Tag Archives: simulation

Evolution of a logistic regression

In my last post I showed how one can easily summarize the outcome of a logistic regression. Here I want to show how this really depends on the data-points that are used to estimate the model. Taking a cue from the evolution of a correlation I have plotted the estimated Odds Ratios (ORs) depending on the number of included participants. The result is bad news for those working with small (< 750 participants) data-sets.

evolution

 

“eval_reg” Function to estimate model parameters for subsets of data 

eval_reg<-function(model){
mod<-model
dat<-mod$data[sample(nrow(mod$data)),]
vars<-names(coef(mod))
est<-data.frame(matrix(nrow=nrow(dat), ncol=length(vars)))
pb <- txtProgressBar(min = 50, max = nrow(dat), style = 3)

for(i in 50:nrow(dat)){
try(boot_mod<-update(mod, data=dat[1:i,]))
try(est[i,]<-exp(coef(boot_mod)))
setTxtProgressBar(pb, i)
}
est$mod_nr<-1:length(dat[,1])
names(est)<-c(vars, ‘mod_nr’)
return(est)
}

As I randomized the order of data you can run it again and again to arrive at an even deeper mistrust as some of the resulting permutations will look like they stabilize earlier. On the balance you need to set the random-number seed to make it reproducible.

Run and plot the development

set.seed(29012001)

mod_eval<-eval_reg(gp_mod)

tmp<-melt(mod_eval,id=’mod_nr’)
tmp2<-tmp[tmp$variable!='(Intercept)',]

ticks<-c(seq(.1, 1, by =.1), seq(0, 10, by =1), seq(10, 100, by =10))

ggplot(tmp2, aes(y=value, x = mod_nr, color = variable)) +
geom_line() +
geom_hline(y=1, linetype=2) +
labs(title = ‘Evolution of logistic regression’, y = ‘OR’, x = ‘number of participants’) +
scale_y_log10(breaks=ticks, labels = as.character(ticks)) +
theme_bw()

Update 29-01-2013:

I added my definition of the ticks on the log-scale. The packages needed are ggplot2 and reshape.

Simulation studies in R – Using all cores and other tips

After working more seriously with simulations I noticed some updates were necessary to my previous setup. Most notably are the following three:

  • It is very handy to explicitly call the different scenarios instead of using nested loops
  • Storing intermediate results in single files obliviates the need to rerun an almost finished but crashed analysis and seperates very clearly the data-generation from analysis part.
  • Using all availible cores can speed up the processing time, but may render the simulation not reproducible.

So here is my new simulation-study sceleton, that consists of five parts:

  1. Praeamble: Load all the functions that are required
  2. Simulation-function: This is the part, that will most likely be much more complicated in your case. Define the steps that will be repeated for different scenarios. The parameters of this function will be filled in by the scencarios.
  3. Scenario-Description: Explicitly show the range of values that should be passed to the Simulation-Function
  4. Run the analysis: Here you pass all the scenario-descriptions to your simulation-function. Either do this on one or all availible cores. In any case you should set a random seed to make the simulatino reproducible.
  5. Analyze the outputs: Not shown here but You propabely

Here is the complete script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# 1. Praeamble
setwd("c:/temp/")
require(doSNOW)
require(rlecuyer)
 
# 2. Simulation-function
sim_fun<-function(a,b,c){
results<-matrix(NA, 1000,4)
for(i in 1:1000){
#a=1; b=2;c=3;i=1
results[i,1:3]<-cbind(a, b, c)
results[i,4]<-mean(rnorm(100))#THIS MAY BE MORE COMPLEX FOR YOU HEHE!
}
write.table(results, file=paste(a,"_", b,"_", c, "_res.csv"))  
}
 
# 3. Scenario-Description
a<-seq(10, 100, 20)
b<-seq(20, 100, 30)
c<-seq(30, 200, 40)
scenarios<-expand.grid(a, b, c)
 
# 4.a Run the analysis on one core
set.seed(29012001)
for(i in 1:length(scenarios[,1])){sim_fun(scenarios[i,1], scenarios[i,2], scenarios[i,3])}
 
 
# 4.b Run the analysis on all availible Cores
cluster<-makeCluster(4, type = "SOCK")
clusterSetupRNG(cluster, seed = 29012001) 
registerDoSNOW(cluster)
 
foreach(i= 1:length(scenarios[,1])) %dopar% {sim_fun(scenarios[i,1], scenarios[i,2], scenarios[i,3])}
 
# compare the time
system.time(
for(i in 1:length(scenarios[,1])){sim_fun(scenarios[i,1], scenarios[i,2], scenarios[i,3])}
)
 
system.time(
foreach(i= 1:length(scenarios[,1])) %dopar% {sim_fun(scenarios[i,1], scenarios[i,2], scenarios[i,3])}
)

There are also other tutorials on how to run simulations in R. The one I liked most was Roger Koenkers’ “A simple protocoll for simulations in R” (accessible here) that relies more heavily on R’s built in features to solve some of the problems.