In my last post I showed how one can easily summarize the outcome of a logistic regression. Here I want to show how this really depends on the data-points that are used to estimate the model. Taking a cue from the evolution of a correlation I have plotted the estimated Odds Ratios (ORs) depending on the number of included participants. The result is bad news for those working with small (< 750 participants) data-sets.

**“eval_reg” Function to estimate model parameters for subsets of data **

eval_reg<-function(model){

mod<-model

dat<-mod$data[sample(nrow(mod$data)),]

vars<-names(coef(mod))

est<-data.frame(matrix(nrow=nrow(dat), ncol=length(vars)))

pb <- txtProgressBar(min = 50, max = nrow(dat), style = 3)for(i in 50:nrow(dat)){

try(boot_mod<-update(mod, data=dat[1:i,]))

try(est[i,]<-exp(coef(boot_mod)))

setTxtProgressBar(pb, i)

}

est$mod_nr<-1:length(dat[,1])

names(est)<-c(vars, ‘mod_nr’)

return(est)

}

As I randomized the order of data you can run it again and again to arrive at an even deeper mistrust as some of the resulting permutations will look like they stabilize earlier. On the balance you need to set the random-number seed to make it reproducible.

**Run and plot the development**

set.seed(29012001)

mod_eval<-eval_reg(gp_mod)

tmp<-melt(mod_eval,id=’mod_nr’)

tmp2<-tmp[tmp$variable!='(Intercept)',]ticks<-c(seq(.1, 1, by =.1), seq(0, 10, by =1), seq(10, 100, by =10))

ggplot(tmp2, aes(y=value, x = mod_nr, color = variable)) +

geom_line() +

geom_hline(y=1, linetype=2) +

labs(title = ‘Evolution of logistic regression’, y = ‘OR’, x = ‘number of participants’) +

scale_y_log10(breaks=ticks, labels = as.character(ticks)) +

theme_bw()

**Update 29-01-2013:**

I added my definition of the ticks on the log-scale. The packages needed are ggplot2 and reshape.

Just curious — what’s the event rate for your data?

For me as novice to R, I have some troubles reproducing the code. First error when calling the function: “model” is an unknown object. Secondly, which package is required for the melt-function?

Finally I wanted to say that I like this blog! It’s pretty helpful to me!

Dear Daniel,

sorry for not mentioning the packages needed – have added to the main post. The function takes as input a model that you have fittet. For example, the following lines generate random data, and fit a logistic regression.

x<-rbinom(1000, 1, .5)

y<-rbinom(1000, 1, .5)

z<-rbinom(1000, 1, .5)

data<-data.frame(x, y, z)

model_1<-glm(z~x+y, family=binomial, data = data)

Then call the eval_reg function.

eval_reg(model_1)

It’s interesting, but presumably if you were building a model form the outset on +/- 50 observations (that looks to be about the starting point of the graph), you wouldn’t use a model with 10 predictors. Overfitting to the noise in that situation is predictable, and I think that’s what this shows to some degree.

What would be interesting would be to throw in something like 5-fold CV repeated 5 times into the mix, and see a visual comparison of how well that technique works to smooth out the estimated ORs.

I think may be a little misleading, because no confidence intervals (CIs) are reported for any of the ORs, only their point estimates. When you make the comment:

“The result is bad news for those working with small (< 750 participants) data-sets."

It's presumably due to the fact that the 4th parameter jumps between about 1.5 and 6 (~ .4 to 1.8 on the log scale, e.g. the regression coefficients) until the sample size is around 600, where it stabilizes. But if we can't see the CI for this parameter (which I *suspect* is very wide in this zone), we can't make much inference about how appropriate the point estimate is, i.e. if there is a large amount of uncertainty around the point estimate (the CI is wide), it would foolish to assume that the point estimate accurately reflects the population value of the parameter.

Further, you make the statement:

"As I randomized the order of data you can run it again and again to arrive at an even deeper mistrust as some of the resulting permutations will look like they stabilize earlier."

Then how frequently did this hump occur? If this hump occurs once in every 20-50 simulated runs, then the probability of having this occur in repeated experiments is 2-5% empirically. This gets very close to the probabilistic interpretation of a p-value, in which p represents the estimated proportion of repeated experiments in which your point estimate is some distance away from the point estimate's mean calculated from all of the repetitions.

This is an interesting post. Thanks for sharing your findings. Are you suggesting that this applies to samples in general or speaking more specifically to the analysis presented here? It seems that 750 may be an appropriate sample size to accurately gauge the relationships between the variables you’re looking at here, given the type of data you’re using, but I don’t see how this would translate to a magical sample size cut-off for studies in general. I see you have a background in model building – is there something I’m missing that makes this situation generalizable? Thanks again for your post.