Simulation studies in R – Reproducing MacCullums et al. 2002 “Effects of variable dichotomization”

I recently came across an excellent paper “On the Practice of Dichotomization of Quantitiative Variables” by MacCallum and colleagues (2002) . As I use ANOVAs a lot in my research, it really got me thinking about the whole issue. Even though I have no great idea for an innovative simulation study, you might have one. If you read through this post, you will notice that it’s really simple – at least the technical part.

I will only explain the two-variable scenario. But the setup is basically the same for more complex, e.g.  two-variable, setups. Let’s start with their small numerical example before turning to the simulation study.

Setting up the packages

As some of you might have noticed I try to be consistent in how I structure the R-skripts. The first step is always to load packages and set up a working directory. As we do not read in any data, the latter is ommitted. However, I want to set a specific seed for the random number generation so that the results are reproducible.

?View Code RSPLUS
set.seed(2901)     #to have reproducible results

Setting up the parameters and generating data

We will use the mvrnormfunktion from the MASS package to simulate the data. This takes three options, the sample size N, the means fof the variables mu,  and a covariance matrix sigma. In case you do not know how to translate a set of correlations between variables into a positive definitive covariance matrix, you will also need the rebuild.cov function from the corpcor packes. With these you can generate a sample of 50 participants with five lines of code.

?View Code RSPLUS

Inspect the results

I always like names better than subscripts. So this completely unneccessary step.

?View Code RSPLUS

?View Code RSPLUS

The statistics can be compared with the following commands.

?View Code RSPLUS
cor.test(X1, Y1)

If you look at the p-values you will note, that both kinds of tests give you a a significant effect. If you find a seed that does not, please write me about it.

Running their small scale study

First we have to define a function that gives us the count of times where the dichotomized analysis gives us larger estimates for the correlation than the original results, given a specific correlation in the population and a sample size. I called it overshoot, because this is most likely due to sample-bias, as argued in the original paper.

I generated the following mainly by using the “extract function” feature in Rstudio.

?View Code RSPLUS

Inspect the results

And here are the results.

[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 4481 4129 3790 3550 3347 3042 2869
[2,] 3359 2402 1642 1087  809  592  416
[3,] 2201 1011  380  141   51   24    5
[4,]  971  154   16    3    1   NA   NA
[5,]   77    1   NA   NA   NA   NA   NA


Thanks to the large number of samples (10,000) drawn from the population, the results are very similar to the data published. Hope you liked it.


MacCallum, R. C., Zhang, S., Preacher, K. J., & Rucker, D. D. (2002). On
the practice of dichotomization of quantitative variables. Psychological
Methods, 7, 19–40.


2 thoughts on “Simulation studies in R – Reproducing MacCullums et al. 2002 “Effects of variable dichotomization””

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>