I recently came across an excellent paper “On the Practice of Dichotomization of Quantitiative Variables” by MacCallum and colleagues (2002) . As I use ANOVAs a lot in my research, it really got me thinking about the whole issue. Even though I have no great idea for an innovative simulation study, you might have one. If you read through this post, you will notice that it’s really simple – at least the technical part.
I will only explain the two-variable scenario. But the setup is basically the same for more complex, e.g. two-variable, setups. Let’s start with their small numerical example before turning to the simulation study.
Setting up the packages
As some of you might have noticed I try to be consistent in how I structure the R-skripts. The first step is always to load packages and set up a working directory. As we do not read in any data, the latter is ommitted. However, I want to set a specific seed for the random number generation so that the results are reproducible.
require(MASS) require(corpcor) set.seed(2901) #to have reproducible results
Setting up the parameters and generating data
We will use the mvrnormfunktion from the MASS package to simulate the data. This takes three options, the sample size N, the means fof the variables mu, and a covariance matrix sigma. In case you do not know how to translate a set of correlations between variables into a positive definitive covariance matrix, you will also need the rebuild.cov function from the corpcor packes. With these you can generate a sample of 50 participants with five lines of code.
Inspect the results
I always like names better than subscripts. So this completely unneccessary step.
The statistics can be compared with the following commands.
cor.test(X1, Y1) t.test(Y1~temp$X1_d)
If you look at the p-values you will note, that both kinds of tests give you a a significant effect. If you find a seed that does not, please write me about it.
Running their small scale study
First we have to define a function that gives us the count of times where the dichotomized analysis gives us larger estimates for the correlation than the original results, given a specific correlation in the population and a sample size. I called it overshoot, because this is most likely due to sample-bias, as argued in the original paper.
I generated the following mainly by using the “extract function” feature in Rstudio.
Inspect the results
And here are the results.
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 4481 4129 3790 3550 3347 3042 2869 [2,] 3359 2402 1642 1087 809 592 416 [3,] 2201 1011 380 141 51 24 5 [4,] 971 154 16 3 1 NA NA [5,] 77 1 NA NA NA NA NA
Thanks to the large number of samples (10,000) drawn from the population, the results are very similar to the data published. Hope you liked it.
MacCallum, R. C., Zhang, S., Preacher, K. J., & Rucker, D. D. (2002). On
the practice of dichotomization of quantitative variables. Psychological
Methods, 7, 19–40.