Category Archives: Dataanalysis / Datenanalyse

Evolution of a logistic regression

In my last post I showed how one can easily summarize the outcome of a logistic regression. Here I want to show how this really depends on the data-points that are used to estimate the model. Taking a cue from the evolution of a correlation I have plotted the estimated Odds Ratios (ORs) depending on the number of included participants. The result is bad news for those working with small (< 750 participants) data-sets.

evolution

 

“eval_reg” Function to estimate model parameters for subsets of data 

eval_reg<-function(model){
mod<-model
dat<-mod$data[sample(nrow(mod$data)),]
vars<-names(coef(mod))
est<-data.frame(matrix(nrow=nrow(dat), ncol=length(vars)))
pb <- txtProgressBar(min = 50, max = nrow(dat), style = 3)

for(i in 50:nrow(dat)){
try(boot_mod<-update(mod, data=dat[1:i,]))
try(est[i,]<-exp(coef(boot_mod)))
setTxtProgressBar(pb, i)
}
est$mod_nr<-1:length(dat[,1])
names(est)<-c(vars, ‘mod_nr’)
return(est)
}

As I randomized the order of data you can run it again and again to arrive at an even deeper mistrust as some of the resulting permutations will look like they stabilize earlier. On the balance you need to set the random-number seed to make it reproducible.

Run and plot the development

set.seed(29012001)

mod_eval<-eval_reg(gp_mod)

tmp<-melt(mod_eval,id=’mod_nr’)
tmp2<-tmp[tmp$variable!='(Intercept)',]

ticks<-c(seq(.1, 1, by =.1), seq(0, 10, by =1), seq(10, 100, by =10))

ggplot(tmp2, aes(y=value, x = mod_nr, color = variable)) +
geom_line() +
geom_hline(y=1, linetype=2) +
labs(title = ‘Evolution of logistic regression’, y = ‘OR’, x = ‘number of participants’) +
scale_y_log10(breaks=ticks, labels = as.character(ticks)) +
theme_bw()

Update 29-01-2013:

I added my definition of the ticks on the log-scale. The packages needed are ggplot2 and reshape.

Plotting Odds Ratios (aka a forrestplot) with ggplot2 –

Hi,

if you like me work in medical research, you have to plot the results of multiple logistic regressions every once in a while. As I have not yet found a great solution to make these plots I have put together the following short skript. Do not expect too much, it’s more of a reminder to my future self than some mind-boggling new invention. The code can be found below the resulting figure looks like this:

fig_1_odds

Here comes the code. It takes the model and optionally a title as an input and generates the above plot.

 

 

plot_odds<-function(x, title = NULL){
tmp<-data.frame(cbind(exp(coef(x)), exp(confint(x))))
odds<-tmp[-1,]
names(odds)<-c(‘OR’, ‘lower’, ‘upper’)
odds$vars<-row.names(odds)
ticks<-c(seq(.1, 1, by =.1), seq(0, 10, by =1), seq(10, 100, by =10))

ggplot(odds, aes(y= OR, x = reorder(vars, OR))) +
geom_point() +
geom_errorbar(aes(ymin=lower, ymax=upper), width=.2) +
scale_y_log10(breaks=ticks, labels = ticks) +
geom_hline(yintercept = 1, linetype=2) +
coord_flip() +
labs(title = title, x = ‘Variables’, y = ‘OR’) +
theme_bw()
}

 

P.s. I know about ggplots “annotation_logticks” but they messed up my graphics, also it is not very often that ORs span more than three orders of magnitude. If they do consider playing with ggplots function or update the line beginning with “ticks <- ” in the above example

Update 29-01-2013: I replaced the nasty ” as they resulted in some nasty copy-past errors…

Reverse research – Extracting data from graphs with WebPlotDigitizer

As long as we all are still waiting for these large-scale data sharing mechanisms in place we will have to use every bit we can can our hand on. For data that is only represented as a graph – instead of tables or in the text – this means that we need some kind of graph digitizer.

The best one I found is the webplotdigitizer developed by Ankit Rohatgi. While I generally prefer desktop versions of programs, because they support reproducible research – this tools is just too easy to use that a further search would be necessary. Together with the video tutorials this is a one-stop shop for your data-extraction needs.

Congratulations to the developer for this excellent example of one task one tool!

 

Open Science – The great leaps forward

For everyone out there who believes that science (i.e. the methods, data and results) should be more open, hear the good news: You are not alone! A whole series of papers published in the worlds most highly ranked journals is on your side – or are they?

Duncal Hull compiled an excellent list of resources.

My personal favorite is the following:

Screenshot of paywall for an editorial that is supposedly about open access

 

 

 

 

fig_3_spice

When Venn diagrams are not enough – Visualizing overlapping data with Social Network Analysis in R

I recently thought about ways to visualize medications and their co-occurences in a group of children. As long as you want to visualize up to  4 different medications you can simply use Venn diagrams. There is a very nice R-package to generate these kind of graphics for you (for a  description see: Chen and Boutros, 2011). But this is of little help here.

The problem I faced involved 29 different medications and 50 children. So my data was stored in a table with 29 columns – one for each medication – and 50 rows – one for each child, so that the cells indicate whether or not the child took the medication.

M <- matrix(sample(0:1, 1450, replace=TRUE, prob=c(0.9,0.1)), nc=29)

The Solution – Social Network Analysis

There are a several R-packages to analyze and visualize social network data – I will focus on “igraph” in this post. The problem I had was that I was not – and probably I am still not –  familiar with the concepts and nomenclature of this field. The key to using the data described above in terms of network analysis was understanding that such data is called an affiliation matrix, where individuals are affiliated with certain events. As “igraph” likes adjacency matrices, where every column and row represents a different node – in our case a medication. The diagonal gives the number of times a medication was given (more information can be found on Daizaburo Shizuka site).

We transform an affilition matrix into an adjacency matrix in R simply by:

adj=M%*%t(M)

Now we can make a first bare-minimum plot:

require(igraph)
g=graph.adjacency(adj,mode=”undirected”, weighted=TRUE,diag=FALSE)
summary(g)
plot(g, main=”The bare minimum”)

 

Adding information and spicing it up a notch

In all likelihood You want to add at least three kinds of  information:

  1. Labels for the nodes
  2. Size of the nodes to represent the total number of events, aka medications
  3. Size of the links to represent the overlap between medications

name<-sample(c(LETTERS, letters, 1:99), 29, replace=TRUE)
number<-diag(adj)*5+5
width<-(E(g)$weight/2)+1
plot(g, main=”A little more information”, vertex.size=number,vertex.label=name,edge.width=width)

 

The “igraph” package lets you adopt quite a few parameters so you should consult with the manual. I only changed some of the colors, layout, fonts, etc.

plot(g, main=”Spice it up a notch”, vertex.size=number, vertex.label=name, edge.width=width, layout=layout.lgl, vertex.color=”red”, edge.color=”darkgrey”, vertex.label.family =”sans”, vertex.label.color=”black”)

 


Here is just the code:

?View Code RSPLUS
require(igraph)
setwd("~/Desktop/")
 
# Generate example data
M <- matrix(sample(0:1, 1450, replace=TRUE, prob=c(0.9,0.1)), nc=29)
 
# Transform matrices
adj=M%*%t(M)
 
# Make a simple plot
g<-graph.adjacency(adj,mode="undirected", weighted=TRUE,diag=FALSE)
summary(g)
plot(g, main="The bare minimum")
 
# Add more information
name<-sample(c(LETTERS, letters, 1:99), 29, replace=TRUE)
number<-diag(adj)*5+5
width<-(E(g)$weight/2)+1
 
plot(g, main="A little more information", vertex.size=number,vertex.label=name,edge.width=width)
 
# Adjust some plotting parameters
plot(g, main="Spice it up a notch", vertex.size=number, vertex.label=name, edge.width=width, layout=layout.lgl, vertex.color="red", edge.color="darkgrey", vertex.label.family ="sans", vertex.label.color="black")

Simulation studies in R – Using all cores and other tips

After working more seriously with simulations I noticed some updates were necessary to my previous setup. Most notably are the following three:

  • It is very handy to explicitly call the different scenarios instead of using nested loops
  • Storing intermediate results in single files obliviates the need to rerun an almost finished but crashed analysis and seperates very clearly the data-generation from analysis part.
  • Using all availible cores can speed up the processing time, but may render the simulation not reproducible.

So here is my new simulation-study sceleton, that consists of five parts:

  1. Praeamble: Load all the functions that are required
  2. Simulation-function: This is the part, that will most likely be much more complicated in your case. Define the steps that will be repeated for different scenarios. The parameters of this function will be filled in by the scencarios.
  3. Scenario-Description: Explicitly show the range of values that should be passed to the Simulation-Function
  4. Run the analysis: Here you pass all the scenario-descriptions to your simulation-function. Either do this on one or all availible cores. In any case you should set a random seed to make the simulatino reproducible.
  5. Analyze the outputs: Not shown here but You propabely

Here is the complete script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# 1. Praeamble
setwd("c:/temp/")
require(doSNOW)
require(rlecuyer)
 
# 2. Simulation-function
sim_fun<-function(a,b,c){
results<-matrix(NA, 1000,4)
for(i in 1:1000){
#a=1; b=2;c=3;i=1
results[i,1:3]<-cbind(a, b, c)
results[i,4]<-mean(rnorm(100))#THIS MAY BE MORE COMPLEX FOR YOU HEHE!
}
write.table(results, file=paste(a,"_", b,"_", c, "_res.csv"))  
}
 
# 3. Scenario-Description
a<-seq(10, 100, 20)
b<-seq(20, 100, 30)
c<-seq(30, 200, 40)
scenarios<-expand.grid(a, b, c)
 
# 4.a Run the analysis on one core
set.seed(29012001)
for(i in 1:length(scenarios[,1])){sim_fun(scenarios[i,1], scenarios[i,2], scenarios[i,3])}
 
 
# 4.b Run the analysis on all availible Cores
cluster<-makeCluster(4, type = "SOCK")
clusterSetupRNG(cluster, seed = 29012001) 
registerDoSNOW(cluster)
 
foreach(i= 1:length(scenarios[,1])) %dopar% {sim_fun(scenarios[i,1], scenarios[i,2], scenarios[i,3])}
 
# compare the time
system.time(
for(i in 1:length(scenarios[,1])){sim_fun(scenarios[i,1], scenarios[i,2], scenarios[i,3])}
)
 
system.time(
foreach(i= 1:length(scenarios[,1])) %dopar% {sim_fun(scenarios[i,1], scenarios[i,2], scenarios[i,3])}
)

There are also other tutorials on how to run simulations in R. The one I liked most was Roger Koenkers’ “A simple protocoll for simulations in R” (accessible here) that relies more heavily on R’s built in features to solve some of the problems.

 

Keep ssh-scripts running without a connection – “screen” for Linux

I started offloading some of the heavy-lifting and the more time-consuming jobs to my server. While I am more and more comfortable of using the ssh terminal, the mayor problem was that scripts only worked as long as I was logged in. Which goes completely against the whole idea of laying off work to the server. After some goggling I found the small tool “screen” which is available for linux. A more comprehensive introduction can be found here, but after installing I only used three commands: “screen” to start; “screen -ls” to list the open terminals after reconnection, and “screen -r XXX” to reconnect to a running terminal.

Installation

With my Linux (ubuntu) server all I had to do to install it was a simple
$ sudo apt-get install screen

Use

After that you can start screen by simply typing:
$ screen

Now start your time-consuming job and just quit the terminal whenever you can’t stand waiting any longer. Whenever you want to have a look again just connect via ssh, and get a list of open screens:

$ screen -ls

this gave me:
3467.pts-0.MYSERVER (09/14/2011 08:59:46 PM) (Detached)
2238.pts-0.MYSERVER (09/14/2011 08:45:13 PM) (Attached)
2 Sockets in /var/run/screen/S-gerhi.

To reconnect to the first of these still running terminals, you just enter:
$ screen -r 3467.pts-0.MYSERVER

 

Dump MySQL to CSV using R

Based on a related post on one of my favorite python-lists I remembered, that I wrote a similar snipplet some time ago.

So if you want to dump your whole MySQL database to csv-files you can recycle the following code:

?Download mysql2cvs.R
1
2
3
4
5
6
7
8
9
require(RMySQL)
m<-MySQL()
summary(m)
con<-dbConnect(m, dbname = "YOURDB", host="localhost", port=8889, user="YOURUSER", pass="YOURPASS", unix.sock="/Applications/MAMP/tmp/mysql/mysql.sock") # in case you are using MAMP 
tables<-dbListTables(con)
 
for (i in 1 : length(tables)){
temp<-(dbReadTable(con, tables[i]))
write.table(temp, tables[i], row.names=F)}

This also is a great way to use my new source-code plugin (WP-CodeBox)
Enjoy!

input

Distributing Python Scripts – the Batchtraveller.app

[en]This is a short intro on how Python-scripts can be packaged into an easy to distribute mac-app.

Essentially I followed two tutorials. The first on how to turn your python script into a self-contained app, the second on how to make compressed diskimages for mac. The final result aswell as the files necessary to reproduce the steps can be downloaded here. A detailed desription can be found in the remainder of this post.

Continue reading

GIS on a shoestring – Getting traveltimes from google

The analysis of geospatial information is currently a big trend in medicine and public health. Even though some may want to convince you that this can only be achieved with the latest and most expensive software, I am not convinced. First, analysis  of spatial data dates back to at least 1856 when John Snow investigated Cholera-outbreaks in London. Second, as I try to demonstrate today some very interesting analysis and data can be retreived essentially for free.

While I have already made a post on how to plot freely availible geospatial data in R in a previous post , this post will show you how to use Python to access the google maps database and gather e.g. travel times and distances to/from various locations with known zip-codes.

Please note that this is my first Python skript. So it will certainly not meet the high standards you might have developed based on previous posts. On the up-side, you will get the baby step instructions.

Update 2011/07/03: A much more user-friendly version of the script that adds guis to select a proper csv-file, containing start and end-adressess and to store the results can be found here. If you are afraid of Python, you can use the stand-alone Mac app “batchtimer” that basically contains all files necessary from here.

Continue reading

RStudio the missing link between your brain and statistics

RStudio is a graphical user interface for R. Or as the developers put it.

RStudio™ is a new integrated development environment (IDE) for R. RStudio combines an intuitive user interface with powerful coding tools to help you get the most out of R.

 

While there have been a few projects (e.g. RCommander, RkWard, JaguaR) RStudio is the first I will probably integrate into my workflow – the mac-gui I work with is already great and has some essential features like syntax-highlighting out of the box, but I will recommend RStudio to anyone considering to start working with R – and anyone else asking me about statistics.

I just want to highlight two features which can change the learning curve for R.

1. Getting data into R. RStudio has a nice import dataset feature that can be used to read text-files. Something that can be really frustrating in the beginning.

2. Navigating the data. By just clicking aI really hope that this will stay a read-only feature, because everything else is simply not the way to go.

Cons:

  • Umlauts are not yet integrated, but it seems like a matter of time with these guy/is.

Poor but Sexy – GUIs for bash-files with Zenity

Every now and then we want to work with others who are less inclined to working with the console to use some bash files. As most of the skripts we make, only take a few options, we might think of adding a nice gui to the bash-file so that others are more likely to use it. As always, this can be done very quickly and for free in Linux.

All you need is to install the programm zenity and then write a script, that calls zenity to set the value of a parameter.

sudo apt-get install zenity

Then your next “program” is only a couple of lines away.

My first “project” – I will propably laugh about this one in a couple of weeks – was a little GUI that should replace a textstring “999999″  in a textfile with a string that is entered via the gui and save the result with a new filename. You only need to save the following code in a bash-file, e.g. “replace.sh” and make it executable.

#!/bin/bash

cd /home/vikp/Desktop/HLT

var=$(zenity –entry –text “Enter the replacement text” –entry-text “”)

sed s/9999999/$VP_name/g *.txt > newname_$var.dat

 

The reason I needed this was that I wanted a wrapper for psytoolkit – the brialliant linux program for reaction-time experiments – that asks for a participant codes and adds this code to the result file. The complete wrapper for psytoolkit now looks like this:

#!/bin/bash

cd /home/vikp/Desktop/HLT

VP_name=$(zenity –entry –text “Participant-Code” –entry-text “”)

./experiment

sed s/9999999/$VP_name/g hlt_5.psy.*.data > HLT_VP_$VP_name.dat