Running R on 32 cores for 2USD/h

Even though most modern computers already have four cores that can be used to speed up analysis – see here for a description. However, after starting to do more simulation studies it became apparent that this is highly impractical.

The solution: Running R via R-studio on Amazons cloud servers.

Thanks to Lois Aslett (http://www.louisaslett.com/RStudio_AMI/) this only takes four minutes toset up and connect to your own R-studio server in the cloud using the free tier program.

If you want to use this with more performant instances just follow the following steps.

1. Register with aws.amazon.com, including payment information etc.
2. Post a service limit increase. This can be done via http://aws.amazon.com/contact-us/ec2-request. You need an instance limit of at least 1. It may take amazon up to two business days to process this request…
3. Follow the description and links provided by Lois Aslett (http://www.louisaslett.com/RStudio_AMI/). These differ in location and thus pricing (right now the 32 core server costs between 1.7 and 2.3 USD per hour). When choosing and instance you can now select and launch more performant servers. I decided to go with “c3.8xlarge” which gives you 32 cores and at least for me adequate memory and storage.
4. Connect to your own 32-core R-Studio instance in the cloud.

All you need is text – Markdown (via pandoc) for academia

Many students struggle to find an adequte format for their thesis. Ironically the advent of “modern” WYSIWYG programms seems to make it harder to consistently format a text.

While learning LaTeX may be a bit too much to ask for, markdown is a very minimal language that together with pandoc affords all typesetting needs for an academic paper. While source documents written in markdown can be opened and edited on any PC (or mobile), pandoc can translate it into beautifully formatted pdf and docx (if it is absolutely necessary) files. Specifically markdown implements:

• Headings, Subheadings
• Figures and tables
• Citations and References (here in APA6 but other styles are also possible) • You will need to edit the file paper_v1.md

See the example paper.

Once pandoc and latex is installed the following command generates a pdf.

pandoc -s -S --biblio biblio.bib --csl apa.csl -N -V geometry:margin=1in paper_v1.md -o paper.pdf

All files necessary to replicate and adopt the example can be found here .

Evolution of a logistic regression

In my last post I showed how one can easily summarize the outcome of a logistic regression. Here I want to show how this really depends on the data-points that are used to estimate the model. Taking a cue from the evolution of a correlation I have plotted the estimated Odds Ratios (ORs) depending on the number of included participants. The result is bad news for those working with small (< 750 participants) data-sets.

evolution

 

“eval_reg” Function to estimate model parameters for subsets of data 

eval_reg<-function(model){
mod<-model
dat<-mod$data[sample(nrow(mod$data)),]
vars<-names(coef(mod))
est<-data.frame(matrix(nrow=nrow(dat), ncol=length(vars)))
pb <- txtProgressBar(min = 50, max = nrow(dat), style = 3)

for(i in 50:nrow(dat)){
try(boot_mod<-update(mod, data=dat[1:i,]))
try(est[i,]<-exp(coef(boot_mod)))
setTxtProgressBar(pb, i)
}
est$mod_nr<-1:length(dat[,1])
names(est)<-c(vars, ‘mod_nr’)
return(est)
}

As I randomized the order of data you can run it again and again to arrive at an even deeper mistrust as some of the resulting permutations will look like they stabilize earlier. On the balance you need to set the random-number seed to make it reproducible.

Run and plot the development

set.seed(29012001)

mod_eval<-eval_reg(gp_mod)

tmp<-melt(mod_eval,id=’mod_nr’)
tmp2<-tmp[tmp$variable!='(Intercept)',]

ticks<-c(seq(.1, 1, by =.1), seq(0, 10, by =1), seq(10, 100, by =10))

ggplot(tmp2, aes(y=value, x = mod_nr, color = variable)) +
geom_line() +
geom_hline(y=1, linetype=2) +
labs(title = ‘Evolution of logistic regression’, y = ‘OR’, x = ‘number of participants’) +
scale_y_log10(breaks=ticks, labels = as.character(ticks)) +
theme_bw()

Update 29-01-2013:

I added my definition of the ticks on the log-scale. The packages needed are ggplot2 and reshape.

Plotting Odds Ratios (aka a forrestplot) with ggplot2 –

Hi,

if you like me work in medical research, you have to plot the results of multiple logistic regressions every once in a while. As I have not yet found a great solution to make these plots I have put together the following short skript. Do not expect too much, it’s more of a reminder to my future self than some mind-boggling new invention. The code can be found below the resulting figure looks like this:

fig_1_odds

Here comes the code. It takes the model and optionally a title as an input and generates the above plot.

 

 

plot_odds<-function(x, title = NULL){
tmp<-data.frame(cbind(exp(coef(x)), exp(confint(x))))
odds<-tmp[-1,]
names(odds)<-c(‘OR’, ‘lower’, ‘upper’)
odds$vars<-row.names(odds)
ticks<-c(seq(.1, 1, by =.1), seq(0, 10, by =1), seq(10, 100, by =10))

ggplot(odds, aes(y= OR, x = reorder(vars, OR))) +
geom_point() +
geom_errorbar(aes(ymin=lower, ymax=upper), width=.2) +
scale_y_log10(breaks=ticks, labels = ticks) +
geom_hline(yintercept = 1, linetype=2) +
coord_flip() +
labs(title = title, x = ‘Variables’, y = ‘OR’) +
theme_bw()
}

 

P.s. I know about ggplots “annotation_logticks” but they messed up my graphics, also it is not very often that ORs span more than three orders of magnitude. If they do consider playing with ggplots function or update the line beginning with “ticks <- ” in the above example

Update 29-01-2013: I replaced the nasty ” as they resulted in some nasty copy-past errors…

Reverse research – Extracting data from graphs with WebPlotDigitizer

As long as we all are still waiting for these large-scale data sharing mechanisms in place we will have to use every bit we can can our hand on. For data that is only represented as a graph – instead of tables or in the text – this means that we need some kind of graph digitizer.

The best one I found is the webplotdigitizer developed by Ankit Rohatgi. While I generally prefer desktop versions of programs, because they support reproducible research – this tools is just too easy to use that a further search would be necessary. Together with the video tutorials this is a one-stop shop for your data-extraction needs.

Congratulations to the developer for this excellent example of one task one tool!

 

Open Science – The great leaps forward

For everyone out there who believes that science (i.e. the methods, data and results) should be more open, hear the good news: You are not alone! A whole series of papers published in the worlds most highly ranked journals is on your side – or are they?

Duncal Hull compiled an excellent list of resources.

My personal favorite is the following:

Screenshot of paywall for an editorial that is supposedly about open access

 

 

 

 

Akademische Lebensläufe mit Latex und dem moderncv package

Alle Jahre wieder braucht man einen aktuellen Lebenslauf. Wenn man in der Wissenschaft arbeitet, besteht dieser zu 80% aus einer aktuellen Publikationsliste. Nach langem Ausprobieren und Vergleichen (Alternativen aus der großen Liste sind z.B.: currvita und europecv) habe ich mich dafür entschieden, meinen CV mit dem moderncv package, dass von Xavier Danaux entwickelt wurde, zu machen. Ich habe allerdings noch eine Anpassungen am outputstyle “plainyr” vorgenommen, um in der Literaturliste jeweils aktuelle Artikel zuerst zu nennen.

Das fertige Resultat findet Ihr hier: cv_hirschfeld_2012_03

Die Sources zum anpassen findet Ihr hier: latex_cv

Um das zu setzten müsst ihr die folgenden Schritte durchführen

  1.  pdflatex auf die tex-Datei anwenden: pdflatex source.tex
  2. bibtex auf die  *.aux Files der einzelnen Bibliographien anwenden: bibtex paper.aux
  3. pdflatex auf die tex-Datei anwenden: pdflatex source.tex

 

 

fig_3_spice

When Venn diagrams are not enough – Visualizing overlapping data with Social Network Analysis in R

I recently thought about ways to visualize medications and their co-occurences in a group of children. As long as you want to visualize up to  4 different medications you can simply use Venn diagrams. There is a very nice R-package to generate these kind of graphics for you (for a  description see: Chen and Boutros, 2011). But this is of little help here.

The problem I faced involved 29 different medications and 50 children. So my data was stored in a table with 29 columns – one for each medication – and 50 rows – one for each child, so that the cells indicate whether or not the child took the medication.

M <- matrix(sample(0:1, 1450, replace=TRUE, prob=c(0.9,0.1)), nc=29)

The Solution – Social Network Analysis

There are a several R-packages to analyze and visualize social network data – I will focus on “igraph” in this post. The problem I had was that I was not – and probably I am still not –  familiar with the concepts and nomenclature of this field. The key to using the data described above in terms of network analysis was understanding that such data is called an affiliation matrix, where individuals are affiliated with certain events. As “igraph” likes adjacency matrices, where every column and row represents a different node – in our case a medication. The diagonal gives the number of times a medication was given (more information can be found on Daizaburo Shizuka site).

We transform an affilition matrix into an adjacency matrix in R simply by:

adj=M%*%t(M)

Now we can make a first bare-minimum plot:

require(igraph)
g=graph.adjacency(adj,mode=”undirected”, weighted=TRUE,diag=FALSE)
summary(g)
plot(g, main=”The bare minimum”)

 

Adding information and spicing it up a notch

In all likelihood You want to add at least three kinds of  information:

  1. Labels for the nodes
  2. Size of the nodes to represent the total number of events, aka medications
  3. Size of the links to represent the overlap between medications

name<-sample(c(LETTERS, letters, 1:99), 29, replace=TRUE)
number<-diag(adj)*5+5
width<-(E(g)$weight/2)+1
plot(g, main=”A little more information”, vertex.size=number,vertex.label=name,edge.width=width)

 

The “igraph” package lets you adopt quite a few parameters so you should consult with the manual. I only changed some of the colors, layout, fonts, etc.

plot(g, main=”Spice it up a notch”, vertex.size=number, vertex.label=name, edge.width=width, layout=layout.lgl, vertex.color=”red”, edge.color=”darkgrey”, vertex.label.family =”sans”, vertex.label.color=”black”)

 


Here is just the code:

?View Code RSPLUS
require(igraph)
setwd("~/Desktop/")
 
# Generate example data
M <- matrix(sample(0:1, 1450, replace=TRUE, prob=c(0.9,0.1)), nc=29)
 
# Transform matrices
adj=M%*%t(M)
 
# Make a simple plot
g<-graph.adjacency(adj,mode="undirected", weighted=TRUE,diag=FALSE)
summary(g)
plot(g, main="The bare minimum")
 
# Add more information
name<-sample(c(LETTERS, letters, 1:99), 29, replace=TRUE)
number<-diag(adj)*5+5
width<-(E(g)$weight/2)+1
 
plot(g, main="A little more information", vertex.size=number,vertex.label=name,edge.width=width)
 
# Adjust some plotting parameters
plot(g, main="Spice it up a notch", vertex.size=number, vertex.label=name, edge.width=width, layout=layout.lgl, vertex.color="red", edge.color="darkgrey", vertex.label.family ="sans", vertex.label.color="black")

Some thoughts about file names

Hi folks,

not really that we have really covered all open source projects that are worth a mentioning.  However, before I can catch enough breath for the next longer post, I wanted to share some thoughts on file names. As long as you start and finish your work in one sweep there isn’t really much need for a really systematic approach. However, if you – like probably everyone reading this blog – are working on projects that span longer than a day – think manuscript, data of your latest study – you need a system to keep track of versions and changes.

If all goes well you can tell stories with file-names.

How I name my own files

Filenames consist of four parts:

  1. A title: e.g. “science_paper”
  2. A version-number, e.g. “_3″
  3. A short description of what has changed, e.g. “_made_up_data”
  4. The ending indicating the file-type, e.g. “.doc”
So the final name is: “science_paper_3_made_up_data.doc”.
This way I can quickly find the most up to date file by sorting the filenames and know what I last did. Sometimes colleagues use the date instead of the integer to keep track of the latest number.  In my opinion integers work better because it also gives you a sense of how often you have worked on a specific file. If the number goes up into the 20 it may be a good idea to reflect on what is gong wrong. However, if you really have to use the date I suggest to use a yyyy-mm-dd format, because this also let’s you quickly find the latest file.

How I like working with others on files

The major problem when working with others is not so much clash of file-naming-conventions but ignorance of the importance of the problem of mixing up files. So the first thing is to agree on trying to keep a system. This way you decrease the likelihood of being called anal behind your back.

I usually opt for a fifth part to the filename after the version-number “Author initials”

  1. A title: e.g. “science_paper”
  2. A version-number, e.g. “_3″
  3. Author, e.g. “_gh”
  4. A short description of what has changed, e.g. “_made_up_data”
  5. The ending indicating the file-type, e.g. “.doc”
So the final name is: “science_paper_3_gh_made_up_data.doc”.

Again there is a small question about the version-numbers. Specifically, who is in charge of changing it. The two options are:

  • Everyone gives a new number.
  • Only the lead-author / lead-analyst gives new numbers.
I tend to prefer the latter especially in the final stages of a paper when you want to approve a specific final version for submission. But this only works as long as everyone agrees to comment only once on a specific version.
Cheerio and sorry again for being too german.

Easy online audio player: Mp3 player

[de]
Manchmal möchte man den Probanden in einer Online-Studie eine Audiodatei vorspielen. Das klappt aber nicht immer perfekt mit gängiger herkömmlicher Befragungssoftware. Manchmal möchte man auch die Audios auf dem eigenen Server ablegen und in der Befragungssofteware nur passend verlinken. Hier kann der Flash Mp3-Player helfen.

Es gibt verschiedene Versionen dieses Players. Mir hat bisher die Mini-Variante schon gereicht – wie man mit dieser ein Audio abspielen (und damit auch in eine Online-Befragung einbinden) kann, ist hier beschrieben.

Pro:
Es ist wirklich einfach und die Beschreibung und Beispielse sehr hilfreich.
Der Code-Generator auf der Website ist eine weitere gute Hilfe.

Contra:
Aus meiner Sicht gibts kein Contra.

[/de]
[en]
Sometimes you would like to play an audio to your participants in an online survey. But not every type of audio is working perfectly in standard online survey software and sometimes you would just like to link to your audio data on your own server. Here you could use the Flash Mp3-Player.

There are different Versions of the player available. For my tasks even the mini player was sufficient so far. How to present an audio online with this player (and so showing it in an online survey as well) is described here.

Pro:
It is really easy and the given examples are very helpful.
The code generator on the website is an additional helpful tool.

Con:
From my point of view there are no cons.
[/en]

test

Visualizing GIS data with R and Open Street Map

In this post I way to share with you some code to use Openstreetmap – maps as a backdrop for a data visualization. We will use the RgoogleMaps-package for R. In the following I will show you how to make this graph.



1. Download the map

I wanted to take a closer look at an area around my former neighborhood, which is in Bochum, Germany.

lat_c<-51.47393
lon_c<-7.22667
bb<-qbbox(lat = c(lat_c[1]+0.01, lat_c[1]-0.01), lon = c(lon_c[1]+0.03, lon_c[1]-0.03))

Once this is done, you can download the corresponding Openstreetmap tile with the following line.

OSM.map<-GetMap.OSM(lonR=bb$lonR, latR=bb$latR, scale = 20000, destfile=”bochum.png”)

2. Add some points to the graphic

Now your second step will most likely be adding points to the map. I choose the following two.

lat <- c(51.47393, 51.479021)
lon <- c(7.22667, 7.222526)
val <- c(10, 100)

As the R-package was mainly build for google-maps, the coordinates need to be adjusted by hand. I made the following functions, that take the min and max value from the downloaded map.

lat_adj<-function(lat, map){(map$BBOX$ll[1]-lat)/(map$BBOX$ll[1]-map$BBOX$ur[1])}
lon_adj<-function(lon, map){(map$BBOX$ll[2]-lon)/(map$BBOX$ll[2]-map$BBOX$ur[2])

Now you can add some points to the map. If you want them to mean anything it may be handy to specify an alpha-level and change some aspects of the points, e.g. size, color, alpha corresponding to some variable of interest.

PlotOnStaticMap(OSM.map, lat = lat_adj(lat, OSM.map), lon = lon_adj(lon, OSM.map), col=rgb(200,val,0,85,maxColorValue=255),pch=16,cex=4)

Here is the full code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
require(RgoogleMaps)
 
#define the part of the world you want to plot. Here the area around my former home.
lat_c<-51.47393
lon_c<-7.22667
bb<-qbbox(lat = c(lat_c[1]+0.01, lat_c[1]-0.01), lon = c(lon_c[1]+0.03, lon_c[1]-0.03))
 
# download the tile from OSM
OSM.map<-GetMap.OSM(lonR=bb$lonR, latR=bb$latR, scale = 20000, destfile="bochum.png")
image(OSM.map)
#Add some coordinates
lat<- c(51.47393, 51.479021)
lon<- c(7.22667, 7.222526)
val <- c(0, 255)
 
#function to adjust the coordinates
lat_adj<-function(lat, map){(map$BBOX$ll[1]-lat)/(map$BBOX$ll[1]-map$BBOX$ur[1])}
lon_adj<-function(lon, map){(map$BBOX$ll[2]-lon)/(map$BBOX$ll[2]-map$BBOX$ur[2])}
 
PlotOnStaticMap(OSM.map, lat = lat_adj(lat, OSM.map), lon = lon_adj(lon, OSM.map), col=rgb(255,0, val,90,maxColorValue=255),pch=16,cex=4)
 
dev.print(jpeg,"test.jpeg", width=1204, height=644, units="px")
yoda

The four steps to publication-grade graphics in R

[en]
For many, the main reason to use R is to generate really good-looking or at least informative graphics. However, while it is easy to find information on how to make an individual plot, it can take some time to find out how to get them out into the world. Here is my four-step program to turning your plot into a graphic-file.

In the following I will use my present favorite plot from here as an example.

 

1. Set your options

R allows you to set many general options for your plots, e.g. the margins and whether or not a box should be drawn around most of which are the documentation here.

My favorites are:

  • mfrow: To combine several plots into one (not necessary for the exaple).
  • mar: To control the margins of the plot (not necessary for the exaple).
  • las: To rotate the axis-labels (not necessary for the example)
?View Code RSPLUS
par(mar=c(2,0,2,2))

2. Make your plot

Well this part is the most heterogeneous, just take a peek at the gallery to get some inspiration, or dive into ggplot2 for a very comprehensive graphic-framework that also helps you to add legends.

?View Code RSPLUS
pie(c(1,1), labels="", col=c("black", "white"), main="Your options according to Yoda", init.angle=90)

3. Add a legend

R has a built-in function to add legends. The full documentation can be found here,  The options I use almost every time are:

  • x,y: To tell R where to put the legend. Usually I use the  name for the location (e.g. “top left”), instead of x and y-coordinates.
  • legend: To add some descriptions for the colors/line-types/shadings.
  • fill: To select the colors or alternatively “lty” for the line-type
  • bty: To get rid of the box around the legend
?View Code RSPLUS
legend("right", c("do", "do not", "try"), fill=c("black", "white", "gold"), bty="n", cex=1.4)

4. Save it to a file

R has various options to save files, as documented here. I most often save them as png, as the file-size for tiffs is extremely large at the same quality. This allows you to set the options

  • filename: well something with a  “.png” at the end
  • width and height: To control the scale and of the image.
  • units: To
  • resolution: To Journals love images with at least 300 dpi.
  • bg: To have non-transparent background simply use “white”.
?View Code RSPLUS
dev.print(png, "yoda.png", width=8, height=6, units="in", res=300, bg="white")

Enjoy the complete script

 

?View Code RSPLUS
par(mar=c(2,0,2,2))
pie(c(1,1), labels="", col=c("black", "white"), main="Your options according to Yoda", init.angle=90)
legend("right", c("do", "do not", "try"), fill=c("black", "white", "gold"), bty="n", cex=1.4)
dev.print(png, "yoda.png", width=8, height=6, units="in", res=300, bg="white")

[/en]

[es]
Para muchos el principal motivo para usar R es generar gráficos de muy buen aspecto o al menos informativos. Encontrar la información necesaria para hacer un diagrama individual puede ser fácil pero averiguar como presentar los diagramas al mundo puede ser un largo proceso. Aquí les presento mi programa, compuesto de 4 pasos, para convertir sus diagramas en archivos gráficos.A continuación usare mi grafico favorito de aquí como un ejemplo.

1. Configura tus opciones

R te permite configurar muchas opciones generales para tus diagramas, por ejemplo los márgenes y si deseas o no que una “caja” rodee la mayor parte de la documentación  aquí
Mis favoritos son:* mfrow:  Para combinar varios diagramas en un (no es necesario para el ejemplo).
* mar: Para controlar los márgenes del diagrama (no es necesario para el ejemplo)
* las: Para rotar los ejes-etiquetas (no es necesario para el ejemplo).
?View Code RSPLUS
par(mar=c(2,0,2,2))

2. Haz tu diagrama

Esta parte es la mas heterogénea. Puedes visitar nuestra galería  para inspirarte o recorrer ggplot2  para ver una estructura grafica (graphic-framework) muy completa que además te permita agregar notas.

?View Code RSPLUS
pie(c(1,1), labels="", col=c("black", "white"), main="Your options according to Yoda", init.angle=90)

3. Agregar una nota

R ha incorporado funciones para agregar notas. Puedes encontrar la documentación completa aquí. Las opciones que uso generalmente son:

  • x,y:  Para decirle a R donde colocar la nota. Generalmente uso el nombre para la localización (por ejemplo: “top left”) en lugar de coordenadas x-y.
  • legend: Para agregar algunas descripciones sobre  colors/line-types/shadings (colores/tipo de líneas/sombreado).
  • fill: Para elegir los colores o “lty” para el tipo de línea.
  • bty: Para deshacerse de la “caja” alrededor de las notas.

legend(“right”, c(“do”, “do not”, “try”), fill=c(“black”, “white”, “gold”), bty=”n”, cex=1.4)

?View Code RSPLUS
legend("right", c("do", "do not", "try"), fill=c("black", "white", "gold"), bty="n", cex=1.4)

4. Guardarlo en un archivo

R tiene varias opciones para guardar archivos, tal como esta documentado aquí. Por lo general guardo los archivos en formato png ya que los archivos tiffs son extremadamente pesados y de igual calidad. Esto te permite configurar las siguientes opciones:

  • filename: El nombre del archivos con un “.png” al final
  • width and height: Para controlar la escala de la imagen.
  • units: Para
  • resolution: Los Journals adoran las imágenes con al menos 300 dpi.
  • bg: Para tener un fondo no transparente debes usar simplemente “white”.
?View Code RSPLUS
dev.print(png, "yoda.png", width=8, height=6, units="in", res=300, bg="white")

Disfruta el Script completo!

 

?View Code RSPLUS
par(mar=c(2,0,2,2))
pie(c(1,1), labels="", col=c("black", "white"), main="Your options according to Yoda", init.angle=90)
legend("right", c("do", "do not", "try"), fill=c("black", "white", "gold"), bty="n", cex=1.4)
dev.print(png, "yoda.png", width=8, height=6, units="in", res=300, bg="white")
[/es]