A reviewers dream has come true. The new statcheck-package for R automagically checks the accurate reporting of statistical results, i.e. do the teststatistic, degrees of freedom and p-values add up? It relies on pdftotext to first convert the pdfs to text files and then extracts the stats to check them for accuracy. Unfortunately this does not work perfectly so some pdf cannot be parsed automatically. Another problem is that it works best on APA-type reporting of stats and may miss other conventions.

If you want to read more about the package and the distribution of reporting errors across 30,000+ papers in Psychology, read the paper Nuijten et al., 2015.

The following ten lines of R-code go through some of the pdfs I have on my harddrive and count the number of gross decision errors, i.e. the reported teststatistic and degrees of freedom indicate a non-significant effect, while the reported p-value is smaller than .05, or vice versa.

To make it more interesting I want to compare the prevalence of these gross errors in papers from vision research (31 papers with 342 stats) to the clincial research papers (100 papers with 972 stats). And the winner is: Clinical papers (1.7%) compared to vision-papers (3.8%) have fewer reporting errors Chi(1) = 3.90; p <.05. Who would have guessed that!




> prop.test(c(vis_err, clin_err), c(vis_all, clin_all))

	2-sample test for equality of proportions with continuity correction

data:  c(vis_err, clin_err) out of c(vis_all, clin_all)
X-squared = 3.9002, df = 1, p-value = 0.04828
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.003332322  0.044376290
sample estimates:
    prop 1     prop 2 
0.03801170 0.01748971 


Also do not miss the graph!



We already explained here how you can use Latex to keep your CV up2date. Today I want to show you how the same bibtex sources and style-files(!) can be turned into a simple html-file that can be any department webpage. All you need is the excellent command-line program bibtex2html.


Once this is installed, the following three steps to write all to file.

  1. Set the current directory as directory for temporary files. (This is only a workaround and may not be necessary): export TMPDIR=.
  2. Convert individual bibtex files to html and gives them a header: bibtex2html
  3. Merge them all together and delete the different temp-files: cat and rm

Here is a short shell scricpt that accomplishes these steps.

export TMPDIR=.
./bibtex2html -a -s plainyr.bst --revkeys --no-abstract --no-keywords -nodoc --header "<h1>Artikel mit Peer-review</h1>" paper.bib
./bibtex2html -a -s plainyr.bst --revkeys --no-abstract -nodoc --header "<h1>Kommentare</h1>" commentary.bib
./bibtex2html -a -s plainyr.bst --revkeys --no-abstract -nodoc --header "<h1>Buchkapitel und Proceedings</h1>" chapters.bib
./bibtex2html -a -s plainyr.bst --revkeys --no-abstract -nodoc --header "<h1>Konferenzen</h1>" conferences.bib
cat paper.html commentary.html chapters.html conferences.html > all_pubs.html
rm bib2html*

Hello again!

After not posting new stuff for more than a year we want to relaunch the site.

Changes to our old hoster broke the database so that a switch to a new platform was a good time.

We choose to move to markdown because it fits’ very nicely our down-to earth approach to science. Over the next weeks we will add the old posts one by one, at present importing did not work for some of these. We will also take another shot at importing the old comments.

Open todos:

  • Cleaning Posts
  • Cleaning Tags and categories
  • adding Comments
  • Piwik integration

Many students struggle to find an adequte format for their thesis. Ironically the advent of “modern” WYSIWYG programms seems to make it harder to consistently format a text.

While learning LaTeX may be a bit too much to ask for, markdown is a very minimal language that together with pandoc affords all typesetting needs for an academic paper. While source documents written in markdown can be opened and edited on any PC (or mobile), pandoc can translate it into beautifully formatted pdf and docx (if it is absolutely necessary) files. Specifically markdown implements:

  • Headings, Subheadings
  • Figures and tables
  • Citations and References (here in APA6 but other styles are also possible)
  • You will need to edit the file paper_v1.md

See the example paper.

Once pandoc and latex is installed the following command generates a pdf.

pandoc -s -S --biblio biblio.bib --csl apa.csl -N -V geometry:margin=1in paper_v1.md -o paper.pdf

All files necessary to replicate and adopt the example can be found here .

In my last post I showed how one can easily summarize the outcome of a logistic regression. Here I want to show how this really depends on the data-points that are used to estimate the model. Taking a cue from the evolution of a correlation I have plotted the estimated Odds Ratios (ORs) depending on the number of included participants. The result is bad news for those working with small (< 750 participants) data-sets.


“eval_reg” Function to estimate model parameters for subsets of data 

est<-data.frame(matrix(nrow=nrow(dat), ncol=length(vars)))
pb <- txtProgressBar(min = 50, max = nrow(dat), style = 3)

for(i in 50:nrow(dat)){
try(boot_mod<-update(mod, data=dat[1:i,]))
setTxtProgressBar(pb, i)
names(est)<-c(vars, 'mod_nr')

As I randomized the order of data you can run it again and again to arrive at an even deeper mistrust as some of the resulting permutations will look like they stabilize earlier. On the balance you need to set the random-number seed to make it reproducible.

Run and plot the development

ticks<-c(seq(.1, 1, by =.1), seq(0, 10, by =1), seq(10, 100, by =10))

ggplot(tmp2, aes(y=value, x = mod_nr, color = variable)) +
geom_line() +
geom_hline(y=1, linetype=2) +
labs(title = 'Evolution of logistic regression', y = 'OR', x = 'number of participants') +
scale_y_log10(breaks=ticks, labels = as.character(ticks)) +

Update 29-01-2013:

I added my definition of the ticks on the log-scale. The packages needed are ggplot2 and reshape.


if you like me work in medical research, you have to plot the results of multiple logistic regressions every once in a while. As I have not yet found a great solution to make these plots I have put together the following short skript. Do not expect too much, it’s more of a reminder to my future self than some mind-boggling new invention. The code can be found below the resulting figure looks like this:


Here comes the code. It takes the model and optionally a title as an input and generates the above plot.

plot_odds<-function(x, title = NULL){
tmp<-data.frame(cbind(exp(coef(x)), exp(confint(x))))
names(odds)<-c('OR', 'lower', 'upper')
ticks<-c(seq(.1, 1, by =.1), seq(0, 10, by =1), seq(10, 100, by =10))

ggplot(odds, aes(y= OR, x = reorder(vars, OR))) +
geom_point() +
geom_errorbar(aes(ymin=lower, ymax=upper), width=.2) +
scale_y_log10(breaks=ticks, labels = ticks) +
geom_hline(yintercept = 1, linetype=2) +
coord_flip() +
labs(title = title, x = 'Variables', y = 'OR') +

P.s. I know about ggplots “annotation_logticks” but they messed up my graphics, also it is not very often that ORs span more than three orders of magnitude. If they do consider playing with ggplots function or update the line beginning with “ticks <- “ in the above example

Update 29-01-2013: I replaced the nasty “ as they resulted in some nasty copy-past errors…

As long as we all are still waiting for these large-scale data sharing mechanisms in place we will have to use every bit we can can our hand on. For data that is only represented as a graph - instead of tables or in the text - this means that we need some kind of graph digitizer.

The best one I found is the webplotdigitizer developed by Ankit Rohatgi. While I generally prefer desktop versions of programs, because they support reproducible research - this tools is just too easy to use that a further search would be necessary. Together with the video tutorials this is a one-stop shop for your data-extraction needs.

Congratulations to the developer for this excellent example of one task one tool!

For everyone out there who believes that science (i.e. the methods, data and results) should be more open, hear the good news: You are not alone! A whole series of papers published in the worlds most highly ranked journals is on your side - or are they?

Duncal Hull compiled an excellent list of resources.

My personal favorite is the following:

Screenshot of paywall for an editorial that is supposedly about open access

Alle Jahre wieder braucht man einen aktuellen Lebenslauf. Wenn man in der Wissenschaft arbeitet, besteht dieser zu 80% aus einer aktuellen Publikationsliste. Nach langem Ausprobieren und Vergleichen (Alternativen aus der großen Liste sind z.B.: currvita und europecv) habe ich mich dafür entschieden, meinen CV mit dem moderncv package, dass von Xavier Danaux entwickelt wurde, zu machen. Ich habe allerdings noch eine Anpassungen am outputstyle “plainyr” vorgenommen, um in der Literaturliste jeweils aktuelle Artikel zuerst zu nennen.

Das fertige Resultat findet Ihr hier: cv_hirschfeld_2012_03

Die Sources zum anpassen findet Ihr hier: latex_cv.zip

Um das zu setzten müsst ihr die folgenden Schritte durchführen

  1. pdflatex auf die tex-Datei anwenden:
pdflatex source.tex
  1. bibtex auf die *.aux Files der einzelnen Bibliographien anwenden:
bibtex paper.aux
  1. pdflatex auf die tex-Datei anwenden:
pdflatex source.tex

I recently thought about ways to visualize medications and their co-occurences in a group of children. As long as you want to visualize up to  4 different medications you can simply use Venn diagrams. There is a very nice R-package to generate these kind of graphics for you (for a  description see: Chen and Boutros, 2011). But this is of little help here.

The problem I faced involved 29 different medications and 50 children. So my data was stored in a table with 29 columns - one for each medication - and 50 rows - one for each child, so that the cells indicate whether or not the child took the medication.

M <- matrix(sample(0:1, 1450, replace=TRUE, prob=c(0.9,0.1)), nc=29)

The Solution - Social Network Analysis

There are a several R-packages to analyze and visualize social network data - I will focus on “igraph” in this post. The problem I had was that I was not - and probably I am still not -  familiar with the concepts and nomenclature of this field. The key to using the data described above in terms of network analysis was understanding that such data is called an affiliation matrix, where individuals are affiliated with certain events. As “igraph” likes adjacency matrices, where every column and row represents a different node - in our case a medication. The diagonal gives the number of times a medication was given (more information can be found on Daizaburo Shizuka site).

We transform an affilition matrix into an adjacency matrix in R simply by:


Now we can make a first bare-minimum plot:

require(igraph) g=graph.adjacency(adj,mode="undirected", weighted=TRUE,diag=FALSE) summary(g) plot(g, main="The bare minimum")

Adding information and spicing it up a notch

In all likelihood You want to add at least three kinds of  information:

  1. Labels for the nodes

  2. Size of the nodes to represent the total number of events, aka medications

  3. Size of the links to represent the overlap between medications

name<-sample(c(LETTERS, letters, 1:99), 29, replace=TRUE) number<-diag(adj)*5+5 width<-(E(g)$weight/2)+1 plot(g, main="A little more information", vertex.size=number,vertex.label=name,edge.width=width)

The “igraph” package lets you adopt quite a few parameters so you should consult with the manual. I only changed some of the colors, layout, fonts, etc.

plot(g, main="Spice it up a notch", vertex.size=number, vertex.label=name, edge.width=width, layout=layout.lgl, vertex.color="red", edge.color="darkgrey", vertex.label.family ="sans", vertex.label.color="black")

Here is just the code:


# Generate example data
M <- matrix(sample(0:1, 1450, replace=TRUE, prob=c(0.9,0.1)), nc=29)

# Transform matrices

# Make a simple plot
g<-graph.adjacency(adj,mode="undirected", weighted=TRUE,diag=FALSE)
plot(g, main="The bare minimum")

# Add more information
name<-sample(c(LETTERS, letters, 1:99), 29, replace=TRUE)

plot(g, main="A little more information", vertex.size=number,vertex.label=name,edge.width=width)

# Adjust some plotting parameters
plot(g, main="Spice it up a notch", vertex.size=number, vertex.label=name, edge.width=width, layout=layout.lgl, vertex.color="red", edge.color="darkgrey", vertex.label.family ="sans", vertex.label.color="black")