2 Reproducible Data Analysis

2.1 Definition and Motivation

  • Reproducibility involves being able to recalculate the exact numbers in a data analysis using the code and raw data provided by the analyst.

  • Reproducibility is often difficult to achieve and has slowed down the discovery of important data analytic errors.

  • Reproducibility should not be confused with “correctness” of a data analysis. A data analysis can be fully reproducible and recreate all numbers in an analysis and still be misleading or incorrect.

From Elements of Data Analytic Style, by Leek

2.2 Reproducible vs. Replicable

Reproducible research is often used these days to indicate the ability to recalculate the exact numbers in a data analysis

Replicable research results often refers to the ability to independently carry out a study (thereby collecting new data) and coming to equivalent conclusions as the original study

These two terms are often confused, so it is important to clearly state the definition

2.3 Steps to a Reproducible Analysis

  1. Use a data analysis script – e.g., R Markdown (discussed next section!) or iPython Notebooks

  2. Record versions of software and paramaters – e.g., use sessionInfo() in R as in hw_1.Rmd

  3. Organize your data analysis

  4. Use version control – e.g., GitHub

  5. Set a random number generator seed – e.g., use set.seed() in R

  6. Have someone else run your analysis

2.4 Organizing Your Data Analysis

  • Data
    • raw data
    • processed data (sometimes multiple stages for very large data sets)
  • Figures
    • Exploratory figures
    • Final figures
  • R code
    • Raw or unused scripts
    • Data processing scripts
    • Analysis scripts
  • Text
    • README files explaining what all the components are
    • Final data analysis products like presentations/writeups

2.5 Common Mistakes

  • Failing to use a script for your analysis
  • Not recording software and package version numbers or other settings used
  • Not sharing your data and code
  • Using reproducibility as a social weapon

2.6 R Markdown

2.6.1 R + Markdown + knitr

R Markdown was developed by the RStudio team to allow one to write reproducible research documents using Markdown and knitr. This is contained in the rmarkdown package, but can easily be carried out in RStudio.

Markdown was originally developed as a very simply text-to-html conversion tool. With Pandoc, Markdown is a very simply text-to-X conversion tool where X can be many different formats: html, LaTeX, PDF, Word, etc.

2.6.2 R Markdown Files

R Markdown documents begin with a metadata section, the YAML header, that can include information on the title, author, and date as well as options for customizing output.

title: "QCB 508 -- Homework 1"
author: "Your Name"
date: February 23, 2017
output: pdf_document

Many options are available. See http://rmarkdown.rstudio.com for full documentation.

2.6.3 Markdown


# Header 1
## Header 2
### Header 3


*italic* **bold**
_italic_ __bold__


First Header  | Second Header
------------- | -------------
Content Cell  | Content Cell
Content Cell  | Content Cell

Unordered list:

- Item 1
- Item 2
    - Item 2a
    - Item 2b

Ordered list:

1. Item 1
2. Item 2
3. Item 3
    - Item 3a
    - Item 3b



[linked phrase](http://example.com)


Florence Nightingale once said:

> For the sick it is important 
> to have the best. 

Plain code blocks:

This text is displayed verbatim with no formatting.

Inline Code:

We use the `print()` function to print the contents 
of a variable in R.

Additional documentation and examples can be found here and here.

2.6.4 LaTeX

LaTeX is a markup language for technical writing, especially for mathematics. It can be include in R Markdown files.

For example,

$y = a + bx + \epsilon$


\(y = a + bx + \epsilon\)

Here is an introduction to LaTeX and here is a primer on LaTeX for R Markdown.

2.6.5 knitr

The knitr R package allows one to execute R code within a document, and to display the code itself and its output (if desired). This is particularly easy to do in the R Markdown setting. For example…

Placing the following text in an R Markdown file

The sum of 2 and 2 is `r 2+2`.

produces in the output file

The sum of 2 and 2 is 4.

2.6.6 knitr Chunks

Chunks of R code separated from the text. In R Markdown:

x <- 2
x + 1

Output in file:

> x <- 2
> x + 1
[1] 3
> print(x)
[1] 2

2.6.7 Chunk Option: echo

In R Markdown:

```{r, echo=FALSE}
x <- 2
x + 1

Output in file:

[1] 3
[1] 2

2.6.8 Chunk Option: results

In R Markdown:

```{r, results="hide"}
x <- 2
x + 1

Output in file:

> x <- 2
> x + 1
> print(x)

2.6.9 Chunk Option: include

In R Markdown:

```{r, include=FALSE}
x <- 2
x + 1

Output in file:


2.6.10 Chunk Option: eval

In R Markdown:

```{r, eval=FALSE}
x <- 2
x + 1

Output in file:

> x <- 2
> x + 1
> print(x)

2.6.11 Chunk Names

Naming your chunks can be useful for identifying them in your file and during the execution, and also to denote dependencies among chunks.

```{r my_first_chunk}
x <- 2
x + 1

2.6.12 knitr Option: cache

Sometimes you don’t want to run chunks over and over, especially for large calculations. You can “cache” them.

```{r chunk1, cache=TRUE, include=FALSE}
x <- 2
```{r chunk2, cache=TRUE, dependson="chunk1"}
y <- 3
z <- x + y

This creates a directory called cache in your working directory that stores the objects created or modified in these chunks. When chunk1 is modified, it is re-run. Since chunk2 depends on chunk1, it will also be re-run.

2.6.13 knitr Options: figures

You can add chunk options regarding the placement and size of figures. Examples include:

  • fig.width
  • fig.height
  • fig.align

2.6.14 Changing Default Chunk Settings

If you will be using the same options on most chunks, you can set default options for the entire document. Run something like this at the beginning of your document with your desired chunk options.

```{r my_opts, cache=FALSE, echo=FALSE}
opts_chunk$set(fig.align="center", fig.height=4, fig.width=6)