2 Reproducible Data Analysis
2.1 Definition and Motivation
Reproducibility involves being able to recalculate the exact numbers in a data analysis using the code and raw data provided by the analyst.
Reproducibility is often difficult to achieve and has slowed down the discovery of important data analytic errors.
Reproducibility should not be confused with “correctness” of a data analysis. A data analysis can be fully reproducible and recreate all numbers in an analysis and still be misleading or incorrect.
From Elements of Data Analytic Style, by Leek
2.2 Reproducible vs. Replicable
Reproducible research is often used these days to indicate the ability to recalculate the exact numbers in a data analysis
Replicable research results often refers to the ability to independently carry out a study (thereby collecting new data) and coming to equivalent conclusions as the original study
These two terms are often confused, so it is important to clearly state the definition
2.3 Steps to a Reproducible Analysis
Use a data analysis script – e.g., R Markdown (discussed next section!) or iPython Notebooks
Record versions of software and paramaters – e.g., use
sessionInfo()
in R as inhw_1.Rmd
Organize your data analysis
Use version control – e.g., GitHub
Set a random number generator seed – e.g., use
set.seed()
in RHave someone else run your analysis
2.4 Organizing Your Data Analysis
- Data
- raw data
- processed data (sometimes multiple stages for very large data sets)
- Figures
- Exploratory figures
- Final figures
- R code
- Raw or unused scripts
- Data processing scripts
- Analysis scripts
- Text
- README files explaining what all the components are
- Final data analysis products like presentations/writeups
2.5 Common Mistakes
- Failing to use a script for your analysis
- Not recording software and package version numbers or other settings used
- Not sharing your data and code
- Using reproducibility as a social weapon
2.6 R Markdown
2.6.1 R + Markdown + knitr
R Markdown was developed by the RStudio team to allow one to write reproducible research documents using Markdown and knitr
. This is contained in the rmarkdown
package, but can easily be carried out in RStudio.
Markdown was originally developed as a very simply text-to-html conversion tool. With Pandoc, Markdown is a very simply text-to-X
conversion tool where X
can be many different formats: html, LaTeX, PDF, Word, etc.
2.6.2 R Markdown Files
R Markdown documents begin with a metadata section, the YAML header, that can include information on the title, author, and date as well as options for customizing output.
title: "QCB 508 -- Homework 1"
author: "Your Name"
date: February 23, 2017
output: pdf_document
Many options are available. See http://rmarkdown.rstudio.com for full documentation.
2.6.3 Markdown
Headers:
# Header 1
## Header 2
### Header 3
Emphasis:
*italic* **bold**
_italic_ __bold__
Tables:
First Header | Second Header
------------- | -------------
Content Cell | Content Cell
Content Cell | Content Cell
Unordered list:
- Item 1
- Item 2
- Item 2a
- Item 2b
Ordered list:
1. Item 1
2. Item 2
3. Item 3
- Item 3a
- Item 3b
Links:
http://example.com
[linked phrase](http://example.com)
Blockquotes:
Florence Nightingale once said:
> For the sick it is important
> to have the best.
Plain code blocks:
```
This text is displayed verbatim with no formatting.
```
Inline Code:
We use the `print()` function to print the contents
of a variable in R.
Additional documentation and examples can be found here and here.
2.6.4 LaTeX
LaTeX is a markup language for technical writing, especially for mathematics. It can be include in R Markdown files.
For example,
$y = a + bx + \epsilon$
produces
\(y = a + bx + \epsilon\)
Here is an introduction to LaTeX and here is a primer on LaTeX for R Markdown.
2.6.5 knitr
The knitr
R package allows one to execute R code within a document, and to display the code itself and its output (if desired). This is particularly easy to do in the R Markdown setting. For example…
Placing the following text in an R Markdown file
The sum of 2 and 2 is `r 2+2`.
produces in the output file
The sum of 2 and 2 is 4.
2.6.6 knitr Chunks
Chunks of R code separated from the text. In R Markdown:
```{r}
x <- 2
x + 1
print(x)
```
Output in file:
> x <- 2
> x + 1
[1] 3
> print(x)
[1] 2
2.6.7 Chunk Option: echo
In R Markdown:
```{r, echo=FALSE}
x <- 2
x + 1
print(x)
```
Output in file:
[1] 3
[1] 2
2.6.8 Chunk Option: results
In R Markdown:
```{r, results="hide"}
x <- 2
x + 1
print(x)
```
Output in file:
> x <- 2
> x + 1
> print(x)
2.6.9 Chunk Option: include
In R Markdown:
```{r, include=FALSE}
x <- 2
x + 1
print(x)
```
Output in file:
(nothing)
2.6.10 Chunk Option: eval
In R Markdown:
```{r, eval=FALSE}
x <- 2
x + 1
print(x)
```
Output in file:
> x <- 2
> x + 1
> print(x)
2.6.11 Chunk Names
Naming your chunks can be useful for identifying them in your file and during the execution, and also to denote dependencies among chunks.
```{r my_first_chunk}
x <- 2
x + 1
print(x)
```
2.6.12 knitr Option: cache
Sometimes you don’t want to run chunks over and over, especially for large calculations. You can “cache” them.
```{r chunk1, cache=TRUE, include=FALSE}
x <- 2
```
```{r chunk2, cache=TRUE, dependson="chunk1"}
y <- 3
z <- x + y
```
This creates a directory called cache
in your working directory that stores the objects created or modified in these chunks. When chunk1
is modified, it is re-run. Since chunk2
depends on chunk1
, it will also be re-run.
2.6.13 knitr Options: figures
You can add chunk options regarding the placement and size of figures. Examples include:
fig.width
fig.height
fig.align
2.6.14 Changing Default Chunk Settings
If you will be using the same options on most chunks, you can set default options for the entire document. Run something like this at the beginning of your document with your desired chunk options.
```{r my_opts, cache=FALSE, echo=FALSE}
library("knitr")
opts_chunk$set(fig.align="center", fig.height=4, fig.width=6)
```