3 R Programming
3.1 Control Structures
3.1.1 Rationale
Control structures in R allow you to control the flow of execution of a series of R expressions
They allow you to put some logic into your R code, rather than just always executing the same R code every time
Control structures also allow you to respond to inputs or to features of the data and execute different R expressions accordingly
Paraphrased from R Programming for Data Science, by Peng
3.1.2 Common Control Structures
if
andelse
: testing a condition and acting on itfor
: execute a loop a fixed number of timeswhile
: execute a loop while a condition is truerepeat
: execute an infinite loop (must break out of it to stop)break
: break the execution of a loopnext
: skip an interation of a loop
From R Programming for Data Science, by Peng
3.1.3 Some Boolean Logic
R has built-in functions that produce TRUE
or FALSE
such as is.vector
or is.na
. You can also do the following:
x == y
: does x equal y?x > y
: is x greater than y? (also<
less than)x >= y
: is x greater than or equal to y?x && y
: are both x and y true?x || y
: is either x or y true?!is.vector(x)
: this isTRUE
if x is not a vector
3.1.4 if
Idea:
if(<condition>) {
## do something
}
### Continue with rest of code
Example:
> x <- c(1,2,3)
> if(is.numeric(x)) {
+ x+2
+ }
[1] 3 4 5
3.1.5 if
-else
Idea:
if(<condition>) {
## do something
}
else {
## do something else
}
Example:
> x <- c("a", "b", "c")
> if(is.numeric(x)) {
+ print(x+2)
+ } else {
+ class(x)
+ }
[1] "character"
3.1.6 for
Loops
Example:
> for(i in 1:10) {
+ print(i)
+ }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
Examples:
> x <- c("a", "b", "c", "d")
>
> for(i in 1:4) {
+ print(x[i])
+ }
[1] "a"
[1] "b"
[1] "c"
[1] "d"
>
> for(i in seq_along(x)) {
+ print(x[i])
+ }
[1] "a"
[1] "b"
[1] "c"
[1] "d"
3.1.7 Nested for
Loops
Example:
> m <- matrix(1:6, nrow=2, ncol=3, byrow=TRUE)
>
> for(i in seq_len(nrow(m))) {
+ for(j in seq_len(ncol(m))) {
+ print(m[i,j])
+ }
+ }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
3.1.8 while
Example:
> x <- 1:10
> idx <- 1
>
> while(x[idx] < 4) {
+ print(x[idx])
+ idx <- idx + 1
+ }
[1] 1
[1] 2
[1] 3
>
> idx
[1] 4
Repeats the loop until while the condition is TRUE
.
3.1.9 repeat
Example:
> x <- 1:10
> idx <- 1
>
> repeat {
+ print(x[idx])
+ idx <- idx + 1
+ if(idx >= 4) {
+ break
+ }
+ }
[1] 1
[1] 2
[1] 3
>
> idx
[1] 4
Repeats the loop until break
is executed.
3.1.10 break
and next
break
ends the loop. next
skips the rest of the current loop iteration.
Example:
> x <- 1:1000
> for(idx in 1:1000) {
+ # %% calculates division remainder
+ if((x[idx] %% 2) > 0) {
+ next
+ } else if(x[idx] > 10) { # an else-if!!
+ break
+ } else {
+ print(x[idx])
+ }
+ }
[1] 2
[1] 4
[1] 6
[1] 8
[1] 10
3.2 Vectorized Operations
3.2.1 Calculations on Vectors
R is usually smart about doing calculations with vectors. Examples:
>
> x <- 1:3
> y <- 4:6
>
> 2*x # same as c(2*x[1], 2*x[2], 2*x[3])
[1] 2 4 6
> x + 1 # same as c(x[1]+1, x[2]+1, x[3]+1)
[1] 2 3 4
> x + y # same as c(x[1]+y[1], x[2]+y[2], x[3]+y[3])
[1] 5 7 9
> x*y # same as c(x[1]*y[1], x[2]*y[2], x[3]*y[3])
[1] 4 10 18
3.2.2 A Caveat
If two vectors are of different lengths, R tries to find a solution for you (and doesn’t always tell you).
> x <- 1:5
> y <- 1:2
> x+y
Warning in x + y: longer object length is not a multiple of shorter object
length
[1] 2 4 4 6 6
What happened here?
3.2.3 Vectorized Matrix Operations
Operations on matrices are also vectorized. Example:
> x <- matrix(1:4, nrow=2, ncol=2, byrow=TRUE)
> y <- matrix(1:4, nrow=2, ncol=2)
>
> x+y
[,1] [,2]
[1,] 2 5
[2,] 5 8
>
> x*y
[,1] [,2]
[1,] 1 6
[2,] 6 16
3.2.4 Mixing Vectors and Matrices
What happens when we do calculations involving a vector and a matrix? Example:
> x <- matrix(1:6, nrow=2, ncol=3, byrow=TRUE)
> z <- 1:2
>
> x + z
[,1] [,2] [,3]
[1,] 2 3 4
[2,] 6 7 8
>
> x * z
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 8 10 12
3.2.5 Mixing Vectors and Matrices
Another example:
> x <- matrix(1:6, nrow=2, ncol=3, byrow=TRUE)
> z <- 1:3
>
> x + z
[,1] [,2] [,3]
[1,] 2 5 5
[2,] 6 6 9
>
> x * z
[,1] [,2] [,3]
[1,] 1 6 6
[2,] 8 5 18
What happened this time?
3.2.6 Vectorized Boolean Logic
We saw &&
and ||
applied to pairs of logical values. We can also vectorize these operations.
> a <- c(TRUE, TRUE, FALSE)
> b <- c(FALSE, TRUE, FALSE)
>
> a | b
[1] TRUE TRUE FALSE
> a & b
[1] FALSE TRUE FALSE
3.3 Subsetting R Objects
3.3.1 Subsetting Vectors
> x <- 1:8
>
> x[1] # extract the first element
[1] 1
> x[2] # extract the second element
[1] 2
>
> x[1:4] # extract the first 4 elements
[1] 1 2 3 4
>
> x[c(1, 3, 4)] # extract elements 1, 3, and 4
[1] 1 3 4
> x[-c(1, 3, 4)] # extract all elements EXCEPT 1, 3, and 4
[1] 2 5 6 7 8
3.3.2 Subsetting Vectors
> names(x) <- letters[1:8]
> x
a b c d e f g h
1 2 3 4 5 6 7 8
>
> x[c("a", "b", "f")]
a b f
1 2 6
>
> s <- x > 3
> s
a b c d e f g h
FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
> x[s]
d e f g h
4 5 6 7 8
3.3.3 Subsettng Matrices
> x <- matrix(1:6, nrow=2, ncol=3, byrow=TRUE)
> x
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
>
> x[1,2]
[1] 2
> x[1, ]
[1] 1 2 3
> x[ ,2]
[1] 2 5
3.3.4 Subsettng Matrices
> colnames(x) <- c("A", "B", "C")
>
> x[ , c("B", "C")]
B C
[1,] 2 3
[2,] 5 6
>
> x[c(FALSE, TRUE), c("B", "C")]
B C
5 6
>
> x[2, c("B", "C")]
B C
5 6
3.3.5 Subsettng Matrices
> s <- (x %% 2) == 0
> s
A B C
[1,] FALSE TRUE FALSE
[2,] TRUE FALSE TRUE
>
> x[s]
[1] 4 2 6
>
> x[c(2, 3, 6)]
[1] 4 2 6
3.3.6 Subsetting Lists
> x <- list(my=1:3, favorite=c("a", "b", "c"),
+ course=c(FALSE, TRUE, NA))
>
> x[[1]]
[1] 1 2 3
> x[["my"]]
[1] 1 2 3
> x$my
[1] 1 2 3
> x[[c(3,1)]]
[1] FALSE
> x[[3]][1]
[1] FALSE
> x[c(3,1)]
$course
[1] FALSE TRUE NA
$my
[1] 1 2 3
3.3.7 Subsetting Data Frames
> x <- data.frame(my=1:3, favorite=c("a", "b", "c"),
+ course=c(FALSE, TRUE, NA))
>
> x[[1]]
[1] 1 2 3
> x[["my"]]
[1] 1 2 3
> x$my
[1] 1 2 3
> x[[c(3,1)]]
[1] FALSE
> x[[3]][1]
[1] FALSE
> x[c(3,1)]
course my
1 FALSE 1
2 TRUE 2
3 NA 3
3.3.8 Subsetting Data Frames
> x <- data.frame(my=1:3, favorite=c("a", "b", "c"),
+ course=c(FALSE, TRUE, NA))
>
> x[1, ]
my favorite course
1 1 a FALSE
> x[ ,3]
[1] FALSE TRUE NA
> x[ ,"favorite"]
[1] a b c
Levels: a b c
> x[1:2, ]
my favorite course
1 1 a FALSE
2 2 b TRUE
> x[ ,2:3]
favorite course
1 a FALSE
2 b TRUE
3 c NA
3.3.9 Note on Data Frames
R often converts character strings to factors unless you specify otherwise.
In the previous slide, we saw it converted the “favorite” column to factors. Let’s fix that…
> x <- data.frame(my=1:3, favorite=c("a", "b", "c"),
+ course=c(FALSE, TRUE, NA),
+ stringsAsFactors=FALSE)
>
> x[ ,"favorite"]
[1] "a" "b" "c"
> class(x[ ,"favorite"])
[1] "character"
3.3.10 Missing Values
> data("airquality", package="datasets")
> head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
> dim(airquality)
[1] 153 6
> which(is.na(airquality$Ozone))
[1] 5 10 25 26 27 32 33 34 35 36 37 39 42 43 45 46 52
[18] 53 54 55 56 57 58 59 60 61 65 72 75 83 84 102 103 107
[35] 115 119 150
> sum(is.na(airquality$Ozone))
[1] 37
3.3.11 Subsetting by Matching
> letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
[18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
> vowels <- c("a", "e", "i", "o", "u")
>
> letters %in% vowels
[1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
[12] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[23] FALSE FALSE FALSE FALSE
> which(letters %in% vowels)
[1] 1 5 9 15 21
>
> letters[which(letters %in% vowels)]
[1] "a" "e" "i" "o" "u"
3.3.12 Advanced Subsetting
The R Programming for Data Science chapter titled “Subsetting R Objects” contains additional material on subsetting that you should know.
The Advanced R website contains more detailed information on subsetting that you may find useful.
3.4 Functions
3.4.1 Rationale
Writing functions is a core activity of an R programmer. It represents the key step of the transition from a mere user to a developer who creates new functionality for R.
Functions are often used to encapsulate a sequence of expressions that need to be executed numerous times, perhaps under slightly different conditions.
Functions are also often written when code must be shared with others or the public.
From R Programming for Data Science, by Peng
3.4.2 Defining a New Function
Functions are defined using the
function()
directiveThey are stored as variables, so they can be passed to other functions and assigned to new variables
Arguments and a final return object are defined
3.4.3 Example 1
> my_square <- function(x) {
+ x*x # can also do return(x*x)
+ }
>
> my_square(x=2)
[1] 4
>
> my_fun2 <- my_square
> my_fun2(x=3)
[1] 9
3.4.4 Example 2
> my_square_ext <- function(x) {
+ y <- x*x
+ return(list(x_original=x, x_squared=y))
+ }
>
> my_square_ext(x=2)
$x_original
[1] 2
$x_squared
[1] 4
>
> z <- my_square_ext(x=2)
3.4.5 Example 3
> my_power <- function(x, e, say_hello) {
+ if(say_hello) {
+ cat("Hello World!")
+ }
+ x^e
+ }
>
> my_power(x=2, e=3, say_hello=TRUE)
Hello World!
[1] 8
>
> z <- my_power(x=2, e=3, say_hello=TRUE)
Hello World!
> z
[1] 8
3.4.6 Default Function Argument Values
Some functions have default values for their arguments:
> str(matrix)
function (data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)
You can define a function with default values by the following:
f <- function(x, y=2) {
x + y
}
If the user types f(x=1)
then it defaults to y=2
, but if the user types f(x=1, y=3)
, then it executes with these assignments.
3.4.7 The Ellipsis Argument
You will encounter functions that include as a possible argument the ellipsis: ...
This basically holds arguments that can be passed to functions called within a function. Example:
> double_log <- function(x, ...) {
+ log((2*x), ...)
+ }
>
> double_log(x=1, base=2)
[1] 1
> double_log(x=1, base=10)
[1] 0.30103
3.4.8 Argument Matching
R tries to automatically deal with function calls when the arguments are not defined explicity. For example:
x <- matrix(1:6, nrow=2, ncol=3, byrow=TRUE) # versus
x <- matrix(1:6, 2, 3, TRUE)
I strongly recommend that you define arguments explcitly. For example, I can never remember which comes first in matrix()
, nrow
or ncol
.
3.5 Environment
3.5.1 Loading .RData
Files
An .RData
file is a binary file containing R objects. These can be saved from your current R session and also loaded into your current session.
> # generally...
> # to load:
> load(file="path/to/file_name.RData")
> # to save:
> save(file="path/to/file_name.RData")
> ## assumes file in working directory
> load(file="project_1_R_basics.RData")
> ## loads from our GitHub repository
> load(file=url("https://github.com/SML201/project1/raw/
+ master/project_1_R_basics.RData"))
3.5.2 Listing Objects
The objects in your current R session can be listed. An environment can also be specificied in case you have objects stored in different environments.
> ls()
[1] "num_people_in_precept" "SML201_grade_distribution"
[3] "some_ORFE_profs"
>
> ls(name=globalenv())
[1] "num_people_in_precept" "SML201_grade_distribution"
[3] "some_ORFE_profs"
>
> ## see help file for other options
> ?ls
3.5.3 Removing Objects
You can remove specific objects or all objects from your R environment of choice.
> rm("some_ORFE_profs") # removes variable some_ORFE_profs
>
> rm(list=ls()) # Removes all variables from environment
3.5.4 Advanced
The R environment is there to connect object names to object values.
The R Programming for Data Science chapter titled “Scoping Rules of R” discussed environments and object names in more detail than we need for this course.
A useful discussion about environments can also be found on the Advanced R web site.
3.6 Packages
3.6.1 Rationale
“In R, the fundamental unit of shareable code is the package. A package bundles together code, data, documentation, and tests, and is easy to share with others. As of January 2015, there were over 6,000 packages available on the Comprehensive R Archive Network, or CRAN, the public clearing house for R packages. This huge variety of packages is one of the reasons that R is so successful: the chances are that someone has already solved a problem that you’re working on, and you can benefit from their work by downloading their package.”
From http://r-pkgs.had.co.nz/intro.html by Hadley Wickham
3.6.2 Contents of a Package
- R functions
- R data objects
- Help documents for using the package
- Information on the authors, dependencies, etc.
- Information to make sure it “plays well” with R and other packages
3.6.3 Installing Packages
From CRAN:
install.packages("dplyr")
From GitHub (for advanced users):
library("devtools")
install_github("hadley/dplyr")
From Bioconductor (basically CRAN for biology):
library("BiocManager")
BiocManager::install("qvalue")
Be very careful about dependencies when installing from GitHub.
Multiple packages:
install.packages(c("dplyr", "ggplot2"))
Install all dependencies:
install.packages(c("dplyr", "ggplot2"), dependencies=TRUE)
Updating packages:
update.packages()
3.6.4 Loading Packages
Two ways to load a package:
library("dplyr")
library(dplyr)
I prefer the former.
3.6.5 Getting Started with a Package
When you install a new package and load it, what’s next? I like to look at the help files and see what functions and data sets a package has.
library("dplyr")
help(package="dplyr")
3.6.6 Specifying a Function within a Package
You can call a function from a specific package. Suppose you are in a setting where you have two packages loaded that have functions with the same name.
dplyr::arrange(mtcars, cyl, disp)
This calls the arrange
functin specifically from dplyr
. The package plyr
also has an arrange
function.
3.6.7 More on Packages
We will be covering several highly used R packages in depth this semester, so we will continue to learn about packages, how they are organized, and how they are used.
You can download the “source” of a package from R and take a look at the contents if you want to dig deeper. There are also many good tutorials on creating packages, such as http://hilaryparker.com/2014/04/29/writing-an-r-package-from-scratch/.
3.7 Organizing Your Code
3.7.1 Suggestions
RStudio conveniently tries to automatically format your R code. We suggest the following in general.
1. No more than 80 characters per line (or fewer depending on how R Markdown compiles):
really_long_line <- my_function(x=20, y=30, z=TRUE,
a="Joe", b=3.8)
2. Indent 2 or more characters for nested commands:
for(i in 1:10) {
if(i > 4) {
print(i)
}
}
3. Generously comment your code.
## a for-loop that prints the index
## whenever it is greater than 4
for(i in 1:10) {
if(i > 4) {
print(i)
}
}
## a good way to get partial credit
## if something goes wrong :-)
4. Do not hesitate to write functions to organize tasks. These help to break up your code into more undertsandable pieces, and functions can often be used several times.
3.7.2 Where to Put Files
See the Elements of Data Analytic Style chapter titled “Reproducibility” for suggestions on how to organize your files.
In this course, we will keep this relatively simple. We will try to provide you with some organization when distributing the projects.