3 R Programming

3.1 Control Structures

3.1.1 Rationale

  • Control structures in R allow you to control the flow of execution of a series of R expressions

  • They allow you to put some logic into your R code, rather than just always executing the same R code every time

  • Control structures also allow you to respond to inputs or to features of the data and execute different R expressions accordingly

Paraphrased from R Programming for Data Science, by Peng

3.1.2 Common Control Structures

  • if and else: testing a condition and acting on it
  • for: execute a loop a fixed number of times
  • while: execute a loop while a condition is true
  • repeat: execute an infinite loop (must break out of it to stop)
  • break: break the execution of a loop
  • next: skip an interation of a loop

From R Programming for Data Science, by Peng

3.1.3 Some Boolean Logic

R has built-in functions that produce TRUE or FALSE such as is.vector or is.na. You can also do the following:

  • x == y : does x equal y?
  • x > y : is x greater than y? (also < less than)
  • x >= y : is x greater than or equal to y?
  • x && y : are both x and y true?
  • x || y : is either x or y true?
  • !is.vector(x) : this is TRUE if x is not a vector

3.1.4 if

Idea:

if(<condition>) {
        ## do something
} 
### Continue with rest of code

Example:

> x <- c(1,2,3)
> if(is.numeric(x)) {
+   x+2
+ }
[1] 3 4 5

3.1.5 if-else

Idea:

if(<condition>) {
        ## do something
} 
else {
        ## do something else
}

Example:

> x <- c("a", "b", "c")
> if(is.numeric(x)) {
+   print(x+2)
+ } else {
+   class(x)
+ }
[1] "character"

3.1.6 for Loops

Example:

> for(i in 1:10) {
+   print(i)
+ }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

Examples:

> x <- c("a", "b", "c", "d")
> 
> for(i in 1:4) {
+   print(x[i])
+ }
[1] "a"
[1] "b"
[1] "c"
[1] "d"
> 
> for(i in seq_along(x)) {
+   print(x[i])
+ }
[1] "a"
[1] "b"
[1] "c"
[1] "d"

3.1.7 Nested for Loops

Example:

> m <- matrix(1:6, nrow=2, ncol=3, byrow=TRUE)
> 
> for(i in seq_len(nrow(m))) {
+   for(j in seq_len(ncol(m))) {
+     print(m[i,j])
+   }
+ }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6

3.1.8 while

Example:

> x <- 1:10
> idx <- 1
> 
> while(x[idx] < 4) {
+   print(x[idx])
+   idx <- idx + 1
+ }
[1] 1
[1] 2
[1] 3
> 
> idx
[1] 4

Repeats the loop until while the condition is TRUE.

3.1.9 repeat

Example:

> x <- 1:10
> idx <- 1
> 
> repeat {
+   print(x[idx])
+   idx <- idx + 1
+   if(idx >= 4) {
+     break
+   }
+ }
[1] 1
[1] 2
[1] 3
> 
> idx
[1] 4

Repeats the loop until break is executed.

3.1.10 break and next

break ends the loop. next skips the rest of the current loop iteration.

Example:

> x <- 1:1000
> for(idx in 1:1000) {
+   # %% calculates division remainder
+   if((x[idx] %% 2) > 0) { 
+     next
+   } else if(x[idx] > 10) { # an else-if!!
+     break
+   } else {
+     print(x[idx])
+   }
+ }
[1] 2
[1] 4
[1] 6
[1] 8
[1] 10

3.2 Vectorized Operations

3.2.1 Calculations on Vectors

R is usually smart about doing calculations with vectors. Examples:

> 
> x <- 1:3
> y <- 4:6
> 
> 2*x     # same as c(2*x[1], 2*x[2], 2*x[3])
[1] 2 4 6
> x + 1   # same as c(x[1]+1, x[2]+1, x[3]+1)
[1] 2 3 4
> x + y   # same as c(x[1]+y[1], x[2]+y[2], x[3]+y[3])
[1] 5 7 9
> x*y     # same as c(x[1]*y[1], x[2]*y[2], x[3]*y[3])
[1]  4 10 18

3.2.2 A Caveat

If two vectors are of different lengths, R tries to find a solution for you (and doesn’t always tell you).

> x <- 1:5
> y <- 1:2
> x+y
Warning in x + y: longer object length is not a multiple of shorter object
length
[1] 2 4 4 6 6

What happened here?

3.2.3 Vectorized Matrix Operations

Operations on matrices are also vectorized. Example:

> x <- matrix(1:4, nrow=2, ncol=2, byrow=TRUE)
> y <- matrix(1:4, nrow=2, ncol=2)
> 
> x+y
     [,1] [,2]
[1,]    2    5
[2,]    5    8
> 
> x*y
     [,1] [,2]
[1,]    1    6
[2,]    6   16

3.2.4 Mixing Vectors and Matrices

What happens when we do calculations involving a vector and a matrix? Example:

> x <- matrix(1:6, nrow=2, ncol=3, byrow=TRUE)
> z <- 1:2
> 
> x + z
     [,1] [,2] [,3]
[1,]    2    3    4
[2,]    6    7    8
> 
> x * z
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    8   10   12

3.2.5 Mixing Vectors and Matrices

Another example:

> x <- matrix(1:6, nrow=2, ncol=3, byrow=TRUE)
> z <- 1:3
> 
> x + z
     [,1] [,2] [,3]
[1,]    2    5    5
[2,]    6    6    9
> 
> x * z
     [,1] [,2] [,3]
[1,]    1    6    6
[2,]    8    5   18

What happened this time?

3.2.6 Vectorized Boolean Logic

We saw && and || applied to pairs of logical values. We can also vectorize these operations.

> a <- c(TRUE, TRUE, FALSE)
> b <- c(FALSE, TRUE, FALSE)
> 
> a | b
[1]  TRUE  TRUE FALSE
> a & b
[1] FALSE  TRUE FALSE

3.3 Subsetting R Objects

3.3.1 Subsetting Vectors

> x <- 1:8
> 
> x[1]           # extract the first element
[1] 1
> x[2]           # extract the second element
[1] 2
> 
> x[1:4]         # extract the first 4 elements
[1] 1 2 3 4
> 
> x[c(1, 3, 4)]  # extract elements 1, 3, and 4
[1] 1 3 4
> x[-c(1, 3, 4)] # extract all elements EXCEPT 1, 3, and 4
[1] 2 5 6 7 8

3.3.2 Subsetting Vectors

> names(x) <- letters[1:8]
> x
a b c d e f g h 
1 2 3 4 5 6 7 8 
> 
> x[c("a", "b", "f")]
a b f 
1 2 6 
> 
> s <- x > 3
> s
    a     b     c     d     e     f     g     h 
FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE 
> x[s]
d e f g h 
4 5 6 7 8 

3.3.3 Subsettng Matrices

> x <- matrix(1:6, nrow=2, ncol=3, byrow=TRUE)
> x
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
> 
> x[1,2]
[1] 2
> x[1, ]
[1] 1 2 3
> x[ ,2]
[1] 2 5

3.3.4 Subsettng Matrices

> colnames(x) <- c("A", "B", "C")
> 
> x[ , c("B", "C")]
     B C
[1,] 2 3
[2,] 5 6
> 
> x[c(FALSE, TRUE), c("B", "C")]
B C 
5 6 
> 
> x[2, c("B", "C")]
B C 
5 6 

3.3.5 Subsettng Matrices

> s <- (x %% 2) == 0
> s
         A     B     C
[1,] FALSE  TRUE FALSE
[2,]  TRUE FALSE  TRUE
> 
> x[s]
[1] 4 2 6
> 
> x[c(2, 3, 6)]
[1] 4 2 6

3.3.6 Subsetting Lists

> x <- list(my=1:3, favorite=c("a", "b", "c"), 
+           course=c(FALSE, TRUE, NA))
> 
> x[[1]]
[1] 1 2 3
> x[["my"]]
[1] 1 2 3
> x$my
[1] 1 2 3
> x[[c(3,1)]]
[1] FALSE
> x[[3]][1]
[1] FALSE
> x[c(3,1)]
$course
[1] FALSE  TRUE    NA

$my
[1] 1 2 3

3.3.7 Subsetting Data Frames

> x <- data.frame(my=1:3, favorite=c("a", "b", "c"), 
+           course=c(FALSE, TRUE, NA))
> 
> x[[1]]
[1] 1 2 3
> x[["my"]]
[1] 1 2 3
> x$my
[1] 1 2 3
> x[[c(3,1)]]
[1] FALSE
> x[[3]][1]
[1] FALSE
> x[c(3,1)]
  course my
1  FALSE  1
2   TRUE  2
3     NA  3

3.3.8 Subsetting Data Frames

> x <- data.frame(my=1:3, favorite=c("a", "b", "c"), 
+           course=c(FALSE, TRUE, NA))
> 
> x[1, ]
  my favorite course
1  1        a  FALSE
> x[ ,3]
[1] FALSE  TRUE    NA
> x[ ,"favorite"]
[1] a b c
Levels: a b c
> x[1:2, ]
  my favorite course
1  1        a  FALSE
2  2        b   TRUE
> x[ ,2:3]
  favorite course
1        a  FALSE
2        b   TRUE
3        c     NA

3.3.9 Note on Data Frames

R often converts character strings to factors unless you specify otherwise.

In the previous slide, we saw it converted the “favorite” column to factors. Let’s fix that…

> x <- data.frame(my=1:3, favorite=c("a", "b", "c"), 
+                 course=c(FALSE, TRUE, NA), 
+                 stringsAsFactors=FALSE)
> 
> x[ ,"favorite"]
[1] "a" "b" "c"
> class(x[ ,"favorite"])
[1] "character"

3.3.10 Missing Values

> data("airquality", package="datasets")
> head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
> dim(airquality)
[1] 153   6
> which(is.na(airquality$Ozone))
 [1]   5  10  25  26  27  32  33  34  35  36  37  39  42  43  45  46  52
[18]  53  54  55  56  57  58  59  60  61  65  72  75  83  84 102 103 107
[35] 115 119 150
> sum(is.na(airquality$Ozone))
[1] 37

3.3.11 Subsetting by Matching

> letters
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
[18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
> vowels <- c("a", "e", "i", "o", "u")
> 
> letters %in% vowels
 [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
[12] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
[23] FALSE FALSE FALSE FALSE
> which(letters %in% vowels)
[1]  1  5  9 15 21
> 
> letters[which(letters %in% vowels)]
[1] "a" "e" "i" "o" "u"

3.3.12 Advanced Subsetting

The R Programming for Data Science chapter titled “Subsetting R Objects” contains additional material on subsetting that you should know.

The Advanced R website contains more detailed information on subsetting that you may find useful.

3.4 Functions

3.4.1 Rationale

  • Writing functions is a core activity of an R programmer. It represents the key step of the transition from a mere user to a developer who creates new functionality for R.

  • Functions are often used to encapsulate a sequence of expressions that need to be executed numerous times, perhaps under slightly different conditions.

  • Functions are also often written when code must be shared with others or the public.

From R Programming for Data Science, by Peng

3.4.2 Defining a New Function

  • Functions are defined using the function() directive

  • They are stored as variables, so they can be passed to other functions and assigned to new variables

  • Arguments and a final return object are defined

3.4.3 Example 1

> my_square <- function(x) {
+   x*x  # can also do return(x*x)
+ }
> 
> my_square(x=2)
[1] 4
> 
> my_fun2 <- my_square
> my_fun2(x=3)
[1] 9

3.4.4 Example 2

> my_square_ext <- function(x) {
+   y <- x*x
+   return(list(x_original=x, x_squared=y))
+ }
> 
> my_square_ext(x=2)
$x_original
[1] 2

$x_squared
[1] 4
> 
> z <- my_square_ext(x=2)

3.4.5 Example 3

> my_power <- function(x, e, say_hello) {
+   if(say_hello) {
+     cat("Hello World!")
+   }
+   x^e
+ }
> 
> my_power(x=2, e=3, say_hello=TRUE)
Hello World!
[1] 8
> 
> z <- my_power(x=2, e=3, say_hello=TRUE)
Hello World!
> z
[1] 8

3.4.6 Default Function Argument Values

Some functions have default values for their arguments:

> str(matrix)
function (data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)  

You can define a function with default values by the following:

f <- function(x, y=2) {
  x + y
}

If the user types f(x=1) then it defaults to y=2, but if the user types f(x=1, y=3), then it executes with these assignments.

3.4.7 The Ellipsis Argument

You will encounter functions that include as a possible argument the ellipsis: ...

This basically holds arguments that can be passed to functions called within a function. Example:

> double_log <- function(x, ...) {
+   log((2*x), ...)
+ }
> 
> double_log(x=1, base=2)
[1] 1
> double_log(x=1, base=10)
[1] 0.30103

3.4.8 Argument Matching

R tries to automatically deal with function calls when the arguments are not defined explicity. For example:

x <- matrix(1:6, nrow=2, ncol=3, byrow=TRUE)  # versus
x <- matrix(1:6, 2, 3, TRUE)

I strongly recommend that you define arguments explcitly. For example, I can never remember which comes first in matrix(), nrow or ncol.

3.5 Environment

3.5.1 Loading .RData Files

An .RData file is a binary file containing R objects. These can be saved from your current R session and also loaded into your current session.

> # generally...
> # to load:
> load(file="path/to/file_name.RData")
> # to save:
> save(file="path/to/file_name.RData")
> ## assumes file in working directory
> load(file="project_1_R_basics.RData") 
> ## loads from our GitHub repository
> load(file=url("https://github.com/SML201/project1/raw/
+          master/project_1_R_basics.RData")) 

3.5.2 Listing Objects

The objects in your current R session can be listed. An environment can also be specificied in case you have objects stored in different environments.

> ls()
[1] "num_people_in_precept"     "SML201_grade_distribution"
[3] "some_ORFE_profs"          
> 
> ls(name=globalenv())
[1] "num_people_in_precept"     "SML201_grade_distribution"
[3] "some_ORFE_profs"          
> 
> ## see help file for other options
> ?ls

3.5.3 Removing Objects

You can remove specific objects or all objects from your R environment of choice.

> rm("some_ORFE_profs") # removes variable some_ORFE_profs
> 
> rm(list=ls()) # Removes all variables from environment

3.5.4 Advanced

The R environment is there to connect object names to object values.

The R Programming for Data Science chapter titled “Scoping Rules of R” discussed environments and object names in more detail than we need for this course.

A useful discussion about environments can also be found on the Advanced R web site.

3.6 Packages

3.6.1 Rationale

“In R, the fundamental unit of shareable code is the package. A package bundles together code, data, documentation, and tests, and is easy to share with others. As of January 2015, there were over 6,000 packages available on the Comprehensive R Archive Network, or CRAN, the public clearing house for R packages. This huge variety of packages is one of the reasons that R is so successful: the chances are that someone has already solved a problem that you’re working on, and you can benefit from their work by downloading their package.”

From http://r-pkgs.had.co.nz/intro.html by Hadley Wickham

3.6.2 Contents of a Package

  • R functions
  • R data objects
  • Help documents for using the package
  • Information on the authors, dependencies, etc.
  • Information to make sure it “plays well” with R and other packages

3.6.3 Installing Packages

From CRAN:

install.packages("dplyr")

From GitHub (for advanced users):

library("devtools")
install_github("hadley/dplyr")

From Bioconductor (basically CRAN for biology):

library("BiocManager")
BiocManager::install("qvalue")

Be very careful about dependencies when installing from GitHub.

Multiple packages:

install.packages(c("dplyr", "ggplot2"))

Install all dependencies:

install.packages(c("dplyr", "ggplot2"), dependencies=TRUE)

Updating packages:

update.packages()

3.6.4 Loading Packages

Two ways to load a package:

library("dplyr")
library(dplyr)

I prefer the former.

3.6.5 Getting Started with a Package

When you install a new package and load it, what’s next? I like to look at the help files and see what functions and data sets a package has.

library("dplyr")
help(package="dplyr")

3.6.6 Specifying a Function within a Package

You can call a function from a specific package. Suppose you are in a setting where you have two packages loaded that have functions with the same name.

dplyr::arrange(mtcars, cyl, disp)

This calls the arrange functin specifically from dplyr. The package plyr also has an arrange function.

3.6.7 More on Packages

We will be covering several highly used R packages in depth this semester, so we will continue to learn about packages, how they are organized, and how they are used.

You can download the “source” of a package from R and take a look at the contents if you want to dig deeper. There are also many good tutorials on creating packages, such as http://hilaryparker.com/2014/04/29/writing-an-r-package-from-scratch/.

3.7 Organizing Your Code

3.7.1 Suggestions

RStudio conveniently tries to automatically format your R code. We suggest the following in general.

1. No more than 80 characters per line (or fewer depending on how R Markdown compiles):

really_long_line <- my_function(x=20, y=30, z=TRUE,
                                a="Joe", b=3.8)

2. Indent 2 or more characters for nested commands:

for(i in 1:10) {
  if(i > 4) {
    print(i)
  }
}

3. Generously comment your code.

## a for-loop that prints the index 
## whenever it is greater than 4
for(i in 1:10) {
  if(i > 4) {
    print(i)
  }
}
## a good way to get partial credit
## if something goes wrong :-)

4. Do not hesitate to write functions to organize tasks. These help to break up your code into more undertsandable pieces, and functions can often be used several times.

3.7.2 Where to Put Files

See the Elements of Data Analytic Style chapter titled “Reproducibility” for suggestions on how to organize your files.

In this course, we will keep this relatively simple. We will try to provide you with some organization when distributing the projects.