Introduction to R

Home Page - http://dmcglinn.github.io/quant_methods/ GitHub Repo - https://github.com/dmcglinn/quant_methods

Source Code Link

https://raw.githubusercontent.com/dmcglinn/quant_methods/gh-pages/lessons/R_introduction.R

Readings

Chapters 1-7 of The R Book (1st ed) by Crawley
Chapters 1-4 of MASS (4th ed)by Venables and Ripley

The purpose of this lesson is to introduce students to the R programming environment for the first time. The lesson builds off the Software Carpentry Lesson developed here: http://swcarpentry.github.io/r-novice-inflammation/

#Arithmetic

3 + 4       # summation

## [1] 7

3 * 4       # multiplication

## [1] 12

3 / 4       # division

## [1] 0.75

3^4         # exponents

## [1] 81

log(3)      # log base e

## [1] 1.098612

log(3, 10)  # log base 10

## [1] 0.4771213

log10(3)    # log base 10

## [1] 0.4771213

exp(log(3)) # e

## [1] 3

#Logical operations

3 > 4        # greater than

## [1] FALSE

3 < 4        # less than

## [1] TRUE

3 >= 4       # greater than or equal to

## [1] FALSE

3 <= 4       # less than or equal to

## [1] TRUE

3 != 4       # not equal to

## [1] TRUE

3 == 4       # equal to

## [1] FALSE

TRUE         # True

## [1] TRUE

T            # True

## [1] TRUE

TRUE == 1    # True is set to one in R

## [1] TRUE

FALSE        # False

## [1] FALSE

F            # False

## [1] FALSE

FALSE == 0   # False is set to zero in R

## [1] TRUE

T + T + F    # what would this equal?

## [1] 2

# useful functions
# any() and all()
any(c(T, F))

## [1] TRUE

all(c(T, F))

## [1] FALSE

#Variable assignment

you can use <- or = to assign a value to a variable but <- is recommended

weight_kg <- 55

print the value of the variable by simply calling its name

weight_kg

## [1] 55

weight in pounds:

2.2 * weight_kg

## [1] 121

weight_kg <- 57.5

weight in kilograms is now

weight_kg

## [1] 57.5

weight_lb <- 2.2 * weight_kg

weight in kg…

weight_kg

## [1] 57.5

…and in pounds

weight_lb

## [1] 126.5

weight_kg <- 100.0

weight in kg now…

weight_kg

## [1] 100

…and in weight pounds still

weight_lb

## [1] 126.5

Coming up with good object and file names can be difficult, but there are two general rules that can help guide you:

be descriptive
don’t make names you must type a lot too long

So for something like a file name which you’ll only type probably once at read and write you should use a long descriptive name, but for objects in your R code you need to consider typeability and readability when designing the name. A long name like root_rhiz_prod_total_mm is very clear but is a pain to read and worse to type. R has a built-in name completion system but this doesn’t completely remove the burden on you for using long object names.

#Reading in data

first check what your working directory is:

getwd()

## [1] "/home/mcglinndj/quant_methods"

I am using an Rstudio Project that I called “quant_methods”. Projects help you to organize your R code for a specific project into a single directory. To create your own project simply go to File -> New Project then click either “New Directory” or “Existing Directory”. Be default the directory and the project name will be identical - it is not recommended to diverge from that behavior as it can make it very confusing.

The working directory within a project is the main project directory so for me it returns: /home/mcglinndj/quant_methods

All file paths can be made relative to this directory. let’s read in the datafile inflammation-01.csv which is located in the directory: ./quant_methods/data) where the . indicates the directory location in which the directory quant_methods is stored in. The usage of the ./ is a shorthand way to create relative paths. Because my working directory is already set to: /home/mcglinndj/quant_methods I can shorten the path to ./data/inflammation-01.csv where again . indicates my current working directory path.

dat <- read.csv(file = "./data/inflammation-01.csv", header = FALSE)

two other quick notes about path shorthand:

../ is shorthand for the parent directory of the working directory, these can be nested like ../../ but not recommended.
~ is shorthand for the home directory on your machine. On my machine ~ refers to /home/mcglinndj

Rather than a relative path, I could have used an absolute path like: "/home/mcglinndj/quant_methods/data/inflammation-01.csv" but that path would only work on my specific machine. For that reason we generally prefer relative paths over absolute paths for making your code more reproducible and future proof.

Another option is to simply put in the url where the data is stored:

dat <- read.csv('https://raw.githubusercontent.com/dmcglinn/quant_methods/gh-pages/data/inflammation-01.csv', 
                header = FALSE)

this is not always a great option though because remote data and urls can break

Last note about the usage of the function read.csv. Notice the argument, header is set to FALSE. This is because this inflammation dataset is a bit strange because it does not include column names. Most datasets will include column names so header = TRUE is the more common setting of this argument, in fact that is the default setting for read.csv. So in most cases (such as on the homework) you will simply use read.csv('mycsvfile.csv') without specifying the header argument explicitly.

#Using the help

above we used the function “read.csv” to find out more about this function see

?read.csv

or equivalently

help(read.csv)

to do a fuzzy help search use

help.search('read') 
help.search('csv')

#Examine data

visual summary of first 6 rows

head(dat)

##   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
## 1  0  0  1  3  1  2  4  7  8   3   3   3  10   5   7   4   7   7  12  18   6
## 2  0  1  2  1  2  1  3  2  2   6  10  11   5   9   4   4   7  16   8   6  18
## 3  0  1  1  3  3  2  6  2  5   9   5   7   4   5   4  15   5  11   9  10  19
## 4  0  0  2  0  4  2  2  1  6   7  10   7   9  13   8   8  15  10  10   7  17
## 5  0  1  1  3  3  1  3  5  2   4   4   7   6   5   3  10   8  10   6  17   9
## 6  0  0  1  2  2  4  2  1  6   4   7   6   6   9   9  15   4  16  18  12  12
##   V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40
## 1  13  11  11   7   7   4   6   8   8   4   4   5   7   3   4   2   3   0   0
## 2   4  12   5  12   7  11   5  11   3   3   5   4   4   5   5   1   1   0   1
## 3  14  12  17   7  12  11   7   4   2  10   5   4   2   2   3   2   2   1   1
## 4   4   4   7   6  15   6   4   9  11   3   5   6   3   3   4   2   3   2   1
## 5  14   9   7  13   9  12   6   7   7   9   6   3   2   2   4   2   0   1   1
## 6   5  18   9   5   3  10   3  12   7   8   4   7   3   5   4   4   3   2   1

visual summary of last 6 rows

tail(dat)

##    V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
## 55  0  1  2  1  1  4  5  4  4   5   9   7  10   3  13  13   8   9  17  16  16
## 56  0  0  1  3  2  3  6  4  5   7   2   4  11  11   3   8   8  16   5  13  16
## 57  0  1  1  2  2  5  1  7  4   2   5   5   4   6   6   4  16  11  14  16  14
## 58  0  1  1  1  4  1  6  4  6   3   6   5   6   4  14  13  13   9  12  19   9
## 59  0  0  0  1  4  5  6  3  8   7   9  10   8   6   5  12  15   5  10   5   8
## 60  0  0  1  0  3  2  5  4  8   2   9   3   3  10  12   9  14  11  13   8   6
##    V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40
## 55  15  12  13   5  12  10   9  11   9   4   5   5   2   2   5   1   0   0   1
## 56   5   8   8   6   9  10  10   9   3   3   5   3   5   4   5   3   3   0   1
## 57  14   8  17   4  14  13   7   6   3   7   7   5   6   3   4   2   2   1   1
## 58  10  15  10   9  10  10   7   5   6   8   6   6   4   3   5   2   1   1   1
## 59  13  18  17  14   9  13   4  10  11  10   8   8   6   5   5   2   0   2   0
## 60  18  11   9  13  11   8   5   5   2   8   5   3   5   4   1   3   1   1   0

what kind of object is dat

class(dat)

## [1] "data.frame"

what are the dimensions of dat

dim(dat)

## [1] 60 40

You may notice that the data did not have column names and R auto assigned the columns the names V1, V2, V3, and so on. In this dataset, each column represent different times. We can assign column names using the function names

names(dat)

##  [1] "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V7"  "V8"  "V9"  "V10" "V11" "V12"
## [13] "V13" "V14" "V15" "V16" "V17" "V18" "V19" "V20" "V21" "V22" "V23" "V24"
## [25] "V25" "V26" "V27" "V28" "V29" "V30" "V31" "V32" "V33" "V34" "V35" "V36"
## [37] "V37" "V38" "V39" "V40"

names(dat) <- paste("day", 1:ncol(dat), sep='')
names(dat)

##  [1] "day1"  "day2"  "day3"  "day4"  "day5"  "day6"  "day7"  "day8"  "day9" 
## [10] "day10" "day11" "day12" "day13" "day14" "day15" "day16" "day17" "day18"
## [19] "day19" "day20" "day21" "day22" "day23" "day24" "day25" "day26" "day27"
## [28] "day28" "day29" "day30" "day31" "day32" "day33" "day34" "day35" "day36"
## [37] "day37" "day38" "day39" "day40"

Above the function paste was used to construct text strings that combined the word “patient” with a given index in this case from 1 to the total number of columns in the object dat. By default the function paste inserts a space between strings that you wish to paste together, I’ve set the sep argument to '' to ensure that no space is inserted (see also ?paste0)

#Subsetting the data

There are a variety of ways to subset data in data.frames. This section

#" demonstrates how to subset data using indices.

first value in dat

dat[1, 1]

## [1] 0

middle value in dat

dat[30, 20]

## [1] 16

chunk of data in dat

dat[1:4, 1:10]

##   day1 day2 day3 day4 day5 day6 day7 day8 day9 day10
## 1    0    0    1    3    1    2    4    7    8     3
## 2    0    1    2    1    2    1    3    2    2     6
## 3    0    1    1    3    3    2    6    2    5     9
## 4    0    0    2    0    4    2    2    1    6     7

select specific rows and columns

dat[c(3, 8, 37), c(10, 14, 29)]

##    day10 day14 day29
## 3      9     5     4
## 8      3     5     6
## 37     6     9    10

The code above provides the values of dat at 3,10 ; 8,14 ; and 37,29 where the first number is the row index and the second number is the column index all columns from row 5

dat[5, ]

##   day1 day2 day3 day4 day5 day6 day7 day8 day9 day10 day11 day12 day13 day14
## 5    0    1    1    3    3    1    3    5    2     4     4     7     6     5
##   day15 day16 day17 day18 day19 day20 day21 day22 day23 day24 day25 day26 day27
## 5     3    10     8    10     6    17     9    14     9     7    13     9    12
##   day28 day29 day30 day31 day32 day33 day34 day35 day36 day37 day38 day39 day40
## 5     6     7     7     9     6     3     2     2     4     2     0     1     1

all rows from column 16

dat[ , 16]

##  [1]  4  4 15  8 10 15 13  9 11  6  3  8 12  3  5 10 11  4 11 13 15  5 14 13  4
## [26]  9 13  6  7  6 14  3 15  4 15 11  7 10 15  6  5  6 15 11 15  6 11 15 14  4
## [51] 10 15 11  6 13  8  4 13 12  9

dat[1:nrow(dat), 16]

##  [1]  4  4 15  8 10 15 13  9 11  6  3  8 12  3  5 10 11  4 11 13 15  5 14 13  4
## [26]  9 13  6  7  6 14  3 15  4 15 11  7 10 15  6  5  6 15 11 15  6 11 15 14  4
## [51] 10 15 11  6 13  8  4 13 12  9

you can also exclude certain indices using the - sign

dat[1:5 , -16]      # gives every column but 16

##   day1 day2 day3 day4 day5 day6 day7 day8 day9 day10 day11 day12 day13 day14
## 1    0    0    1    3    1    2    4    7    8     3     3     3    10     5
## 2    0    1    2    1    2    1    3    2    2     6    10    11     5     9
## 3    0    1    1    3    3    2    6    2    5     9     5     7     4     5
## 4    0    0    2    0    4    2    2    1    6     7    10     7     9    13
## 5    0    1    1    3    3    1    3    5    2     4     4     7     6     5
##   day15 day17 day18 day19 day20 day21 day22 day23 day24 day25 day26 day27 day28
## 1     7     7     7    12    18     6    13    11    11     7     7     4     6
## 2     4     7    16     8     6    18     4    12     5    12     7    11     5
## 3     4     5    11     9    10    19    14    12    17     7    12    11     7
## 4     8    15    10    10     7    17     4     4     7     6    15     6     4
## 5     3     8    10     6    17     9    14     9     7    13     9    12     6
##   day29 day30 day31 day32 day33 day34 day35 day36 day37 day38 day39 day40
## 1     8     8     4     4     5     7     3     4     2     3     0     0
## 2    11     3     3     5     4     4     5     5     1     1     0     1
## 3     4     2    10     5     4     2     2     3     2     2     1     1
## 4     9    11     3     5     6     3     3     4     2     3     2     1
## 5     7     7     9     6     3     2     2     4     2     0     1     1

dat[1:5 , -(1:10)]  # gives every column except the first 10

##   day11 day12 day13 day14 day15 day16 day17 day18 day19 day20 day21 day22 day23
## 1     3     3    10     5     7     4     7     7    12    18     6    13    11
## 2    10    11     5     9     4     4     7    16     8     6    18     4    12
## 3     5     7     4     5     4    15     5    11     9    10    19    14    12
## 4    10     7     9    13     8     8    15    10    10     7    17     4     4
## 5     4     7     6     5     3    10     8    10     6    17     9    14     9
##   day24 day25 day26 day27 day28 day29 day30 day31 day32 day33 day34 day35 day36
## 1    11     7     7     4     6     8     8     4     4     5     7     3     4
## 2     5    12     7    11     5    11     3     3     5     4     4     5     5
## 3    17     7    12    11     7     4     2    10     5     4     2     2     3
## 4     7     6    15     6     4     9    11     3     5     6     3     3     4
## 5     7    13     9    12     6     7     7     9     6     3     2     2     4
##   day37 day38 day39 day40
## 1     2     3     0     0
## 2     1     1     0     1
## 3     2     2     1     1
## 4     2     3     2     1
## 5     2     0     1     1

An alternative way to carry out subsetting is to reference specific column names or to use the subset function. Here to avoid printing too much information to the screen I’ll just focus on on the first 5 rows of each subset

dat$patient10[1:5]

## NULL

dat[1:5 , 'day10']

## [1] 3 6 9 7 4

dat[1:5 , c('day10', 'day15')]

##   day10 day15
## 1     3     7
## 2     6     4
## 3     9     4
## 4     7     8
## 5     4     3

notice that the following would give and error

dat[ , -c('patient3')]

## Error in -c("patient3"): invalid argument to unary operator

but that the following would accomplish the intended goal of dropping patient 3

dat[1:5 , -3]

##   day1 day2 day4 day5 day6 day7 day8 day9 day10 day11 day12 day13 day14 day15
## 1    0    0    3    1    2    4    7    8     3     3     3    10     5     7
## 2    0    1    1    2    1    3    2    2     6    10    11     5     9     4
## 3    0    1    3    3    2    6    2    5     9     5     7     4     5     4
## 4    0    0    0    4    2    2    1    6     7    10     7     9    13     8
## 5    0    1    3    3    1    3    5    2     4     4     7     6     5     3
##   day16 day17 day18 day19 day20 day21 day22 day23 day24 day25 day26 day27 day28
## 1     4     7     7    12    18     6    13    11    11     7     7     4     6
## 2     4     7    16     8     6    18     4    12     5    12     7    11     5
## 3    15     5    11     9    10    19    14    12    17     7    12    11     7
## 4     8    15    10    10     7    17     4     4     7     6    15     6     4
## 5    10     8    10     6    17     9    14     9     7    13     9    12     6
##   day29 day30 day31 day32 day33 day34 day35 day36 day37 day38 day39 day40
## 1     8     8     4     4     5     7     3     4     2     3     0     0
## 2    11     3     3     5     4     4     5     5     1     1     0     1
## 3     4     2    10     5     4     2     2     3     2     2     1     1
## 4     9    11     3     5     6     3     3     4     2     3     2     1
## 5     7     7     9     6     3     2     2     4     2     0     1     1

let’s try using the subset function only data for day 3

subset(dat, select = day3)[1:5, ]

## [1] 1 2 1 2 1

data on all days but 3

subset(dat, select = -day3)[1:5, ]

##   day1 day2 day4 day5 day6 day7 day8 day9 day10 day11 day12 day13 day14 day15
## 1    0    0    3    1    2    4    7    8     3     3     3    10     5     7
## 2    0    1    1    2    1    3    2    2     6    10    11     5     9     4
## 3    0    1    3    3    2    6    2    5     9     5     7     4     5     4
## 4    0    0    0    4    2    2    1    6     7    10     7     9    13     8
## 5    0    1    3    3    1    3    5    2     4     4     7     6     5     3
##   day16 day17 day18 day19 day20 day21 day22 day23 day24 day25 day26 day27 day28
## 1     4     7     7    12    18     6    13    11    11     7     7     4     6
## 2     4     7    16     8     6    18     4    12     5    12     7    11     5
## 3    15     5    11     9    10    19    14    12    17     7    12    11     7
## 4     8    15    10    10     7    17     4     4     7     6    15     6     4
## 5    10     8    10     6    17     9    14     9     7    13     9    12     6
##   day29 day30 day31 day32 day33 day34 day35 day36 day37 day38 day39 day40
## 1     8     8     4     4     5     7     3     4     2     3     0     0
## 2    11     3     3     5     4     4     5     5     1     1     0     1
## 3     4     2    10     5     4     2     2     3     2     2     1     1
## 4     9    11     3     5     6     3     3     4     2     3     2     1
## 5     7     7     9     6     3     2     2     4     2     0     1     1

data only on day 3 when inflammation in day 1 is equal to 0

subset(dat, subset = day1 == 0, select = day3)[1:5, ]

## [1] 1 2 1 2 1

#Summary statistics

first row, all of the columns

patient_1 <- dat[1, ]

max inflammation for patient 1

max(patient_1)

## [1] 18

max inflammation for patient 2

max(dat[2, ])

## [1] 18

minimum inflammation on day 7

min(dat[ , 7])

## [1] 1

mean inflammation on day 7

mean(dat[ , 7])

## [1] 3.8

median inflammation on day 7

median(dat[ , 7])

## [1] 4

standard deviation of inflammation on day 7

sd(dat[ , 7])

## [1] 1.725187

summary(dat[ , 7])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     2.0     4.0     3.8     5.0     6.0

#Aggregate across rows or columns

Thus, to obtain the average inflammation of each patient we will need to calculate the mean of all of the rows (MARGIN = 1) of the data frame.

avg_patient_inflammation <- apply(dat, 1, mean)

And to obtain the average inflammation of each day we will need to calculate the mean of all of the columns (MARGIN = 2) of the data frame.

avg_day_inflammation <- apply(dat, 2, mean)

standard deviation of day

sd_day_inflammation <- apply(dat, 2, sd)

standard deviation of patients

sd_patient_inflammation <- apply(dat, 1, sd)

#Plot data

use the function plot() to plot the data summaries

?plot

## Help on topic 'plot' was found in the following packages:
## 
##   Package               Library
##   graphics              /usr/lib/R/library
##   base                  /usr/lib/R/library
## 
## 
## Using the first match ...

provides a long list of potential arguments and examples at a minimum you must provide a single quantitative variable, for example:

plot(avg_day_inflammation)

notice how R fills in lots of pieces of missing information automatically. specifically it assumes that the independent variable is simply an index from 1 to the length of the object in this case avg_day_inflammation. A safer more clear way to accomplish the same plot is to use the following:

plot(1:length(avg_day_inflammation), avg_day_inflammation, xlab='day', 
    ylab='inflammation')

this makes it clearer that the x-variable is simply an index from 1 to the length of avg_day_inflammation, and it makes the x and y axis labels more understandable.

Note that in most cases in R you have a specific columns in your data.frame you wish to plot against one another. If for example my data.frame was called dat and there were two columns growth and temperature. I could plot growth as a function of temperature using plot(dat$temperature, dat$growth) This will come up in the homework as well.

To output multi-panel plots use for example

par(mfrow=c(1,2))

which will create a single plotting row with two columns

plot(1:length(avg_day_inflammation), avg_day_inflammation, xlab='day', 
     ylab='inflammation')

plot(1:length(avg_patient_inflammation), avg_patient_inflammation,
     xlab='patient identity', ylab='inflammation')

to output the figure to file you can use Rstudio’s GUI features or you can use the command line which is what I recommend so that the code is fully reproducible:

# png('./inflammation_fig1.png') # to create a png file
par(mfrow = c(2,1))
plot(1:length(avg_day_inflammation), avg_day_inflammation, xlab='Day',
     ylab='Inflammation', frame.plot=F, col='magenta', pch=2, cex=2)
plot(1:length(avg_patient_inflammation), avg_patient_inflammation,
     xlab='patient identity', ylab='inflammation', col='dodgerblue')

# dev.off()                      # to stop writing to the png file.