Home Page - http://dmcglinn.github.io/quant_methods/ GitHub Repo - https://github.com/dmcglinn/quant_methods
https://raw.githubusercontent.com/dmcglinn/quant_methods/gh-pages/lessons/R_introduction.R
The purpose of this lesson is to introduce students to the R programming environment for the first time. The lesson builds off the Software Carpentry Lesson developed here: http://swcarpentry.github.io/r-novice-inflammation/
3 + 4 # summation
## [1] 7
3 * 4 # multiplication
## [1] 12
3 / 4 # division
## [1] 0.75
3^4 # exponents
## [1] 81
log(3) # log base e
## [1] 1.098612
log(3, 10) # log base 10
## [1] 0.4771213
log10(3) # log base 10
## [1] 0.4771213
exp(log(3)) # e
## [1] 3
3 > 4 # greater than
## [1] FALSE
3 < 4 # less than
## [1] TRUE
3 >= 4 # greater than or equal to
## [1] FALSE
3 <= 4 # less than or equal to
## [1] TRUE
3 != 4 # not equal to
## [1] TRUE
3 == 4 # equal to
## [1] FALSE
TRUE # True
## [1] TRUE
T # True
## [1] TRUE
TRUE == 1 # True is set to one in R
## [1] TRUE
FALSE # False
## [1] FALSE
F # False
## [1] FALSE
FALSE == 0 # False is set to zero in R
## [1] TRUE
T + T + F # what would this equal?
## [1] 2
# useful functions
# any() and all()
any(c(T, F))
## [1] TRUE
all(c(T, F))
## [1] FALSE
you can use <-
or =
to assign a value to
a variable but <-
is recommended
weight_kg <- 55
print the value of the variable by simply calling its name
weight_kg
## [1] 55
weight in pounds:
2.2 * weight_kg
## [1] 121
weight_kg <- 57.5
weight in kilograms is now
weight_kg
## [1] 57.5
weight_lb <- 2.2 * weight_kg
weight in kg…
weight_kg
## [1] 57.5
…and in pounds
weight_lb
## [1] 126.5
weight_kg <- 100.0
weight in kg now…
weight_kg
## [1] 100
…and in weight pounds still
weight_lb
## [1] 126.5
Coming up with good object and file names can be difficult, but there are two general rules that can help guide you:
So for something like a file name which you’ll only type probably once at read and write you should use a long descriptive name, but for objects in your R code you need to consider typeability and readability when designing the name. A long name like root_rhiz_prod_total_mm is very clear but is a pain to read and worse to type. R has a built-in name completion system but this doesn’t completely remove the burden on you for using long object names.
first check what your working directory is:
getwd()
## [1] "/home/mcglinndj/quant_methods"
I am using an Rstudio Project that I called “quant_methods”. Projects help you to organize your R code for a specific project into a single directory. To create your own project simply go to File -> New Project then click either “New Directory” or “Existing Directory”. Be default the directory and the project name will be identical - it is not recommended to diverge from that behavior as it can make it very confusing.
The working directory within a project is the main project directory
so for me it returns: /home/mcglinndj/quant_methods
All file paths can be made relative to this directory. let’s read in
the datafile inflammation-01.csv
which is located in the
directory: ./quant_methods/data)
where the .
indicates the directory location in which the directory
quant_methods
is stored in. The usage of the
./
is a shorthand way to create relative paths. Because my
working directory is already set to:
/home/mcglinndj/quant_methods
I can shorten the path to
./data/inflammation-01.csv
where again .
indicates my current working directory path.
dat <- read.csv(file = "./data/inflammation-01.csv", header = FALSE)
two other quick notes about path shorthand:
../
is shorthand for the parent directory of the
working directory, these can be nested like ../../
but not
recommended.
~
is shorthand for the home directory on your
machine. On my machine ~
refers to
/home/mcglinndj
Rather than a relative path, I could have used an absolute path like:
"/home/mcglinndj/quant_methods/data/inflammation-01.csv"
but that path would only work on my specific machine. For that reason we
generally prefer relative paths over absolute paths for making your code
more reproducible and future proof.
Another option is to simply put in the url where the data is stored:
dat <- read.csv('https://raw.githubusercontent.com/dmcglinn/quant_methods/gh-pages/data/inflammation-01.csv',
header = FALSE)
this is not always a great option though because remote data and urls can break
Last note about the usage of the function read.csv
.
Notice the argument, header
is set to FALSE
.
This is because this inflammation dataset is a bit strange because it
does not include column names. Most datasets will include column names
so header = TRUE
is the more common setting of this
argument, in fact that is the default setting for read.csv
.
So in most cases (such as on the homework) you will simply use
read.csv('mycsvfile.csv')
without specifying the
header
argument explicitly.
above we used the function “read.csv” to find out more about this function see
?read.csv
or equivalently
help(read.csv)
to do a fuzzy help search use
help.search('read')
help.search('csv')
visual summary of first 6 rows
head(dat)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
## 1 0 0 1 3 1 2 4 7 8 3 3 3 10 5 7 4 7 7 12 18 6
## 2 0 1 2 1 2 1 3 2 2 6 10 11 5 9 4 4 7 16 8 6 18
## 3 0 1 1 3 3 2 6 2 5 9 5 7 4 5 4 15 5 11 9 10 19
## 4 0 0 2 0 4 2 2 1 6 7 10 7 9 13 8 8 15 10 10 7 17
## 5 0 1 1 3 3 1 3 5 2 4 4 7 6 5 3 10 8 10 6 17 9
## 6 0 0 1 2 2 4 2 1 6 4 7 6 6 9 9 15 4 16 18 12 12
## V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40
## 1 13 11 11 7 7 4 6 8 8 4 4 5 7 3 4 2 3 0 0
## 2 4 12 5 12 7 11 5 11 3 3 5 4 4 5 5 1 1 0 1
## 3 14 12 17 7 12 11 7 4 2 10 5 4 2 2 3 2 2 1 1
## 4 4 4 7 6 15 6 4 9 11 3 5 6 3 3 4 2 3 2 1
## 5 14 9 7 13 9 12 6 7 7 9 6 3 2 2 4 2 0 1 1
## 6 5 18 9 5 3 10 3 12 7 8 4 7 3 5 4 4 3 2 1
visual summary of last 6 rows
tail(dat)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
## 55 0 1 2 1 1 4 5 4 4 5 9 7 10 3 13 13 8 9 17 16 16
## 56 0 0 1 3 2 3 6 4 5 7 2 4 11 11 3 8 8 16 5 13 16
## 57 0 1 1 2 2 5 1 7 4 2 5 5 4 6 6 4 16 11 14 16 14
## 58 0 1 1 1 4 1 6 4 6 3 6 5 6 4 14 13 13 9 12 19 9
## 59 0 0 0 1 4 5 6 3 8 7 9 10 8 6 5 12 15 5 10 5 8
## 60 0 0 1 0 3 2 5 4 8 2 9 3 3 10 12 9 14 11 13 8 6
## V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40
## 55 15 12 13 5 12 10 9 11 9 4 5 5 2 2 5 1 0 0 1
## 56 5 8 8 6 9 10 10 9 3 3 5 3 5 4 5 3 3 0 1
## 57 14 8 17 4 14 13 7 6 3 7 7 5 6 3 4 2 2 1 1
## 58 10 15 10 9 10 10 7 5 6 8 6 6 4 3 5 2 1 1 1
## 59 13 18 17 14 9 13 4 10 11 10 8 8 6 5 5 2 0 2 0
## 60 18 11 9 13 11 8 5 5 2 8 5 3 5 4 1 3 1 1 0
what kind of object is dat
class(dat)
## [1] "data.frame"
what are the dimensions of dat
dim(dat)
## [1] 60 40
You may notice that the data did not have column names and R auto
assigned the columns the names V1, V2, V3, and so on. In this dataset,
each column represent different times. We can assign column names using
the function names
names(dat)
## [1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10" "V11" "V12"
## [13] "V13" "V14" "V15" "V16" "V17" "V18" "V19" "V20" "V21" "V22" "V23" "V24"
## [25] "V25" "V26" "V27" "V28" "V29" "V30" "V31" "V32" "V33" "V34" "V35" "V36"
## [37] "V37" "V38" "V39" "V40"
names(dat) <- paste("day", 1:ncol(dat), sep='')
names(dat)
## [1] "day1" "day2" "day3" "day4" "day5" "day6" "day7" "day8" "day9"
## [10] "day10" "day11" "day12" "day13" "day14" "day15" "day16" "day17" "day18"
## [19] "day19" "day20" "day21" "day22" "day23" "day24" "day25" "day26" "day27"
## [28] "day28" "day29" "day30" "day31" "day32" "day33" "day34" "day35" "day36"
## [37] "day37" "day38" "day39" "day40"
Above the function paste
was used to construct text
strings that combined the word “patient” with a given index in this case
from 1 to the total number of columns in the object dat
. By
default the function paste
inserts a space between strings
that you wish to paste together, I’ve set the sep
argument
to ''
to ensure that no space is inserted (see also
?paste0
)
There are a variety of ways to subset data in
data.frames
. This section
#" demonstrates how to subset data using indices.
first value in dat
dat[1, 1]
## [1] 0
middle value in dat
dat[30, 20]
## [1] 16
chunk of data in dat
dat[1:4, 1:10]
## day1 day2 day3 day4 day5 day6 day7 day8 day9 day10
## 1 0 0 1 3 1 2 4 7 8 3
## 2 0 1 2 1 2 1 3 2 2 6
## 3 0 1 1 3 3 2 6 2 5 9
## 4 0 0 2 0 4 2 2 1 6 7
select specific rows and columns
dat[c(3, 8, 37), c(10, 14, 29)]
## day10 day14 day29
## 3 9 5 4
## 8 3 5 6
## 37 6 9 10
The code above provides the values of dat
at 3,10 ; 8,14
; and 37,29 where the first number is the row index and the second
number is the column index all columns from row 5
dat[5, ]
## day1 day2 day3 day4 day5 day6 day7 day8 day9 day10 day11 day12 day13 day14
## 5 0 1 1 3 3 1 3 5 2 4 4 7 6 5
## day15 day16 day17 day18 day19 day20 day21 day22 day23 day24 day25 day26 day27
## 5 3 10 8 10 6 17 9 14 9 7 13 9 12
## day28 day29 day30 day31 day32 day33 day34 day35 day36 day37 day38 day39 day40
## 5 6 7 7 9 6 3 2 2 4 2 0 1 1
all rows from column 16
dat[ , 16]
## [1] 4 4 15 8 10 15 13 9 11 6 3 8 12 3 5 10 11 4 11 13 15 5 14 13 4
## [26] 9 13 6 7 6 14 3 15 4 15 11 7 10 15 6 5 6 15 11 15 6 11 15 14 4
## [51] 10 15 11 6 13 8 4 13 12 9
dat[1:nrow(dat), 16]
## [1] 4 4 15 8 10 15 13 9 11 6 3 8 12 3 5 10 11 4 11 13 15 5 14 13 4
## [26] 9 13 6 7 6 14 3 15 4 15 11 7 10 15 6 5 6 15 11 15 6 11 15 14 4
## [51] 10 15 11 6 13 8 4 13 12 9
you can also exclude certain indices using the -
sign
dat[1:5 , -16] # gives every column but 16
## day1 day2 day3 day4 day5 day6 day7 day8 day9 day10 day11 day12 day13 day14
## 1 0 0 1 3 1 2 4 7 8 3 3 3 10 5
## 2 0 1 2 1 2 1 3 2 2 6 10 11 5 9
## 3 0 1 1 3 3 2 6 2 5 9 5 7 4 5
## 4 0 0 2 0 4 2 2 1 6 7 10 7 9 13
## 5 0 1 1 3 3 1 3 5 2 4 4 7 6 5
## day15 day17 day18 day19 day20 day21 day22 day23 day24 day25 day26 day27 day28
## 1 7 7 7 12 18 6 13 11 11 7 7 4 6
## 2 4 7 16 8 6 18 4 12 5 12 7 11 5
## 3 4 5 11 9 10 19 14 12 17 7 12 11 7
## 4 8 15 10 10 7 17 4 4 7 6 15 6 4
## 5 3 8 10 6 17 9 14 9 7 13 9 12 6
## day29 day30 day31 day32 day33 day34 day35 day36 day37 day38 day39 day40
## 1 8 8 4 4 5 7 3 4 2 3 0 0
## 2 11 3 3 5 4 4 5 5 1 1 0 1
## 3 4 2 10 5 4 2 2 3 2 2 1 1
## 4 9 11 3 5 6 3 3 4 2 3 2 1
## 5 7 7 9 6 3 2 2 4 2 0 1 1
dat[1:5 , -(1:10)] # gives every column except the first 10
## day11 day12 day13 day14 day15 day16 day17 day18 day19 day20 day21 day22 day23
## 1 3 3 10 5 7 4 7 7 12 18 6 13 11
## 2 10 11 5 9 4 4 7 16 8 6 18 4 12
## 3 5 7 4 5 4 15 5 11 9 10 19 14 12
## 4 10 7 9 13 8 8 15 10 10 7 17 4 4
## 5 4 7 6 5 3 10 8 10 6 17 9 14 9
## day24 day25 day26 day27 day28 day29 day30 day31 day32 day33 day34 day35 day36
## 1 11 7 7 4 6 8 8 4 4 5 7 3 4
## 2 5 12 7 11 5 11 3 3 5 4 4 5 5
## 3 17 7 12 11 7 4 2 10 5 4 2 2 3
## 4 7 6 15 6 4 9 11 3 5 6 3 3 4
## 5 7 13 9 12 6 7 7 9 6 3 2 2 4
## day37 day38 day39 day40
## 1 2 3 0 0
## 2 1 1 0 1
## 3 2 2 1 1
## 4 2 3 2 1
## 5 2 0 1 1
An alternative way to carry out subsetting is to reference specific
column names or to use the subset
function. Here to avoid
printing too much information to the screen I’ll just focus on on the
first 5 rows of each subset
dat$patient10[1:5]
## NULL
dat[1:5 , 'day10']
## [1] 3 6 9 7 4
dat[1:5 , c('day10', 'day15')]
## day10 day15
## 1 3 7
## 2 6 4
## 3 9 4
## 4 7 8
## 5 4 3
notice that the following would give and error
dat[ , -c('patient3')]
## Error in -c("patient3"): invalid argument to unary operator
but that the following would accomplish the intended goal of dropping patient 3
dat[1:5 , -3]
## day1 day2 day4 day5 day6 day7 day8 day9 day10 day11 day12 day13 day14 day15
## 1 0 0 3 1 2 4 7 8 3 3 3 10 5 7
## 2 0 1 1 2 1 3 2 2 6 10 11 5 9 4
## 3 0 1 3 3 2 6 2 5 9 5 7 4 5 4
## 4 0 0 0 4 2 2 1 6 7 10 7 9 13 8
## 5 0 1 3 3 1 3 5 2 4 4 7 6 5 3
## day16 day17 day18 day19 day20 day21 day22 day23 day24 day25 day26 day27 day28
## 1 4 7 7 12 18 6 13 11 11 7 7 4 6
## 2 4 7 16 8 6 18 4 12 5 12 7 11 5
## 3 15 5 11 9 10 19 14 12 17 7 12 11 7
## 4 8 15 10 10 7 17 4 4 7 6 15 6 4
## 5 10 8 10 6 17 9 14 9 7 13 9 12 6
## day29 day30 day31 day32 day33 day34 day35 day36 day37 day38 day39 day40
## 1 8 8 4 4 5 7 3 4 2 3 0 0
## 2 11 3 3 5 4 4 5 5 1 1 0 1
## 3 4 2 10 5 4 2 2 3 2 2 1 1
## 4 9 11 3 5 6 3 3 4 2 3 2 1
## 5 7 7 9 6 3 2 2 4 2 0 1 1
let’s try using the subset function only data for day 3
subset(dat, select = day3)[1:5, ]
## [1] 1 2 1 2 1
data on all days but 3
subset(dat, select = -day3)[1:5, ]
## day1 day2 day4 day5 day6 day7 day8 day9 day10 day11 day12 day13 day14 day15
## 1 0 0 3 1 2 4 7 8 3 3 3 10 5 7
## 2 0 1 1 2 1 3 2 2 6 10 11 5 9 4
## 3 0 1 3 3 2 6 2 5 9 5 7 4 5 4
## 4 0 0 0 4 2 2 1 6 7 10 7 9 13 8
## 5 0 1 3 3 1 3 5 2 4 4 7 6 5 3
## day16 day17 day18 day19 day20 day21 day22 day23 day24 day25 day26 day27 day28
## 1 4 7 7 12 18 6 13 11 11 7 7 4 6
## 2 4 7 16 8 6 18 4 12 5 12 7 11 5
## 3 15 5 11 9 10 19 14 12 17 7 12 11 7
## 4 8 15 10 10 7 17 4 4 7 6 15 6 4
## 5 10 8 10 6 17 9 14 9 7 13 9 12 6
## day29 day30 day31 day32 day33 day34 day35 day36 day37 day38 day39 day40
## 1 8 8 4 4 5 7 3 4 2 3 0 0
## 2 11 3 3 5 4 4 5 5 1 1 0 1
## 3 4 2 10 5 4 2 2 3 2 2 1 1
## 4 9 11 3 5 6 3 3 4 2 3 2 1
## 5 7 7 9 6 3 2 2 4 2 0 1 1
data only on day 3 when inflammation in day 1 is equal to 0
subset(dat, subset = day1 == 0, select = day3)[1:5, ]
## [1] 1 2 1 2 1
first row, all of the columns
patient_1 <- dat[1, ]
max inflammation for patient 1
max(patient_1)
## [1] 18
max inflammation for patient 2
max(dat[2, ])
## [1] 18
minimum inflammation on day 7
min(dat[ , 7])
## [1] 1
mean inflammation on day 7
mean(dat[ , 7])
## [1] 3.8
median inflammation on day 7
median(dat[ , 7])
## [1] 4
standard deviation of inflammation on day 7
sd(dat[ , 7])
## [1] 1.725187
summary(dat[ , 7])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 2.0 4.0 3.8 5.0 6.0
Thus, to obtain the average inflammation of each patient we will need
to calculate the mean of all of the rows (MARGIN = 1
) of
the data frame.
avg_patient_inflammation <- apply(dat, 1, mean)
And to obtain the average inflammation of each day we will need to
calculate the mean of all of the columns (MARGIN = 2
) of
the data frame.
avg_day_inflammation <- apply(dat, 2, mean)
standard deviation of day
sd_day_inflammation <- apply(dat, 2, sd)
standard deviation of patients
sd_patient_inflammation <- apply(dat, 1, sd)
use the function plot() to plot the data summaries
?plot
## Help on topic 'plot' was found in the following packages:
##
## Package Library
## graphics /usr/lib/R/library
## base /usr/lib/R/library
##
##
## Using the first match ...
provides a long list of potential arguments and examples at a minimum you must provide a single quantitative variable, for example:
plot(avg_day_inflammation)
notice how R fills in lots of pieces of missing information automatically. specifically it assumes that the independent variable is simply an index from 1 to the length of the object in this case avg_day_inflamation. A safer more clear way to accomplish the same plot is to use the following:
plot(1:length(avg_day_inflammation), avg_day_inflammation, xlab='day',
ylab='inflammation')
this makes it clearer that the x-variable is simply an index from 1 to the length of avg_day_inflammation, and it makes the x and y axis labels more understandable.
Note that in most cases in R you have a specific columns in your
data.frame
you wish to plot against one another. If for
example my data.frame
was called dat
and there
were two columns growth
and temperature
. I
could plot growth as a function of temperature using
plot(dat$temperature, dat$growth)
This will come up in the
homework as well.
To output multi-panel plots use for example
par(mfrow=c(1,2))
which will create a single plotting row with two columns
plot(1:length(avg_day_inflammation), avg_day_inflammation, xlab='day',
ylab='inflammation')
plot(1:length(avg_patient_inflammation), avg_patient_inflammation,
xlab='patient identity', ylab='inflammation')
to output the figure to file you can use Rstudio’s GUI features or you can use the command line which is what I recommend so that the code is fully reproducible:
# png('./inflammation_fig1.png') # to create a png file
par(mfrow = c(2,1))
plot(1:length(avg_day_inflammation), avg_day_inflammation, xlab='Day',
ylab='Inflammation', frame.plot=F, col='magenta', pch=2, cex=2)
plot(1:length(avg_patient_inflammation), avg_patient_inflammation,
xlab='patient identity', ylab='inflammation', col='dodgerblue')
# dev.off() # to stop writing to the png file.
Home Page - http://dmcglinn.github.io/quant_methods/ GitHub Repo - https://github.com/dmcglinn/quant_methods