Home Page - http://dmcglinn.github.io/quant_methods/ GitHub Repo -
https://github.com/dmcglinn/quant_methods
Source Code Link
https://raw.githubusercontent.com/dmcglinn/quant_methods/gh-pages/lessons/univariate_models.Rmd
The goals of this lesson are to introduce univariate modeling using
simple and multivariate Ordinary Least Squares (OLS) regression, and to
gain exposure to the concept and methods of model comparison.
The content from this lesson is modified in part from lessons
developed by Jack Weiss and Jason Fridley. The original lessons can be
found at the following links:
Readings and Literature Cited
- Chapters 5, 9-10 in The R book (1st ed) by Crawley
- Chapters 6 in MASS (4th ed) by Venables and Ripley
- Chapter 1 in Model Selection and Multimodel Inference (2nd
ed) by Burnham and Anderson
- p150 in A primer of Ecological Statistics (2nd ed) by
Gotelli and Ellison
# Modeling Philosophy
One of the simplest and most important ideas when constructing models
can be stated succinctly:
“All models are wrong but some models are useful” G.E. Box
As scientists we have to be on guard that we are not too attached to
our models. Additionally it is often difficult to decide when a model is
useful or not because there are three general reasons for building a
model:
- Hypothesis testing
- Exploring data
- Predicting data
These different goals of model building require different approaches
and modes of model evaluation. Hypothesis testing is typically geared
towards testing a small handful of carefully crafted A
PRIORI hypotheses. In this context the model is typically
judged useful if it is statistically significant.
However, many times though the investigator does not have a clear
a priori hypothesis (see
post at Small Pond Science) and is instead examining if a large
number of possibly relevant variables are correlated with a response of
interest. The goal of this kind of exploratory analysis is to generate
hypotheses that can be more carefully tested with other
datasets and not with the dataset used to generate them.
Unfortunately very frequently and quite misleadingly the correlations
that are found to be strong in the exploratory analysis are then
presented in a hypo-deductive framework as if they were a
prori (see
post on Dynamic Ecology). When using models to make predictions we
are typically less concerned about the exact variables that are in the
model and more concerned that we can predict observations not included
in the building of the model (i.e., cross-validate the model).
There are many reasons that p-values and statistical significance are
abused in science. For example, it can be very tempting to report
statistics of significance in many analyses in which you did not have
clear a prori hypotheses because often times R will
report such statistics without prompting from the user (e.g.,
summary(my_ols_model)
). Additionally there is a stigma in
many fields of science against exploratory analyses in favor of
hypothesis testing which pushes some researchers to re-frame their
analyses as if they are confirmatory rather than exploratory. And of
course there is pressure during peer-review to only report on statistics
that are significant.
You might be wondering why this is a big deal. The reason is
Freedman’s paradox (Freedman
1983) which demonstrates that you will inevitably get good fitting
models (high R^2) and statistically significant results (p < 0.05) if
you keep adding variables to a model even if those variables by
definition are independent of the response variable. The solution to
Freedman’s paradox is to approach model comparison as carefully and
intentionally as possible with a small number of deliberately chosen
models. Burnham and Anderson (2002, p19) advocate for:
… a conservative approach to the overall issue of strategy in
the analysis of data in the biological sciences with an emphasis on
a priori considerations and models to be considered.
Careful, a priori consideration of alternative models will often
require a major change in emphasis among many people.
The Principle of Parsimony (Occam’s Razor)
Once you have decided on a small number of models to compare how do
you decide which model is the best? One approach is to use the the
principle of parsimony which can be stated succinctly as:
the correct explanation is the simplest explanation that is
sufficient.
For our purposes we can think of this as:
the model with the smallest possible number of parameters for
adequate representation of the data (Box and Jenkins 1970)
Crawley (2007) suggested that all else being equal the better model
is the model with the following properties:
- a model with n−1 parameters to a model with n parameters;
- a model with k−1 explanatory variables to a model with k explanatory
variables;
- a linear model to a model which is curved;
- a model without a hump to a model with a hump;
- a model without interactions to a model containing interactions
between factors.
Thus model simplification is an important part of finding the most
useful model. A simpler model that explains similar levels of deviance
in a response should be considered a better model.
Crawley (2007) suggests the following steps for backward model
simplification:
These steps are reasonable if one is carrying out an
exploratory data analysis; however, I do not recommend
this approach when formal hypothesis testing is the goal of the analysis
due to Freedman’s paradox as explained above. Other methods of model
selection (other than p < 0.05 as advocated above) will suffer the
same problem.
According to Burnham and Anderson (2002, p18) to resolve Freedman’s
paradox one should:
1) use biologically meaningful models
2) keep the number of candidate models small
3) have a large sample size relative to the number of parameters to be
estimated
4) base inference on more than one model (multi-model inference) if
models perform equally well
In regards to #3 above, Gotelli and Ellison (2018, p150) suggest
The Rule of 10: you should have at least 10 replicate
observations per treatment level or variable included in the
model.
Last point regarding model building is that it is also possible to
proceed from simple to more complex models as well evaluating with each
additional term if it has significantly increased explanatory power.
# The Milkweeds and Fertilizer
Example
We will begin exploring how to build and interpret linear models in R
by using a data set that a researcher collected 40 plants of common
milkweed, Asclepias syriaca, from old fields in northeastern
Maine and grew them in a greenhouse; 20 received fertilizer and the
other 20 were kept as controls. After two months, she measured the
height of each plant and then collected and weighed all of the fruit
produced by the plant.
For today’s purposes we will examine how fruit mass responses to
fertilization, and plant height, and if fertilization modulates the
effect of plant height on fruit mass.
First let’s read load the relevant R packages and the data in and
look at the structure of the data
#install.packages("ggplot2") # for ggplot
#install.packages("gridExtra") # for grid.arrange to arrange ggplots
#install.packages("scatterplot3d") # for scatterplot3d to make 3d graphic
#install.packages("MASS") # for stepAIC to automate model selection
library(ggplot2) # to make fancy graphs
library(gridExtra) # arrange ggplot graphs
library(scatterplot3d) # make 3d plot
library(MASS) # use function Anova rather than base anova
Now that we have our libraries loaded let’s read in the data
weeds <- read.csv('http://dmcglinn.github.io/quant_methods/data/milkweeds.csv',
colClasses = c('integer', 'factor', 'numeric', 'numeric'))
weeds
## samp_id trt plant_ht_cm fruit_mass_mg
## 1 1 fertilized 30.32 19.17
## 2 2 unfertilized 23.07 15.06
## 3 3 fertilized 21.90 16.09
## 4 4 unfertilized 32.32 17.19
## 5 5 unfertilized 30.19 15.88
## 6 6 fertilized 19.63 18.91
## 7 7 unfertilized 15.32 13.43
## 8 8 fertilized 25.26 15.54
## 9 9 fertilized 25.42 22.90
## 10 10 fertilized 22.33 15.97
## 11 11 fertilized 36.81 24.80
## 12 12 unfertilized 23.76 17.18
## 13 13 fertilized 22.75 22.00
## 14 14 fertilized 32.29 19.31
## 15 15 unfertilized 29.27 16.14
## 16 16 fertilized 24.16 17.06
## 17 17 unfertilized 22.62 15.04
## 18 18 unfertilized 18.97 16.70
## 19 19 fertilized 25.32 20.15
## 20 20 unfertilized 27.58 17.97
## 21 21 unfertilized 17.92 16.09
## 22 22 unfertilized 25.08 16.71
## 23 23 fertilized 28.38 19.68
## 24 24 fertilized 30.80 21.19
## 25 25 unfertilized 27.11 18.46
## 26 26 unfertilized 16.63 13.35
## 27 27 fertilized 28.26 24.34
## 28 28 fertilized 37.67 23.41
## 29 29 unfertilized 25.11 20.72
## 30 30 unfertilized 21.72 12.61
## 31 31 fertilized 25.10 22.23
## 32 32 fertilized 31.17 16.74
## 33 33 fertilized 21.08 19.69
## 34 34 unfertilized 26.69 17.69
## 35 35 fertilized 26.04 20.89
## 36 36 unfertilized 32.68 19.59
## 37 37 unfertilized 31.71 18.29
## 38 38 unfertilized 30.93 19.17
## 39 39 unfertilized 29.05 16.92
## 40 40 fertilized 31.95 21.93
So we can see that this data set is fairly simple. There are two
treatments in the column called trt
:
fertilized
and unfertilized
. And there are two
continuous variables plant height is provided by
plant_ht_cm
and fruit mass by
fruit_mass_mg
.
Before we dive into the models it is always important to
first visually examine the patterns in the data set
graphically. This will help to build our intuition which will
guide our modeling and interpretation of the statistics.
A quick note on graphics. I will primarily use base graphics in this
course but increasingly the R community is moving towards using
ggplot
for graphics. ggplot
is a really great
set of tools for making beautiful graphics but it can be more difficult
to understand exactly how it works and how to custumize graphics.
Therefore, I want to expose you to both methods of producing graphics
the simple and clunky base graphics and the elegent and shiny
ggplot
graphics.
First let’s examine the effect that fertilizer has on fruit mass
using box-and-whisker plots.
boxplot(fruit_mass_mg ~ trt, data = weeds, xlab='Treatment',
ylab = 'Fruit mass (mg)')
Above we used the function boxplot
to produce a modified
box-and-whisker plot. In this case we used the formula
argument to specify the relationship we wished to display in which the
LHS is the dependent variable and the RHS is the independent variable.
The argument data
tells the function where to look for the
variables that I specified in the formula
argument. Lastly,
xlab
and ylab
allow you to specify the labels
for each axis.
In the graphic, the box represents the inner quartile range (IQR),
the line is the median, and the whiskers represents either the minimum
or maximum or 1.5 * IQR which ever is closer to the median value.
Outliers are defined as points that fall outside of the whiskers and
will be shown as points.
I prefer box-and-whisker plots because they help to represent the
data in a summarized way that does not make distributional assumptions.
It is also easy to calculate the values used in the boxplot using the
function quantile
quantile(weeds$fruit_mass_mg[weeds$trt == 'fertilized'])
## 0% 25% 50% 75% 100%
## 15.5400 18.4475 19.9200 22.0575 24.8000
Now let’s make the same graphic using ggplot
ggplot(data = weeds) +
geom_boxplot(mapping = aes(x = trt, y = fruit_mass_mg)) +
labs(x = 'Treatment', y = 'Fruit mass (mg)')
Technically this is the same graphic as produced by base but you can
see that the R code is a bit more cryptic.
Let’s break it down line by line: ggplot(data = weeds) +
spawns a ggplot
graphic and specifies that the data will be
provided by the object weeds
. Now we can simply refer to
column names of weeds
in the remainder of the plotting
call.
The next line:
geom_boxplot(mapping = aes(x = trt, y = fruit_mass_mg)) +
specifies the geometry of the graphic in this case a boxplot, there are
many other that can be specified in ggplot
. The geometry
function requires a few arguments including how to map data on to the
graphic using the argument mapping
. The mapping is usually
wrapped in the function aes
which provides an aesthetically
pleasing rendering of the data on the graphic, aes
requires
and x
and y
variables.
The last line:
labs(x = 'Treatment', y = 'Fruit mass (mg)')
provides the
axis labels.
Note that each line that is not the last line of the
ggplot
function ends with a plus sign +
this
tells R that you are specifying more options to the plot.
The additional overhead of the understanding this somewhat cryptic R
code comes when you want to quickly make nice graphics across multiple
factors or splits or a data set and when you want more complex
geometries which the ggplot
developers have already
created.
We are only going to touch on the capacities of ggplot
here. Here are a couple of great places to learn more about
ggplot
From our boxplot we can see that fertilizer does seem to increase the
mass of the milkweed fruits. Let’s also examine if plant height seems to
influence fruit mass.
Since fruit mass and plant height are both continuous variables we
will use a scatterplot to display their relationship.
plot(fruit_mass_mg ~ plant_ht_cm, data = weeds, xlab = 'Plant height (cm)',
ylab = 'Fruit mass (mg)')
Like the boxplot
function the plot
function
allows you to use a formula
argument to specify the
variables included in the plot.
From our plot we can see that taller plants do seem to produce more
fruit mass.
Let’s visually examine if fertilizer modifies how fruit mass response
to plant height
plot(fruit_mass_mg ~ plant_ht_cm, data = weeds, type = 'n',
xlab = 'Plant height (cm)', ylab = 'Fruit mass (mg)')
points(fruit_mass_mg ~ plant_ht_cm, data = weeds, subset = trt == "fertilized",
pch = 1, col = 'red')
points(fruit_mass_mg ~ plant_ht_cm, data = weeds, subset = trt == "unfertilized",
pch = 2, col = 'blue')
legend('topleft', c('Fertilized', 'Unfertilized'), col = c('red', 'blue'),
pch = c(1, 2), bty = 'n')
Now we have a graphic where we have different colors and symbols for
the fertilized and unfertilized points.
Let’s add a smoother so that we can better see if the different
fertilization regimes have different functional forms. There are many
options for smoothing functions to choose from in R but for simplicity
we’ll use the function lowess
which is a locally-weighted
polynomial regression.
plot(fruit_mass_mg ~ plant_ht_cm, data = weeds, type = 'n',
xlab = 'Plant height (cm)', ylab = 'Fruit mass (mg)')
points(fruit_mass_mg ~ plant_ht_cm, data = weeds, subset = trt == "fertilized",
pch = 1, col = 'red')
lines(lowess(weeds$plant_ht_cm[weeds$trt == 'fertilized'],
weeds$fruit_mass_mg[weeds$trt == 'fertilized']),
lty = 1, col = 'red')
points(fruit_mass_mg ~ plant_ht_cm, data = weeds, subset = trt == "unfertilized",
pch = 2, col = 'blue')
lines(lowess(weeds$plant_ht_cm[weeds$trt == 'unfertilized'],
weeds$fruit_mass_mg[weeds$trt == 'unfertilized']),
lty = 2, col = 'blue')
legend('topleft', c('Fertilized', 'Unfertilized'), col = c('red', 'blue'),
pch = c(1, 2), lty = c(1, 2), bty = 'n')
Question
Based on the graphic above does it appear that fertilization
modulates the effect of plant height on fruit mass? In statistics we
would consider such an effect and interaction
effect.
[Show answer]
There does not appear to be an interaction effect between fruit mass and
fertizlation because the two smoothers look roughly parallel to one another.
This suggests that a model in which just two main effects: treatment and plant
height but no interaction effect is likely going to provide the best
approximation to the data without over fitting.
Let’s create the same graphic but using ggplot
ggplot(data = weeds, mapping = aes(x = plant_ht_cm, y = fruit_mass_mg)) +
geom_point(mapping = aes(color = trt)) +
geom_smooth(mapping = aes(linetype = trt, color = trt), method = 'loess') +
scale_color_manual(values = c("red", "blue")) +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
A few quick points about this ggplot
call.
- rather than specify the mapping of the
x
and
y
for each geom_*
we simply specify them once
when setting up the graphic in the first ggplot()
call
- Note the usage of
manual_color()
- this is generally
not needed as ggplot
’s default colors are usually pretty
attractive.
Excercise
Modify the
ggplot
call so that the
x
and
y
axes are properly labeled.
[Show answer]
ggplot(data = weeds, mapping = aes(x = plant_ht_cm, y = fruit_mass_mg)) +
geom_point(mapping = aes(color = trt)) +
geom_smooth(mapping = aes(linetype = trt, color = trt), method = 'loess') +
scale_color_manual(values = c("red", "blue")) +
labs(x = 'Plant height (cm)', y = 'Fruit mass (mg)') +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
# Model Building and Graphing
Now that we’ve visually explored the relationships in the milkweed
data set, and we have developed some intuition about how we might want
to approach this data set let’s actually develop some statistical
regression models to explain the variation in fruit mass.
Ordinary Least Squares Regression
Ordinary Least Squares (OLS) regression is one of the most commonly
used models in statistics. It attempts to approximate the response
variable y using a linear combination of explanatory variables
x. We can express OLS regression algebraically as:
\[y_i = \beta_0 1 + \beta_1 x_{i1} +
\cdots + \beta_p x_{ip} + \varepsilon_i, \qquad i = 1, \ldots, n, \qquad
\varepsilon \sim N(0, \sigma^2) \qquad \text{(equ. 1)}\]
Let’s break equ. 1 down into its parts:
- \(y_i\) is the i-th value
of the response variable
- \(\beta_0\) is the
y-intercept
- the \(\beta\) values are the
partial slopes of the p explanatory variables.
- \(\varepsilon\) represents the
normally distributed error with mean of 0 and variance equal to \(\sigma^2\)
One of the core things to take away from equ. 1 is that regression
equations always have one or more error terms. How this error is
specified (e.g., Normal error, Binomial error, Poisson error) is the key
way different regression methods differ from one another. Another
important point about the error term is that in OLS each of the
n samples are assumed to be independent and identically
distributed (iid) from one another. Other methods of regression
we examine (e.g., spatial regression) will not make this same assumption
and will specify how samples co-vary with one another.
Taken together it can be seen that OLS regression has a number of key
assumptions
- Linearity such that y is a linear combination of the
x-variables. Because the x variables may be
transformed this is a less restrictive assumption then it may seem.
- No error in the independent variables.
- Homoscedasticity of errors which means that error around the
regression line is equal across the range of x values.
- Independence of errors such that the direction and magnitude of
error is not correlated between samples.
As you read those assumptions you may be correctly thinking to
yourself that those assumptions are almost never met in most situations.
We’ll examine in a moment using simulated data how critical violations
of some of those assumptions are.
Check out this nice interactive visualization of OLS regression at Explained
Visually
Model Heirarchy
In our greenhouse data set we have several different models we could
examine:
Intercept only null model
There is the minimal intercept only model
\[\mathbf{y} = \beta_0 +
\varepsilon\] This model essentially just uses the mean of
y as a predictor of y. This may seem silly but this is
essentially what you compare all more complex models against - it is our
null model.
null_mod <- lm(fruit_mass_mg ~ 1, data = weeds)
null_mod
##
## Call:
## lm(formula = fruit_mass_mg ~ 1, data = weeds)
##
## Coefficients:
## (Intercept)
## 18.4
mean(weeds$fruit_mass_mg)
## [1] 18.40475
In the above R code, lm
specifies this is a linear
model, we used the formula
specification just as we did
when using boxplot
and plot
but in this case
the RHS of the equation is just a 1 indicating that just an intercept
only model is being specified.
It is clear from the output of the R code this model only has the
intercept and that is equal to the mean of the response variable.
Let’s examine this model graphically
plot(fruit_mass_mg ~ 1, data = weeds)
abline(null_mod, lwd = 2)
abline(h = mean(weeds$fruit_mass_mg), col = 'red', lty = 2, lwd = 2)
Note above that the x-axis is just an index of the 40 samples. The
function abline
is useful to specifying regression
(reg=
), horizontal (h=
), and vertical lines
(v=
).
Main effect models
A main effect model is a model that includes at least one independent
variable:
\[y = \beta_0 +
\beta_1x+\varepsilon\]
We can develop single main effect models that include either
treatment or plant height as such in R:
trt_mod <- lm(fruit_mass_mg ~ trt, data = weeds)
# alternatively we can just update null_mod
trt_mod <- update(null_mod, . ~ . + trt)
ht_mod <- update(null_mod, . ~ . + plant_ht_cm)
trt_mod
##
## Call:
## lm(formula = fruit_mass_mg ~ trt, data = weeds)
##
## Coefficients:
## (Intercept) trtunfertilized
## 20.10 -3.39
ht_mod
##
## Call:
## lm(formula = fruit_mass_mg ~ plant_ht_cm, data = weeds)
##
## Coefficients:
## (Intercept) plant_ht_cm
## 10.1205 0.3143
Let’s take a closer look at the trt_mod
which includes
the main effect due to treatment. Note that only one of the levels of
treatment is provided as a coefficient. In this case it is
unfertilized
. To better understand this you need to
consider how factor variables are included in regression models. A
categorical variable is encoded in R into a set of orthogonal contrasts
(aka dummy
variables).
levels(weeds$trt)
## [1] "fertilized" "unfertilized"
contrasts(weeds$trt)
## unfertilized
## fertilized 0
## unfertilized 1
In this case we have a factor trt
that only has two
levels so it only requires one binary variable because if a sample is
not fertilized then it must be unfertilized. That may seem a little
strange but just remember that if you have a factor with k
levels then you need k - 1 binary variables to specify that
factor as a set of orthogonal contrasts. This explains why the treatment
variable only requires a single regression coefficient.
Sometimes we have factors that are ranked such as low, medium, high.
In this case the variable is called ordinal as opposed
to our nominal treatment variable which did not contain
ranks. The contrasts of ordinal variables are not as simple to specify
and typically a Helmert polynomial contrasts are used (if you want to
know more and possibly a better solution see Gullickson’s
2020 blog post)
Let’s examine these models graphically.
par(mfrow=c(1,2))
plot(fruit_mass_mg ~ trt, data = weeds, xlab = 'Treatment',ylab = 'Fruit mass (mg)')
abline(trt_mod)
plot(fruit_mass_mg ~ plant_ht_cm, data = weeds,
xlab = 'Plant height (cm)', ylab = 'Fruit mass (mg)')
abline(ht_mod)
The first panel of this graphic doesn’t quite look correct because
the regression line is not intersecting the center of the boxplots which
is what we would expect. This is because by default R assigns the first
level of the treatment factor an x-value of 1 and the second level of
the factor a value of 2. Therefore when the regression line is added to
the plot it is plotting the y-intercept (which is the mean value of the
fertilized group) off the graph to the left one unit. To correct this we
have to build the graph from scratch a bit more carefully making sure
that the fertilized group is plotted an an x-axis value of 0 so that the
regression line intersects that groups properly.
par(mfrow=c(1,2))
plot(0:1, 0:1, type='n', ylim = range(weeds$fruit_mass_mg), axes = F,
xlim = c(-0.25, 1.25),
xlab = 'Treatment', ylab = 'Fruit mass (mg)', frame.plot = T)
points(rep(0, 20), weeds$fruit_mass_mg[weeds$trt == 'fertilized'])
points(rep(1, 20), weeds$fruit_mass_mg[weeds$trt == 'unfertilized'])
axis(side = 1, at = 0:1, labels = c('fertilized', 'unfertilized'))
axis(side = 2)
abline(trt_mod)
plot(fruit_mass_mg ~ plant_ht_cm, data = weeds,
xlab = 'Plant height (cm)', ylab = 'Fruit mass (mg)')
abline(ht_mod)
This is a good example of how ggplot
can simplify things
if you know the correct command. In this case the command is a bit
cryptic we need to specify in the mapping of the points that the
argument group = 1
. To be honest I’m not exactly sure what
this means but a search of the internet suggested it would work so
viola!
p1 <- ggplot(data = weeds,
mapping = aes(x = trt, y = fruit_mass_mg, group = 1)) +
geom_point() +
geom_smooth(method = 'lm') +
labs(x = 'Treatment', y = 'Fruit mass (mg)')
p2 <- ggplot(data = weeds,
mapping = aes(x = plant_ht_cm, y = fruit_mass_mg)) +
geom_point() +
geom_smooth(method = 'lm') +
labs(x = 'Plant height (cm)', y = 'Fruit mass (mg)')
grid.arrange(p1, p2, nrow = 1)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Note above that the function grid.arrange
is used to
make multi-panel ggplots
and it is a part of the
gridExtra
package.
All main effects model
Of course we can also specify a model that includes both main
effects
\[y = \beta_0 + \beta_1x + \beta_2z +
\varepsilon\]
In R we would use:
main_mod <- lm(fruit_mass_mg ~ trt + plant_ht_cm, data = weeds)
main_mod
##
## Call:
## lm(formula = fruit_mass_mg ~ trt + plant_ht_cm, data = weeds)
##
## Coefficients:
## (Intercept) trtunfertilized plant_ht_cm
## 12.9586 -2.8822 0.2613
This particular model specifies two parallel regression lines that
only differ in their y-intercept based on whether or not they were
fertilized.
xrange = range(weeds$plant_ht_cm)
plot(fruit_mass_mg ~ plant_ht_cm, data = weeds, type = 'n',
xlab = 'Plant height (cm)', ylab = 'Fruit mass (mg)')
points(fruit_mass_mg ~ plant_ht_cm, data = weeds, subset = trt == "fertilized",
pch = 1, col = 'red')
# add the fitted regression line using two points at either end of the range of x
lines(xrange,
predict(main_mod,
newdata = data.frame(plant_ht_cm = xrange, trt = "fertilized")),
col = 'red', lty = 1)
points(fruit_mass_mg ~ plant_ht_cm, data = weeds, subset = trt == "unfertilized",
pch = 2, col = 'blue')
lines(xrange,
predict(main_mod,
newdata = data.frame(plant_ht_cm = xrange, trt = "unfertilized")),
col = 'blue', lty = 2)
legend('topleft', c('Fertilized', 'Unfertilized'), col = c('red', 'blue'),
pch = c(1, 2), lty = c(1, 2), bty = 'n')
Above in the R code we used a new function predict()
.
This function is useful for taking a fitted model and using it to
predict values. In this case we use this feature to predict where two
points fall along the x-axis so that we can connect them with lines and
thus form the fitted regression lines. One of the more difficult
arguments to predict()
to understand is the
newdata
argument. This is essentially a
data.frame
that must have the variables in the model
included as column names. So in this case must have the variables
plant_ht_cm
and trt
in
newdata
.
Another way to think about this model is that it is specifying a 3d
plane that can fit through the data.
s3d <- scatterplot3d(weeds[ , c('trt', 'plant_ht_cm', 'fruit_mass_mg')],
type = 'h', color = 'blue', angle = 65, pch = 16,
lab = c(3, 5), xlim = c(0, 3),
x.ticklabs = c('', 'fertilized', 'unfertilized', ''),
box = F)
# Add regression plane
s3d$plane3d(main_mod)
Again above it is a little fussy to make this graph because of the
factor.
Interaction effect model
Lastly the most complex typical model that we would specify based on
our graphical exploration would also be a model that includes an
interaction effect. An interaction effect is an effect
in which one variable influences the effect of another variable on the
response.
\[y = \beta_0 + \beta_1x + \beta_2z +
\beta_3xz + \varepsilon\]
int_mod <- lm(fruit_mass_mg ~ trt + plant_ht_cm + trt:plant_ht_cm,
data = weeds)
# alternatively we can use * as short hand
int_mod <- lm(fruit_mass_mg ~ trt * plant_ht_cm, data = weeds)
int_mod
##
## Call:
## lm(formula = fruit_mass_mg ~ trt * plant_ht_cm, data = weeds)
##
## Coefficients:
## (Intercept) trtunfertilized
## 12.842466 -2.669054
## plant_ht_cm trtunfertilized:plant_ht_cm
## 0.265532 -0.008069
Graphically the easiest way to consider this model is that it
specifies separate regression lines for each level of fertilization that
differ in both their y-intercept and slope.
plot(fruit_mass_mg ~ plant_ht_cm, data = weeds, type = 'n',
xlab = 'Plant height (cm)', ylab = 'Fruit mass (mg)')
points(fruit_mass_mg ~ plant_ht_cm, data = weeds, subset = trt == "fertilized",
pch = 1, col = 'red')
xrange = range(weeds$plant_ht_cm[weeds$trt == "fertilized"])
lines(xrange,
predict(int_mod,
newdata = data.frame(plant_ht_cm = xrange, trt = "fertilized")),
col = 'red', lty = 1)
points(fruit_mass_mg ~ plant_ht_cm, data = weeds, subset = trt == "unfertilized",
pch = 2, col = 'blue')
xrange = range(weeds$plant_ht_cm[weeds$trt == "unfertilized"])
lines(xrange,
predict(int_mod,
newdata = data.frame(plant_ht_cm = xrange, trt = "unfertilized")),
col = 'blue', lty = 2)
legend('topleft', c('Fertilized', 'Unfertilized'), col = c('red', 'blue'),
pch = c(1, 2), lty = c(1, 2), bty = 'n')
Exercise
Try to come up with the ggplot
code to graphically
depict the interaction effect model (clue: we made essentially this
graphic above but with a smoother so if you need a reminder of how to
structure the code look above).
[Show answer]
ggplot(weeds, mapping = aes(x = plant_ht_cm, y = fruit_mass_mg)) +
geom_point(aes(color = trt)) +
geom_smooth(aes(color = trt, linetype = trt), method = 'lm', se = F) +
labs(x = 'Plant height (cm)' , y = "Fruit mass (cm)")
## `geom_smooth()` using formula = 'y ~ x'
Question
In the above graphic do you see good evidence that the additional
complexity of the interaction effect model is warranted?
[Show answer]
The interaction effect model allows for the slopes of the lines to differ
based on fertilization but it can be seen that the best fit lines appear to have
very similar slopes which suggests that the added complexity is likely not
warrented
Tabular model interpretation
Now that we’ve build our two models let’s examine the model fits and
the statistical significance of the explanatory variables.
summary(trt_mod)
##
## Call:
## lm(formula = fruit_mass_mg ~ trt, data = weeds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5600 -1.3049 0.0252 1.7704 4.7000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.1000 0.5557 36.171 < 2e-16 ***
## trtunfertilized -3.3905 0.7859 -4.314 0.00011 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.485 on 38 degrees of freedom
## Multiple R-squared: 0.3288, Adjusted R-squared: 0.3111
## F-statistic: 18.61 on 1 and 38 DF, p-value: 0.0001099
# because trt is a factor lets also examine the anova output
anova(trt_mod)
## Analysis of Variance Table
##
## Response: fruit_mass_mg
## Df Sum Sq Mean Sq F value Pr(>F)
## trt 1 114.95 114.955 18.614 0.0001099 ***
## Residuals 38 234.68 6.176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The function summary
provides several useful pieces of
information for linear models. The coefficients table:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.1000 0.5557 36.171 < 2e-16 ***
trtunfertilized -3.3905 0.7859 -4.314 0.00011 ***
which provides the estimate, standard error, t-statistic, and p-value
for:
- \(\hat{\beta_0}\) the y-intercept -
1st row of the table
- \(\hat{\beta_1}\) the slope of the
regression line - 2nd row of the table
The p-values can be used to assess statistical significance, and the
t-statistics provide a measure of effect size.
Note: most of the time we are not interested in the test of
the intercept
In addition to the coefficient table several statistics for the
entire model are also provided:
Residual standard error: 2.485 on 38 degrees of freedom
Multiple R-squared: 0.3288, Adjusted R-squared: 0.3111
F-statistic: 18.61 on 1 and 38 DF, p-value: 0.0001099
The multiple R-squared and adjusted R-squared provide estimates of
variance explained. The later statistic adjusts for the number of
variables included in the model. The F-statistic is a ratio of the mean
sum of squares for the model to the sum of squares of the residuals
(i.e., the ratio of explained variance to unexplained variance). The
p-value associated with the F-statistic provides a means of examining
the statistics significance of the entire model.
Note: in this specific model because only a single
explanatory variable is included the overall model p-value is
identical to the p-value reported for trt
in the
coefficients table.
The function anova
reports the analysis of variance
table
Response: fruit_mass_mg
Df Sum Sq Mean Sq F value Pr(>F)
trt 1 114.95 114.955 18.614 0.0001099 ***
Residuals 38 234.68 6.176
The first column lists the degrees of freedom which indicate that our
for two leveled factor trt
that only one degree of freedom
is lost. An additional degree of freedom is lost to estimate the
variance of the error term, thus the residuals Df = 38 when n =
40. The second column provides the sums of square differences for the
trt and residuals these are also know as the explain and unexplained
variances. The F-value is the ratio of unexplained to explained variance
after taking degrees of freedom into consideration.
# F is explained SS / (unexplained SS / df)
114.95 / (234.68 / 38)
## [1] 18.613
Note that the p-value derived from the F-test is
identical to the t reported for in the coefficient table
returned by summary
.
So the tabular output of the treatment main effect model indicates
that including just treatment explains 31% of the variance which is a
statistically significant amount of variance.
Now let’s examine the plant height main effect model
Question
Extract and interpret the model statistics for the plant height only
model.
[Show answer]
summary(ht_mod)
##
## Call:
## lm(formula = fruit_mass_mg ~ plant_ht_cm, data = weeds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3367 -1.8466 -0.6624 1.8441 5.3379
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.12052 2.09707 4.826 2.28e-05 ***
## plant_ht_cm 0.31428 0.07808 4.025 0.000262 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.54 on 38 degrees of freedom
## Multiple R-squared: 0.2989, Adjusted R-squared: 0.2804
## F-statistic: 16.2 on 1 and 38 DF, p-value: 0.0002623
# variance explained is 30% and overall model is statistically significant
# this suggests that the treatment effect is stronger than the plant height effect
# Model comparisons
Our graphical exploration suggested that the model that included both
treatment and plant height would have the most support let’s undertake
some formal model comparison.
Adjusted \(R^2\)
A simple but somewhat cumbersome approach is just to compare the
model output from summary
summary(main_mod)
##
## Call:
## lm(formula = fruit_mass_mg ~ trt + plant_ht_cm, data = weeds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3628 -1.3944 0.3798 1.1516 4.0827
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.95862 1.86798 6.937 3.44e-08 ***
## trtunfertilized -2.88217 0.68014 -4.238 0.000144 ***
## plant_ht_cm 0.26128 0.06612 3.951 0.000336 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.112 on 37 degrees of freedom
## Multiple R-squared: 0.528, Adjusted R-squared: 0.5025
## F-statistic: 20.69 on 2 and 37 DF, p-value: 9.297e-07
summary(int_mod)
##
## Call:
## lm(formula = fruit_mass_mg ~ trt * plant_ht_cm, data = weeds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.379 -1.371 0.367 1.162 4.082
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.842466 2.705632 4.747 3.25e-05 ***
## trtunfertilized -2.669054 3.612254 -0.739 0.46476
## plant_ht_cm 0.265532 0.097429 2.725 0.00985 **
## trtunfertilized:plant_ht_cm -0.008069 0.134256 -0.060 0.95241
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.141 on 36 degrees of freedom
## Multiple R-squared: 0.528, Adjusted R-squared: 0.4887
## F-statistic: 13.42 on 3 and 36 DF, p-value: 4.907e-06
From this we can see that two models have identical \(R^2\) values which usually would be
surprising because we typically expect more complex models to have
higher raw variance explained estimates. This isn’t surprising to use
because we already graphically inspected the models and concluded there
was little support for an interaction effect. We should also compare the
adjusted \(R^2\) which adjusts the
estimate of variance explained based on how much we would expect the
variance explained if purely noise variables were added. In this case
the more complex interaction effect model has a lower adjust \(R^2\) suggesting that it is not the most
parsimonious model.
Log-ratio tests
We can also use the function anova
to carry out
log-ratio tests on nested models. Nested models are models in which the
more complex model has all of the same parameters as the simple model
and then some additional ones.
anova(main_mod, int_mod)
## Analysis of Variance Table
##
## Model 1: fruit_mass_mg ~ trt + plant_ht_cm
## Model 2: fruit_mass_mg ~ trt * plant_ht_cm
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 37 165.04
## 2 36 165.02 1 0.016559 0.0036 0.9524
The ANOVA table indicates that the interaction model does not explain
significantly more variance - this case this was a silly test to run
because we already could have predicted this result based on the
adjusted \(R^2\) values.
AIC model comparison
In the biological sciences a popular way of comparing models is to
use a penalized log likelihood criteria such as as the Akaike
information criterion (AIC) or the Bayesian information criterion (BIC).
Unlike the ANOVA comparisons AIC and BIC can be used to compare models
that are not nested. The formula for AIC is:
\[\mathrm{AIC} = - 2\log(\hat L) + 2k
\]
where \(\hat L\) is the likelihood
of the model and \(k\) is the number of
parameters in the model. The AIC index ranks models both on their degree
of fit but also on their complexity. A small value of AIC indicates a
better model and thus models with more complexity (i.e., more
parameters) and thus larger \(k\)
value, but equal likelihood are considered worse models (higher
AIC).
Note: the exact value of the AIC model is not important but
the differences between models are
# examine the AIC scores of all the models we have considered
# smaller number is better
AIC(null_mod)
## [1] 204.2357
AIC(trt_mod)
## [1] 190.2891
AIC(ht_mod)
## [1] 192.0322
AIC(main_mod)
## [1] 178.2068
AIC(int_mod)
## [1] 180.2027
So as you can see all the models are better than the null model, but
the main effect only model is considered the best model. If two models
are within 2 units of the AIC index then they are considered roughly
equivalent models.
There is another function called extractAIC
which
returns different values for AIC because it uses different additive
constants; however, the differences between the models are still the
same
AIC(main_mod)
## [1] 178.2068
extractAIC(main_mod)[2]
## [1] 62.69167
AIC(main_mod) - AIC(int_mod)
## [1] -1.995986
extractAIC(main_mod)[2] - extractAIC(int_mod)[2]
## [1] -1.995986
Model diagonstics
Now that we think we have a sufficient model (i.e., the main effect
model) we should carefully examine it to make sure it meets the
assumptions of OLS regression. This is particularly important if we are
going to use the p-values for rigorous hypothesis testing.
It is easy to examine the diagnostics of the model visually using
plot
par(mfrow = c(2,2))
plot(main_mod)
The top left and bottom left panels both suggests that the assumption
of homoscedasticity holds. The top right panel suggests
that the residuals are normally distributed. The bottom
right panel suggests that there are not large outliers that are having a
disproportionate effect on the regression.
Biological conclusion: Fruit size increases due to fertizlier and
plant height
We have carried out model comparison and checked the model
assumptions, we can now feel pretty confident about making some
biological conclusions.
Our main effect model received the most empirical support and it
suggests that fruit mass increased due to fertilizer and plant height
(see figure below). The effect of fertilizer was a little stronger than
the effect of height and together these two variables helped to explain
approximately 50% of the variance in fruit mass. We did not find that
the effects of plant height on fruit mass depended on if the plant was
fertilized.
Above is the final best supported model (source code for this is
provided when the main effect model is first introducted above)
Automated model selection
The approach we have taken above of a careful examination of the
graphical patterns in the data set followed by careful and rational
model comparison is not always possible when the models being compared
are quite large. However, this method does suffer from Freedman’s
paradox discussed at the top of the lesson.
# run a stepwise regression analysis on the full model.
stepAIC(int_mod)
## Start: AIC=64.69
## fruit_mass_mg ~ trt * plant_ht_cm
##
## Df Sum of Sq RSS AIC
## - trt:plant_ht_cm 1 0.016559 165.04 62.692
## <none> 165.02 64.688
##
## Step: AIC=62.69
## fruit_mass_mg ~ trt + plant_ht_cm
##
## Df Sum of Sq RSS AIC
## <none> 165.04 62.692
## - plant_ht_cm 1 69.644 234.68 74.774
## - trt 1 80.098 245.13 76.517
##
## Call:
## lm(formula = fruit_mass_mg ~ trt + plant_ht_cm, data = weeds)
##
## Coefficients:
## (Intercept) trtunfertilized plant_ht_cm
## 12.9586 -2.8822 0.2613
As you can see above this approach results in the same model that we
found to be the best - the main effects model. Do note that
stepAIC
internally uses the function extracAIC
to compute the AIC values.
# Understanding Regression via
Simulation
R is an excellent environment for learning about how models work in
part because of the ease to generate data with known properties. This
provides us the ability to not only check that a model is performing as
expected but also helps to indicate strengths and weaknesses of various
model fitting and effect size strength measures.
#generate data for example
set.seed(10) # this is done so that the same results are generated each time
x1 <- runif(900)
x2 <- rbinom(900, 10, .5)
x3 <- rgamma(900, .1, .1)
#organize predictors in data frame
sim_data <- data.frame(x1, x2, x3)
#create noise b/c there is always error in real life
epsilon <- rnorm(900, 0, 3)
#generate response: additive model plus noise, intercept=0
sim_data$y <- 2*x1 + x2 + 3*x3 + epsilon
Above we have defined response variable y
as a linear
function of the three simulated independent variables (the
x
variables). Epsilon refers to the error in the model and
in line with OLS regression assumptions we have made this is a normally
distributed variable centered at zero.
Now that we have simulated our data let’s examine how we build OLS
models in R and plot the result.
#First we will demonstrate the simplest possible model
#the intercept only model
mod <- lm(sim_data$y ~ 1)
mod
##
## Call:
## lm(formula = sim_data$y ~ 1)
##
## Coefficients:
## (Intercept)
## 8.581
summary(mod)
##
## Call:
## lm(formula = sim_data$y ~ 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.743 -4.282 -1.764 1.581 88.528
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.5809 0.2863 29.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.59 on 899 degrees of freedom
#Note that the estimate for the intercept is equivalent to the mean of y
mean(sim_data$y)
## [1] 8.58092
#simple linear regression with x1 as predictor
mod1 <- lm(y ~ x1, data=sim_data)
#plot regression line and mean line
plot(y ~ x1, data=sim_data)
abline(h=mean(sim_data$y), col='pink', lwd=3)
abline(mod1, lty=2)
#simple linear regression with x3 as a predictor
mod3 <- lm(y ~ x3, data=sim_data)
#graph regression line and mean line
plot(y ~ x3, data=sim_data)
abline(mod3)
abline(h=mean(sim_data$y), col='pink', lwd=2)
legend('topleft', c('OLS fit', 'mean'), col=c('black', 'pink'), lty=1, lwd=c(1,2))
Now that we’ve build our two models let’s examine the model fits and
the statistical significance of the explanatory variables.
summary(mod1)
##
## Call:
## lm(formula = y ~ x1, data = sim_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.334 -4.164 -1.744 1.511 88.076
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.0261 0.5752 13.954 <2e-16 ***
## x1 1.0967 0.9862 1.112 0.266
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.589 on 898 degrees of freedom
## Multiple R-squared: 0.001375, Adjusted R-squared: 0.0002632
## F-statistic: 1.237 on 1 and 898 DF, p-value: 0.2664
We noticed that there was a large outlier in the previous plot. Let’s
run model diagnostics to see if we should consider dropping that
variable.
par(mfrow=c(2,2))
plot(mod1)
par(mfrow=c(1,1))
From the diagnostic plots you can see observations 26, 85, and 88 are
consistently identified as having abnormally large residual values
(Residuals vs Fitted plot), they cause the residuals to diverge from an
expectation under normality (Normal Q-Q plot), and lastly they exert too
much leverage (i.e., control on the slope of the regression, Residuals
vs Leverage plot).
Here is a case where if this was an actual analysis we would check to
make sure these values are not mistakes. Let’s drop these points and
examine the changes in the model statistics.
sim_data_sub <- sim_data[-c(26, 85, 88), ]
#verify that one observation was removed
dim(sim_data)
## [1] 900 4
dim(sim_data_sub)
## [1] 897 4
#refit model to reduced data
mod3_sub <- lm(y ~ x3, data=sim_data_sub)
summary(mod3)
##
## Call:
## lm(formula = y ~ x3, data = sim_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.1071 -2.2041 -0.0316 2.2882 12.1010
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.94151 0.12502 47.52 <2e-16 ***
## x3 2.97661 0.04511 65.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.554 on 898 degrees of freedom
## Multiple R-squared: 0.8291, Adjusted R-squared: 0.8289
## F-statistic: 4355 on 1 and 898 DF, p-value: < 2.2e-16
summary(mod3_sub)
##
## Call:
## lm(formula = y ~ x3, data = sim_data_sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.1068 -2.2200 -0.0314 2.2866 12.1012
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.94127 0.12534 47.40 <2e-16 ***
## x3 2.97747 0.04517 65.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.557 on 895 degrees of freedom
## Multiple R-squared: 0.8292, Adjusted R-squared: 0.829
## F-statistic: 4345 on 1 and 895 DF, p-value: < 2.2e-16
So it appears that \(R^2\) is highly
sensitive to outliers but the \(\beta\)
coefficients are more robust.
Exercise
Create a single plot that displays the model of y given x3 before and
after the outliers were removed. How much to they visually differ from
one another. Modify the arguments to abline()
including
lty
and lwd
.
[Show answer]
plot(y ~ x3, data=sim_data)
points(y ~ x3, data=sim_data_sub, col='dodgerblue', pch=19)
abline(mod3)
abline(mod3_sub, col='dodgerblue', lwd=2)
legend('topleft', c('fit with all data', 'fit w/o outliers'),
col=c('black', 'dodgerblue'), pch=c(1, 19), lty=1,
lwd=c(1,2), bty='n')
Multiple regression
So far we have only examined models with a single variable but by
design we know that y is influenced by three variables. Let’s include
all the relevant explanatory variables in one model now.
mod_main <- lm(y ~ x1 + x2 + x3, data=sim_data)
summary(mod_main)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = sim_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5652 -1.9274 -0.0122 1.9385 11.1587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.12607 0.38847 -0.325 0.746
## x1 2.03852 0.35835 5.689 1.73e-08 ***
## x2 1.01568 0.06629 15.323 < 2e-16 ***
## x3 2.99037 0.03961 75.500 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.118 on 896 degrees of freedom
## Multiple R-squared: 0.8687, Adjusted R-squared: 0.8682
## F-statistic: 1976 on 3 and 896 DF, p-value: < 2.2e-16
coefficients(mod_main)
## (Intercept) x1 x2 x3
## -0.126067 2.038523 1.015678 2.990369
Notice that in the output above the coefficient estimates are close
to what we set them at when we created the variable y
.
Home Page - http://dmcglinn.github.io/quant_methods/ GitHub Repo -
https://github.com/dmcglinn/quant_methods