Home Page - http://dmcglinn.github.io/quant_methods/ GitHub Repo - https://github.com/dmcglinn/quant_methods
https://raw.githubusercontent.com/dmcglinn/quant_methods/gh-pages/lessons/partial_residual_plots.Rmd
When working with multiple regression models we often spend a lot of time examining tabular summary tables and not enough time examining graphical fits of the models to the data. Partial residual plots help us to visualize the fit of each independent variable in the multiple regression model after controlling for the other variables. Mathematically partial residuals are defined as:
\(\text{Residuals} + \hat{\beta}_iX_i \text{ versus } X_i,\)
where
\(\text{Residuals = residuals from the full
model}\)
\(\hat{\beta}_i \text{= regression coefficient
from the i-th independent variable in the full model}\)
\(X_i \text{= the i-th independent
variable}\)
We will use the function termplot
to examine partial
regression plots in R for lm
models but these also work for
glm
models.
Before we go further we should mention two caveats:
This approach is not well suited for models that contain interaction effects because in that case examining the partial effect of a single term that is included in an interaction effect does not make sense.
This approach can also be misleading if there is strong multicollinearity in which the independent variables are highly correlated with each other. In this case, the variance indicated by the partial residual plot can be much less than the actual variance.
We will use a simple simulated example with 3 independent variables.
set.seed(1)
x1 <- rnorm(100) # continuous variable 1
x2 <- rnorm(100) # continuous variable 2
x3 <- as.factor(rep(c('cnt', 'trt'), 50)) # categorical variable
y <- .5 * x1^2 + .75 * x2 + ifelse(x3 == 'cnt', -0.5, 0.5) + rnorm(100, 0, 0.1)
We can visually examine our simulated data relationships to
y
prior to modeling
par(mfrow=c(1,3))
plot(y ~ x1)
plot(y ~ x2)
plot(y ~ x3)
Even though in this simulated case we know the ‘true’ model because
we designed it by visually examining the relationship between
y
and each variable individually it does look like
x1
has no relationship, x1
has a positive
relationship, and for x3 it appears that the trt
samples
have a higher y
Let’s carryout multiple regression modeling to see if we can uncover more information:
mod <- lm(y ~ x1 + x2 + x3)
summary(mod)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7435 -0.3487 -0.1985 0.1948 2.4465
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.12350 0.08271 -1.493 0.139
## x1 0.08076 0.06492 1.244 0.217
## x2 0.66073 0.06060 10.904 < 2e-16 ***
## x3trt 1.03916 0.11606 8.954 2.63e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5775 on 96 degrees of freedom
## Multiple R-squared: 0.678, Adjusted R-squared: 0.6679
## F-statistic: 67.37 on 3 and 96 DF, p-value: < 2.2e-16
OK we fit the model and the tabular output seems to agree with our initial graphical exploration prior to model fitting. Let’s now examine the partial residual plot for this example:
par(mfrow = c(1, 3))
termplot(mod, partial.resid = TRUE, se=T, smooth=panel.smooth, pch=19, cex=0.75,
col.res = 'black', col.term = 'dodgerblue', lwd.term=2,
col.se = 'dodgerblue', lwd.se = 2, lty.se=3,
col.smth = 'red', lty.smth = 2)
First let’s get oriented on these graphics. The y-axis is the partial
residual of the response variable y
against each specific
independent variable. There are several sets of lines on the graphs. The
light blue lines are the linear model fits with their associated
standard errors (there is more uncertainty at the ends of the best fit
line). The graphs also include a localized smoother or loess curve (in
red) which fits the data as well as possible.
In the first panel we are examining the relationship
y ~ x1 | x2 + x3
where the |
stands for
given
so this formula asking what is the strength of the
relationship of x1
to y
given or after
controlling for x2
and x3
in other words:
lm(residuals(lm(y ~ x2 + x3)) ~ x1)
##
## Call:
## lm(formula = residuals(lm(y ~ x2 + x3)) ~ x1)
##
## Coefficients:
## (Intercept) x1
## -0.008711 0.079998
Note that the leftmost panel above indicates something really useful
to us. It shows clearly that the regression model (light blue line) has
the wrong functional form. It isn’t that x1
has no
relationship to y
it is just that it is not a linear
relationship. The loess smoother (red line) shows this clearly.
The other two panels for x2
and x3
also so
their partial effects on y
. These panels look good and
indicate that our linear model assumption for the relationship between
y1
and x2
seems pretty reasonable.
Let’s re-specify our model and re-examine the partial residual plot
mod2 <- lm(y ~ I(x1^2) + x2 + x3)
summary(mod2)
##
## Call:
## lm(formula = y ~ I(x1^2) + x2 + x3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.304170 -0.047998 0.003126 0.072060 0.254556
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.50274 0.01654 -30.40 <2e-16 ***
## I(x1^2) 0.49776 0.00928 53.64 <2e-16 ***
## x2 0.74416 0.01109 67.12 <2e-16 ***
## x3trt 1.01460 0.02092 48.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1046 on 96 degrees of freedom
## Multiple R-squared: 0.9894, Adjusted R-squared: 0.9891
## F-statistic: 2997 on 3 and 96 DF, p-value: < 2.2e-16
par(mfrow = c(1, 3))
termplot(mod2, partial.resid = TRUE, se=T, smooth=panel.smooth, pch=19, cex=0.75,
col.res = 'black', col.term = 'dodgerblue', lwd.term=2,
col.se = 'dodgerblue', lwd.se = 2, lty.se=3,
col.smth = 'red', lty.smth = 2)
Now the model is properly specified the model fits look better and they have lower uncertainty associated with them.
Home Page - http://dmcglinn.github.io/quant_methods/ GitHub Repo - https://github.com/dmcglinn/quant_methods