Consider that you have two variables x
and y
.
x <- 1:100
y <- 20 * x - 0.2 * x^2 + rnorm(100, 0, 30)
You are interested to understand if x
explains variation in y
. How will you approach this?
Let's examine the outcome and inference if we do not visually examine our data and only rely on regression modeling.
mod_lin <- lm(y ~ x)
summary(mod_lin)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -378.48 -108.54 32.55 126.13 180.80
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 344.7734 29.8143 11.564 <2e-16 ***
## x -0.2913 0.5126 -0.568 0.571
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 148 on 98 degrees of freedom
## Multiple R-squared: 0.003285, Adjusted R-squared: -0.006886
## F-statistic: 0.3229 on 1 and 98 DF, p-value: 0.5711
The above model summary table would lead us to believe that there is no relationship between x
and y
which we know is false because we created the variables above.
Because our model is systematically mis-representing the functional form of the relationship between x
and y
which we defined to be a quadratic relationship. This would have been obvious if we first graphed y
and x
plot(y ~ x)
This simple step would indicate to us that y
is not just a linear function of x
but it is a quadratic function, such that a more appropriate model is:
mod_quad <- lm(y ~ x + I(x^2))
summary(mod_quad)
##
## Call:
## lm(formula = y ~ x + I(x^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.837 -22.635 3.558 17.125 67.744
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.85923 8.94040 1.55 0.124
## x 19.17426 0.40860 46.93 <2e-16 ***
## I(x^2) -0.19273 0.00392 -49.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.21 on 97 degrees of freedom
## Multiple R-squared: 0.9616, Adjusted R-squared: 0.9608
## F-statistic: 1213 on 2 and 97 DF, p-value: < 2.2e-16
To further articulate why the regression only approach failed we should overlay the models and the data. Unfortunately, this critical step is often overlooked.
plot(y ~ x)
lines(x, predict(mod_lin), col='red')
lines(x, predict(mod_quad), col='blue')
legend('bottom', c('linear', 'quadratic'), col=c('red', 'blue'), lty=1, bty='n')
Here we just looked at two variables so it may be obvious that graphing them is wise, but even in multivariate scenarios where a graphical exploration may be more tedious it can still be very helpful and illuminating prior to model fitting.