Data exploration: the importance of plotting

Simple example

Consider that you have two variables x and y.

x <- 1:100
y <-  20 * x - 0.2 * x^2 + rnorm(100, 0, 30)

You are interested to understand if x explains variation in y. How will you approach this?

Should you build a linear model first or plot the variables first and visually explore?

Let's examine the outcome and inference if we do not visually examine our data and only rely on regression modeling.

mod_lin <- lm(y ~ x)
summary(mod_lin)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -378.48 -108.54   32.55  126.13  180.80 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 344.7734    29.8143  11.564   <2e-16 ***
## x            -0.2913     0.5126  -0.568    0.571    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 148 on 98 degrees of freedom
## Multiple R-squared:  0.003285,   Adjusted R-squared:  -0.006886 
## F-statistic: 0.3229 on 1 and 98 DF,  p-value: 0.5711

The above model summary table would lead us to believe that there is no relationship between x and y which we know is false because we created the variables above.

Why is it our regression table leading us to an incorrect inference?

Because our model is systematically mis-representing the functional form of the relationship between x and y which we defined to be a quadratic relationship. This would have been obvious if we first graphed y and x

plot(y ~ x)

This simple step would indicate to us that y is not just a linear function of x but it is a quadratic function, such that a more appropriate model is:

mod_quad <- lm(y ~ x + I(x^2))
summary(mod_quad)

## 
## Call:
## lm(formula = y ~ x + I(x^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -66.837 -22.635   3.558  17.125  67.744 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.85923    8.94040    1.55    0.124    
## x           19.17426    0.40860   46.93   <2e-16 ***
## I(x^2)      -0.19273    0.00392  -49.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.21 on 97 degrees of freedom
## Multiple R-squared:  0.9616, Adjusted R-squared:  0.9608 
## F-statistic:  1213 on 2 and 97 DF,  p-value: < 2.2e-16

To further articulate why the regression only approach failed we should overlay the models and the data. Unfortunately, this critical step is often overlooked.

plot(y ~ x)
lines(x, predict(mod_lin), col='red')
lines(x, predict(mod_quad), col='blue')
legend('bottom', c('linear', 'quadratic'), col=c('red', 'blue'), lty=1, bty='n')

Here we just looked at two variables so it may be obvious that graphing them is wise, but even in multivariate scenarios where a graphical exploration may be more tedious it can still be very helpful and illuminating prior to model fitting.