Consider that you have two variables x and y.
x <- 1:100
y <-  20 * x - 0.2 * x^2 + rnorm(100, 0, 30)
You are interested to understand if x explains variation in y. How will you approach this?
Let's examine the outcome and inference if we do not visually examine our data and only rely on regression modeling.
mod_lin <- lm(y ~ x)
summary(mod_lin)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -378.48 -108.54   32.55  126.13  180.80 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 344.7734    29.8143  11.564   <2e-16 ***
## x            -0.2913     0.5126  -0.568    0.571    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 148 on 98 degrees of freedom
## Multiple R-squared:  0.003285,   Adjusted R-squared:  -0.006886 
## F-statistic: 0.3229 on 1 and 98 DF,  p-value: 0.5711
The above model summary table would lead us to believe that there is no relationship between x and y which we know is false because we created the variables above.
Because our model is systematically mis-representing the functional form of the relationship between x and y which we defined to be a quadratic relationship. This would have been obvious if we first graphed y and x
plot(y ~ x)
This simple step would indicate to us that y is not just a linear function of x but it is a quadratic function, such that a more appropriate model is:
mod_quad <- lm(y ~ x + I(x^2))
summary(mod_quad)
## 
## Call:
## lm(formula = y ~ x + I(x^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -66.837 -22.635   3.558  17.125  67.744 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.85923    8.94040    1.55    0.124    
## x           19.17426    0.40860   46.93   <2e-16 ***
## I(x^2)      -0.19273    0.00392  -49.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.21 on 97 degrees of freedom
## Multiple R-squared:  0.9616, Adjusted R-squared:  0.9608 
## F-statistic:  1213 on 2 and 97 DF,  p-value: < 2.2e-16
To further articulate why the regression only approach failed we should overlay the models and the data. Unfortunately, this critical step is often overlooked.
plot(y ~ x)
lines(x, predict(mod_lin), col='red')
lines(x, predict(mod_quad), col='blue')
legend('bottom', c('linear', 'quadratic'), col=c('red', 'blue'), lty=1, bty='n')
Here we just looked at two variables so it may be obvious that graphing them is wise, but even in multivariate scenarios where a graphical exploration may be more tedious it can still be very helpful and illuminating prior to model fitting.