Standardized \beta coefficients

Source Code Link

https://raw.githubusercontent.com/dmcglinn/quant_methods/gh-pages/lessons/standardized_beta_coefficients.Rmd

This mini-lesson is to introduce the concept of standardized regression coefficients in R. A standardized regression coefficient is simply the \(\beta\) estimate from a regression on standardized variables. A standardized variable is a variable that has a mean of 0 and a standard deviation of 1.

One reason for standardizing variables is that you can interpret the \(\beta\) estimates as partial correlation coefficients. In other words now that the variables are standardized you can compare how correlated they are to the response variable using their regression coefficients. Below is a demo of this.

## We will use this function to plot the data and correlations 
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor=3, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) 
        cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor)
}

Simulate some data for running models. Here to provide a clear demonstration we need explanatory variables that are independent normal variates.

set.seed(10)
n = 90
x1 = rnorm(n)
x2 = rnorm(n)
x3 = rnorm(n)

#create noise b/c there is always error in real life
epsilon = rnorm(n, 0, 3)
#generate response: additive model plus noise, intercept=0
y = 2*x1 + x2 + 3*x3 + epsilon
#organize predictors in data frame
sim_data = data.frame(y, x1, x2, x3)

Before standardizing variables it is worthwhile to highlight that the relationship between correlation and regression statistics. Specifically, the t-statistic from a simple correlation coefficient is exactly what is reported for the \(\beta_1\) coefficient in a regression model.

cor.test(sim_data$y, sim_data$x1)$statistic

##       t 
## 3.28821

summary(lm(y ~ x1, data=sim_data))$coef

##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 0.5675109  0.5015436 1.131529 0.260906812
## x1          1.7411233  0.5295049 3.288210 0.001450304

The \(\beta\) coefficient reported by the regression is not equal to the correlation coefficient though because the \(\beta\) is in the units of the \(x_1\) variable (i.e., it has not been standardized). Now let’s use the function scale() to standardize the independent and dependent variables.

sim_data_std = data.frame(scale(sim_data))

mod = lm(y  ~ x1 + x2 + x3, data=sim_data)
mod_std = lm(y  ~ x1 + x2 + x3, data=sim_data_std)
round(summary(mod)$coef, 3)

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    1.089      0.320   3.400    0.001
## x1             2.071      0.336   6.161    0.000
## x2             1.089      0.327   3.327    0.001
## x3             3.580      0.323  11.076    0.000

round(summary(mod_std)$coef, 3)

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    0.000      0.063   0.000    1.000
## x1             0.393      0.064   6.161    0.000
## x2             0.212      0.064   3.327    0.001
## x3             0.707      0.064  11.076    0.000

cor(sim_data$y, sim_data$x1)

## [1] 0.3307912

cor(sim_data$y, sim_data$x2)

## [1] 0.2037518

cor(sim_data$y, sim_data$x3)

## [1] 0.6772098

Notice that above the t-statistics and consequently the p-values between mod and mod_std don’t change (with the exception of the intercept term which is always 0 in a regression of standardized variables). This is because the t-statistic is a pivotal statistic meaning that its value doesn’t depend on the scale of the difference.

Additionally notice that the individual correlation coefficients are very similar to the \(\beta\) estimates in mod_std. Why are these not exactly the same? Here’s a hint - what would happen if their was strong multicollinarity between the explanatory variables?

Let’s plot the variables against one another and also display their individual Pearson correlation coefficients to get a visual perspective on the problem

pairs(sim_data, lower.panel = panel.cor, upper.panel = panel.smooth)

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

Home Page - http://dmcglinn.github.io/quant_methods/ GitHub Repo - https://github.com/dmcglinn/quant_methods

Standardized \(\beta\) coefficients

Source Code Link