Statistical tests vs. linear regression

This post was inspired by this article

Previous topics

The family of parametric statistical tests ( One or Two-sample t-tests, ANOVA, ANCOVA, MANOVA, MANCOVA ) are all looking for significant differences between means of some groups. Since linear regression can find such differences too, here we’ll investigate whether there are differences in the results of tests vs. linear regression and which is better.

Head-to-head comparison:

# library(tidyverse)
# library(ISLR)
# library(car)
# library(multcomp)
# library(jmv)
# library(broom)
# library(knitr)

One-sample t-test vs. linear regression

Since statistical tests family, from a simple t-test to the MANCOVA are all wrapped up around linear model concept in R, linear regression are able to reproduce the results of even the simplest statistical tests.

Run the first two lines of code below and see the output of both for yourself, if you wish. The rest of the code simply packs the results in a nice looking table: (1) tidy(object, conf.int = T) from broom package was designed to produce nice model outputs, (2) kable from knitr package makes the table look publication like, (3) %>% and select are data-manipulation functions from tidyverse package-family and (4) rbind simply puts rows beneath each other, while cbind puts the columns near each other. Since select is used by several packages, we need to tell R to use select from dplyr:: package.

# statistical test (st)
st = t.test(mtcars$mpg)

# equivalent intercept-only linear regression (lr) 
lr = lm(mtcars$mpg ~ 1)

# nice looking table
kable(
  rbind(
    tidy(st)[c(1:3, 5:7)], 
    cbind(
      tidy(lr, conf.int = T)[c(2,4:7)], method = "Simple Linear Model") ) %>% 
  dplyr::select(method, everything()) )
method estimate statistic p.value conf.low conf.high
One Sample t-test 20.09062 18.85693 0 17.91768 22.26357
Simple Linear Model 20.09062 18.85693 0 17.91768 22.26357

You see: the results are identical!

Paired sample t-test vs. linear regression

To make the comparison less abstract, let’s make up (simulate) and visualise two samples.

# stabilize randomness for reproducibility
set.seed(1)

# tibble is simply a new table
data <- tibble(
  a = rnorm(n = 10, mean = 0,   sd = 9),
  b = rnorm(n = 10, mean = 5,   sd = 7) )

# ggplot is also included into tidyverse package-family
ggplot(data %>% gather(), aes(x = factor(key), y = value))+
  geom_boxplot()+
  geom_dotplot(binaxis ='y', stackdir = 'center', binwidth = 1, alpha = .3)+
  xlab("")+
  theme_minimal()

Since, paired t-test calculates the difference between samples and then compares this difference to zero (making it a one-sample t-test, where one-sample is our difference), in the regression model we simply subtract one sample from the other and perform intercept only model.

st = t.test(data$a, data$b, paired = TRUE)
lr = lm(data$a - data$b ~ 1)

kable(
  rbind(
    tidy(st)[c(1:3, 5:7)], 
    cbind(
      tidy(lr, conf.int = T)[c(2,4:7)], method = "Simple Linear Model")) %>% 
  dplyr::select(method, everything()) )  
method estimate statistic p.value conf.low conf.high
Paired t-test -5.55209 -1.457902 0.1788606 -14.167 3.062823
Simple Linear Model -5.55209 -1.457902 0.1788606 -14.167 3.062823

Also here are all results identical. Thus, mean of the differences in paired t.test() is indeed the same as the (Intercept) in the linear regression.

Unpaired sample t-test vs. linear regression

Since statistical tests compare means of groups (samples), let’s reshape our data so that groups are in one column and values of our samples are in the second column. gather() function from tidyr package makes this easy to do. Reshaping the data will help us to understand tilde operator ~, which is often used in tests like ANOVA and always used in the linear models to define a model formula. The possibility of using ~ within statistical tests also indicates the affinity of statistical tests with models. This affinity is the reason you often see “ANOVA model” in the literature instead of “ANOVA test”.

paired = FALSE in the code below is actually unnecessary, since t-test uses it per default, but it underlines our intention to use unpaired t-test here.

data2 <- data %>% gather() %>% rename(group = key, value = value)

st = t.test(value ~ group, data = data2, paired = FALSE, var.equal = TRUE)
lr = lm(value ~ group, data = data2)

kable( tidy(st)[c(1:4, 6:8)] )
estimate1 estimate2 statistic p.value conf.low conf.high method
1.189825 6.741915 -1.710128 0.1044233 -12.37293 1.26875 Two Sample t-test
kable( tidy(lr, conf.int = T) )
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 1.189825 2.295688 0.5182868 0.6105707 -3.633237 6.012887
groupb 5.552090 3.246594 1.7101277 0.1044233 -1.268750 12.372930

Here we can see that the mean of the first sample “a” mean(data$a) = 1.189825, which is the estimate1 in the t-test is identical to the (Intercept) of the linear model. The second estimate of the t-test (6.741915) is identical to the sum of estimates in the linear model (1.189825+5.552090 = 6.741915). This makes sense, because the (Intercept) is always the mean of the first sample, while the estimate for other groups are always a transition from the first (reference) sample to the particular one (e.g. groupb). See, same thing, displayed a bit differently. But why? Well, while t-test shows you two means, the transition from one mean to the other in linear model shows you the slope - rate of change of the second group compared to the first, which will be very useful when we start adding more variables to our models.

T-test indicates that groups are not significantly different (p = 0.104). The switching from sample “a” to sample “b” in the linear model (see the second row in the model output) has identical p-value = 0.104. The t-statistics and confident intervals in the second row are also identical, but with the opposite sign (which I can’t explain at the moment. So, please, leave a comment if you can). But what it the first row about? Interestingly, the first row in the model output presents the results of a one-sample t-test of the first sample “a”. Run the code below and compare estimate, t-statistic, p-values and confidence intervals with the first row of the model output above. You’ll see that the results almost identical.

t.test(data$a)
## 
##  One Sample t-test
## 
## data:  data$a
## t = 0.53557, df = 9, p-value = 0.6052
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -3.835753  6.215403
## sample estimates:
## mean of x 
##  1.189825

One-way ANOVA vs. linear regression

Let’s take the Wage dataset from ISLR package to quickly get 5 group in our categorical variable we want to compare. Then we compare results of two models, ANOVA and regression.

Similarly to tidy, glance from broom package also presents model results in a tidy way.

st = aov(wage ~ education, Wage)
lr = lm(wage ~ education, Wage)

kable( rbind(glance(st), glance(lr)) )
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.2348419 0.23382 36.52575 229.8059 0 5 -15048.37 30108.73 30144.77 3995721 2995
0.2348419 0.23382 36.52575 229.8059 0 5 -15048.37 30108.73 30144.77 3995721 2995

The results are, again, identical!

Two-way ANOVA vs. linear regression

Even if we add more predictors and interactions into our model, the results statistical tests (here represented by Two-way ANOVA) and regression will match. Even the syntaxes in R is the same. Syntaxes might differ in other programming languages, but since R is the primer statistical language and we here try to learn stats, we stick with R for a while.

st = aov(wage ~ education * jobclass, Wage)
lr = lm(wage ~ education * jobclass, Wage)

kable( rbind(glance(st), glance(lr)) )
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.2422488 0.239968 36.37891 106.2096 0 10 -15033.78 30089.55 30155.62 3957042 2990
0.2422488 0.239968 36.37891 106.2096 0 10 -15033.78 30089.55 30155.62 3957042 2990

One-way ANCOVA vs. linear regression

st = aov(wage ~ education * age, Wage)
lr = lm(wage ~ education * age, Wage)

kable( rbind(glance(st), glance(lr)) )
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.2624205 0.2602003 35.89144 118.2 0 10 -14993.3 30008.61 30074.68 3851704 2990
0.2624205 0.2602003 35.89144 118.2 0 10 -14993.3 30008.61 30074.68 3851704 2990

Any-way MANOVA vs. linear regression

The multivariate analysis of variance (MANOVA) can also be conducted by lm function:

summary.aov( aov(cbind(wage, age)~education, Wage) )
##  Response wage :
##               Df  Sum Sq Mean Sq F value    Pr(>F)    
## education      4 1226364  306591  229.81 < 2.2e-16 ***
## Residuals   2995 3995721    1334                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response age :
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## education      4   4608 1151.90  8.7353 5.227e-07 ***
## Residuals   2995 394941  131.87                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary.aov(  lm(cbind(wage, age)~education, Wage) )
##  Response wage :
##               Df  Sum Sq Mean Sq F value    Pr(>F)    
## education      4 1226364  306591  229.81 < 2.2e-16 ***
## Residuals   2995 3995721    1334                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response age :
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## education      4   4608 1151.90  8.7353 5.227e-07 ***
## Residuals   2995 394941  131.87                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Please, don’t be confused by the function summary.aov(object, ...) used for linear model. It simply computes tables for one or more fitted model objects, where object contains the results returned by a model fitting function (e.g., aov, lm or glm).

Any-way MANCOVA vs. linear regression

# library(jmv)
st <- mancova(data = Wage,
              deps = vars(wage, age),
              factors = education, covs = year)
lr = lm(cbind(wage,age) ~ education+year, Wage)

st$univar                   
## 
##  Univariate Tests                                                                                 
##  ──────────────────────────────────────────────────────────────────────────────────────────────── 
##                 Dependent Variable    Sum of Squares    df      Mean Square    F         p        
##  ──────────────────────────────────────────────────────────────────────────────────────────────── 
##    education    wage                         1226364       4         306591    230.70    < .001   
##                 age                             4608       4           1152      8.74    < .001   
##    year         wage                           16807       1          16807     12.65    < .001   
##                 age                              494       1            494      3.75     0.053   
##    Residuals    wage                         3978914    2994           1329                       
##                 age                           394446    2994            132                       
##  ────────────────────────────────────────────────────────────────────────────────────────────────
summary.aov(lr)
##  Response wage :
##               Df  Sum Sq Mean Sq F value    Pr(>F)    
## education      4 1226364  306591 230.700 < 2.2e-16 ***
## year           1   16807   16807  12.647 0.0003821 ***
## Residuals   2994 3978914    1329                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response age :
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## education      4   4608 1151.90  8.7433 5.149e-07 ***
## year           1    494  494.13  3.7507   0.05288 .  
## Residuals   2994 394446  131.75                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Do we need both? What are the pros?

So, we convinced ourselves that statistical tests are linear regressions, which was also beautifully described by Jonas Kristoffer Lindeløv.

Pros of regression:

  • simplicity: why learn so many limited concepts of statistical tests and differentiate between them, if you can only learn one - lm.

  • extendibility: all tests have an assumption of linearity, but since the nature is not linear, regression approach is far more useful, since you can fit any type of data. Thus, linear models will give you access to other, more powerful methods, such as non-linear modelling.

  • increased inference: regression output is also often more useful, it gives you rate of change, relationship between variables, fitted values etc..

  • applicability: making predictions, comparing models to choose the best (e.g. which model explains the most variance in the data or makes the best predictions)

  • scalability: is easily applicable to “big data” which thousands of variables

So, if regression not only secures the same results as statistical test, but also has so many advantages, then why do we need statistical tests and ANOVA family anyway?

Pros of tests:

  • learnability: tests are simpler to grasp for a beginner and still get the job done, when you only need a little statistical help to make your not-too-sophisticated research results more solid. Besides, if you start to explain statistics to a newbie with a linear regression, you almost certainly will fail to teach the newbie anything, but either (1) will scare the newbie away from any data analysis, or (2) will plant a mystical “I’ll never get it anyway” belief in newbies head and make the newbie use statistical tests or even more complex methods without really understanding them, which is very scary. As a non-statistician myself, I’ve been there a long time, until I set down and started with the basics - statistical tests. However, if teaching regression first can be done well enough, you’ll see that tests and regression are related and you’ll be able to conduct any test using general regression framework, avoiding unnecessary complexity 1.

Conclusion: which one is better?

The answer to our original question, whether there are differences between the results of tests and regression, is - yes and no. No, because regression delivers the same results statistical tests do. Yes, because regression can do so much more, which no test can.

Which one to learn? depending on your stats level, if you are a newbie, start with tests, if you already have some stats experience, go with regression.

Which one to use? use regression, since it gives you more results and more potential, except if you need a specific ANOVA looking output and do not need/want to improve in Data Science.

What’s next

Further readings and references

Avatar
Yury Zablotski
Data Scientist at LMU Munich, Faculty of Veterinary Medicine

Passion for applying Biostatistics and Machine Learning to Life Science Data

Related

comments powered by Disqus