- Previous topics
- Head-to-head comparison:
- One-sample t-test vs. linear regression
- Paired sample t-test vs. linear regression
- Unpaired sample t-test vs. linear regression
- One-way ANOVA vs. linear regression
- Two-way ANOVA vs. linear regression
- One-way ANCOVA vs. linear regression
- Any-way MANOVA vs. linear regression
- Any-way MANCOVA vs. linear regression
- Do we need both? What are the pros?
- Conclusion: which one is better?
- What’s next
- Further readings and references
This post was inspired by this article
Previous topics
The family of parametric statistical tests ( One or Two-sample t-tests, ANOVA, ANCOVA, MANOVA, MANCOVA ) are all looking for significant differences between means of some groups. Since linear regression can find such differences too, here we’ll investigate whether there are differences in the results of tests vs. linear regression and which is better.
Head-to-head comparison:
# library(tidyverse)
# library(ISLR)
# library(car)
# library(multcomp)
# library(jmv)
# library(broom)
# library(knitr)
One-sample t-test vs. linear regression
Since statistical tests family, from a simple t-test to the MANCOVA are all wrapped up around linear model concept in R, linear regression are able to reproduce the results of even the simplest statistical tests.
Run the first two lines of code below and see the output of both for yourself, if you wish. The rest of the code simply packs the results in a nice looking table: (1) tidy(object, conf.int = T)
from broom
package was designed to produce nice model outputs, (2) kable
from knitr
package makes the table look publication like, (3) %>%
and select
are data-manipulation functions from tidyverse
package-family and (4) rbind
simply puts rows beneath each other, while cbind
puts the columns near each other. Since select
is used by several packages, we need to tell R to use select
from dplyr::
package.
# statistical test (st)
st = t.test(mtcars$mpg)
# equivalent intercept-only linear regression (lr)
lr = lm(mtcars$mpg ~ 1)
# nice looking table
kable(
rbind(
tidy(st)[c(1:3, 5:7)],
cbind(
tidy(lr, conf.int = T)[c(2,4:7)], method = "Simple Linear Model") ) %>%
dplyr::select(method, everything()) )
method | estimate | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|
One Sample t-test | 20.09062 | 18.85693 | 0 | 17.91768 | 22.26357 |
Simple Linear Model | 20.09062 | 18.85693 | 0 | 17.91768 | 22.26357 |
You see: the results are identical!
Paired sample t-test vs. linear regression
To make the comparison less abstract, let’s make up (simulate) and visualise two samples.
# stabilize randomness for reproducibility
set.seed(1)
# tibble is simply a new table
data <- tibble(
a = rnorm(n = 10, mean = 0, sd = 9),
b = rnorm(n = 10, mean = 5, sd = 7) )
# ggplot is also included into tidyverse package-family
ggplot(data %>% gather(), aes(x = factor(key), y = value))+
geom_boxplot()+
geom_dotplot(binaxis ='y', stackdir = 'center', binwidth = 1, alpha = .3)+
xlab("")+
theme_minimal()
Since, paired t-test calculates the difference between samples and then compares this difference to zero (making it a one-sample t-test, where one-sample is our difference), in the regression model we simply subtract one sample from the other and perform intercept only model.
st = t.test(data$a, data$b, paired = TRUE)
lr = lm(data$a - data$b ~ 1)
kable(
rbind(
tidy(st)[c(1:3, 5:7)],
cbind(
tidy(lr, conf.int = T)[c(2,4:7)], method = "Simple Linear Model")) %>%
dplyr::select(method, everything()) )
method | estimate | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|
Paired t-test | -5.55209 | -1.457902 | 0.1788606 | -14.167 | 3.062823 |
Simple Linear Model | -5.55209 | -1.457902 | 0.1788606 | -14.167 | 3.062823 |
Also here are all results identical. Thus, mean of the differences
in paired t.test()
is indeed the same as the (Intercept)
in the linear regression.
Unpaired sample t-test vs. linear regression
Since statistical tests compare means of groups (samples), let’s reshape our data
so that groups are in one column and values of our samples are in the second column. gather()
function from tidyr
package makes this easy to do. Reshaping the data will help us to understand tilde operator ~
, which is often used in tests like ANOVA and always used in the linear models to define a model formula. The possibility of using ~
within statistical tests also indicates the affinity of statistical tests with models. This affinity is the reason you often see “ANOVA model” in the literature instead of “ANOVA test”.
paired = FALSE
in the code below is actually unnecessary, since t-test uses it per default, but it underlines our intention to use unpaired t-test here.
data2 <- data %>% gather() %>% rename(group = key, value = value)
st = t.test(value ~ group, data = data2, paired = FALSE, var.equal = TRUE)
lr = lm(value ~ group, data = data2)
kable( tidy(st)[c(1:4, 6:8)] )
estimate1 | estimate2 | statistic | p.value | conf.low | conf.high | method |
---|---|---|---|---|---|---|
1.189825 | 6.741915 | -1.710128 | 0.1044233 | -12.37293 | 1.26875 | Two Sample t-test |
kable( tidy(lr, conf.int = T) )
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 1.189825 | 2.295688 | 0.5182868 | 0.6105707 | -3.633237 | 6.012887 |
groupb | 5.552090 | 3.246594 | 1.7101277 | 0.1044233 | -1.268750 | 12.372930 |
Here we can see that the mean of the first sample “a” mean(data$a)
= 1.189825, which is the estimate1
in the t-test is identical to the (Intercept)
of the linear model. The second estimate of the t-test (6.741915) is identical to the sum of estimates in the linear model (1.189825+5.552090 = 6.741915). This makes sense, because the (Intercept)
is always the mean of the first sample, while the estimate for other groups are always a transition from the first (reference) sample to the particular one (e.g. groupb). See, same thing, displayed a bit differently. But why? Well, while t-test shows you two means, the transition from one mean to the other in linear model shows you the slope - rate of change of the second group compared to the first, which will be very useful when we start adding more variables to our models.
T-test indicates that groups are not significantly different (p = 0.104). The switching from sample “a” to sample “b” in the linear model (see the second row in the model output) has identical p-value = 0.104. The t-statistics and confident intervals in the second row are also identical, but with the opposite sign (which I can’t explain at the moment. So, please, leave a comment if you can). But what it the first row about? Interestingly, the first row in the model output presents the results of a one-sample t-test of the first sample “a”. Run the code below and compare estimate, t-statistic, p-values and confidence intervals with the first row of the model output above. You’ll see that the results almost identical.
t.test(data$a)
##
## One Sample t-test
##
## data: data$a
## t = 0.53557, df = 9, p-value = 0.6052
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -3.835753 6.215403
## sample estimates:
## mean of x
## 1.189825
One-way ANOVA vs. linear regression
Let’s take the Wage
dataset from ISLR
package to quickly get 5 group in our
categorical variable we want to compare. Then we compare results of two models, ANOVA and regression.
Similarly to tidy
, glance
from broom
package also presents model results in a tidy way.
st = aov(wage ~ education, Wage)
lr = lm(wage ~ education, Wage)
kable( rbind(glance(st), glance(lr)) )
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
---|---|---|---|---|---|---|---|---|---|---|
0.2348419 | 0.23382 | 36.52575 | 229.8059 | 0 | 5 | -15048.37 | 30108.73 | 30144.77 | 3995721 | 2995 |
0.2348419 | 0.23382 | 36.52575 | 229.8059 | 0 | 5 | -15048.37 | 30108.73 | 30144.77 | 3995721 | 2995 |
The results are, again, identical!
Two-way ANOVA vs. linear regression
Even if we add more predictors and interactions into our model, the results statistical tests (here represented by Two-way ANOVA) and regression will match. Even the syntaxes in R is the same. Syntaxes might differ in other programming languages, but since R is the primer statistical language and we here try to learn stats, we stick with R for a while.
st = aov(wage ~ education * jobclass, Wage)
lr = lm(wage ~ education * jobclass, Wage)
kable( rbind(glance(st), glance(lr)) )
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
---|---|---|---|---|---|---|---|---|---|---|
0.2422488 | 0.239968 | 36.37891 | 106.2096 | 0 | 10 | -15033.78 | 30089.55 | 30155.62 | 3957042 | 2990 |
0.2422488 | 0.239968 | 36.37891 | 106.2096 | 0 | 10 | -15033.78 | 30089.55 | 30155.62 | 3957042 | 2990 |
One-way ANCOVA vs. linear regression
st = aov(wage ~ education * age, Wage)
lr = lm(wage ~ education * age, Wage)
kable( rbind(glance(st), glance(lr)) )
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
---|---|---|---|---|---|---|---|---|---|---|
0.2624205 | 0.2602003 | 35.89144 | 118.2 | 0 | 10 | -14993.3 | 30008.61 | 30074.68 | 3851704 | 2990 |
0.2624205 | 0.2602003 | 35.89144 | 118.2 | 0 | 10 | -14993.3 | 30008.61 | 30074.68 | 3851704 | 2990 |
Any-way MANOVA vs. linear regression
The multivariate analysis of variance (MANOVA) can also be conducted by lm
function:
summary.aov( aov(cbind(wage, age)~education, Wage) )
## Response wage :
## Df Sum Sq Mean Sq F value Pr(>F)
## education 4 1226364 306591 229.81 < 2.2e-16 ***
## Residuals 2995 3995721 1334
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response age :
## Df Sum Sq Mean Sq F value Pr(>F)
## education 4 4608 1151.90 8.7353 5.227e-07 ***
## Residuals 2995 394941 131.87
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary.aov( lm(cbind(wage, age)~education, Wage) )
## Response wage :
## Df Sum Sq Mean Sq F value Pr(>F)
## education 4 1226364 306591 229.81 < 2.2e-16 ***
## Residuals 2995 3995721 1334
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response age :
## Df Sum Sq Mean Sq F value Pr(>F)
## education 4 4608 1151.90 8.7353 5.227e-07 ***
## Residuals 2995 394941 131.87
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Please, don’t be confused by the function summary.aov(object, ...)
used for linear model. It simply computes tables for one or more fitted model objects, where object contains the results returned by a model fitting function (e.g., aov, lm or glm).
Any-way MANCOVA vs. linear regression
# library(jmv)
st <- mancova(data = Wage,
deps = vars(wage, age),
factors = education, covs = year)
lr = lm(cbind(wage,age) ~ education+year, Wage)
st$univar
##
## Univariate Tests
## ────────────────────────────────────────────────────────────────────────────────────────────────
## Dependent Variable Sum of Squares df Mean Square F p
## ────────────────────────────────────────────────────────────────────────────────────────────────
## education wage 1226364 4 306591 230.70 < .001
## age 4608 4 1152 8.74 < .001
## year wage 16807 1 16807 12.65 < .001
## age 494 1 494 3.75 0.053
## Residuals wage 3978914 2994 1329
## age 394446 2994 132
## ────────────────────────────────────────────────────────────────────────────────────────────────
summary.aov(lr)
## Response wage :
## Df Sum Sq Mean Sq F value Pr(>F)
## education 4 1226364 306591 230.700 < 2.2e-16 ***
## year 1 16807 16807 12.647 0.0003821 ***
## Residuals 2994 3978914 1329
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response age :
## Df Sum Sq Mean Sq F value Pr(>F)
## education 4 4608 1151.90 8.7433 5.149e-07 ***
## year 1 494 494.13 3.7507 0.05288 .
## Residuals 2994 394446 131.75
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Do we need both? What are the pros?
So, we convinced ourselves that statistical tests are linear regressions, which was also beautifully described by Jonas Kristoffer Lindeløv.
Pros of regression:
simplicity: why learn so many limited concepts of statistical tests and differentiate between them, if you can only learn one -
lm
.extendibility: all tests have an assumption of linearity, but since the nature is not linear, regression approach is far more useful, since you can fit any type of data. Thus, linear models will give you access to other, more powerful methods, such as non-linear modelling.
increased inference: regression output is also often more useful, it gives you rate of change, relationship between variables, fitted values etc..
applicability: making predictions, comparing models to choose the best (e.g. which model explains the most variance in the data or makes the best predictions)
scalability: is easily applicable to “big data” which thousands of variables
So, if regression not only secures the same results as statistical test, but also has so many advantages, then why do we need statistical tests and ANOVA family anyway?
Pros of tests:
- learnability: tests are simpler to grasp for a beginner and still get the job done, when you only need a little statistical help to make your not-too-sophisticated research results more solid. Besides, if you start to explain statistics to a newbie with a linear regression, you almost certainly will fail to teach the newbie anything, but either (1) will scare the newbie away from any data analysis, or (2) will plant a mystical “I’ll never get it anyway” belief in newbies head and make the newbie use statistical tests or even more complex methods without really understanding them, which is very scary. As a non-statistician myself, I’ve been there a long time, until I set down and started with the basics - statistical tests. However, if teaching regression first can be done well enough, you’ll see that tests and regression are related and you’ll be able to conduct any test using general regression framework, avoiding unnecessary complexity 1.
Conclusion: which one is better?
The answer to our original question, whether there are differences between the results of tests and regression, is - yes and no. No, because regression delivers the same results statistical tests do. Yes, because regression can do so much more, which no test can.
Which one to learn? depending on your stats level, if you are a newbie, start with tests, if you already have some stats experience, go with regression.
Which one to use? use regression, since it gives you more results and more potential, except if you need a specific ANOVA looking output and do not need/want to improve in Data Science.
What’s next
- Simple linear regression
- Models with interactions
- Non-parametric statistical tests (in progress)