Multiple linear regression

Nature and life are complex phenomena, and explaining them by a single variable is simply impossible, that is why simple linear regression needs to be complemented with several predictors. But, sinse we often have a lot of predictors, we often conduct a variable selection. Backward selection cannot be used if p > n, while forward selection can always be used. Forward selection is a greedy approach, and might include variables early that later become redundant. Mixed selection can remedy this. Use model selection algorithm to choose the best predictors

library(MASS)
fit1 <- lm(mpg ~ ., mtcars)
fit2 <- lm(mpg ~ 1, mtcars)

om1 <- stepAIC(fit1, direction="backward", scope=list(upper=fit1,lower=fit2), trace = F)
om2 <- stepAIC(fit2, direction="forward",  scope=list(upper=fit1,lower=fit2), trace = F)
om3 <- stepAIC(fit2, direction="both",     scope=list(upper=fit1,lower=fit2), trace = F)

AIC(om1, om2, om3)
##     df      AIC
## om1  5 154.1194
## om2  5 155.4766
## om3  5 155.4766

Check the constraints for linear model

# 1. non-linearity between response and predictor -------------------------

summary(om1)  # the om1 is actually the best model
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11
car::influencePlot(om1)

##                         StudRes       Hat        CookD
## Merc 230            -1.25110559 0.2970422 0.1620826668
## Lincoln Continental  0.08383783 0.2642151 0.0006541961
## Chrysler Imperial    2.32311937 0.2296338 0.3475974030
## Fiat 128             2.12214584 0.1276313 0.1464019096
# there are three influentual points, but nothing wild, so I keep them

par(mfrow=c(2,2))
plot(om1) 

# residuals hat some pattern, some non-normality and some heteroskedasticity, 
# Chrisler Imperial could be an outlier, but it's not wild, so it maybe just a bad car
# thus - nothing wild, but I'd better do the residual-normality and heteroskedasticity tests to be sure
# 2. non-normality (or autocorrelation) of the residual -------------------

shapiro.test(om1$residuals)  # p <0.05, meaning - the residuals are dist. non-normal (bad thing)
## 
##  Shapiro-Wilk normality test
## 
## data:  om1$residuals
## W = 0.9411, p-value = 0.08043
# im my case p >0.05 - resuduals are normally distributed (good thing)!
# 3. heteroskedasticity - non-constant variance of residuals --------------

library(lmtest) # for the test
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(om1) # the p-value is >0.05 indicates no heteroskedasticity in our data. no need to transform them
## 
##  studentized Breusch-Pagan test
## 
## data:  om1
## BP = 6.1871, df = 3, p-value = 0.1029
library(gvlma)
gvlma(om1) # still no heteroskedasticity
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt         qsec           am  
##       9.618       -3.917        1.226        2.936  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = om1) 
## 
##                      Value  p-value                   Decision
## Global Stat        11.3771 0.022637 Assumptions NOT satisfied!
## Skewness            1.5545 0.212473    Assumptions acceptable.
## Kurtosis            0.6585 0.417077    Assumptions acceptable.
## Link Function       8.9061 0.002842 Assumptions NOT satisfied!
## Heteroscedasticity  0.2580 0.611526    Assumptions acceptable.
# 4. (multi)collinearity - dependence of predictors -----------------------

car::vif(om1)  # sinse the values are all below 5, there is no problem in our "om1" model
##       wt     qsec       am 
## 2.482952 1.364339 2.541437
# 5. outliers & high-leverage points --------------------------------------

car::influencePlot(om1)

##                         StudRes       Hat        CookD
## Merc 230            -1.25110559 0.2970422 0.1620826668
## Lincoln Continental  0.08383783 0.2642151 0.0006541961
## Chrysler Imperial    2.32311937 0.2296338 0.3475974030
## Fiat 128             2.12214584 0.1276313 0.1464019096
# and Chrysler Imperial could be interpreted as a low leverage ourlier, but it's overkill, so I keep it

#!!! congrats, for this multiple linear model, you solved every problem. don't need log. Some conclutions:

# 1. there is a relationship between mpg and wt, qsec, am
# 2. the relationships are strong: RSE is small and hp explaines ca. 85% of the variation in mpg,
# and the p-value is significant. Note, that if the p-value of the model is significat, but p-value of the predictors not, then there is something wrong in the model, like collinearity.
# now you can plot implied predictions
library(jtools)
effect_plot(om1, pred = wt,   interval = TRUE, plot.points = TRUE)

effect_plot(om1, pred = qsec, interval = TRUE, plot.points = TRUE)

effect_plot(om1, pred = am,   interval = TRUE, plot.points = TRUE)

Avatar
Yury Zablotski
Data Scientist at LMU Munich, Faculty of Veterinary Medicine

Passion for applying Biostatistics and Machine Learning to Life Science Data

Related

comments powered by Disqus