Nature and life are complex phenomena, and explaining them by a single variable is simply impossible, that is why simple linear regression needs to be complemented with several predictors. But, sinse we often have a lot of predictors, we often conduct a variable selection. Backward selection cannot be used if p > n, while forward selection can always be used. Forward selection is a greedy approach, and might include variables early that later become redundant. Mixed selection can remedy this. Use model selection algorithm to choose the best predictors
library(MASS)
fit1 <- lm(mpg ~ ., mtcars)
fit2 <- lm(mpg ~ 1, mtcars)
om1 <- stepAIC(fit1, direction="backward", scope=list(upper=fit1,lower=fit2), trace = F)
om2 <- stepAIC(fit2, direction="forward", scope=list(upper=fit1,lower=fit2), trace = F)
om3 <- stepAIC(fit2, direction="both", scope=list(upper=fit1,lower=fit2), trace = F)
AIC(om1, om2, om3)
## df AIC
## om1 5 154.1194
## om2 5 155.4766
## om3 5 155.4766
Check the constraints for linear model
# 1. non-linearity between response and predictor -------------------------
summary(om1) # the om1 is actually the best model
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## am 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
car::influencePlot(om1)
## StudRes Hat CookD
## Merc 230 -1.25110559 0.2970422 0.1620826668
## Lincoln Continental 0.08383783 0.2642151 0.0006541961
## Chrysler Imperial 2.32311937 0.2296338 0.3475974030
## Fiat 128 2.12214584 0.1276313 0.1464019096
# there are three influentual points, but nothing wild, so I keep them
par(mfrow=c(2,2))
plot(om1)
# residuals hat some pattern, some non-normality and some heteroskedasticity,
# Chrisler Imperial could be an outlier, but it's not wild, so it maybe just a bad car
# thus - nothing wild, but I'd better do the residual-normality and heteroskedasticity tests to be sure
# 2. non-normality (or autocorrelation) of the residual -------------------
shapiro.test(om1$residuals) # p <0.05, meaning - the residuals are dist. non-normal (bad thing)
##
## Shapiro-Wilk normality test
##
## data: om1$residuals
## W = 0.9411, p-value = 0.08043
# im my case p >0.05 - resuduals are normally distributed (good thing)!
# 3. heteroskedasticity - non-constant variance of residuals --------------
library(lmtest) # for the test
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(om1) # the p-value is >0.05 indicates no heteroskedasticity in our data. no need to transform them
##
## studentized Breusch-Pagan test
##
## data: om1
## BP = 6.1871, df = 3, p-value = 0.1029
library(gvlma)
gvlma(om1) # still no heteroskedasticity
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Coefficients:
## (Intercept) wt qsec am
## 9.618 -3.917 1.226 2.936
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = om1)
##
## Value p-value Decision
## Global Stat 11.3771 0.022637 Assumptions NOT satisfied!
## Skewness 1.5545 0.212473 Assumptions acceptable.
## Kurtosis 0.6585 0.417077 Assumptions acceptable.
## Link Function 8.9061 0.002842 Assumptions NOT satisfied!
## Heteroscedasticity 0.2580 0.611526 Assumptions acceptable.
# 4. (multi)collinearity - dependence of predictors -----------------------
car::vif(om1) # sinse the values are all below 5, there is no problem in our "om1" model
## wt qsec am
## 2.482952 1.364339 2.541437
# 5. outliers & high-leverage points --------------------------------------
car::influencePlot(om1)
## StudRes Hat CookD
## Merc 230 -1.25110559 0.2970422 0.1620826668
## Lincoln Continental 0.08383783 0.2642151 0.0006541961
## Chrysler Imperial 2.32311937 0.2296338 0.3475974030
## Fiat 128 2.12214584 0.1276313 0.1464019096
# and Chrysler Imperial could be interpreted as a low leverage ourlier, but it's overkill, so I keep it
#!!! congrats, for this multiple linear model, you solved every problem. don't need log. Some conclutions:
# 1. there is a relationship between mpg and wt, qsec, am
# 2. the relationships are strong: RSE is small and hp explaines ca. 85% of the variation in mpg,
# and the p-value is significant. Note, that if the p-value of the model is significat, but p-value of the predictors not, then there is something wrong in the model, like collinearity.
# now you can plot implied predictions
library(jtools)
effect_plot(om1, pred = wt, interval = TRUE, plot.points = TRUE)
effect_plot(om1, pred = qsec, interval = TRUE, plot.points = TRUE)
effect_plot(om1, pred = am, interval = TRUE, plot.points = TRUE)