Model diagnostics

The interpretation and understanding of the model is both important and difficult. In this post we demistify the summary and plot of the simple linear model. Think about them as a concentrated useful information which tells the story about your data.

Summary

model <- lm(mpg ~ hp, mtcars)
summary(model)
## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

Formula

First you obtain the equation for your regression \(y = a+b*x\). Since we have assumed a linear relationship between the response and the predictor(s), the entire fitting problem reduces to estimating the best coefficients (parameters) \(a\) (intersept) and \(b\) (slope). This coeffitients provide the best solution for our model, meaning minimized the sum of squared error. So the queation takes the form: \(mpg = 30.1 - 0.068*hp\).

Estimates

Intercept

Intercept is the mean (expected) value of \(y\) (\(mpg\)) when \(x\) (\(hp\)) $ = 0$. It’s needed to constact the line , but it often does not make much sense, since no miles can be driven if a car has 0 horse power. So, don’t bother with a meaning of the intercept now. In the posts on centralisation and normalisation of the variables, the intercepts will starts to tell you something useful.

Slope

\(b\) is the slope (rate of change) — the average increase in \(y\) associated with a one-unit increase in \(x\). It shows what kind of correlation, positive or negative do we have. The higher is coefficient \(b\), the stronger the predictor (hp) influences our response variable (mpg).

Standart error

Standatd error can be used to compute confidence intervals. But it’s easily done with R:

confint(model)
##                   2.5 %     97.5 %
## (Intercept) 26.76194879 33.4357723
## hp          -0.08889465 -0.0475619

t-value

t-value from t-test shows weather the slope is different from 0. t_value = Estimate / Std.Error. t-statistic measures the number of standard deviations that \(b\) is away from 0.

p-value and significance stars

p-value shows the level of significance of our correlation. Roughly speaking, we interpret the p-value as follows: a small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance, in the absence of any real association between the predictor and the response. Hence, if we see a small p-value, then we can infer that there is an relashioship between the predictor and the response. Stars are helping to quickly see whether this relationship is significant or not. Thus, we reject the null hypothesis, which assumed no relationship between \(x\) and \(y\). Typical p-value cutoffs for rejecting the null hypothesis are 5 or 1 %. When n = 30, these correspond to t-statistics of around 2 and 2.75, respectively. Conclusion: the lower p-value, the better.

\(R^2\) and adjusted \(R^2\) - goodness of fit

\(R^2\) = 0.6024 means that ca. 60% of the variance in our response variable can be explained by our model. Normally, for physics or ingeneering, it’s not that much, because 40% of the variance are due to the factors which are not yet included into the model. But for some areas of science such as biology this might be a good model. Being a proportion of variance explained, \(R^2\), can only be between 0 and 1, and is independent of the scale of \(y\). Important to know, that \(R^2\) also shows how good are predicted values fit the real values. Conclusion: \(R^2\) is a very good indicator of model efficiency, the higher is \(R^2\), the better our model describes the relationship between two variables (except of the problem of overfitting which is outside of the scope of this article).

The problem with \(R^2\) is that adding more predictors to the model will always increase it, which might lead to overfitting. Thus, Adjusted \(R^2\) accounts for the number of variables in the model and penalizes \(R^2\) in order to avoid overfitting. Thus, for multiple linear regressions, adjusted \(R^2\) is more pragmatic.

R esidual standard error (RSE) - lack of fit

Roughly speaking, it is the average amount that the response \(y\) will deviate from the true regression line. The RSE is considered a measure of the lack of fit of the model to the data. Conclusion: small error - good fit, large error - bad feed. But how do we know, whether our number - 3.86 is small or big. Well, we can calculate the predicted error rate by dividing RSE by the mean of the outcome (mpg): \(3.86/mean(mpg) = 19\)%.

F-statistics and p-value of the model

F-statistics tests the null hypothesis, that all predictors have no influence on the outcome, the coefficients (slopes) of the predictors are zero, meaning non of the predictors has a relationship with the outcome. The alternative hypothesis is that at least one predictor has a non-zero (significant) slope. When there is no relationship between the response and predictors, one would expect the F-statistic to take on a value close to 1. On the other hand, if alternative hypothesis is true, we expect F to be greater than 1. In other words, the large F-statistic suggests that at least one of the predictors must be related to the outcome. Having a lot of data (large \(N\)), an F-statistic that is just a little larger than 1 might still provide evidence against H0. In contrast, a larger F-statistic is needed to reject H0 if \(N\) is small. Thus the F-statistics alone does not show a significance, that why we need a p-value.

The p-value of the model shows whether one, or more, of the predictors is significantly associated with the outcome. Thus, if the p-value of the model is significant, but non of the predictors in the Coefficints block of the summary is, somethinng is wrong with the model and you should not use it.

Plot

par(mfrow = c(2,2))
plot(model)

First and third plots

..are almost the same. They show the distribution of the residuals (error terms) - the difference between the ith observed response value and the ith response value that is predicted by our linear model. If they show some pattern, then your model is non-linear or the model still did not catch some effect. If the model is good, the residuals supposed to be normally and evenly distributed.

Second plot - QQPLOT

Its difficult to say, how much the diviance from the middle line normal or not, that is why this plot is less useful as it should be, instead we can build histograms with normal and standardazed residuals.

Fourth plot

To understand the forth plot we need to understand some definitions: - Outlier - is a point which has a big residual. What is big? Studentised residuals should be between -2 and 2 with approximately 95% probability - High-leverage point - is a point which has small residual, but is far away from the most of the data points - Low-leverage outlier - is a point with a big residual but no, or small leverage - High-leverage outlier - is a point with a big residual and big leverage - Cook’s distance shows how strongly predicted values change if we remove a single point.

Thus, if removing a point heavily influences the model (Cook’s distance >1), then is does not suppose to be in the dataset.

The forth plot is also not to helpful, when it comes to identifying outliers or high-leverage points. So, you can see, that the “classical” model diagnostic by “plot(model)” is not very helpful, nevertheless, it is often used and suppose to be understood, even if you conclude not to use this to often. At least it it now demystified.

Better ways to diagnose models

car::outlierTest(model) # Bonferonni p-value for most extreme observation (Outliers)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##               rstudent unadjusted p-value Bonferroni p
## Maserati Bora 2.568387           0.015632      0.50021
car::qqPlot(model, main="QQ Plot") #qq plot for studentized residuals 

## Toyota Corolla  Maserati Bora 
##             20             31
car::leveragePlots(model) # leverage plots

# shows cook's distance
plot(model, 4)

# shows both outliers and high-leverage points
car::influencePlot(model, 
              main="Influence Plot", 
              sub="Circle size is proportial to Cook's Distance")

##                 StudRes       Hat      CookD
## Toyota Corolla 2.386612 0.0770401 0.20554671
## Ford Pantera L 1.029071 0.1256885 0.07596905
## Maserati Bora  2.568387 0.2745929 1.05223055
Avatar
Yury Zablotski
Data Scientist at LMU Munich, Faculty of Veterinary Medicine

Passion for applying Biostatistics and Machine Learning to Life Science Data

Related

comments powered by Disqus