# Repeated Measures ANOVA

## Previous topics or when do we need it

If you know a paired t-test, which compares only two samples, you might have wondered how to compare more then two samples?" And if you know a simple and multiple ANOVAs, which compare more then two unpaired samples, you might have wondered what to do if the samples are paired? Well, the Repeated-measures ANOVA answers both of these questions.

## Why do we need it? What are the benefits?

Repeated-measures design increases the power of the experiment via removing the individual differences from the total variance, resulting in the sharper image of the effects we actually want to study. This allows making experiments with fewer individuals, which saves resources, like time, money or even reducing suffering while studying some painful therapy. Increasing power also increases chances of finding statistical inference, a.k.a. finding an effect, with fewer observations. It also allows to study individual changes of individuals over time (longitudinal analysis).

## Understanding Repeated Measures ANOVA

NOTE: a lot of the chapter below originates from this amazing article by Laerd statistics. Yes, it’s often just copy-paste. I often did not rephrase explanations because I could not have been explained this topic better! And since my goal here is to learn and teach, I left this most intuitive guide to the Repeated Measures ANOVA be the most intuitive.

### Introduction

Repeated measures ANOVA is the equivalent of the one-way ANOVA, but for related, not independent groups, and is the extension of the dependent (paired) t-test. A repeated measures ANOVA is also referred to as a within-subjects ANOVA or ANOVA for correlated samples. All these names imply the nature of the repeated measures ANOVA, that of a test to detect any overall differences between related means. There are many complex designs that can make use of repeated measures, but throughout this guide, we will be referring to the most simple case, that of a one-way repeated measures ANOVA. This particular test requires one independent variable and one dependent variable. The dependent variable (outcome) needs to be continuous (interval or ratio) and the independent variable (predictor) needs to be categorical (either nominal or ordinal).

### When to use a Repeated Measures ANOVA

We can analyse data using a repeated measures ANOVA for two types of study design. Studies that investigate either (1) changes in mean scores over three or more time points, or (2) differences in mean scores under three or more different conditions. For example, for (1), you might be investigating the effect of a 6-month exercise training programm on blood pressure and want to measure blood pressure at 3 separate time points (pre-, midway and post-exercise intervention), which would allow you to develop a time-course for any exercise effect. For (2), you might get the same subjects to eat different types of cake (chocolate, caramel and lemon) and rate each one for taste, rather than having different people taste each different cake. The important point with these two study designs is that the same people are being measured more than once on the same dependent variable (i.e., why it is called repeated measures).

In repeated measures ANOVA, the independent variable has categories called levels or related groups. Where measurements are repeated over time, such as when measuring changes in blood pressure due to an exercise-training program, the independent variable is time. Each level (or related group) is a specific time point. Hence, for the exercise-training study, there would be three time points and each time-point is a level of the independent variable (a schematic of a time-course repeated measures design is shown below):

Where measurements are made under different conditions, the conditions are the levels (or related groups) of the independent variable (e.g., type of cake is the independent variable with chocolate, caramel, and lemon cake as the levels of the independent variable). A schematic of a different-conditions repeated measures design is shown below. It should be noted that often the levels of the independent variable are not referred to as conditions, but treatments. Which one you want to use is up to you. There is no right or wrong naming convention. You will also see the independent variable more commonly referred to as the within-subjects factor.

The above two schematics have shown an example of each type of repeated measures ANOVA design, but you will also often see these designs expressed in tabular form, such as shown below:

This particular table describes a study with six subjects (S1 to S6) performing under three conditions or at three time points (T1 to T3). As highlighted earlier, the within-subjects factor could also have been labelled “treatment” instead of “time/condition”. They all relate to the same thing: subjects undergoing repeated measurements at either different time points or under different conditions/treatments.

### Hypothesis for Repeated Measures ANOVA

The repeated measures ANOVA tests for whether there are any differences between related population means. The null hypothesis ($$H_0$$) states that the means are equal:

H0: µ1 = µ2 = µ3 = … = µk

where µ = population mean and k = number of related groups. The alternative hypothesis ($$H_{Alt}$$) states that the related population means are not equal (at least one mean is different to another mean):

HA: at least two means are significantly different

For our exercise-training example, the null hypothesis (H0) is that mean blood pressure is the same at all time points (pre-, 3 months, and 6 months). The alternative hypothesis is that mean blood pressure is significantly different at one or more time points. A repeated measures ANOVA will not inform you where the differences between groups lie as it is an omnibus (general) statistical test. The same would be true if you were investigating different conditions or treatments rather than time points, as used in this example. If your repeated measures ANOVA is statistically significant, you can run post-hoc tests that can highlight exactly where these differences occur.

### Logic of the Repeated Measures ANOVA

The logic behind a repeated measures ANOVA is very similar to that of a between-subjects ANOVA. Recall that a between-subjects ANOVA partitions total variability into between-groups variability (SSb) and within-groups variability (SSw), as shown below:

In this design, within-group variability (SSw) is defined as the error variability (SS-error). Following division by the appropriate degrees of freedom, a mean sum of squares for between-groups (MSb) and within-groups (MSw) is determined and an F-statistic is calculated as the ratio of MSb to MSw (or MS-error), as shown below:

A repeated measures ANOVA calculates an F-statistic in a similar way:

The advantage of a repeated measures ANOVA is that whereas within-group variability (SSw) expresses the error variability (SS-error) in an independent (between-subjects) ANOVA, a repeated measures ANOVA can further partition this error term, reducing its size, as is illustrated below:

This has the effect of increasing the value of the F-statistic due to the reduction of the denominator and leading to an increase in the power of the test to detect significant differences between means (this is discussed in more detail later). Mathematically, and as illustrated above, we partition the variability attributable to the differences between groups (SS-conditions) and variability within groups (SSw) exactly as we do in a between-subjects (independent) ANOVA. However, with a repeated measures ANOVA, as we are using the same subjects in each group, we can remove the variability due to the individual differences between subjects, referred to as SS-subjects, from the within-groups variability (SSw). How is this achieved? Quite simply, we treat each subject as a block. That is, each subject becomes a level of a factor called subjects. We then calculate this variability as we do with any between-subjects factor. The ability to subtract SS-subjects will leave us with a smaller SS-error term, as highlighted below:

Now that we have removed the between-subjects variability, our new SS-error only reflects individual variability to each condition. You might recognize this as the interaction effect of subject by conditions; that is, how subjects react to the different conditions. Whether this leads to a more powerful test will depend on whether the reduction in SS-error more than compensates for the reduction in degrees of freedom for the error term (as degrees of freedom go from (n - k) to (n - 1)(k - 1) (remembering that there are more subjects in the independent ANOVA design).

### Calculating a Repeated Measures ANOVA

The repeated measures ANOVA, like other ANOVAs, generates an F-statistic that is used to determine statistical significance. The F-statistic is calculated as below:

The diagram below represents the partitioning of variance that occurs in the calculation of a repeated measures ANOVA.

In order to calculate an F-statistic we need to calculate SS-conditions and SS-error. SS-conditions can be calculated directly quite easily (as you will have encountered in an independent ANOVA as SS-between). Although SS-error can also be calculated directly it is somewhat difficult in comparison to deriving it from knowledge of other sums of squares which are easier to calculate, namely SS-subjects, and either SST or SSw. SS-error can then be calculated in either of two ways:

#### Calculate between groups variation

SS-between is SS-time in our example:

where k = number of conditions, ni = number of subjects under each (ith) condition, Mean of ith term = mean score for each (ith) condition, Mean = grand mean. So, in our example, we have:

#### Calculate within groups variation

Within-groups variation (SSw) is also calculated in the same way as in an independent ANOVA, expressed as follows:

where xi1 is the score of the ith subject in group 1, xi2 is the score of the ith subject in group 2, and xik is the score of the ith subject in group k. In our case, this is:

#### Calculate within subject variation

As mentioned earlier, we treat each subject as its own block. In other words, we treat each subject as a level of an independent factor called subjects. We can then calculate SS-subjects as follows:

where k = number of conditions, Mean of ith term mean of subject i, and Mean = grand mean. In our case, this is:

#### Calculate SS error

We can now calculate SS-error by substitution:

which, in our case, is:

#### Determining MS-time, MS-error and the F-statistic

To determine the mean sum of squares for time (MS-time) we divide SS-time by its associated degrees of freedom (k - 1), where k = number of time points. In our case:

We do the same for the mean sum of squares for error (MS-error), this time dividing by (n - 1)(k - 1) degrees of freedom, where n = number of subjects and k = number of time points. In our case:

Therefore, we can calculate the F-statistic as:

We can now look up (or use a computer program) to ascertain the critical F-statistic for our F-distribution with our degrees of freedom for time (dftime) and error (dferror) and determine whether our F-statistic indicates a statistically significant result.

### Reporting the Result of a Repeated Measures ANOVA

We report the F-statistic from a repeated measures ANOVA as:

F(df_time, df_error) = F-value, p = p-value

which for our example would be:

F(2, 10) = 12.53, p = .002

This means we can reject the null hypothesis and accept the alternative hypothesis. As we will discuss later, there are assumptions and effect sizes we can calculate that can alter how we report the above result. However, we would otherwise report the above findings for this example exercise study as:

There was a statistically significant effect of time on
exercise-induced fitness, F(2, 10) = 12.53, p = .002.

or

The six-month exercise-training programme had a statistically
significant effect on fitness levels, F(2, 10) = 12.53, p = .002.

Normally, the result of a repeated measures ANOVA is presented in the written text, as above, and not in a tabular form when writing a report.

### Increased Power in a Repeated Measures ANOVA

The major advantage with running a repeated measures ANOVA over an independent ANOVA is that the test is generally much more powerful. This particular advantage is achieved by the reduction in MS-error (the denominator of the F-statistic) that comes from the partitioning of variability due to differences between subjects (SS-subjects) from the original error term in an independent ANOVA (SSw): i.e. SS-error = SSw - SS-subjects. We achieved a result of F(2, 10) = 12.53, p = .002, for our example repeated measures ANOVA. How does this compare to if we had run an independent ANOVA instead? Well, if we ran through the calculations, we would have ended up with a result of F(2, 15) = 1.504, p = .254, for the independent ANOVA. We can clearly see the advantage of using the same subjects in a repeated measures ANOVA as opposed to different subjects. For our exercise-training example, the illustration below shows that after taking away SS-subjects from SSw we are left with an error term (SS-error) that is only 8% as large as the independent ANOVA error term.

This does not lead to an automatic increase in the F-statistic as there are a greater number of degrees of freedom for SSw than SS-error. However, it is usual for SS-subjects to account for such a large percentage of the within-groups variability that the reduction in the error term is large enough to more than compensate for the loss in the degrees of freedom (as used in selecting an F-distribution).

### Effect Size for Repeated Measures ANOVA

It is becoming more common to report effect sizes in journals and reports. Partial eta-squared is where the the SS-subjects has been removed from the denominator:

So, for our example, this would lead to a partial eta-squared of:

## Summary of understanding RMA

For only two groups the repeated-measures ANOVA produces identical results as the paired t-test. Individual differences in subjects (which were measured repeatedly) can be seeing as a hidden variation (measurements error), which should be excluded from the analysis.

### Null hypothesis

H0 The means of repeated (dependent/paired) samples are equal:

This and the next pictures in this sub-chapter originate from here.

### What the F again? Or the formula.

The total variation of the response variable is described by the Total Sum of Squared (SStotal), which consists from variance between (SSbetween) and variance within (SSwithin) subjects. The variance-within is the most interesting for us, otherwise we would have used a normal ANOVA. A part of the variance within subjects can be explained by our subject (SSmodel) and the rest of the variance can not (SSerror). If our model explains most of the variance, where SSmodel (or SSsubject) is larger then SSerror, we can reject the H0 and conclude that the samples means are different. The measure of “explainability” of the model is F:

$F = \frac {MS_{model}} {MS_{error}} = \frac { \frac {SS_{model}} {df_{model}} } { \frac {SS_{error}} {df_{error}} } = \frac { \frac {SS_{model}} {df_{model}} } { \frac {SS_{within}−SS_{model}} {df_{error}} } = \frac { \frac {n \sum_{j=1}^{k} (X_j−X)^2} {k−1} } { \frac { (\sum_{i=1}^{n} \sum_{j=1}^{k} (X_{ij}−X_i)^2) - (n \sum_{j=1}^{k} (X_j−X)^2)} {(k−1)⋅(n−1)} }$

where:

• $$df$$ is the degree of freedom
• $$n$$ is the number of subjects;
• $$k$$ is the number of variables;
• $$X_{ij}$$ is the score of subject i on variable j;
• $$X_i$$ is the mean for subject i;
• $$X_j$$ is the mean of variable j;
• $$X$$ is the grand mean.

If F-value is large, the population means are most likely different and we can reject the H0 if the p-value, which results from F-value, is below 0.05.

Let’s have a look at the example, which we could calculate manually, but was already done on this page, so that we can use computer to do the work and just compare the results. So, please focus on the numbers but not on the methods, they will be explained below:

d <- tibble(
Subjects    = seq(1:6),
Pre         = c(45, 42, 36, 39, 51, 44),
three_month = c(50, 42, 41, 35, 55, 49),
six_month   = c(55, 45, 43, 40, 59, 56)) %>%
pivot_longer(cols = 2:4) %>%
mutate_at(vars(Subjects, name), factor)

The total sum of squares in the normal model below is a sum of SS_model (name) which is the explained variance and SS_error (Residuals) which is the unexplained variance. In the example below it is 143.4 + 715.5 = 858.9 In the case of the normal anova, the SS_total consists of both the between and within variance.

summary(aov(data = d, value ~ name))
##             Df Sum Sq Mean Sq F value Pr(>F)
## name         2  143.4   71.72   1.504  0.254
## Residuals   15  715.5   47.70

In the repeated measurements, the between variance will be excluded from the model, so that SS_within = SS_model + SS_error 143.44 + 57.22 = 200.66, while the total variance of the whole model is still the same 143.44 + 57.22 + 658.3 = 858.96:

summary(aov(data = d, value ~ name + Error(Subjects/name)))
##
## Error: Subjects
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals  5  658.3   131.7
##
## Error: Subjects:name
##           Df Sum Sq Mean Sq F value  Pr(>F)
## name       2 143.44   71.72   12.53 0.00189 **
## Residuals 10  57.22    5.72
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since a model output gives us the Mean Squared errors for both within-subject variable (MS_time) and residuals (MS_error), we could calculate the F-value manually using the formula above: 71.72 / 5.72 = 12.53, but why should we, if to the right of the Mean Squares we can find the F-value itself and even the p-value, so that we don’t need to look it up anywhere.

As sum of squares of the error (residuals) decrease due to outsourcing of the individual error, the F-value will increase resulting in decreasing p-value.

The “manually” calculated results in the previous chapter are identical to what repeated-measurements ANOVA model gave us, so we can trust the p-value and for this example we can reject the H0 and conclude that the means of blood pressure changes significantly over time.

If we have both, a within-subject variable and the between-subject variable (such as gender in the picture below), we calculate all the values for a particular block (e.g. gender = Female)in the same way:

## How to compute Repeated Measures ANOVA

There are five main ways to implement the Repeated measures ANOVA in R 1:

1. aov(depvar ~ predictors), followed by summary() of the result to see a conventional ANOVA table.

2. lm(depvar ~ predictors), followed by anova() or Anova() (car package) to see the results.

3. lme(depvar ~ predictors, further_parameters) from nlme package and then anova() for the result. Used for mixed (fixed & random) effects models, where fixed effects describe the variance between variables, while random effect describe the variance within variables.

4. lmer(depvar ~ predictors, further_parameters)from lme4 package and then anova() for the result. Also usually applied to mixed effects models.

5. ezANOVA(dataframe, dv=.(depvar(s)), wid=.(subject_identifier), within=.(within_subject_variable(s)), between=.(between_subject_variable(s)), other_options) from ez package performs not only repeated measures ANOVA but also validity checks (including Levene’s test, Mauchly’s test for sphericity, and the Greenhouse–Geisser and Huynh–Feldt epsilon corrections).

Here we’ll focus on the last ezANOVA way, because it’s simply the best. But we also will compare it to the first classical aov way, because it’s the oldest and the most used one (so you’ll be able to understand the code of others). The mixed models approaches (lme & lmer) will be treated in the separate post. And the lm approach will be completely ignored, because it is totally ridiculous if applied to the repeated measures.

Load all needed packages at once to avoid interruptions.

library(tidyverse)   # for data wrangling and visualization
library(ez)          # for the best Repeated-measures ANOVA
library(knitr)       # for beautifiying the tables
library(ggpubr)      # for powerful visualisation, like "ggarrange"

Load some data2 and visualize it:

mutate(subj = factor(subj))

ggplot(newborn, aes(language, rate))+
geom_violin()+
geom_dotplot(binaxis='y', stackdir='center', dotsize=.5)

ggplot(bilingual, aes(gender, mlu)) +
geom_violin() +
geom_dotplot(binaxis='y', stackdir='center', dotsize=.5)+
facet_grid(language~age)

### Between-subjects (BS) ANOVA

The between-subjects ANOVA is simply a usual (independent observations) ANOVA, where we did not have any repeated measurements. And despite the fact that this post is about repeated measurement ANOVA, the ez package we use here can also be used for conducting and visualizing a usual ANOVA. The ez package defaults to the “median-centered” ANOVA, which is usually more robust then “mean-centered”.

###### The new easy way

This new way is much better then the old one (presented below) because it automatically checks for assumptions, particularly for the homogeneity of variances. Please, have a look at the “subj” column to see why we needed to change it in the code below.

# here we need to change repeated measures of a subject to unique
ezPlot(data    = bilingual %>% mutate(subj = factor(row_number())),
dv      = mlu,
wid     = subj,
between = .(gender, age),
x       = age,
split   = gender)
## Coefficient covariances computed by hccm()

easy <- ezANOVA(data     = bilingual %>% mutate(subj = factor(row_number())),
dv       = mlu,
wid      = subj,
between  = .(gender, age),
detailed = TRUE,
type     = 3,
return_aov = TRUE)
## Coefficient covariances computed by hccm()
easy$ANOVA %>% kable() Effect DFn DFd SSn SSd F p p<.05 ges (Intercept) 1 114 2492.2156242 218.4438 1300.6211303 0.0000000 * 0.9194131 gender 1 114 0.0021087 218.4438 0.0011005 0.9735945 0.0000097 age 2 114 35.4980118 218.4438 9.2627347 0.0001873 * 0.1397880 gender:age 2 114 1.3183298 218.4438 0.3440006 0.7096618 0.0059989 easy$Levene's Test for Homogeneity of Variance %>%
kable()
DFn DFd SSn SSd F p p<.05
5 114 5.104009 79.95702 1.455424 0.2099949

Interpretation:

• gender does not really matter for the mlu, which we already saw in the plot above
• age is definitely influences mlu, for instance the plot above showed that error bars of preschool and secondgrade do not overlap, which indicates a probably significance. Thus, there is a significant difference between the means of the groups in age.
• the interaction between gender and age do not relate to the mlu
• the high p-value of the Levene’s Test for Homogeneity of Variance indicates equal variances among samples, so that we can continue using results of the ANOVA and do not need to apply a correction Greenhouse-Geisser and Huynh-Feldt or transform the data
• “ges” is the Generalized Eta-Squared measure of effect size, which can be reported if needed. By the way reporting:

Reporting: “Levene’s Test for Homogeneity of Variance indicated that this assumption had not been violated, F = 1.19, p > .05. The results show that the mlu was significantly affected by the age of the subject, F = 9.26, p < .05, η2 = 0.14.

If you need a post-hoc for this new method, you can use the pairwise comparisons as shown below:

###### Post-hoc tests
data    = bilingual %>% mutate(subj = factor(row_number()))
pairwise.t.test(data$mlu, interaction(data$age, data$gender), paired=T, p.adjust.method = "holm") ## ## Pairwise comparisons using paired t tests ## ## data: data$mlu and interaction(data$age, data$gender)
##
## preschool.F   1.0000       -           -             -            -
## secondgrade.F 0.2085       0.0130      -             -            -
## firstgrade.M  1.0000       1.0000      1.0000        -            -
## preschool.M   1.0000       1.0000      0.1608        0.4538       -
## secondgrade.M 0.6974       0.1608      1.0000        0.0261       0.0017
##
## P value adjustment method: holm

Interpretations: the results show that indeed, the preschool and secondgrade in both males and females are significantly different, which we suspected already in the visualization part of our analysis. Besides, the error bars of males which are further apart then the error bars of females (still talking about between preschool vs. secondgrade), are in line with the smaller p-value for males (p = 0.0017) compared to the females (p = 0.014), indicating more difference among males then among females, while both are still significant. Moreover, as we could see from the plot, but unfortunately not from the anova output (that is why the visualization is sooo important!), the error bars of firstgrade males did not overlap with the secondgrade males, which was approved by the post-hoc test (p = 0.03)

###### The old hard way
hard <- aov(mlu ~ gender*age,
data=bilingual %>% mutate(subj = factor(row_number())))
anova(hard) %>% kable()
Df Sum Sq Mean Sq F value Pr(>F)
gender 1 0.0021087 0.0021087 0.0011005 0.9735945
age 2 35.4980118 17.7490059 9.2627347 0.0001873
gender:age 2 1.3183298 0.6591649 0.3440006 0.7096618
Residuals 114 218.4437685 1.9161734 NA NA
plot(TukeyHSD(hard), las=1, cex.axis=0.4)

### One within-subjects (WS) factor

This is our first Repeated-measures ANOVA (with one factor), sometimes also called a randomized complete block (RCB) design (with one factor), or a single-factor within-subjects design, or One-way ANOVA within subject. I hate statistics for having so many different synonyms, because it creates an illusions of complexity which scares of and prevents many scientist from using more stats in their research and compare the results among each other. I could imagine that the world misses lots of useful discoveries because of such abundance of terms meaning the same thing!

###### The new easy way

So, if we want to compare several groups, why should we care about the repeated measurements? Repeated or not, they still belong to different groups, right? Yes, they do, but let’s visualize the groups with (plot A below) and without (plot B) accounting for repeated measurement. If you compare the error bars of both groups, the plot A which accounted for individual differences (a within subject variance) and the plot B which did not account for individual differences (a between subject variance of a usual ANOVA model), you’ll see that plot A has smaller and non-overlapping (probably significantly different) bars! It’s because the variance of repeated measurements was subtracted from the between subject variance of the model. This made the between subject variance smaller and so made the differences between groups more distinguishable. Thus, accounting for repeated measurements improves ANOVA’s power and allows to find the real difference between groups, which could have been easily overlooked.

ggarrange(
ezPlot(bilingual, dv = mlu, wid = subj, within  = language, x = language, type = 3),
ezPlot(bilingual, dv = mlu, wid = subj, between = language, x = language, type = 3),
labels = c("A", "B"))
## Coefficient covariances computed by hccm()

easy_between <- ezANOVA(bilingual %>% mutate(subj = factor(row_number())),
dv       = mlu,
wid      = subj,
between  = language,
detailed = TRUE,
type     = 3)
## Coefficient covariances computed by hccm()
easy_within <- ezANOVA(bilingual,
dv       = mlu,
wid      = subj,
within   = language,
detailed = TRUE,
type     = 3)

easy_between$ANOVA %>% kable() Effect DFn DFd SSn SSd F p p<.05 ges (Intercept) 1 118 2492.215624 248.0652 1185.500809 0.0000000 * 0.9094745 language 1 118 7.197062 248.0652 3.423509 0.0667757 0.0281948 easy_within$ANOVA %>%
kable()
Effect DFn DFd SSn SSd F p p<.05 ges
(Intercept) 1 19 830.738541 43.274581 364.741421 0.000000 * 0.9435331
language 1 19 2.399021 6.442012 7.075645 0.015465 * 0.0460327

If you compare the results of “between” and “within” models, you see exactly that - the “between” model missed the significant differences in language (p = 0.067), which we have also seen in the overlapping error bars on the plot B, while the “within” model caught this difference (p = 0.015). Thus, the discoveries made by this two models are fundamentally different and in fact the opposite!

###### The old hard way

The “old hard way” ANOVA also delivers completely opposite results if we don’t account for the individual differences (repeated measurements):

hard <- aov(mlu ~ language, data=bilingual )
anova(hard) %>%
kable()
Df Sum Sq Mean Sq F value Pr(>F)
language 1 7.197062 7.197062 3.423509 0.0667757
Residuals 118 248.065157 2.102247 NA NA
hard <- aov(mlu ~ language + Error( subj/language ), data=bilingual )
summary(hard)
##
## Error: subj
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 19  129.8   6.833
##
## Error: subj:language
##           Df Sum Sq Mean Sq F value Pr(>F)
## language   1  7.197   7.197   7.076 0.0155 *
## Residuals 19 19.326   1.017
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Error: Within
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 80  98.92   1.236

### Two WS factors

###### The new easy way

If we have two factors, the plots can quickly become very cramped, thus, I visualized the within subject two factors models in two separate plots, which can already deliver some ideas of what could significantly differ. Though, we need to conduct the test in order to be sure (get p-values).

ggarrange(
ezPlot(data = bilingual, dv = mlu, wid = subj, within = .(age, language),
x = age, split = language),
ezPlot(data = bilingual, dv = mlu, wid = subj, within =.(age, language),
x = language, split = age),
labels = c("A", "B"), ncol = 1)

We want to know if the “age” and the “language” of the subjects affect the “mlu”:

# set orthogonal contrasts
options(contrasts = c("contr.sum", "contr.poly"))

easy <- ezANOVA(data     = bilingual,
dv       = mlu,
wid      = subj,
within   = .(age, language),
detailed = TRUE,
type     = 3,
return_aov = TRUE)

easy$ANOVA %>% kable() Effect DFn DFd SSn SSd F p p<.05 ges (Intercept) 1 19 2492.215624 129.82374 364.741421 0.0000000 * 0.9225955 age 2 38 35.498012 37.96796 17.763985 0.0000036 * 0.1451318 language 1 19 7.197062 19.32604 7.075645 0.0154650 * 0.0332750 age:language 2 38 3.473570 21.97584 3.003200 0.0615301 0.0163410 easy$Mauchly's Test for Sphericity %>%
kable()
Effect W p p<.05
2 age 0.9877079 0.8946573
4 age:language 0.9889349 0.9047097
easy$Sphericity Corrections %>% kable() Effect GGe p[GG] p[GG]<.05 HFe p[HF] p[HF]<.05 2 age 0.9878571 0.000004 * 1.101787 0.0000036 * 4 age:language 0.9890560 0.062193 1.103351 0.0615301 ###### Post-hoc tests: Looking at the plots, I expect at least 10 combinations of factor groups to be significantly different. But the pairwise comparison t-tests give me only 5. pairwise.t.test(bilingual$mlu, interaction(bilingual$age, bilingual$language),
##
##  Pairwise comparisons using paired t tests
##
## data:  bilingual$mlu and interaction(bilingual$age, bilingual$language) ## ## firstgrade.home.only preschool.home.only ## preschool.home.only 1.00000 - ## secondgrade.home.only 1.00000 0.00917 ## firstgrade.school 1.00000 0.40594 ## preschool.school 1.00000 1.00000 ## secondgrade.school 0.00209 0.00087 ## secondgrade.home.only firstgrade.school preschool.school ## preschool.home.only - - - ## secondgrade.home.only - - - ## firstgrade.school 1.00000 - - ## preschool.school 0.20420 0.77050 - ## secondgrade.school 0.05458 0.00065 0.00120 ## ## P value adjustment method: bonferroni Interpretation and reporting: Mauchly’s Test for Sphericity indicated that the assumption of sphericity was not violated for , W = 0.99, p > .05 for age and age:language interaction. There was a significant main effect of age on mlu, F = 17.8, p = 0.0000036, and a significant main effect of language, F = 7.1, p = 0.015. The interaction effect between age and language wasn’t significant, F = 3, p = 0.062. If the Sphericity assumption was violated mildly (Greenouse-Geisser-e > 0,75) the liberal Huyn-Feldt-correction suppose to be taken. If the Sphericity assumption was violated heavily (Greenouse-Geisser-e < 0,75) the conservative Greenhouse- Geisser-correction suppose to be taken. Bonferroni post-hoc tests revealed significant differences between following groups: • secondgrade.school - firstgrade.home.only, p = 0.00209 • secondgrade.home.only & preschool.home.only, p = 0.00917 • secondgrade.school - preschool.home.only, p = 0.00087 • secondgrade.school - firstgrade.school, p = 0.00065 and • secondgrade.school - preschool.school, p = 0.00120 The pair “secondgrade.home.only - secondgrade.school” is almost significant. But I don’t know if we can trust the t-test pairwise comparisons, because they most likely weren’t made for repeated measurements. The special pairwise comparison for repeated-measurements below and a multi-level model both found 6 differences, but again, not 10, so I will have to trust it for now, first, because the error bars on the plot do not always mean a statistical significance, and because I prefer not to do a more advanced model in this post: # set orthogonal contrasts options(contrasts = c("contr.sum", "contr.poly")) library(emmeans) model <- easy$aov
emm <- emmeans(model, ~ age*language)
# all pairwise comparisons
as_tibble() %>%
filter(p.value < 0.051) %>%
kable()
contrast estimate SE df t.ratio p.value
preschool,home.only - secondgrade,home.only -0.9105594 0.2808441 70.95017 -3.242224 0.0271260
preschool,home.only - secondgrade,school -1.8586634 0.3057805 67.96831 -6.078424 0.0000009
preschool,school - secondgrade,school -1.7250821 0.2808441 70.95017 -6.142490 0.0000006
###### Fancy Post-hoc tests :)

You can choose the p-value correction method and plot all the p-values of all the pairwise comparisons at once using the emm object you already have:

# 7. Visualize post-hoc results with a Pairwise P-value plot
pwpp(emm, type = "response", adjust = "bonferroni")+theme_bw()

Below is just a quick glimpse into the doing the same analysis with the Mixed-Effect-Model. But you can skip it, because it will be treated in deapth in separate posts. All within-subject factors supposed to be entered in the model as an “Error” tirm after the “subject” - repeated measures. In our case these are: (1|subj) + (1|age:subj) + (1|language:subj). The between-subject can remain just the part of the formula.

mlm <- lmer(data = bilingual, mlu ~ age*language + (1|subj) + (1|age:subj) + (1|language:subj) )
###### The old hard way
summary(aov(mlu ~ age*language + Error(subj/(age*language)),
data=bilingual ) )
##
## Error: subj
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 19  129.8   6.833
##
## Error: subj:age
##           Df Sum Sq Mean Sq F value   Pr(>F)
## age        2  35.50  17.749   17.76 3.58e-06 ***
## Residuals 38  37.97   0.999
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Error: subj:language
##           Df Sum Sq Mean Sq F value Pr(>F)
## language   1  7.197   7.197   7.076 0.0155 *
## Residuals 19 19.326   1.017
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Error: subj:age:language
##              Df Sum Sq Mean Sq F value Pr(>F)
## age:language  2  3.474  1.7368   3.003 0.0615 .
## Residuals    38 21.976  0.5783
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Error(subj/(age*language)) statement is used to separate the residual sums of squares into several components. The statement is equivalent to Error(subj + subj:age + subj:language + subj:age:language), meaning that we want to separate the following error terms: one for subject, one for subject by age interaction, one for subject by language interaction, and one for subject by age by language interaction.

### One BS and one WS factor

Sometimes called a split-plot design, or mixed two-factor within-subjects design, or repeated measures analysis using a split-plot design, or an univariate mixed models approach with subject as a random effect. Oh God, do statisticians speak different (statistical) languages?

Some predictors are not or cannot be replicated. For instance gender can hardly be measured within-subjects. In this case we use so-called mixed-design ANOVA analysis, namely between and withing subject. Conducting a mixed-design ANOVA in R is easy: we just add the between-subject variable(s) to the model formula, but exclude it from the error term.

This is actually our first mixed-effects model! And since it’s sometimes called split-plot design, below I split the plot into categories of the between-subject factor variable.

###### The new easy way
ezPlot(data     = bilingual,
dv       = mlu,
wid      = subj,
between  = gender,
within   = age,
x        = age,
col      = gender)

easy <- ezANOVA(data     = bilingual,
dv       = mlu,
wid      = subj,
between  = gender,
within   = age,
detailed = TRUE,
type     = 3)

easy$ANOVA %>% kable() Effect DFn DFd SSn SSd F p p<.05 ges (Intercept) 1 18 1246.1078121 64.91082 345.5501167 0.0000000 * 0.9373859 gender 1 18 0.0010543 64.91082 0.0002924 0.9865458 0.0000127 age 2 36 17.7490059 18.32481 17.4343981 0.0000051 * 0.1757595 gender:age 2 36 0.6591649 18.32481 0.6474809 0.5293489 0.0078570 easy$Mauchly's Test for Sphericity %>%
kable()
Effect W p p<.05
3 age 0.9937147 0.9478173
4 gender:age 0.9937147 0.9478173
easy$Sphericity Corrections %>% kable() Effect GGe p[GG] p[GG]<.05 HFe p[HF] p[HF]<.05 3 age 0.993754 0.0000054 * 1.116711 0.0000051 * 4 gender:age 0.993754 0.5284531 1.116711 0.5293489 ###### The old hard way summary(aov(mlu ~ gender*age + Error( subj/(age)), data=bilingual) ) ## ## Error: subj ## Df Sum Sq Mean Sq F value Pr(>F) ## gender 1 0.0 0.002 0 0.987 ## Residuals 18 129.8 7.212 ## ## Error: subj:age ## Df Sum Sq Mean Sq F value Pr(>F) ## age 2 35.50 17.749 17.434 5.07e-06 *** ## gender:age 2 1.32 0.659 0.647 0.529 ## Residuals 36 36.65 1.018 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Error: Within ## Df Sum Sq Mean Sq F value Pr(>F) ## Residuals 60 51.97 0.8662 ### Two BS factors and one WS factor ###### The new easy way ezPlot(data = bilingual, dv = mlu, wid = subj, between = .(gender, language), within = age, x = age, split = language, col = gender) easy <- ezANOVA(data = bilingual, dv = mlu, wid = subj, between = .(gender,language), within = age, detailed = TRUE, type = 3) easy$ANOVA %>%
kable()
Effect DFn DFd SSn SSd F p p<.05 ges
1 (Intercept) 1 36 2492.2156242 148.93765 602.3981288 0.0000000 * 0.9236697
2 gender 1 36 0.0021087 148.93765 0.0005097 0.9821129 0.0000102
3 language 1 36 7.1970622 148.93765 1.7396155 0.1955163 0.0337654
5 age 2 72 35.4980118 57.01423 22.4141995 0.0000000 * 0.1470202
4 gender:language 1 36 0.2100207 148.93765 0.0507645 0.8230125 0.0010187
6 gender:age 2 72 1.3183298 57.01423 0.8324214 0.4391372 0.0063604
7 language:age 2 72 3.4735702 57.01423 2.1932861 0.1189480 0.0165862
8 gender:language:age 2 72 1.6112297 57.01423 1.0173647 0.3666841 0.0077626
easy$Mauchly's Test for Sphericity %>% kable() Effect W p p<.05 5 age 0.9980459 0.9663492 6 gender:age 0.9980459 0.9663492 7 language:age 0.9980459 0.9663492 8 gender:language:age 0.9980459 0.9663492 easy$Sphericity Corrections %>%
kable()
Effect GGe p[GG] p[GG]<.05 HFe p[HF] p[HF]<.05
5 age 0.9980497 0.0000000 * 1.05658 0.0000000 *
6 gender:age 0.9980497 0.4389428 1.05658 0.4391372
7 language:age 0.9980497 0.1190596 1.05658 0.1189480
8 gender:language:age 0.9980497 0.3665707 1.05658 0.3666841
###### The old hard way

“And now the more complex methods start to become a bit pointless. First, aov: the problem is that this provides Type 1 Sum of Squares (so analyzing A * B * U differs from analyzing B * A * U), and the drop1() command doesn’t like the multi-stratum output from aov()”3. We will though below present the last case using the old aov methodology.

summary(aov(mlu ~ gender*age*language + Error( subj/(language)),
data=bilingual) )
##
## Error: subj
##           Df Sum Sq Mean Sq F value Pr(>F)
## gender     1    0.0   0.002       0  0.987
## Residuals 18  129.8   7.212
##
## Error: subj:language
##                 Df Sum Sq Mean Sq F value Pr(>F)
## language         1  7.197   7.197   6.777  0.018 *
## gender:language  1  0.210   0.210   0.198  0.662
## Residuals       18 19.116   1.062
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Error: Within
##                     Df Sum Sq Mean Sq F value   Pr(>F)
## age                  2  35.50  17.749  22.414 2.71e-08 ***
## gender:age           2   1.32   0.659   0.832    0.439
## age:language         2   3.47   1.737   2.193    0.119
## gender:age:language  2   1.61   0.806   1.017    0.367
## Residuals           72  57.01   0.792
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

### One BS factor and two WS factors

###### The old hard way
ezPlot(data    = bilingual,
dv      = mlu,
wid     = subj,
between = gender,
within  = .(language, age),
x       = age,
split   = language,
col     = gender)

easy <- ezANOVA(data     = bilingual,
dv       = mlu,
wid      = subj,
between  = gender,
within   = .(language, age),
detailed = TRUE,
type     = 3,
return_aov = TRUE)

easy$ANOVA %>% kable() Effect DFn DFd SSn SSd F p p<.05 ges 1 (Intercept) 1 18 2492.2156242 129.82164 345.5501167 0.0000000 * 0.9236697 2 gender 1 18 0.0021087 129.82164 0.0002924 0.9865458 0.0000102 3 language 1 18 7.1970622 19.11602 6.7768891 0.0179763 * 0.0337654 5 age 2 36 35.4980118 36.64963 17.4343981 0.0000051 * 0.1470202 4 gender:language 1 18 0.2100207 19.11602 0.1977594 0.6618365 0.0010187 6 gender:age 2 36 1.3183298 36.64963 0.6474809 0.5293489 0.0063604 7 language:age 2 36 3.4735702 20.36461 3.0702417 0.0587292 0.0165862 8 gender:language:age 2 36 1.6112297 20.36461 1.4241442 0.2539516 0.0077626 easy$Mauchly's Test for Sphericity %>%
kable()
Effect W p p<.05
5 age 0.9937147 0.9478173
6 gender:age 0.9937147 0.9478173
7 language:age 0.9905139 0.9221786
8 gender:language:age 0.9905139 0.9221786
easy\$Sphericity Corrections %>%
kable()
Effect GGe p[GG] p[GG]<.05 HFe p[HF] p[HF]<.05
5 age 0.9937540 0.0000054 * 1.116711 0.0000051 *
6 gender:age 0.9937540 0.5284531 1.116711 0.5293489
7 language:age 0.9906031 0.0592938 1.112534 0.0587292
8 gender:language:age 0.9906031 0.2540382 1.112534 0.2539516
###### The old hard way
summary(aov(mlu ~ gender*language*age + Error( subj/(language*age)),
data=bilingual) )
##
## Error: subj
##           Df Sum Sq Mean Sq F value Pr(>F)
## gender     1    0.0   0.002       0  0.987
## Residuals 18  129.8   7.212
##
## Error: subj:language
##                 Df Sum Sq Mean Sq F value Pr(>F)
## language         1  7.197   7.197   6.777  0.018 *
## gender:language  1  0.210   0.210   0.198  0.662
## Residuals       18 19.116   1.062
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Error: subj:age
##            Df Sum Sq Mean Sq F value   Pr(>F)
## age         2  35.50  17.749  17.434 5.07e-06 ***
## gender:age  2   1.32   0.659   0.647    0.529
## Residuals  36  36.65   1.018
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Error: subj:language:age
##                     Df Sum Sq Mean Sq F value Pr(>F)
## language:age         2  3.474  1.7368   3.070 0.0587 .
## gender:language:age  2  1.611  0.8056   1.424 0.2540
## Residuals           36 20.365  0.5657
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is no need to extend our cases to more then two levels of between or withing factors, because, first, it will work similarly to the examples above, secondly, the data becomes to complex and will most likely violate some of the assumptions.

Despite the ability of the old aov() way to cover a variety of experimental designs, it has numerous disadvantages. Particularly, it won’t run any diagnostics, so that even if it violates some assumptions, you won’t see it. Moreover, it won’t report any eﬀect size and won’t be able to handle unbalanced designs or missing values. The new ezANOVA() way is much better in a lot of ways, however, if design is unbalanced (missing values exist), the results become diﬃcult to interpret.

One reasonable course of action when ANOVA design becomes too complicated, or assumptions are violated at some level is to switch to so-called mixed-eﬀect models which also oﬀer some other beneﬁts.

This all will inevitably push us towards the mixed-effects model - a natural next step in the analysis of complex experiments. But until then, let’s have a look at a model comparison and think about why it’s useful.

## Model comparison

One useful way to find out whether a particular factor is significant, is to compare models. For instance make a base-intercept only model and a model with one of the factors, then compare both using anova(m1, m2). If the model with a particular factor is significantly different from the base model, this factor is significant.

m1 <- aov(mlu ~ 1, data=bilingual)
m2 <- aov(mlu ~ age, data=bilingual)
m3 <- aov(mlu ~ gender, data=bilingual)

anova(m1, m2)
## Analysis of Variance Table
##
## Model 1: mlu ~ 1
## Model 2: mlu ~ age
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)
## 1    119 255.26
## 2    117 219.76  2    35.498 9.4494 0.000157 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m1, m3)
## Analysis of Variance Table
##
## Model 1: mlu ~ 1
## Model 2: mlu ~ gender
##   Res.Df    RSS Df Sum of Sq     F Pr(>F)
## 1    119 255.26
## 2    118 255.26  1 0.0021087 0.001 0.9751

In two cases above, the age seems to be a significant predictor, while the gender is not. However, the comparison between two or more models will only be valid if they are fitted to the same dataset. This may be a problem if there are missing values in one of the variables.

## Predictions

Copy pasted code from the manual of the ez package

#Run ezBoot on the accurate RT data
data(ANT)
rt = ezBoot(
data = ANT
, dv = rt
, wid = subnum
, within = .(cue,flank)
, between = group
, iterations = 1e1 #1e3 or higher is best for publication
)
##
|                                                    |  0%
|=====                                               | 10% ~1 s remaining
|==========                                          | 20% ~1 s remaining
|===============                                     | 30% ~1 s remaining
|====================                                | 40% ~1 s remaining
|==========================                          | 50% ~1 s remaining
|===============================                     | 60% ~1 s remaining
|====================================                | 70% ~0 s remaining
|=========================================           | 80% ~0 s remaining
|==============================================      | 90% ~0 s remaining
|====================================================|100% ~0 s remaining
|====================================================|100%                      Completed after 1 s
#plot the full design
p = ezPlot2(
preds = rt
, x = flank
, split = cue
, col = group
)
print(p)

#plot the effect of group across the flank*cue design
p = ezPlot2(
preds = rt
, x = flank
, split = cue
, diff = group
)
print(p)

#plot the flank*cue design, averaging across group
p = ezPlot2(
preds = rt
, x = flank
, split = cue
)
print(p)

## Assumptions or when NOT to use the Repeated measures ANOVA

• when the predictors are correlated: then the sums of squares are calculated differently (Type I, Type II or Type III):

• when you have completely unique (independent) observations, use an usual ANOVA

• when you have missing values, remove or fill them somehow

• when your responce variable is ordinal, use a Friedman test

• when you have outliers, remove them

• when you have an unbalanced design (NAs or simply different number of observations per group), make sure to use Type = 3 sum of squares calculation. By the way, don’t panic if aov throws some strange message at you with unbalanced design, but make sure to use other methods with Type = 3 Sum of Squares! Repeated measures are handled much better with linear mixed models

• when sphericity (homogeneity of variances) is present: when the variances of differences among the levels of within-subject factors are not equal. In case of present sphericity we can:

1. adjust the degrees of freedom of our F-ratio4 :
• Greenhouse-Geisser correction (too conservative)
• Huynh-Feldt correction (too liberal)
• The average of these two (weird, but kinda works) For the post-hoc tests, only Bonferroni seems to work well. Or…
1. use mixed-effect models, sometimes called multi-level models

Levene’s Test for Homogeneity helps to check this assumption for only two levels of the independent variable. And Mauchly’s Test for Sphericity helps to check this assumption if there are more than 2 levels. It is part of the ezANOVA function. When the significance level of Levene’s or Mauchly’s test is < 0.05 then sphericity assumption is violated, while if p-value > 0.05 then the null hypothesis stating that there is no difference in variances for all pairwise group comparisons is accepted. Note: the sphericity assumption is only relevant to the univariate (one-way) RM-ANOVA. The alternative, a multivariate test, does not require the assumption of sphericity and is therefore recommended (I am not completely sure about that, just found that somewhere while learning). Sphericity is met when the variances of the samples are roughly equal. Violating sphericity reduces the power and increases the probability of a Type 2 error.

We can write up our results (not the exercise example), where we have included Mauchly’s Test for Sphericity as:

Mauchly's Test of Sphericity indicated that the assumption of sphericity
had been violated, χ2(2) = 22.115, p < .0005, and therefore, a Greenhouse-Geisser
correction was used. There was a significant effect of time on cholesterol
concentration, F(1.171, 38) = 21.032, p < .0005.
• Normality: each level of each within-subject independent variable needs to be approximately normally distributed. E.g. in a variable A with three levels each level should be checked, if variable B has four other levels, then each combination of levels supposed to be checked: shapiro.test(our_data[A=="one_of_the_levels" & B=="one_of_the_levels"])

### Conclusion: be careful!

As you could get from the assumptions section, Repeated Measures ANOVA is not always the best statistical analysis for repeated measure designs. It is vulnerable to missing values, imputation, nonequivalent time points between subjects and violations of sphericity. Thus, don’t use a repeated measures ANOVA if you understand and are able to do mixed-effects models. They are more flexible, accurate, powerful and deliver more results.