## Previous topics

*Two-samples Wilcoxon-Mann-Whitney test (WMW-test)* is a non-parametric equivalent to *two-samples unpaired t-test*, thus understanding *t-test* first would teach you about the importance of normal distribution. Besides, I’d recommend getting familiar with *two-samples paired Wilcoxon test* in order to be able to differentiate paired vs. unpaired samples. Since *two-samples paired Wilcoxon test* is just a *one-samples paired Wilcoxon test* of the difference between two samples, examining how far is this difference away from zero, I also recommend to check out *one-sample paired Wilcoxon test*.

## Why do we need it? What are the benefits?

To **compare two non-normally** distributed (skewed / not-bell shaped) **unpaired samples**. Since two samples are not connected with each other in any way, they **don’t need to be the same size**. Comparing them will answer the question whether there is a **difference between them**. The difference can help to figure out whether, e.g. samples stem from different populations. Compared to *unpaired t-test*, *WMW-test* is also **more robust to outliers**.

## When do we need *Wilcoxon-Mann-Whitney test*?

We need it if small samples (<30) are not-normally (not-bell shaped) distributed and if outliers are present. Thus, let’s first visualize our two samples and look whether they are normally distributed or have some outliers.

The picture above is borrowed from here.

We’ll compare the strength (horsepower - `hp`

) of cars with automatic (`am = 0`

) vs. manual (`am = 1`

) gearboxes. **Our question** here is: are cars with an automatic gearbox as strong as cars with a manual gearbox, or do they differ?

Load all needed packages at once to avoid interruptions.

```
# library(tidyverse) # for data wrangling and visualization
# library(broom) # for tidy test output
# library(knitr) # for nice looking table
# library(ggpubr) # for QQplot
```

```
# make transmission variable (am) a factor / categorical
mtcars <- mtcars %>% # "%>%" means - then. It simplifies programming by separating steps
mutate_at(vars(am), factor) # "mutate_at" modifies columns
# visualize distributions of both samples. "grid.arrange" from "GridExtra"
# package & "ncol" (number of columns) simply place two plots near each other
gridExtra::grid.arrange(
ggplot(mtcars, aes(am, hp)) + geom_boxplot() + theme_pubr(),
ggplot(mtcars, aes(hp, color = am)) + geom_density() + theme_pubr(),
ncol = 2)
```

In terms of horsepower, cars with different transmission differ on both plots. The boxplot on the left shows some **outliers** in the manual gearbox, while the density plot on the right also shows that manual gearbox sample is not normally distributed, but **skewed** to the right. Thus, it looks like **we can’t use unpaired t-test **. But to be even more sure of that, let’s conduct two more checks: visual normality check using

`QQplot`

and *Shapiro-Wilk normality-test*.

`ggqqplot(mtcars, x = "hp", facet.by = "am")`

```
mtcars %>%
group_by(am) %>% # ".$" means - the current dataset
do(tidy(shapiro.test(.$hp))) %>% # tidy beautifies the output of the test
kable() # kable beautifies the resulting table
```

am | statistic | p.value | method |
---|---|---|---|

0 | 0.9583485 | 0.5402948 | Shapiro-Wilk normality test |

1 | 0.7675804 | 0.0028804 | Shapiro-Wilk normality test |

QQplot confirms the presence of **outliers** and both QQplot and low *p-value* of normality test (p-value = 0.00288, a bad thing here) indicate the **non-normality of distribution** of the manual gearbox cars.

Thus, **here we really need a non-parametric alternative to unpaired t-test, which is WMW-test.**

## How *Wilcoxon-Mann-Whitney test* works and why it’s called “rank-sum” and “U”

*WMW-test* in only 4 steps: ^{1}

**rank values of both samples**from low to high no matter which group each value belongs to**sum the ranks**for both samples separately,**R**&_{1}**R**. This is where the_{2}**rank-sum**part of the name comes from. Which sample is**R**is irrelevant_{1}- Calculate the test statistics:
**W-value**for n < 20 or**z-score**for n > 20

**W-value**(also called*U-value*) for each rank-sum. This is where the**U**part of the name comes from.

\[{\displaystyle W_{1}=R_{1}-{n_{1}(n_{1}+1) \over 2}\,\!} \\ {\displaystyle W_{2}=R_{2}-{n_{2}(n_{2}+1) \over 2}\,\!}\]

**z-score**

\[z={\frac {W-m_{W}}{\sigma _{W}}}\] where \(m_{W}\) is: \[{\displaystyle m_{W}={\frac {n_{1}n_{2}}{2}}}\]

and \(\sigma _{W}\) is:

\[\sigma _{W}={\sqrt {n_{1}n_{2}(n_{1}+n_{2}+1) \over 12}}\]

- get the
*p-value*from*W-table*or*z-table*. This part will show you**whether the cars with automatic transmission differ in strength from cars with manual transmission**.

The code below shows the step-by-step calculation procedure in a self-explanatory plain English. You can execute this code step-by-step to see the result of every step by simply executing everything above needed line. For example, if you leave out three last lines (and `%>%`

before), you’ll see the whole table with all the calculated columns: n, rank_sum etc.

```
# calculate the test statistics manually
mtcars %>%
mutate(rank = rank(hp)) %>% # mutate means - create new column
group_by(am) %>%
summarise(n = n(),
rank_mean = mean(rank),
hp_mean = mean(hp),
rank_median = median(rank),
hp_median = median(hp),
rank_sum = sum(rank),
hp_sum = sum(hp)) %>%
mutate(W = rank_sum - (n * (n + 1)) / 2,
z = (W-(.$n[1]*.$n[2])/2) / sqrt(.$n[1]*.$n[2]*(.$n[1]+.$n[2]+1)/12) ) %>%
kable()
```

am | n | rank_mean | hp_mean | rank_median | hp_median | rank_sum | hp_sum | W | z |
---|---|---|---|---|---|---|---|---|---|

0 | 19 | 19.26316 | 160.2632 | 21 | 175 | 366 | 3045 | 176 | 2.014394 |

1 | 13 | 12.46154 | 126.8462 | 11 | 109 | 162 | 1649 | 71 | -2.014394 |

```
# interestingly ;)
176 + 71 == 19 * 13
```

`## [1] TRUE`

## How to compute *Wilcoxon-Mann-Whitney test*

Just for the sake of curiosity, we’ll also apply the *unpaired t-test* to our dataset and compare the results. Before executing both, *WMW-test* and *t-test*, have a look at the very first visualization one more time and answer this simple question: do you think, the cars actually differ? Then, execute the code below.

**H _{0}** - null hypothesis: samples are similar (have similar distribution), so that there is a 50% chance that a random value from one sample is higher then a random value from another

**H _{alt}** - alternative hypothesis: samples are different

```
bind_rows(
wilcox.test(data = mtcars, hp ~ am, conf.int = T, exact = F) %>% tidy(),
t.test(data = mtcars, hp ~ am, conf.int = T, exact = F) %>% tidy() %>%
select(-estimate1, -estimate2, -parameter) # "minus" de-selects (removes) columns
) %>% kable()
```

estimate | statistic | p.value | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|

55.00007 | 176.000000 | 0.0457013 | 0.0000862 | 91.99995 | Wilcoxon rank sum test with continuity correction | two.sided |

33.41700 | 1.266189 | 0.2209796 | -21.8785802 | 88.71259 | Welch Two Sample t-test | two.sided |

### Interpretation

**estimates**show the difference between samples. While*t-test*reports 33.4 [horsepower] as a difference between the means of cars with automated vs. manual gearbox,*WMW-test*found a bigger difference of 55 [horsepower]. Thus,*WMW-test*saw the same difference we have seen on the first visualization, while*t-test*missed it**test statistics**of*WMW-test*W = 176 is identical to the one we calculated manually above. Thus, we are confident, the test is fine and our calculation is correct.*W-value*is large, thus we expect a significant difference between samples, which is confirmed by a…**p-value**of*WMW-test*(*p-value*= 0.0457) which**rejects the H**that cars in both groups are similar, and_{0}**accepts the H**saying that cars differ. Another confirmation of the existing difference between the groups is the_{alt}**absence of zero within 95% confidence intervals**. Thus, we can be 95% sure that this difference did not happened by chancein contrast,

**the**between the two groups and therefore shows a non-significant*t-test*did not find enough difference*p-value*of 0.2. The result of the*t-test*is not surprising, since we saw that two important assumptions of the*t-test*(normality and outliers) are violated by our dataset**z-score**is approximated by every statistical software, thus manually calculated z-score might differ a little. The main point here, is that both*W-value*or*z-score*allow to look up the*p-value*which we need to determine the significance of the difference. Usually, if*z-score*is far less than -1.96, or much greater than 1.96, we can reject the null hypothesis. But if the*z-score*is close to 2 or -2, I’d recommend to trust the test. The`coin`

package provides a`wilcox_test`

function, which delivers*z-score*:

`coin::wilcox_test(data = mtcars, hp ~ am, conf.int = T)`

```
##
## Asymptotic Wilcoxon-Mann-Whitney Test
##
## data: hp by am (0, 1)
## Z = 2.0174, p-value = 0.04366
## alternative hypothesis: true mu is not equal to 0
## 95 percent confidence interval:
## 3.332906e-05 8.900007e+01
## sample estimates:
## difference in location
## 54.99999
```

Our manually calculated *z-score* = 2.0144, while the *z-score* from the test = 2.0174 which are practically the same (differ only to 0.3%). Similarly to the `wilcox.test`

which uses *W-value*, `wolcox_test`

and its *z-score* also found a significant difference in strength of cars with automatic vs. manual transmission.

### One-sided *Wilcoxon-Mann-Whitney test*

The default alternative hypothesis of the *WMW-test* is `two.sided`

and only says that a difference is present, but does not say whether cars with automatic transmission (`am`

= 0) are wicker or stronger then cars with manual one (`am`

= 1). To find out exactly this, we need to test two new alternative hypotheses (**H _{alt}**):

- automatic transmission is
**wicker**then manual = have`less`

horsepower - automatic transmission is
**stronger**then manual = have`greater`

horsepower

Doing this will add another useful tool to your statistical toolbox, namely **one-tailed (or one-sided) non-parametric two-samples unpaired Wilcoxon-Mann-Whitney rank-sum U test** (this name is really killing me:):

`coin::wilcox_test(data = mtcars, hp ~ am, conf.int = T, alternative = "less")`

```
##
## Asymptotic Wilcoxon-Mann-Whitney Test
##
## data: hp by am (0, 1)
## Z = 2.0174, p-value = 0.9782
## alternative hypothesis: true mu is less than 0
## 95 percent confidence interval:
## -Inf 84.00006
## sample estimates:
## difference in location
## 54.99999
```

`coin::wilcox_test(data = mtcars, hp ~ am, conf.int = T, alternative = "greater")`

```
##
## Asymptotic Wilcoxon-Mann-Whitney Test
##
## data: hp by am (0, 1)
## Z = 2.0174, p-value = 0.02183
## alternative hypothesis: true mu is greater than 0
## 95 percent confidence interval:
## 6.000042 Inf
## sample estimates:
## difference in location
## 54.99999
```

The corresponding *p-values* for one-tailed *z-scores* can be found in this table, but hold on!, every statistical software provides you with *p-values* anyway, so that you’ll never need to look it up. But it’s always good to understand how *p-values* originate.

Low *p-value* (0.02) of the **greater-sided** test confirms that the **automatic-gearbox cars are stronger then manual-gearbox cars**. The *p-value* of the **less-sided** test is to **97,8% (p = 0.9782) sure that they are not-weaker**.

## Don’t use *Wilcoxon-Mann-Whitney test* if:

- samples are dependent. In this case apply
*two-samples paired Wilcoxon-test* - samples are small (n<30) and normally distributed (or big and near normal). In this case use the more powerful two-samples unpaired/independent t-test. Lower power of
*WMW-test*is due a slight loss of information, which happenes when real data is replaced by ranks. In fact, if your sample is very small, <7, you’ll never get a significant result with*WMW-test*^{2}. Thus, despite the fact that*WMW-test*is very robust, please, don’t overuse it.

## Conclusion

*Two-sample unpaired rank-sum Mann-Whitney test* handles skewed distribution and outliers much better then the *unpaired t-test* and is almost as powerful. Since the **real world data is never perfect, the non-parametric tests are very important tools** in a toolbox of any data scientist.

## What’s next

- if we have more then two samples to compare to,
*one-way ANOVA*(if you meed the assumptions) or*Kruskal-Wallis rank-sum Test*(if you don’t meed the assumptions of*ANOVA*) will cover that.