Two-samples Wilcoxon-Mann-Whitney test: compare two independent groups

Previous topics

Two-samples Wilcoxon-Mann-Whitney test (WMW-test) is a non-parametric equivalent to two-samples unpaired t-test, thus understanding t-test first would teach you about the importance of normal distribution. Besides, I’d recommend getting familiar with two-samples paired Wilcoxon test in order to be able to differentiate paired vs. unpaired samples. Since two-samples paired Wilcoxon test is just a one-samples paired Wilcoxon test of the difference between two samples, examining how far is this difference away from zero, I also recommend to check out one-sample paired Wilcoxon test.

Why do we need it? What are the benefits?

To compare two non-normally distributed (skewed / not-bell shaped) unpaired samples. Since two samples are not connected with each other in any way, they don’t need to be the same size. Comparing them will answer the question whether there is a difference between them. The difference can help to figure out whether, e.g. samples stem from different populations. Compared to unpaired t-test, WMW-test is also more robust to outliers.

When do we need Wilcoxon-Mann-Whitney test?

We need it if small samples (<30) are not-normally (not-bell shaped) distributed and if outliers are present. Thus, let’s first visualize our two samples and look whether they are normally distributed or have some outliers.

The picture above is borrowed from here.

We’ll compare the strength (horsepower - hp) of cars with automatic (am = 0) vs. manual (am = 1) gearboxes. Our question here is: are cars with an automatic gearbox as strong as cars with a manual gearbox, or do they differ?

Load all needed packages at once to avoid interruptions.

# library(tidyverse)   # for data wrangling and visualization
# library(broom)       # for tidy test output
# library(knitr)       # for nice looking table
# library(ggpubr)      # for QQplot
# make transmission variable (am) a factor / categorical
mtcars <- mtcars %>%          # "%>%" means - then. It simplifies programming by separating steps
  mutate_at(vars(am), factor) # "mutate_at" modifies columns

# visualize distributions of both samples. "grid.arrange" from "GridExtra" 
# package & "ncol" (number of columns) simply place two plots near each other
  ggplot(mtcars, aes(am, hp)) + geom_boxplot() + theme_pubr(),
  ggplot(mtcars, aes(hp, color = am)) + geom_density() + theme_pubr(), 
ncol = 2)

In terms of horsepower, cars with different transmission differ on both plots. The boxplot on the left shows some outliers in the manual gearbox, while the density plot on the right also shows that manual gearbox sample is not normally distributed, but skewed to the right. Thus, it looks like we can’t use unpaired t-test . But to be even more sure of that, let’s conduct two more checks: visual normality check using QQplot and Shapiro-Wilk normality-test.

ggqqplot(mtcars, x = "hp", = "am")

mtcars %>% 
  group_by(am) %>%                    # ".$"  means - the current dataset
  do(tidy(shapiro.test(.$hp))) %>%    # tidy  beautifies the output of the test
  kable()                             # kable beautifies the resulting table
am statistic p.value method
0 0.9583485 0.5402948 Shapiro-Wilk normality test
1 0.7675804 0.0028804 Shapiro-Wilk normality test

QQplot confirms the presence of outliers and both QQplot and low p-value of normality test (p-value = 0.00288, a bad thing here) indicate the non-normality of distribution of the manual gearbox cars.

Thus, here we really need a non-parametric alternative to unpaired t-test, which is WMW-test.

How Wilcoxon-Mann-Whitney test works and why it’s called “rank-sum” and “U”

WMW-test in only 4 steps: 1

  1. rank values of both samples from low to high no matter which group each value belongs to
  2. sum the ranks for both samples separately, R1 & R2. This is where the rank-sum part of the name comes from. Which sample is R1 is irrelevant
  3. Calculate the test statistics: W-value for n < 20 or z-score for n > 20
  • W-value (also called U-value) for each rank-sum. This is where the U part of the name comes from.

\[{\displaystyle W_{1}=R_{1}-{n_{1}(n_{1}+1) \over 2}\,\!} \\ {\displaystyle W_{2}=R_{2}-{n_{2}(n_{2}+1) \over 2}\,\!}\]

  • z-score

\[z={\frac {W-m_{W}}{\sigma _{W}}}\] where \(m_{W}\) is: \[{\displaystyle m_{W}={\frac {n_{1}n_{2}}{2}}}\]

and \(\sigma _{W}\) is:

\[\sigma _{W}={\sqrt {n_{1}n_{2}(n_{1}+n_{2}+1) \over 12}}\]

  1. get the p-value from W-table or z-table. This part will show you whether the cars with automatic transmission differ in strength from cars with manual transmission.

The code below shows the step-by-step calculation procedure in a self-explanatory plain English. You can execute this code step-by-step to see the result of every step by simply executing everything above needed line. For example, if you leave out three last lines (and %>% before), you’ll see the whole table with all the calculated columns: n, rank_sum etc.

# calculate the test statistics manually
mtcars %>% 
  mutate(rank = rank(hp)) %>%          # mutate means - create new column
  group_by(am) %>%
  summarise(n           = n(), 
            rank_mean   = mean(rank),
            hp_mean     = mean(hp),
            rank_median = median(rank),
            hp_median   = median(hp),
            rank_sum    = sum(rank),
            hp_sum      = sum(hp)) %>%
  mutate(W = rank_sum - (n * (n + 1)) / 2,
         z = (W-(.$n[1]*.$n[2])/2) / sqrt(.$n[1]*.$n[2]*(.$n[1]+.$n[2]+1)/12) ) %>% 
am n rank_mean hp_mean rank_median hp_median rank_sum hp_sum W z
0 19 19.26316 160.2632 21 175 366 3045 176 2.014394
1 13 12.46154 126.8462 11 109 162 1649 71 -2.014394
# interestingly ;)
176 + 71 == 19 * 13
## [1] TRUE

How to compute Wilcoxon-Mann-Whitney test

Just for the sake of curiosity, we’ll also apply the unpaired t-test to our dataset and compare the results. Before executing both, WMW-test and t-test, have a look at the very first visualization one more time and answer this simple question: do you think, the cars actually differ? Then, execute the code below.

H0 - null hypothesis: samples are similar (have similar distribution), so that there is a 50% chance that a random value from one sample is higher then a random value from another

Halt - alternative hypothesis: samples are different

  wilcox.test(data = mtcars, hp ~ am, = T, exact = F) %>% tidy(),
       t.test(data = mtcars, hp ~ am, = T, exact = F) %>% tidy() %>%
            select(-estimate1, -estimate2, -parameter) # "minus" de-selects (removes) columns
) %>% kable()
estimate statistic p.value conf.low conf.high method alternative
55.00007 176.000000 0.0457013 0.0000862 91.99995 Wilcoxon rank sum test with continuity correction two.sided
33.41700 1.266189 0.2209796 -21.8785802 88.71259 Welch Two Sample t-test two.sided


  • estimates show the difference between samples. While t-test reports 33.4 [horsepower] as a difference between the means of cars with automated vs. manual gearbox, WMW-test found a bigger difference of 55 [horsepower]. Thus, WMW-test saw the same difference we have seen on the first visualization, while t-test missed it

  • test statistics of WMW-test W = 176 is identical to the one we calculated manually above. Thus, we are confident, the test is fine and our calculation is correct. W-value is large, thus we expect a significant difference between samples, which is confirmed by a…

  • p-value of WMW-test (p-value = 0.0457) which rejects the H0 that cars in both groups are similar, and accepts the Halt saying that cars differ. Another confirmation of the existing difference between the groups is the absence of zero within 95% confidence intervals. Thus, we can be 95% sure that this difference did not happened by chance

  • in contrast, the t-test did not find enough difference between the two groups and therefore shows a non-significant p-value of 0.2. The result of the t-test is not surprising, since we saw that two important assumptions of the t-test (normality and outliers) are violated by our dataset

  • z-score is approximated by every statistical software, thus manually calculated z-score might differ a little. The main point here, is that both W-value or z-score allow to look up the p-value which we need to determine the significance of the difference. Usually, if z-score is far less than -1.96, or much greater than 1.96, we can reject the null hypothesis. But if the z-score is close to 2 or -2, I’d recommend to trust the test. The coin package provides a wilcox_test function, which delivers z-score:

coin::wilcox_test(data = mtcars, hp ~ am, = T)
##  Asymptotic Wilcoxon-Mann-Whitney Test
## data:  hp by am (0, 1)
## Z = 2.0174, p-value = 0.04366
## alternative hypothesis: true mu is not equal to 0
## 95 percent confidence interval:
##  3.332906e-05 8.900007e+01
## sample estimates:
## difference in location 
##               54.99999

Our manually calculated z-score = 2.0144, while the z-score from the test = 2.0174 which are practically the same (differ only to 0.3%). Similarly to the wilcox.test which uses W-value, wolcox_test and its z-score also found a significant difference in strength of cars with automatic vs. manual transmission.

One-sided Wilcoxon-Mann-Whitney test

The default alternative hypothesis of the WMW-test is two.sided and only says that a difference is present, but does not say whether cars with automatic transmission (am = 0) are wicker or stronger then cars with manual one (am = 1). To find out exactly this, we need to test two new alternative hypotheses (Halt):

  1. automatic transmission is wicker then manual = have less horsepower
  2. automatic transmission is stronger then manual = have greater horsepower

Doing this will add another useful tool to your statistical toolbox, namely one-tailed (or one-sided) non-parametric two-samples unpaired Wilcoxon-Mann-Whitney rank-sum U test (this name is really killing me:):

coin::wilcox_test(data = mtcars, hp ~ am, = T, alternative = "less")
##  Asymptotic Wilcoxon-Mann-Whitney Test
## data:  hp by am (0, 1)
## Z = 2.0174, p-value = 0.9782
## alternative hypothesis: true mu is less than 0
## 95 percent confidence interval:
##      -Inf 84.00006
## sample estimates:
## difference in location 
##               54.99999
coin::wilcox_test(data = mtcars, hp ~ am, = T, alternative = "greater")
##  Asymptotic Wilcoxon-Mann-Whitney Test
## data:  hp by am (0, 1)
## Z = 2.0174, p-value = 0.02183
## alternative hypothesis: true mu is greater than 0
## 95 percent confidence interval:
##  6.000042      Inf
## sample estimates:
## difference in location 
##               54.99999

The corresponding p-values for one-tailed z-scores can be found in this table, but hold on!, every statistical software provides you with p-values anyway, so that you’ll never need to look it up. But it’s always good to understand how p-values originate.

Low p-value (0.02) of the greater-sided test confirms that the automatic-gearbox cars are stronger then manual-gearbox cars. The p-value of the less-sided test is to 97,8% (p = 0.9782) sure that they are not-weaker.

Don’t use Wilcoxon-Mann-Whitney test if:

  • samples are dependent. In this case apply two-samples paired Wilcoxon-test
  • samples are small (n<30) and normally distributed (or big and near normal). In this case use the more powerful two-samples unpaired/independent t-test. Lower power of WMW-test is due a slight loss of information, which happenes when real data is replaced by ranks. In fact, if your sample is very small, <7, you’ll never get a significant result with WMW-test 2. Thus, despite the fact that WMW-test is very robust, please, don’t overuse it.


Two-sample unpaired rank-sum Mann-Whitney test handles skewed distribution and outliers much better then the unpaired t-test and is almost as powerful. Since the real world data is never perfect, the non-parametric tests are very important tools in a toolbox of any data scientist.

What’s next

Yury Zablotski
Data Scientist at LMU Munich, Faculty of Veterinary Medicine

Passion for applying Biostatistics and Machine Learning to Life Science Data


comments powered by Disqus