J. Demsar. Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research, volume 7, 2006.

Demsar discusses several means of statistically comparing classifiers on multiple datasets. Within computer vision research, statistical comparison is important to asses the performance of different algorithms/classifiers using multiple datasets. For example, Dollar et al. [1] use the Friedman test to compare different approaches to pedestrian detection. The Friedman test uses average ranks

$R_j = \frac{1}{N}\sum_{i = 1}^N r_j^i$(1)

where $j = 1,\ldots,K$ refers to the classifiers and $i = 1,\ldots,N$ refers to the datasets, i.e. $r_j^i$ is the rank of classifier $j$ on dataset $i$. The null-hypothesis is that all algorithms are equivalent; in the optimal case we would like to show that this is not the case. The Friedman statistic for this null-hypothesis is

$\chi_F^2 = \frac{12 N}{K(K + 1)} \left(\sum_{j = 1}^K R_j^2 - \frac{K(K + 1)^2}{4}\right)$

and distributed according to the F-distribution with $(K - 1)$ and $(K - 1)(N - 1)$ degrees of freedom.

Demsar also briefly discusses an alternative statistic which was shown to be less conservative:

$F_F = \frac{(N - 1)\chi^2_F}{N(K - 1) - \chi_F^2}$.

If the null-hypothesis is rejected using one of these statistics, the Nemenyi test can be used to decide whether the performance of two specific classifiers is significantly different. This is the case when the average ranks (as in Equation (1)) differ by at least

$q_\alpha \sqrt{\frac{K(K + 1)}{6N}}$

which is called the critical difference. The critical value $q_\alpha$ is based on the Studentized range statistic divided by $\sqrt{2}$. A specific example is given by Demsar, and the corresponding values of $q_\alpha$ are given for $\alpha=0.05,0.1$ and $K = 2,\ldots,10$.

  • [1] P. Doll├ár, C. Wojek, B. Schiele, P. Perona. Pedestrian Detection: An Evaluation of the State of the Art. IEEE Transactions of Pattern Analysis and Machine Intelligence, volume 34, number 4, 2012.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.