Publication | Open Access
We Need to Talk about Standard Splits
95
Citations
29
References
2019
Year
Unknown Venue
It is standard practice in speech & language technology to rank systems according to performance on a test set held out for evaluation. However, few researchers apply statistical tests to determine whether differences in performance are likely to arise by chance, and few examine the stability of system ranking across multiple training-testing splits. We conduct replication and reproduction experiments with nine part-of-speech taggers published between 2000 and 2018, each of which reports state-of-the-art performance on a widely-used "standard split". We fail to reliably reproduce some rankings using <i>randomly generated</i> splits. We suggest that randomly generated splits should be used in system comparison.
| Year | Citations | |
|---|---|---|
Page 1
Page 1