Balancing Training Data for Automated Annotation of Keywords: a Case Study.

TLDR

Automated database annotation is increasingly pursued, with machine learning viewed as a promising tool, yet class imbalance—many proteins lacking annotations—remains a major obstacle. This study investigates how to mitigate this imbalance by applying data‑balancing techniques to the training sets for automated annotation. The authors employ resampling and other balancing methods to construct balanced training data for symbolic machine‑learning models. Classifiers trained on the balanced data achieved higher accuracy than those trained on the original, imbalanced data.

Abstract

There has been an increasing interest in tools for automating the annotation of databases. Machine learning techniques are promising candidates to help curators to, at least, guide the process of annotation which is mostly done manually. Following previous works on automated annotation using symbolic machine learning techniques, the present work deals with a common problem in machine learning: that classes usually have skewed class prior probabilities, i.e., there is a large number of examples of one class compared with just few examples of the other class. This happens due to the fact that a large number of proteins is not annotated for every feature. Thus, we analyze and employ some techniques aiming at balancing the training data. Our experiments show that the classifiers induced from balanced data sampled with our method are more accurate than those induced from the original data.

References

Page 1

	Year	Citations

Page 1