Large scale biomedical texts classification: a kNN and an ESA-based approaches

Abstract

With the large and increasing volume of textual data, automated methods for\nidentifying significant topics to classify textual documents have received a\ngrowing interest. While many efforts have been made in this direction, it still\nremains a real challenge. Moreover, the issue is even more complex as full\ntexts are not always freely available. Then, using only partial information to\nannotate these documents is promising but remains a very ambitious issue.\nMethodsWe propose two classification methods: a k-nearest neighbours\n(kNN)-based approach and an explicit semantic analysis (ESA)-based approach.\nAlthough the kNN-based approach is widely used in text classification, it needs\nto be improved to perform well in this specific classification problem which\ndeals with partial information. Compared to existing kNN-based methods, our\nmethod uses classical Machine Learning (ML) algorithms for ranking the labels.\nAdditional features are also investigated in order to improve the classifiers'\nperformance. In addition, the combination of several learning algorithms with\nvarious techniques for fixing the number of relevant topics is performed. On\nthe other hand, ESA seems promising for this classification task as it yielded\ninteresting results in related issues, such as semantic relatedness computation\nbetween texts and text classification. Unlike existing works, which use ESA for\nenriching the bag-of-words approach with additional knowledge-based features,\nour ESA-based method builds a standalone classifier. Furthermore, we\ninvestigate if the results of this method could be useful as a complementary\nfeature of our kNN-based approach.ResultsExperimental evaluations performed on\nlarge standard annotated datasets, provided by the BioASQ organizers, show that\nthe kNN-based method with the Random Forest learning algorithm achieves good\nperformances compared with the current state-of-the-art methods, reaching a\ncompetitive f-measure of 0.55% while the ESA-based approach surprisingly\nyielded reserved results.ConclusionsWe have proposed simple classification\nmethods suitable to annotate textual documents using only partial information.\nThey are therefore adequate for large multi-label classification and\nparticularly in the biomedical domain. Thus, our work contributes to the\nextraction of relevant information from unstructured documents in order to\nfacilitate their automated processing. Consequently, it could be used for\nvarious purposes, including document indexing, information retrieval, etc.\n

References

Page 1

	Year	Citations

Page 1