Identification of Orphan Genes in Unbalanced Datasets Based on Ensemble Learning

Abstract

Background: In the field of bioinformatics, orphan genes (ORFans) of species are associated with regulatory patterns, but the experimental methods for identifying ORFans are both time-consuming and expensive. Especially, the unbalanced dataset of ORFans caused a huge challege to design an accurate and robustness classified model. Results: Hence, over-sampling algorithms (SMOTE) were selected to balance the datasets of ORFans in the proposed method. This manuscript compared different machine learning methods to identify ORFans between balanced and unbalanced dataset. Then combined over-sampling and under-sampling algorithm with advanced ensembling algorithm, such as Adaboost(Adaptive boosting), GBDT(Gradient Boosting Decision Tree), and XGBoost (Extreme Gradient Boosting) model, analyzed the performance on the A.thaliana gene feature dataset. Conclusions: When performed on A.thaliana sequence genes datasets, the proposed method integrated balanced algorithm and ensembled method（XGBoost）achieved 0.94 F1 score, which are better than unbalaced dataset. Extensive Experiments showed that the integrated XGBoost method achieved higher predictive accuracy than other combined models. Finally, the combined over-sampling and under-sampling algorithm with XGBoost provided a optimal model to identify ORFans from comparing with the other balanced algorithm with XGBoost respectively. In conclusion, the propoesed approach can be considered as a theoretical basis for the evaluation criteria for identifying orphan genes and guarantees the validity of their evaluation work.

References

Page 1

	Year	Citations

Page 1