OrthoML2GO: homology-based protein function prediction using orthogroups and machine learning
https://doi.org/10.18699/vjgb-25-119
Abstract
In recent years, the rapid growth of sequencing data has exacerbated the problem of functional annotation of protein sequences, as traditional homology-based methods face limitations when working with distant homologs, making it difficult to accurately determine protein functions. This paper introduces the OrthoML2GO method for protein function prediction, which integrates homology searches using the USEARCH algorithm, orthogroup analysis based on OrthoDB version 12.0, and a machine learning algorithm (gradient boosting).
A key feature of our approach is the use of orthogroup information to account for the evolutionary and functional similarity of proteins and the application of machine learning to refine the assigned GO terms for the target sequence.
To select the optimal algorithm for protein annotation, the following approaches were applied sequentially: the k-nearest neighbors (KNN) method; a method based on the annotation of the orthogroup most represented in the k-nearest homologs (OG); a method of verifying the GO terms identified in the previous stage using machine learning algorithms. A comparison of the prediction accuracy of GO terms using the OrthoML2GO method with the Blast2GO and PANNZER2 annotation programs was performed on sequence samples from both individual organisms (humans, Arabidopsis) and a combined sample represented by different taxa. Our results demonstrate that the proposed method is comparable to, and by some evaluation metrics outperforms, these existing methods in terms of the quality of protein function prediction, especially on large and heterogeneous samples of organisms. The greatest performance improvement is achieved by combining information about the closest homologs and orthogroups with verification of terms using machine learning methods. Our approach demonstrates high performance for large-scale automatic protein annotation, and prospects for further development include optimizing machine learning model parameters for specific biological tasks and integrating additional sources of structural and functional information, which will further improve the method’s accuracy and versatility. In addition, the introduction of new bioinformatics tools and the expansion of the annotated protein database will contribute to the further improvement of the proposed approach.
About the Authors
E. V. MalyuginRussian Federation
Novosibirsk
D. A. Afonnikov
Russian Federation
Novosibirsk
References
1. Altenhoff A.M., Glover N.M., Dessimoz C. Inferring orthology and paralogy. Methods Mol Biol. 2019;1910:149-175. doi: 10.1007/978-1-4939-9074-0_5
2. Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J. Basic local alignment search tool. J Mol Biol. 1990;215(3):403-410. doi: 10.1016/S0022-2836(05)80360-2
3. Ashburner M., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., Davis A.P., … Matese J.C., Richardson J.E., Ringwald M., Rubin G.M., Sherlock G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25(1): 25-29. doi: 10.1038/75556
4. Benso A., Di Carlo S., Ur Rehman H., Politano G., Savino A., Suravajhala P. A combined approach for genome wide protein function annotation/prediction. Proteome Sci. 2013;11(Suppl. 1):S1. doi: 10.1186/1477-5956-11-S1-S1
5. Bradford Y.M., Van Slyke C.E., Ruzicka L., Singer A., Eagle A., Fashena D., Howe D.G., Frazer K., Martin R., Paddock H., Pich C., Ramachandran S., Westerfield M. Zebrafish information network, the knowledgebase for Danio rerio research. Genetics. 2022;220(4): iyac016. doi: 10.1093/genetics/iyac016
6. Buchfink B., Xie C., Huson D.H. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2015;12(1):59-60. doi: 10.1038/nmeth.3176
7. Cao Y., Shen Y. TALE: Transformer-based protein function Annotation with joint sequence-Label Embedding. Bioinformatics. 2021; 37(18):2825-2833. doi: 10.1093/bioinformatics/btab198
8. Chen T., Guestrin C. XGBoost: A Scalable Tree Boosting System. In: KDD ‘16. Proceedings of the 22<sup>nd</sup> ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: Association for Computing Machinery, 2016;785-794. doi: 10.1145/2939672.2939785
9. Cheng S., Melkonian M., Smith S.A., Brockington S., Archibald J.M., Delaux P.M., Li F.W., … Graham S.W., Soltis P.S., Liu X., Xu X., Wong G.K. 10KP: A phylodiverse genome sequencing plan. Giga-science. 2018;7(3):1-9. doi: 10.1093/gigascience/giy013
10. Conesa A., Götz S., García-Gómez J.M., Terol J., Talón M., Robles M. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics. 2005;21(18): 3674-3676. doi: 10.1093/bioinformatics/bti610
11. Dongardive J., Abraham S. Protein Sequence Classification Based on N-Gram and K-Nearest Neighbor Algorithm. In: Behera H., Mohapatra D. (Eds). Computational Intelligence in Data Mining. Vol. 2. Advances in Intelligent Systems and Computing. Vol. 411. Springer, New Delhi, 2016;163-171 doi: 10.1007/978-81-322-2731-1_15
12. du Plessis L., Skunca N., Dessimoz C. The what, where, how and why of gene ontology – a primer for bioinformaticians. Brief Bioinform. 2011;12(6):723-735. doi: 10.1093/bib/bbr002
13. Edgar R.C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26(19):2460-2461. doi: 10.1093/bioinformatics/btq461
14. Eisenberg D., Marcotte E.M., Xenarios I., Yeates T.O. Protein function in the post-genomic era. Nature. 2000;405(6788):823-826. doi: 10.1038/35015694
15. Fitch W.M. Distinguishing homologous from analogous proteins. Syst Biol. 1970;19(2):99-113. doi: 10.2307/2412448
16. Fitch W.M. Homology a personal view on some of the problems. Trends Genet. 2000;16(5):227-231. doi: 10.1016/s0168-9525(00)02005-9
17. Galperin M.Y., Koonin E.V. From complete genome sequence to ‘complete’ understanding? Trends Biotechnol. 2010;28(8):398-406. doi: 10.1016/j.tibtech.2010.05.006
18. Gene Ontology Consortium; Aleksander S.A., Balhoff J., Carbon S., Cherry J.M., Drabkin H.J., Ebert D., ... Ponferrada V., Zorn A., Ramachandran S., Ruzicka L., Westerfield M. The Gene Ontology knowledgebase in 2023. Genetics. 2023;224(1):iyad031. doi: 10.1093/genetics/iyad031
19. Goodwin S., McPherson J.D., McCombie W.R. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet. 2016;17(6):333-351. doi: 10.1038/nrg.2016.49
20. Grigoriev I.V., Hayes R.D., Calhoun S., Kamel B., Wang A., Ahrendt S., Dusheyko S., Nikitin R., Mondo S.J., Salamov A., Shabalov I., Kuo A. PhycoCosm, a comparative algal genomics resource. Nucleic Acids Res. 2021;49(D1):1004-1011. doi: 10.1093/nar/gkaa898
21. Hamilton J.P., Brose J., Buell C.R. SpudDB: a database for accessing potato genomic data. Genetics. 2025a;229(3):iyae205. doi: 10.1093/genetics/iyae205
22. Hamilton J.P., Li C., Buell C.R. The rice genome annotation project: an updated database for mining the rice genome. Nucleic Acids Res. 2025b;53(1):1614-1622. doi: 10.1093/nar/gkae1061
23. Huntley R.P., Sawford T., Mutowo-Meullenet P., Shypitsyna A., Bonilla C., Martin M.J., O’Donovan C. The GOA database: Gene Ontology annotation updates for 2015. Nucleic Acids Res. 2015; 43(D1):1057-1063. doi: 10.1093/nar/gku1113
24. Jensen L.J., Julien P., Kuhn M., von Mering C., Muller J., Doerks T., Bork P. eggNOG: automated construction and annotation of orthologous groups of genes. Nucleic Acids Res. 2008;36(Database issue): 250-254. doi: 10.1093/nar/gkm796
25. Kharsikar S., Mugler D., Sheffer D., Moore F., Duan Z.H. A weighted k-nearest neighbor method for gene ontology based protein function prediction. In: Proceedings of the Second International Multi-Symposiums on Computer and Computational Sciences (IMSCCS ‘07). IEEE Computer Society, USA, 2007;25-31. doi: 10.1109/IMSCCS.2007.13
26. Kriventseva E.V., Rahman N., Espinosa O., Zdobnov E.M. OrthoDB: the hierarchical catalog of eukaryotic orthologs. Nucleic Acids Res. 2008;36(Database issue):271-275. doi: 10.1093/nar/gkm845
27. Kulmanov M., Hoehndorf R. DeepGOPlus: improved protein function prediction from sequence. Bioinformatics. 2020;36(2):422-429. doi: 10.1093/bioinformatics/btz595
28. Kuzniar A., van Ham R.C., Pongor S., Leunissen J.A. The quest for orthologs: finding the corresponding gene across genomes. Trends Genet. 2008;24(11):539-551. doi: 10.1016/j.tig.2008.08.009
29. Lewin H.A., Robinson G.E., Kress W.J., Baker W.J., Coddington J., Crandall K.A., Durbin R., …van Sluys M.A., Soltis P.S., Xu X., Yang H., Zhang G. Earth BioGenome project: Sequencing life for the future of life. Proc Natl Acad Sci USA. 2018;115(17):4325-4333. doi: 10.1073/pnas.1720115115
30. Liaw A., Wiener M. Classification and Regression by randomForest. R News. 2002;2(3):18-22. doi: 10.32614/CRAN.package.random
31. Forest Öztürk-Çolak A., Marygold S.J., Antonazzo G., Attrill H., Goutte-Gattat D., Jenkins V.K., Matthews B.B., Millburn G., Dos Santos G., Tabone C.J.; FlyBase Consortium. FlyBase: updates to the Drosophila genes and genomes database. Genetics. 2024;227(1):iyad211. doi: 10.1093/genetics/iyad211
32. Pearson W.R. An introduction to sequence similarity (“homology”) searching. Curr Protoc Bioinformatics. 2013;42(3):3.1.1-3.1.8. doi: 10.1002/0471250953.bi0301s42
33. R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, 2013. Available: http://www.R-project.org/
34. Reiser L., Bakker E., Subramaniam S., Chen X., Sawant S., Khosa K., Prithvi T., Berardini T.Z. The Arabidopsis Information Resource in 2024. Genetics. 2024;227(1):iyae027. doi: 10.1093/genetics/iyae027
35. Sanderson T., Bileschi M.L., Belanger D., Colwell L.J. ProteInfer, deep neural networks for protein functional inference. eLife. 2023;12: e80942. doi: 10.7554/eLife.80942
36. Steinegger M., Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35(11):1026-1028. doi: 10.1038/nbt.3988
37. Suzuki S., Kakuta M., Ishida T., Akiyama Y. GHOSTX: an improved sequence homology search algorithm using a query suffix array and a database suffix array. PLoS One. 2014;9(8):e103833. doi: 10.1371/journal.pone.0103833
38. Tegenfeldt F., Kuznetsov D., Manni M., Berkeley M., Zdobnov E.M., Kriventseva E.V. OrthoDB and BUSCO update: annotation of orthologs with wider sampling of genomes. Nucleic Acids Res. 2025; 53(D1):D516-D522. doi: 10.1093/nar/gkae987
39. Törönen P., Medlar A., Holm L. PANNZER2: a rapid functional annotation web server. Nucleic Acids Res. 2018;46(W1):W84-W88. doi: 10.1093/nar/gky350
40. Wickham H., François R., Henry L., Müller K., Vaughan D. dplyr: A Grammar of Data Manipulation. R package version 1.1.4. 2025. doi: 10.32614/CRAN.package.dplyr
41. Yao S., You R., Wang S., Xiong Y., Huang X., Zhu S. NetGO 2.0: improving large-scale protein function prediction with massive sequence, text, domain, family and network information. Nucleic Acids Res. 2021;49(W1):W469-W475. doi: 10.1093/nar/gkab398
42. You R., Zhang Z., Xiong Y., Sun F., Mamitsuka H., Zhu S. GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank. Bioinformatics. 2018;34(14):2465-2473. doi: 10.1093/bioinformatics/bty130
43. Yuan Q., Xie J., Xie J., Zhao H., Yang Y. Fast and accurate protein function prediction from sequence through pretrained language model and homology-based label diffusion. Brief Bioinform. 2023;24(3): bbad117. doi: 10.1093/bib/bbad117






