Preview

Vavilov Journal of Genetics and Breeding

Advanced search

Genomic prediction of plant traits by popular machine learning methods

https://doi.org/10.18699/vjgb-25-49

Abstract

A rapid growth of the available body of genomic data has made it possible to obtain extensive results in genomic prediction and identification of associations of SNPs with phenotypic traits. In many cases, to identify new relationships between phenotypes and genotypes, it is preferable to use machine learning, deep learning and artificial intelligence, especially explainable artificial intelligence, capable of recognizing complex patterns. 80 sources were manually selected; while there were no restrictions on the release date, the main attention was paid to the originality of the proposed approach for use in genomic prediction. The article considers models for genomic prediction, convolutional neural networks, explainable artificial intelligence and large language models. Attention is  paid to Data Augmentation, Transfer Learning, Dimensionality Reduction methods and hybrid methods. Research  in the field of model-specific and model-independent methods for interpretation of model solutions is represented  by three main categories: sensing, perturbation, and surrogate model. The considered examples reflect the main modern trends in this area of research. The growing role of large language models, including those based on transformers, for genetic code processing, as well as the development of data augmentation methods, are noted. Among hybrid approaches, the prospect of combining machine learning models and models of plant development based on biophysical and biochemical processes is emphasized. Since the methods of machine learning and artificial intelligence are the focus of attention of both specialists in various applied fields and fundamental scientists, and also cause public resonance, the number of works devoted to these topics is growing explosively. 

About the Authors

K. N. Kozlov
Peter the Great St. Petersburg Polytechnic University
Russian Federation

St. Petersburg



M. P. Bankin
Peter the Great St. Petersburg Polytechnic University
Russian Federation

St. Petersburg



E. A. Semenova
Far Eastern State Agrarian University
Russian Federation

Blagoveshchensk, Amur region



M. G. Samsonova
Peter the Great St. Petersburg Polytechnic University
Russian Federation

St. Petersburg



References

1. Applications. New York: Chapman and Hall/CRC, 2014. doi 10.1201/9781315373515

2. Azodi C.B., Tang J., Shiu S.-H. Opening the black box: interpretable machine learning for geneticists. Trends Genet. 2020;36(6):442-455. doi 10.1016/j.tig.2020.03.005

3. Bavykina M., Kostina N., Lee C.-R., Schafleitner R., Bishop-von Wettberg E., Nuzhdin S.V., Samsonova M., Gursky V., Kozlov K. Modeling of flowering time in Vigna radiata with artificial image objects, convolutional neural network and random forest. Plants. 2022; 11(23):3327. doi 10.3390/plants11233327

4. Bazgir O., Zhang R., Dhruba S.R., Rahman R., Ghosh S., Pal R. Representation of features as images with neighborhood dependencies for compatibility with convolutional neural networks. Nat Commun. 2020;11(1):4391. doi 10.1038/s41467-020-18197-y

5. Bragina M.K., Afonnikov D.A., Salina E.A. Progress in plant genome sequencing: research directions. Vavilovskii Zhurnal Genetiki i Selektsii = Vavilov J Genet Breed. 2019;23(1):38-48. doi 10.18699/VJ19.459 (in Russian)

6. Chamorro-Padial J., García R., Gil R. A systematic review of open data in agriculture. Comput Electron Agric. 2024;219:108775. doi 10.1016/j.compag.2024.108775

7. Chattopadhay A., Sarkar A., Howlader P., Balasubramanian V.N. Grad-CAM++: generalized gradient-based visual explanations for deep convolutional networks. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA. IEEE, 2018;839-847. doi 10.1109/WACV.2018.00097

8. Chen C., Bhuiyan S.A., Ross E., Powell O., Dinglasan E., Wei X., Atkin F., Deomano E., Hayes B. Genomic prediction for sugarcane diseases including hybrid Bayesian-machine learning approaches. Front Plant Sci. 2024;15:1398903. doi 10.3389/fpls.2024.1398903

9. Chen X., Chen D.G., Zhao Z., Balko J.M., Chen J. Artificial image objects for classification of breast cancer biomarkers with transcriptome sequencing data and convolutional neural network algo- rithms. Breast Cancer Res. 2021a;23(1):96. doi 10.1186/s13058021-01474-z

10. Chen X., Chen D.G., Zhao Z., Zhan J., Ji C., Chen J. Artificial image objects for classification of schizophrenia with GWAS-selected SNVs and convolutional neural network. Patterns. 2021b;2(8):100303. doi 10.1016/j.patter.2021.100303

11. Consens M.E., Dufault C., Wainberg M., Forster D., Karimzadeh M., Goodarzi H., Theis F.J., Moses A., Wang B. To transformers and beyond: large language models for the genome. arXiv. 2023. doi 10.48550/arXiv.2311.07621

12. Cubitt R. The Shapley value: essays in honor of Lloyd S. Shapley. Econ J. 1991;101(406):644-646. doi 10.2307/2233574

13. Cui T., El Mekkaoui K., Reinvall J., Havulinna A.S., Marttinen P., Kaski S. Gene-gene interaction detection with deep learning. Commun Biol. 2022;5(1):1238. doi 10.1038/s42003-022-04186-y

14. Danilevicz M.F., Gill M., Anderson R., Batley J., Bennamoun M., Bayer P.E., Edwards D. Plant genotype to phenotype prediction using machine learning. Front Genet. 2022;13:822173. doi 10.3389/fgene.2022.822173

15. de Los Campos G., Hickey J.M., Pong-Wong R., Daetwyler H.D., Calus M.P.L. Whole-genome regression and prediction methods applied to plant and animal breeding. Genetics. 2013;193(2):327-345. doi 10.1534/genetics.112.143313

16. Fournier Q., Aloise D. Empirical comparison between autoencoders and traditional dimensionality reduction methods. In: 2019 IEEE Second International Conference on Artificial Intelligence and Knowledge Engineering (AIKE), Sardinia, Italy. IEEE, 2019;211-214. doi 10.1109/AIKE.2019.00044

17. Galli G., Sabadin F., Yassue R.M., Galves C., Carvalho H.F., Crossa J., Montesinos-López O.A., Fritsche-Neto R. Automated machine learning: a case study of genomic “image-based” prediction in maize hybrids. Front Plant Sci. 2022;13:845524. doi 10.3389/fpls.2022.845524

18. Gao Y., Cui Y. Deep transfer learning provides a Pareto improvement for multi-ancestral clinico-genomic prediction of diseases. bioRxiv. 2022. doi 10.1101/2022.09.22.509055

19. Guidotti R., Monreale A., Ruggieri S., Pedreschi D., Turini F., Giannotti F. Local rule-based explanations of black box decision systems. arXiv. 2018. doi 10.48550/arXiv.1805.10820

20. Han H., Liu X. The challenges of explainable AI in biomedical data science. BMC Bioinformatics. 2022;22(Suppl. 12):443. doi 10.1186/ s12859-021-04368-1

21. Hayes B. Overview of statistical methods for genome-wide association studies (GWAS). In: Genome-Wide Association Studies and Ge nomic Prediction. Methods in Molecular Biology. Vol. 1019. Totowa, NJ: Humana Press, 2013;149-169. doi 10.1007/978-1-62703-447-0_6

22. Ichihara H., Yamada M., Kohara M., Hirakawa H., Ghelfi A., Tamura T., Nakaya A., … Komaki A., Fawcett J.A., Sugihara E., Tabata S., Isobe S.N. Plant GARDEN: a portal website for cross-searching between different types of genomic and genetic resources in a wide variety of plant species. BMC Plant Biol. 2023:23(1);391. doi 10.1186/s12870-023-04392-8

23. Ji L., Hou W., Xiong L., Zhou H., Liu C., Li L., Yuan Z. GSCNN: a genomic selection convolutional neural network model based on SNP genotype and physical distance features and data augmentation strategy. Res Square. 2024. doi 10.21203/rs.3.rs-3991262/v1

24. Jiang P.-T., Zhang C.-B., Hou Q., Cheng M.-M., Wei Y. LayerCAM: exploring hierarchical class activation maps for localization. IEEE Trans Image Process. 2021;30:5875-5888. doi 10.1109/TIP.2021.3089943

25. Jubair S., Tucker J.R., Henderson N., Hiebert C.W., Badea A., Domaratzki M., Fernando W.G.D. GPTransformer: a transformer-based deep learning method for predicting Fusarium related traits in barley. Front Plant Sci. 2021;12:761402. doi 10.3389/fpls.2021.761402

26. Karim M.R., Beyan O., Zappa A., Costa I.G., Rebholz-Schuhmann D., Cochez M., Decker S. Deep learning-based clustering approaches for bioinformatics. Brief Bioinform. 2021;22(1):393-415. doi 10.1093/bib/bbz170

27. Kirchler M., Konigorski S., Norden M., Meltendorf C., Kloft M., Schurmann C., Lippert C. transferGWAS: GWAS of images using deep transfer learning. Bioinformatics. 2022;38(14):3621-3628. doi 10.1093/bioinformatics/btac369

28. Kovalev M.S., Igolkina A.A., Samsonova M.G., Nuzhdin S.V. A pipeline for classifying deleterious coding mutations in agricultural plants. Front Plant Sci. 2018;9:1734. doi 10.3389/fpls.2018.01734

29. Kuratov Y., Shmelev A., Fishman V., Kardymon O., Burtsev M. Recurrent memory augmentation of GENA-LM improves performance on long DNA sequence tasks. In: Workshop Machine Learning for Genomics Explorations (MLGenX). 2024. Available: https://openreview.net/pdf?id=K671lCX90x

30. Lakkaraju H., Kamar E., Caruana R., Leskovec J. Faithful and Customizable Explanations of Black Box Models. In: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (AIES ‘19). New York, NY, USA: Association for Computing Machinery, 2019;131-138. doi 10.1145/3306618.3314229

31. Larue F., Fumey D., Rouan L., Soulié J.-C., Roques S., Beurier G., Luquet D. Modelling tiller growth and mortality as a sink-driven process using Ecomeristem: implications for biomass sorghum ideotyping. Ann Bot. 2019;124(4):675-690. doi 10.1093/aob/mcz038

32. Larue F., Rouan L., Pot D., Rami J.-F., Luquet D., Beurier G. Linking genetic markers and crop model parameters using neural networks to enhance genomic prediction of integrative traits. Front Plant Sci. 2024;15:1393965. doi 10.3389/fpls.2024.1393965

33. Li J., Zhang D., Yang F., Zhang Q., Pan S., Zhao X., Zhang Qi., Han Y., Yang J., Wang K., Zhao C. TrG2P: a transfer-learningbased tool integrating multi-trait data for accurate prediction of crop yield. Plant Commun. 2024;5(7):100975. doi 10.1016/j.xplc.2024.100975

34. Liu Y., Wang D., He F., Wang J., Joshi T., Xu D. Phenotype prediction and genome-wide association study using deep convolutional neural network of soybean. Front Genet. 2019;10:1091. doi 10.3389/fgene.2019.01091

35. Lundberg S., Lee S.-I. A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Red Hook, NY, USA: Curran Associates Inc., 2017;4768-4777. doi 10.48550/arXiv.1705.07874

36. Meuwissen T.H., Hayes B.J., Goddard M.E. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001;157(4):1819-1829. doi 10.1093/genetics/157.4.1819

37. Molnar C. Interpretable Machine Learning. A Guide for Making Black Box Models Explainable. Independently published, 2022

38. Montesinos-López O.A., Montesinos-López A., Mosqueda-Gonzalez B.A., Montesinos-López J.C., Crossa J., Ramirez N.L., Singh P., Valladares-Anguiano F.A. A zero altered Poisson random forest model for genomic-enabled prediction. G3 (Bethesda). 2021;11(2): jkaa057. doi 10.1093/g3journal/jkaa057

39. Montesinos-López O.A., Solis-Camacho M.A., Crespo-Herrera L., Saint Pierre C., Huerta Prado G.I., Ramos-Pulido S., Al-Nowibet K., Fritsche-Neto R., Gerard G., Montesinos-López A., Crossa J. Data augmentation enhances plant-genomic-enabled predictions. Genes. 2024;15(3):286. doi 10.3390/genes15030286

40. Nascimento M., Nascimento A.C.C., Azevedo C.F., de Oliveira A.C.B., Caixeta E.T., Jarquin D. Enhancing genomic prediction with Stacking Ensemble Learning in Arabica Coffee. Front Plant Sci. 2024;15: 1373318. doi 10.3389/fpls.2024.1373318

41. Nguyen E., Poli M., Faizi M., Thomas A., Birch-Sykes C., Wornow M., Patel A., Rabideau C., Massaroli S., Bengio Y., Ermon S., Baccus S.A., Ré C. HyenaDNA: long-range genomic sequence modeling at single nucleotide resolution. In: Proceedings of the 37th International Conference on Neural Information Processing Systems (NIPS ‘23). Red Hook, NY, USA: Curran Associates Inc., 2023; 43177-43201. doi 10.48550/arXiv.2306.15794

42. Poli M., Massaroli S., Nguyen E., Fu D.Y., Dao T., Baccus S., Bengio Y., Ermon S., Ré C. Hyena hierarchy: towards larger convolutional language models. In: Proceedings of the 40th International Conference on Machine Learning (ICML ‘23). Vol. 202. JMLR.org, 2023;28043-28078. doi 10.48550/arXiv.2302.10866

43. Pook T., Freudenthal J., Korte A., Simianer H. Using local convolutional neural networks for genomic prediction. Front Genet. 2020; 11:561497. doi 10.3389/fgene.2020.561497

44. Ramzan F., Gültas M., Bertram H., Cavero D., Schmitt A.O. Combining random forests and a signal detection method leads to the robust detection of genotype-phenotype associations. Genes (Basel). 2020;11(8):892. doi 10.3390/genes11080892

45. Ribeiro M.T., Singh S., Guestrin C. “Why should i trust you?”: explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16). New York, NY, USA: Association for Computing Machinery, 2016;1135-1144. doi 10.1145/2939672.2939778

46. Sandhu K., Patil S.S., Pumphrey M., Carter A. Multitrait machine- and deep-learning models for genomic selection using spectral information in a wheat breeding program. Plant Genome. 2021;14(3):e20119. doi 10.1002/tpg2.20119

47. Selvaraju R.R., Cogswell M., Das A., Vedantam R., Parikh D., Batra D. Grad-CAM: visual explanations from deep networks via gradientbased localization. In: 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy. IEEE, 2017;618-626. doi 10.1109/ICCV.2017.74

48. Selvaraju R.R., Cogswell M., Das A., Vedantam R., Parikh D., Batra D. Grad-CAM: visual explanations from deep networks via gradientbased localization. Int J Comput Vis. 2020;128(2):336-359. doi 10.1007/s11263-019-01228-7

49. Sharma A., Vans E., Shigemizu D., Boroevich K.A., Tsunoda T. DeepInsight: a methodology to transform a non-image data to an image for convolution neural network architecture. Sci Rep. 2019; 9(1): 11399. doi 10.1038/s41598-019-47765-6

50. Sharma A., Lysenko A., Boroevich K.A., Vans E., Tsunoda T. DeepFeature: feature selection in nonimage data using convolutional neural network. Brief Bioinform. 2021;22(6):bbab297. doi 10.1093/bib/bbab297

51. Simonyan K., Vedaldi A., Zisserman A. Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv. 2014. doi 10.48550/arXiv.1312.6034

52. Sirsat M.S., Oblessuc P.R., Ramiro R.S. Genomic prediction of wheat grain yield using machine learning. Agriculture. 2022;12(9):1406. doi 10.3390/agriculture12091406

53. Stiglic G., Kocbek P., Fijacko N., Zitnik M., Verbert K., Cilar L. Interpretability of machine learning based prediction models in healthcare. WIREs Data Min Knowl Discovery. 2020;10(5):e1379. doi 10.1002/widm.1379

54. Tang F.H.M., Nguyen T.H., Conchedda G., Casse L., Tubiello F.N., Maggi F. CROPGRIDS: a global geo-referenced dataset of 173 crops. Sci Data. 2024;11:413. doi 10.1038/s41597-024-03247-7

55. Tong K., Chen X., Yan S., Dai L., Liao Y., Li Z., Wang T. PlantMine: a machine-learning framework to detect core SNPs in rice genomics. Genes. 2024;15(5):603. doi 10.3390/genes15050603

56. Vilov S., Heinig M. Neural network approach to somatic SNP calling in WGS samples without a matched control. bioRxiv. 2022. doi 10.1101/2022.04.14.488223

57. Wachter S., Mittelstadt B., Russell C. Counterfactual explanations without opening the black box: automated decisions and the GDPR. Harvard J Law Technol. 2018;31(2):841-887

58. Wang H., Wang Z., Du M., Yang F., Zhang Z., Ding S., Mardziel P., Hu X. Score-CAM: score-weighted visual explanations for convolutional neural networks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA. IEEE, 2020;111-119. doi 10.1109/CVPRW50498.2020.00020

59. Weber L., Lapuschkin S., Binder A., Samek W. Beyond explaining: opportunities and challenges of XAI-based model improvement. Inf Fusion. 2023;92:154-176. doi 10.1016/j.inffus.2022.11.013

60. Wu C., Zhang Y., Ying Z., Li L., Wang J., Yu H., Zhang M., Feng X., Wei X., Xu X. A transformer-based genomic prediction method fused with knowledge-guided module. Brief Bioinform. 2023;25(1): bbad438. doi 10.1093/bib/bbad438

61. Wu H., Gao B., Zhang R., Huang Z., Yin Z., Hu X., Yang C.-X., Du Z.- Q. Residual network improves the prediction accuracy of genomic selection. Anim Genet. 2024;55(4):599-611. doi 10.1111/age.13445

62. Xie Z., Xu X., Li L., Wu C., Ma Y., He J., Wei S., Wang J., Feng X. Residual networks without pooling layers improve the accuracy Conflict of interest. The authors declare no conflict of interest. of genomic predictions. Theor Appl Genet. 2024;137(6):138. doi 10.1007/s00122-024-04649-2

63. Yelmen B., Decelle A., Ongaro L., Marnetto D., Tallec C., Montinaro F., Furtlehner C., Pagani L., Jay F. Creating artificial human genomes using generative neural networks. PLoS Genet. 2021; 17(2):e1009303. doi 10.1371/journal.pgen.1009303

64. Zhang S., Li P., Wang S., Zhu J., Huang Z., Cai F., Freidel S., Ling F., Schwarz E., Chen J. BioM2: biologically informed multi-stage machine learning for phenotype prediction using omics data. Brief Bioinform. 2024;25(5):bbae384. doi 10.1093/bib/bbae384

65. Zhang T.-H., Flores M., Huang Y. ES-ARCNN: predicting enhancer strength by using data augmentation and residual convolutional neural network. Anal Biochem. 2021;618:114120. doi 10.1016/j.ab.2021.114120

66. Zhang X., Gao J. Measuring feature importance of convolutional neural networks. IEEE Access. 2020;8:196062-196074. doi 10.1109/ACCESS.2020.3034625


Review

Views: 23


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2500-3259 (Online)