CropGene: a software package for the analysis of genomic and transcriptomic data of agricultural plants
https://doi.org/10.18699/vjgb-25-35
Abstract
Currently, the breeding of agricultural plants is increasingly based on the use of molecular biological data on genetic sequences, which makes it possible to significantly accelerate the breeding process, create new plant varieties through genomic editing. These data have a large volume, variety and require a large amount of resources, both labor and computing, to analyze the costs. Data analysis of such volume and complexity can be effective only when using modern bioinformatics methods, which include algorithms for identifying genes, predicting their function, and evaluating the effect of mutation on plant phenotype. Such an analysis has recently become impossible without the use of integrated software systems that solve problems of different levels by executing computational pipelines. The paper describes the CropGene software package developed for the comprehensive analysis of genomic and transcriptomic data of agricultural plants. CropGene includes several blocks of bioinformatic analysis, such as analysis of gene variations, assembly of genomes and transcriptomes, as well as annotation of genes and proteins. CropGene implements new methods for analyzing long non-coding RNAs, protein domains, searching and analyzing polymorphisms, and genomewide association research. CropGene has a user-friendly interface and supports working with various types of data, which greatly simplifies its use for researchers who do not have deep knowledge in the field of bioinformatics. The paper provides examples of the use of CropGene for the analysis of agricultural organisms such as Solanum tuberosum and Zea mays. With CropGene, genetic markers have been identified that explain up to 50 % of the variability in seed color parameters; potential genes that may become promising material for producing potato varieties; more than 100 thousand new long non-coding RNAs. Orthogroups were also found, the domain structure of which shows a marked similarity with the domain architecture of characteristic secreted A2 phospholipases. Thus, CropGene is an important tool for scientists and practitioners working in the field of agrobiotechnology and plant genetics.
Keywords
About the Authors
A. Yu. PronozinRussian Federation
Novosibirsk
D. I. Karetnikov
Russian Federation
Novosibirsk
N. A. Shmakov
Russian Federation
Novosibirsk
M. E. Bocharnikova
Russian Federation
Novosibirsk
S. D. Afonnikova
Russian Federation
Novosibirsk
D. A. Afonnikov
Russian Federation
Novosibirsk
N. A. Kolchanov
Russian Federation
Novosibirsk
References
1. Afonnikova S.D., Kiseleva A.A., Fedyaeva A.V., Komyshev E.G., Koval V.S., Afonnikov D.A., Salina E.A. Identification of novel loci precisely modulating pre-harvest sprouting resistance and red color components of the seed coat in T. aestivum L. Plants. 2024;13(10): 1309. doi 10.3390/plants13101309
2. Bray N.L., Pimentel H., Melsted P., Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol. 2016;34(5):525-527. doi 10.1038/nbt.3519
3. Browning B.L., Zhou Y., Browning S.R. A one-penny imputed genome from next-generation reference panels. Am J Hum Genet. 2018; 103(3):338-348. doi 10.1016/j.ajhg.2018.07.015
4. Burghardt L.T., Young N.D., Tiffin P. A guide to genome‐wide association mapping in plants. Curr Protoc Plant Biol. 2017;2(1):22-38. doi 10.1002/cppb.20041
5. Bushmanova E., Antipov D., Lapidus A., Suvorov V., Prjibelski A.D. rnaQUAST: a quality assessment tool for de novo transcript assemblies. Bioinformatics. 2016;32(14):2210-2212. doi 10.1093/bioinformatics/btw218
6. Bushmanova E., Antipov D., Lapidus A., Prjibelski A.D. rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data. GigaScience. 2019;8(9):giz100. doi 10.1093/gigascience/giz100
7. Cardoso-Silva C.B., Costa E.A., Mancini M.C., Balsalobre T.W.A., Canesin L.E.C., Pinto L.R., Carneiro M.S., Garcia A.A.F., de Souza A.P., Vicentini R. De novo assembly and transcriptome analysis of contrasting sugarcane varieties. PloS One. 2014;9(2):e88462. doi 10.1371/journal.pone.0088462
8. Carninci P., Kasukawa T., Katayama S., Gough J., Frith M.C., Maeda N., Oyama R., … Watahiki A., Okamura-Oho Y., Suzuki H., Kawai J., Hayashizaki Y. The transcriptional landscape of the mammalian genome. Science. 2005;309(5740):1559-1563. doi 10.1126/science.1112014
9. Chen S., Zhou Y., Chen Y., Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34(17):i884-i890. doi 10.1093/bioinformatics/bty560
10. Danecek P., Auton A., Abecasis G., Albers C.A., Banks E., DePristo M.A., Handsaker R.E., Lunter G., Marth G.T., Sherry S.T., McVean G., Durbin R.; 1000 Genomes Project Analysis Group. The variant call format and VCFtools. Bioinformatics. 2011;27(15): 2156-2158. doi 10.1093/bioinformatics/btr330
11. Danecek P., Bonfield J.K., Liddle J., Marshall J., Ohan V., Pollard M.O., Whitwham A., Keane T., McCarthy S.A., Davies R.M., Li H. Twelve years of SAMtools and BCFtools. GigaScience. 2021;10(2): giab008. doi 10.1093/gigascence/giab008
12. Drewe P., Stegle O., Hartmann L., Kahles A., Bohnert R., Wachter A., Borgwardt K., Rätsch G. Accurate detection of differential RNA processing. Nucleic Acids Res. 2013;41(10):5189-5198. doi 10.1093/nar/gkt211
13. Emms D.M., Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 2019;20(1):238. doi 10.1186/s13059-019-1832-y
14. Grabherr M.G., Haas B.J., Yassour M., Levin J.Z., Thompson D.A., Amit I., Adiconis X., … Birren B.W., Nusbaum C., Lindblad-Toh K., Friedman N., Regev A. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011; 29(7):644-652. doi 10.1038/nbt.1883
15. Grosjean P., Ibanez F., Etienne M., Grosjean M.P. Package ‘Pastecs’. 2018. Available online: http://masterdistfiles.gentoo.org/pub/cran/web/packages/pastecs/pastecs.pdf
16. Han S., Liang Y., Ma Q., Xu Y., Zhang Y., Du W., Wang C., Li Y. LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information and physicochemical property. Brief Bioinform. 2019;20(6): 2009-2027. doi 10.1093/bib/bby065
17. Hassani‐Pak K., Singh A., Brandizi M., Hearnshaw J., Parsons J.D., Amberkar S., Phillips A.L., Doonan J.H., Rawlings C. KnetMiner: a comprehensive approach for supporting evidence‐based gene discovery and complex trait analysis across species. Plant Biotechnol J. 2021;19(8):1670-1678. doi 10.1111/pbi.13583
18. Jia L., Liu N., Huang F., Zhou Z., He X., Li H., Wang Z., Yao W. intansv: an R package for integrative analysis of structural variations. PeerJ. 2020;8:e8867. doi 10.7717/peerj.8867
19. Jin M., Liu H., He C., Fu J., Xiao Y., Wang Y., Xie W., Wang G., Yan J. Maize pan-transcriptome provides novel insights into genome complexity and quantitative trait variation. Sci Rep. 2016;6(1):18936. doi 10.1038/srep18936
20. Johnson K.A., Krishnan A. Robust normalization and transformation techniques for constructing gene coexpression networks from RNA-seq data. Genome Biol. 2022;23(1):1. doi 10.1186/s13059-021-02568-9
21. Karetnikov D.I., Vasiliev G.V., Toshchakov S.V., Shmakov N.A., Genaev M.A., Nesterov M.A., Ibragimova S.M., Rybakov D.A., Gavrilenko T.A., Salina E.A., Patrushev M.V., Kochetov A.V., Afonnikov D.A. Analysis of genome structure and its variations in potato cultivars grown in Russia. Int J Mol Sci. 2023;24(6):5713. doi 10.3390/ijms24065713
22. Khlestkina E.K. Molecular markers in genetic studies and breeding. Russ J Genet Appl Res. 2014;4:236-244. doi 10.1134/S2079059714030022
23. Kim E.-D., Sung S. Long noncoding RNA: unveiling hidden layer of gene regulatory networks. Trends Plant Sci. 2012;17(1):16-21. doi 10.1016/j.tplants.2011.10.008
24. Kochetov A.V., Afonnikov D.A., Shmakov N., Vasiliev G.V., Antonova O.Y., Shatskaya N.V., Glagoleva A.Y., Ibragimova S.M., Khiutti A., Afanasenko O.S., Gavrilenko T.A. NLR genes related transcript sets in potato cultivars bearing genetic material of wild Mexican Solanum species. Agronomy. 2021;11(12):2426. doi 10.3390/agronomy11122426
25. Larkin D.L., Lozada D.N., Mason R.E. Genomic selection – considerations for successful implementation in wheat breeding programs. Agronomy. 2019;9(9):479. doi 10.3390/agronomy9090479
26. Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27(21):2987-2993. doi 10.1093/bioinformatics/btr509
27. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ArXiv. 2013;1303.3997
28. Li H., Durbin R. Fast and accurate short read alignment with Burrows– Wheeler transform. Bioinformatics. 2009;25(14):1754-1760. doi 10.1093/bioinformatics/btp324
29. Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R; 1000 Genome Project Data Processing Subgroup. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078-2079. doi 10.1093/bioinformatics/btp352
30. Liao Y., Smyth G.K., Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30(7):923-930. doi 10.1093/bioinformatics/btt656
31. Lin H.-N., Hsu W.-L. DART: a fast and accurate RNA-seq mapper with a partitioning strategy. Bioinformatics. 2018;34(2):190-197. doi 10.1093/bioinformatics/btx558
32. Muqaddasi Q.H., Brassac J., Ebmeyer E., Kollers S., Korzun V., Argillier O., Stiewe G., Plieske J., Ganal M.W., Röder M.S. Prospects of GWAS and predictive breeding for European winter wheat’s grain protein content, grain starch content, and grain hardness. Sci Rep. 2020;10(1):12541. doi 10.1038/s41598-020-69381-5
33. Nazipova N.N. Variety of non-coding RNAs in eukaryotic genomes. Matematicheskaya Biologiya i Bioinformatika = Mathematical Biology Bioinformatics. 2021;16(2):256-298. doi 10.17537/2021.16.256 (in Russian)
34. Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., … Passos A., Cournapeau D., Brucher M., Perrot M., Duchesnay E. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825-2830
35. Piskol R., Ramaswami G., Li J.B. Reliable identification of genomic variants from RNA-seq data. Am J Hum Genet. 2013;93(4):641-651. doi 10.1016/j.ajhg.2013.08.008
36. Pronozin A.Yu., Afonnikov D.A. ICAnnoLncRNA: A Snakemake pipeline for a long non-coding-RNA search and annotation in transcriptomic sequences. Genes. 2023;14(7):1331. doi 10.3390/genes14071331
37. Pronozin A.Yu., Bragina M.K., Salina E.A. Crop pangenomes. Vavilovskii Zhurnal Genetiki i Selektsii = Vavilov J Genet Breed. 2021; 25(1):57-63. DOI 10.18699/VJ21.007
38. Pronozin A.Yu., Salina E.A., Afonnikov D.A. GBS-DP: a bioinformatics pipeline for processing data coming from genotyping by sequencing. Vavilov J Genet Breed. 2023;27(7):737-745. doi 10.18699/VJGB-23-86
39. Robertson G., Schein J., Chiu R., Corbett R., Field M., Jackman S.D., Mungall K., … Hirst M., Marra M.A., Jones S.J., Hoodless P.A., Bi rol I. De novo assembly and analysis of RNA-seq data. Nat Methods. 2010;7(11):909-912. doi 10.1038/nmeth.1517
40. Scheben A., Batley J., Edwards D. Genotyping-by-sequencing approaches to characterize crop genomes: choosing the right tool for the right application. Plant Biotechnol J. 2017;15(2):149-161. doi 10.1111/pbi.12645
41. Shendure J. The beginning of the end for microarrays? Nat Methods. 2008;5(7):585-587. doi 10.1038/nmeth0708-585
42. Simão F.A., Waterhouse R.M., Ioannidis P., Kriventseva E.V., Zdobnov E.M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31(19): 3210-3212. doi 10.1093/bioinformatics/btv351
43. Stanke M., Steinkamp R., Waack S., Morgenstern B. AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res. 2004;32(Suppl. 2):W309-W312. doi 10.1093/nar/gkh379
44. Sukhareva A.S., Kuluev B.R. DNA markers for genetic analysis of crops. Biomika = Biomics. 2018;10(1):69-84. doi 10.31301/2221-6197.bmcs.2018-15 (in Russian)
45. Suvakov M., Panda A., Diesh C., Holmes I., Abyzov A. CNVpytor: a tool for copy number variation detection and analysis from read depth and allele imbalance in whole-genome sequencing. GigaScience. 2021;10(11):giab074. doi 10.1093/gigascience/giab074
46. Tsai M.-C., Manor O., Wan Y., Mosammaparast N., Wang J.K., Lan F., Shi Y., Segal E., Chang H.Y. Long noncoding RNA as modular scaffold of histone modification complexes. Science. 2010;329(5992): 689-693. doi 10.1126/science.1192002
47. Velculescu V.E., Zhang L., Zhou W., Vogelstein J., Basrai M.A., Bassett D.E., Hieter P., Vogelstein B., Kinzler K.W. Characterization of the yeast transcriptome. Cell. 1997;88(2):243-251. doi 10.1016/S0092-8674(00)81845-0
48. Vernikos G., Medini D., Riley D.R., Tettelin H. Ten years of pangenome analyses. Curr Opin Microbiol. 2015;23:148-154. doi 10.1016/j.mib.2014.11.016
49. Wang J., Zhang Z. GAPIT version 3: boosting power and accuracy for genomic association and prediction. Genomics Proteomics Bioinformatics. 2021;19(4):629-640. doi 10.1016/j.gpb.2021.08.005
50. Wu T.D., Watanabe C.K. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics. 2005; 21(9):1859-1875. doi 10.1093/bioinformatics/bti310
51. Zatybekov A., Abugalieva S., Didorenko S., Gerasimova Y., Sidorik I., Anuarbek S., Turuspekov Y. GWAS of agronomic traits in soybean collection included in breeding pool in Kazakhstan. BMC Plant Biol. 2017;17(S1):179. doi 10.1186/s12870-017-1125-0
52. Zheng X. A tutorial for the R Package SNPRelate. Washington, USA: University of Washington, 2013 Zimin A.V., Marçais G., Puiu D., Roberts M., Salzberg S.L., Yorke J.A. The MaSuRCA genome assembler. Bioinformatics. 2013;29(21): 2669-2677. doi 10.1093/bioinformatics/btt476