Prioritization of potato genes involved in the formation of agronomically valuable traits using the SOLANUM TUBEROSUM knowledge base

The development of highly efficient technologies in genomics, transcriptomics, proteomics and metabolomics, as well as new technologies in agriculture has led to an “information explosion” in plant biology and crop production, including potato production. Only a small part of the information reaches formalized databases (for example, Uniprot, NCBI Gene, BioGRID, IntAct, etc.). One of the main sources of reliable biological data is the scientific literature. The well-known PubMed database contains more than 18 thousand abstracts of articles on potato. The effective use of knowledge presented in such a number of non-formalized documents in natural language requires the use of modern intellectual methods of analysis. However, in the literature, there is no evidence of a widespread use of intelligent methods for automatically extracting knowledge from scientific publications on cultures such as potatoes. Earlier we developed the SOLANUM TUBEROSUM knowledge base (http://www-bionet.sysbio.cytogen. ru/and/plant/). Integrated into the knowledge base information about the molecular genetic mechanisms underlying the selection of significant traits helps to accelerate the identification of candidate genes for the breeding characteristics of potatoes and the development of diagnostic markers for breeding. The article searches for new potential participants of the molecular genetic mechanisms of resistance to adverse factors in plants. Prioritizing candidate genes has shown that the PHYA, GF14, CNIH1, RCI1A, ABI5, CPK1, RGS1, NHL3, GRF8, and CYP21-4 genes are the most promising for further testing of their relationships with resistance to adverse factors. As a result of the analysis, it was shown that the molecular genetic relationships responsible for the formation of significant agricultural traits are complex and include many direct and indirect interactions. The construction of associative gene networks and their analysis using the SOLANUM TUBEROSUM knowledge base is the basis for searching for target genes for targeted mutagenesis and marker-oriented selection of potato varieties with valuable agricultural characteristics.


Introduction
Potatoes (Solanum tuberosum L.) have high nutritional, technical, feed value and are one of the most important crops. The high nutritional value of potatoes is achieved due to the high content of carbohydrates, ascorbic acid, salts of potassium, calcium, magnesium and other trace elements, as well as good digestibility of proteins. Potato starch is a raw material used in the production of alcohol, molasses, dextrins, glucose, maltose, as well as many other products for the chemical industry . Potato tuber starch is widely used in the paper, textile and other industries (Kraak, 1992;Ellis et al., 1998;Jobling, 2004).
The development of high-performance technologies in the field of genomics, transcriptomics, proteomics, and metabolomics, as well as in agriculture led to an "information explosion" in the plant biology. At the same time, only a small amount of information gets into the formalized factographic databases (for example, NCBI Gene, UniProt, IntAct, BioGRID, etc.). One of the main sources of reliable biological data is scientific literature. The well-known PubMed database contains more than 18000 abstracts of articles devoted to potatoes, which makes the manual analysis of such data extremely difficult for researchers.
The lack of unified resources that integrate all available information leads to a strong complication of tasks related to the identification of relationships between the data sets that describing the important and practically useful properties of plants, their structure and the processes on the molecular level (Khlestkin et al., 2017). Thus, the efficiency of the use of obtained results decreases as well.
The problem of the processing of large and extra-large amounts of data is becoming increasingly common in various areas of human activity (Kilicoglu, 2017), making the methods of automated text analysis (text-mining) more and more popular. These methods can be divided into two main groups: methods based on manually created semantic rules and templates, and methods which are using machine learning approaches. Methods based on semantic rules and templates normally have a high level of accuracy, but the completeness of the extracted information often is relatively low (Aggarwal, 2012). Another approach for automatic information retrieval is the use of machine learning techniques. These methods do not require manually created rules and are widely used recently. At the same time, one of the main disadvantages of such methods is the need for extensive training sets, which are often impossible to obtain without manual analysis.
At the same time, most of the published scientific literature contains information regarding the application of text-mining approaches only to the model plants. For example, the PLAN2L system (Krallinger et al., 2009) contains results of the automatic extraction of information from proteinprotein interactions and genetic regulation from the full-text articles dedicated to Arabidopsis thaliana, as well as some data describing associations of genes with some cellular and developmental processes (flower, root, etc.). Da Costa and colleagues (2018) developed an interactive system that allows identification of pests and diseases of rice based on information obtained from farmers in the form of short textual messages (SMS).
Previously, we developed a computer platform for integrated intellectual analysis of scientific publications in the field of potato growing -the SOLANUM TUBEROSUM knowledge base, available at http://www-bionet.sysbio.cytogen.ru/and/ plant/ (Saik et al., 2017;Ivanisenko et al., 2018). The software of this platform provides the automatic extraction and formalized representation of information in the base of knowledge, including the genetics data, DNA markers, breeding, seed production, diagnosis of diseases, methods of protection and potato storage technologies. The developed graphical interface to the SOLANUM TUBEROSUM knowledge base provides user access to the data, execution of user-specified queries and visualization of obtained results. Automated analysis of texts was carried out by using the adapted methods of the ANDSystem tool (Demenkov et al., 2012;Ivanisenko et al., 2015;Saik et al., 2016).
The integration of knowledge about the molecular-genetic mechanisms underlying inside the significant for breeding traits can help to accelerate the identification of candidate genes essential for the important breeding characteristics of potatoes, as well as the development of diagnostic markers for breeding.
At present, prioritization methods are widely used in bioinformatics to identify candidate-genes which are potentially involved in the trait and/or biological process Prioritization of potato genes using the SOLANUM TUBEROSUM knowledge base (Chen et al., 2009). Analysis of gene networks is one of such approaches. Previously, we developed criteria for the prioritization of genes, based on the analysis of the structure of the associative gene networks of ANDSystem Yankina et al., 2018). In this work, the prioritization of genes was aimed at identifying promising candidates to study their relationship with resistance to adverse factors.

Materials and methods
The SOLANUM TUBEROSUM knowledge base is available at http://www-bionet.sscc.ru/and/plant/. The base consists of three main modules.
The text-mining module is used for the extraction of information about the interactions between objects from the texts of scientific publications. The module is based on the ANDSystem software tool (Ivanisenko et al., 2015). The ANDSystem provides a multi-stage text analysis, consisting of preprocessing of texts, retrieval of information describing the relationships between the objects based on the semanticlinguistic templates, and the presenting of the results in a formalized form. The current version of ANDSystem works only with English texts. In addition to the text analysis tools, ANDSystem also contains tools for the collecting and integrating of information from the external factographic databases.
The module of the SOLANUM TUBEROSUM database consists of the two sections: Dictionary (dictionaries of objects and terms) and Associative networks (information about the relationships between objects and terms).
The Dictionary section includes: • molecular genetic data for potatoes and model plants (genes, proteins, metabolites, miRNA, biological processes); • genetic biomarkers; • potato varieties; • properties significant for breeding, economically valuable traits and consumer properties of potatoes and model plants; • physiological, phenotypic traits and diseases of potatoes; • molecular genetic data on pathogens and potato pests (genes, proteins, metabolites, biological processes); • genetic markers of resistance to plant protection products; • molecular targets for plant protection chemicals; • biotic environmental factors; • abiotic environmental factors (soil, humidity, temperature, light, air, climate, and microclimate, etc.); • methods and technologies: -breeding; -diagnosis of diseas; -protection against diseases; -cultivation, processing, and storage of potatoes. The Associative networks section contains: • physical interactions (molecular complexes protein/ protein, protein/ligand, protein/DNA); • chemical interactions (catalytic reactions and processes) such as a substrate-enzyme-product; • regulatory interactions and associations (regulation of gene expression, regulation of protein activity, gene/traits association, etc.); • the interactions between the terms of breeding, phenomics and seed production, diseases, diagnostic techniques and methods of protection.
The module of visualization and bioinformatics is used for interactive construction of associative gene networks and their analysis using bioinformatics methods.
Gene prioritization was carried out on the basis of the crosstalk centrality index (CTC), calculated using the ANDVisio program's Intelligent Filtration function using the formula: where N j is the number of links of the j-th gene/protein with the participants of the associative gene network; M is the number of vertices of the associative gene network (Yankina et al., 2018). When ranking candidate genes, the sorting was performed in descending order of the CTC value. Thus, the genes with the highest CTC score receive the highest priority.

Results
Using the information from the SOLANUM TUBEROSUM knowledge base, we performed the reconstruction and analysis of associative gene networks describing biological processes involved in the formation of selective agricultural traits, such as resistance to adverse environmental factors, response to various stresses (excess salt, cold, drought, high temperature). The reconstructed associative gene network of resistance to adverse factors is provided in Figure. The network includes 542 genes, 544 proteins, 34 biological processes and 2406 interactions between them. Table 1 contains the list of biological processes that are responsible for resistance to adverse factors in potatoes. From the table, it can be seen that the largest number of genes and proteins is associated with the "response to oxidative stress" Gene Ontology process (Gene Ontology identifier -GO:0006979). In plants, the oxidative stress is observed under the majority of unfavorable environmental factors, including the cold, drought, soil salinization, high temperatures and pathogens (Mittler, 2002;Ramirez et al., 2018).
A number of studies for potatoes discussing the possibilities for creating the plant lines resistant to various adverse environmental conditions, which can be obtained by modifying the biological processes presented in Table 1 have already been conducted (Jones et al., 2014;Kikuchi et al., 2015). For example, it was shown that transgenic potatoes in which the Cu-and Zn-superoxide dismutase genes of tomato were expressed had increased resistance to oxidative (Perl et al., 1993), as well as to cold and salt (Shafi et al., 2017) stresses. P. Monneveux et al. (2013) discussed the relationship of 14 potato genes to drought tolerance and the possibilities of their use for the development of transgenic plants. The relationship between the ACS4, ACS5 potato genes and the response to biotic stress has been studied by C.D. Schlagnhaufer (1997). The prioritization of genes carried out using the CTC (crosstalk centrality) index, allowed to identify candidate genes which are most promising for further study of their relationship with resistance to adverse environmental factors, as well as response to various stresses. Table 2 contains the top 10 of such candidate genes, ranked according to the values of the CTC indicator, which reflects the degree of gene connectivity in the gene network presented in Figure. From Table 2 it can be seen that the first place belongs to the PHYA gene that encodes the photoreceptor phytochrome A participating in various biological processes, including the control of the circadian rhythm, flowering and leaf movements in response to exposure to light with different wavelengths (Yanovsky et al., 2000). R.J. Sawers et al. (2005) discussed in their paper the use of phytochromes in crop breeding programs for developing varieties resistant to negative growth factors under thickened sowing conditions. Other examples of studies of the effects of phytochrome mutations on plant phenotypes are the works performed by (Chen et al., 2013;Zhang et al., 2013). Thus, J. Zhang et al. (2013) demonstrated the effect of phyB mutations in A. thaliana on a number of plant phenotypic traits, while J. Chen et al. (2013) showed that loss of PHYC functional activity in wheat could lead to changes in the circadian rhythm and a sharp delay in flowering during the long daylight hours.
The second, fourth and ninth places belong to the genes from the 14-3-3-like proteins family (GF14, RCI1A, and GRF8, respectively). These proteins regulate the cell cycle, apoptosis, immune processes, nitrogen and carbon metabolism, and are involved in the regulation of starch synthesis, ATP production, detoxification by peroxide and in some other biochemical pathways. Also, the plant development and seed germination are controlled by factors which are activated by interacting with 14-3-3-like proteins (Fulgosi et al., 2002). Świȩdrych et al. (2002) showed that decrease in the level of GF14 protein leads to an increase in calcium, starch and an increase in the ratio of soluble sugars to starch in potato tubers, as well as to the significant increase of methionine, proline, and arginine in potato cells. It was shown that the suppression of GF14e gene expression by the RNA interference method could lead to the increased resistance of rice to the virulent strain of the Xanthomonas oryzae pv. oryzae (Xoo) bacterial phytopathogen (Manosalva et al., 2011). It is known that plants with mutations in the RCI1A and GRF8 genes are having increased resistance to the low temperatures (Catalá et al., 2014;Liu et al., 2017). The third place was given to the CNIH1 gene, which encodes the plants protein interacting with the sodium transporter HKT1 and providing the correct location of the transporter on the Golgi apparatus membrane (Rosas-Santiago et al., 2017). In the work of M.M. Wudick et al. (2018), the effect of mutations in the CNIH1 gene on pollen and calcium homeostasis in A. thaliana was studied. It is interesting to note that potato cultivar Yubiley Zhukova, which has enhanced salt and drought tolerance, was obtained due to the overexpression of the vacuolar Na + /H + antiporter NHX2 (Belyaev et al., 2011).
The homolog of the transcription factor ABI5 bZIP-type (ABI5) was ranked fifth. ABI5 plays an important role in seed germination, which is regulated by abscisic acid Prioritization of potato genes using the SOLANUM TUBEROSUM knowledge base Associative gene network of resistance to adverse factors in potato.
Genes are presented by spirals, proteins by red balls, biological processes by brown ovals, and the interaction by lines.
селекция растений на иммунитет и продуктивность / plant breeding for immunity and performance (Finkelstein, 1994;Lopez-Molina et al., 2002). The ABI5 bZIP-type transcription factor is involved in the activation of genes responsible for the accumulation of proteins during seed development. It is known that the reduction of the expression of the ABI5 gene activates the meristem growth (Lopez-Molina et al., 2002). Mutations of the ABI5 gene in A. thaliana are associated with reduced sensitivity to abscisic acid, as well as to salt and osmotic stress during the germination (Finkelstein, Lynch, 2000;Carles et al., 2002;Tezuka et al., 2013). In sixth place was the CPK1 gene, it encodes a calciumdependent protein kinase C, which is involved in the immune response, resistance to fungal diseases and pathogens (Gravino et al., 2015). Mutations of CPK1 gene in the A. thaliana are known to cause hypersensitivity to salt stress and drought, while transgenic plant lines with increased expression of CPK1 showed significant resistance to salt stress and drought (Huang et al., 2018).
The seventh place was taken by the RGS1 gene, which encodes the negative regulator of the signaling pathway of G-protein type 1. It is known that the expression of this gene decreases in response to water deficiency (Campbell et al., 2012). A. Chen et al. (2006) showed for A. thaliana that transgenic plants with over-expressed RGS1 gene have an increased drought tolerance.
The NHL3 gene appeared to be in the eighth place; this gene encodes NDR1/HIN1-like protein 3, which is involved in response to pathogens (Chong et al., 2008). The A. thaliana transgenic line, in which increased expression of the NHL3 gene was observed, showed the increased resistance to the pathogen Pseudomonas syringae pv. tomato DC3000 (Varet et al., 2003).
The tenth place in Table 2 was taken by the CYP21-4 gene, encoding cyclophilin, localized in the Golgi apparatus, which is involved in tolerance to oxidative stress (Park et al., 2017). The authors consider the over-expression of the CYP21-4 gene in crops as a new promising way to increase the productivity of plants. For potatoes and rice, it has been shown that transgenic plants in which the CYP21-4 gene is over-expressed have increased yield, the stems and roots of the plants are longer, and the leaves are thicker. Also, such potatoes produced a bigger number of tubers of a larger size, and the microtubers were formed faster than in wild-type plants (Park et al., 2017).

Conclusion
In the current work, a search for new potential participants of molecular genetic mechanisms of resistance to adverse factors in plants was carried out. Prioritization of candidate genes has shown that the PHYA, GF14, CNIH1, RCI1A, ABI5, CPK1, RGS1, NHL3, GRF8 and CYP21-4 genes are the most promising for the further study of their relationship with resistance to adverse factors. The performed analysis reviled that the molecular-genetic relationships responsible for the formation of significant agricultural traits are complex and include many direct and indirect interactions. The representation of these interactions in the form of associative gene networks and their analysis using the SOLANUM TUBEROSUM knowledge base can be the basis for the search for target genes important for targeted mutagenesis and marker-oriented selection of potato varieties resistant to adverse environmental factors.