Phylostratigraphic analysis of gene networks of human diseases

Phylostratigraphic analysis is an approach to the study of gene evolution that makes it possible to determine the time of the origin of genes by analyzing their orthologous groups. The age of a gene belonging to an orthologous group is def ined as the age of the most recent ancestor of all species represented in that group. Such an analysis can reveal important stages in the evolution of both the organism as a whole and groups of functionally related genes, in particular gene networks. In addition to investigating the time of origin of a gene, the level of its genetic variability and what type of selection the gene is subject to in relation to the most closely related organisms is studied. Using the Orthoscape application, gene networks from the KEGG Pathway, Human Diseases database describing various human diseases were analyzed. It was shown that the majority of genes described in gene networks are under stabilizing selection and a high reliable correlation was found between the time of gene origin and the level of genetic variability: the younger the gene, the higher the level of its variability is. It was also shown that among the gene networks analyzed, the highest proportion of evolutionarily young genes was found in the networks associated with diseases of the immune system (65 %), and the highest proportion of evolutionarily ancient genes was found in the networks responsible for the formation of human dependence on substances that cause addiction to chemical compounds (88 %); gene networks responsible for the development of infectious diseases caused by parasites are signif icantly enriched for evolutionarily young genes, and gene networks responsible for the development of specif ic types of cancer are signif icantly enriched for evolutionarily ancient genes.


Introduction
The study of key factors that influence to the development of diseases is one of the most important research areas in both medicine and biology (Stepanov, 2016). It is known that the formation of phenotypic traits that provide the adaptation of organisms to environmental conditions is controlled not by individual genes, but by gene networks -groups of coordinately functioning genes and their products (RNA, proteins, metabolites, etc.) (Kolchanov et al., 2013). The task of highlighting the key structural features of networks, network elements, and their numerical description arises. One of the important characteristics in evolutionary bio logy is the age of a gene. The age of a gene belonging to an orthologous group is defined as the age of the most recent ancestor of all species represented in this group (Liebeskind et al., 2016).
Modern methods of analysis make it possible to evaluate the evolutionary characteristics of genes, in particular, phylostratigraphic analysis, a methodology proposed in 2007 by T. DomazetLošo, which allows to determine the age of a gene using a special index. The index is derived from analysis of orthologous genes and comparison of the position of organisms whose genes are considered in the analysis on a phylogenetic tree (DomazetLošo et al., 2007).
There are many software tools to work with gene networks. Some of them focus on reconstructing networks based on data from biological databases: String (Szklar czyk et al., 2019), GeneMANIA (Montojo et al., 2010). The others have extensive functionality for visualizing network elements and identifying its structural features: Cytoscape (Shannon et al., 2003), yEd (https://www.yworks.com/products/ yed). Cytoscape has an advantage that in addition to its extensive capabilities of constructing networks, layouting and painting their elements and analyzing structural features, it allows users to write their own applications in Java. It makes possible for community to implement any functionality and plug in to Cytoscape. For example, wellknown tools String and GeneMANIA for networks reconstruction from the list of genes based on extracting interactions from biological databases have their own plugins in Cytoscape and allow to use their functionality by combining it with the capabilities of Cytoscape and its other plugins. Also, the plugins allow to import networks from databases, for example, Pathway Commons (Cerami et al., 2011) or KEGG Pathway (Kanehisa et al., 2017), without hard parsing the formats of network representation.
The results of gene network analysis by one of such applications -Orthoscape (Mustafin et al., 2017), are presented in this paper. Orthoscape can analyze the evolutionary features of genes in gene network. It has been shown that most of the genes described in gene networks are under influence of stabilizing selection. A high reliable correlation has been found between the time of occurrence of a gene and the level of its genetic variability -the younger the gene, the higher the level of genetic variability is. Among the gene networks analyzed, the highest proportion of evolutionary young genes was detected in the networks associated with immune diseases (65 %), and the highest proportion of evolutionary ancient genes was detected in the networks responsible for the substance dependencies (88 %); gene networks responsible for the development of infectious diseases caused by parasites are significantly enriched in evolutionary young genes, and gene networks responsible for the development of specific types of cancer are significantly enriched in evolutionary ancient genes.
The data required for the analysis, such as lists of orthologous genes, nucleotide sequences of genes and amino acid sequences of the proteins they encode, protein domains, taxonomic information about organisms whose genes were considered in the analysis were also taken from the KEGG database.
Software used. The analysis was performed using the Cytoscape software package (Shannon et al., 2003). CyKEGG Parser plugin was used to import networks from the KEGG Pathway (Nersisyan et al., 2014). Orthoscape plugin was used to perform phylostratigraphic analysis and analysis of so called divergence index -the index of evolutionary variability (Mustafin et al., 2017).
Methods for estimation the evolutionary characteristics of genes. Orthoscape allows to estimate two evolutio nary characteristics of genes. The first one is phylostrati graphic age index (PAI). This index shows how far from the root of the phylogenetic tree is the taxon reflecting the age of the gene, i. e., the taxon where the studied species diverged from the most distant related taxon in which the ortholog of the studied gene was found. Thus, the more PAI of the gene, the younger it is (Fig. 1). KEGG Orthology service is used in Orthoscape to calculate PAI, which makes it possible to consider orthologous genes among all homologs. Figure 1 shows examples of determining the age of a gene and the phylostratigraphic index, using human genes. On the left (a) the case when the most distant organism in which the ortholog of the studied gene was found is the bonobo is shown. The node most distant from the root of the phylogenetic tree that is common to H. sapiens and P. paniscus (bonobos) is Нominidae. It corresponds to the phylostratigraphic index is equal to 13. On the right (b) is the gene whose ortholog was found in M. domestica (gray short tailed opossum), the most distant node is Mammalia, and the phylostratigraphic index of the gene is equal to 7. Since the PAI in example (a) is larger than the PAI in example (b), we  а -the example of evolutionary young gene is hsa:1029, most distant from the studied organism in which the orthologous gene was found Pan paniscus (bonobo); b -the example of evolutionary older gene is hsa:1030, most distant from the studied organism in which the orthologous gene was found is Monodelphis domestica (gray short-tailed opossum). We can conclude that the gene on example (a) is evolutionary younger than the gene on example (b). The scale on the left shows the PAI index, which corresponds to the depth of a node in the phylogenetic tree (see Table 1 for details). 0 Cellular organism (tree root) 4100 (Bell et al., 2015) 1 Eukaryota 1850 (Leander, 2020) 2 Metazoa 665 (Maloof et al., 2010a) 3 Chordata 541 (Maloof et al., 2010b) 4 Craniata 535 (Maloof et al., 2010b) 5 Vertebrata 525 (Shu et al., 1999) 6 Euteleostomi 420 (Diogo, 2007) 7 Mammalia 225 (Datta, 2005) 8 Eutheria 160 (Luo et al., 2011) 9 Euarchontoglires 65 (Kumar et al., 2013) 10 Primates 55 (Chatterjee et al., 2009) 11 Haplorrhini 50 (Dunn et al., 2016) 12 Catarrhini 44 (Harrison, 2013) 13 Hominidae 17 (Hey, 2005) 14 Homo 2.8 (Schrenk et al., 2014)  can conclude that the gene in example (a) is evolutionary younger than the gene in example (b). An important characteristic for the phylostratigraphic analysis is the list of taxonomic units describing the stages of divergence on the phylogenetic tree. Table 1 shows the complete list of taxa used in the analysis to determine the phylostratigraphic age index of H. sapiens genes, as well as the approximate evolutionary age of these taxa in millions of years from our time. It should be noted that the discussions on this topic are ongoing and there are different data of the age; the values in the table reflect approximate estimates.
Orthoscape also allows to estimate divergence index (DI). DI shows the type of selection to which the gene is in fluenced. This index is calculating based on the dN/dS ratio, where dN reflects the rate of nonsynonymous substitutions between the sequences of analyzed gene and its orthologous gene (the substitutions changing the amino acid encoded) and dS -reflects the rate of synonymous substitutions (the substitutions without changing the amino acid encoded). The index value from 0 to 1 indicates that the gene is under stabilizing selection, value is equal to 1 indicates neutral evolution, and greater than 1 indicates a driving selection. The analysis of this index makes sense only when comparing closely related organisms, because it can't take into account multiple substitutions in the same position, which will be inevitably accumulated when comparing the evolutionary distant organisms. Calculation of dN/dS takes place in two phases: 1. Alignment of original sequences. To align the sequences, the Needleman-Wunsch algorithm is used. The task is to align aminoacid sequences and nucleotide triplets correspond to them and remove the gaps from the result. 2. Aligned sequences are given as input to the PAML (phylogenetic analysis by maximum likelihood) (Yang, 2007) software. Various methods are used to calculate dN/dS. They take into account different positions of triplets, their frequency of occurrence, and other factors. There are Nei-Gojobori (Nei, Gojobori, 1986), Yang & Nielsen (Yang, Nielsen, 2000), LWL85 (Li, 1985), LWLm (Li, 1993), LPB93 (Pamilo, Bianchi, 1993) methods implemented in PAML. To count DI, Orthoscape uses LPB93 method. The formula to count DI is based on dN/dS for every geneortholog pair where dnds i -dN/dS value for gene and ith ortholog; n -number or orthologous genes.

Results and discussion
The analysis of evolutionary characteristics of gene networks 80 networks from KEGG Pathway, Human Diseases were analyzed using Orthoscape software. First of all, PAI and DI values for genes in network have been calculated. Based on these data, PAI values for every network have been calculated (Table 2) as an average PAI value of genes involved in network. Finally, PAI of the category of diseases has been calculated as an average of PAI value of networks from this category. The same metrics have been calculated for DI. There are big PAI variations are observed among the analyzed 80 networks: from 0.44 (i. e., in "Nicotine addiction" gene network, the most of the genes are evolutionary ancient) to 6.38 (i. e., in "Asthma" gene network, the most of the genes are evolutionary young). The DI variation is usually within 0 < DI < 1 interval, i. e., within stabilizing selection interval; however, the level of variability of genes involved in different networks also varies greatly, from 0.16 to 0.64. The diseases "Asthma" and "Nicotine addiction" are the most exuding according to the PAI and DI indices. In the "Asthma" network, evolutionary young and variable genes prevail, and in the "Nicotine addiction" network, evolutio nary ancient and conservative genes prevail. Fig. 2 shows the result of PAI analysis for the "Asthma" and "Nicotine addiction" networks, and Fig. 3 shows the DI results of the same networks.
The most part of genes in the "Asthma" network ( Fig. 2, a) are evolutionary young (colored in green and yellow), with origin on Vertebrata level. On the contrary, in the "Nicotine addiction" network ( Fig. 2, b) all genes have been identified as evolutionary ancient, with origin from the cellular life form (Cellular organisms) to multicellular (Metazoa) stages.
Analysis of the DI indicates that almost all the genes involved in the "Asthma" network (see Fig. 3, a) are more evolutionary variable than the genes involved in the "Nicotine addiction" network (see Fig. 3, b), whose genes are very conservative.
Let's take a look at the estimations of PAI values for 11 disease categories (see Table 2). The most segregated networks are from 4 categories. High PAI and DI values is characteristic of Immune diseases (8 networks) and Infectious diseases: Parasitic (6 networks). Low PAI and DI value is characteristic of cancers: specific types (15 networks) and substance dependence (5 networks).
Genes from the categories above, as well as the complete set of 1436 genes, were divided into two groups: 1) a group of evolutionary ancient genes with PAI < 5 (the age of the genes corresponds to the period of evolution from Cellular organisms to Chordata); 2) a group of evolutionary young genes with PAI ≥ 5 (the age of the genes corresponds to the period of evolution from the Craniata to modern humans).
Contingency tables were created and Fisher's exact test was used to assess whether the difference in the partitioning of genes into groups in the category from the partitioning in the full list of genes was significant (Table 3).
The average PAI of all 1436 genes studied was equal to 2.49. The results from Table 3 show that gene networks from category Immune diseases have not only the highest value of the PAI (5.21), but also a significantly different distribution of the proportion of evolutionary young and ancient genes in comparison with such proportion among all genes analyzed (the last row of the Table 3).
The part of evolutionary young genes in Immune di seases category is 65 %. The most part of genes origin was at vertebrata stages (Vertebrata and Euteleostomi taxa), that corresponds to modern data about the development of specific immunity: it exists in cartilaginous fish (sharks and rays) and, therefore, appeared at least 400-500 million years ago. These fishes have genes related to the genes of the Ig variable region (IgV ), or Tcell receptor (TcR) genes. At the same time, even more primitive vertebrates, the roundworms (hagfish and lampreys), do not have an acquired immunity system; they have neither IgV nor TkR genes (Galaktionov, 2015). The analysis also revealed a small fraction of evolutionary ancient genes in the Immune diseases category. This is consistent with the knowledge that some functions of the immune system originated as early as in unicellular organisms, such as the ability to phagocytose; cells with the Tlymphocyte marker first discovered in ringworms and the histocompatibility system -in sponges (Khaitov, 2009). On the contrary, the highest proportion of evolutionary ancient genes is characteristic of the "Substance dependence" diseases category, which includes genes responsible for addiction to chemicals (88 %). Most of the genes consi dered are involved in nervous system function, including neurotransmitter function.
The infectious diseases parasitic category, which includes genes associated with infectious diseases caused by parasites (53 % of the evolutionary young genes), has a significant difference in the proportion of evolutionary ancient and evolutionary young genes from that among all the genes analyzed. In the case of the infectious diseases parasitic ca tegory, the high proportion of evolutionary young genes can be directly related to the high proportion of evolutionary young genes and the high evolutionary variability of genes found in the Immune diseases category. It is infectious di seases that are one of the most important drivers of immune system evolution. At the same time, infectious diseases of different nature and the immune system coevolve in the process of forming mechanisms to fight each other (Sasaki et al., 2000;Khakoo, 2004;Zheleznikova, 2014).
It should be noted, that there is a significant excess of the proportion of ancient genes over young genes compared to their distribution (ancient/young) in the full sample of   Gene coding the proteins in these networks are shown as rectangles with gene name, the color reflects the gene age. The genes colored in blue and cyan correspond to the most evolutionary ancient taxa, green and yellow correspond to evolutionary younger in compare with taxa colored in blue. genes analyzed in cancers specific types category, which includes genes associated with carcinogenesis. This result is consistent with the current ideas that gene networks involved in cancer development processes were formed during the stages of multicellular organisms origin (DomazetLošo, Tautz, 2010). Let us consider two categories of diseases in more details: (1) immune diseases with the highest proportion of evolu-tionary young genes and (2) substance dependence with the highest proportion of evolutionary ancient genes (Fig. 4). Figure 4 shows the PAI distributions for 13 networks (8 immune diseases networks and 5 substance dependen ce networks) as "violin plot" graphs. The lower and upper points of each graph show the minimum and maximum PAI values, the orange star shows the median PAI values, and the width of the graph for each position on the ordinate  Gene coding the proteins in these networks are shown as rectangles with gene name, the color reflects the gene variability level. In the upper right part of the graph of each network placed the color scheme for DI. The scale for each network is individual, and even the most variable genes involved in the "Nicotine addiction" network have minimal variability compared to genes involved in the "Asthma" network. axis (i. e. for each PAI) shows the proportion of genes with that particular PAI. The median for distributions of immune diseases varies in the range (5, 7) (from Vertebrata to Mammalia), and the distributions themselves have a character expressed in decreasing the number of genes with a corresponding PAI value as PAI decreases. The median for distributions of substance dependence varies in the range (0, 1) (Cellular organisms and Eukaryota), and the distributions themselves have a character expressed in increasing the number of genes with a corresponding PAI value as PAI decreases. These distributions are fundamentally different when comparing the proportion of evolutionary ancient and    Figure 5 shows the distribution of PAI among all the genes involved in the 80 gene networks considered from the KEGG Pathway, Human Diseases. This distribution has two peaks. The left peak includes genes formed early in evolution (from the emergence of cellular organization of life to chordates), and the right one includes genes formed at subsequent stages of evolution (vertebrates to placentals). There were more evolutionary ancient genes than evolutionary young ones. Figure 6 shows the distribution of DI among all the genes involved in considered gene networks from the KEGG Pathway, Human Diseases. The DI analysis makes it possible to estimate what type of selection the genes are influenced. However, it only makes sense when the sequences of the analyzed genes are compared with the orthologous genes of evolutionary close organisms. To calculate dN/dS, human gene sequences were compared with the sequences of  The analysis showed that there is a large significant correlation between PAI and DI with the value of the correlation coefficient (r = 0.876, pvalue < 1.8 × 10 -26 ). It means there is a relationship between the average evolutionary age of genes in gene networks and the level of their genetic variability: the less the evolutionary age of genes, the greater the level of their genetic variability. This agrees well with the fact that evolutionary ancient genes are involved in key processes for organism functionality; they are a subject to many restrictions by other genes and moleculargenetic systems organization peculiarities, so they are not characteri zed by high variability. On the contrary, evolutionary young genes provide adaptation to modern life conditions and are characterized by higher variability.

Conclusion
Phylostratigraphic analysis is a modern methodology that allows genomewide estimation of gene ages based on data on the similarity of genetic sequences and the origin of or-ganisms. Together with information on what type of selection a gene is subject to as a unit of heredity, the results of the analysis allow us to estimate the role of certain genes in the evolution of the gene networks of an organism.
Analysis of gene networks from the KEGG Pathway, Human Diseases database shows several trends. The vast majority of the genes involved in the gene networks studied evolved in the mode of stabilizing selection (DI < 1). There is significant (r = 0.876, pvalue < 1.8 × 10 -26 ) correlation between the average evolutionary age of genes in gene networks and their level of genetic variability: the lower the evolutionary age of genes, the greater the genetic variability is. Some categories of gene networks are especially distinguished by the proportion of evolutionary young and evolutionary ancient genes. The highest proportion of evolutionary young genes (65 %) was found in gene networks from immune diseases category. The highest proportion of evolutionary ancient genes (88 %) was found in gene networks from substance dependence category.
It is also shown that gene networks responsible for the functioning of infectious diseases caused by parasites are significantly enriched with evolutionary young genes, and gene networks responsible for the development of specific cancer types are significantly enriched with evolutionary ancient genes. Such results indicate an active process of adaptation of the human immune system to emerging threats. In addition, the genes involved in chemical addictive diseases have a minimum number of substitutions, i. e., such genes are as conservative as possible. Separate work can be carried out in this direction, with expansion of the original networks thanks to the classifiers and databases currently available.