Preview

Vavilov Journal of Genetics and Breeding

Advanced search

Оригинальный русский текст: https://vavilovj-icg.ru/2021-year/25-1/

 

Vol 25, No 1 (2021)
View or download the full issue PDF
https://doi.org/10.18699/VJ20.677

BIOINFORMATICS AND COMPUTATIONAL SYSTEMS BIOLOGY

 
7-17 772
Abstract
The most popular model for the search of ChIP-seq data for transcription factor binding sites (TFBS) is the positional weight matrix (PWM). However, this model does not take into account dependencies between nucleotide occurrences in different site positions. Currently, two recently proposed models, BaMM and InMoDe, can do as much. However, application of these models was usually limited only to comparing their recognition accuracies with that of PWMs, while none of the analyses of the co-prediction and relative positioning of hits of different models in peaks has yet been performed. To close this gap, we propose the pipeline called MultiDeNA. This pipeline includes stages of model training, assessing their recognition accuracy, scanning ChIP-seq peaks and their classif ication based on scan results. We applied our pipeline to 22 ChIP-seq datasets of TF FOXA2 and considered PWM, dinucleotide PWM (diPWM), BaMM and InMoDe models. The combination of these four models allowed a signif icant increase in the fraction of recognized peaks compared to that for the sole PWM model: the increase was 26.3 %. The BaMM model provided the main contribution to the recognition of sites. Although the major fraction of predicted peaks contained TFBS of different models with coincided positions, the medians of the fraction of peaks containing the predictions of sole models were 1.08, 0.49, 4.15 and 1.73 % for PWM, diPWM, BaMM and InMoDe, respectively. Thus, FOXA2 BSs were not fully described by only a sole model, which indicates theirs heterogeneity. We assume that the BaMM model is the most successful in describing the structure of the FOXA2 BS in ChIP-seq datasets under study.
 
18-29 1536
Abstract
Whole genome and whole exome sequencing technologies play a very important role in the studies of the genetic aspects of the pathogenesis of various diseases. The ample use of genome-wide and exome-wide association study methodology (GWAS and EWAS) made it possible to identify a large number of genetic variants associated with diseases. This information is accumulated in the databases like GWAS central, GWAS catalog, OMIM, ClinVar, etc. Most of the variants identified by the GWAS technique are located in the noncoding regions of the human genome. According to the ENCODE project, the fraction of regions in the human genome potentially involved in transcriptional control is many times greater than the fraction of coding regions. Thus, genetic variation in noncoding regions of the genome can increase the susceptibility to diseases by disrupting various regulatory elements (promoters, enhancers, silencers, insulator regions, etc.). However, identification of the mechanisms of influence of pathogenic genetic variants on the diseases risk is difficult due to a wide variety of regulatory elements. The present review focuses on the molecular genetic mechanisms by which pathogenic genetic variants affect gene expression. At the same time, attention is concentrated on the transcriptional level of regulation as an initial step in the expression of any gene. A triggering event mediating the effect of a pathogenic genetic variant on the level of gene expression can be, for example, a change in the functional activity of transcription factor binding sites (TFBSs) or DNA methylation change, which, in turn, affects the functional activity of promoters or enhancers. Dissecting the regulatory roles of polymorphic loci have been impossible without close integration of modern experimental approaches with computer analysis of a growing wealth of genetic and biological data obtained using omics technologies. The review provides a brief description of a number of the most well-known public genomic information resources containing data obtained using omics technologies, including (1) resources that accumulate data on the chromatin states and the regions of transcription factor binding derived from ChIP-seq experiments; (2) resources containing data on genomic loci, for which allele-specific transcription factor binding was revealed based on ChIP-seq technology; (3) resources containing in silico predicted data on the potential impact of genetic variants on the transcription factor binding sites.
 
30-38 1221
Abstract
De novo transcriptome assembly is an important stage of RNA-seq data computational analysis. It allows the researchers to obtain the sequences of transcripts presented in the biological sample of interest. The availability of accurate and complete transcriptome sequence of the organism of interest is, in turn, an indispensable condition for further analysis of RNA-seq data. Through years of transcriptomic research, the bioinformatics community has developed a number of assembler programs for transcriptome reconstruction from short reads of RNA-seq libraries. Different assemblers makes it possible to conduct a de novo transcriptome reconstruction and a genome-guided reconstruction. The majority of the assemblers working with RNA-seq data are based on the De Bruijn graph method of sequence reconstruction. However, specif ics of their procedures can vary drastically, as do their results. A number of authors recommend a hybrid approach to transcriptome reconstruction based on combining the results of several assemblers in order to achieve a better transcriptome assembly. The advantage of this approach has been demonstrated in a number of studies, with RNA-seq experiments conducted on the Illumina platform. In this paper, we propose a hybrid approach for creating a transcriptome assembly of the barley Hordeum vulgare isogenic line Bowman and two nearly isogenic lines contrasting in spike pigmentation, based on the results of sequencing on the IonTorrent platform. This approach implements several de novo assemblers: Trinity, Trans-ABySS and rnaSPAdes. Several assembly metrics were examined: the percentage of reference transcripts observed in the assemblies, the percentage of RNA-seq reads involved, and BUSCO scores. It was shown that, based on the summation of these metrics, transcriptome meta-assembly surpasses individual transcriptome assemblies it consists of.
 
39-45 827
Abstract
Active polar transport of the plant hormone auxin carried out by its PIN transporters is a key link in the formation and maintenance of auxin distribution, which, in turn, determines plant morphogenesis. The plasticity of auxin distribution is largely realized through the molecular genetic regulation of the expression of its transporters belonging to the PIN-FORMED (PIN) protein family. Regulation of auxin-response genes occurs through the ARF-Aux/IAA signaling pathway. However, it is not known which ARF-Aux/IAA proteins are involved in the regulation of PIN gene expression by auxin. In Arabidopsis thaliana, the PIN, ARF, and Aux/IAA families contain a larger number of members; their various combinations are possible in realization of the signaling pathway, and this is a challenge for understanding the mechanisms of this process. The use of high-throughput sequencing data on auxin-induced transcriptomes makes it possible to identify candidate genes involved in the regulation of PIN expression. To address this problem, we created an approach for the meta-analysis of auxin-induced transcriptomes, which helped us select genes that change their expression during the auxin response together with PIN1, PIN3, PIN4 and PIN7. Possible regulators of ARF-Aux/IAA signaling pathway for each of the PINs under study were identif ied, and so were the aspects of their regulatory circuits both common for groups of PIN genes and specif ic for each PIN gene. Reconstruction of gene networks and their analysis predicted possible interactions between genes and served as an additional conf irmation of the pathways obtained in the meta-analysis. The approach developed can be used in the search for gene expression regulators in other genomewide data.
 
46-56 767
Abstract
Phylostratigraphic analysis is an approach to the study of gene evolution that makes it possible to determine the time of the origin of genes by analyzing their orthologous groups. The age of a gene belonging to an orthologous group is def ined as the age of the most recent ancestor of all species represented in that group. Such an analysis can reveal important stages in the evolution of both the organism as a whole and groups of functionally related genes, in particular gene networks. In addition to investigating the time of origin of a gene, the level of its genetic variability and what type of selection the gene is subject to in relation to the most closely related organisms is studied. Using the Orthoscape application, gene networks from the KEGG Pathway, Human Diseases database describing various human diseases were analyzed. It was shown that the majority of genes described in gene networks are under stabilizing selection and a high reliable correlation was found between the time of gene origin and the level of genetic variability: the younger the gene, the higher the level of its variability is. It was also shown that among the gene networks analyzed, the highest proportion of evolutionarily young genes was found in the networks associated with diseases of the immune system (65 %), and the highest proportion of evolutionarily ancient genes was found in the networks responsible for the formation of human dependence on substances that cause addiction to chemical compounds (88 %); gene networks responsible for the development of infectious diseases caused by parasites are signif icantly enriched for evolutionarily young genes, and gene networks responsible for the development of specif ic types of cancer are signif icantly enriched for evolutionarily ancient genes.
 
57-63 1347
Abstract
Progress in genome sequencing, assembly and analysis allows for a deeper study of agricultural plants’ chromosome structures, gene identif ication and annotation. The published genomes of agricultural plants proved to be a valuable tool for studing gene functions and for marker-assisted and genomic selection. However, large structural genome changes, including gene copy number variations (CNVs) and gene presence/absence variations (PAVs), prevail in crops. These genomic variations play an important role in the functional set of genes and the gene composition in individuals of the same species and provide the genetic determination of the agronomically important crops properties. A high degree of genomic variation observed indicates that single reference genomes do not represent the diversity within a species, leading to the pangenome concept. The pangenome represents information about all genes in a taxon: those that are common to all taxon members and those that are variable and are partially or completely specif ic for particular individuals. Pangenome sequencing and analysis technologies provide a large-scale study of genomic variation and resources for an evolutionary research, functional genomics and crop breeding. This review provides an analysis of agricultural plants’ pangenome studies. Pangenome structural features, methods and programs for bioinformatic analysis of pangenomic data are described.
 
64-70 836
Abstract
Determining the quantitative content of chlorophylls in plant leaves by their reflection spectra is an important task both in monitoring the state of natural and industrial phytocenoses, and in laboratory studies of normal and pathological processes during plant growth. The use of machine learning methods for these purposes is promising, since these methods allow inferring the relationships between input and output variables (prediction model), and in order to improve the quality of the prediction, a researcher may modify predictors and selects a set of method parameters. Here, we present the results of the implementation and evaluation of the random forest algorithm for predicting the total concentration of chlorophylls a and b from the ref lection spectra of plant leaves in the visible and infrared wavelengths. We used the ref lection spectra for 276 leaf samples from 39 plant species obtained from open sources. 181 samples were from the sycamore maple (Acer pseudoplatanus L.). The ref lection spectrum represented wavelengths from 400 to 2500 nm with a step of 1 nm. The training set consisted of the 85 % of A. pseudoplatanus L. samples, and the performance was evaluated on the remaining 15 % samples of this species (validation sample). Six models based on the random forest algorithm with different predictors were evaluated. The selection of control parameters was performed by cross-checking on five partitions. For the f irst model, the intensity of the ref lection spectra without any transformation was used. Based on the analysis of this model, the optimal ranges of wavelengths for the remaining f ive models were selected. The best results were obtained by models that used a two-point estimation of the derivative of the ref lection spectrum in the visible wavelength range as input data. We compared one of these models (the two-point estimation of the derivative of the ref lection spectrum in the range of 400–800 nm with a step of 1 nm) with the model by other authors (which is based on the functional dependence between two unknown parameters selected by the least squares method and two ref lection coeff icients, the choice of which is described in the article). The comparison of the results of predictions of the model based on the random forest algorithm with the model of other authors was carried out both on the validation sample of maple and on the sample from other plant species. In the f irst case, the predictions of the method based on a random forest had a lower estimate of the standard deviation. In the second case, the predictions of this method had a large error for small values of chlorophyll, while the third-party method had acceptable predictions. The article provides the analysis of the results, as well as recommendations for using this machine learning method to assess the quantitative content of chlorophylls in leaves.
 
71-81 1078
Abstract

Intraspecific classification of cultivated plants is necessary for the conservation of biological diversity, study of their origin and their phylogeny. The modern cultivated wheat species originated from three wild diploid ancestors as a result of several rounds of genome doubling and are represented by di-, tetra- and hexaploid species. The identification of wheat ploidy level is one of the main stages of their taxonomy. Such classification is possible based on visual analysis of the wheat spike traits. The aim of this study is to investigate the morphological characteristics of spikes for hexa- and tetraploid wheat species based on the method of high-performance phenotyping. Phenotyping of the quantitative characteristics of the spike of 17 wheat species (595 plants, 3348 images), including eight tetraploids (Triticum aethiopicum, T. dicoccoides, T. dicoccum, T. durum, T. militinae, T. polonicum, T. timopheevii, and T. turgidum) and nine hexaploids (T. compactum, T. aestivum, i:ANK-23 (near-isogenic line of T. aestivum cv. Novosibirskaya 67), T. antiquorum, T. spelta (including cv. Rother Sommer Kolben), T. petropavlovskyi, T. yunnanense, T. macha, T. sphaerococcum, and T. vavilovii), was performed. Wheat spike morphology was described on the basis of nine quantitative traits including shape, size and awns area of the spike. The traits were obtained as a result of image analysis using the WERecognizer program. A cluster analysis of plants according to the characteristics of the spike shape and comparison of their distributions in tetraploid and hexaploid species showed a higher variability of traits in hexaploid species compared to tetraploid ones. At the same time, the species themselves form two clusters in the visual characteristics of the spike. One type is predominantly hexaploid species (with the exception of one tetraploid, T. dicoccoides). The other group includes tetraploid ones (with the exception of three hexaploid ones, T. compactum, T. antiquorum, T. sphaerococcum, and i:ANK-23). Thus, it has been shown that the morphological characteristics of spikes for hexaploid and tetraploid wheat species, obtained on the basis of computer analysis of images, include differences, which are further used to develop methods for plant classifications by ploidy level and their species in an automatic mode.

 
82-91 983
Abstract
The paper presents the results of sensitivity-based identif iability analysis of the COVID-19 pandemic spread models in the Novosibirsk region using the systems of differential equations and mass balance law. The algorithm is built on the sensitivity matrix analysis using the methods of differential and linear algebra. It allows one to determine the parameters that are the least and most sensitive to data changes to build a regularization for solving an identif ication problem of the most accurate pandemic spread scenarios in the region. The performed analysis has demonstrated that the virus contagiousness is identif iable from the number of daily conf irmed, critical and recovery cases. On the other hand, the predicted proportion of the admitted patients who require a ventilator and the mortality rate are determined much less consistently. It has been shown that building a more realistic forecast requires adding additional information about the process such as the number of daily hospital admissions. In our study, the problems of parameter identif ication using additional information about the number of daily conf irmed, critical and mortality cases in the region were reduced to minimizing the corresponding misf it functions. The minimization problem was solved through the differential evolution method that is widely applied for stochastic global optimization. It has been demonstrated that a more general COVID-19 spread compartmental model consisting of seven ordinary differential equations describes the main trend of the spread and is sensitive to the peaks of conf irmed cases but does not qualitatively describe small statistical datasets such as the number of daily critical cases or mortality that can lead to errors in forecasting. A more detailed agent-oriented model has been able to capture statistical data with additional noise to build scenarios of COVID-19 spread in the region.
 
92-100 985
Abstract
The assumption that chronic mechanical stress in brain cells stemming from intracranial hypertension, arterial hypertension, or mechanical injury is a risk factor for neurodegenerative diseases was put forward in the 1990s and has since been supported. However, the molecular mechanisms that underlie the way from cell exposure to mechanical stress to disturbances in synaptic plasticity followed by changes in behavior, cognition, and memory are still poorly understood. Here we review (1) the current knowledge of molecular mechanisms regulating local translation and the actin cytoskeleton state at an activated synapse, where they play a key role in the formation of various sorts of synaptic plasticity and long-term memory, and (2) possible pathways of mechanical stress intervention. The roles of the mTOR (mammalian target of rapamycin) signaling pathway; the RNA-binding FMRP protein; the CYFIP1 protein, interacting with FMRP; the family of small GTPases; and the WAVE regulatory complex in the regulation of translation initiation and actin cytoskeleton rearrangements in dendritic spines of the activated synapse are discussed. Evidence is provided that chronic mechanical stress may result in aberrant activation of mTOR signaling and the WAVE regulatory complex via the YAP/TAZ system, the key sensor of mechanical signals, and influence the associated pathways regulating the formation of F actin filaments and the dendritic spine structure. These consequences may be a risk factor for various neurological conditions, including autistic spectrum disorders and epileptic encephalopathy. In further consideration of the role of the local translation system in the development of neuropsychic and neurodegenerative diseases, an original hypothesis was put forward that one of the possible causes of synaptopathies is impaired proteome stability associated with mTOR hyperactivity and formation of complex dynamic modes of de novo protein synthesis in response to synapse-stimulating factors, including chronic mechanical stress.

BIOTECHNOLOGY

 
101-107 730
Abstract
In eukaryotes, trans-splicing is a process of nuclear pre-mRNA maturation where two different RNA molecules are joined together by the spliceosomal machinery utilizing mechanisms similar to cis-splicing. In diverse taxa of lower eukaryotes, spliced leader (SL) trans-splicing is the most frequent type of trans-splicing, when the same sequence derived from short small nuclear RNA molecules, called SL RNAs, is attached to the 5’ ends of different non-processed pre-mRNAs. One of the functions of SL trans-splicing is processing polycistronic pre-mRNA molecules transcribed from operons, when several genes are transcribed as one pre-mRNA molecule. However, only a fraction of trans-spliced genes reside in operons, suggesting that SL trans-splicing must also have some other, less understood functions. Regenerative flatworms are informative model organisms which hold the keys to understand the mechanism of stem cell regulation and specialization during regeneration and homeostasis. Their ability to regenerate is fueled by the division and differentiation of the adult somatic stem cell population called neoblasts. Macrostomum lignano is a flatworm model organism where substantial technological advances have been achieved in recent years, including the development of transgenesis. Although a large fraction of genes in M. lignano were estimated to be SL trans-spliced, SL trans-splicing was not studied in detail in M. lignano before. Here, we performed the first comprehensive study of SL trans-splicing in M. lignano. By reanalyzing the existing genome and transcriptome data of M. lignano, we estimate that 30 % of its genes are SL trans-spliced, 15 % are organized in operons, and almost 40 % are both SL trans-spliced and in operons. We annotated and characterized the sequence of SL RNA and characterized conserved cis- and SL transsplicing motifs. Finally, we found that a majority of SL trans-spliced genes are evolutionarily conserved and signif icantly over-represented in neoblast-specific genes. Our findings suggest an important role of SL trans-splicing in the regulation and maintenance of neoblasts in M. lignano.
 
108-116 1134
Abstract
Hundreds of millions of people worldwide are infected by various species of parasitic flatworms. Without treatment, acute and chronical infections frequently lead to the development of severe pathologies and even death. Emerging data on a decreasing eff iciency of some important anthelmintic compounds and the emergence of resistance to them force the search for alternative drugs. Parasitic flatworms have complex life cycles, are laborious and expensive in culturing, and have a range of anatomic and physiological adaptations that complicate the application of standard molecular-biological methods. On the other hand, free-living flatworm species, evolutionarily close to parasitic flatworms, do not have the abovementioned diff iculties, which makes them potential alternative models to search for and study homologous genes. In this review, we describe the use of the basal free-living flatworm Macrostomum lignano as such a model. M. lignano has a number of convenient biological and experimental properties, such as fast reproduction, easy and non-expensive laboratory culturing, optical body transparency, obligatory sexual reproduction, annotated genome and transcriptome assemblies, and the availability of modern molecular methods, including transgenesis, gene knockdown by RNA interference, and in situ hybridization. All this makes M. lignano amenable to the most modern approaches of forward and reverse genetics, such as transposon insertional mutagenesis and methods of targeted genome editing by the CRISPR/Cas9 system. Due to the availability of an increasing number of genome and transcriptome assemblies of different parasitic flatworm species, new knowledge generated by studying M. lignano can be easily translated to parasitic f latworms with the help of modern bioinformatic methods of comparative genomics and transcriptomics. In support of this, we provide the results of our bioinformatics search and analysis of genes homologous between M. lignano and parasitic flatworms, which predicts a list of promising gene targets for subsequent research.
 
117-124 743
Abstract
There are more than 30 inherited human disorders connected with repeat expansion (myotonic dystrophy type I, Huntington’s disease, Fragile X syndrome). Fragile X syndrome is the most common reason for inherited intellectual disability in the human population. The ways of the expansion development remain unclear. An important feature of expanded repeats is the ability to form stable alternative DNA secondary structures. There are hypotheses about the nature of repeat instability. It is proposed that these DNA secondary structures can block various stages of DNA metabolism processes, such as replication, repair and recombination and it is considered as the source of repeat instability. However, none of the hypotheses is fully conf irmed or is the only valid one. Here, an experimental system for studying (CGG)n repeat expansion associated with transcription and TCR­-NER is proposed. It is noteworthy that the aberrations of transcription are a poorly studied mechanism of (CGG)n instability. However, the proposed systems take into account the contribution of other processes of DNA metabolism and, therefore, the developed systems are universal and applicable for various studies. Transgenic cell lines carrying a repeat of normal or premutant length under the control of an inducible promoter were established and a method for repeat instability quantif ication was developed. One type of the cell lines contains an exogenous repeat integrated into the genome by the Sleeping Beauty transposon; in another cell line, the vector is maintained as an episome due to the SV40 origin of replication. These experimental systems can serve for f inding the causes of instability and the development of therapeutic agents. In addition, a criterion was developed for the quantif ication of exogenous (CGG)n repeat instability in the transgenic cell lines’ genome.
 
125-134 2135
Abstract
In this review, we discuss the progress in the study and modif ication of subtilisin proteases. Despite longstanding applications of microbial proteases and a large number of research papers, the search for new protease genes, the construction of producer strains, and the development of methods for their practical application are still relevant and important, judging by the number of citations of the research articles on proteases and their microbial producers. This enzyme class represents the largest share of the industrial production of proteins worldwide. This situation can explain the high level of interest in these enzymes and points to the high importance of designing domestic technologies for their manufacture. The review covers subtilisin classif ication, the history of their discovery, and subsequent research on the optimization of their properties. An overview of the classes of subtilisin proteases and related enzymes is provided too. There is a discussion about the problems with the search for (and selection of) subtilases from natural strains of various microorganisms, approaches to (and specifics of) their modif ication, as well as the relevant genetic engineering techniques. Details are provided on the methods for expression optimization of industrial subtilases of various strains: the details of the most important parameters of cultivation, i. e., composition of the media, culture duration, and the inf luence of temperature and pH. Also presented are the results of the latest studies on cultivation techniques: submerged and solid-state fermentation. From the literature data reviewed, we can conclude that native enzymes (i. e., those obtained from natural sources) currently hardly have any practical applications because of the decisive advantages of the enzymes modified by genetic engineering and having better properties: e. g., thermal stability, general resistance to detergents and specif ic resistance to various oxidants, high activity in various temperature ranges, independence from metal ions, and stability in the absence of calcium. The vast majority of subtilisin proteases are expressed in producer strains belonging to different species of the genus Bacillus. Meanwhile, there is an effort to adapt the expression of these enzymes to other microbes, in particular species of the yeast Pichia pastoris.


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2500-3259 (Online)