COMPUTATIONAL GENOMICS
It was previously shown that the expression levels of human genes positively correlate with TBP affinity for the promoters of these genes. In turn, single nucleotide polymorphisms (SNPs) in human gene promoters can affect TBP affinity for DNA and, as a consequence, gene expression. The Institute of Cytology and Genetics SB RAS (ICG) has developed a method for predicting TBP affinity for gene promoters based on a three-step binding mechanism: (1) TBP slides along DNA, (2) TBP stops at the binding site, and (3) the TBP-promoter complex is fixed due to DNA helix bending. The method showed a high correlation of theoretical predictions with measured values during repeated experimental testing by independent groups of researchers. This model served as a base for other ICG web services, SNP_TATA_Z-tester and SNP_TATA_Comparator, which make a statistical assessment of the SNP-induced change in the affinity of TBP binding to the human gene promoter and help predict changes in expression that may be associated with a genetic predisposition to diseases or phenotypic features of the organism. In this work, we integrated into a single database information about SNPs in human gene promoters obtained by automatic extraction from various heterogeneous data sources, as well as the estimates of TBP affinity for the promoter obtained using the three-step binding model and predicting their effect on gene expression for wild-type promoters and promoters with SNPs. We have shown that Human_SNP_TATAdb can be used for annotation and identification of candidate SNP markers of diseases. The results of a genome-wide data analysis are presented, including the distribution of genes with respect to the number of transcripts, the distribution of SNPs affecting TBP-DNA affinity with respect to positions within promoters, as well as patterns linking TBP affinity for the promoter, the specificity of the TBP binding site for the promoter and other characteristics of promoters. The results of the genome-wide analysis showed that the affinity of TBP for the promoter and the specificity of its binding site are statistically related to other characteristics of promoters important for the functional classification of promoters and the study of the features of differential gene expression.
The development of next-generation sequencing technologies has provided new opportunities for genotyping various organisms, including plants. Genotyping by sequencing (GBS) is used to identify genetic variability more rapidly, and is more cost-effective than whole-genome sequencing. GBS has demonstrated its reliability and flexibility for a number of plant species and populations. It has been applied to genetic mapping, molecular marker discovery, genomic selection, genetic diversity studies, variety identification, conservation biology and evolutio nary studies. However, reduction in sequencing time and cost has led to the need to develop efficient bioinformatics analyses for an ever-expanding amount of sequenced data. Bioinformatics pipelines for GBS data analysis serve the purpose. Due to the similarity of data processing steps, existing pipelines are mainly characterised by a combination of software packages specifically selected either to process data for certain organisms or to process data from any organisms. However, despite the usage of efficient software packages, these pipelines have some disadvantages. For example, there is a lack of process automation (in some pipelines, each step must be started manually), which significantly reduces the performance of the analysis. In the majority of pipelines, there is no possibility of automatic installation of all necessary software packages; for most of them, it is also impossible to switch off unnecessary or completed steps. In the present work, we have developed a GBS-DP bioinformatics pipeline for GBS data analysis. The pipeline can be applied for various species. The pipeline is implemented using the Snakemake workflow engine. This implementation allows fully automating the process of calculation and installation of the necessary software packages. Our pipeline is able to perform analysis of large datasets (more than 400 samples).
SYSTEMS COMPUTATIONAL BIOLOGY
Identification of the mechanisms underlying the genetic control of spatial structure formation is among the relevant tasks of developmental biology. Both experimental and theoretical approaches and methods are used for this purpose, including gene network methodology, as well as mathematical and computer modeling. Reconstruction and analysis of the gene networks that provide the formation of traits allow us to integrate the existing experimental data and to identify the key links and intra-network connections that ensure the function of networks. Mathematical and computer modeling is used to obtain the dynamic characteristics of the studied systems and to predict their state and behavior. An example of the spatial morphological structure is the Drosophila bristle pattern with a strictly defined arrangement of its components – mechanoreceptors (external sensory organs) – on the head and body. The mechanoreceptor develops from a single sensory organ parental cell (SOPC), which is isolated from the ectoderm cells of the imaginal disk. It is distinguished from its surroundings by the highest content of proneural proteins (ASC), the products of the achaete-scute proneural gene complex (AS-C). The SOPC status is determined by the gene network we previously reconstructed and the AS-C is the key component of this network. AS-C activity is controlled by its subnetwork – the central regulatory circuit (CRC) comprising seven genes: AS-C, hairy, senseless (sens), charlatan (chn), scratch (scrt), phyllopod (phyl), and extramacrochaete (emc), as well as their respective proteins. In addition, the CRC includes the accessory proteins Daughterless (DA), Groucho (GRO), Ubiquitin (UB), and Seven-in-absentia (SINA). The paper describes the results of computer modeling of different CRC operation modes. As is shown, a cell is determined as an SOPC when the ASC content increases approximately 2.5-fold relative to the level in the surrounding cells. The hierarchy of the effects of mutations in the CRC genes on the dynamics of ASC protein accumulation is clarified. AS-C as the main CRC component is the most significant. The mutations that decrease the ASC content by more than 40 % lead to the prohibition of SOPC segregation.
The infectious disease caused by human immunodeficiency virus type 1 (HIV-1) remains a serious threat to human health. The current approach to HIV-1 treatment is based on the use of highly active antiretroviral therapy, which has side effects and is costly. For clinical practice, it is highly important to create functional cures that can enhance immune control of viral growth and infection of target cells with a subsequent reduction in viral load and restoration of the immune status. HIV-1 control efforts with reliance on immunotherapy remain at a conceptual stage due to the complexity of a set of processes that regulate the dynamics of infection and immune response. For this reason, it is extremely important to use methods of mathematical modeling of HIV-1 infection dynamics for theoretical analysis of possibilities of reducing the viral load by affecting the immune system without the usage of antiviral therapy. The aim of our study is to examine the existence of bi-, multistability and hysteresis properties with a meaningful mathematical model of HIV-1 infection. The model describes the most important blocks of the processes of interaction between viruses and the human body, namely, the spread of infection in productively and latently infected cells, the appearance of viral mutants and the development of the T cell immune response. Furthermore, our analysis aims to study the possibilities of transferring the clinical pattern of the disease from a more severe state to a milder one. We analyze numerically the conditions for the existence of steady states of the mathematical model of HIV-1 infection for the numerical values of model parameters corresponding to phenotypically different variants of the infectious disease course. To this end, original computational methods of bifurcation analysis of mathematical models formulated with systems of ordinary differential equations and delay differential equations are used. The macrophage activation rate constant is considered as a bifurcation parameter. The regions in the model parameter space, in particular, for the rate of activation of innate immune cells (macrophages), in which the properties of bi-, multistability and hysteresis are expressed, have been identified, and the features cha rac terizing transition kinetics between stable equilibrium states have been explored. Overall, the results of bifurcation analysis of the HIV-1 infection model form a theoretical basis for the development of combination immune-based therapeutic approaches to HIV-1 treatment. In particular, the results of the study of the HIV-1 infection model for parameter sets corresponding to different phenotypes of disease dynamics (typical, long-term non-progressing and rapidly progressing courses) indicate that an effective functional treatment (cure) of HIV-1-infected patients requires the development of a personalized approach that takes into account both the properties of the HIV-1 quasispecies population and the patient’s immune status.
Postoperative delirium (POD) is considered one of the most severe complications, resulting in impaired cognitive function, extended hospitalization, and higher treatment costs. The challenge of early POD diagnosis becomes particularly significant in cardiac surgery cases, as the incidence of this complication exceeds 50 % in certain patient categories. While it is known that neuroinflammation, neurotransmitter imbalances, disruptions in neuroendocrine regulation, and interneuronal connections contribute significantly to the development of POD, the molecular, genetic mechanisms of POD in cardiac surgery patients, along with potential metabolomic diagnostic markers, remain in adequately understood. In this study, blood plasma was collected from a group of patients over 65 years old after cardiac surgery involving artificial circulation. The collected samples were analyzed for sphingomyelin content and quantity using high-performance liquid chromatography coupled with mass spectrometry (HPLC-MS/MS) me thods. The analysis revealed four significantly different sphingomyelin contents in patients with POD compared to those who did not develop POD (control group). Employing gene network reconstruction, we perceived a set of 82 regulatory enzymes affiliated with the genetic coordination of the sphingolipid metabolism pathway. Within this set, 47 are assumed to be regulators of gene expression, governing the transcription of enzymes pivotal to the metabolic cascade. Complementing this, an additional assembly of 35 regulators are considered to be regulators of activity, degradation, and translocation dynamics of enzymes integral to the aforementioned pathway. Analysis of the overrepresentation of diseases with which these regulatory proteins are associated showed that the regulators can be categorized into two groups, associated with cardiovascular pathologies (CVP) and neuropsychiatric diseases (NPD), respectively. The regulators associated with CVP are expectedly related to the effects on myocardial tissue during surgery. It is hypothesized that dysfunction of NPD-associated regulators may specifically account for the development of POD after cardiac surgery. Thus, the identified regulatory genes may provide a basis for planning further experiments, in order to study disorders at the level of expression of these genes, as well as impaired function of proteins encoded by them in patients with POD. The identified significant sphingolipids can be considered as potential markers of POD.
The participants of Hepatitis C virus (HCV) replication are both viral and host proteins. Therapeutic approaches based on activity inhibition of viral non-structural proteins NS3, NS5A, and NS5B are undergoing clinical trials. However, rapid mutation processes in the viral genome and acquisition of drug resistance to the existing drugs remain the main obstacles to fighting HCV. Identifying the host factors, exploring their role in HCV RNA replication, and studying viral effects on their expression is essential for understanding the mechanisms of viral replication and developing novel, effective curative approaches. It is known that the host factors PREB (prolactin regulatory element binding) and PLA2G4C (cytosolic phospholipase A2 gamma) are important for the functioning of the viral replicase complex and the formation of the platforms of HCV genome replication. The expression of PREB and PLA2G4C was significantly elevated in the presence of the HCV genome. However, the mechanisms of its regulation by HCV remain unknown. In this paper, using a text-mining technology provided by ANDSystem, we reconstructed and analyzed gene networks describing regulatory effects on the expression of PREB and PLA2G4C by HCV proteins. On the basis of the gene network analysis performed, we put forward hypotheses about the modulation of the host factors functions resulting from protein-protein interaction with HCV proteins. Among the viral proteins, NS3 showed the greatest number of regulatory linkages. We assumed that NS3 could inhibit the function of host transcription factor (TF) NOTCH1 by protein-protein interaction, leading to upregulation of PREB and PLA2G4C. Analysis of the gene networks and data on differential gene expression in HCV-infected cells allowed us to hypothesize further how HCV could regulate the expression of TFs, the binding sites of which are localized within PREB and PLA2G4C gene regions. The results obtained can be used for planning studies of the molecular-genetic mechanisms of viral-host interaction and searching for potential targets for anti-HCV therapy.
Hepatocellular carcinoma (HCC) is a common severe type of liver cancer characterized by an extremely aggressive course and low survival rates. It is known that disruptions in the regulation of apoptosis activation are some of the key features inherent in most cancer cells, which determines the pharmacological induction of apoptosis as an important strategy for cancer therapy. The computer design of chemical compounds capable of specifically regulating the external signaling pathway of apoptosis induction represents a promising approach for creating new effective ways of therapy for liver cancer and other oncological diseases. However, at present, most of the studies are devoted to pharmacological effects on the internal (mitochondrial) apoptosis pathway. In contrast, the external pathway induced via cell death receptors remains out of focus. Aberrant gene methylation, along with hepatitis C virus (HCV) infection, are important risk factors for the development of hepatocellular carcinoma. The reconstruction of gene networks describing the molecular mechanisms of interaction of aberrantly methylated genes with key participants of the extrinsic apoptosis pathway and their regulation by HCV proteins can provide important information when searching for pharmacological targets. In the present study, 13 criteria were proposed for prioritizing potential pharmacological targets for developing anti-hepatocarcinoma drugs modulating the extrinsic apoptosis pathway. The criteria are based on indicators of the structural and functional organization of reconstructed gene networks of hepatocarcinoma, the extrinsic apoptosis pathway, and regulatory pathways of virus-extrinsic apoptosis pathway interaction and aberrant gene methylation-extrinsic apoptosis pathway interaction using ANDSystem. The list of the top 100 gene targets ranked according to the prioritization rating was statistically significantly (p-value = 0.0002) enriched for known pharmacological targets approved by the FDA, indicating the correctness of the prioritization method. Among the promising potential pharmacological targets, six highly ranked genes (JUN, IL10, STAT3, MYC, TLR4, and KHDRBS1) are likely to deserve close attention.
The animal models used in biomedical research cover virtually every human disease. RatDEGdb, a knowledge base of the differentially expressed genes (DEGs) of the rat as a model object in biomedical research is a collection of published data on gene expression in rat strains simulating arterial hypertension, age-related diseases, psychopathological conditions and other human afflictions. The current release contains information on 25,101 DEGs representing 14,320 unique rat genes that change transcription levels in 21 tissues of 10 genetic rat strains used as models of 11 human diseases based on 45 original scientific papers. RatDEGdb is novel in that, unlike any other biomedical database, it offers the manually curated annotations of DEGs in model rats with the use of independent clinical data on equal changes in the expression of homologous genes revealed in people with pathologies. The rat DEGs put in RatDEGdb were annotated with equal changes in the expression of their human homologs in affected people. In its current release, RatDEGdb contains 94,873 such annotations for 321 human genes in 836 diseases based on 959 original scientific papers found in the current PubMed. RatDEGdb may be interesting first of all to human geneticists, molecular biologists, clinical physicians, genetic advisors as well as experts in biopharmaceutics, bioinformatics and personalized genomics. RatDEGdb is publicly available at https://www.sysbio.ru/RatDEGdb.
STRUCTURAL COMPUTATIONAL BIOLOGY AND PHARMACOLOGY
To date, many derivatives and analogs of nucleic acids (NAs) have been developed. Some of them have found uses in scientific research and biomedical applications. Their effective use is based on the data about their properties. Some of the most important physicochemical properties of oligonucleotides are thermodynamic parameters of the formation of their duplexes with DNA and RNA. These parameters can be calculated only for a few NA derivatives: locked NAs, bridged oligonucleotides, and peptide NAs. Existing predictive approaches are based on an analysis of experimental data and the consequent construction of predictive models. The ongoing pilot studies aimed at devising methods for predicting the properties of NAs by computational modeling techniques are based only on knowledge about the structure of oligonucleotides. In this work, we studied the applicability of the weighted histogram analysis method (WHAM) in combination with umbrella sampling to the calculation of thermodynamic parameters of DNA duplex formation (changes in enthalpy ∆H°, entropy ∆S°, and Gibbs free energy ∆G37° ). A procedure was designed involving WHAM for calculating the hybridization properties of oligodeoxyribonucleotides. Optimal parameters for modeling and calculation of thermodynamic parameters were determined. The feasibility of calculation of ∆H°, ∆S°, and ∆G37° was demonstrated using a representative sample of 21 oligonucleotides 4–16 nucleotides long with a GC content of 14–100 %. Error of the calculation of the thermodynamic parameters was 11.4, 12.9, and 11.8 % for ∆H°, ∆S°, and ∆G37° , respectively, and the melting temperature was predicted with an average error of 5.5 °C. Such high accuracy of computations is comparable with the accuracy of the experimental approach and of other methods for calculating the energy of NA duplex formation. In this paper, the use of WHAM for computation of the energy of DNA duplex formation was systematically investigated for the first time. Our results show that a reliable calculation of the hybridization parameters of new NA derivatives is possible, including derivatives not yet synthesized. This work opens up new horizons for a rational design of constructs based on NAs for solving problems in biomedicine and biotechnology.
EVOLUTIONARY COMPUTATIONAL BIOLOGY
Cancer is a complex and heterogeneous disease characterized by the accumulation of genetic alterations that drive uncontrolled cell growth and proliferation. Evolutionary dynamics plays a crucial role in the emergence and development of tumors, shaping the heterogeneity and adaptability of cancer cells. From the perspective of evolutionary theory, tumors are complex ecosystems that evolve through a process of microevolution influenced by genetic mutations, epigenetic changes, tumor microenvironment factors, and therapyinduced changes. This dynamic nature of tumors poses significant challenges for effective cancer treatment, and understanding it is essential for developing effective and personalized therapies. By uncovering the mechanisms that determine tumor heterogeneity, researchers can identify key genetic and epigenetic changes that contribute to tumor progression and resistance to treatment. This knowledge enables the development of innovative strategies for targeting specific tumor clones, minimizing the risk of recurrence and improving patient outcomes. To investigate the evolutionary dynamics of cancer, researchers employ a wide range of experimental and computational approaches. Traditional experimental methods involve genomic profiling techniques such as nextgeneration sequencing and fluorescence in situ hybridization. These techniques enable the identification of somatic mutations, copy number alterations, and structural rearrangements within cancer genomes. Furthermore, singlecell sequencing methods have emerged as powerful tools for dissecting intratumoral heterogeneity and tracing clonal evolution. In parallel, computational models and algorithms have been developed to simulate and analyze cancer evolution. These models integrate data from multiple sources to predict tumor growth patterns, identify driver mutations, and infer evolutionary trajectories. In this paper, we set out to describe the current approaches to address this evolutionary complexity and theories of its occurrence.
Currently, active research is focused on investigating the mechanisms that regulate the development of various pathologies and their evolutionary dynamics. Epigenetic mechanisms, such as DNA methylation, play a significant role in evolutionary processes, as their changes have a faster impact on the phenotype compared to mutagenesis. In this study, we attempted to develop an algorithm for identifying differentially methylated regions associated with metabolic syndrome, which have undergone methylation changes in humans during the transition from a huntergatherer to a sedentary lifestyle. The application of existing wholegenome bisulfite sequencing methods is limited for ancient samples due to their low quality and fragmentation, and the approach to obtaining DNA methylation profiles differs significantly between ancient huntergatherer samples and modern tissues. In this study, we validated DamMet, an algorithm for reconstructing ancient methylomes. Application of DamMet to Neanderthal and Denisovan genomes showed a moderate level of correlation with previously published methylation profiles and demonstrated an underestimation of methylation levels in the reconstructed profiles by an average of 15–20 %. Additionally, we developed a new Pythonbased algorithm that allows for the comparison of methylomes in ancient and modern samples, despite the absence of methylation profiles in modern bone tissue within the context of obesity. This analysis involves a twostep data processing approach, where the first step involves the identification and filtration of tissuespecific methylation regions, and the second step focuses on the direct search for differentially methylated regions in specific areas associated with the researcher’s target condition. By applying this algorithm to test data, we identified 38 differentially methylated regions associated with obesity, the majority of which were located in promoter regions. The pipeline demonstrated sufficient efficiency in detecting these regions. These results confirm the feasibility of reconstructing DNA methylation profiles in ancient samples and comparing them with modern methylomes. Furthermore, possibilities for further methodological development and the implementation of a new step for studying differentially methylated positions associated with evolutionary processes are discussed.
Genes encoding cell surface receptors make up a significant portion of the human genome (more than a thousand genes) and play an important role in gene networks. Cell surface receptors are transmembrane proteins that interact with molecules (ligands) located outside the cell. This interaction activates signal transduction pathways in the cell. A large number of exogenous ligands of various origins, including drugs, are known for cell surface receptors, which accounts for interest in them from biomedical researchers. Appetite (the desire of the animal organism to consume food) is one of the most primitive instincts that contribute to survival. However, when the supply of nutrients is stable, the mechanism of adaptation to adverse factors acquired in the course of evolution turned out to be excessive, and therefore obesity has become one of the most serious public health problems of the twenty-first century. Pathological human conditions characterized by appetite violations include both hyperphagia, which inevitably leads to obesity, and anorexia nervosa induced by psychosocial stimuli, as well as decreased appetite caused by neurodegeneration, inflammation or cancer. Understanding the evolutionary mechanisms of human diseases, especially those related to lifestyle changes that have occurred over the past 100–200 years, is of fundamental and applied importance. It is also very important to identify relationships between the evolutionary characteristics of genes in gene networks and the resistance of these networks to changes caused by mutations. The aim of the current study is to identify the distinctive features of human genes encoding cell surface receptors involved in appetite regulation using the phylostratigraphic age index (PAI) and divergence index (DI). The values of PAI and DI were analyzed for 64 human genes encoding cell surface receptors, the orthologs of which were involved in the regulation of appetite in model animal species. It turned out that the set of genes under consideration contains an increased number of genes with the same phylostratigraphic age (PAI = 5, the stage of vertebrate divergence), and almost all of these genes (28 out of 31) belong to the superfamily of G-protein coupled receptors. Apparently, the synchronized evolution of such a large group of genes (31 genes out of 64) is associated with the development of the brain as a separate organ in the first vertebrates. When studying the distribution of genes from the same set by DI values, a significant enrichment with genes having a low DIs was revealed: eight genes (GPR26, NPY1R, GHSR, ADIPOR1, DRD1, NPY2R, GPR171, NPBWR1) had extremely low DIs (less than 0.05). Such low DI values indicate that most likely these genes are subjected to stabili zing selection. It was also found that the group of genes with low DIs was enriched with genes that had brain-specific patterns of expression. In particular, GPR26, which had the lowest DI, is in the group of brain-specific genes. Because the endogenous ligand for the GPR26 receptor has not yet been identified, this gene seems to be an extremely interesting object for further theoretical and experimental research. We believe that the features of the genes encoding cell surface receptors we have identified using the evolutionary metrics PAI and DI can be a starting point for further evolutionary analysis of the gene network regulating appetite.
The coronavirus pandemic caused by the SARS-CoV-2 virus, which humanity resisted using the latest advances in science, left behind, among other things, extensive genetic data. Every day since the end of 2019, samples of the virus genomes have been collected around the world, which makes it possible to trace its evolution in detail from its emergence to the present. The accumulated statistics of testing results showed that the number of confirmed cases of SARS-CoV-2 infection was at least 767.5 million (9.5 % of the current world population, excluding asymptomatic people), and the number of sequenced virus genomes is more than 15.7 million (which is over 2 % of the total number of infected people). These new data potentially contain information about the mechanisms of the variability and spread of the virus, its interaction with the human immune system, the main parameters characterizing the mechanisms of the development of a pandemic, and much more. In this article, we analyze the space of possible variants of SARS-CoV-2 genetic sequences both from a mathematical point of view and taking into account the biological limitations inherent in this system, known both from general biological knowledge and from the consideration of the characteristics of this particular virus. We have developed software capable of loading and analyzing SARS-CoV-2 nucleotide sequences in FASTA format, determining the 5’ and 3’ UTR positions, the number and location of unidentified nucleotides (“N”), performing alignment with the reference sequence by calling the program designed for this, determining mutations, deletions and insertions, as well as calculating various characteris tics of virus genomes with a given time step (days, weeks, months, etc.). The data obtained indicate that, despite the apparent mathematical diversity of possible options for changing the virus over time, the corridor of the evolutionary trajectory that the coronavirus has passed through seems to be quite narrow. Thus it can be assumed that it is determined to some extent, which allows us to hope for a possibility of modeling the evolution of the coronavirus.
DEEP LEARNING METHODS IN BIOINFORMATICS AND SYSTEMS BIOLOGY
The development of objective methods for assessing stress levels is an important task of applied neuroscience. Analysis of EEG recorded as part of a behavioral self-control program can serve as the basis for the development of test methods that allow classifying people by stress level. It is well known that participation in meditation practices leads to the development of skills of voluntary self-control over the individual’s mental state due to an increased concentration of attention to themselves. As a consequence of meditation practices, participants can reduce overall anxiety and stress levels. The aim of our study was to develop, train and test a convolutional neural network capable of classifying individuals into groups of practitioners and non-practitioners of meditation by analysis of eventrelated brain potentials recorded during stop-signal paradigm. Four non-deep convolutional network architectures were developed, trained and tested on samples of 100 people (51 meditators and 49 non-meditators). Subsequently, all structures were additionally tested on an independent sample of 25 people. It was found that a structure using a one-dimensional convolutional layer combining the layer and a two-layer fully connected network showed the best performance in simulation tests. However, this model was often subject to overfitting due to the limitation of the display size of the data set. The phenomenon of overfitting was mitigated by changing the structure and scale of the model, initialization network parameters, regularization, random deactivation (dropout) and hyperparameters of cross-validation screening. The resulting model showed 82 % accuracy in classifying people into subgroups. The use of such models can be expected to be effective in assessing stress levels and inclination to anxiety and depression disorders in other groups of subjects.
The pigment composition of plant seed coat affects important properties such as resistance to pathogens, pre-harvest sprouting, and mechanical hardness. The dark color of barley (Hordeum vulgare L.) grain can be attributed to the synthesis and accumulation of two groups of pigments. Blue and purple grain color is associated with the biosynthesis of anthocyanins. Gray and black grain color is caused by melanin. These pigments may accumulate in the grain shells both individually and together. Therefore, it is difficult to visually distinguish which pigments are responsible for the dark color of the grain. Chemical methods are used to accurately determine the presence/absence of pigments; however, they are expensive and labor-intensive. Therefore, the development of a new method for quickly assessing the presence of pigments in the grain would help in investigating the mechanisms of genetic control of the pigment composition of barley grains. In this work, we developed a method for assessing the presence or absence of anthocyanins and melanin in the barley grain shell based on digital image analysis using computer vision and machine learning algo rithms. A protocol was developed to obtain digital RGB images of barley grains. Using this protocol, a total of 972 images were acquired for 108 barley accessions. Seed coat from these accessions may contain anthocyanins, melanins, or pigments of both types. Chemical methods were used to accurately determine the pigment content of the grains. Four models based on computer vision techniques and convolutional neural networks of different architectures were developed to predict grain pigment composition from images. The U-Net network model based on the EfficientNetB0 topology showed the best performance in the holdout set (the value of the “accuracy” parameter was 0.821).
ECOLOGICAL COMPUTATIONAL BIOLOGY
At the beginning of the paper, the level of necessary phenomenology of complex models is discussed. When working with complex systems, which of course include living organisms and ecological systems, it is necessary to use a phenomenological description. An illustration of the phenomenological approach is given, which captures the most significant general principles or patterns of interactions; the specific values of the parameters cannot be calculated from the first principles, but are determined empirically. An appropriate interpretation is also chosen empirically and pragmatically. However, in order to simulate a wider range of situations, it becomes necessary to lower the level of phenomenology, switch to a more detailed description of the system, introducing interaction between selected elements of the system. The requirements for a system model combining ecological, metabolic and genetic levels of cell culture description are formulated. A mathematical model of quorum sensing dynamics during the growth of batch culture of luminescent bacteria at different concentrations of the nutrient substrate has been developed. The model contains four blocks describing ecological, energy, quorum and luminescent aspects of bacterial culture growth. The model demonstrated good agreement with the experimental data obtained. When analyzing the model, three oddities in the behavior of the culture were noted, which presumably can change the idea of some processes taking place during the development of a culture of luminescent bacteria. The results obtained suggest the presence of some additional control system for the luminescent reaction via the synthesis pathways of FMN · Н2 or aliphatic aldehyde. In this case, the generalized description of the contribution of energy metabolism to luminescence only through ATP is too strong a simplification. As a result of comparing the model dynamics with the experiment, a discrepancy arose between the concentration of the substrate (peptone) measured in the experiment and its effective influence on the bacterial population growth. This discrepancy seems to indicate peptone is not the leading substrate, and growth is limited by nutrients contained in the yeast extract, the concentration of which did not change in these experiments. The discrepancies noted between the expectations and the results of experimental data processing, together with the assumptions about the causes of these discrepancies, set the direction for further experimental and theoretical studies of quorum sensing mechanisms in a culture of luminescent bacteria.
The purpose of the study was to compare quantitative analysis methods used in the early stages of closed-loop system prototyping with modern data analysis approaches. As an example, a mathematical model of the stable coexistence of two microalgae in a mixed flow culture, proposed by Bolsunovsky and Degermendzhi in 1982, is considered. The model is built on the basis of a detailed theoretical description of the interaction between species and substrate (in this case, illumination). The ability to control the species ratio allows you to adjust the assimilation quotient (AQ), that is, the ratio of carbon dioxide absorbed to oxygen released. The problem of controlling the assimilation coefficient of a life support system is still relevant; in modern works, microalgae are considered as promising oxygen generators. At the same time, modern works place emphasis on empirical modeling methods, in particular, on the analysis of big data, and the work does not go beyond the task of managing a monoculture of microalgae. In our work, we pay attention to three results that, in our opinion, successfully complement modern methods. Firstly, the model allows the use of results from experiments with monocultures. Secondly, the model predicts the transformation of data into a form convenient for further analysis, including for calculating AQ. Thirdly, the model allows us to guarantee the stability of the resulting approximation and further refine the solution by small corrections using empirical methods.
The light emitted by a luminescent bacterium serves as a unique native channel of information regarding the intracellular processes within the individual cell. In the presence of highly sensitive equipment, it is possible to obtain the distribution of bacterial culture cells by the intensity of light emission, which correlates with the amount of luciferase in the cells. When growing on rich media, the luminescence intensity of individual cells of brightly luminous strains of the luminescent bacteria Photobacterium leiognathi and Ph. phosporeum reaches 104–105 quanta/s. The signal of such intensity can be registered using sensitive photometric equipment. All experiments were carried out with bacterial clones (genetically homogeneous populations). A typical dynamics of luminous bacterial cells distributions with respect to intensity of light emission at various stages of batch culture growth in a liquid medium was obtained. To describe experimental distributions, a phenomenological model that links the light of a bacterial cell with the history of events at the molecular level was constructed. The proposed phenomenological model with a minimum number of fitting parameters (1.5) provides a satisfactory description of the complex process of formation of cell distributions by luminescence intensity at different stages of bacterial culture growth. This may be an indication that the structure of the model describes some essential processes of the real system. Since in the process of division all cells go through the stage of release of all regulatory molecules from the DNA molecule, the resulting distributions can be attributed not only to luciferase, but also to other proteins of constitutive (and not only) synthesis.
COMPUTATIONAL PLANT BIOLOGY
To study the mechanisms of growth and development, it is necessary to analyze the dynamics of the tissue patterning regulators in time and space and to take into account their effect on the cellular dynamics within a tissue. Plant hormones are the main regulators of the cell dynamics in plant tissues; they form gradients and maxima and control molecular processes in a concentration-dependent manner. Here, we present DyCeModel, a software tool implemented in MATLAB for one-dimensional simulation of tissue with a dynamic cellular ensemble, where changes in hormone (or other active substance) concentration in the cells are described by ordinary differential equations (ODEs). We applied DyCeModel to simulate cell dynamics in plant meristems with different cellular structures and demonstrated that DyCeModel helps to identify the relationships between hormone concentration and cellular behaviors. The tool visualizes the simulation progress and presents a video obtained during the calculation. Importantly, the tool is capable of automatically adjusting the parameters by fitting the distribution of the substance concentrations predicted in the model to experimental data taken from the microscopic images. Noteworthy, DyCeModel makes it possible to build models for distinct types of plant meristems with the same ODEs, recruiting specific input characteristics for each meristem. We demonstrate the tool’s efficiency by simulation of the effect of auxin and cytokinin distributions on tissue patterning in two types of Arabidopsis thaliana stem cell niches: the root and shoot apical meristems. The resulting models represent a promising framework for further study of the role of hormone-controlled gene regulatory networks in cell dynamics.
INDUSTRIAL BIOINFORMATICS
Modern investigations in biology often require the efforts of one or more groups of researchers. Often these are groups of specialists from various scientific fields who generate and share data of different formats and sizes. Without modern approaches to work automation and data versioning (where data from different collaborators are stored at different points in time), teamwork quickly devolves into unmanageable confusion. In this review, we present a number of information systems designed to solve these problems. Their application to the organization of scientific activity helps to manage the flow of actions and data, allowing all participants to work with relevant information and solving the issue of reproducibility of both experimental and computational results. The article describes methods for organizing data flows within a team, principles for organizing metadata and ontologies. The information systems Trello, Git, Redmine, SEEK, OpenBIS and Galaxy are considered. Their functionality and scope of use are described. Before using any tools, it is important to understand the purpose of implementation, to define the set of tasks they should solve, and, based on this, to formulate requirements and finally to monitor the application of recommendations in the field. The tasks of creating a framework of ontologies, metadata, data warehousing schemas and software systems are key for a team that has decided to undertake work to automate data circulation. It is not always possible to implement such systems in their entirety, but one should still strive to do so through a stepbystep introduction of principles for organizing data and tasks with the mastery of individual software tools. It is worth noting that Trello, Git, and Redmine are easier to use, customize, and support for small research groups. At the same time, SEEK, OpenBIS, and Galaxy are more specific and their use is advisable if the capabilities of simple systems are no longer sufficient.
MOLECULAR AND CELL BIOLOGY
CHO cells are most commonly used for the synthesis of recombinant proteins in biopharmaceutical production. When stable producer cell lines are obtained, the locus of transgene integration into the genome has a great influence on the level of its expression. Therefore, the identification of genomic loci ensuring a high level of protein production is very important. Here, we used the TRIP assay to study the influence of the local chromatin environment on the activity of transgenes in CHO cells. For this purpose, reporter constructs encoding eGFP under the control of four promoters were stably integrated into the genome of CHO cells using the piggyBac transposon. Each individual transgene contained a unique tag, a DNA barcode, and the resulting polyclonal cell population was cultured for almost a month without any selection. Next, using the high-throughput sequencing, genomic localizations of barcodes, as well as their abundances in the population and transcriptional activities were identified. In total, ~640 transgenes more or less evenly distributed across all chromosomes of CHO cells were characterized. More than half of the transgenes were completely silent. The most active transgenes were identified to be inserted in gene promoters and 5’ UTRs. Transgenes carrying Chinese hamster full-length promoter of the EF-1α gene showed the highest activity. Transgenes with a truncated version of the same promoter and with the mouse PGK gene promoter were on average 10 and 19 times less active, respectively. In total, combinations of genomic loci of CHO cells and transgene promoters that together provide different levels of transcriptional activity of the model reporter construct were described.