<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3.dtd">
<article article-type="research-article" dtd-version="1.3" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="ru"><front><journal-meta><journal-id journal-id-type="publisher-id">vavilov</journal-id><journal-title-group><journal-title xml:lang="ru">Вавиловский журнал генетики и селекции</journal-title><trans-title-group xml:lang="en"><trans-title>Vavilov Journal of Genetics and Breeding</trans-title></trans-title-group></journal-title-group><issn pub-type="epub">2500-3259</issn><publisher><publisher-name>Institute of Cytology and Genetics of Siberian Branch of the RAS</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.18699/vjgb-24-92</article-id><article-id custom-type="elpub" pub-id-type="custom">vavilov-4406</article-id><article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="ru"><subject>ЭВОЛЮЦИОННАЯ БИОЛОГИЯ</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="en"><subject>EVOLUTIONARY BIOLOGY</subject></subj-group></article-categories><title-group><article-title>Новый подход к анализу эволюции SARS-CoV-2, основанный на визуализации и кластеризации больших объемов генетических данных, компактно представленных в оперативной памяти</article-title><trans-title-group xml:lang="en"><trans-title>A novel approach to analyzing the evolution of SARS-CoV-2 based on visualization and clustering of large genetic data compactly represented in operative memory</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author" corresp="yes"><contrib-id contrib-id-type="orcid">https://orcid.org/0000-0003-1108-1486</contrib-id><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Пальянов</surname><given-names>А. Ю.</given-names></name><name name-style="western" xml:lang="en"><surname>Palyanov</surname><given-names>A. Yu.</given-names></name></name-alternatives><bio xml:lang="ru"><p>Новосибирск</p></bio><bio xml:lang="en"><p>Novosibirsk</p></bio><email xlink:type="simple">palyanov@iis.nsk.su</email><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-1783-5798</contrib-id><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Пальянова</surname><given-names>Н. В.</given-names></name><name name-style="western" xml:lang="en"><surname>Palyanova</surname><given-names>N. V.</given-names></name></name-alternatives><bio xml:lang="ru"><p>Новосибирск</p></bio><bio xml:lang="en"><p>Novosibirsk</p></bio><xref ref-type="aff" rid="aff-2"/></contrib></contrib-group><aff-alternatives id="aff-1"><aff xml:lang="ru">Институт систем информатики им. А.П. Ершова Сибирского отделения Российской академии наук, Новосибирск;&#13;
Научно-исследовательский институт вирусологии, Федеральный исследовательский центр фундаментальной и трансляционной медицины;&#13;
Новосибирский национальный исследовательский государственный университет<country>Россия</country></aff><aff xml:lang="en">A.P. Ershov Institute of Informatics Systems of the Siberian Branch of the Russian Academy of Sciences;&#13;
Research Institute of Virology, Federal Research Center of Fundamental and Translational Medicine;&#13;
Novosibirsk State University<country>Russian Federation</country></aff></aff-alternatives><aff-alternatives id="aff-2"><aff xml:lang="ru">Научно-исследовательский институт вирусологии, Федеральный исследовательский центр фундаментальной и трансляционной медицины<country>Россия</country></aff><aff xml:lang="en">Research Institute of Virology, Federal Research Center of Fundamental and Translational Medicine<country>Russian Federation</country></aff></aff-alternatives><pub-date pub-type="collection"><year>2024</year></pub-date><pub-date pub-type="epub"><day>25</day><month>01</month><year>2025</year></pub-date><volume>28</volume><issue>8</issue><fpage>843</fpage><lpage>853</lpage><permissions><copyright-statement>Copyright &amp;#x00A9; Пальянов А.Ю., Пальянова Н.В., 2025</copyright-statement><copyright-year>2025</copyright-year><copyright-holder xml:lang="ru">Пальянов А.Ю., Пальянова Н.В.</copyright-holder><copyright-holder xml:lang="en">Palyanov A.Y., Palyanova N.V.</copyright-holder><license license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>This work is licensed under a Creative Commons Attribution 4.0 License.</license-p></license></permissions><self-uri xlink:href="https://vavilov.elpub.ru/jour/article/view/4406">https://vavilov.elpub.ru/jour/article/view/4406</self-uri><abstract><p>Коронавирус SARS-CoV-2 – это вирус, для которого было собрано, секвенировано и сохранено ре­кордное количество вариантов генома из источников по всему миру. Нуклеотидные последовательности в фор­мате FASTA включают 16.8 млн геномов, каждый длиной ≈29 900 нт (нуклеотидов), общим размером ≈500 ∙ 109 нт, или 466 Гб. Мы предлагаем способ представления данных, позволяющий разместить без потерь всю эту информацию в оперативной памяти (RAM) обычного персонального компьютера. Более того, будет достаточно всего ≈330 Мб. Выравнивание их всех относительно исходной референсной последовательности Wunah-Hu-1 позволяет представить каждый геном как структуру данных, содержащую списки точечных мутаций, делеций и вставок. Наша реализация такого представления данных привела к коэффициенту сжатия 1:1500 (для сравнения, упаковка данных с помощью популярного архиватора WinRAR дает степень сжатия только 1:62) и обеспечила возможность быстрого вычисления редакционного расстояния между различными вариантами генома. С помощью этого подхода, реализованного в виде программы на C++, мы провели анализ различных свойств набора геномов SARS-CoV-2, содержащихся в NCBI Genbank, собранных за 4.5 года (с 24.12.2019 по 24.06.2024). Были рассчитаны распределение числа геномов от числа неопределенных нуклеотидов “N” в них, число уникальных геномов и кластеров из идентичных геномов, а также распределение кластеров по размеру (числу идентичных геномов) и продолжительности (длине временного интервала между первым и последним геномом каждого кластера). Наконец, эволюция распределений числа изменений (редакционное расстояние между каждым ге­номом и референсной последовательностью), вызванных заменами, делециями и вставками, была визуализи­рована в виде 3D поверхностей, наглядно изображающих процесс вирусной эволюции в течение 4.5 лет, с ин­тервалом в одну неделю. Такая визуализация хорошо соотносится с филогенетическими деревьями (обычно рассчитываемыми по 3–4 тыс. представителей вариантов генома), но строится на основе миллионов геномов, отображает больше деталей и не зависит от типа классификации линий/клад.</p></abstract><trans-abstract xml:lang="en"><p>SARS-CoV-2 is a virus for which an outstanding number of genome variants were collected, sequenced and stored from sources all around the world. Raw data in FASTA format include 16.8 million genomes, each ≈29,900 nt (nu­cleotides), with a total size of ≈500 ∙ 109 nt, or 465 Gb. We suggest an approach to data representation and organization, with which all this can be stored losslessly in the operative memory (RAM) of a common PC. Moreover, just ≈330 Mb will be enough. Aligning all genomes versus the initial Wuhan-Hu-1 reference sequence allows each to be represented as a data structure containing lists of point mutations, deletions and insertions. Our implementation of such data represen­tation resulted in a 1:1500 compression ratio (for comparison, compression of the same data with the popular WinRAR archiver gives only 1:62) and fast access to genomes (and their metadata) and comparisons between different genome variants. With this approach implemented as a C++ program, we performed an analysis of various properties of the set of SARS-CoV-2 genomes available in NCBI Genbank (within a period from 24.12.2019 to 24.06.2024). We calculated the distribution of the number of genomes with undetermined nucleotides, ‘N’s, vs the number of such nucleotides in them, the number of unique genomes and clusters of identical genomes, and the distribution of clusters by size (the number of identical genomes) and duration (the time interval between each cluster’s first and last genome). Finally, the evolution of distributions of the number of changes (editing distance between each genome and reference sequence) caused by substitutions, deletions and insertions was visualized as 3D surfaces, which clearly show the process of viral evolution over 4.5 years, with a time step = 1 week. It is in good correspondence with phylogenetic trees (usually based on 3–4 thousand of genome variant representatives), but is built over millions of genomes, shows more details and is independent of the type of lineage/clade classification.</p></trans-abstract><kwd-group xml:lang="ru"><kwd>коронавирус</kwd><kwd>SARS-CoV-2</kwd><kwd>геном</kwd><kwd>варианты</kwd><kwd>эволюция</kwd><kwd>программная система</kwd><kwd>большие дан¬ные</kwd><kwd>компактизация</kwd><kwd>анализ</kwd><kwd>визуализация</kwd></kwd-group><kwd-group xml:lang="en"><kwd>coronavirus</kwd><kwd>SARS-CoV-2</kwd><kwd>genome</kwd><kwd>variants</kwd><kwd>evolution</kwd><kwd>software system</kwd><kwd>big data</kwd><kwd>compact representation of data</kwd><kwd>analysis</kwd><kwd>visualization</kwd></kwd-group><funding-group xml:lang="en"><funding-statement>This research was funded by RSF, grant number 23-64-00005.</funding-statement></funding-group></article-meta></front><back><ref-list><title>References</title><ref id="cit1"><label>1</label><citation-alternatives><mixed-citation xml:lang="ru">Aksamentov I., Roemer C., Hodcroft B., Neher R.A. Nextclade: clade assignment, mutation calling and quality control for viral genomes. J. Open Source Software. 2021;6(67):3773. doi 10.21105/joss.03773</mixed-citation><mixed-citation xml:lang="en">Aksamentov I., Roemer C., Hodcroft B., Neher R.A. Nextclade: clade assignment, mutation calling and quality control for viral genomes. J. Open Source Software. 2021;6(67):3773. doi 10.21105/joss.03773</mixed-citation></citation-alternatives></ref><ref id="cit2"><label>2</label><citation-alternatives><mixed-citation xml:lang="ru">Amicone M., Borges V., Alves M.J., Isidro J., Zé-Zé L., Duarte S., Vieira L., Guiomar R., Gomes J.P., Gordo I. Mutation rate of SARSCoV-2 and emergence of mutators during experimental evolution. Evol. Med. Public Health. 2022;10(1):142-155. doi 10.1093/emph/eoac010</mixed-citation><mixed-citation xml:lang="en">Amicone M., Borges V., Alves M.J., Isidro J., Zé-Zé L., Duarte S., Vieira L., Guiomar R., Gomes J.P., Gordo I. Mutation rate of SARSCoV-2 and emergence of mutators during experimental evolution. Evol. Med. Public Health. 2022;10(1):142-155. doi 10.1093/emph/eoac010</mixed-citation></citation-alternatives></ref><ref id="cit3"><label>3</label><citation-alternatives><mixed-citation xml:lang="ru">Bai C., Zhong Q., Gao G.F. Overview of SARS-CoV-2 genome-encoded proteins. Sci. China Life Sci. 2022;65(2):280-294. doi 10.1007/s11427-021-1964-4</mixed-citation><mixed-citation xml:lang="en">Bai C., Zhong Q., Gao G.F. Overview of SARS-CoV-2 genome-encoded proteins. Sci. China Life Sci. 2022;65(2):280-294. doi 10.1007/s11427-021-1964-4</mixed-citation></citation-alternatives></ref><ref id="cit4"><label>4</label><citation-alternatives><mixed-citation xml:lang="ru">Bolze A., Basler T., White S., Rossi A.D., Wyman D., Dai H., Roychoudhury P., Greninger A.L., Hayashibara K., Beatty M., Shah S., Stous S., McCrone J.T., Kil E., Cassens T., Tsan K., Nguyen J., Ramirez J., Carter S., Cirulli E.T., Barrett K.S., Washington N.L., Belda-Ferre P., Jacobs S., Sandoval E., Becker D., Lu J.T., Isaksson M., Lee W., Luo S. Evidence for SARS-CoV-2 Delta and Omicron co-infections and recombination. Med. 2022;3(12):848-859. doi 10.1016/j.medj.2022.10.002</mixed-citation><mixed-citation xml:lang="en">Bolze A., Basler T., White S., Rossi A.D., Wyman D., Dai H., Roychoudhury P., Greninger A.L., Hayashibara K., Beatty M., Shah S., Stous S., McCrone J.T., Kil E., Cassens T., Tsan K., Nguyen J., Ramirez J., Carter S., Cirulli E.T., Barrett K.S., Washington N.L., Belda-Ferre P., Jacobs S., Sandoval E., Becker D., Lu J.T., Isaksson M., Lee W., Luo S. Evidence for SARS-CoV-2 Delta and Omicron co-infections and recombination. Med. 2022;3(12):848-859. doi 10.1016/j.medj.2022.10.002</mixed-citation></citation-alternatives></ref><ref id="cit5"><label>5</label><citation-alternatives><mixed-citation xml:lang="ru">Campagnola G., Govindarajan V., Pelletier A., Canard B., Peersen O.B. The SARS-CoV-2 nsp12 polymerase active site is tuned for largegenome replication. J. Virol. 2022;96(16):e0067122. doi 10.1128/jvi.00671-22</mixed-citation><mixed-citation xml:lang="en">Campagnola G., Govindarajan V., Pelletier A., Canard B., Peersen O.B. The SARS-CoV-2 nsp12 polymerase active site is tuned for largegenome replication. J. Virol. 2022;96(16):e0067122. doi 10.1128/jvi.00671-22</mixed-citation></citation-alternatives></ref><ref id="cit6"><label>6</label><citation-alternatives><mixed-citation xml:lang="ru">Cui X., Wang Y., Zhai J., Xue M., Zheng C., Yu L. Future trajectory of SARS-CoV-2: Constant spillover back and forth between humans and animals. Virus Res. 2023;328:199075. doi 10.1016/j.virusres.2023.199075</mixed-citation><mixed-citation xml:lang="en">Cui X., Wang Y., Zhai J., Xue M., Zheng C., Yu L. Future trajectory of SARS-CoV-2: Constant spillover back and forth between humans and animals. Virus Res. 2023;328:199075. doi 10.1016/j.virusres.2023.199075</mixed-citation></citation-alternatives></ref><ref id="cit7"><label>7</label><citation-alternatives><mixed-citation xml:lang="ru">Palyanov A.Yu., Palyanova N.V. On the space of SARS-CoV-2 genetic sequence variants. Vavilovskii Zhurnal Genetiki i Selektsii = Vavilov Journal of Genetics and Breeding. 2023;27(7):839-850. doi 10.18699/VJGB-23-97</mixed-citation><mixed-citation xml:lang="en">Palyanov A.Yu., Palyanova N.V. On the space of SARS-CoV-2 genetic sequence variants. Vavilovskii Zhurnal Genetiki i Selektsii = Vavilov Journal of Genetics and Breeding. 2023;27(7):839-850. doi 10.18699/VJGB-23-97</mixed-citation></citation-alternatives></ref><ref id="cit8"><label>8</label><citation-alternatives><mixed-citation xml:lang="ru">Palyanova N.V., Sobolev I.A., Alekseev A., Glushenko A., Kazachkova E., Markhaev A., Kononova Y., Gulyaeva M., Adamenko L., Kurskaya O., Bi Y., Xin Y., Sharshov K., Shestopalov A. Genomic and epidemiological features of COVID-19 in the Novosibirsk region during the beginning of the pandemic. Viruses. 2022;14(9):2036. doi 10.3390/v14092036</mixed-citation><mixed-citation xml:lang="en">Palyanova N.V., Sobolev I.A., Alekseev A., Glushenko A., Kazachkova E., Markhaev A., Kononova Y., Gulyaeva M., Adamenko L., Kurskaya O., Bi Y., Xin Y., Sharshov K., Shestopalov A. Genomic and epidemiological features of COVID-19 in the Novosibirsk region during the beginning of the pandemic. Viruses. 2022;14(9):2036. doi 10.3390/v14092036</mixed-citation></citation-alternatives></ref><ref id="cit9"><label>9</label><citation-alternatives><mixed-citation xml:lang="ru">Palyanova N.V., Sobolev I.A., Palyanov A.Yu., Kurskaya O.G., Komissarov A.B., Danilenko D.M., Fadeev A.V., Shestopalov A.M. The development of the SARS-CoV-2 epidemic in different regions of Siberia in the 2020–2022 period. Viruses. 2023;15(10):2014. doi 10.3390/v15102014</mixed-citation><mixed-citation xml:lang="en">Palyanova N.V., Sobolev I.A., Palyanov A.Yu., Kurskaya O.G., Komissarov A.B., Danilenko D.M., Fadeev A.V., Shestopalov A.M. The development of the SARS-CoV-2 epidemic in different regions of Siberia in the 2020–2022 period. Viruses. 2023;15(10):2014. doi 10.3390/v15102014</mixed-citation></citation-alternatives></ref><ref id="cit10"><label>10</label><citation-alternatives><mixed-citation xml:lang="ru">Sanjuán R., Domingo-Calap P. Mechanisms of viral mutation. Cell. Mol. Life Sci. 2016;73(23):4433-4448. doi 10.1007/s00018-016-2299-6</mixed-citation><mixed-citation xml:lang="en">Sanjuán R., Domingo-Calap P. Mechanisms of viral mutation. Cell. Mol. Life Sci. 2016;73(23):4433-4448. doi 10.1007/s00018-016-2299-6</mixed-citation></citation-alternatives></ref><ref id="cit11"><label>11</label><citation-alternatives><mixed-citation xml:lang="ru">Simon-Loriere E., Holmes E.C. Why do RNA viruses recombine? Nat. Rev. Microbiol. 2011;9(8):617-626. doi 10.1038/nrmicro2614</mixed-citation><mixed-citation xml:lang="en">Simon-Loriere E., Holmes E.C. Why do RNA viruses recombine? Nat. Rev. Microbiol. 2011;9(8):617-626. doi 10.1038/nrmicro2614</mixed-citation></citation-alternatives></ref><ref id="cit12"><label>12</label><citation-alternatives><mixed-citation xml:lang="ru">Sonnleitner S.T., Prelog M., Sonnleitner S., Hinterbichler E., Halbfurter H., Kopecky D.B.C., Almanzar G., Koblmüller S., Sturmbauer C., Feist L., Horres R., Posch W., Walde G. Cumulative SARS-CoV-2 mutations and corresponding changes in immunity in an immunocompromised patient indicate viral evolution within the host. Nat. Commun. 2022;13(1):2560. doi 10.1038/s41467-022-30163-4</mixed-citation><mixed-citation xml:lang="en">Sonnleitner S.T., Prelog M., Sonnleitner S., Hinterbichler E., Halbfurter H., Kopecky D.B.C., Almanzar G., Koblmüller S., Sturmbauer C., Feist L., Horres R., Posch W., Walde G. Cumulative SARS-CoV-2 mutations and corresponding changes in immunity in an immunocompromised patient indicate viral evolution within the host. Nat. Commun. 2022;13(1):2560. doi 10.1038/s41467-022-30163-4</mixed-citation></citation-alternatives></ref><ref id="cit13"><label>13</label><citation-alternatives><mixed-citation xml:lang="ru">Temmam S., Vongphayloth K., Baquero E., Munier S., Bonomi M., Regnault B., Douangboubpha B., Karami Y., Chrétien D., Sanamxay D., Xayaphet V., Paphaphanh P., Lacoste V., Somlor S., Lakeomany K., Phommavanh N., Pérot P., Dehan O., Amara F., Donati F., Bigot T., Nilges M., Rey F.A., van der Werf S., Brey P.T., Eloit M. Bat coronaviruses related to SARS-CoV-2 and infectious for human cells. Nature. 2022;604(7905):330-336. doi 10.1038/s41586-022-04532-4</mixed-citation><mixed-citation xml:lang="en">Temmam S., Vongphayloth K., Baquero E., Munier S., Bonomi M., Regnault B., Douangboubpha B., Karami Y., Chrétien D., Sanamxay D., Xayaphet V., Paphaphanh P., Lacoste V., Somlor S., Lakeomany K., Phommavanh N., Pérot P., Dehan O., Amara F., Donati F., Bigot T., Nilges M., Rey F.A., van der Werf S., Brey P.T., Eloit M. Bat coronaviruses related to SARS-CoV-2 and infectious for human cells. Nature. 2022;604(7905):330-336. doi 10.1038/s41586-022-04532-4</mixed-citation></citation-alternatives></ref><ref id="cit14"><label>14</label><citation-alternatives><mixed-citation xml:lang="ru">Wu F., Zhao S., Yu B., Chen Y.M., Wang W., Song Z.-G., Hu Y., Tao Z.-W., Tian J.-H., Pei Y.-Y., Yuan M.-L., Zhang Y.-L., Dai F.-H., Liu Y., Wang Q.-M., Zheng J.-J., Xu L., Holmes E.C., Zhang Y.-Z. A new coronavirus associated with human respiratory disease in China. Nature. 2020;579(7798):265-269. doi 10.1038/s41586-020-2008-3</mixed-citation><mixed-citation xml:lang="en">Wu F., Zhao S., Yu B., Chen Y.M., Wang W., Song Z.-G., Hu Y., Tao Z.-W., Tian J.-H., Pei Y.-Y., Yuan M.-L., Zhang Y.-L., Dai F.-H., Liu Y., Wang Q.-M., Zheng J.-J., Xu L., Holmes E.C., Zhang Y.-Z. A new coronavirus associated with human respiratory disease in China. Nature. 2020;579(7798):265-269. doi 10.1038/s41586-020-2008-3</mixed-citation></citation-alternatives></ref><ref id="cit15"><label>15</label><citation-alternatives><mixed-citation xml:lang="ru">Zhou P., Yang X.L., Wang X.G., Hu B., Zhang L., Zhang W., Si H.-R., Zhu Y., Li B., Huang C.-L., Chen H.-D., Chen J., Luo Y., Guo H., Jiang R.-D., Liu M.-Q., Chen Y., Shen X.-R., Wang X., Zheng X.-S., Zhao K., Chen Q.-J., Deng F., Liu L.-L., Shi Z.-L. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature. 2020;579(7798):270-273. doi 10.1038/s41586-020-2012-7</mixed-citation><mixed-citation xml:lang="en">Zhou P., Yang X.L., Wang X.G., Hu B., Zhang L., Zhang W., Si H.-R., Zhu Y., Li B., Huang C.-L., Chen H.-D., Chen J., Luo Y., Guo H., Jiang R.-D., Liu M.-Q., Chen Y., Shen X.-R., Wang X., Zheng X.-S., Zhao K., Chen Q.-J., Deng F., Liu L.-L., Shi Z.-L. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature. 2020;579(7798):270-273. doi 10.1038/s41586-020-2012-7</mixed-citation></citation-alternatives></ref></ref-list><fn-group><fn fn-type="conflict"><p>The authors declare that there are no conflicts of interest present.</p></fn></fn-group></back></article>
