Preview

Vavilov Journal of Genetics and Breeding

Advanced search

Alembic: a framework for converting disparate biological data into structured resources

https://doi.org/10.18699/vjgb-26-33

Abstract

The imperative to re-analyze existing public sequencing data is central to modern biology, driven by new hypotheses and advanced analytical methods. However, this effort is critically hampered by the profound heterogeneity of repository data, particularly the non-standardized, free-text descriptions of biological experiments. This lack of structural and semantic homogeneity prevents systematic search, integration, and comparative analysis, effectively locking away the full potential of accumulated datasets. Advances in Natural Language Processing (NLP) offer a pivotal pathway to overcome this bottleneck by transforming unstructured text into computable, homogeneous information. The integrated Entrez database system, maintained by the National Center for Biotechnology Information (NCBI), provides sophisticated programmatic access via an API to primary sequencing data and its associated metadata, including detailed experimental descriptions. This interface enables researchers to identify and retrieve relevant data through keyword searches, including those based on gene names, and to apply modern NLP techniques to transform textual metadata into structured information. The output is formatted data ready for integration into local databases, accompanied by a systematic list of links for downloading primary files. The Alembic software package offers a comprehensive and automated solution for the entire workflow. Designed as a locally deployable client-server system, Alembic incorporates state-of-the-art transformer-based AI algorithms for analyzing the biomedical text that accompanies sequencing data. Its core utilizes the openly available AIONER platform, which is built upon the PubMedBERT model trained on the PubMed repository, to ensure efficient and accurate recognition of biomedical named entities (e. g., genes, diseases). This provides users with structured and meaningful keyword search results. By delivering a curated list of datasets, Alembic streamlines the path from search to analysis. Researchers can efficiently identify high-value targets and obtain a complete package of metadata and primary data to construct a tailored local repository. This positions Alembic as a universal solution that overcomes the fragmented approach of existing tools, offering an integrated workflow for diverse public sequencing data.

About the Authors

I. V. Bezdvornykh
Institute for Translational Biomedicine, Saint Petersburg State University
Russian Federation

St. Petersburg



K. I. Yuditskiy
Institute for Translational Biomedicine, Saint Petersburg State University
Russian Federation

St. Petersburg



N. A. Cherkasov
Institute for Translational Biomedicine, Saint Petersburg State University
Russian Federation

St. Petersburg



A. A. Samsonova
Institute for Translational Biomedicine, Saint Petersburg State University
Russian Federation

St. Petersburg



A. A. Kanapin
Institute for Translational Biomedicine, Saint Petersburg State University
Russian Federation

St. Petersburg



References

1. Aronson A.R., Lang F.M. An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc. 2010;17(3): 229-236. doi 10.1136/jamia.2009.002733

2. Chao H., Li Z., Chen D., Chen M. iSeq: an integrated tool to fetch public sequencing data. Bioinformatics. 2024;40(11):btae641. doi 10.1093/bioinformatics/btae641

3. Chin W.L., Lassmann T. SampleExplorer: using language models to discover relevant transcriptome data. Bioinformatics. 2024;41(1): btae759. doi 10.1093/bioinformatics/btae759

4. Devlin J., Chang M.W., Lee K., Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv. 2019. doi 10.48550/arXiv.1810.04805

5. Lee J., Yoon W., Kim S., Kim D., Kim S., So C.H., Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234-1240. doi 10.1093/bioinformatics/btz682

6. Luo L., Wei C.-H., Lai P.-T., Leaman R., Chen Q., Lu Z. AIONER: all-in-one scheme-based biomedical named entity recognition using deep learning. Bioinformatics. 2023;39(5):btad310. doi 10.1093/bioinformatics/btad310

7. Neumann M., King D., Beltagy I., Ammar W. ScispaCy: fast and robust models for biomedical natural language processing. In: Proceedings of the 18th BioNLP Workshop and Shared Task. Association for Computational Linguistics, 2019;319-327. doi 10.18653/v1/W19-5034

8. Sayers E. The E-utilities in-depth: parameters, syntax and more. In: Entrez Programming Utilities Help [Internet]. Bethesda (MD): National Center for Biotechnology Information (US), 2022. Available at: https://www.ncbi.nlm.nih.gov/books/NBK25499/. Accessed: Jul. 30, 2025

9. Wang X., Zhang Y., Ren X., Zhang Y., Zitnik M., Shang J., Langlotz C., Han J. Cross-type biomedical named entity recognition with deep multi-task learning. Bioinformatics. 2019;35(10):1745-1752. doi 10.1093/bioinformatics/bty869


Review

Views: 34

JATS XML


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2500-3259 (Online)