Linking hierarchical classification of transcription factors by the structure of their DNA-binding domains to the variability of their binding site motifs
https://doi.org/10.18699/vjgb-25-99
Abstract
De novo motif search is the main approach for determining the nucleotide specificity of binding of the key regulators of gene transcription, transcription factors (TFs), based on data from massive genome-wide sequencing of their binding site regions in vivo, such as ChIP-seq. The number of motifs of known TF binding sites (TFBSs) has increased several times in recent years. Due to the similarity in the structure of the DNA-binding domains of TFs, many structurally cognate TFs have similar and sometimes almost indistinguishable binding site motifs. The classification of TFs by the structure of the DNA-binding domains from the TFClass database defines the top levels of the hierarchy (superclasses and classes of TFs) by the structure of these domains, and the next levels (families and subfamilies of TFs) by the alignments of amino acid sequences of domains. However, this classification does not take into account the similarity of TFBS motifs, whereas identification of valid TFs from massive sequencing data of TFBSs, such as ChIP-seq, requires working with TFBS motifs rather than TFs themselves. Therefore, in this study we extracted from the Hocomoco and Jaspar databases the TFBS motifs for human and fruit fly Drosophila melanogaster, and considered the pairwise similarity of binding site motifs of cognate TFs according to their classification from the TFClass database. We have shown that the common tree of the TF hierarchy by the structure of DNA-binding domains can be split into separate branches representing non-overlapping sets of TFs. Within each branch, the majority of TF pairs have significantly similar binding site motifs. Each branch can include one or more sister elementary units of the hierarchy and all its/their lower levels: one or more TFs of the same subfamily, or the whole subfamily, one or several subfamilies of the same family, an entire family, etc., up to the entire class. Analysis of the seven largest human and two largest Drosophila TF classes showed that the similarity of TFs in terms of TFBS motifs for different corresponding levels (classes, families) is noticeably different. Supplementing the hierarchical classification of TFs with branches combining significantly similar motifs of TFBSs can increase the efficiency of identifying involved TFs through enriched motifs detected by de novo motif search for massive sequencing data of TFBSs from the ChIP-seq technology.
Keywords
About the Authors
V. G. LevitskyRussian Federation
Novosibirsk
T. Yu. Vatolina
Russian Federation
Novosibirsk
V. V. Raditsa
Russian Federation
Novosibirsk
References
1. Ambrosini G., Vorontsov I., Penzar D., Groux R., Fornes O., Nikolaeva D.D., Ballester B., Grau J., Grosse I., Makeev V., Kulakovskiy I., Bucher P. Insights gained from a comprehensive all-against-all transcription factor binding motif benchmarking study. Genome Biol. 2020;21(1):114. doi: 10.1186/s13059-020-01996-3
2. Amoutzias G.D., Robertson D.L., Van de Peer Y., Oliver S.G. Choose your partners: dimerization in eukaryotic transcription factors. Trends Biochem Sci. 2008;33(5):220-229. doi: 10.1016/j.tibs.2008.02.002
3. Bailey T.L. STREME: Accurate and versatile sequence motif discovery. Bioinformatics 2021;37(18):2834-2840. doi: 10.1093/bioinformatics/btab203
4. Blanc-Mathieu R., Dumas R., Turchi L., Lucas J., Parcy F. Plant- TFClass: a structural classification for plant transcription factors. Trends Plant Sci. 2024;29(1):40-51. doi: 10.1016/j.tplants.2023.06.023
5. D’haeseleer P. What are DNA sequence motifs? Nat Biotechnol. 2006; 24(4):423-425. doi: 10.1038/nbt0406-423
6. de Martin X., Sodaei R., Santpere G. Mechanisms of binding specificity among bHLH transcription factors. Int J Mol Sci. 2021;22(17): 9150. doi: 10.3390/ijms22179150
7. Franco-Zorrilla J.M., López-Vidriero I., Carrasco J.L., Godoy M., Vera P., Solano R. DNA-binding specificities of plant transcription factors and their potential to define target genes. Proc Natl Acad Sci USA. 2014;111(6):2367-2372. doi: 10.1073/pnas.1316278111
8. Gupta S., Stamatoyannopolous J.A., Bailey T.L., Noble W.S. Quantifying similarity between motifs. Genome Biol. 2007;8(2):R24. doi: 10.1186/gb-2007-8-2-r24
9. Hammal F., de Langen P., Bergon A., Lopez F., Ballester B. ReMap 2022: A database of Human, Mouse, Drosophila and Arabidopsis regulatory regions from an integrative analysis of DNA-binding sequencing experiments. Nucleic Acids Res. 2022;50(D1):D316-D325. doi: 10.1093/nar/gkab996
10. Johnson D.S., Mortazavi A., Myers R.M., Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007;316(5830): 1497-1502. doi: 10.1126/science.1141319
11. Jolma A., Yan J., Whitington T., Toivonen J., Nitta K.R., Rastas P., Morgunova E., … Hughes T.R., Lemaire P., Ukkonen E., Kivioja T., Taipale J. DNA-binding specificities of human transcription factors. Cell. 2013;152(1-2):327-339. doi: 10.1016/j.cell.2012.12.009
12. Kolmykov S., Yevshin I., Kulyashov M., Sharipov R., Kondrakhin Y., Makeev V.J., Kulakovskiy I.V., Kel A., Kolpakov F. GTRD: An integrated view of transcription regulation. Nucleic Acids Res. 2021; 49(D1):D104-D111. doi: 10.1093/nar/gkaa1057
13. Lambert S.A., Jolma A., Campitelli L.F., Das P.K., Yin Y., Albu M., Chen X., Taipale J., Hughes T.R., Weirauch M.T. The human transcription factors. Cell. 2018;172(4):650-665. doi: 10.1016/j.cell.2018.01.029
14. Lambert S.A., Yan A.W.H., Sasse A., Cowley G., Albu M., Caddick M.X., Morris Q.D., Weirauch M.T., Hughes T.R. Similarity regression predicts evolution of transcription factor sequence specificity. Nat Genet. 2019;51(6):981-989. doi: 10.1038/s41588-019-0411-1
15. Levitsky V., Zemlyanskaya E., Oshchepkov D., Podkolodnaya O., Ignatieva E., Grosse I., Mironova V., Merkulova T. A single ChIP-seq dataset is sufficient for comprehensive analysis of motifs co-occurrence with MCOT package. Nucleic Acids Res. 2019;47(21):e139. doi: 10.1093/nar/gkz800
16. Liu B., Yang J., Li Y., McDermaid A., Ma Q. An algorithmic perspective of de novo cis-regulatory motif finding based on ChIP-seq data. Brief Bioinform. 2018;19(5):1069-1081. doi: 10.1093/bib/bbx026
17. Lloyd S.M., Bao X. Pinpointing the genomic localizations of chromatin-associated proteins: the yesterday, today, and tomorrow of ChIP-seq. Curr Protoc Cell Biol. 2019;84(1):e89. doi: 10.1002/cpcb.89
18. Morgunova E., Taipale J. Structural perspective of cooperative transcription factor binding. Curr Opin Struct Biol. 2017;47:1-8. doi: 10.1016/j.sbi.2017.03.006
19. Nagy G., Nagy L. Motif grammar: The basis of the language of gene expression. Comput Struct Biotechnol J. 2020;18:2026-2032. doi: 10.1016/j.csbj.2020.07.007
20. Najafabadi H.S., Mnaimneh S., Schmitges F.W., Garton M., Lam K.N., Yang A., Albu M., Weirauch M.T., Radovani E., Kim P.M., Greenblatt J., Frey B.J., Hughes T.R. C<sub>2</sub>H<sub>2</sub> zinc finger proteins greatly expand the human regulatory lexicon. Nat Biotechnol. 2015;33(5): 555-562. doi: 10.1038/nbt.3128
21. Nakato R., Shirahige K. Recent advances in ChIP-seq analysis: from quality management to whole-genome annotation. Brief Bioinform. 2017;18(2):279-290. doi: 10.1093/bib/bbw023
22. Nitta K.R., Jolma A., Yin Y., Morgunova E., Kivioja T., Akhtar J., Hens K., Toivonen J., Deplancke B., Furlong E.E., Taipale J. Conservation of transcription factor binding specificities across 600 million years of bilateria evolution. eLife. 2015;4:e04837. doi: 10.7554/eLife.04837
23. Rauluseviciute I., Riudavets-Puig R., Blanc-Mathieu R., Castro-Mondragon J.A., Ferenc K., Kumar V., Lemma R.B., … Lenhard B., Sandelin A., Wasserman W.W., Parcy F., Mathelier A. JASPAR 2024: 20<sup>th</sup> anniversary of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2024;52(D1):D174-D182. doi: 10.1093/nar/gkad1059
24. Schneider T.D., Stephens R.M. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990;18(20):6097-6100. doi: 10.1093/nar/18.20.6097
25. Shen W.K., Chen S.Y., Gan Z.Q., Zhang Y.Z., Yue T., Chen M.M., Xue Y., Hu H., Guo A.Y. AnimalTFDB 4.0: a comprehensive animal transcription factor database updated with variation and expression annotations. Nucleic Acids Res. 2023;51(D1):D39-D45. doi: 10.1093/nar/gkac907
26. Skene P.J., Henikoff S. An efficient targeted nuclease strategy for high-resolution mapping of DNA binding sites. eLife. 2017;6:e21856. doi: 10.7554/eLife.21856
27. Slattery M., Zhou T., Yang L., Dantas Machado A.C., Gordân R., Rohs R. Absence of a simple code: how transcription factors read the genome. Trends Biochem Sci. 2014;39(9):381-399. doi: 10.1016/j.tibs.2014.07.002
28. Sokal R.R., Michener C.D. A statistical method for evaluating systematic relationships. Univ Kansas Sci Bull. 1958;38:1409-1438. Available: https://archive.org/details/cbarchive_33927_astatisticalmethodforevaluatin1902/page/n1/mode/2up
29. Spitz F., Furlong E.E. Transcription factors: from enhancer binding to developmental control. Nat Rev Genet. 2012;13(9):613-626. doi: 10.1038/nrg3207
30. Stormo G.D., Zhao Y. Determining the specificity of protein-DNA interactions. Nat Rev Genet. 2010;11(11):751-760. doi: 10.1038/nrg2845
31. Taing L., Dandawate A., L’Yi S., Gehlenborg N., Brown M., Meyer C.A. Cistrome Data Browser: integrated search, analysis and visualization of chromatin data. Nucleic Acids Res. 2024;52(D1):D61-D66. doi: 10.1093/nar/gkad1069
32. Vorontsov I.E., Eliseeva I.A., Zinkevich A., Nikonov M., Abramov S., Boytsov A., Kamenets V., … Medvedeva Y.A., Jolma A., Kolpakov F., Makeev V.J., Kulakovskiy I.V. HOCOMOCO in 2024: a rebuild of the curated collection of binding models for human and mouse transcription factors. Nucleic Acids Res. 2024;52(D1):D154-D163. doi: 10.1093/nar/gkad1077
33. Wasserman W.W., Sandelin A. Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet. 2004;5(4):276-287. doi: 10.1038/nrg1315
34. Weirauch M.T., Yang A., Albu M., Cote A.G., Montenegro-Monter A., Drewe P., Najafabadi H.S., … Bouget F.Y., Ratsch G., Larrondo L.F., Ecker J.R., Hughes T.R. Determination and inference of eukaryotic transcription factor sequence specificity. Cell. 2014;158(6):1431-1443. doi: 10.1016/j.cell.2014.08.009
35. Wingender E. Classification scheme of eukaryotic transcription factors. Mol Biol. 1997:31(4):483-497. (translated from Вингендер Э. Классификация транскрипционных факторов эукариот. Молекулярная биология. 1997;31(4):584-600. Russian)
36. Wingender E. Criteria for an updated classification of human transcription factor DNA-binding domains. J Bioinform Comput Biol. 2013;11(1):1340007. doi: 10.1142/S0219720013400076
37. Wingender E., Schoeps T., Dönitz J. TFClass: an expandable hierarchical classification of human transcription factors. Nucleic Acids Res. 2013;41(D1):D165-D170. doi: 10.1093/nar/gks1123
38. Wingender E., Schoeps T., Haubrock M., Dönitz J. TFClass: a classification of human transcription factors and their rodent orthologs. Nucleic Acids Res. 2015;43(D1):D97-D102. doi: 10.1093/nar/gku1064
39. Wingender E., Schoeps T., Haubrock M., Krull M., Dönitz J. TFClass: expanding the classification of human transcription factors to their mammalian orthologs. Nucleic Acids Res. 2018;46(D1):D343-D347. doi: 10.1093/nar/gkx987
40. Zambelli F., Pesole G., Pavesi G. Motif discovery and transcription factor binding sites before and after the next-generation sequencing era. Brief Bioinform. 2013;14(2):225-237. doi: 10.1093/bib/bbs016
41. Zenker S., Wulf D., Meierhenrich A., Viehöver P., Becker S., Eisenhut M., Stracke R., Weisshaar B., Bräutigam A. Many transcription factor families have evolutionarily conserved binding motifs in plants. Plant Physiol. 2025;198(2):kiaf205. doi: 10.1093/plphys/kiaf205






