Diversity of mariner -like elements in Orthoptera Разнообразие mariner -подобных элементов Orthoptera

Mariner -like elements (MLEs) are among the most widespread DNA transposable elements in eukaryotes. Insects were the first organisms in which MLEs were identified, however the diversity of MLEs in the insect order Orthoptera has not yet been addressed. In the present study, we explore the diversity of MLEs elements in 16 species of Orthoptera belonging to three infraorders, Acridoidea (Caelifera), Grylloidea (Ensifera), and Tettigoniidea (Ensifera) by combining data mined from computational analysis of sequenced degenerative PCR MLE amplicons and available Orthoptera genomic scaffolds. In total, 75 MLE lineages (Ortmar) were identified in all the studied genomes. Automatic phylogeny-based classification suggested that the current known variability of MLEs can be assigned to seven statistically well-supported phylogenetic clusters (I–VII), and the identified Orthoptera lineages were distributed among all of them. The majority of the lineages (36 out of 75) belong to cluster I; 20 belong to cluster VI; and seven, six, four, one and one lineages belong to clusters II, IV, VII, III, and V, respectively. Two of the clusters (II and IV) were composed of a single Orthoptera MLE lineage each (Ortmar37 and Ortmar45, respectively) which were distributed in the vast majority of the studied Orthoptera genomes. Finally, for 16 Orthoptera MLE lineages, horizontal transfer from the distantly related taxa belonging to other insect orders may have occurred. We believe that our study can serve as a basis for future researches on the diversity, distribution, and evolution of MLEs in species of other taxa that are still lacking the sequenced genomes.


Introduction
DNA transposons were the first mobile genetic elements discovered in eukaryotes (McClintock, 1950). Since that time, different groups of these "selfish" elements were identified throughout the whole eukaryotic tree of life (Robertson, 2002;Feschotte, Pritham, 2007;Wicker et al., 2007). The most abundant and widespread group of DNA transposons is a Tc1/mariner superfamily (Plasterk et al., 1999;Tellier et al., 2015) named after its two best-studied elements, Tc1 and mariner from a nematode Caenorhabditis elegans (Emmons Diversity of mariner-like elements in Orthoptera et al., 1983) and a fruit fly Drosophila mauritiana (Jacobson et al., 1986), respectively.
Probably, the most exciting feature of Tc1/mariner elements and MLEs in particular is their ubiquity. There are numerous reports suggesting horizontal transfers of MLEs between highly diverged taxa (Yoshiyama et al., 2001;Lampe et al., 2003;Silva et al., 2004Silva et al., , 2005Oliveira et al., 2012;Dupeyron et al., 2014). Such a tendency to horizontal transfer is explained by (1) an evolutionary necessity for a long-term persistence of a DNA transposon, which otherwise will be eventually silenced by spontaneous mutations and cell's internal defense mechanisms (Lohe et al., 1995;Schaack et al., 2010), and (2) the simplicity of MLEs structure and life cycle (Plasterk, van Luenen, 2002). Presence of cis-or trans-encoded transposase and intact TIRs are the only features required for the transposition of MLEs (Plasterk et al., 1999). In addition, the MLE transposase was demonstrated to be fully active in vitro (Lampe et al., 1996;Tosi, Beverley, 2000). Such a low dependence on host-encoded and other biological factors most probably underlies success of MLEs as inter-taxa colonizers and their use as transgenic vectors (Delaurière et al., 2009).
Insects were the first organisms in which MLEs were identified and for which the horizontal transfer of MLEs was first proposed (Jacobson et al., 1986;Maruyama, Hartl, 1991;Robertson, 1993;Robertson, MacLeod, 1993). The insect order Orthoptera comprises two suborders: Caelifera (grasshoppers and locusts) and Ensifera (katydids and crickets). A computational analysis of the diversity of Tc1/mariner elements was carried out only for the genome of the migratory locust Locusta migratoria (Caelifera, Acrididae, Oedipodinae), one of the largest sequenced animal genome up to date (5.7 Gb) (Wang et al., 2014). However, this study provided no classification of the identified elements below the superfamily level (Bao et al., 2015). Only a few partial sequences belonging to MLEs were experimentally identified in two other grasshopper species (both in Caelifera, Acrididae): Eyprepocnemis plorans (Eyprepocnemedinae) (Montiel et al., 2012) and Trimerotropis pallidipennis (Oedipodinae) (Tevy et al., 2007). In addition, a recently published draft genome of the Hawaiian cricket Laupala kohalensis (Ensifera, Grylloidea, Trigonidiidae) is now also available for the analysis (Blankers et al., 2017). In this work, we explore the diversity of MLEs in Orthoptera from three infraorders, Acridoidea (Caelifera), Grylloidea (Ensifera), and Tettigoniidea (Ensifera). We identify new lineages of MLEs and take a first look into their horizontal transfer dynamics among insects.

Materials and methods
Orthoptera specimens. Specimens of 14 species of Ortho ptera were provided by Dr. A.G. Bugrov, Institute of Systematics and Ecology of Animals, the Siberian Branch of Russian Academy of Sciences, Novosibirsk, Russia. We used 10 specimens of grasshoppers (Acridoidea), two of crickets (Grylloidea), and two of katydids (Tettigonioidea). The taxonomic position of each of the studied specimens is indicated in Table 1.
DNA extraction, PCR amplification and sequencing. Total DNA was extracted from muscle or brain tissue of the ethanol-preserved insect specimens using DNAeasy Blood and Tissue kit (Qiagen, Germany). We used canonical degenerate PCR primers specific for the conserved part of the MLE transposase gene catalytic domain to amplify the corresponding sequences for each of the DNA specimen. The primer sequences (MAR 124F and MAR 276R) and the cycling conditions are from the previous study (Robertson, MacLeod, 1993). PCR products were visualized and separated from unincorporated primers on 1.2 % agarose gel stained with EtBr. The resulting bands were cut out and then isolated from the gel for each of the studied Orthoptera specimens using QIAquick Gel Extraction Kit (Qiagen, Germany).
Library preparation, sequencing and assembly. Approximately 100 ng of each of the isolated 14 PCR amplicon mixes were used for barcode library preparation with Ion Xpress Plus Fragment Library Kit (Thermo Fisher Scientific) following the manufacturer's protocol. Fragmentation time was 6 min and 8 PCR amplification cycles were done. Purification of the libraries was done with AMPure XP kit (Beckman Coulter). A DNA fraction of 280-350 bp was extracted from the libraries using LabChip XT (Caliper Life Sciences) on a DNA500 chip with subsequent purification with AMPure XP kit. The quality and molarity of the obtained libraries was evaluated with DNA High Sensivity kit on the Bionanalyzer BA2100 chip (Agilent). The final libraries were amplified in emulsion PCR with Ion PGM Hi-Q View OT2 Kit (Thermo Scientific) and then sequenced on the Ion Torrent PGM system with 316v2 chip. Raw sequence reads produced by the Ion Torrent sequencing were de novo assembled into contigs using MIRA v. 4.0.2 (Chevreux et al., 2004), specifying "iontor" as technology parameter and "est, denovo, accurate" as job parameters in MIRA manifest configuration files.
Availability of supporting data. The genomic sequences used in this work are available at NCBI GenBank (http:// www.ncbi.nlm.nih.gov/genbank). The genomes of Locusta migratoria and Laupala kohalensis include 5,760 Mb covering 1,397,492 scaffolds and 1,595 Mb covering 148,784 scaffolds, respectively (Wang et al., 2014;Blankers et al., 2017). The genome sequences from other 259 insect species (actual on 11.07.2018) were retrieved from NCBI GenBank using the following Entrez searching query on the Assembly database: "Insecta"[Organism] NOT "Orthoptera"[Organism] AND (latest[filter] AND "scaffold level"[filter] AND "representative genome"[filter] AND all[filter] NOT anomalous[filter] filtering option.

Detection of MLE sequences.
We collected a set of 19 known full-length transposase protein sequences belonging to eight MLE subfamilies (Supplementary 1) 1 and used them as queries in a batch tBLASTn search performed on the genomic sequences of L. migratoria and L. kohalensis. The recovered BLAST hits (High Scoring Pairs) were reassembled into copies if they were following the previously defined criteria (Wallau et al., 2014). Two hits from the same scaffold are jointed together if the distance between their centers is less than 1000 bp and if they have the same orientation. Only copies longer than 400 bp were retained. In order to extract the full-length sequence of each copy, the region of each assembled copy was expanded to 1500 bp. The copies were next clustered together into lineages by iterative BLASTn, setting the similarity threshold to 75 % which was previously defined in (Bouallègue et al., 2017), retrieving a majority rule consensus sequence for each lineage during the clustering process. Singlet sequences and consensuses derived from less than 10 sequences were removed from further analysis. Finally, we attempted to identify TIRs at the ends of the lineage consensuses using EMBOSS EINVERTED tool (Rice et al., 2000). Each newly identified complete sequence of Orthoptera MLE lineage was added to the query set and the whole search process was repeated to achieve the higher sensitivity.
The sequences of the query set obtained after the MLE screening of L. migratoria and L. kohalensis genomes were trimmed to the size of the conserved region of the MLE transposase catalytic domain which is between the canonical MLE degenerative primers (Robertson, MacLeod, 1993). Using this set, we performed a batch tBLASTn search against the assembled PCR amplicon sequences of 14 other Orthoptera species. Contigs of expected length (between 450 and 550 bp) which gave positive BLAST hits with more than 90 % coverage to the query sequence were kept for further analysis. The retrieved contigs were clustered into lineages as previously described.
To remove redundancy and to keep (when possible) only a full-length representative of a lineage for further analyses, we did a BLASTn search using the consensus sequences of obtained lineages as queries against the database of full-length Orthoptera MLE lineages identified in this work.
Finally, consensus sequences of the obtained MLE lineages, both full-length and partial, were used as queries in a batch BLASTn search against the sequences originally retrieved by tBLASTn search from all the studied Orthoptera species to estimate the prevalence and distribution of each lineage.
Classification and phylogenetic analysis. Corresponding full-length and partial protein sequences of MLE transposases were identified in lineage consensuses using BLASTx search against the original transposase tBLASTn query set. The obtained protein sequences were aligned together with fulllength transposase proteins of known MLEs from the previous work (Bouallègue et al., 2017) using MAFFT v7.312 (Katoh, Standley, 2013). The resulted alignment was used to build a Bayesian phylogeny of MLEs using the MrBayes v3.2.5 (Ronquist et al., 2012). The best-fitting model LG+I+G with four gamma rate categories was selected using SMS tool (Lefort et al., 2017) based on Akaike information criterion (Akaike, 1998). Bayesian posterior probabilities were estimated by running ten Markov chain Monte Carlo chains for 2,500,000 generations in two parallel runs, sampling each Note. Clusters correspond to those in Figure. Taxonomy is according to NCBI Taxonomy. See Supplementary 4 for more details.
* Species with available genomic scaffolds. ** For each cluster, (N) designate total number of different Ortmar MLE lineages found.
Diversity of mariner-like elements in Orthoptera 250 genera tions, and discarding first 25 % of samples in the end. Automatic phylogeny-based classification of the sequences on the resulted tree was performed using Cluster Picker tool (Ragonnet-Cronin et al., 2013). The minimal statistical support was set to 90 % and various genetic distance thresholds from 40 to 60 % were tested.
Horizontal gene transfer detection. The full-length nucleotide consensus sequences of the identified Orthoptera li neages were used in MEGABLAST search against the genomic scaffolds of 259 insect species from different orders available in the NCBI. Cases of potential horizontal transfer between the studied Orthoptera genomes and other insect taxa were considered when MEGABLAST hit showed more than 90 % of identity covering more than 90 % of the query sequences as it was proposed by several authors (Filée et al., 2015;Wallau et al., 2016;Bouallègue et al., 2017).

Results and discussion
Screening for MLE sequences was based on a homology approach (tBLASTn) with a set of 19 known full-length amino acid transposase sequences used as queries (see Supplementary 1). First, we initialized the search in the available genomic scaffolds of L. migratoria and L. kohalensis in order to identify Orthoptera full-length MLE sequences. A total of 4290 MLE copies from L. migratoria clustered in 27 lineages and 992 MLE copies from L. kohalensis clustered in 24 lineages were identified. Here, one lineage corresponds to a group of at least 10 sequences which have more than 75 % identity to each other detected by BLASTn comparisons.
Partial sequences of the conserved region of the MLE transposase catalytic domain were identified among the de novo sequenced PCR amplicons obtained from 14 other Orthoptera species. The sequences were then clustered into lineages following the same rules as described above (see Table 1). For each lineage, obtained both from the full-length or partial sequences, a majority-rule consensus sequence was derived (Supplementary 2). A total of 75 lineages were identified in all the studied Orthoptera genomes. For 49 of the lineages there was a full-length representative either from L. migratoria or L. kohalensis genome (Supplementary 3). There were no lineages distributed in all the genomes studied. Sixteen lineages were found only in the Caelifera, and 26, only in the Ensifera genomes. Thirty-three lineages were present both in Caelifera and Ensifera. Sixteen lineages were identified in a single genome each (Supplementary 4).
On average, 26 lineages were found for a studied orthopteran genome. The maximum (36 lineages) was identified in the genome of the migratory locust (L. migratoria) and the minimum (9 lineages) in the genome of an African grasshopper (Chrotogonus sp.) (see Table 1). It should be noted that the increase in the number of MLE lineages detected in L. migratoria (from 27 to 36) and L. kohalensis (from 24 to 31) genomes is due to the fact that these lineages are present as singlets or in less than 10 copies in the genomes and therefore were filtered out during the initial stringent mining process. Nevertheless, their minor presence is noted in Table 1 and Supplementary 4.
For the MLE lineage consensuses identified during the first stringent search in the L. migratoria and L. kohalensis genomes, TIRs as well as the characteristic to MLEs TA di-nucleotide TSD have been identified. Among the 27 L. migratoria lineages TIRs were found in 24 lineages, and the TSD was found in 17. TIRs were found in all the 24 lineage consensuses from the L. kohalensis genome, 19 of which also had the TSD (see Supplementary 3).
Transposase amino acid sequences of the lineage consensuses were then used in phylogenetic analysis together with other known MLE transposases. To implement a phylogenybased automatic classification of the identified lineages, we used the Cluster Picker tool (Ragonnet-Cronin et al., 2013) which assigns sequences into a cluster if (1) there is a substantial statistical support for their clustering on a tree and (2) a distance threshold between sequences in a cluster is satisfied. Testing different distance thresholds we found that at 43 % level all the "canonical" (historically first identified) MLE subfamilies (like mauritiana, mellifera, cecropia, etc.) still can be distinguished from each other, while increasing the threshold to 44 % already results in collapsing mauritiana and mellifera subfamilies into a single cluster. Using the 43 % threshold results in 28 clusters (including those represented by a single lineage), which can be supposedly of a subfamily level. However, these clusters at the same time are the parts of the higher-level ones (see Figure). For further convenience and the redundancy removal, we arbitrarily chose 55 % similarity threshold at which seven higher-level and statistically pronounced clusters (I-VII) are defined by Cluster Picker (see Figure, Table 1, and Supplementary 4). Four of these clusters (III, V, VI, and VII) include previously known subfamilies. Cluster III is comprised of mauritiana, mellifera, cecropia, elegans and briggsae subfamilies. Cluster V is represented solely by the irritans subfamily, even more supporting its status as such. Members of a recently defined drosophila subfamily (Wallau et al., 2014) were found within cluster VI. Cluster VII includes vertumnana and marmoratus subfamilies (Bui et al., 2007(Bui et al., , 2008. Although no previously defined subfamilies were found grouped in cluster I, it includes elements recently identified in the insect order Hemiptera (Bouallègue et al., 2017) and assigned to irritans subfamily (which, however it is not supported by our phylogenetic analysis). Clusters II and IV are comprised of single Orthoptera lineage each (Ort-mar37 and Ortmar45, respectively). However, these lineages were found in the vast majority (13 out of 16) of the studied genomes (see Figure, Table 1, and Supplementary 4).
The identified Orthoptera MLE lineages were distributed among all the seven clusters. The majority of the lineages (36 out of 75) belong to cluster I; 20 to cluster VI; and seven, six, four, one and one lineages, to the clusters II, IV, VII, II, and V, respectively (see Figure, and Supplementary 4). Interestingly, lineages of the cluster VII were identified only among the studied Ensifera genomes.
The tree was rooted with transposases from rosa and LTIR Tc1/mariner families (see Supplementary 6 for the names and accessions). Posterior probability is indicated for each node. Nodes with posterior probability below 90 % are collapsed and not shown. Names of the 75 Orthoptera MLE lineages identified in this study start with "Ortmar" and highlighted in blue. Clusters corresponding to original MLE subfamilies and predicted by Cluster Picker (Ragonnet-Cronin et al., 2013) with 43 % genetic distance threshold are indicated by bold red lines to the right of the tree, and their corresponding names (where applicable) are shown. Posterior probabilities of the nodes corresponding to subfamily-level clusters are highlighted in red. Higher-level clusters predicted by Cluster Picker with 55 % genetic distance threshold are indicated with black bold brackets and assigned with Roman numerals. Red stars indicate Orthoptera lineages which may have been horizontally transferred between Orthoptera species and species from other insect orders (see Table 2 for more details).
Diversity of mariner-like elements in Orthoptera tantly related species (divergence time over 250 Mya (http:// www.timetree.org/home)) from other insect orders may have occurred (see Table 2, and Supplementary 5). Among these potential transfers, some draw an additional attention. For instance, high-identity (up to 99 %) MEGABLAST hits with the full-length sequence of the Ortmar63 lineage consensus from the drosophila subfamily (see Figure, cluster VI) were identified simultaneously in 16 genomes belonging to Diptera, Hemiptera, Lepidoptera, and Hymenoptera, while the lineage itself was identified in ten and two copies in two crickets only: L. kohalensis and Cardiodactylus novaeguineae, respectively (see Supplementary 4). A less prominent signal was found for the lineage Ortmar39 of the mauritiana subfamily identified in eleven Orthoptera genomes with 10 copies per genome on average (see Figure, cluster III). MEGABLAST hits to Ort-mar39 (up to 92 % identity) were found for nine Hymenoptera genomes with the number of matches ranging from 1 to 257 (see Table 2, Supplementary 5). Two substantially diverged lineages from the cluster I (Ortmar8 and Ortmar9, each was found in ten studied Orthoptera genomes) gave 80 and 78 high scoring hits with a single genome of a termite Cryptotermes secundus. A substantial amount of high-identity hits with a genome of a stick insect Clitarchus hookeri (Phasmatodea) were found for the lineages Ortmar1 (320 hits up to 95 %, cluster I, distributed in 13 Orthoptera genomes) and Ortmar58 (51 hits up to 97 %, cluster VI, distributed in the 4 Orthoptera genomes). Another example is even more extraordinary: 5784 high-identity (up to 95 %) hits with the sequence of the Ort-mar46 lineage of the irritans subfamily, widespread within the Orthoptera (see Figure, cluster V, Supplementary 4 and S5), were found in a single genome of the blood-sucking bug Rhodnius prolixus (Hemiptera) (see Table 2, and Supplementary 5). Interestingly, previous studies showed that the R. prolixus genome is extremely saturated with members of a single lineage of the irritans subfamily, suggesting a relatively recent transposition burst with the estimated expansion occurred around the time of speciation of R. prolixus, approximately 1.4 Mya (Fernández-Medina et al., 2016). With Orthoptera species as a potential source, it can be hypothesized that the horizontal transfer of the Ortmar46 lineage could be a cause of the irritans subfamily expansion in the R. prolixus genome. Although in all the identified potential horizontal transfer cases the exact mechanism(s) and direction of the transfer remain elusive, these findings can later help in deciphering of each particular cases of the process in insects, as suggested by some authors (Silva et al., 2004;Loreto et al., 2008). Overall, the horizontal transfer cases identified here are in line with the recent report on massive horizontal transfer of transposable elements between insects which was based on more sophisticated genome to genome comparisons (Peccoud et al., 2017).

Conclusion
In this study, we used a non-standard approach to investigate the variability of MLEs. First of all, we took advantage of modern high-throughput sequencing technologies to avoid the bias of non-exhausting sampling which would occur if Ortmar8 ( * For each lineage, (N) -number of studied Orthoptera genomes in which the lineage was identified; ** for each non-Orthoptera order, (X/Y) -(number of genomes with detected potential horizontal transfers/number of genomes in the database); for the corresponding Orthoptera MLE lineage, values X/Y/Z designate number of genomes with MEGABLAST hit with more than 90 % identity/total number of MEGABLAST hits among the genomes/the maximum MEGABLAST identity (%) found.
the standard molecular cloning of the PCR amplicons with the subsequent Sanger sequencing was used. Here, the Ion Torrent PGM was a platform of choice, however, any of the currently available next-generation sequencing technologies can be utilized. The robustness of the obtained results was additionally controlled by the results of computational analysis of the two genome sequences from distantly related taxa of Orthoptera, L. migratoria (suborder Caelifera) and L. kohalensis (suborder Ensifera). Secondly, the automatic phylogeny-based classification of MLEs with Cluster Picker (Ragonnet-Cronin et al., 2013) allowed the definition of clusters on a rational, researcher-independent basis. In all previous studies on MLEs, it was a choice of the researchers whether to include newly identified MLE lineages to previously defined clusters, or designate new ones. Using this approach, we were able to identify and classify MLE lineages in the diverse 16 species of Orthoptera (see Table 1 and Figure), for which so far the genomic sequences were available only for the two species. We believe that our study can serve as a basis for future researches on diversity, distribution, and evolution of MLEs in species of other taxa that are still lacking the sequenced genomes.