Computational problems of analysis of short next generation sequencing reads

1 Университет Хартфордшира, Хатфилд, Великобритания 2 Федеральное государственное автономное образовательное учреждение высшего образования «Новосибирский национальный исследовательский государственный университет», Новосибирск, Россия 3 Федеральное государственное бюджетное образовательное учреждение высшего профессионального образования «Новосибирский государственный архитектурно-строительный университет (Сибстрин)», Новосибирск, Россия 4 Федеральное государственное бюджетное научное учреждение «Федеральный исследовательский центр Институт цитологии и генетики Сибирского отделения Российской академии наук», Новосибирск, Россия 5 Институт Сэнгера, Велком Траст, Кембридж, Великобритания


Геномика
Вавиловский журнал генетики и селекции H ere we will concentrate on second generation short read DNA sequencing (Liu et al., 2012), and will drop off the term 'second generation short read' while speaking about NGS further.The terms in bold are explained in the Glossary.
NGS technologies have as an essential feature of breaking DNA molecules into stack of numerous fragments, called a 'library'.The end parts of these fragments are sequenced in parallel, and called 'reads'.They are assembled into contiguous strings.These assembled sub-sequences are in turn assembled into genomes, and are subjects for further analysis.
We refer to (van Dijk et al., 2014;Anders et al., 2015) for detailed characteristics of NGS platforms.
The NGS data processing is arranged in a set of consecutive steps, called a pipeline.A common post-sequencing NGS pipeline (Mutarelli et al., 2014;Newell, 2014) consists of: (1) Quality Control (QC) of initial data; (2) Mapping to a reference genome and/or assembly; (3) Post-mapping/assembly QC and re-calibration; (4) Variant calling and its QC; (5) Correcting of errors.
The step 2 may be combined and/or substituted by a de-novo genome assembly (Baker, 2012) in case there is no reference for the sequenced genome.
For each of the steps in a pipeline above we will review (a) what they are and what is their goal; (b) how it is done summarising the methodology, their advantages and pitfalls.

QC of initial data
For any platform, an initial unprocessed digital outputs of a sequencing are a base calls and their qualities.
Compared to a first generation Sanger sequencing (Sanger et al., 1992), NGS technologies are confronted by shorter read length, platform/instrument/sample specific biases (Harismendy et al., 2009), higher error rate, and irregular coverage.These factors lower the accuracy of NGS further analysis (e. g. variant calls and de-novo assembly) by introducing sequencing errors that may direct to mis-interpretation of data.
The main factors utilised in quality control of raw data to characterise sequencer's performance and library preparation are: Total read count; Proportion of high quality data; Nucleotide and quality distribution per cycle; Duplication rate (can be optical or amplification duplicates); Adapter's counts; Proportion of bases per sample for pooled multiplexed data; ◊ Total read count shows general library effectiveness.It should be reasonably large to produce results of statistical significance.◊ Proportion of high quality (Q > 30) bases within Q value distribution should be large: at least more than half.A base call is scored with low Q mostly because of sequencer's preferences and faults (Abnizova et al., 2010;Ledergerber, Dessimoz, 2011).These low quality bases are typically trimmed or corrected (Kelley et al., 2010;Del Fabbro et al., 2013), so low Q and possibly wrong called data will not compromise downstream analysis.However, an error correction should be applied only to high-coverage and homogeneous data -an assumption that often fails for NGS data.◊ Quality-per-cycle distribution.A random quality peaks/ deeps per cycle point to some problems on machine during sequencing.Quality usually declines gradually with cycle as a result of increasing signal-to-noise ratio.◊ Duplicate reads, appearing due to PCR (polymerase chain reaction) and optical problems, may lead to over-estimating of some variant contribution in the data.Duplicate removing is debated in (Pireddu et al., 2011;Davis et al., 2013).Thus, their proportion should be less than 10 %. ◊ Proportion if adapters should less than 10 % as well.Parts of adapter might be erroneously sequenced in the beginning of a read, and thus may bring artificial mutations (Martin, 2011;Li J. et al., 2012a).The popular tools for adapter removing are discussed in (Marroni et al., 2012;Jiang et al., 2014).◊ Di-multiplexing, namely splitting up samples based on their tags, should be even across tags in theory.It is very important that the size of each pool is sufficient and equal (Mir et al., 2013) for pooled multiplexed samples.Fairly even di-multiplexing (Hadfield, 2013) provides less biased data.◊ There is also a possibility to quality control a library before massive sequencing.The MiSeq QC (Illumina, 2014) enables performing a preliminary run on libraries before deepsequencing on a bigger machine, such as HiSeq or HiSeqX.
Nevertheless, any individual QC metric should be regarded in context of its project (Guo et al., 2014a).
A lot of sequencers (Cox et al., 2010) generate a QC reports included into their processing pipeline, and these reports investigate mainly a general performance of the corresponding sequencer.They typically do not cover any effects of sample extraction and library preparations.
A special case is fastQC.It is built up to point to problems which developed either in the library preparation or on sequencer.It is a very fast and crude estimation of different metrics formed by stratified samples of the data.
Alternatively, FaQCs (Lo, Chain, 2014) records errors in the whole data.It also takes away low Q-value reads.
The NGS QC Toolkit (Patel, Jain, 2012), except of performing a quality check and generating descriptive statistics, trims low Q ends of reads and removes low Q bases.It also enables a conversion between various file formats of NGS data from Illumina and Roche 454 platforms.
One should be careful: what is removed might be a genuine biological signal.Nevertheless, any fluctuation from expected values for the QC metrics, might be a possible error.

Mapping/aligning to a reference genome and/or assembly
The further step is the matching of the reads to positions at the reference genome, so called mapping.This is done by aligning reads to sub-sequences of the reference genome to which they are most close in terms of nucleotide sequence.Computationally, mapping is the most time and memory consuming step (Day-Williams, Zeggini, 2011;Fonseca et al., 2012).It is also critical: any mistake in alignment will be subject to further processing and hence spread errors to the further stages of sequencing and analysis.
For the short reads of NGS, it is too inefficient in time and memory to use the well-known BLAST (Altschul et al., 1990) algorithm to map reads to genome.Therefore a particular memory and time optimised mapping algorithms are developed.
NGS mappers/aligners can be classified based on their methods: hash table indexing (Shang et al., 2014) or Burrows-Wheeler Transform (BWT) (Li, Durbin, 2010).They also differ by computer resource usage and sensitivity.Thus, they may lead to a different mapping results.Here we define sensitivity as a proportion of genome which is covered by at least one read after mapping.Mapping algorithms also vary in their ability to deal with particular sequencing platforms, quality of base, protocols and in the dealing with structural features of the DNA subject to sequencing, such as repetitive motives, gaps, deletions and insertions.
Both types of aligners typically pre-process and index both reference and reads before a search of matching read positions (in the reference genome) itself.A hash table is a kind of a look up table, only supplied with advanced structure of indexing.BWT usually compresses data in a particular way (modification of a suffix array) before matching.BWT aligners are less sensitive than hash table methods, but are faster and use less memory (Newell, 2014).
A sequence assembly refers to aligning and integrating short fragments from a sequenced DNA in order to recreate the original sequence.If the genome of an organism has not been sequenced before, the assembly results in the first form of its reference genome.This procedure is called "de-novo assembly".Sometimes a de-novo assembly is used together with alignment to reconstruct previously insufficiently covered and untrustworthy sequenced genome loci.
Present-day assembling algorithms for NGS comprise two main groups (Li Z. et al., 2012): (i) Overlap-layout-consensus (OLC) methods; and (ii) Eulerian/de Bruijn Graph (DBG) methods.Both groups apply a graph theory to deal with NGS data, but in OLC notation reads are nodes, while in DBG notation a k-mer is a node.A read's overlapping sequences stand for graph edges in both groups of assemblers.
What can go wrong?
• Reference mistakes.One should understand that an alignment step is apparently dependent on a reference's accuracy.
In the case of incorrect reference many reference errors could be mistreated as high quality genetic variants.
• A bias shared by most technologies is that their accuracy decreases with the number of sequencing cycles, thus an error of mapping the end of a read grows (Balint, 2016;Sameith et al., 2016).
• Of the more specific defects, we refer to: platform-dependent issues; the type of protocol used; complications due to the functional and structural complexity of the sample DNA.

• Read Length and Error Rates
Read lengths span from 70-1500 bp (Newell, 2014) depending on the sequencing platform.If reads are short it is harder to match them precisely to a unique genomic location.
Some sequencing platforms allow for longer read lengths than others (for example 200 bp by Ion Torrent and 700 bp by Roche's 454, while Illumina's reads of 100-250 bp) which makes mapping more precise.However, this advantage is defeated by their higher mismatch-error rate; aligners throw away reads with too many mismatches on the basis of a preset mismatch error rate.

• Platform-dependent issues
The technology on which a platform is founded may be prone to a certain sequencing mistakes, resulting in platformspecific error characteristic.
A 'light-based' sequencing platforms, Illumina, SOLiD, and Complete Genomics, employ fluorescent dye's labelling to measure a signal strength for a successive sequencing cycle.The light-based platforms are known to be impaired by GC-bias, i. e. a low coverage of either GC-rich or GC-poor DNA regions (Chen et al., 2013;Rieber et al., 2013).Its origin is likely to be a fragmentation or/and cloning procedures during library preparation (Benjamini, Speed, 2012;Ross et al., 2013).
The light-based platforms typically are disadvantaged by single nucleotide miss-identifications.The SOLiD platform is known to have difficulties with sequencing palindromic sequences (Huang et al., 2012) ates on acidity (pH) rather than on light.Roche's 454 (Niu et al., 2010) employs a pyro-sequencing technology.An accuracy of both technologies depends on the length of sub-sequences of identical nucleotides ("homo-polymers") because of similar computational approaches to evaluate a homo-polymer length.
Defective flow-calls result in insertion/deletion (indel) errors: they are largely homo-polymer-asso ciat ed errors in case when short homo-polymers are frequent while long are rare (Bragg et al., 2013;Li et al., 2013).
Recognizing indels from NGS is known to be very daring (Li et al., 2013), because 'indel by itself obstructs with precise mapping'.To map indels precisely, pair-end (PE) information is employed (Albers et al., 2011).It is valid for indels half a size of reads.Longer deletions are detected by a split-read method.
To distinguish long insertions a de-novo assembly of weakly covered regions is required (Li et al., 2013).
• Sequence-specific errors: For pyrosequencing platforms, a 'homopolymer-associated errors' result into throwing away repetitive DNA after mapping.Indel errors are known to be context-dependent.Moreover, for Ion Torrent, GC-poor organisms have higher error rate and poorer coverage than GC-balanced.The nucleotide context of Illumina errors is reported in (Minoche et al., 2011).

• DNA complexity: DNA functionality causes aligning biases
The study of NGS artefacts in (Schwartz et al., 2011) showed that less linguistically complex sequences of introns are less covered with reads than more complex sequences of exons.The authors discovered that peaks of mapped reads were associated with biological features, such as intronexon junction, expression level, splice sites and transcription length.
Similarly, the authors of (Auerbach et al., 2009) found that regions proximal to promoters are prone for sonication breakage, and hence are the subjects of regional bias.These regions are the primary cause of an uneven read coverage, retaining a large peaks of aligned reads.

• DNA complexity: Repetitive DNA causes assembly problem
A particular troublesome feature of the sequential structure of many genomes is the occurrence of long chunks of repetitive DNA (so-called "repeats"): repetitive DNA is frequently overlooked, miss-mapped and miss-assembled by all platforms (McCoy et al., 2014).
Around half of human genome is comprised of repetitive DNA (de Koning et al., 2011), the fraction of repeats is even larger for some plant genomes (Feschotte et al., 2002).Even though repetitive DNA is functionally important, NGS sequencing often fails to sequence it flawlessly (Alkan et al., 2011b;Ye et al., 2011).Most current technologies are errorprone while handling repeats.
But granting a repetitive DNA stretch is sequenced correctly, it might be compromised by similar DNA in other genome location, and lead to mis-alignment.And finally, repetitive DNA is often a hot-spot of real biological mutations and structural variations (Orlov et al., 2006;Medvedev et al., 2009;Safronova et al., 2015Safronova et al., , 2016)).
In addition to various repetitive DNA, a short indels and segmental duplications are also difficult to align and assemble (McCoy et al., 2014) because of ambiguity at which location to map an identical DNA subsequence.
The main assumption of assembly (similar reads belong to the same location) is breached by various repeats and polymorphisms.An assembly is computationally not tractable for genomes where the ratio of repeat length to read length is large (Nagarajan, Pop, 2013).
When whole long repetitive stretch were sequenced together with their flanking regions, it would be easier to detect it within genome.Therefore, longer reads could solve this problem (Huddleston et al., 2014).

• Diversity of protocols: PE and MP methods
The types of sequencing protocols depend on a researcher's question: e. g. reads sequenced in pairs (pair end, PE) (Medvedev et al., 2009) or singles (SE).PE reads are designed to detect direction and distance between reads, therefore reads containing complex DNA can be mapped uniquely (Miller et al., 2010;Alkan et al., 2011a).
A sub-type of PE reads, the long inserts reads (up to 5 KB), frequently named as mate-pair libraries (MP) (Park, 2013) are valuable to connect long repeats (including repetitive transposable elements) and structural variations.
Longer reads can solve the assembly and mapping problems.With longer reads it is easier to establish a correct genomic location for a sequenced DNA.Therefore, a new synthetic long reads (McCoy et al., 2014) from the Illumina TruSeq are developed.They are as long as 3d generation PacBio (Sharon et al., 2013), but much more accurate, having as low error rate as 0.03 % per base.
These synthetic long reads are assembled from Illumina short reads, by combination of laboratory and computational efforts (Voskoboynik et al., 2013).Nonetheless, there are still some imperfections left: gaps in assembly and a low coverage for repetitive AT-rich regions.
Regrettably, when some problems are reduced, a new ones arise.The essential problems of MP (Park, 2013) are: (i) extremely elaborated construction of their libraries, and (ii) common mistakes of mapping: 'inward facing' reads as a substitute of 'outward facing'.This mistake results into chimeric reads (Illumina).Another problems are: unexpectedly small insert sizes (Nextera), underrepresentation of the AT-rich sequences (SOLiD) and unplanned spontaneous secondary fragmentation (Roche).
• Sequencing errors (Abnizova et al., 2012;Ross et al., 2013) is another threat for aligners.Clearly, if a read encloses more mismatches than allowed by aligner settings, than it will be discarded, even if it accommodates biological signal.Another objection significant discordance of assemblers (Magoc et al., 2013): different assemblers yield very unequal amount of assembled reads for the same data sets, specifically for homologous genome regions.

Post-mapping/assembly QC and re-calibration
Mapping is known (Li H. et al., 2009) to be the a primary cause of sequencing biases.Therefore it is recommended that one reviews the quality of mapped reads before in-depth scientific analysis.

Mapping metrics
To safeguard an adequate aligners' performance, there are several QC metrics: Ideally, an even read coverage is expected along genome, to escape local biases.On the other hand, coverage is known (Minoche et al., 2011) to be non-uniform along genome, depending on the regional function, composition (Rieber et al., 2013) and many other features.◊ The Q-value/score is a commonly used measure of base call quality (Bonfield, Staden, 1995;Ewing et al., 1998).The quality Q-scores compress different types of information about the quality of base calls into a confidence (of error) value.Quality score is commonly accepted input for majority of analysis tools, assemblers and aligners in order to produce accurate results.However, in a raw fastq/bam files these Qs are inferred or predicted.The predictions are based on a set of measurements of a base call, and on previous observations of the values of the measurements.The inferred Q-values are assigned by the means of pre-computed look up table, so called calibration table (Brockman et al., 2008;Abnizova et al., 2010).
A sequencer's errors are typically of low Q, and come from technological and hardware shortcomings.
The infamous sources of errors for Illumina sequencers are: phasing and pre-phasing, dye label X-talk, molecule degradation with time and G-quenching (IDT, 2011).The phase inaccuracy results from base-incorporation errors on a sequencer machine.A G-quenching is an effect of previous nucleotide G; a base quality is typically low for this G-preceded base call (Abnizova et al., 2010(Abnizova et al., , 2012)).It was strongly pronounced for the v3 version HiSeq, and dramatically reduced for HiSeqX10 and X5.◊ Contaminated sequences (due to different reasons) may bring up artefacts during variant calling (Schmieder, Edwards, 2011).◊ A capture efficiency for exome sequencing is a proportion of useful data (Garcia-Garcia et al., 2016).It is normally 40-75 % (Guo et al., 2014a), and should not be too small for statistically sound results.And likewise to the section 1, any inconsistency with expected values for a sample investigated should be cautioning.

Assembly metrics
In the non-existence of reference genome, the assembly metrics are: ◊ Total number of contigs or scaffolds: the less the better; ◊ Contig or scaffolds sizes: max, mean and N50.N50 is defined as the length of the scaffold/contig, which overlaps the midpoint of length-ordered concatenation of scaffolds/ contigs; ◊ Total size of scaffolds.It should be close to an expected size of a genome sequenced; ◊ Number of Ns should be limited.(The created gaps in assembly are filled with the uninformative base-pair character 'N'.)An assembly accuracy and several normalised metrics are possible to assess in case when a reference genome exits.Note that normalization accounts only on those parts of assembly, which can be mapped to a reference genome by standard local alignment tools.◊ Sensitivity of assembly is defines as a percent of genome assembled.◊ Normalised N50 for contigs and for scaffolds is more complicated than for contigs because of N-filler of gaps (Makinen et al., 2012).

Q re-calibration
A predicted Qs often do not correspond to an actual Qs for a certain run/lane/library.In this incident (and in case heterogeneous data are combined) it is suggested to re-calibrate the data (Ewing et al., 1998;Massingham, Goldman, 2012).
In the WTSI we implemented the in-house recalibration and error analysis tools (Abnizova et al., 2010).Instead of trimming an ambiguous base calls, we warn (low Q) about possible sequencing errors.Trustworthy Q-value is known to increase SNP call accuracy (Li, Stoneking, 2012) more than hard filtering.

Variant calling and its QC
Variant calling from NGS data is defined as a computational methods for establishing an event of genetic variant resulting from NGS experiments (Lawrence, 2014;Zhang et al., 2015).
Variant calling involves small-range variants (Kojima et al., 2013), such as single nucleotide polymorphisms (SNPs), short insertions and deletions (indels), and large-range structural variants, copy number variants (CNV) and structural variants (SV).A SVs are inversions, translocations, or large indels.All types of variants are identified by comparison to a reference genome.
Fraction of variation in genomes is significant: e. g. for human genome, SNPs comprise around 0.1 %, although SV's contribution is estimated as 1.2 % (Tattini et al., 2015) and CNV's contribution is as large as 15 % (Wong et al., 2010).
A variant calling is crucial for comparative genomics and genetics of human diseases.A valuable variant calling application is clinical testing: identifying disease-associated mutations (Chin et al., 2013).
Variant calls are implemented in two ways: (i) after aligning reads, or (ii) after assembling reads.Sometimes these steps are combined.SNPs and small indels can be identified by alignment of short sequencing reads to a reference genome.However, larger structural variants and repetitive regions in the genome are harder to find.
Structural variation can disturb genes or regulatory elements, therefore whole-genome sequencing is not complete without assembly and detection of structural variation (Li H. et al., 2009).In the (i) case, a position of each read relative to the reference genome if identified first.After reads are aligned, Вычислительные проблемы анализа ошибок коротких прочтений ДНК при секвенирования следующего поколения a set of QC steps, involving recalibration, duplicate removing, and indel-realignment, are done before variant calling.
In the (ii) case, an assembly of un-processed reads is performed first, and only after this the assembly is set against a reference genome (if the later exists).Variant detection after assembling is beneficial to individual genes (Olson et al., 2015), but it loses power when applied to a whole genome: in the absence of a reference genome it is not possible to identify other genome's contaminations; spurious variants can not be verified by raw reads after assembling.

Somatic versus germline mutation
Variant calling from NGS is well utilised in genetics of human diseases.There are three typical ways how NGS data is applied in the area: (a) detection of causal germline mutations in Mendelian disorders (Lettice et al., 2008;Stitziel et al., 2011); (b) detection of putative genes for complex diseases with GWAS (Day-Williams, Zeggini, 2011;Lander, 2011;Marian, 2012); (c) detection of somatic and constitutional mutations in cancer (Walther et al., 2009).
It is more complicated to identify a somatic mutation than a germline mutation (Pabinger et al., 2014).
To identify somatic mutations in cancer, they typically compare tumor vs/and normal for the same individual (Vissers et al., 2011;Yan et al., 2011).
A set of metrics to assess a quality of variant call is listed below (Guo et al., 2014b;Jun et al., 2015): ◊ Ti/Tv ratio, individually for whole genome sequencing (WGS) and whole exome sequencing (WES) (should be 2 and 3); ◊ Heterozygocity ratio; ◊ Number of known and of new SNPs per person: should be not more than 200; ◊ Cross species and within species contamination; genotype consistency; ◊ SNP spatial density; QC per SNP; ◊ Strand, cycle, allele balance, reference allele biases; haplotype scores; ◊ Performance metrics: Sensitivity and specificity of single nucleotide variant call.One can combine these metrics by a machine learning methods (DePristo et al., 2011;Jun et al., 2015).In order to minimise false positive (FP), some variant callers do a lot of filtering and trimming using metrics above: by applying a minimum depth of coverage threshold, by masking of homo-polymers and repeats, by trimming poor quality bases from a read etc.Unfortunately, while reducing FP, one can increase false negative (FN) by applying these filters (Olson et al., 2015).◊ To assess a goodness of a variant caller, one should use a performance metrics: accuracy, sensitivity and specificity, (Olson et al., 2015) given a reliable benchmarking test sets and reference.
A comprehensive review of post-map QC is performed in (Wyllie, 2013;Guo et al., 2014b).The GATK (DePristo et al., 2011) utilises variant QC metrics for their variant calls, applying genotyping and known SNP information for a variant QC and annotation.However, there seems to be no a standard evaluation of a variant caller (Olson et al., 2015) so far.

Correction of errors
A definite amount of errors is the result of sequencing and post-processing imperfections.One way to tackle them is to Q-score possible known artefacts low, so they would be not used by further analysis.Another way is to correct errors using a knowledge about error sources for various platforms' errors (Edgar, Flyvbjerg, 2015;Olson et al., 2015) and computational biases.
An error correction after mapping is correction of a mismatch between sequenced read and a reference.After/during assembling error correction is a general agreement of base calls across all reads belonging to the same assembled location.
There are multiple attempts to correct sequencing errors.However, an error correction might introduce new type of errors: mis-correction errors (Yang et al., 2013;Fujimoto et al., 2014).And these errors are more difficult to correct back than technological errors.
A sound comparison of NGS platforms is done in (Yang et al., 2013) together with very good explanation of modern error-correction methods.Surprisingly, the paper is very convincing that one should NOT introduce new mis-correcting errors.Additionally, it also does not look promising to correct reads without understanding causes of sequencing/library errors.The work (Fujimoto et al., 2014) confirms that error correction methods can not handle heterozygosity, and they introduce new mis-correction errors.
There are approaches to correct for known context biases, such as GGGGT error patterns for Illumina (Minoche et al., 2011;Nakamura et al., 2011).However, new Illumina releases (e. g.HiSeqX10) are almost free from old type motif-dependency, and new artefacts (such as larger context dependence on a next base) appear.
Error models are used in (Janin et al., 2014) to realistically simulate individual sequencing runs and/or technologies.These models are mostly empirically derived and contextbased.A comparison of genomes without assembling them is introduced by (Patro, Kingsford, 2015).
It might be beneficial to do so for de novo sequenced genomes.However, possible PCR biases in coverage are not included in the model.Some studies, such as (Orton et al., 2015) developed a computational error model of Illumina's sample processing, which involves experimental steps.This model infers possible genomic genome locations of PCR errors.
As a conclusion, one should be informed of possible biases, and make decisions depending on their study's aim.Overall conclusion is in necessity to use short sequencing reads error correction for the mapping and processing NGS data, depending on sequencing platforms.Details of error corrections publications will be presented in next paper.

Glossary
The well-known Sanger sequencing method (Sanger et al., 1992) is called a first-generation DNA sequencing technology.The next generation sequencing technologies (Liu et al., 2012) include: (i) 2nd generation sequencing, the massive parallel sequencing of relatively short DNA fragments (Dolled-Filhart et al., 2013); and (ii) 3d generation sequencing, in which single DNA molecules hence much longer fragments (Schadt et al., 2010) are sequenced.
In this paper we will focus on 2nd generation DNA sequencing, and will omit the term '2nd generation' while mentioning NGS further.With NGS technologies, bases are inferred from light/chemistry intensity signals, a process commonly referred to as base-calling.
The sequenced bases are assigned A, C, G or T letters depending on the intensity.The Q-value/score is the most well accepted measure of base call quality (Bonfield, Staden, 1995).The quality Q-scores compress a variety of types of information about the quality of base calls into a probability-of-error value.Mapping or aligning is the matching of the reads to locations at the reference genome.This is done by aligning reads to stretches of the reference genome to which they are most similar in terms of nucleotide sequence.A sequence assembly refers to aligning and merging short fragments from a DNA sequence in order to reconstruct the original sequence.If the genome of a species has not been sequenced before, the assembly of the reads results in the first version of its reference genome.This is called "de-novo assembly".Multiplex is a library containing various samples labelled with bar codes.Sample multiplexing is a useful technique when targeting specific genomic regions or working with smaller genomes.To accomplish this, individual "barcode" sequences are added to each sample so they can be distinguished and sorted during data analysis.Pooling samples exponentially increases the number of samples analyzed in a single run, without drastically increasing cost or time.Di-multiplexing is separating samples based on their tags, ideally should be even across tags.Adapter.The vast majority of next-generation sequencing experiments will attach adapter sequence to the sequencing construct.
In many cases these are standard sequences that can be obtained from the vendor and/or sequencing centre.Unfortunately sometimes adapter information is not properly tracked and attached as metadata to the raw sequencing data and may not be known for a given sample.PF (purity-filtered) data: PF-filtering is known as throwing away data with low maximum intensity signal (purity, Illumina terminology).GC-content is a measure of the relative frequency of the cytosine (C) and guanine (G) bases, in comparison with the adenine (A) and thymine (T) bases.A genome is called GC-rich if significantly more than 50 % of its bases are G or C. Mate-pair libraries.Mate-pair is different from "paired-end " in the sense of how the sequence library is made.In "Mate-pair" sequencing, 2-5 kb fragments are selected and sequenced from both end, thus giving information how nucleotides far apart are linked together.Mate-pairs are more ideal for studying genomic structural rearrangement and help de novo genome assembly.They also facilitate sensitive structural variant (SV) detection across a widened SV size-spectrum and in repetitive areas of the genome.Insert size = DNA fragment size.Ti/Tv (sometimes called Ts/Tv): the ratio of transitions vs. transversions in SNPs.Transitions are mutations within the same type of nucleotide: pyrimidine-pyrimidine mutations (C <-> T) and purine-purine mutations (A <-> G).Transversions are mutations from a pyrimidine to a purine or vice versa.The heterozygosity ratio is the number of heterozygous sites in an individual divided by the number of non-reference homozygous sites.Error-correction is an attempt to correct a mismatch between sequenced reads and/or reference (if it is available).Genomic variant or mutation is a permanent alteration of the nucleotide sequence of the genome of an organism.A single nucleotide polymorphism or simple nucleotide polymorphism, (SNP), is a variation in a single nucleotide which may occur at some specific position in the genome, where each variation is present to some appreciable degree within a population (e. g. >1 %).Structural variation (also genomic structural variation) is the variation in structure of an organism's chromosome.Structural variation consists of many kinds of variation in the genome of one species, and usually includes microscopic and submicroscopic types, such as deletions, duplications, copy-number variants, insertions, inversions and translocations.DNA sequencing is the process of determining the precise order of nucleotides within a DNA molecule.WGS -whole genome sequencing.WES -whole exome sequencing.In Illumina, PCR and size selection steps have been implicated in GC-bias.PCR is known to preferentially amplify GC-moderate sequences, while size selection involves DNA heating which leads to a GC-poor fragment's underrepresentation.Avoiding these steps helps to limit the GC-bias.BAM File -binary version of SAM file, a typical output of the secondary phase of data analysis.Coverage -this value indicates the coverage of an analysed sequence with respect to its length, usually expressed as a percentage; sometimes the term is also used for the depth of reading.Long-Reads -strategy for sequencing samples prepared by Mate-Pair-End method.Mate Pair-End-Read -strategy for sample preparation where the longer fragment (thousands of bases) is circularized using labelled adapters, the molecule is subsequently fragmented, but only the fragments containing the labelled adapters are sequenced.Paired-End-Read -a method of reading a fragment where the fragment is first read from one end and then from the other.Read Depth -DNA = number of times a nucleotide is read; RNA = total number of reads per sample.Read Length -the number of read bases per fragment, respectively the maximum length of the fragment, which can be sequenced at a time (indicated in bases).Single-Read -a method of reading a fragment where the fragment is read from one end only during sequencing.SNP -Single-Nucleotide Polymorphism = sequence divergence in the range of a single base.SNP Calling -process of detecting SNPs in the sequences obtained.Variant Calling is a process of detection of sequence variants in the sequences obtained.
• 20 • 6 • 2016 Heterozygosity occurs when an individual has two different alleles of a gene/loci.Chimeric reads are reads with DNA sequences originating from two different samples.