Understanding NGS Library Complexity: Sources of Sequencing Errors & Limits of Detection

Next-generation sequencing is, by definition, “massively parallel.” While the first 454/Curagen GS20 systems yielded a million reads per run, the latest sequencing platforms from Illumina, (NovaSeq 6000 running their highest-capacity S4 flowcell), yields 10,000X the number (i.e. 10 billion reads) per sequencing run.

Applications such as somatic mutation detection in cancer samples require increasingly greater sensitivity and accuracy. One particularly demanding application is liquid biopsy, specifically, the analysis of cell-free DNA for mutated circulating tumor DNA (abbreviated ctDNA). It is not uncommon for ctDNA to be present in one mutant molecule in five hundred, or 0.5% mutant allele frequency (MAF). This is particularly important for applications such as therapy selection or ongoing resistance monitoring where the ability to detect low allele frequencies is imperative for patient care.

With such sensitive (and clinically important) applications, it is vital to have a firm understanding of NGS library complexity to determine where sensitivity loss can occur or where sequencing errors are created. Of particular importance are the amplification steps from the initial DNA sample to the reading of the fluorescent signal off of the DNA cluster, or DNA nanoball, or the DNA Ion Sphere Particle. These last three terms refer to the Illumina sequencing process, the BGI MGISEQ sequencing process, and the Thermo Fisher Scientific / Ion Torrent sequencing process respectively.

From FFPE DNA to DNA Cluster: An Example

Thermo Fisher Scientific’s AmpliSeq targeted enrichment kit targets DNA purified from an FFPE tissue slice. The DNA is amplified once during the initial target amplification step. After partial digestion and index ligation, the library is then cleaned up with magnetic beads and the DNA amplified a second time to complete the ends of the library molecules (i.e. addition of the P5 and P7 adapter sequences outside the index sequence).

Of particular importance are the amplification steps from the initial DNA sample to the reading of the fluorescent signal

After a second bead purification and library quality check (typically on a gel or via qPCR), the individual libraries are quantitated, normalized and pooled together. After loading onto the sequencer, the library molecules migrate to the surface of the Illumina flowcell, where bridge amplification takes place. A single library molecule is amplified 1000-fold, or more onto a solid surface. At this point, the sequencing primer is added to the cluster molecules (several 1000 molecules), and the sequencing-by-synthesis (SBS) using reversible dye-labeled deoxynucleotides begins.

As in the example above, there are three amplifications that can reduce NGS library complexity: once during the original enrichment, a second time during the addition of the P5/P7 adapter sequences, and a third during the cluster generation. Errors in the polymerase can incorporate an errant base at any step, and this creates noise (a false-positive signal) in the raw sequencing data. If the error occurs early in the first PCR enrichment, this will be carried through as a high-quality base throughout the sequencing process.

Thus, even with 99.7% raw per-base accuracy, as the 0.3% errors involve cumulative amplifications that take place in the intermediate steps, this can lead to false positives. However, with redundancy (i.e. 1000x-coverage, or in the case of cell-free circulating tumor DNA (as high as 50,000x), these errant bases can be filtered out, thereby increasing the ability to accurately call variant frequencies.

Detecting duplicates and measuring NGS library complexity

In the example above, the PCR start sites are in fixed positions. PCR duplicates, where a copy of the single original molecule becomes a second seemingly-independent read, can be an issue that goes undetected with use of special techniques such as barcoding (i.e. labelling individual molecules at the first PCR step with a molecularly-unique string of random bases). Typically, molecular barcoding is used only when minor allele frequency is below a nominal 5% (that is one variation per 20 wild-type background reads). Molecular barcoding is often used in sequencing cell-free circulating tumor DNA (ctDNA).

If the error occurs early in the first PCR enrichment, this will be carried through as a high-quality base throughout the sequencing process.

The other main enrichment method, hybridization capture using biotinylated oligonucleotides, selects randomly-fragmented pieces of the original template DNA. This allows a gauge of NGS library complexity that can be evaluated by the number of unique start-sites after mapping.

Naturally, the measurement of NGS library complexity will be confounded by factors that affect coverage, including local G-C content, the quality and quantity of the input material, and the efficiency of the overall enrichment and library construction process.

The reality of conversion

The inherent limitations of assay efficiency may be overlooked due to the availability of commercial kits for enrichment and library construction processes. However, PCR is not 100% efficient. The polymerases do not synthesize with 100% accuracy, and a ligation step may only convert 20% of the molecules (or often less) into sequencing reads. For especially sensitive applications such as measuring ctDNA to one part in one thousand or 0.1% MAF, this conversion efficiency is a crucial but rarely reported metric.

A nominal 20 nanograms of input cell-free DNA from 2 mL of plasma, will only yield approximately 6,000 genomic copies of both wild-type and target mutated sequence. At 0.1%, the number of mutant molecules present in this concentration is only six. If the process of conversion is only 20% efficient, there would be only one molecule of the target mutant molecule turning into a single sequencing read. How can you call a mutation when there is only one representative read? Determining whether it is a real variant or stochastic noise error from the sequencing system is virtually impossible.

How SeqOnce RhinoSeq can help

SeqOnce’s RhinoSeq uses a simple additive protocol without intermediate purification steps, in addition to minimal input amounts. This minimizes sample loss and maximizes NGS library complexity for Whole Genome and Whole Exome NGS library construction, with additional methods in development. Contact us today for further information or to request a trial kit.

Summary:

  • New NGS applications are pushing the limits of detection and accuracy.
  • Amplification steps, even with a low error rate can create high-quality reads that are incorrect. Amplification rounds are the primary cause of bias that minimize NGS library complexity.
  • Multiple amplification steps lead to cumulative errors.
  • Overall conversion frequency is critical for detecting low copy molecules, but is not an often-cited metric.

Leave a Reply

Your email address will not be published. Required fields are marked *

Recent Posts