Ensuring confident detection of disease-linked variants with NGS

July 20, 2014

9 min read

As the use of next generation sequencing (NGS) moves closer to the clinic, it is of paramount importance that clinically relevant mutations can be detected and called with complete certainty, without the occurrence of false positives or false negatives. Only reliable and reproducible DNA analysis methods will provide enough confidence to allow the transition of NGS from a useful research tool into a clinically dependable test.

Targeted sequencing, which presents a more focused alternative to whole genome sequencing, is becoming an increasingly popular approach within disease research, where only certain regions of the genome are of interest for a given study. By reducing the target size and focusing on the regions most likely to yield relevant data, targeted sequencing enables increased depth of coverage, increasing the chance of identifying biologically relevant variants. Targeted approaches have already had a major impact on disease detection by permitting successful identification of causal mutations for a number of genetic disorders,^1-5 including cancer.^6,7

Unlike whole genome sequencing, targeted sequencing involves an initial target enrichment step, selecting those sequences of interest to create an enriched library for sequencing. If poorly designed, target enrichment can generate uneven coverage of those regions of interest, skewing the results in the downstream sequencing assay and negatively affecting variant calling.

The enrichment protocol for targeted sequencing is therefore a vital consideration in terms of data accuracy, and it must be carefully chosen and designed. Through an optimized target enrichment strategy, the pitfalls of bias and error can be diminished or avoided entirely.

Enrichment assay design

The type of target enrichment assays available can be broadly separated into two categories: amplicon-based approaches and hybridization-based approaches. Hybridization protocols begin with the random shearing of the DNA sample, and then long oligonucleotide baits are used to capture complementary DNA fragments from the sample. Alternatively, libraries can be prepared using amplicon-based enrichment assays, with the target amplified using flanking PCR primers.

Each strategy has its own associated advantages and limitations (Table 1) that can impact the accuracy in variant calling. In essence, amplicon assays tend to be quick and affordable for small panels, while the longer baits of hybridization assays can be designed to provide greater accuracy for larger panels, as will be explained. But exactly what do we mean by accuracy?

Sources of systematic bias

Sample origin – DNA quantity can vary and may be fragmented or contain contaminants (e.g., those in FFPE material).

Amplification – artifacts and bias introduced through PCR

Strand bias – A disproportional number of forward or reverse strands are represented.

Reference bias – where the reference genome influences the sample read

Mapping bias – the accuracy of mapping the reads along the reference genome

Table 1. Advantages and limitations of hybridization and amplicon-based enrichment assays. In general, hybridization-based enrichment offers superior performance, while amplicon-based approaches offer superior speed and convenience.

The likelihood of a given sequenced base being correct is measured by the Phred (or Q) score. For example, a base call of Phred 20 has 99% accuracy, therefore presenting a one in 100 chance of error. Assuming no bias, the Phred score determines the read depth necessary for accurate base calling, since a higher depth of coverage essentially dilutes random error and improves precision.

In addition, sequencing is almost always subject to a level of systematic bias, compromising the trueness of the data—but unlike precision, this cannot be solved by read depth alone (Figure 1).

Figure 1. NGS data accuracy. While precision is improved by increased read depth, trueness (or systematic bias) is derived from multiple sources. Image courtesy of Dr. Chris Mattocks, Research Scientist at National Genetics Reference Laboratory (Wessex, UK).

Sources of systematic bias

Compromising the accuracy of variant calling, the introduction of systematic bias is a common occurrence during the target enrichment stage. This can in turn originate from multiple sources, such as PCR or the quality and quantity of DNA sample. Both hybridization and amplicon assays can be optimized to perform well with FFPE. One hybridization-based solid tumor panel, for example, had a <2% failure rate in a study of >200 clinical samples, so long as the average DNA fragment size was >1000 bp. While PCR methods can be susceptible to contaminants found in FFPE material, amplicon-based assays also tend to work well with FFPE, even with samples in which the DNA is heavily fragmented. In addition to strand bias, reference bias, and mapping bias, amplification bias also presents a challenge for accurate variant calling (Table 2). While higher depth and uniformity of coverage protect against the occurrence of missing variants (i.e., false negatives), false positives are most commonly introduced by PCR artifacts during library preparation.

Table 2. Sources of systematic bias.

Uniformity of coverage

Within disease research, the ultimate goal of any sequencing experiment is to discover all variants present. Uniformity of coverage means that all regions are represented equally, and that all variants present in any region can be called (Figure 2). It also allows lower sequencing depths to be used, which can enable a greater number of samples to be multiplexed in a single run, translating into cost savings. Furthermore, uniformity is important when looking at heterogeneous samples—for example, a mixture of tumor and normal tissue, where it is essential to have enough reads to confidently call a variant present within the tumor DNA.

Figure 2: The importance of uniformity for confident detection of variants. In this genomic region containing six variants, some regions are sequenced to a greater depth than others, causing snv2 to be missed. Image courtesy of Oxford Gene Technology.

Repetitive sequences can be problematic for amplicon-based enrichment, in terms of PCR primer placement. On the other hand, longer hybridization baits can be positioned to flank the repeats, enhancing uniformity in more challenging genomic regions. Novel variants can also potentially occur within the primer site, leading to allelic bias or complete drop-out in amplicon-based assays. Although it is possible to add more primers to increase amplification of specific regions in amplicon-based assays, there is less flexibility in the positioning and design of PCR primers. Hybridization-based enrichment offers an advantage in this situation, since longer baits have a higher resistance to mismatch when hybridizing to a novel variant. Thus with careful bait design, hybridization-based assays tend to afford greater potential for uniformity than the amplicon-based alternative.

Amplification artifacts and duplicates

All targeted sequencing experiments rely on PCR to some extent, which can introduce artifacts (amplification errors) and duplicates, as well as preferential amplification of specific sequences, which can have a significant impact on data quality.

Artifacts caused by amplification errors can be reduced by proof-reading DNA polymerase enzymes, but even these may still introduce the occasional error. Duplicate fragments are also common, normally arising during the library preparation stage and making the library an untrue representation of the original sample, with some regions massively over-represented. From the random fragmentation employed for library preparation in hybridization-based protocols, any duplicates following PCR amplification can be easily detected and removed prior to data analysis. In contrast, amplicon-based enrichment entirely depends on direct replication of each target fragment, and a duplicate cannot be distinguished, thus remaining in the data and leading to inaccurate variant calls.

All PCR-related bias increases with each cycle, and it is therefore possible to lessen its occurrence through reduced dependence on PCR. This source of bias is perhaps the primary limitation of amplicon-based enrichment assays, which are wholly PCR-dependent. In contrast, hybridization assays require far fewer PCR cycles than amplicon-based assays, vastly diminishing amplification-related error and bias, to generate data with less “noise.”

So in summary…

Accuracy is a primary goal of NGS, especially within the realm of clinical research. Increasing read depth may improve data precision, but systematic bias in targeted sequencing can only be improved through optimized experimental design, and this includes the choice of target enrichment assay.

Each enrichment assay is suited to a particular application. While amplicon-based enrichment is a quick, accessible and affordable approach, ideal for small targets, the data quality suffers from a heavy dependence upon PCR and the associated duplicates. In contrast, hybridization assays involve fewer cycles of PCR and through the optimization of bait design, this latter approach offers much greater scope for uniform coverage and enhanced performance, leading to highly confident variant calling.

Simon Hughes, PhD, serves as Team Leader, Cancer R&D, for Oxford Gene Technology (OGT). OGT provides innovative products and services for genomic analysis, including targeted next generation sequencing services.

References

Classen, CF. et al. Dissecting the genotype in syndromic intellectual disability using whole exome sequencing in addition to genome-wide copy number analysis. Human Genetics, April 4, 2013. [Epub ahead of print]
Semler O, Garbes L, Keupp K, et al. A mutation in the 5’-UTR of IFITM5 creates an in-frame start codon and causes autosomal-dominant osteogenesis imperfacta type V with hyperlastic callus. Amer J Hum Gen. 2012:91(2):349-357.
Choi M, Scholl UI, Ji W, et al.Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proceedings of the National Academy of Sciences of the United States of America. 2009;106(45):19096-19101.
Ng SB, Buckingham KJ, Lee C, et al. Exome sequencing identifies the cause of a mendelian disorder. Nat Genet. 2010;42(1):30-35
Ng SB, Turner EH, Robertson PD, et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009:10;461(7261):272-276.
Wei X, Walia V, Lin JC, et al. Exome sequencing identifies GRIN2A as frequently mutated in melanoma. Nat Genet. 2011;43(5):442-446.
Yan XJ, Xu J, Gu ZH, et al. Exome sequencing identifies somatic mutations of DNA methyltransferase gene DNMT3A in acute monocytic leukemia. Nat Genet. 2011;43(4):309-315.