WES vs WGS: why the exome isn’t the whole story (and sometimes when it’s better)

In this month’s installment we’re going to revisit in a bit more depth a topic that’s been touched on in this space before—that is, the differences between a whole genome sequence (WGS) and a whole exome sequence (WES). On the surface the differences are simple and explicit in the names. WGS provides the sequence of the genomic (nuclear) DNA from a sample, including all sorts of noncoding regions such as centromeres, telomeres, long repetitive stretches of “junk” DNA, and various un-transcribed control regions which influence the activity of the actual genes. For a human, a whole genome is approximately 3.3 billion base pairs, haploid—so 6.6 billion base pairs to capture the whole diploid complement per cell. The exome by contrast is just the collection of expressed RNAs (including both coding mRNAs and noncoding functional RNAs which can be everything from rRNA functional ribosomal components to tRNAs essential for protein expression to things like miRNAs important for gene silencing and post-transcriptional regulation). The human exome is roughly 30 million base pairs total size, or only about one percent of the genome.

Sequencing either a genome or an exome requires collecting a significant “overage” of data, or “sequencing depth.” This is done for two reasons: one is to improve accuracy (a single read may misrepresent a particular base pair, so a consensus of multiple reads over the same spot is more accurate) and the other is that to build up full chromosome length reads from short bits requires ‘tiling’ or overlap between reads so we can generate long contiguous sequences. Since the predominant next generation sequencing (NGS) technologies produce individual read lengths much shorter than many RNA transcripts, tiling is as much a requirement for WES as it is for WGS. Overall then, while there are a lot of nuances we won’t go into, while either a WGS or WES requires a lot of data to be generated and processed by bioinformatic pipelines, a WES is to a first approximation 30 fold less data than a WGS (you’re excused for expecting that to be 100 fold but WGS tend to be run ~30x depth and WES at ~100x, to allow for capture of rare variants; more on that below). Obviously then WES has one immediate advantage over WGS in that it’s faster and cheaper to obtain and analyze.

We generally think of doing some form of NGS in a clinical context as a means to try to uncover the root cause of a particular physical manifestation—a phenotype. We’ll ignore the inconvenient reality that some phenotypic behavior arises from complex polygenic traits and assume for simplicity that in this hypothetical example it’s a simple monogenic Mendelian cause. Cost and time factors aside, what are the pros and cons of using either a WGS or WES approach to tackling this?

Surprise #1: for complete exon coverage, WGS beats WES

Within protein coding sequences, mutations can in some cases be known pathogenic from other examples, or they may be novel but of readily apparent impact such as stop codons, significant insertions/deletions, or frame shifts. Even less readily interpretable amino acid substitutions may in some cases be scrutinized against known or computer predicted protein structures with a reasonable chance of spotting significantly disruptive changes (putting a proline in the middle of that critical α-helix probably isn’t a good thing)! While you might think that mutations in coding regions should be equally observable in both WES and WGS approaches, it’s been observed that that’s not quite true; in particular, GC-rich gene sequences appear more accurately captured by WGS than WES. WGS also scores better for completeness among preselected panels of disease relevant genes, where WES is reported to miss between 0.42 percent and a whopping 24.44 percent of exonic data as captured in a PCR-free WGS strategy. (For a more in-depth look at these numbers, see e.g. [1]). If complete coverage even just of exons is your goal, then WGS edges out WES.

Meaningful mutations can also occur outside of exons, in regulatory elements such as transcriptional promoters, enhancers, and suppressors thereby altering expression level and/or location. Similarly, mutations within introns can influence splice site selection and lead to inappropriate expression of particular splice variant isoforms of a gene which is otherwise expressed at an overall appropriate level. Since these by their very nature occur in non-transcribed sections of the genome (or at least not retained in mature transcripts), an immediate expectation might be that these will be captured in WGS and not in WES. Strictly speaking that’s true; a WGS data set will include all of these sorts of regions but a challenge comes when we try to interpret. Like with exons, in some cases there are very specific variations such as SNPs (single nucleotide polymorphisms) in non-exonic regions which have a known phenotypic impact (or lack thereof). As databases get filled with more and more example human genomes with clinical correlates, the library of known variations becomes bigger. At present, however, compared to the size of the human genome and the frequency with which variations from reference genomes are seen, this known library is small and in the majority of cases, variations noted are of unknown impact. These even have their own name “VUS” (variants of unknown significance) and create a number of headaches in clinical practice, not just interpretationally but also with regard to ethical issues about even disclosing them. Particularly if disclosed to non-specialists they’re prone to cause misunderstanding (for a more in-depth discussion, see e.g. [2]). By some estimates, each of us is walking around with roughly half a million VUS in our respective genomes. So, while the WGS data captures all of this, we’re left in many cases unsure of how to interpret what we have.

Surprise #2: look to the exons if you want to know what happened outside them

Paradoxically, the best approach to find evidence of meaningful non-exonic variation is probably through WES. That’s right, we should look at the exons to find out what happened elsewhere. The key here is to remember that a WES is generated from cDNA and includes not just individual sequences but also relative observational frequencies of gene products and even particular splice variants of a single gene. If (and that’s a critical caveat) the cDNA library used for WES comes from the cell population of interest, this provides a snapshot not of the actual non-exonic sequences but of their significant effects. For example, in the case of mutations impacting net gene expression level, the impacted gene will represent a lower or higher level compared to expected when referenced to other housekeeping genes in the sample. Where the mutation impacts something more nuanced such as splice site bias in a particular gene, relative levels of gene isoforms will deviate in the sample from equivalent isoform ratios in control samples. While this doesn’t give us any information on what the actual root cause mutation(s) is (are), it ignores the impact of truly insignificant variations which we’d otherwise classify as VUS and be left none the wiser.

So, what’s better, WGS or WES?

The answer to that depends on what it is you’re looking for, and the resources available in terms of time, cost, and bioinformatics tools. WES rose to popularity early on and it remains a cost-effective focused strategy for looking at what is likely to be the most informationally dense set of genomic data from a sample. Bear in mind the comment above though that cDNA populations and their derived WES data sets are tissue specific to some degree. In addition to this they have demonstrated biases against representing some sequence types and can lack the completeness of a WGS. In comparison, PCR-free WGS requires more cost and effort but is more complete in its coverage and is generalizable across the whole organism (we’ll pretend this space wasn’t just recently devoted to somatic microchimerism as the exception to this). If at some point in the future we have vastly more data such that VUS are a thing of the past, then WGS will probably be the ‘better’ choice. Before that occurs however, and as costs of NGS technology continues to drop and ease of use increases, we may reach a situation where the most complete and interpretable genomic picture is obtained by capturing both a WGS and a paired tissue-relevant WES. Each provides a slightly different insight to the genome and in reality the two forms of data are complementary.

REFERENCES

  1. Meienberg J, Bruggmann R, Oexle K, Matyas G. Clinical sequencing: is WGS the better WES? Hum Genet. 2016;135(3):359-62.
  2. Hoffman-Andrews L. The known unknown: the challenges of genetic variants of uncertain significance in clinical practice. J Law Biosci. 2018;4(3):648-65.