This month’s foray by The Primer into molecular diagnostics techniques will cover a method known as short tandem repeat (STR) typing. STR typing is a method most commonly applied for molecular forensics work, but it is increasingly used as well in the molecular pathology lab as a method to use (or fall back on, as needed) for tracking and/or confirming “identity” of tissue samples. While the former application gets more screen time in TV and movies, it’s in the latter context that more readers will likely encounter the method. However as we’ll see, the actual method is identical in both uses.
Understanding short tandem repeats
We start by considering what an STR actually is. Across the human genome, there are large numbers of places (“loci”) where non-coding DNA sequences consist of a short (usually 3 or 4) nucleotide element, such as “CGG” or “ACAG,” which occurs as a set of concatameric repeats. For our examples, this would be “CGGCGGCGG….CGG” and “ACAGACAGACAG….ACAG,” respectively. When DNA replication occurs through these types of elements, a particular type of error can occur: if the nascent (growing) strand becomes detached from the template momentarily, when it re-anneals it can effectively do so with a “slippage” by some unit number of repeats, allowing almost all of the base pairing to re-establish. That is, after denaturation and reannealing of nascent strand to template, the polymerase may be “ahead of ”or “behind” where it was when it left. As replication starts again, the nascent strand then has either “skipped over” some of the template repeats or read them as template twice. The first results in a daughter DNA strand with a loss of a unit number of the STR repeats, while the second yields a daughter strand with an increased number of STR repeats.
How often does this polymerase “slippage” occur in an STR region? The frequency can vary with a number of factors, but the general answer is: rare, but frequently enough such that a population will display a range of STR repeat numbers at a given locus.
While STR loci occur scattered throughout the human genome, a small number are particularly well understood. These are ones which occur at a region flanked by highly conserved unique sequences, such that the unique sequences are not too close together, and not too far apart, for effective PCR amplification based on primers against the unique regions (roughly, 50 to 500 base pairs). When an STR element like this exists, it is possible to design PCR primers against the flanking conserved regions and know that (thanks to conservation) the primer set will amplify a product from any intact human DNA sample. Actually, when the STR loci in question are on an autosome as most are, then a human DNA sample will have two loci copies (one from maternally derived DNA, one from paternally derived), so two PCR products will be amplified. The important detail here is that while we know a PCR product (or two products) will form, we don’t know a priori how long the products will be. That length will depend on the number of STR repeats between the PCR primer sites. The most useful STR loci are ones where population studies have shown a wide diversity of number of repeats encountered—say, 5 to 30. This range of 25 “repeat numbers” for a trinucleotide repeat element would lead to a 75-base pair range (i.e., 25 x 3) of possible amplicon sizes, in steps of three bases per repeat.
Known by loci names such as “D1S80” or sometimes for nearby genes “TPOX,” a particular handful of STR loci meet these utility criteria and are well studied, with published primer sets for their amplification and known population statistics for individual repeat numbers at each locus—that is, for a locus, in a given ethnic population, some repeat numbers are commonly found, and others are less commonly found.
Finding a DNA fingerprint
With this in mind, imagine we take a primer set for an autosomal STR locus, and perform a simple endpoint PCR on a human sample. At the end of the PCR, we analyze the PCR products for size. Since we’re looking for 3-nt or 4-nt steps, gel electrophoresis won’t give us good enough resolution to distinguish products differing by one or two repeat numbers, so we employ capillary electrophoresis sequencers to read the product sizes exactly to the nucleotide. Our sample may show a single size product (indicating both loci copies had the same repeat number), or it may show two products, differing in repeat numbers (Figure 1). For the loci examined, we now know the repeat numbers associated with that DNA sample.
Figure 1. A sketch of hypothetical STR results for four STR markers “A,” “B,” “C,” and “D.” |
Now imagine we take a second DNA sample, and repeat the experiment. If we do so, and we get a different result than on the first sample, we have the inescapable conclusion that the two DNA samples come from different individuals. If instead we get the same results as the first sample, we cannot, however, say that the samples are from the same individual. That’s because it’s possible (to some statistical level, depending on the population frequency of the result obtained for this locus) that another person had the same repeat numbers as these loci.
The technique gains power when we test multiple STR loci in each specimen. The likelihood of two samples exactly matching decreases rapidly as we examine more loci. One commonly used commercial STR test examines 16 loci (15 autosomal, and one special case sex chromosomal discussed further below) with claimed specificity rates (that is, likelihood of two individuals matching all loci) of 1 in 1.8×1017 individuals or more—that is, many times the total human population of the planet! Commonly referred to as a “DNA fingerprint,” with probabilities of that scale, it is little wonder DNA evidence is increasingly applied in forensic and criminal uses—both in popular entertainments and in real-life forensics.
Meaningful relationships
In addition to determining (or disproving) identity between two samples, the technique can also be used to determine relatedness. Half of any individual’s STR loci repeat numbers should match values from the maternal source, and the other half from the paternal source. Siblings should generally match each other on one-fourth of loci, and so on, through simple Mendelian genetic relationships. Detailed statistical analysis (including population frequency of the particular STR types observed) can be employed to refine data of this sort and prove or disprove levels of relatedness between samples, to a known statistical probability. Note that since de novo STR slippage events and/ or experimental errors can occur, a single mismatch to expected values does not definitively disprove relatedness; a match across the remaining loci may still be enough to have a high certainty of relatedness.
Now that we understand how this methodology is applied to the popular applications, what about the more mundane? The power and simplicity of the STR “fingerprinting” method, combined with its ready availability in pre-made optimized kit formats easily run on lab equipment already generally present in the molecular pathology laboratory, makes it a useful tool for tracking and/or confirming relatedness in samples (or pieces of samples). Consider a case such as a potential tumor biopsy, where multiple small tissue pieces may be embedded in a single FFPE block. Routine immunohistochemistry analysis is done, and the result shows most of the tissue pieces are non-cancerous, while a single small piece is. In cases such as this, a concern crossing the pathologist’s mind may be whether all of the tissue pieces are in fact from the same sample—or has a “floater” from another case somehow gotten into the block? While carefully guarded against, such cases are not impossible and can have serious consequences for the patient. The STR typing method can be a very useful tool in a case such as this, where a microscopic section of the tissue piece in question can serve as a template for one DNA fingerprint, with a reference sample from the patient providing another. A match confirms the “relatedness” (or not) of the tissue piece and the patient, assuring correctly assigned diagnosis.
A particular STR locus on the sex chromosomes was alluded to above. Occurring within the amelogenin gene, this isn’t strictly a classical STR where the size of a repeat element can vary; rather, it turns out that the version of the amelogenin gene carried on the X chromosome (AMELX; twice in females and once in males) is not precisely length-identical with the same region of the amelogenin gene carried on the Y chromosome (AMELY; once in males). An intron in AMELY contains a 6-base insertion relative to the same intron in AMELX. (Note that since the insertion is in an intron, the coding gene portions are identical.) When amplified by primers flanking this region, AMELY-derived products are thus 6 bp longer than AMELX-derived. The most commonly used primer set for this locus thus yields a single 106 bp product (two loci copies, same size) for DNA samples from females, and a 106/112 bp doublet product (one from each locus) from males. This single locus test thus effectively rules in (or out) about half the population as possible sources for any given DNA sample, and is routinely included in STR typing panels. Given its size, this locus is also just barely differentiable on an agarose gel with good technique, making it a good classroom demonstration of the overall approach, without need for access to a capillary sequencer instrument.
What about mixed samples? The method described above has assumed each sample tested is from a single source, and works best in this “perfect” situation. However, in real-life cases with small proportionate amounts of other DNA present, the method still works. The primary template will yield the majority of product, with small amounts of product arising from the contaminating template. Capillary sequencers show the actual peak area for each product detected, so the primary sample will show a major set of peaks with small side peaks attributable to contaminant.
Return to Figure 1: a sketch of hypothetical STR results for four STR markers “A,” “B,” “C,” and “D.” Samples 1, 2, and 3 represent three different specimens tested for these markers, as the raw capillary electrophoresis results. Each sample contains two fixed size standards, “R1” and “R2,” which are used to ensure alignment of the results between samples. Each electropherogram itself shows signal strength vs. product size, and is divided into regions “A,” “B,” “C,” and “D.” Each region represents the expected range of possible amplicon sizes from the STR marker of the same name. Considering Sample 1, it can be seen that the source is heterozygous for STR sizes at markers A, C, and D, (two distinct products formed at specific sizes) but is homozygous for both B marker alleles. (There are two products but of identical size, yielding a single peak.) Sample 2 would have very low genetic relatedness to Sample 1; note that in this sample, STR markers B, C, and D are heterozygous while A is homozygous, and that in general few of any of the allele sizes line up between Samples 1 and 2. By contrast, Sample 3 is exactly the same as Sample 1, suggesting identity (or at least, a high degree of relatedness). A real STR result would look similar to this but with more markers, allowing for greater statistical power in detecting unrelatedness.