Each person has about 4 million sequence differences in their genome relative to the reference human genome. These differences are known as variants. A central goal in precision medicine is understanding which of these variants contribute to disease in a particular patient. Therefore, much of the human genome annotation effort is devoted to developing resources to help interpret the relative contribution of human variants to different observable phenotypes – i.e., determining variant impact.
Recently, Yale School of Medicine led a large NIH-sponsored study where multiple institutions and international collaborators came together to address this challenge. This study generated a large, organized dataset from four individual donors using high-quality genome sequencing to identify all the variants and many different assays to determine their effect on molecular phenotypes in 25 different tissues. Known as EN-TEx, the resource is an important step toward the future of personalized care. The team published its findings in Cell on March 30.
In their latest project, the team utilized long-read sequencing technologies to determine diploid genomes from four donors with high accuracy. Everyone has a diploid genome. This means that we have two copies of 22 chromosomes as well as sex chromosomes—one from our mother and one from our father.
The team developed a variety of statistical and deep learning approaches to be able to leverage the dataset for practical applications. In particular, they built statistical models that identify subsets of regulatory regions in the human genome highly associated with disease variants. They also found many new linkages between variants and changes in nearby gene expression, connecting impactful but uncharacterized variants to genes with known function. This considerably expands previously determined catalogs, especially in many hard-to-assay tissues.
More fundamentally, the team developed a deep learning model that was able to predict whether a variant would disrupt a binding site for a regulatory factor—a protein that binds to specific sequences in the genome to turn nearby genes on or off. Interestingly, they found that to accurately predict this, they needed to look beyond just the binding site itself and consider a large genomic region around the site. The key to whether a binding site would be impacted was the presence of nearby binding sequences for other regulatory factors.