- Study Characteristics
- Whole Genome Sequencing
- Resources for the Scientific Community
The Trans-Omics for Precision Medicine (TOPMed) program, sponsored by the National Institutes of Health (NIH) National Heart, Lung and Blood Institute (NHLBI), is part of a broader Precision Medicine Initiative, which aims to provide disease treatments tailored to an individual’s unique genes and environment. TOPMed contributes to this Initiative through the integration of whole-genome sequencing (WGS) and other omics (e.g., metabolic profiles, epigenomics, protein and RNA expression patterns) data with molecular, behavioral, imaging, environmental, and clinical data.
A primary goal of the TOPMed program is to improve scientific understanding of the fundamental biological processes that underlie heart, lung, blood, and sleep (HLBS) disorders. TOPMed is providing deep WGS and other omics data to pre-existing ‘parent’ studies having large samples of human subjects with rich phenotypic characterization and environmental exposure data.
As of February 2020, TOPMed consists of ~155k participants from >80 different studies with varying designs. Prospective cohorts provide large numbers of disease risk factors, subclinical disease measures, and incident disease cases; case-control studies provide large numbers of prevalent disease cases; extended family structures and population isolates provide improved power to detect rare variant effects. The phenotype pie chart below shows the numbers and percentages of participants in studies with a focus on HLBS, as well as the percentage belonging to cohort studies that have collected many different phenotypes. It also shows areas of focus within each of the major HLBS categories.
Achieving ancestral and ethnic diversity is a priority in selecting contributing studies. Currently, 60% of the 155k sequenced participants are of predominantly non-European ancestry. Discovery of genotype-phenotype associations frequently includes pooled analysis across ancestry groups and studies, using statistical models that account for population structure and relatedness.
The pie chart below summarizes TOPMed participant diversity using a combination of self-identified or ascriptive race/ethnicity categories, study inclusion criteria, or other demographic information provided by study investigators. Please note that while groupings may correlate to some extent with genetic ancestry, TOPMed recommends distinguishing between genetically and non-genetically inferred descriptions in analyses and publications, as described in these Guidelines on the use and reporting of race, ethnicity, and ancestry in TOPMed.
WGS was performed by several sequencing centers to a median depth of 30X using DNA from blood, PCR-free library construction and Illumina HiSeq X technology. A Support Vector Machine quality filter was trained with known variants and Mendelian-inconsistent variants. The Informatics Research Center conducts joint genotype calling across all samples available to produce genotype data “freezes.”
In TOPMed data freeze 8, with variant discovery on ~186k samples, 811 million single nucleotide variants and 66 million short insertion/deletion variants were identified and passed variant QC.
In TOPMed data freeze 9, variant discovery was initially made on ~206k samples including CCDG, but subset to 158,470 TOPMed samples plus 2,504 1000 Genomes samples. 781 million single nucleotide variants and 62 million short insertion/deletion variants were identified and passed variant QC. These variant counts are slightly smaller than the corresponding numbers in data freeze 8 due to omitting sites which show no variation in TOPMed samples. More information about WGS methods can be found under Sequencing and Data Processing Methods.
TOPMed data are being made available to the scientific community as a series of “data freezes”: genotypes and phenotypes via dbGaP; read alignments via the Sequence Read Archive (SRA); and variant summary information via the Bravo variant server (see figure below) and dbSNP. Genotypes for a set of 55k samples have been released on dbGaP (freeze 5) and a freeze release of >140k samples is expected by mid 2020 (freeze 8). TOPMed WGS data are contained in study-specific accessions with names containing “NHLBI TOPMed”, while most phenotypic data are in parent study accessions. The TOPMed accessions can be identified by searching the dbGaP web site for “TOPMed”. More information about what data are available and how to access it can be found on the Data Access page.
TOPMed is currently adding other omic assays to samples that have been whole-genome sequenced; these include RNAseq, metabolomics, proteomics and epigenomics.
Overview of Bravo variant server resources
This content was adapted from a poster presented at the 2018 American Society of Human Genetics (ASHG) meeting, “Overview of the NHLBI Trans-Omics for Precision Medicine (TOPMed) program: Whole genome sequencing of >100,000 deeply phenotyped individuals” (Poster 3145/T).