TOPMed Data Access for the Scientific Community
TOPMed genomic data and pre-existing Parent study phenotypic data are made available to the scientific community in study-specific accessions in the database of Genotypes and Phenotypes (dbGaP). Different types of data are organized within these accessions as follows:
- Phenotypes: When the Parent study has a dbGaP accession that preceded the existence of the TOPMed program, phenotypic data are in the Parent accession. Otherwise, the phenotypic data are in the TOPMed accession. In addition, the TOPMed Data Coordinating Center (DCC) has harmonized select phenotypes across TOPMed, which will also be deposited into the TOPMed accessions. The DCC harmonized phenotypes have been submitted to dbGaP and are pending release.
- Genotypes (WGS): Genotype calls from TOPMed WGS are available in the TOPMed accession as Variant Call Format (VCF) files. Studies may have multiple sets of VCF files corresponding to the various TOPMed data freezes. The VCF files contain variant-level quality metrics and a support vector machine (SVM) quality filter.
- Read alignment data (WGS): Only a limited number of TOPMed Phase 1 CRAMs aligned to build 37 are available directly through the dbGaP Sequence Read Archive (SRA). These can be accessed through a dbGaP approval for their corresponding TOPMed accessions. All other CRAMs, including build 38 alignments for all TOPMed WGS samples, are hosted in NHLBI cloud buckets and accessed using the “fusera” software.
- Instructions for controlled access to TOPMed sequence data on the cloud (Provided by Tom Blackwell, TOPMed IRC)
- Further documentation on dbGaP cloud access, including fusera (Provided by NCBI)
- Non-WGS omics: Generation of TOPMed non-WGS omics data are underway and will be made available in dbGaP in the future.
Users who want to apply for controlled-access TOPMed data should follow the dbGaP instructions for requesting controlled-access data. In a dbGaP application, each TOPMed study-consent group will need to be requested individually. Note that participant consent and Data Use Limitations (DULs) differ within and across TOPMed studies. Therefore, dbGaP applicants will need to carefully review DULs and ensure that proposed Research Use Statements (RUS) are consistent with the study-consent group(s) being requested. Additionally, some TOPMed studies have consent modifiers that may require additional documentation, such as documentation of local IRB approval and/or letters of collaboration with the primary study PI(s).
Applicants should investigate whether phenotype data are deposited in the TOPMed or the Parent accession for the studies of interest. If the latter, then applicants will need to specifically apply for access to the Parent accession for phenotypes in addition to applying to the TOPMed accession for TOPMed WGS genotypes. Phs numbers for TOPMed and Parent accessions are available in the dbGaP methods documents.
Running mega analyses across TOPMed studies requires combining genotype and phenotype data across individual dbGaP accessions.
- Combining genotypes: The Informatics Research Center’s (IRC) joint calling process produces a multi-study VCF file for each chromosome, each of which is split into study-specific components. For studies with multiple consent groups, these components are further divided by consent groups and deposited in the study’s TOPMed accession. The same variants occur in all VCF components of a given call set. To construct a multi-study VCF file for analysis, a user must apply for access to each study-consent group and reassemble the components. Note some TOPMed accessions will have VCF files for more than one data freeze. Therefore, users must take care to select VCF files from the same freeze for their multi-study reassembly. Tools for combining VCF files include vcftools and bcftools.
- Combining phenotypes: The Parent studies contributing to TOPMed have many phenotypic measures in common, thereby providing opportunities for cross-study analyses to gain power in detecting genetic effects. However, these studies’ designs differ in how their phenotypic data were collected, and in how their data are annotated and structured. Creating harmonized phenotypic data sets for cross-study analyses is therefore a challenging and largely manual process. Users will need to carefully evaluate the source phenotypes and accompanying documentation before attempting to harmonize across studies. Note there is also a centralized phenotype harmonization process conducted by the TOPMed DCC, the products of which will be added back to the TOPMed accessions. Initially these DCC-harmonized phenotypes are focusing on demographics and common covariates.
- A note on Sample/subject identifiers: Each DNA sample in the TOPMed program is assigned a unique sample identifier (“NWD” followed by 6 digits) centrally by the DCC. The NWD ID is the identifier used in all files containing TOPMed sequence or genotype data. Subject (aka participant or individual) identifiers are assigned by study investigators and are not guaranteed to be unique across all studies. The subject identifiers are associated with individual-level phenotypic data and, in most cases, are consistent between the TOPMed and Parent accessions for a given study. Mappings between sample and subject identifiers, as well as subject ID aliases, are given in standard dbGaP files labeled as subject-sample mapping and subject consent files.
The following repositories contain variant summary information for TOPMed studies that granted explicit permission. These resources are publicly available.