TOPMed Data Access for the Scientific Community

Contents

  1. Where are the data?
  2. How do I apply for access?
  3. How do I use the data?
  4. Where can I access variant summary data?
  5. Where can I learn more?

1. Where are the data?

TOPMed genomic data and pre-existing Parent study phenotypic data are made available to the scientific community in study-specific accessions in the database of Genotypes and Phenotypes (dbGaP). Different types of data are organized within these accessions as follows:

  • Phenotypes: When the Parent study has a dbGaP accession that preceded the existence of the TOPMed program, phenotypic data are in the Parent accession. Otherwise, the phenotypic data are in the TOPMed accession. The TOPMed Data Coordinating Center (DCC) is harmonizing select phenotypes across TOPMed, which will also be deposited into the TOPMed accessions.
  • Genotypes: Genotype calls from TOPMed WGS are available in the TOPMed accession as Variant Call Format (VCF) files. Studies may have multiple sets of VCF files corresponding to the various TOPMed data freezes (e.g., freeze4 and freeze5b). The VCF files contain variant-level quality metrics and a support vector machine (SVM) quality filter.
  • Read alignment data: Read alignments are provided via the Sequence Read Archive (SRA). Most TOPMed Phase 1 studies have read alignments for GRCh37 in .sra format, which can be accessed through a dbGaP approval for their corresponding TOPMed accessions. Read alignments for GRCh38 are not yet available for access by the scientific community, but a mechanism is under development for accessing and working with these alignments in CRAM format. Sample-level quality metrics for SRA/CRAM files are stored in the dbGaP “genotype-qc” file type.

2. How do I apply for access?

Users who want to apply for controlled-access TOPMed data should follow the dbGaP instructions for requesting controlled-access data. In a dbGaP application, each TOPMed study-consent group will need to be requested individually. Note that participant consent and Data Use Limitations (DULs) differ within and across TOPMed studies. Therefore, dbGaP applicants will need to carefully review DULs and ensure that proposed Research Use Statements (RUS) are consistent with the study-consent group(s) being requested. Additionally, some TOPMed studies have consent modifiers that may require additional documentation, such as documentation of local IRB approval and/or letters of collaboration with the primary study PI(s).

Applicants should investigate whether phenotype data are deposited in the TOPMed or the Parent accession for the studies of interest. If the latter, then applicants will need to specifically apply for access to the Parent accession for phenotypes in addition to applying to the TOPMed accession for TOPMed WGS genotypes. Phs numbers for TOPMed and Parent accessions are available in the dbGaP methods documents for freeze4 and freeze5b.

3. How do I use the data?

Running mega analyses across TOPMed studies requires combining genotype and phenotype data across individual dbGaP accessions.

  • Combining genotypes: The Informatics Research Center’s (IRC) joint calling process produces a multi-study VCF file for each chromosome, each of which is split into study-specific components. For studies with multiple consent groups, these components are further divided by consent groups and deposited in the study’s TOPMed accession. The same variants occur in all VCF components of a given call set. To construct a multi-study VCF file for analysis, a user must apply for access to each study-consent group and reassemble the components. Note some TOPMed accessions will have VCF files for more than one data freeze (e.g., freeze4 and freeze5b). Therefore, users must take care to select VCF files from the same freeze for their multi-study reassembly. Tools for combining VCF files include vcftools and bcftools.
  • Combining phenotypes: The Parent studies contributing to TOPMed have many phenotypic measures in common, thereby providing opportunities for cross-study analyses to gain power in detecting genetic effects. However, these studies’ designs differ in how their phenotypic data were collected, and in how their data are annotated and structured. Creating harmonized phenotypic data sets for cross-study analyses is therefore a challenging and largely manual process. Users will need to carefully evaluate the source phenotypes and accompanying documentation before attempting to harmonize across studies. Note there is also a centralized phenotype harmonization process conducted by the TOPMed DCC, the products of which will be added back to the TOPMed accessions. Initially these DCC-harmonized phenotypes are focusing on demographics and common covariates.
  • A note on Sample/subject identifiers: Each DNA sample in the TOPMed program is assigned a unique sample identifier (“NWD” followed by 6 digits) centrally by the DCC. The NWD ID is the identifier used in all files containing TOPMed sequence or genotype data. Subject (aka participant or individual) identifiers are assigned by study investigators and are not guaranteed to be unique across all studies. The subject identifiers are associated with individual-level phenotypic data and, in most cases, are consistent between the TOPMed and Parent accessions for a given study. Mappings between sample and subject identifiers, as well as subject ID aliases, are given in standard dbGaP files labeled as subject-sample mapping and subject consent files.

4. Where can I access variant summary data?

The following repositories contain variant summary information for TOPMed studies that granted explicit permission. These resources are publicly available.

5. Where can I learn more?