DCC-harmonized phenotypes for the scientific community
The DCC has undertaken two projects related to study phenotypes in TOPMed. Please see the sections below for more information.
We are preparing a manuscript describing these projects in more detail. This page will be updated with a link to the paper upon publication.
DCC phenotype harmonization project
The TOPMed DCC has harmonized over 100 phenotype variables related to heart, lung, blood, and sleep domains. The main goal of the DCC harmonization project is to provide harmonized phenotypes that are well-documented, reproducible, and as homogeneous across studies as possible. In harmonized datasets and documentation, the DCC typically uses “phenotype” to refer to the observable characteristic (e.g., diastolic blood pressure) and “variable” to refer to the specific data vector values for a given phenotype (e.g., bp_diastolic_1). To enable reproducibility, all study data were acquired from dbGaP.
Full documentation for each harmonized variable will also be provided in a GitHub repository. The documentation for each harmonized variable includes the identifiers of the original dbGaP study variables used in harmonization as well as the code that was used to transform them into the harmonized variable. This repository also includes a reproducible example that instructs users how to use the documentation to reproduce a simulated harmonized variable.
In addition to the phenotype harmonization project, the DCC has undertaken a related project to label over 16,000 dbGaP study variables with 65 phenotype concepts from heart, lung, blood, and sleep domains. We refer to this process as “variable tagging.” These labels enable researchers to more easily identify variables of interest that can be used in future harmonization efforts. The results of the tagging project are available in the dbGaP user interface.
The list of tags and instructions for identifying phenotype tags can be found on the DCC phenotype tagging details page.