The main steps involve using the ATM model; a detailed documentary of ATM software is available at https://github.com/Xilin-Jiang/ATM/tree/main.
1. Inferring comorbidity patterns:
Download ATM software following the "Installation" section of https://github.com/Xilin-Jiang/ATM/tree/main.
Following the code example in "Inferring disease topics using diagnosis data". Change the `HES_age_example` using your data.
2. Extract the topic weights.
Topic weights are used for excess PRS analysis, SNP x Topic interaction analysis. You could extract topic weights for each individual from the ATM inference results from previous step. Following the code in section "Inferring comorbidity profiles for individuals" of https://github.com/Xilin-Jiang/ATM/tree/main.
3. Extract diagnosis-specific topic probability.
Diagnosis-specific topic probability is used for excess genetic correlation analysis and FST analysis. As an individual has multiple distinct diseases, ATM allows weights for each diagnosis, which is a posterior probability based on the topic weight of the individual and the specific disease. Following the code in section "Inferring comorbidity profiles for individuals" of https://github.com/Xilin-Jiang/ATM/tree/main.
4. Mapping different disease codes.
For mapping between SNOMED and ICD-10 codes, we use mapping downloaded from https://www.nlm.nih.gov/research/umls/mapping_projects/snomedct_to_icd10cm.html and mapping between ICD-10 to Phecodes are downloaded from https://phewascatalog.org/phecodes. We implemented code maps as functions in the ATM software.
5. Genetic analysis of comorbidity subtypes.
We define disease subtypes using topic weights of each patient and diagnosis-specific topic probabilities of each disease diagnosis.
For excess PRS analysis, we use BOLT-LMM 2.3 to compute predictive PRS (five fold cross validation) and perform association between PRS and topic weights.
For genetic correlation analysis, we define discrete disease subtypes based on the largest value of diagnosis-specific topic probabilities. For example, if the model learns 10 comorbidity topics, each diagnosis-specific topic probabilities will be a vector of 10, and each diagnosis is assigned to one of the subtypes based on this value. We then compute three types of genetic correlations: (1) subtype-subtype, (2) subtype-disease, (3) and disease-disease. Here disease means all cases without being assigned to a subtype. We compute excess genetic correlation by comparing the first two types of genetic correlation (subtype-subtype and subtype-disease) to disease-disease genetic correlation.
For excess FST analysis, we first compute observed FST estimates (between two subtypes of the same disease, defined the same way as above). We then compute expected null FST distribution based on sampling healthy individuals with matched topic weights (i.e. FST estimates between two sets of healthy controls with topic weight distributions matched to the respective disease subtypes). We reported the permutation P-values for the 52 diseases with discrete subtypes.
For SNP x topic interactions analysis, we test the interaction between topic weights for each patient and GWAS SNPs (SNPs that reached genome-wide significance in an association analysis, using PLINK 1.9), using diseases as the response variable.
For details of the implementation, refer to the Full Methods section of Supplementary Note of our Associated Nature Genetics publication.