1. Cancer cell lines are standardized by name and filtered by requiring: all four data types being available, being of solid tumor origin, not subject to genetic drift between Broad Institute and Sanger Institute characterizations of the cell line per Ghandi et al. 2019, and not reclassified as fibroblast-like by Weck et al. 2017 and Ghandi et al. 20191,8.
2. Observed somatic variants, copy number alterations, and fusions are processed by the Molecular Oncology Almanac (MOAlmanac) to identify clinically relevant molecular features and annotate for their presence in Cancer Gene Census (CGC)9.
3. GDSC’s cell line IC50 z-score thresholds are applied to each therapy and cancer cell line pair to generate boolean valued labels for sensitive (z-score < -2.0) and resistant (z-score > 2.0) relationships2. Pairwise comparisons are made between all cancer cell lines, noting the intersection of therapies which both cancer cell lines are sensitive to as well as the intersection size. Each pair of cancer cell lines is deemed to share therapeutic sensitivity if the intersection size of sensitive therapies is greater than zero. Cancer cell lines are further filtered by requiring that they are sensitive to at least one therapy and that there exists at least one other cell line that shares therapeutic sensitivity.
4. Somatic variants, copy number alterations, and fusions are coded into matrices indexed by cancer cell line name with each column associated with a different molecular feature to be used to calculate genomic similarity between pairs of cancer cell lines. The coding of features is dependent on the model implemented, as follows in alphabetically:
- Compatibility (compatibility): Similarity measures between case and comparison profiles are generated based on shared observed clinically relevant features.
First, a total possible score is calculated for each profile based on the set of clinically relevant somatic variants, copy number alterations, and fusions observed in the tumor. Specifically, molecular features which match fully characterized MOAlmanac entries (by gene, feature type, and alteration details) receive 75 points, those which only match by gene and feature type receive 25 points, and only 10 points for matching by gene. For example, the features PIK3CA p.H1047R, NUP214--ABL1, and CDK12 amplification respectively score 75, 25, and 10 points for a total of 110 points as PIK3CA p.H1047R is catalogued in MOAlmanac, ABL1 is catalogued with other fusion partners, and CDK12 somatic variants but not copy number alterations are catalogued in MOAlmanac.
Next, pairwise comparisons are performed to score the intersection of observed molecular features relative to each case profile. Consider a second profile (B) with features BCR--ABL1, CDK p.L21S, and TP53 deletion being compared relative to the one described above (A). B scores 35 of 110 potential points from A, resulting in a score of 0.318. Likewise, A relative to B is calculated and the mean of the two values is taken as the similarity measure, or compatibility, between two molecular profiles. This approach was inspired by dating algorithms.
- Jaccard of CGC feature types (jaccard-cgc-feature-types). Sort by agreement-based measure (jaccard) by considering variants in a Cancer Gene Census gene and feature type (e.g. CDKN2A copy number alterations match but not a CDKN2A deletion and CDKN2A nonsense somatic variant). Matrix elements are boolean.
- Jaccard of CGC genes (jaccard-cgc-genes). Sort by agreement-based measure (jaccard) by considering any variant in a Cancer Gene Census gene. Matrix elements are boolean.
- Jaccard of MOAlmanac feature types (jaccard-almanac-feature-types). Sort by agreement-based measure (jaccard) by considering both gene and data type for all somatic variants, copy number alterations, and rearrangements catalogued in the Molecular Oncology Almanac (e.g. CDKN2A copy number alterations match but not a CDKN2A deletion and CDKN2A nonsense somatic variant). Matrix elements are boolean.
- Jaccard of MOAlmanac features (jaccard-almanac-features). Sort by agreement-based measure (jaccard) by considering all somatic variant, copy number, and rearrangement molecular features with alteration details as catalogued in the Molecular Oncology Almanac. Matrix elements are boolean.
- Jaccard of MOAlmanac genes (jaccard-almanac-genes). Sort by agreement-based measure (jaccard) by considering any somatic variant, copy number alteration, and rearrangement in any gene catalogued in Molecular Oncology Almanac. Matrix elements are boolean.
- Multi-pass sort: FDA & CGC (multi-pass-sort_fda-cgc). A weakness of agreement-based measure is that there will be tied values. Tied similarity based on Molecular Oncology Almanac features associated with FDA evidence are further sorted by using similarity based on CGC genes.
- Nonsynonymous variant count (nonsynonymous-variant-count). Similarity is evaluated based on the absolute value of the difference of the number of coding somatic variants between the two cancer cell lines. This is a proxy for mutational burden, as the number of somatic bases considered when calling variants to use a denominator is not available from data sources.
- PCA of CGC genes (pca-cgc-genes). Principal Component Analysis is applied to the matrix of CGC genes, with elements populated if a gene is mutated in a sample either as a somatic variant, copy number alteration, or fusion. Matrix elements are boolean. For example, both TP53 nonsense variants and copy number deletions can populate elements in the column associated with the gene TP53.
- PCA of MOAlmanac genes (pca-almanac-genes). Principal Component Analysis is applied to the matrix of MOAlmanac genes, with elements populated if a gene is mutated in a sample either as a somatic variant, copy number alteration, or fusion. Matrix elements are boolean. For example, both TP53 nonsense variants and copy number deletions can populate elements in the column associated with the gene TP53.
- Random (random_mean). Cell lines are shuffled against one another randomly across 100,000 seeds. The seed associated with the average mean average precision was chosen.
- SNF: CGC (snf_cgc). The python implementation of Similarity Network Fusion by Ross Markello (https://github.com/rmarkello/snfpy) is used to combine similarity across multiple data types10. Matrices that contain boolean values describing variants in CGC genes altered by (1) somatic variants, (2) copy number alterations, and (3) rearrangements are processed by the tool.
- SNF: FDA & CGC (snf_fda-cgc). The python implementation of Similarity Network Fusion by Ross Markello (https://github.com/rmarkello/snfpy) is used to combine similarity across multiple data types10. Matrices that contain boolean values describing variants (1) in CGC genes that contain a somatic variants, (2) in CGC genes that contain a copy number alterations, (3) in CGC genes that contain a fusion, and (4) associated with an FDA approval, as identified by MOAlmanac, are processed by the tool.
- SNF: FDA & CGC genes (snf_fda-cgc-genes). The python implementation of Similarity Network Fusion by Ross Markello (https://github.com/rmarkello/snfpy) is used to combine similarity across multiple data types10. Matrices that contain boolean values describing variants (1) in CGC genes if mutated either as a somatic variant, copy number alteration, or fusion and (2) associated with an FDA approval, as identified by MOAlmanac, are processed by the tool.
- SNF: MOAlmanac (snf_almanac). The python implementation of Similarity Network Fusion by Ross Markello (https://github.com/rmarkello/snfpy) is used to combine similarity across multiple data types10. Matrices that contain boolean values describing variants in MOAlmanac genes altered by (1) somatic variants, (2) copy number alterations, and (3) rearrangements are processed by the tool.
- Somatic tree (somatic-tree). This approach is inspired by CELLector, by Najgebauer et al.4. CELLector has a prioritized list of alterations based on cancer type and will report similar cell lines based on mutant / wild type status of each alteration. Likewise, we utilize MOAlmanac’s prioritization of somatic variants, copy number alterations, and rearrangements observed in a given profile to rank comparison cell lines based on the mutant or wild type status of each molecular feature’s feature type and gene altered. The prioritization of somatic events is as appeared in the somatic.scored.txt output of MOAlmanac. As an illustrative example, consider a profile with prioritized somatic events, in order: BRAF somatic variant, COL1A1 fusion, and CDKN2A copy number alteration. Cell lines would be sorted into the order of: (1) BRAF, COL1A1, and CDKN2A mutant, (2) BRAF and COL1A1 mutant and CDKN2A wild type, (3) BRAF and CDKN2A mutant and COL1A1 wild type, (4) BRAF mutant and COL1A1 and CDKN2A wild type, and (5) BRAF wild type, etc.
5. Similarity models are evaluated by their ability to sort cancer cell line pairs which share therapeutic sensitivity as more similar than those that do not using evaluation metrics from ranked retrieval11. The metrics precision at rank (k), recall at rank (k), and average precision are used to evaluate a model’s ability to sort cell lines relative to one cell line. These metrics are defined as follows, considering a ranked list containing relevant and not relevant documents after a querying many documents:
- Precision at rank k (precision @ k). The number of relevant documents that are in the top k ranked documents divided by k.
- Recall at rank k (recall @ k). The fraction of total relevant documents returned when considering k documents.
- Average precision (AP). The average of precision values at the positions of relevant documents.
These metrics are illustrated by example by considering a ranked list of four items that are relevant, not relevant, relevant, and not relevant (Figure 1). Considering the first item (k = 1) yields a precision of 1.0 (1 relevant item / 1 considered item) but also considering the second yields a precision of 0.5 (1 relevant item / 2 considered items). Recall is calculated to be 0.5 at k = 1 (1 relevant item / 2 total relevant item), 1.0 at k = 3 (2 relevant items / 2 total relevant items), and 1.0 at k = 4. Relevant items exist at k = 1 and k = 3 with associated precision values of 1.0 and 0.66 so the average precision is calculated to be 0.83.
The mean average precision (mAP) is calculated to evaluate each model’s performance across all cancer cell lines. The metric is defined as,
- Mean average precision (mAP). The mean of average precision values from multiple queries.
Given three queries which returned six documents with AP values of 0.66, 0.565, and 0.25, the mAP is calculated to be 0.492 (Figure 2).
The AP for each cancer cell line and mAP across all cancer cell lines is calculated for each similarity model.
6. Models are further compared pairwise using randomized testing (Figure 3). The difference in mAP (delta mAP) values between two models is chosen as a test statistic and recorded. For 10,000 iterations using seeds 0 to 9,999, AP values for all cancer cell lines generated by both models are shuffled and both mAP values and delta mAP values associated with the seed are recorded. A p-value describing the significance in difference between the two models is generated by comparing the test statistic to the distribution of 10,000 delta mAP values.