Cancer is an evolutionary process1,2, in which the heritable accumulation of somatic mutations in the genome of cancer cells results in the formation of heterogeneous subpopulations of cancer cells, referred to as intra-tumour heterogeneity (ITH)3–5. Most cancer evolution studies quantify ITH from DNA sequencing data by identifying the unique complements of somatic mutations that are carried by these different subpopulations of cells, or ‘subclones’. Accurately reconstructing the genomic profile of each subclone, and inferring the evolutionary hierarchy between the subclones present in a tumour is important, not only for studying the biology of the disease trajectory, but because a tumour subclone harbouring a treatment-resistant genomic variant could have important clinical implications, and could be used to guide therapeutic decision making6–8.
In recent years, progress in next-generation sequencing technology and computational methodology has revealed significant ITH in several cancer types3–5. However, a single tumour tissue biopsy sample may contain a mixture of many thousands of heterogeneous normal and cancer cells, making the full deconvolution of subclonal populations and their phylogenetic ordering from bulk DNA sequencing challenging. Typically, subclonal reconstruction algorithms leverage the observed variant allele frequency (VAF) of single-nucleotide mutations measured from aligned DNA sequencing reads in order to quantify the prevalence of somatic events9–13. Due to the presence of somatic copy number alterations (SCNAs) and normal cell admixtures, the VAF is not an accurate estimator of the population frequency of the variant. Therefore, most existing algorithms apply different approaches to correct the VAF for tumour purity and SCNAs to infer estimates of the cancer cell fraction (CCF) of a mutation, which defines the proportion of cancer cells in the sample that carry the mutation3,8,14. To reconstruct clonal evolution, these computational methods cluster together mutations with similar CCFs in all samples sequenced into ‘subclonal clusters’, under the assumption that they are likely present in a similar set of cells and that represent a clonal expansion at a similar evolutionary time point9,11–15. Then, by nesting subclonal cluster CCFs based on evolutionary principles for constraining lineage relationships, such as the ‘pigeonhole principle’ and ‘crossing rule’ (Supplementary Methods), algorithms seek to infer the evolutionary ordering of clusters and reconstruct the full tumour phylogenetic tree3,11–13,16–18 (Table 1; Figure 2).
Three key challenges make the accurate estimation of mutation CCFs from bulk sequencing data, assigning mutations to clusters, and inferring evolutionary ordering between mutation clusters non-trivial.
First, errors in both mutation and copy number calling may result in errors in the estimated CCFs and, hence, in false mutation clusters, which reflect the presence of errors (e.g., sequencing artefacts, misalignments, etc) rather than true biological signals. For example, subclonal SCNAs undetected by copy-number calling algorithms can result in a genomically clustered group of mutations having a distinct CCF which reflects the copy number event and not the true underlying prevalence of the mutations (Figure 1). Unless explicitly removed, such clusters will be propagated and will impact the phylogenetic tree reconstruction. However, most of the existing algorithms that cluster mutations and reconstruct tumour phylogenetic trees assume that the input data is error free11, either in terms of SNVs11,12,17,19–21, SCNAs11,20,22,23, or both12,13,20. Thus, a cluster resulting from mutation or SCNA errors will be given equal weight to a bona-fide mutation cluster which might erroneously impact the reconstruction of the tumour phylogenetic tree.
Second, SCNAs can result in the loss of mutations when SCNAs delete the genomic segments that contain the locus of their mutated alleles3,14. When analysing these lost mutations, their CCFs are lower than the CCFs of the other mutations that represent the same clonal expansion (i.e. that are part of the same edge of the tumour phylogenetic tree). Hence, mutation losses violate the commonly enforced infinite sites assumption (i.e., the assumption in which mutations occur at most once at a particular genomic locus and cannot be lost by reversion mutation11–13). In this paper, we refer to the fraction of cancer cells that either carry a mutation, or whose ancestors carried the mutation before mutation loss, as the phylogenetic cancer cell fraction (PhyloCCF)14. This concept has been introduced and used in previous studies3,14.
Finally, most current subclonal reconstruction methods are limited in their ability to accurately cluster and construct phylogenetic trees based on large multi-sample studies. In particular, to account for SCNAs during the estimation of CCFs from the observed VAFs, some phylogenetic reconstruction algorithms aim to jointly model the evolution of SNVs and SCNAs12,14,19,24. However, due to the complexity of these models, these algorithms do not scale to the high numbers of mutations found in the whole-genome and whole-sequencing studies4,8,25, and neither to the large number of tumour samples sequenced in recent multi-sample tumour studies3,8,26–28.
To address previous limitations, we develop CONIPHER (COrrecting Noise In PHylogenetic Evaluation and Reconstruction), a novel algorithm to automatically reconstruct subclonal mutation clusters, tumour phylogeny and subclone cell proportions from bulk sequencing data and account for uncertainty. CONIPHER is characterised by three novel features that address the key challenges in phylogenetic reconstruction described above: (1) an approach to remove biologically improbable clusters that either are driven by likely-erroneous mutations or by subclonal SCNAs, (2) a method to correct for complex evolutionary events, including mutation losses14, and (3) an efficient extension of previous and new approaches that allows CONIPHER to scale to a high number of primary tumour samples per patient. In this protocol, we outline the CONIPHER method and also detail how to practically use our tool. We show that CONIPHER outperforms previous algorithms on simulations.
In addition, despite the rich literature on tumour phylogeny reconstruction10–13,17–21, how features of the inferred tumour phylogenies relate to the biology of tumour growth, in terms of selection, mutation rates and rates of chromosomal instability, remains unclear. This protocol enables a user-friendly, straightforward computational framework for analysis of tumour phylogenies in R, including calculation of subclone proportions in each tumour sample. In fact, CONIPHER has been used to automatically reconstruct the tumour phylogenetic trees for 421 patients with non-small cell lung cancers (NSCLC) with primary and metastatic disease in the recent TRACERx421 study26,27.
Development of the protocol
Automated tumour phylogenetic reconstruction from bulk DNA sequencing of tumours with a large number of mutations enables an in depth analysis of tumour evolution. To accurately reconstruct the tumour phylogenetic tree we posit that it is imperative to account for mutation losses and erroneously clustered mutations. Correct tree reconstruction will affect interpretation of downstream analyses of evolutionary relationships between specific driver mutations, and inference of metastatic seeding and dissemination patterns. Hence, we created CONIPHER to process and construct tumour phylogenetic trees for 432 tumours from 421 patients with NSCLC from the TRACERx lung cohort26,27.
Overview of the CONIPHER method
CONIPHER takes as input processed mutation data from bulk DNA sequencing (for example using Mutect229 and Varscan30), as well as SCNAs, purity and ploidy, which can be computed by existing and well established methods, such as ASCAT31, HATCHet32, Sequenza33 and Battenberg34. CONIPHER subsequently performs mutation clustering, followed by tumour phylogeny reconstruction (Figure 1), and finally computes subclone proportions. Below, we describe an overview of the method. Full details of statistics and exact values of the parameters and thresholds used are described in Supplementary Methods.
Subclonal mutation clustering
The first step in CONIPHER is the estimation of PhyloCCFs and clustering of somatic mutations. This step can be broken down into four main components, which were designed to minimise the error introduced at each subsequent step. First, copy number preprocessing of every mutation is performed (Figure 1a), in which the PhyloCCF of every mutation is calculated, by transforming the measured VAF by expected mutation copy number and tumour purity to compute the CCF metric3,14,35,36, and taking into account both clonal and subclonal SCNAs3,26,27. Secondly, a pre-clustering step is implemented to split mutations in different groups, such that each group only contains mutations that are clearly present or clearly absent in the same set of tumour samples (Figure 1b). Similar to recent methods21, this step prevents the mixing of these mutations in the same cluster, an error that has been observed for most existing mutation clustering algorithms21. Thirdly, CONIPHER applies Dirichlet clustering using the PyClone algorithm (v0.13.19) to each group of mutations separately to identify the candidate mutation clusters (Figure 1c). Finally, post-processing is performed on the inferred mutation clusters, in which mutation clusters are removed that comprise a small number of mutations (user-defined, for whole exome sequencing a default threshold of 5 is used) and two subclonal clusters are merged if their difference is driven solely by a subclonal copy number correction (Figure 1d). Full details of the method are described in our companion manuscript26.
Phylogenetic tree building
The second and main step of CONIPHER is reconstruction of the tumour phylogenetic tree. This step takes output from the previously performed mutation clustering as input, namely, inferred assignments of mutations to mutation clusters, and mutation PhyloCCF estimates. Notably, this step is compatible with mutation clustering performed from other methods. The phylogenetic tree building step can be broken down into four main components: cluster nesting, growing the tree, enumerating the solution space of alternative phylogenies, and computing subclone proportions.
Mutation cluster nesting
First, 95% confidence intervals are computed to obtain estimates for average PhyloCCF values for each mutation cluster identified in the clustering step, in each tumour sample (Figure 1e, Supplementary Methods). Secondly, two one-sided tests are performed comparing PhyloCCF values between every possible pair of clusters in each tumour sample, in order to determine whether one cluster could potentially be nested within the other. The truncal cluster is assigned as the cluster that can nest all other clusters (Figure 1f). A test is additionally performed to check whether each cluster could be classified as subclonal, or whether it is indistinguishable from the truncal cluster within each tumour sample (Supplementary Methods26). In order to prevent artefactual mutation clusters from being assigned to a branch of the phylogenetic tree, the genomic positions of mutations within each cluster are inspected. If all mutations in a cluster are less evenly distributed across chromosomes than would be expected based on the distribution of mutations across chromosomes in the truncal cluster, the cluster is deemed as potentially copy number driven and therefore removed from subsequent analysis. Cluster nesting is summarised as a nesting matrix and can be represented as an ancestral graph.
Growing the phylogenetic tree
Then, the ancestral graph is pruned to attempt to produce a tree structure with no cycles (Figure 1g). This method favours a more linear tree topology structure, as opposed to a more branched structure. Subsequently, clusters are removed from the tree that are the cause of the following issues: (i) circles in the tree, or (ii) CCFs of tree branches at each tree level exceeding a user-defined threshold (by default a CCF buffer of 10% is used, Supplementary Methods). Clusters are removed such that the fewest mutations possible are removed from the phylogenetic tree. This step returns one ‘default’ tumour phylogenetic tree.
Growing the forest
After identifying the default tree, our algorithm enumerates all possible alternative phylogenies that fit the identified cluster nesting structure of the pruned ancestral graph (Figure 1h). First, all combinations of clusters are identified that could be moved to descend from a different parental node, without causing issues (i) or (ii) (Supplementary Methods). All possible phylogenetic trees are provided as output.
After all potential trees are identified, tree branches, or edges, that are common to all trees are classified as “consensus” branches, conversely, branches that are found in only a subset of trees are classified as “non-consensus” branches.
CONIPHER additionally provides two methods for summarising the solution space of multiple phylogenetic trees per tumour (Figure 1i). First, CONIPHER computes the tree(s) that generates the lowest amount of nesting error, which we term the sum condition error (SCE). Secondly, CONIPHER computes the tree(s) comprising branches, or tree edges, most commonly shared amongst alternative trees in the solution space, by computing the edge probability. A full description of the computation of the SCE and edge probability metrics can be found in Supplementary Methods.
Computing subclone proportions
Finally, CONIPHER automatically computes the proportion of cells in each tumour sample belonging to each genomically homogeneous subclone, or the "subclone proportions", based on the inferred default tree and tumour phylogeny with lowest SCE (Figure 1j, Supplementary Methods). Notably, subclone proportions will sum to 1 in each tumour sample and will only correspond to the mutation cluster PhyloCCF in the case of terminal nodes on the phylogenetic tree. This enables an analysis of recent subclonal expansions in a tumour, which was found to be prognostic in our companion manuscript26.
Benchmarking and evaluating the performance of CONIPHER
A realistic simulation framework for tumour evolution
We benchmarked the performance of CONIPHER using a set of 150 ground truth simulations that comprise generated tumour phylogenies, mutation clusters and related bulk sequencing data26. The ground truth simulations were designed to model the evolution of genetic variants frequently observed in NSCLC, including SCNAs and whole genome doubling (WGD) events that can occur truncally or subclonally. In particular, the simulation framework models the effect of such genetic alterations on SNV mutation loss and SNV multiplicity. Three distinct categories of simulated datasets were generated: 50 simulated datasets with 2-3 samples per tumour (low category), 50 simulated datasets with 4-7 samples per tumour (medium category) and 50 simulated datasets with >7 samples per tumour (high category), totalling a collection of 150 simulated datasets. Full mathematical details of the simulation framework are described in our companion paper26.
Comparison of CONIPHER with current state-of-the-art tools
Based on the ground truth simulations generated using the simulation framework26, we compared CONIPHER for reconstructing tumour subclonal mutation clusters and inferring tumour phylogeny with current state-of-the-art approaches (Figure 3). We compared our clustering method with PyClone, as well as our clustering and phylogenetic tree building method with PhyloWGS12, LICHeE13, CITUP11 and Pairtree18 (Figure 3). We additionally performed benchmarking of our tree building method only, by using simulated ground truth clusters and applying CONIPHER to reconstruct the phylogenetic trees. We compared the latter benchmarking to LICHeE, CITUP and Pairtree (Figure 3). Table 1 compares functionalities and methodology between these methods and CONIPHER. Overall, CONIPHER is able to identify mutation clusters (Figure 3a) and reconstruct tumour phylogenies (Figure 3b) with higher accuracy than other methods.
Scalability of method
We first compared the scalability of CONIPHER against the current state-of-the art methods. We found that CONIPHER and Pairtree were able to infer tumour phylogeny for every simulated dataset whereas other methods failed to run or complete the reconstruction within the time frame allowed (8 hours). In particular, PhyloWGS was unable to complete tumour phylogenetic reconstruction on any of the simulated datasets in the medium or high category and only able to reconstruct 3/50 trees in the low category (Figure 3c).
Presence-absence informed clustering
We explicitly compared the performance of our mutation clustering to other methods, to evaluate how differences in the clustering would affect tree building downstream (Figure 3d). We found that CONIPHER and LiCHeE had the highest mutation presence precision in every tumour sample, compared to CITUP and Pairtree. In particular, the presence-absence classification step in CONIPHER led to improved mutation presence precision in the high category, compared to the other methods for which performance decreased with larger simulations.
Measuring mutation losses
CONIPHER considers the possibility of mutation losses in tumour evolution, and aims to correct for these when performing mutation clustering, as described above (Figure 1). We assessed each method’s ability to account for mutation losses by evaluating the sensitivity in the identification of truncal mutations. Methods that do not account for mutation loss will incorrectly classify truncal mutations that were lost later in tumour evolution as subclonal, such as CITUP and Pairtree, which had a lower truncal sensitivity in all simulation categories (Figure 3e). We observed that when running Pairtree with CONIPHER clustering, the truncal sensitivity was greatly improved, thereby indicating that Pairtree was not directly accounting for mutation loss (Figure 3e). Clustering performance may directly impact the truncal sensitivity independently of tree building, so we also evaluated the performance of each tree building method on the set of ground truth simulated clusters per dataset (Figure 3f & 3g). We found that CITUP failed in all 150/150 (100%) instances, which we hypothesise is due to the inability to account for mutation loss. Pairtree and LiCHeE were able to identify the correct truncal mutation cluster in 83/150 (55%) and 84/150 (56%) of the simulated instances respectively, compared with CONIPHER that was best able to account for mutation loss and correctly identified the truncal mutation cluster in 141/150 (94%) of ground truth instances.
Accurate error removal
Bulk DNA sequencing data may contain a significant degree of error; however, many algorithms for subclonal reconstruction fail to remove potentially noisy mutations (Table 1). CONIPHER aims to identify mutation clusters driven by sequencing noise, and removes these. We evaluated the extent to which we were correctly identifying and removing mutational sequencing noise by injecting an artefactual cluster in the simulation datasets, and comparing the number of simulations in which we remove the artefact cluster (Figure 3h). Notably, the artefact cluster could be compatible with the tree structure (i.e. it was not necessarily biologically implausible). CONIPHER was able to identify and remove error-driven mutations in 77/150 simulated datasets (51%), compared to LICHeE that removed noisy clusters in 3/150 datasets (2%), and CITUP and Pairtree which did not identify the noisy clusters. For simulations with a low number of samples per tumour, CONIPHER also often failed to remove the erroneous cluster (38/50 datasets). In these cases, many ‘error clusters’ still fit the tree, without the need to remove any mutations. By contrast, for simulations with a high number of samples, the erroneous cluster was correctly identified in 38/50 simulated datasets (76%).
Multiple alternative tree solutions
We further used the simulated dataset to benchmark CONIPHER of ranking plausible alternative phylogenetic trees. First, we measured whether phylogenetic tree solutions with higher mutation descendant accuracy gave better performing SCE and edge probability metric scores. We observed that for simulated tumour cases for which CONIPHER identified more than one potential tree structure, the alternative trees that were reconstructed with the highest mutation descendant accuracy had lower sum condition error (SCE) scores compared to less accurate alternative phylogenetic trees (Supplementary Figure 1a, Supplementary Methods). Evaluating the performance of CONIPHER tree building on the set of ground truth clusters from an additional simulated dataset with no mutation loss and no error-driver mutations (simulated dataset 2, Supplementary Methods), we observed that the inferred edges that were present in the ground truth (GT) tree were shared amongst a larger number of alternative tree solutions than edges not present in the GT tree (Supplementary Figure 1b). Finally, we observed that the highest ranking tree solutions based on the SCE and edge probability metrics had a higher descendant accuracy than alternative tree solutions (Supplementary Figure 1b, Supplementary Methods).
Advantages and limitations of CONIPHER
CONIPHER performs mutation clustering and phylogenetic tree building from processed bulk DNA sequencing data. This can be from bulk whole genome sequencing (WGS), whole exome sequencing (WES) or a targeted sequencing approach. It is highly scalable and can reconstruct tumour phylogenies from tumours with many samples and many clusters in a time frame of the order of minutes. CONIPHER assigns mutations to the phylogenetic tree more accurately than other state-of-the-art methods and in particular improves the quality of the mutations assigned to the tree, by taking into account biological constraints in order to remove error-driven signal. CONIPHER for phylogenetic tree building is compatible with input from mutation clustering performed using other methods and automatically computes subclone proportions in each tumour sample.
However, CONIPHER does have limitations. CONIPHER does not currently support raw sequencing data as input and requires processed data from bulk DNA sequencing. In particular, we assume that mutation and copy number calling algorithms have been applied to the raw sequencing data.
CONIPHER is straightforward to implement from the command line, using basic knowledge of Linux/Unix syntax. CONIPHER output is in both human readable form (.tsv files) and additionally .RDS objects for use in the R programming language. Knowledge of scripting languages would be helpful for users who wish to use CONIPHER output for downstream analyses; however, non-experts in bioinformatics should be able to run CONIPHER using the command line only to obtain mutation clustering and tumour phylogenies with correct input data. The current implementation of CONIPHER is written in the R programming language.
The CONIPHER procedure is composed of two main steps: a clustering step and a tree building step (Figure 4, Method Workflow). The clustering step is optional, and can be replaced by a mutation clustering method of the user’s choice. At each step, output directories are generated containing both data and summary plots. Both steps can be run with a wrapper end-to-end; that is, the clustering step automatically generates output that is taken as input to the tree building step. Both steps can be run from the command line.
Our protocol requires as input a file input.tsv, a mutation table containing information about each point mutation in each tumour sample sequenced, as shown in Example 1 (Figure 5). This input table can be used as input for both clustering and tree building steps, with specific column names required for each step. A complete description of all columns required in the input table is shown in Box 1 (Figure 6).
As shown in Example 1, input.tsv is in long format, with a new row for each mutation, for each tumour sample sequenced. Mutation clustering takes as input the genomic position of every mutation in every tumour sample, the copy number at the genomic position of each mutation, and an estimate of the tumour purity (or aberrant cell fraction, ACF) and ploidy (PLOIDY) within each sample (Example 1). Tree building takes the same table as input, with additional columns required (green box, Example 1): mutation cluster assignments (CLUSTER), estimates of the PhyloCCF (CCF_PHYLO) and observed CCF (CCF_OBS), and mutation copy number estimates for each mutation in each sample (MUT_COPY). These data and table columns are generated automatically by the clustering step (Figure 4).
In our companion manuscripts26,27, the convention is to refer to distinct bulk samples taken from one tumour as ‘tumour regions’, however in this manuscript we refer to SAMPLE as the tumour sample identifier. Chromosome names can be either with or without ‘chr’ prefix (e.g. ‘1’ or ‘chr1’). Chromosomes X and Y are ignored in this procedure.