CONIPHER: a computational framework for scalable phylogenetic reconstruction with error correction

doi:10.21203/rs.3.pex-2158/v1

Method Article

CONIPHER: a computational framework for scalable phylogenetic reconstruction with error correction

https://doi.org/10.21203/rs.3.pex-2158/v1

This work is licensed under a CC BY 4.0 License

This protocol has been posted on Protocol Exchange, an open repository of community-contributed protocols sponsored by Nature Portfolio. These protocols are posted directly on the Protocol Exchange by authors and are made freely available to the scientific community for use and comment.

Version 1

posted

You are reading this latest protocol version

Intra-tumour heterogeneity provides the fuel for the evolution and selection of subclonal tumour cell populations. However, accurate inference of tumour subclonal architecture and reconstruction of tumour evolutionary history from bulk DNA sequencing data remains challenging. Sequencing and alignment artefacts cannot be distinguished from real cancer somatic mutations and errors in the identification of copy number alterations or complex evolutionary events (e.g. mutation losses) affect the estimated cellular prevalence of mutations, leading to errors in mutation clustering and phylogenetic reconstructions. In this paper we present a new computational framework, CONIPHER (COrrecting Noise In PHylogenetic Evaluation and Reconstruction), that accurately infers subclonal structure and phylogenetic relationships from multi-sample tumour sequencing, accounting for both copy number alterations and mutation errors. CONIPHER outperforms similar methods on simulated datasets, and in particular scales to a large number of tumour samples and clones. As such, CONIPHER enables automated phylogenetic analysis which can be effectively applied to large sequencing datasets generated with different technologies.

Computational biology and bioinformatics

Intra-tumour heterogeneity

cancer

evolution

phylogeny

subclone

sequencing

Introduction

Cancer is an evolutionary process^1,2, in which the heritable accumulation of somatic mutations in the genome of cancer cells results in the formation of heterogeneous subpopulations of cancer cells, referred to as intra-tumour heterogeneity (ITH)^3–5. Most cancer evolution studies quantify ITH from DNA sequencing data by identifying the unique complements of somatic mutations that are carried by these different subpopulations of cells, or ‘subclones’. Accurately reconstructing the genomic profile of each subclone, and inferring the evolutionary hierarchy between the subclones present in a tumour is important, not only for studying the biology of the disease trajectory, but because a tumour subclone harbouring a treatment-resistant genomic variant could have important clinical implications, and could be used to guide therapeutic decision making^6–8.

In recent years, progress in next-generation sequencing technology and computational methodology has revealed significant ITH in several cancer types^3–5. However, a single tumour tissue biopsy sample may contain a mixture of many thousands of heterogeneous normal and cancer cells, making the full deconvolution of subclonal populations and their phylogenetic ordering from bulk DNA sequencing challenging. Typically, subclonal reconstruction algorithms leverage the observed variant allele frequency (VAF) of single-nucleotide mutations measured from aligned DNA sequencing reads in order to quantify the prevalence of somatic events^9–13. Due to the presence of somatic copy number alterations (SCNAs) and normal cell admixtures, the VAF is not an accurate estimator of the population frequency of the variant. Therefore, most existing algorithms apply different approaches to correct the VAF for tumour purity and SCNAs to infer estimates of the cancer cell fraction (CCF) of a mutation, which defines the proportion of cancer cells in the sample that carry the mutation^3,8,14. To reconstruct clonal evolution, these computational methods cluster together mutations with similar CCFs in all samples sequenced into ‘subclonal clusters’, under the assumption that they are likely present in a similar set of cells and that represent a clonal expansion at a similar evolutionary time point^9,11–15. Then, by nesting subclonal cluster CCFs based on evolutionary principles for constraining lineage relationships, such as the ‘pigeonhole principle’ and ‘crossing rule’ (Supplementary Methods), algorithms seek to infer the evolutionary ordering of clusters and reconstruct the full tumour phylogenetic tree^{3,11–13,16–18} (Table 1; Figure 2).

Three key challenges make the accurate estimation of mutation CCFs from bulk sequencing data, assigning mutations to clusters, and inferring evolutionary ordering between mutation clusters non-trivial.

First, errors in both mutation and copy number calling may result in errors in the estimated CCFs and, hence, in false mutation clusters, which reflect the presence of errors (e.g., sequencing artefacts, misalignments, etc) rather than true biological signals. For example, subclonal SCNAs undetected by copy-number calling algorithms can result in a genomically clustered group of mutations having a distinct CCF which reflects the copy number event and not the true underlying prevalence of the mutations (Figure 1). Unless explicitly removed, such clusters will be propagated and will impact the phylogenetic tree reconstruction. However, most of the existing algorithms that cluster mutations and reconstruct tumour phylogenetic trees assume that the input data is error free¹¹, either in terms of SNVs^{11,12,17,19–21}, SCNAs^11,20,22,23, or both^12,13,20. Thus, a cluster resulting from mutation or SCNA errors will be given equal weight to a bona-fide mutation cluster which might erroneously impact the reconstruction of the tumour phylogenetic tree.

Second, SCNAs can result in the loss of mutations when SCNAs delete the genomic segments that contain the locus of their mutated alleles^3,14. When analysing these lost mutations, their CCFs are lower than the CCFs of the other mutations that represent the same clonal expansion (i.e. that are part of the same edge of the tumour phylogenetic tree). Hence, mutation losses violate the commonly enforced infinite sites assumption (i.e., the assumption in which mutations occur at most once at a particular genomic locus and cannot be lost by reversion mutation^11–13). In this paper, we refer to the fraction of cancer cells that either carry a mutation, or whose ancestors carried the mutation before mutation loss, as the phylogenetic cancer cell fraction (PhyloCCF)¹⁴. This concept has been introduced and used in previous studies^3,14.

Finally, most current subclonal reconstruction methods are limited in their ability to accurately cluster and construct phylogenetic trees based on large multi-sample studies. In particular, to account for SCNAs during the estimation of CCFs from the observed VAFs, some phylogenetic reconstruction algorithms aim to jointly model the evolution of SNVs and SCNAs^12,14,19,24. However, due to the complexity of these models, these algorithms do not scale to the high numbers of mutations found in the whole-genome and whole-sequencing studies^4,8,25, and neither to the large number of tumour samples sequenced in recent multi-sample tumour studies^3,8,26–28.

To address previous limitations, we develop CONIPHER (COrrecting Noise In PHylogenetic Evaluation and Reconstruction), a novel algorithm to automatically reconstruct subclonal mutation clusters, tumour phylogeny and subclone cell proportions from bulk sequencing data and account for uncertainty. CONIPHER is characterised by three novel features that address the key challenges in phylogenetic reconstruction described above: (1) an approach to remove biologically improbable clusters that either are driven by likely-erroneous mutations or by subclonal SCNAs, (2) a method to correct for complex evolutionary events, including mutation losses¹⁴, and (3) an efficient extension of previous and new approaches that allows CONIPHER to scale to a high number of primary tumour samples per patient. In this protocol, we outline the CONIPHER method and also detail how to practically use our tool. We show that CONIPHER outperforms previous algorithms on simulations.

In addition, despite the rich literature on tumour phylogeny reconstruction^{10–13,17–21}, how features of the inferred tumour phylogenies relate to the biology of tumour growth, in terms of selection, mutation rates and rates of chromosomal instability, remains unclear. This protocol enables a user-friendly, straightforward computational framework for analysis of tumour phylogenies in R, including calculation of subclone proportions in each tumour sample. In fact, CONIPHER has been used to automatically reconstruct the tumour phylogenetic trees for 421 patients with non-small cell lung cancers (NSCLC) with primary and metastatic disease in the recent TRACERx421 study^26,27.

Results

Development of the protocol

Automated tumour phylogenetic reconstruction from bulk DNA sequencing of tumours with a large number of mutations enables an in depth analysis of tumour evolution. To accurately reconstruct the tumour phylogenetic tree we posit that it is imperative to account for mutation losses and erroneously clustered mutations. Correct tree reconstruction will affect interpretation of downstream analyses of evolutionary relationships between specific driver mutations, and inference of metastatic seeding and dissemination patterns. Hence, we created CONIPHER to process and construct tumour phylogenetic trees for 432 tumours from 421 patients with NSCLC from the TRACERx lung cohort^26,27.

Overview of the CONIPHER method

CONIPHER takes as input processed mutation data from bulk DNA sequencing (for example using Mutect2²⁹ and Varscan³⁰), as well as SCNAs, purity and ploidy, which can be computed by existing and well established methods, such as ASCAT³¹, HATCHet³², Sequenza³³ and Battenberg³⁴. CONIPHER subsequently performs mutation clustering, followed by tumour phylogeny reconstruction (Figure 1), and finally computes subclone proportions. Below, we describe an overview of the method. Full details of statistics and exact values of the parameters and thresholds used are described in Supplementary Methods.

Subclonal mutation clustering

The first step in CONIPHER is the estimation of PhyloCCFs and clustering of somatic mutations. This step can be broken down into four main components, which were designed to minimise the error introduced at each subsequent step. First, copy number preprocessing of every mutation is performed (Figure 1a), in which the PhyloCCF of every mutation is calculated, by transforming the measured VAF by expected mutation copy number and tumour purity to compute the CCF metric^3,14,35,36, and taking into account both clonal and subclonal SCNAs^3,26,27. Secondly, a pre-clustering step is implemented to split mutations in different groups, such that each group only contains mutations that are clearly present or clearly absent in the same set of tumour samples (Figure 1b). Similar to recent methods²¹, this step prevents the mixing of these mutations in the same cluster, an error that has been observed for most existing mutation clustering algorithms²¹. Thirdly, CONIPHER applies Dirichlet clustering using the PyClone algorithm (v0.13.19) to each group of mutations separately to identify the candidate mutation clusters (Figure 1c). Finally, post-processing is performed on the inferred mutation clusters, in which mutation clusters are removed that comprise a small number of mutations (user-defined, for whole exome sequencing a default threshold of 5 is used) and two subclonal clusters are merged if their difference is driven solely by a subclonal copy number correction (Figure 1d). Full details of the method are described in our companion manuscript²⁶.

Phylogenetic tree building

The second and main step of CONIPHER is reconstruction of the tumour phylogenetic tree. This step takes output from the previously performed mutation clustering as input, namely, inferred assignments of mutations to mutation clusters, and mutation PhyloCCF estimates. Notably, this step is compatible with mutation clustering performed from other methods. The phylogenetic tree building step can be broken down into four main components: cluster nesting, growing the tree, enumerating the solution space of alternative phylogenies, and computing subclone proportions.

Mutation cluster nesting

First, 95% confidence intervals are computed to obtain estimates for average PhyloCCF values for each mutation cluster identified in the clustering step, in each tumour sample (Figure 1e, Supplementary Methods). Secondly, two one-sided tests are performed comparing PhyloCCF values between every possible pair of clusters in each tumour sample, in order to determine whether one cluster could potentially be nested within the other. The truncal cluster is assigned as the cluster that can nest all other clusters (Figure 1f). A test is additionally performed to check whether each cluster could be classified as subclonal, or whether it is indistinguishable from the truncal cluster within each tumour sample (Supplementary Methods²⁶). In order to prevent artefactual mutation clusters from being assigned to a branch of the phylogenetic tree, the genomic positions of mutations within each cluster are inspected. If all mutations in a cluster are less evenly distributed across chromosomes than would be expected based on the distribution of mutations across chromosomes in the truncal cluster, the cluster is deemed as potentially copy number driven and therefore removed from subsequent analysis. Cluster nesting is summarised as a nesting matrix and can be represented as an ancestral graph.

Growing the phylogenetic tree

Then, the ancestral graph is pruned to attempt to produce a tree structure with no cycles (Figure 1g). This method favours a more linear tree topology structure, as opposed to a more branched structure. Subsequently, clusters are removed from the tree that are the cause of the following issues: (i) circles in the tree, or (ii) CCFs of tree branches at each tree level exceeding a user-defined threshold (by default a CCF buffer of 10% is used, Supplementary Methods). Clusters are removed such that the fewest mutations possible are removed from the phylogenetic tree. This step returns one ‘default’ tumour phylogenetic tree.

Growing the forest

After identifying the default tree, our algorithm enumerates all possible alternative phylogenies that fit the identified cluster nesting structure of the pruned ancestral graph (Figure 1h). First, all combinations of clusters are identified that could be moved to descend from a different parental node, without causing issues (i) or (ii) (Supplementary Methods). All possible phylogenetic trees are provided as output.

After all potential trees are identified, tree branches, or edges, that are common to all trees are classified as “consensus” branches, conversely, branches that are found in only a subset of trees are classified as “non-consensus” branches.

CONIPHER additionally provides two methods for summarising the solution space of multiple phylogenetic trees per tumour (Figure 1i). First, CONIPHER computes the tree(s) that generates the lowest amount of nesting error, which we term the sum condition error (SCE). Secondly, CONIPHER computes the tree(s) comprising branches, or tree edges, most commonly shared amongst alternative trees in the solution space, by computing the edge probability. A full description of the computation of the SCE and edge probability metrics can be found in Supplementary Methods.

Computing subclone proportions

Finally, CONIPHER automatically computes the proportion of cells in each tumour sample belonging to each genomically homogeneous subclone, or the "subclone proportions", based on the inferred default tree and tumour phylogeny with lowest SCE (Figure 1j, Supplementary Methods). Notably, subclone proportions will sum to 1 in each tumour sample and will only correspond to the mutation cluster PhyloCCF in the case of terminal nodes on the phylogenetic tree. This enables an analysis of recent subclonal expansions in a tumour, which was found to be prognostic in our companion manuscript²⁶.

Benchmarking and evaluating the performance of CONIPHER

A realistic simulation framework for tumour evolution

We benchmarked the performance of CONIPHER using a set of 150 ground truth simulations that comprise generated tumour phylogenies, mutation clusters and related bulk sequencing data²⁶. The ground truth simulations were designed to model the evolution of genetic variants frequently observed in NSCLC, including SCNAs and whole genome doubling (WGD) events that can occur truncally or subclonally. In particular, the simulation framework models the effect of such genetic alterations on SNV mutation loss and SNV multiplicity. Three distinct categories of simulated datasets were generated: 50 simulated datasets with 2-3 samples per tumour (low category), 50 simulated datasets with 4-7 samples per tumour (medium category) and 50 simulated datasets with >7 samples per tumour (high category), totalling a collection of 150 simulated datasets. Full mathematical details of the simulation framework are described in our companion paper²⁶.

Comparison of CONIPHER with current state-of-the-art tools

Based on the ground truth simulations generated using the simulation framework²⁶, we compared CONIPHER for reconstructing tumour subclonal mutation clusters and inferring tumour phylogeny with current state-of-the-art approaches (Figure 3). We compared our clustering method with PyClone, as well as our clustering and phylogenetic tree building method with PhyloWGS¹², LICHeE¹³, CITUP¹¹ and Pairtree¹⁸ (Figure 3). We additionally performed benchmarking of our tree building method only, by using simulated ground truth clusters and applying CONIPHER to reconstruct the phylogenetic trees. We compared the latter benchmarking to LICHeE, CITUP and Pairtree (Figure 3). Table 1 compares functionalities and methodology between these methods and CONIPHER. Overall, CONIPHER is able to identify mutation clusters (Figure 3a) and reconstruct tumour phylogenies (Figure 3b) with higher accuracy than other methods.

Scalability of method

We first compared the scalability of CONIPHER against the current state-of-the art methods. We found that CONIPHER and Pairtree were able to infer tumour phylogeny for every simulated dataset whereas other methods failed to run or complete the reconstruction within the time frame allowed (8 hours). In particular, PhyloWGS was unable to complete tumour phylogenetic reconstruction on any of the simulated datasets in the medium or high category and only able to reconstruct 3/50 trees in the low category (Figure 3c).

Presence-absence informed clustering

We explicitly compared the performance of our mutation clustering to other methods, to evaluate how differences in the clustering would affect tree building downstream (Figure 3d). We found that CONIPHER and LiCHeE had the highest mutation presence precision in every tumour sample, compared to CITUP and Pairtree. In particular, the presence-absence classification step in CONIPHER led to improved mutation presence precision in the high category, compared to the other methods for which performance decreased with larger simulations.

Measuring mutation losses

CONIPHER considers the possibility of mutation losses in tumour evolution, and aims to correct for these when performing mutation clustering, as described above (Figure 1). We assessed each method’s ability to account for mutation losses by evaluating the sensitivity in the identification of truncal mutations. Methods that do not account for mutation loss will incorrectly classify truncal mutations that were lost later in tumour evolution as subclonal, such as CITUP and Pairtree, which had a lower truncal sensitivity in all simulation categories (Figure 3e). We observed that when running Pairtree with CONIPHER clustering, the truncal sensitivity was greatly improved, thereby indicating that Pairtree was not directly accounting for mutation loss (Figure 3e). Clustering performance may directly impact the truncal sensitivity independently of tree building, so we also evaluated the performance of each tree building method on the set of ground truth simulated clusters per dataset (Figure 3f & 3g). We found that CITUP failed in all 150/150 (100%) instances, which we hypothesise is due to the inability to account for mutation loss. Pairtree and LiCHeE were able to identify the correct truncal mutation cluster in 83/150 (55%) and 84/150 (56%) of the simulated instances respectively, compared with CONIPHER that was best able to account for mutation loss and correctly identified the truncal mutation cluster in 141/150 (94%) of ground truth instances.

Accurate error removal

Bulk DNA sequencing data may contain a significant degree of error; however, many algorithms for subclonal reconstruction fail to remove potentially noisy mutations (Table 1). CONIPHER aims to identify mutation clusters driven by sequencing noise, and removes these. We evaluated the extent to which we were correctly identifying and removing mutational sequencing noise by injecting an artefactual cluster in the simulation datasets, and comparing the number of simulations in which we remove the artefact cluster (Figure 3h). Notably, the artefact cluster could be compatible with the tree structure (i.e. it was not necessarily biologically implausible). CONIPHER was able to identify and remove error-driven mutations in 77/150 simulated datasets (51%), compared to LICHeE that removed noisy clusters in 3/150 datasets (2%), and CITUP and Pairtree which did not identify the noisy clusters. For simulations with a low number of samples per tumour, CONIPHER also often failed to remove the erroneous cluster (38/50 datasets). In these cases, many ‘error clusters’ still fit the tree, without the need to remove any mutations. By contrast, for simulations with a high number of samples, the erroneous cluster was correctly identified in 38/50 simulated datasets (76%).

Multiple alternative tree solutions

We further used the simulated dataset to benchmark CONIPHER of ranking plausible alternative phylogenetic trees. First, we measured whether phylogenetic tree solutions with higher mutation descendant accuracy gave better performing SCE and edge probability metric scores. We observed that for simulated tumour cases for which CONIPHER identified more than one potential tree structure, the alternative trees that were reconstructed with the highest mutation descendant accuracy had lower sum condition error (SCE) scores compared to less accurate alternative phylogenetic trees (Supplementary Figure 1a, Supplementary Methods). Evaluating the performance of CONIPHER tree building on the set of ground truth clusters from an additional simulated dataset with no mutation loss and no error-driver mutations (simulated dataset 2, Supplementary Methods), we observed that the inferred edges that were present in the ground truth (GT) tree were shared amongst a larger number of alternative tree solutions than edges not present in the GT tree (Supplementary Figure 1b). Finally, we observed that the highest ranking tree solutions based on the SCE and edge probability metrics had a higher descendant accuracy than alternative tree solutions (Supplementary Figure 1b, Supplementary Methods).

Advantages and limitations of CONIPHER

CONIPHER performs mutation clustering and phylogenetic tree building from processed bulk DNA sequencing data. This can be from bulk whole genome sequencing (WGS), whole exome sequencing (WES) or a targeted sequencing approach. It is highly scalable and can reconstruct tumour phylogenies from tumours with many samples and many clusters in a time frame of the order of minutes. CONIPHER assigns mutations to the phylogenetic tree more accurately than other state-of-the-art methods and in particular improves the quality of the mutations assigned to the tree, by taking into account biological constraints in order to remove error-driven signal. CONIPHER for phylogenetic tree building is compatible with input from mutation clustering performed using other methods and automatically computes subclone proportions in each tumour sample.

However, CONIPHER does have limitations. CONIPHER does not currently support raw sequencing data as input and requires processed data from bulk DNA sequencing. In particular, we assume that mutation and copy number calling algorithms have been applied to the raw sequencing data.

Required expertise

CONIPHER is straightforward to implement from the command line, using basic knowledge of Linux/Unix syntax. CONIPHER output is in both human readable form (.tsv files) and additionally .RDS objects for use in the R programming language. Knowledge of scripting languages would be helpful for users who wish to use CONIPHER output for downstream analyses; however, non-experts in bioinformatics should be able to run CONIPHER using the command line only to obtain mutation clustering and tumour phylogenies with correct input data. The current implementation of CONIPHER is written in the R programming language.

Experimental design

The CONIPHER procedure is composed of two main steps: a clustering step and a tree building step (Figure 4, Method Workflow). The clustering step is optional, and can be replaced by a mutation clustering method of the user’s choice. At each step, output directories are generated containing both data and summary plots. Both steps can be run with a wrapper end-to-end; that is, the clustering step automatically generates output that is taken as input to the tree building step. Both steps can be run from the command line.

Input data

Our protocol requires as input a file input.tsv, a mutation table containing information about each point mutation in each tumour sample sequenced, as shown in Example 1 (Figure 5). This input table can be used as input for both clustering and tree building steps, with specific column names required for each step. A complete description of all columns required in the input table is shown in Box 1 (Figure 6).

As shown in Example 1, input.tsv is in long format, with a new row for each mutation, for each tumour sample sequenced. Mutation clustering takes as input the genomic position of every mutation in every tumour sample, the copy number at the genomic position of each mutation, and an estimate of the tumour purity (or aberrant cell fraction, ACF) and ploidy (PLOIDY) within each sample (Example 1). Tree building takes the same table as input, with additional columns required (green box, Example 1): mutation cluster assignments (CLUSTER), estimates of the PhyloCCF (CCF_PHYLO) and observed CCF (CCF_OBS), and mutation copy number estimates for each mutation in each sample (MUT_COPY). These data and table columns are generated automatically by the clustering step (Figure 4).

Conventions

In our companion manuscripts^26,27, the convention is to refer to distinct bulk samples taken from one tumour as ‘tumour regions’, however in this manuscript we refer to SAMPLE as the tumour sample identifier. Chromosome names can be either with or without ‘chr’ prefix (e.g. ‘1’ or ‘chr1’). Chromosomes X and Y are ignored in this procedure.

EQUIPMENT

Data files are required for each tumour in the analysis cohort (input.tsv) as described in section Input data.
A standard computer system with a Linux operating system is required to run CONIPHER from the command line. CONIPHER can be run using access to Conda. Details can be found in Software Requirements.

Programme source code is publicly available for our CONIPHER tree building R package at https://github.com/McGranahanLab/CONIPHER, and for our CONIPHER clustering and tree building wrapper at https://github.com/McGranahanLab/CONIPHER-wrapper.

EQUIPMENT SETUP

Hardware requirements

Memory requirements depend on whether the input data is from whole exome or whole genome sequencing data. It is recommended to run the method using at least 8GB memory.

Software requirements

Access to a high performance computing (HPC) system is recommended for tumours with a large number of samples and mutations, but not essential to run clustering and tree building using CONIPHER. CONIPHER can be run using Conda and a Conda environment containing all R packages required to run the full pipeline can be installed using the instructions given in the CONIPHER-wrapper github page https://github.com/McGranahanLab/CONIPHER-wrapper.

Installation

CONIPHER code repository can be downloaded from GitHub. We have created an R package for CONIPHER tree building with full package installation and run instructions at (https://github.com/McGranahanLab/CONIPHER). We have additionally created a Github repository with a CONIPHER wrapper to run both CONIPHER clustering and tree building end-to-end (https://github.com/McGranahanLab/CONIPHER-wrapper). Instructions for creating the Conda environment required to run clustering and tree building are detailed in the README.md file in the CONIPHER-wrapper Github repository.

Data preprocessing

Preprocessing of input.tsv. To run the mutation clustering step, the input table should contain all columns described in Box 1 (Figure 6) up to the dashed line. The columns required for tree building additionally include MUT_COPY, CCF_PHYLO, CCF_OBS and CLUSTER. Optionally, a copy number segmentation file input_seg.tsv can be provided as input (see example CRUK0063 in PROCEDURE), which is used in the clustering step to generate a copy number plot across the genome with overlaid mutation copy numbers.

Execution of full pipeline

An example wrapper to run both steps of the pipeline end-to-end is available to download from the GitHub page (0_runningClusteringTreeBuilding.sh). This wrapper is designed to be run for one case in the analysis cohort. A description of how to run each CONIPHER step individually is detailed below in PROCEDURE, in which both the CONIPHER clustering and tree building wrapper functions are run on processed WES data from a patient with metastatic disease from the TRACERx421 cohort, case CRUK0063²⁷.

Step 1: Mutation clustering - TIMING: 10 min - 6 hrs

! CRITICAL. The tumour identifier in column CASE_ID and tumour sample identifier in column SAMPLE must include a prefix character string common to all patients in the cohort, for example prefix ‘CRUK’ in the toy case CRUK0000 shown in Example 1 (Figure 5). The input table should be in tab-separated format (input.tsv), should have no additional column with row names or numbers, and should have no quotation marks for character string entries.

! CRITICAL. In cases of multiple genomically distinct tumours detected within one patient, CONIPHER should be implemented separately for each tumour.

An example input.tsv for TRACERx case CRUK0063 is shown below. The case CRUK0063 has WES data available for 5 primary tumour samples (CRUK0063_SU_T1.R3 - CRUK0063_SU_T1.R7) and two metastatic samples (CRUK0063_SU_FLN1 - CRUK0063_BR_T1.R1):

CASE_ID SAMPLE CHR POS REF ALT REF_COUNT VAR_COUNT DEPTH COPY_NUMBER_A COPY_NUMBER_B ACF PLOIDY

CRUK0063 CRUK0063_SU_T1.R3 1 1854811 C G 222 0 222 2 1 0.12 2.99578468212973

CRUK0063 CRUK0063_SU_T1.R4 1 1854811 C G 155 43 198 2 1 0.26 3.6455987662771

CRUK0063 CRUK0063_SU_T1.R5 1 1854811 C G 184 43 229 2 1 0.25 3.82763920021902

CRUK0063 CRUK0063_SU_T1.R6 1 1854811 C G 205 42 247 2 1 0.14 3.63598022755526

CRUK0063 CRUK0063_SU_T1.R7 1 1854811 C G 177 32 209 2 1 0.13 3.60409460005687

CRUK0063 CRUK0063_SU_FLN1 1 1854811 C G 111 26 137 2 1 0.16 3.4421795771731

CRUK0063 CRUK0063_BR_T1.R1 1 1854811 C G 406 0 406 3 0 0.19 2.85294220222681

CRUK0063 CRUK0063_SU_T1.R3 1 2525963 - A 301 2 303 2 1 0.12 2.99578468212973

When running the clustering and tree building pipeline for a cohort of tumours, it is recommended to save the input and output in a distinct directory for each tumour case ${CASE_ID}, for example:

inputTSV="/${CASE_ID}/input.tsv"

clustering_dir=“${CASE_ID}/Clustering/”

treebuilding_dir=“${CASE_ID}/TreeBuilding/”

mkdir –p ${clustering_dir}

mkdir –p ${treebuilding_dir}

Make sure to specify inputs for the parameters of file names: –-patient, --working_dir, --script_dir, --input_tsv, and optionally --input_seg_tsv.

1| Run mutation clustering for one patient with this command:

Rscript run_clustering.R --patient ${CASE_ID} --working_dir ${clustering_dir} --script_dir ${scriptDir} --input_tsv ${inputTSV}

Optionally, a table can be provided as input that describes the estimated copy number across the genome for each tumour sample, input_seg.tsv. An example input_seg.tsv for case CRUK0063 is shown below:

SAMPLE CHR STARTPOS ENDPOS COPY_NUMBER_A COPY_NUMBER_B

CRUK0063_SU_T1.R3 1 1154343 24194770 1.84281724051725 0.819410968360379

CRUK0063_SU_T1.R4 1 1154343 24194770 2.06043419365728 0.987132058968649

CRUK0063_SU_T1.R5 1 1154343 24194770 2.03393512167531 1.01848350188843

CRUK0063_SU_T1.R6 1 1154343 24194770 1.98816956507153 0.968961861085115

CRUK0063_SU_T1.R7 1 1154343 24194770 1.96976684730678 0.994579179452045

CRUK0063_BR_T1.R1 1 1154343 24194770 3.2781464926907 0.263955266201884

CRUK0063_SU_FLN1 1 1154343 24194770 1.89405545754167 0.843149172994773

CRUK0063_SU_T1.R3 1 24200891 24201115 0.802114518343767 0

This table is in long format, with a new row for one copy number segment in one tumour sample. The first column SAMPLE describes the tumour sample identifier. Columns CHR, STARTPOS and ENDPOS indicate the genomic segment. Columns COPY_NUMBER_A and COPY_NUMBER_B indicate the copy number of the major and minor alleles, respectively. These values can be integer copy number or raw fractional copy number.

! CRITICAL. For file input_seg.tsv, the tumour sample identifiers in column SAMPLE and chromosome identifiers in column CHR should correspond to those in input.tsv.

A full description of all parameters available for the clustering step can be found in Box 2 (Figure 7).

ANTICIPATED CLUSTERING OUTPUT

Running the clustering step will output the following files in patient-specific directory “${CASE_ID}/Clustering/”:

OUTPUT DATA:

<CASE_ID>.SCoutput.CLEAN.tsv. This is a mutation table in the same form as input.tsv, including columns for: mutation cluster assignments (CLUSTER); mutation cell fraction estimates, including the PhyloCCF (CCF_PHYLO) and observed CCF (CCF_OBS); and mutation copy number estimates for each mutation in each sample (MUT_COPY). Additionally, there is a column mutation_id, which is an identifier for the mutation in the form: <CASE_ID>:<CHR>:<POS>:<REF>:<ALT>. By convention, cluster names are integers, ordered by the number of mutations assigned to that cluster (so the cluster with the largest number of mutations will be labelled as CLUSTER==1, and so forth).

<CASE_ID>.removed.muts.txt. This is a mutation table containing the mutations that were removed prior to the clustering step (e.g. if no copy number was available for that mutation). Each row is a new mutation. Tumour sample-specific information is found in columns that begin with <SAMPLE>.*. NOTE: this file will not be generated if no mutations are removed.

OUTPUT PLOTS:

<CASE_ID>_pyclone_cluster_assignment_copynumber_clean.pdf. This figure is a cross-genome plot of each mutation plotted at its genomic position (x-axis) against its mutation copy number (y-axis), coloured by the cluster it was assigned to in mutation clustering. Each new row shows a new tumour sample. If the input_seg.tsv file was additionally provided, the copy number of each segment will be plotted: black indicates the major allele, green the minor allele. The first page of the pdf displays all mutations from every cluster. The subsequent pages display the same segment copy number information for each sample, with mutations from only one cluster overlaid. Histograms on the right hand side of cross-genome plots (on all pages except the first page) indicate the frequency of mutations at each copy number value. An example of one sample from page 1 of the pdf for case CRUK0063 is shown in Figure 8.
<CASE_ID>.removedCPN.muts.pdf. This figure is identical to the above, except restricting to only mutations that were removed during the clustering step, due to being localised genomically. Each mutation is coloured by the cluster it was assigned to in mutation clustering. Each new row shows a new tumour sample. Histograms on the right hand side indicate the frequency of mutations at each integer copy number value. NOTE: this file will not be generated if no mutations are removed. An example of one sample for case CRUK0063 is shown in Figure 9.
<CASE_ID>.heatmap.pdf. This figure shows a heatmap of presence/absence of each mutation (rows) in each tumour sample (columns). The colour bar on the left indicates removed mutations (blue) and kept mutations (yellow).
<CASE_ID>.cluster.ccf.heatmap.pdf. This figure shows a heatmap of the inferred PhyloCCF of each mutation (rows) in each tumour sample (columns). The colour bar on the left indicates the assigned cluster.
<CASE_ID>.pyclone_cluster_assignment_phylo_clean.pdf. This figure shows a scatter plot of the PhyloCCF of each (non-removed) mutation between each pair of samples. Each mutation is coloured by the assigned cluster. An example of one pair of samples from case CRUK0063 is shown in Figure 10.

Step 2: Phylogenetic tree building - TIMING: 1 min - 1 hrs

2| Run tree building for one patient with one of the following commands:

a. If running CONIPHER tree building from CONIPHER clustering output:

Rscript run_treebuilding.R --input_tsv ${CASE_ID}".SCoutput.CLEAN.tsv" --out_dir ${treebuilding_dir} --script_dir ${scriptDir} --prefix CRUK

b. If running CONIPHER tree building directly from an input.tsv file:

Rscript run_treebuilding.R --input_tsv ${inputTSV} --out_dir ${treebuilding_dir} --script_dir ${scriptDir} --prefix CRUK

A full description of all parameters for the tree building step can be found in Box 3 (Figure 11).

! CRITICAL. NOTE: CONIPHER tree building implements its own cluster merging process in addition to cluster merging in the CONIPHER clustering step. By default similar clusters are merged if possible (merge_clusters==TRUE) and bootstrapped confidence intervals are used (use_boot==TRUE). These settings are recommended (Supplementary Methods).

! CRITICAL. If running tree building only, it is required that all columns in the input file ${inputTSV} are present. NOTE: if the clustering method used does not output an estimate of PhyloCCF as well as observed CCF per mutation, the column CCF_PHYLO should be manually added to the input table, with identical entries to column CCF_OBS.

ANTICIPATED TREE BUILDING OUTPUT

Running the tree building step will output the following files in patient-specific directory “${CASE_ID}/TreeBuilding/”:

OUTPUT DATA:

allTrees.txt. This is a text file containing all potential inferred phylogenetic trees, in the format below. This file can be parsed into any scripting language for further analysis.

### 11 trees

# tree 1

2 1

8 3

21 4

1 5

…

# tree 2

2 8

8 21

2 1

17 20

…

The first row of the file indicates how many alternative phylogenies were detected by the tree building algorithm. Each alternative tumour phylogeny number X begins with a header: # tree X. For each tree, each new row of allTrees.txt is a tree branch, or edge, connecting a pair of distinct clusters. The first column indicates the parental node; the second column indicates the child node.

! CRITICAL. Tree number 1 (# tree 1) always refers to the default tree generated by the tree building algorithm.

alternativeTreeMetrics.txt. This is a tab-delimited text file containing summary metrics of all alternative phylogenetic trees, whereby each row of the table indicates one alternative tree (treeID).

treeID sum_condition_error SCE_ranking lowest_SCE edge_probability_score edge_probability_ranking highest_edge_probability

1 2.45535714285714 1 Lowest SCE tree -13.7503137875581 1 Highest edge probability tree

2 2.66666666666667 2 Alternative tree -13.7503137875581 1 Highest edge probability tree

3 2.84761904761905 4 Alternative tree -28.235254999465 6 Alternative tree

4 3.19047619047619 8 Alternative tree -17.1080085151707 3 Alternative tree

5 2.8 3 Alternative tree -18.3317674469289 5 Alternative tree

6 3.08571428571429 6 Alternative tree -14.8660315441292 2 Alternative tree

7 3.46938775510204 11 Alternative tree -17.1080085151707 3 Alternative tree

8 3.05102040816327 5 Alternative tree -18.3317674469289 5 Alternative tree

9 3.35714285714286 9 Alternative tree -14.8660315441292 2 Alternative tree

10 3.46938775510204 10 Alternative tree -18.2237262717417 4 Alternative tree

11 3.19047619047619 7 Alternative tree -18.2237262717417 4 Alternative tree

The treeID column value directly corresponds to the alternative tree number in the full alternative tree list allTrees.txt. The second column sum_condition_error gives the sum condition error value for that tree, and subsequent column SCE_ranking is an ordering of the trees from lowest error (SCE_ranking == 1) to highest. Correspondingly, lowest_SCE is a binary flag to indicate whether this tree had the lowest error (‘Lowest SCE tree’) or not (‘Alternative tree’). Similarly, the fourth column edge_probability_score gives the edge probability score for that tree, and subsequent column edge_probability_ranking is an ordering of the trees from highest edge probability (edge_probability_ranking == 1) to lowest. Column highest_edge_probability is a binary flag to indicate whether this tree had the maximal edge probability score (‘Highest edge probability tree’) or not (‘Alternative tree’). Any ties within either SCE_ranking or edge_probability_ranking are labelled with the same rank.

clusterInfo.txt. This is a tab-delimited text file containing a table detailing information about each mutation cluster, whereby each row of the table indicates one cluster (clusterID) in one tumour sample (SAMPLE), as shown below.

clusterID truncal treeClust cpnRemClust nMuts SAMPLE meanCCF CCF_CI_low CCF_CI_high clonality clone_proportions_default

1 FALSE TRUE FALSE 180 CRUK0063_SU_T1.R3 0 0 0 absent 0

1 FALSE TRUE FALSE 180 CRUK0063_SU_T1.R4 97 94.4413167785942 100.434004728202 clonal 3

1 FALSE TRUE FALSE 180 CRUK0063_SU_T1.R5 99 96.211117059643 102.101566540685 clonal 11

1 FALSE TRUE FALSE 180 CRUK0063_SU_T1.R6 94 90.9058240254545 97.5439854950948 clonal 0

1 FALSE TRUE FALSE 180 CRUK0063_SU_T1.R7 97 93.7004858700328 100.833168946869 clonal 0

1 FALSE TRUE FALSE 180 CRUK0063_SU_FLN1 105 99.2286716589233 110.747279152377 clonal 0

1 FALSE TRUE FALSE 180 CRUK0063_BR_T1.R1 0 0 0 absent 0

2 TRUE TRUE FALSE 89 CRUK0063_SU_T1.R3 104 96.4730100539471 110.970973969053 clonal 0

The cluster name in clusterID matches the cluster names input into tree building (in either <CASE_ID>.SCoutput.CLEAN.tsv or input.tsv). The second column truncal indicates whether this cluster was assigned to be the truncal cluster of the phylogenetic tree. Only one unique cluster will be assigned to be truncal. The third column treeClust indicates whether the cluster was assigned to a branch of the phylogenetic tree (treeClust==TRUE). If a cluster was identified as erroneous due to being composed of biologically implausible mutations only, column treeClust will have a value of FALSE. If the cluster was identified as erroneous due to subclonal copy number alterations undetected during clustering, column treeClust will have a value of FALSE and cpnRemClust will have a value of TRUE. Column nMuts describes the number of SNVs assigned to that cluster. The columns meanCCF, CCF_CI_low, and CCF_CI_high describe the distribution of PhyloCCF values for all mutations in that clusterID in that SAMPLE. Column clonality describes whether that clusterID in that SAMPLE was classified as being either: absent, subclonal or clonal within that sample (Supplementary Methods)26. Finally, column clone_proportions_default describes the subclone proportion of that clusterID in that SAMPLE, computed from the default phylogenetic tree (tree 1).

cloneProportionsMinErrorTrees.txt. This is a tab-delimited text file containing subclone proportion tables in long format from only phylogenetic trees with the lowest SCE. Each row corresponds to one clusterID from one treeID. In example CRUK0063 below, the lowest SCE tree was the default tree (tree 1). Values in the table indicate the subclone proportion of the subclone resulting from that clusterID within that sampled tumour sample (column). For each treeID, columns should sum to 100.

CRUK0063_SU_T1.R3 CRUK0063_SU_T1.R4 CRUK0063_SU_T1.R5 CRUK0063_SU_T1.R6 CRUK0063_SU_T1.R7 CRUK0063_SU_FLN1 CRUK0063_BR_T1.R1 clusterID treeID

0 3 11 0 0 0 0 1 1

0 0 0 0 0 0 0 2 1

14.6446280991736 0 0 0 0 0 0 3 1

0 0 0 0 0 0 40 4 1

68 0 0 0 0 0 0 5 1

0 0 0 0 0 73 0 6 1

0 25 0 37.5886524822695 35.7615894039735 0 0 7 1

0 0 0 23 0 0 0 8 1

0 0 0 0 0 0 40 9 1

0 0 0 0 0 0 0 10 1

…

consensusBranches.txt. This is a text file containing all branches (parent-child pairs) of the phylogenetic tree that were identified to be present across all alternative phylogenies, as shown in example CRUK0063 below. First column: parent node; second column: child node.

1 18

1 5

10 9

12 6

…

consensusRelationships.txt. This is a text file containing all ancestor-descendent pairs that were identified to be present across all alternative phylogenies, as shown in example CRUK0063 below. First column: ancestral node; second column: descendent node.

1 12

1 17

1 18

1 19

…

treeTable.tsv. This is a tab-separated mutation table in the format of input.tsv, except with extra columns: originalCLUSTER and treeCLUSTER. originalCLUSTER indicates the cluster ID this mutation was assigned to in the clustering step (and will correspond to column CLUSTER in the input.tsv to the tree building step). treeCLUSTER indicates the final cluster name the mutation is assigned to after tree building. Note: originalCLUSTER and treeCLUSTER are identical, except in cases of cluster merging (Supplementary Methods).

CASE_ID SAMPLE CHR POS REF ALT REF_COUNT VAR_COUNT DEPTH originalCLUSTER CCF_PHYLO CCF_OBS MUT_COPY COPY_NUMBER_A COPY_NUMBER_B ACF PLOIDY mutation_id treeCLUSTER

CRUK0063 CRUK0063_SU_T1.R3 1 1854811 C G 222 0 222 1 0 0 0 2 1 0.12 2.99578468212973 CRUK0063:1:1854811:C:G 1

CRUK0063 CRUK0063_SU_T1.R4 1 1854811 C G 155 43 198 1 0.949026716242792 1 1.89805343248558 2 1 0.26 3.6455987662771 CRUK0063:1:1854811:C:G 1

CRUK0063 CRUK0063_SU_T1.R5 1 1854811 C G 184 43 229 1 0.857387666989517 1 1.71477533397903 2 1 0.25 3.82763920021902 CRUK0063:1:1854811:C:G 1

CRUK0063 CRUK0063_SU_T1.R6 1 1854811 C G 205 42 247 1 1.30366188871038 1 2.59190089027765 2 1 0.14 3.63598022755526 CRUK0063:1:1854811:C:G 1

CRUK0063 CRUK0063_SU_T1.R7 1 1854811 C G 177 32 209 1 1.27080535818999 1 2.50319026394246 2 1 0.13 3.60409460005687 CRUK0063:1:1854811:C:G 1

CRUK0063 CRUK0063_SU_FLN1 1 1854811 C G 111 26 137 1 1.32634459664356 1 2.51217022185363 2 1 0.16 3.4421795771731 CRUK0063:1:1854811:C:G 1

CRUK0063 CRUK0063_BR_T1.R1 1 1854811 C G 406 0 406 1 0 0 0 3 0 0.19 2.85294220222681 CRUK0063:1:1854811:C:G 1

CRUK0063 CRUK0063_SU_T1.R3 1 2525963 - A 301 2 303 4 0 0.11 0.139734209392565 2 1 0.12 2.99578468212973 CRUK0063:1:2525963:-:A 4

OUTPUT PLOTS:

pytree_and_bar.pdf. The left side of the figure shows a barplot of the mean estimated PhyloCCF values of each mutation cluster (rows) in each sample (columns), with a bootstrap computed 95% confidence interval (as described in Figure 1). If the cluster was classified as ‘clonal’ within that tumour sample, the corresponding bar has a black outline (for example, bars for truncal cluster 2 have a black box in every tumour sample). The right hand side of the figure shows the inferred default phylogenetic tree. Each node pie chart corresponds to the same mutation cluster shown in the barplot, whereby each piece of the pie corresponds to a tumour sample sampled and is shaded by the mean PhyloCCF of that mutation cluster in that sample. The numbers of mutations per cluster are shown, as well as clusters identified as comprising erroneous mutations and removed. Tree branches that are black indicate this branch is a consensus branch and was found to be present in all alternative phylogenies. Grey branches indicate non-consensus branches. An example is shown in Figure 12.
pytree_multipletrees.pdf is a figure displaying all alternative phylogenetic trees identified in the tree building procedure. Black branches indicate consensus branches and grey branches indicate non-consensus branches. An example is shown in Figure 13.

Additional output produced by CONIPHER includes files for analysis in R, described in the Supplementary Note.

A troubleshooting table is provided (Table 2)

Anticipated Results

A successful completion of the procedure results in the output files: CLUSTERING STEP: <CASE_ID>.SCoutput.CLEAN.tsv, <CASE_ID>.all.SNV.cpn.xls, <CASE_ID>.removed.muts.txt, <CASE_ID>.subclonal.mut.cpn.pdf, <CASE_ID>.removedCPN.muts.pdf, <CASE_ID>.heatmap.pdf, <CASE_ID>.cluster.ccf.heatmap.pdf, <CASE_ID>.pyclone_cluster_assignment_phylo_clean.pdf and <CASE_ID>_mutationCCF_all.pdf; TREEBUILDING STEP: allTrees.txt, alternativeTreeMetrics.txt, cloneProportionsMinErrorTrees.txt, clusterInfo.txt, consensusBranches.txt, consensusRelationships.txt, treeTable.tsv. A detailed description of these files is given in the PROCEDURE: ANTICIPATED CLUSTERING OUTPUT and ANTICIPATED TREEBUILDING OUTPUT above. Approximate expected runtime is based on the simulations described in ‘Benchmarking and evaluating the performance of CONIPHER’ (see Supplementary Note). For tumours with a large number of samples or mutations run time may be longer.

Code Availability

Code to run the CONIPHER clustering and tree building wrapper can be found https://github.com/McGranahanLab/CONIPHER-wrapper. Code to run the CONIPHER tree building R package on its own can be found https://github.com/McGranahanLab/CONIPHER. The simulation framework can be found at: https://github.com/zaccaria-lab/TRACERx_simulation_tool.

Data Access Statement

The Whole Exome Sequencing data (from the TRACERx study) used during this study has been deposited at the European Genome–phenome Archive (EGA), which is hosted by The European Bioinformatics Institute (EBI) and the Centre for Genomic Regulation (CRG) under the accession codes EGAS00001006494; access is controlled by the TRACERx data access committee. Details on how to apply for access are available on the linked page.

1. Greaves, M. & Maley, C. C. Clonal evolution in cancer. Nature 481, 306–313 (2012).

2. Nowell, P. C. The clonal evolution of tumor cell populations. Science 194, 23–28 (1976).

3. Jamal-Hanjani, M. et al. Tracking the Evolution of Non-Small-Cell Lung Cancer. N. Engl. J. Med. 376, 2109–2121 (2017).

4. Gerstung, M. et al. The evolutionary history of 2,658 cancers. Nature 578, 122–128 (2020).

5. Dentro, S. C. et al. Characterizing genetic intra-tumor heterogeneity across 2,658 human cancer genomes. Cell 184, 2239–2254.e39 (2021).

6. Maley, C. C. et al. Genetic clonal diversity predicts progression to esophageal adenocarcinoma. Nat. Genet. 38, 468–473 (2006).

7. McGranahan, N. & Swanton, C. Clonal Heterogeneity and Tumor Evolution: Past, Present, and the Future. Cell 168, 613–628 (2017).

8. Gundem, G. et al. The evolutionary history of lethal metastatic prostate cancer. Nature 520, 353–357 (2015).

9. Roth, A. et al. PyClone: statistical inference of clonal population structure in cancer. Nat. Methods 11, 396–398 (2014).

10. Jiao, W., Vembu, S., Deshwar, A. G., Stein, L. & Morris, Q. Inferring clonal evolution of tumors from single nucleotide somatic mutations. BMC Bioinformatics 15, 35 (2014).

11. Malikic, S., McPherson, A. W., Donmez, N. & Sahinalp, C. S. Clonality inference in multiple tumor samples using phylogeny. Bioinformatics 31, 1349–1356 (2015).

12. Deshwar, A. G. et al. PhyloWGS: reconstructing subclonal composition and evolution from whole-genome sequencing of tumors. Genome Biol. 16, 35 (2015).

13. Popic, V. et al. Fast and scalable inference of multi-sample cancer lineages. Genome Biol. 16, 91 (2015).

14. Satas, G., Zaccaria, S., El-Kebir, M. & Raphael, B. J. DeCiFering the Elusive Cancer Cell Fraction in Tumor Heterogeneity and Evolution. bioRxiv 2021.02.27.429196 (2021).

15. Miller, C. A. et al. SciClone: inferring clonal architecture and tracking the spatial and temporal patterns of tumor evolution. PLoS Comput. Biol. 10, e1003665 (2014).

16. Dentro, S. C., Wedge, D. C. & Van Loo, P. Principles of reconstructing the subclonal architecture of cancers. Cold Spring Harb Perspect Med. 2017; 7: a026625.

17. Satas, G. & Raphael, B. J. Tumor phylogeny inference using tree-constrained importance sampling. Bioinformatics 33, i152–i160 (2017).

18. Wintersinger, J. A. et al. Reconstructing Complex Cancer Evolutionary Histories from Multiple Bulk DNA Samples Using Pairtree. Blood Cancer Discov 3, 208–219 (2022).

19. El-Kebir, M., Satas, G., Oesper, L. & Raphael, B. J. Inferring the Mutational History of a Tumor Using Multi-state Perfect Phylogeny Mixtures. Cell Syst 3, 43–53 (2016).

20. El-Kebir, M., Oesper, L., Acheson-Field, H. & Raphael, B. J. Reconstruction of clonal trees and tumor composition from multi-sample sequencing data. Bioinformatics 31, i62–70 (2015).

21. Myers, M. A., Satas, G. & Raphael, B. J. CALDER: Inferring Phylogenetic Trees from Longitudinal Tumor Samples. Cell Syst 8, 514–522.e5 (2019).

22. Ha, G. et al. TITAN: inference of copy number architectures in clonal cell populations from tumor whole-genome sequence data. Genome Res. 24, 1881–1893 (2014).

23. Oesper, L., Mahmoody, A. & Raphael, B. J. THetA: inferring intra-tumor heterogeneity from high-throughput DNA sequencing data. Genome Biol. 14, R80 (2013).

24. Jiang, Y., Qiu, Y., Minn, A. J. & Zhang, N. R. Assessing intratumor heterogeneity and tracking longitudinal and spatial clonal evolutionary history by next-generation sequencing. Proc. Natl. Acad. Sci. U. S. A. 113, E5528–37 (2016).

25. Abbosh, C. et al. Phylogenetic ctDNA analysis depicts early-stage lung cancer evolution. Nature 545, 446–451 (2017).

26. TRACERx consortium. The natural history of NSCLC and impact of subclonal selection in TRACERx. Nature. In press.

27. TRACERx consortium. TRACERx: The evolution of metastases in non-small cell lung cancer. Nature. In press.

28. Makohon-Moore, A. P. et al. Limited heterogeneity of known driver gene mutations among the metastases of individual patients with pancreatic cancer. Nat. Genet. 49, 358–366 (2017).

29. Benjamin, D. et al. Calling Somatic SNVs and Indels with Mutect2. bioRxiv 861054 (2019) doi:10.1101/861054 .

30. Koboldt, D. C. et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 22, 568–576 (2012).

31. Van Loo, P. et al. Allele-specific copy number analysis of tumors. Proc. Natl. Acad. Sci. U. S. A. 107, 16910–16915 (2010).

32. Zaccaria, S. & Raphael, B. J. Accurate quantification of copy-number aberrations and whole-genome duplications in multi-sample tumor sequencing data. Nat. Commun. 11, 4301 (2020).

33. Favero, F. et al. Sequenza: allele-specific copy number and mutation profiles from tumor sequencing data. Ann. Oncol. 26, 64–70 (2015).

34. Nik-Zainal, S. et al. The life history of 21 breast cancers. Cell 149, 994–1007 (2012).

35. Dentro, S. C. et al. Characterizing genetic intra-tumor heterogeneity across 2,658 human cancer genomes. Cell 184, 2239–2254.e39 (2021).

36. McGranahan, N. et al. Clonal status of actionable driver events and the timing of mutational processes in cancer evolution. Sci. Transl. Med. 7, 283ra54 (2015).

37. Frankell, A. M., Colliver, E., Mcgranahan, N. & Swanton, C. cloneMap: a R package to visualise clonal heterogeneity. bioRxiv 2022.07.26.501523 (2022) doi:10.1101/2022.07.26.501523 .

Acknowledgements

The TRACERx study (Clinicaltrials.gov no: NCT01888601) is sponsored by University College London (UCL/12/0279) and has been approved by an independent Research Ethics Committee (13/LO/1546). TRACERx is funded by Cancer Research UK (C11496/A17786) and coordinated through the Cancer Research UK and UCL Cancer Trials Centre which has a core grant from CRUK (C444/A15953). We gratefully acknowledge the patients and relatives who participated in the TRACERx study. We thank all site personnel, investigators, funders and industry partners that supported the generation of the data within this study.

This work was supported by the Francis Crick Institute that receives its core funding from Cancer Research UK (CC2041), the UK Medical Research Council (CC2041), and the Wellcome Trust (CC2041). This work was also supported by the Cancer Research UK Lung Cancer Centre of Excellence, the CRUK City of London Centre Award (C7893/A26233) and the UCL Experimental Cancer Research Centre.

Author contributions

Kristiana Grigoriadis^1,2,3,#, Ariana Huebner^1,2,3,#, Abigail Bunkum^1,4,5,#, Emma Colliver^2,#, Alexander M. Frankell^1,2,#, Mark S. Hill², Kerstin Thol^1,3, Nicolai J. Birkbak^1,2,6,7,8, Charles Swanton^1,2,9,*, Simone Zaccaria^1,5,*, Nicholas McGranahan^1,3,*

Cancer Research UK Lung Cancer Centre of Excellence, University College London Cancer Institute, London, UK
Cancer Evolution and Genome Instability Laboratory, The Francis Crick Institute, London, UK
Cancer Genome Evolution Research Group, Cancer Research UK Lung Cancer Centre of Excellence, University College London Cancer Institute, London, UK
Cancer Metastasis Lab, University College London Cancer Institute, London, UK
Computational Cancer Genomics Research Group, University College London Cancer Institute, London, UK
Department of Molecular Medicine, Aarhus University Hospital, Aarhus, Denmark
Department of Clinical Medicine, Aarhus University, Aarhus, Denmark
Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark
Department of Oncology, University College London Hospitals, London, UK

# These authors contributed equally: Kristiana Grigoriadis, Ariana Huebner, Abigail Bunkum, Emma Colliver, Alexander M. Frankell

* These authors jointly supervised this work

Correspondence to: Charles Swanton, Simone Zaccaria, Nicholas McGranahan

K.G., A.H., E.C., A.M.F., K.T., N.J.B. and N.M. helped develop the protocol and wrote the manuscript. A.B. and S.Z. created the simulations, performed the benchmarking and wrote the manuscript. M.S.H. helped with bioinformatics pipeline development. C.S., S.Z. and N.M. jointly designed and supervised the study and helped write the manuscript.

Competing interests

A.M.F. is co-inventor to a patent application to determine methods and systems for tumour monitoring (PCT/EP2022/077987).

N.J.B. is a co-inventor to a patent to identifying responders to cancer treatment (PCT/GB2018/051912).

C.S. acknowledges grant support from AstraZeneca, Boehringer-Ingelheim, Bristol Myers Squibb, Pfizer, Roche-Ventana, Invitae (previously Archer Dx Inc - collaboration in minimal residual disease sequencing technologies), and Ono Pharmaceutical. He is an AstraZeneca Advisory Board member and Chief Investigator for the AZ MeRmaiD 1 and 2 clinical trials and is also Co-Chief Investigator of the NHS Galleri trial funded by GRAIL and a paid member of GRAIL’s Scientific Advisory Board. He receives consultant fees from Achilles Therapeutics (also SAB member), Bicycle Therapeutics (also a SAB member), Genentech, Medicxi, Roche Innovation Centre – Shanghai, Metabomed (until July 2022), and the Sarah Cannon Research Institute. ad stock options in Apogen Biotechnologies and GRAIL until June 2021, and currently has stock options in Epic Bioscience, Bicycle Therapeutics, and has stock options and is co-founder of Achilles Therapeutics. C.S. holds patents relating to assay technology to detect tumour recurrence (PCT/GB2017/053289); to targeting neoantigens (PCT/EP2016/059401), identifying patent response to immune checkpoint blockade (PCT/EP2016/071471), determining HLA LOH (PCT/GB2018/052004), predicting survival rates of patients with cancer (PCT/GB2020/050221), identifying patients who respond to cancer treatment (PCT/GB2018/051912), US patent relating to detecting tumour mutations (PCT/US2017/28013), methods for lung cancer detection (US20190106751A1) and both a European and US patent related to identifying insertion/deletion mutation targets (PCT/GB2018/051892) and is co-inventor to a patent application to determine methods and systems for tumour monitoring (PCT/EP2022/077987).

N.M. has received consultancy fees and has stock options in Achilles Therapeutics. N.M. holds European patents relating to targeting neoantigens (PCT/EP2016/ 059401), identifying patient response to immune checkpoint blockade (PCT/ EP2016/071471), determining HLA LOH (PCT/GB2018/052004), predicting survival rates of patients with cancer (PCT/GB2020/050221).

SupplementaryNote.pdf

Download PDF

Version 1

posted

You are reading this latest protocol version

CONIPHER: a computational framework for scalable phylogenetic reconstruction with error correction

Status:

Version 1

Abstract

Figures

Introduction

Equipment

EQUIPMENT

EQUIPMENT SETUP

Hardware requirements

Software requirements

Procedure

Troubleshooting

Anticipated Results

References

Acknowledgements

Supplementary Files

Status:

Version 1

Privacy Policy

Terms of Service

CONIPHER: a computational framework for scalable phylogenetic reconstruction with error correction

Status:

Version 1

Abstract

Figures

Introduction

Equipment

EQUIPMENT

EQUIPMENT SETUP

Hardware requirements

Software requirements

Procedure

Troubleshooting

Anticipated Results

References

Acknowledgements

Supplementary Files

Status:

Version 1

Privacy Policy

Terms of Service

Manage Cookie Preferences