The present invention relates to a method for characterising a DNA sample by determining the presence of copy number signatures associated with different types of chromosomal instability. It is particularly, but not exclusively, concerned with a method for determining whether a tumour has one or more deficiencies associated with chromosomal instability, such as impaired homologous recombination and replication stress, and to methods for identifying a treatment based on the presence of copy number signatures in a DNA sample from a tumour.
Chromosomal instability (CIN) is the process of accumulating numerical and structural changes in DNA. CIN is a hallmark of cancer and over time manifests as changes of whole chromosomes or parts of chromosomes. A stable state where large-scale chromosomal aberrations are tolerated and do not change over time is called aneuploidy and is seen as a product of CIN.
Consequences of CIN are complex and include recurrent loss or amplification of driver genes [9-11], highly complex focal rearrangements, formation of extrachromosomal DNA and micronuclei, activation of innate immune signaling [Bakhoum et al., 2018], as well as associations with disease stage [Raghvendra et al., 2020, Vargas-Rondon et al., 2018], metastasis [Bakhoum et al., 2018b], poor prognosis [Bakhoum et al., 20111], and therapeutic resistance [Lee et al. 2011]. Causes of CIN are equally complex and include mitotic chromosome missegregation, homologous recombination defects, telomere crisis, and breakage fusion bridge cycles.
Despite the diversity of causes and consequences, CIN is generally used as an umbrella term. Measures of CIN either divide tumours into broad categories of high/low CIN [Birbak, 2011], or are restricted to a single aetiology like homologous repair deficiency [Davies et al., 2017], or are limited to a particular genomic feature like whole chromosome-arm changes [Cohen-Sharir et al., 2021]. As a result, there is no systematic framework to comprehensively characterise the diversity, extent and origins of CIN, and to define how different types of CIN relate to clinical phenotypes.
Therefore, there is still a need for improved methods for characterising chromosomal instability.
The present inventors devised a robust analysis framework for chromosomal instability in human cancers. In particular, the inventors identified that patterns of chromosomal instability could be robustly characterised using a limited set of fundamental copy number features that does not include the absolute copy number of segments. The inventors further demonstrated the relevance of the approach in a pan-cancer analysis of 7,880 high-quality samples. The inventors thereby identified a compendium of copy number signatures that characterise different types of CIN and their aetiologies across 33 cancer types, and are supported by a wide array of independent data sources. The inventors then demonstrated the biological relevance of the pan-cancer signatures identified by using the signatures to predict drug response and to identify new drug targets. In particular, associations between the signatures and targeting of 40 genes were identified, supported by both drug and genetic perturbations. This demonstrates the use of the signatures in identifying existing therapeutic strategies for cancers. Further, associations between the signatures and targeting of 104 genes with druggable structures but without known targeted therapies were also identified. This demonstrates the use of the signatures in the drug design process, by identifying drug targets for cancer therapeutics. The inventors further showed how the new framework refines the understanding of impaired homologous recombination (IHR), one of the most clinically relevant types of CIN. In particular, the inventors identified three distinct signatures of IHR: a signature of IHR alone, a signature of IHR plus replication stress, and a signature of IHR plus replication stress and impaired damage sensing and nucleotide excision repair (NER). Finally, the inventors demonstrated that these signatures could be used to predict sensitivity to platinum-based therapy in multiple cancers including ovarian and oesophageal cancer.
Thus, according to a first aspect, there is provided a method of characterising a DNA sample obtained from a tumour, the method including the steps of:
The present inventors have identified that not including a feature representing the copy number of a segment advantageously avoided redundancy if signatures for the same aetiology appear across different ploidy background. For example, a single copy loss on a whole-genome duplicated (WGD) background versus a deletion in a non-WGD background would result in two different copy number states and therefore be encoded by two different signatures, even though they might be caused by the same mutational process. This therefore resulted in more robust, and more biologically relevant signatures identified using these copy number features.
The method may have any one or more of the following features.
The one or more signatures of chromosomal instability may each individually be associated with one or more processes causing chromosomal instability, and exposure to the one or more signatures may be indicative of the presence of the respective process(es) in the sample. The inventors have identified that not including the absolute copy number of a copy number alteration event as a feature used to characterise a copy number profile unexpectedly improves the signatures that are derived using copy number features. In particular, it avoids the artificial splitting of signatures by ploidy status when the signatures in fact represent the same mutational process.
Quantifying a set of copy number features of the copy number profile may comprise quantifying, for each copy number event in the copy number profile, one or more features selected from: segment size, copy number change-point, breakpoint count per predetermined length of sequence and/or per chromosome arm, and number of segments with oscillating copy number. The set of copy number features may include all of the following features: segment size, breakpoint count per predetermined length of sequence, copy number change-point, breakpoint count per chromosome arm, and number of segments with oscillating copy number. The set of copy number features may consist of said features or said features and one or more allele-specific features. This set of copy number features was found by the inventors to be sufficient to describe known patterns of chromosomal instability (see Table 2) including amplification (captured by the segment size (SS), copy number changepoint (CNC), and breakpoint per 10 MB (BP10) features), aneuploidy (captured by the SS and CNC features), breakage-fusion-bridge (captured by the CNC and breakpoints per chromosome arm (BPARM) features), chromothripsis (captured by the CNC, BP10, BPARM and number of segments with oscillating copy number (OSC) features), complex genomic rearrangements (captured by the SS, CNC, BP10, BPARM and OSC features), deletions (captured by the CNC feature), extrachromosomal DNA (captured by the SS and CNC features), homologous recombination deficiency (captured by the SS, CNC, BP10, BPARM and OSC features), loss of heterozygosity (captured by the SS, CNC and BPARM features), micronuclei (captured by the SS and CNC features), tandem duplications (captured by the SS, CNC, BPARM and OSC features), and whole genome duplications (captured by the CNC feature). Thus, this set of features may represent a compact yet robust and complete set of features to describe copy number profiles. Further, the present inventors have identified that the use of such a compact set had several advantages compared to the use of a more extended set that may contain some redundancies (or even noise) in the information captured by each feature. In particular, the use of a compact set of features with definitions that do not overlap means that each feature is more directly interpretable and the information it captures is not diluted across a plurality of features. Additionally, the use of additional features (potentially uninformative in relation to CIN) may lead to the identification of artefactual signatures that do not represent true biological differences. Such artefactual signatures may be difficult to identify and automatically remove, leading to less interpretable and reproducible results.
The segment size may be quantified for each segment in the copy number profile. The copy number change-point may be quantified for each segment in the copy number profile. The copy number change-point may be quantified relative to the left (upstream) neighbouring segment. The breakpoint count per predetermined length of sequence may be quantified for each set of segments that falls within a predetermined length of sequence. The breakpoint count per chromosome arm may be quantified for each set of segments that falls within a chromosome arm. The number of segments with oscillating copy number may be quantified for each set of segments that comprises the maximum number of contiguous segments oscillating between two different absolute copy numbers.
The breakpoint count is preferably quantified over two separate length of sequences: a predetermined length of sequence and the length of the chromosome arm for each chromosome arm represented in the copy number profile. The predetermined length of sequence is preferably shorter than the shortest chromosome arm represented in the copy number profile. For example, the breakpoint count per predetermined length of sequence is advantageously quantified over a predetermined length of sequence between 8 and 12 MB, preferably about 10 MB. Such values may be particularly advantageous when looking at human copy number profiles. Indeed, the present inventors have recognised that clusters of copy number alterations (CNAs) can come in various sizes, e.g. short tandem duplications with less than 10 kb in length each, long tandem duplications with lengths of over 100 kb each, and up to large-scale transitions and chromothriptic events which can span multiple dozens of mega bases. Their respective cluster sizes can be in the tens of kb up to whole chromosome arms. The present inventors have further recognised that the set of copy number features would ideally comprise features that are able to capture all of these events. However, the inventors further recognised that the smallest chromosome arm on a human genotyping array such as the SNP6 is 12.8 Mb in length, such that having window sizes that are large than this may lead to a skewed breakpoint density on smaller chromosome arms. Conversely, if the window size is too small then it may not be able to capture medium size clusters. A further feature could be included for this but this would increase the complexity of the scheme and potentially lead to some redundancy in the features. Thus, the inventors identified the combination of a breakpoint count per 8-12 MB (preferably about 10 MB) and a breakpoint count per chromosome arm to ideally capture the patterns of copy number alterations expected in e.g. a human genome.
The use of features such as the breakpoint count per predetermined length of sequence, the breakpoint count per chromosome arm, and the number of segments with oscillating copy number advantageously capture phenomena that manifest themselves over multiple segments or long stretches of DNA. This is not possible when using only features that characterise single segments. Thus, this results in a more complete (and more biologically relevant) characterisation of the CIN processes that are active in a sample.
The set of features may further comprise an allele-specific feature. An allele-specific feature may advantageously enable the detection of copy-neutral loss of heterozygosity. An allele-specific feature may for example be the proportion of the major allele over both alleles for a portion of a sequence. However, inclusion of an allele-specific feature may restrict applicability of the method to sequence data that has allele-specific resolution (e.g. WGS, SNP 6.0 arrays). Therefore, an allele-specific feature (e.g. a loss of heterozygosity feature such as the proportion of the major allele over both alleles) may advantageously not be included in order to increase the scope of applicability of the method to sequencing data that has been obtained using a variety of technologies such as genotyping arrays, single cell sequencing, shallow whole genome sequencing, etc.
Quantifying the set of copy number features may comprise using unrounded copy number segments. Quantifying the set of copy number features may comprise collapsing and merging near diploid segments to a diploid state, wherein near diploid segments are segments that have a copy number within a predetermined distance from 2. The predetermined distance may be 0.1. Thus, quantifying the set of copy number features may comprise assigning a copy number of 2 (i.e. collapsing) to any segment that has a copy number within predetermined boundaries (such as e.g. above 1.9 and below 2.1), and merging any contiguous segments that have the same copy number as a result of the assigning. This may advantageously avoid including signal from segments that are likely normal diploid segment and which could act as noise in the process. Unrounded copy number segments may be segments whose copy number has not been rounded to the near integer. Thus, the method preferably uses copy number profiles that have not been rounded and/or wherein the only rounding that is performed relates to the collapsing or near diploid segments. The inventors have identified that using unrounded copy numbers enabled them to use more of the information present in copy number profiles (in terms of number of segments that would otherwise be merged but also in terms of the copy number information that is contained in said profile for each segment.
Rounding of copy numbers for segments may be performed in order to remove noise in a copy number profile, at the cost of loss of information. The present inventors have further identified that the additional noise that is associated with the use of unrounded copy number segments could be compensated at least in part, with minimal loss of information relevant to CIN by merging segments with minor deviations from the “normal” copy number state.
Quantifying the set of copy number features may comprise quantifying a feature selected from segment size and copy number changepoint, and wherein said feature is not quantified for diploid segments. Ignoring normal diploid segments when quantifying the segment size and/or copy number changepoint feature(s) may advantageously avoid inflating the quantifications for such segments which are expected to be more numerous and do not capture information in relation to chromosomal instability for these features. This was shown by the inventors to result in refined signature exposures (likely to be more reflective of true biology), with samples with similar exposures having more similar exposures (as quantified by cosine similarity) than if the normal segments had not been ignored. Indeed, the present inventors found that normal diploid segments influenced signature activity even on strongly rearranged genomes and that it would therefore be advantageous to remove them, particularly when looking at signatures derived from multiple cancers where many samples may have less rearranged genomes. Quantifying the set of copy number features may comprise quantifying a feature selected from breakpoint count per predetermined length of sequence and breakpoint count per chromosome arm, and wherein said feature is quantified for all segments including diploid segments. The inventors have identified that although breakpoint count features are impacted by the overrepresentation of diploid segments, this impact is not as problematic as e.g. that for a copy number changepoint feature since a changepoint feature may quantify two events for each aberrant segment surrounded by diploid segments (whereas only one relevant change of copy number has occurred), whereas a breakpoint feature quantifies breakpoints as relevant events where DNA has been broken and repaired, and these are not therefore double counted by counting both breakpoints in the case of an aberrant segment surrounded by diploid segments. The copy number changepoint feature may be quantified by ignoring the first segment of any chromosome if said segment is a normal segment (where a normal segment in this context refers to a diploid segment). Indeed, the changepoint feature describes the difference in absolute copy number from a segment to its neighbouring segment, which may be the segment on the left (i.e. preceding segment), in which case the very first segment of each chromosome may not have a neighbouring segment to the left. The changepoint feature may be quantified by subtracting 2 from the absolute copy number of the first segment of any chromosome if said segment is not a normal segment. This assumes that the (non-existing) previous segment was a normal segment (copy number 2) if the segment is not a normal segment.
Quantifying the set of copy number features may comprise obtaining one or more summarised measures for each copy number feature across the copy number profile. The one or more summarised measures may comprise the sum over all copy number events for which a feature has been quantified of the posterior probabilities of each feature value belonging to each of a set of predetermined distributions. The one or more summarised measures may have been identified empirically by quantifying the copy number features in a plurality of tumour samples. A set of predetermined distributions may comprise one or more distributions, such as e.g. between 1 and 30, between 1 and 25, between 3 and 25. A predetermined distribution may also be referred to herein as a “component”. Each copy number feature may be associated with a plurality of components, and a summarised measure across the copy number profile may be obtained for each such component. The use of summarised measures that have been identified empirically using relevant data means that the features are truly reflective of biological processes rather than arbitrary categories that while convenient to manipulate, may not be biologically relevant. For example, a set of predetermined distributions may be identified by applying mixture modelling on a data set comprises the quantified set of copy number features for a plurality of tumour samples. This enables the identification of the true states of the copy number features present in tumours, and hence the quantification of the evidence for the presence of these states in a new sample to be analysed. The use of posterior probabilities of each feature value belonging to each of a set of predetermined distributions advantageously enables the method to deal with uncertainty in the assignment of a segment to a state (i.e. uncertainty in whether the segment provides evidence for the presence of a particular category of CIN events characterised by one of the distributions for a copy number feature). This means that the method is by design able to deal with inevitable noise in the data, resulting in a more accurate characterisation of the sample. The set of predetermined distributions for a copy number feature may be a set of Gaussian distributions for any feature that is quasi-continuous. The set of predetermined distributions may be a set of Poisson distributions for any count feature. A set of predetermined distributions may have been obtained using a mixture modelling technique. Count features may be selected from breakpoint count per predetermined length of sequence and/or per chromosome arm, and number of segments with oscillating copy number. A quasi-continuous feature may be any feature that is not a count feature, such as e.g. segment size and copy number changepoint. For example, the segment size feature may be quantified by obtaining a sum-of-posterior probabilities (summed across copy number events) for each of a plurality of Gaussian distributions, such as e.g. between 20 and 25 Gaussian distributions (e.g. 22 Gaussian distributions). As another example, the copy number changepoint feature may be quantified by obtaining a sum-of-posterior probabilities (summed across copy number events) for each of a plurality of Gaussian distributions, such as e.g. between 5 and 15 Gaussian distributions (e.g. 10 Gaussian distributions). As another example, the breakpoint per predetermined sequence length feature may be quantified by obtaining a sum-of-posterior probabilities (summed across copy number events) for each of a plurality of Poisson distributions, such as e.g. between 1 and 5 Poisson distributions (e.g. 3 Poisson distributions). As another example, the breakpoint per chromosome arm feature may be quantified by obtaining a sum-of-posterior probabilities (summed across copy number events) for each of a plurality of Poisson distributions, such as e.g. between 1 and 10 Poisson distributions (e.g. 5 Poisson distributions). As another example, the number of segments with oscillating copy number feature may be quantified by obtaining a sum-of-posterior probabilities (summed across copy number events) for each of a plurality of Poisson distributions, such as e.g. between 1 and 5 Poisson distributions (e.g. 3 Poisson distributions). The precise number of distributions used may depend at least in part on the number and diversity of samples used to obtain the signatures. In the examples provided herein, the inventors used a very large number of samples from a wide variety of tumour types, enabling them to provide a nuanced picture of the behaviours of copy number features in cancer, resulting in a higher number of components identified than e.g. in Macintyre et al., 2018. The parameters of each of the predetermined distributions (such as e.g. mean and variance for a Gaussian distribution, A for a Poisson distribution) may have been determined as part of the process of obtaining the signatures of chromosomal instability. The parameters of each of the predetermined distributions (such as e.g. mean and variance for a Gaussian distribution, A for a Poisson distribution) may have been determined by fitting mixture models to the quantified set of copy number features in the plurality of tumour samples from which the signatures of chromosomal instability have been obtained. The parameters of each of a set of Gaussian distributions may have been obtained using a Variational Bayes Gaussian mixture model. In particular, the parameters of Gaussian distributions (for example for features such as segment size and changepoint) may have been obtained by fitting Dirichlet-Process Gaussian mixture models using variational inference. The parameters of each of a set of Poisson distributions (for examples features such as breakpoint counts and lengths of oscillating chains) may have been obtained using a Finite Poisson mixture model. Obtaining a set of distributions may comprise: obtaining a raw set of distributions by fitting a mixture model to a feature distribution, and combining distributions that satisfy one or more similarity criteria to obtain a final set of distributions. The one or more similarity criteria may include the mean of a distribution being within a predetermined distance from another distribution. The predetermined distance may be based on the standard deviation of the other distribution. For example, a first distribution may be combined with a further distribution if the mean of the first distribution is within the standard deviation of the further distribution. Combining distributions may comprise defining a new distribution that combines a plurality of raw distributions. The mean of such a new distribution may be defined as the weighted mean of the plurality of raw distributions that are combined. The standard deviation of such a distribution may be obtained by sampling points from the distributions that are combined. Obtaining a raw set of distributions may comprise extracting all distributions resulting from fitting a mixture model that have weights above a threshold, such as e.g. 1%.
Determining exposure to one or more signatures of chromosomal instability based on the quantified features may comprise identifying the values of E that satisfy: PbC≈E×SbC where E is a vector of size n comprising coefficients E1, . . . , n where Ei is the exposure to signature i; PbC is a vector of size c, each element in the vector representing a summarised measure across the copy number profile associated with one of the copy number features; and SbC is a matrix of size c by n, each value representing the weight of a summarised measured C in a signature i. The signatures of chromosomal instability have been obtained by identifying the values of SbC and E that satisfy: PbC≈E×SbC where E is a matrix of size n by p, each element in the matrix representing the exposure to a signature in the copy number profile of one of the plurality of tumour samples; PbC is a matrix of size c by p, each element in the matrix representing a summarised measure associated with one of the copy number features, wherein the summarised measured is obtained for the copy number profile of one of the plurality of tumour samples; and SbC is a matrix of size c by n, each value representing the weight of a summarised measured C in a signature/. The values of E and SbC may be obtained by non-negative matrix factorisation. The parameter n may be the number of signatures identified using the process described above. The parameter c may be the total number of summarised measures associated with the set of copy number features. The parameter p may be the number of tumour samples used to obtain the signatures that satisfy the above equation. Each of c, n may be >1. P may be >1. As described above, the summarised measures may comprise the sum over all copy number events for which a feature has been quantified in the copy number profiles for the plurality of tumour samples, of the posterior probabilities of each feature value belonging to each of a set of predetermined distributions. The set of predetermined distributions may comprise c distributions, where c can be between 30 and 50, such as e.g. 43. The values in matrix/vector E may define the exposure/activity of the signatures defined in the vector/matrix SbC. The elements of the vector/matrix SbC may define the relative weights of the different summarised measures of copy number features in each of one or more signatures. The values of E and SbC may have been obtained by non-negative matrix factorisation. For example, any non-negative matrix factorisation algorithm known in the art may be used, such as e.g. as implemented in SignatureAnalyzer (Kim et al., 2016, Tan & Fevotte, 2013). The values of E and SbC may have been obtained by: performing non-negative matrix factorisation a plurality of times with random initialisation to obtain a plurality of sets of signatures, identifying a number of signatures as the mode of the distribution of the number of signatures obtained across the plurality of times, calculating the similarity between all signatures in the sets of signatures that have the identified number of signatures, clustering the signatures based on said similarity, and selecting a set of signatures based on the location of the signatures in the clusters thus obtained, such as e.g. by selecting the most optimal solution that has a signature in the highest possible number of clusters. Such an approach advantageously enables a computationally efficient and reproducible identification of a final set of signatures from a plurality of iterations of a non-exact optimisation process (due to the use of a finite set of NMF solutions, such as e.g. 100, 200 or 300 solutions and the deterministic process from the results of the NMF iterations). This is particularly advantageous when using a large number of samples to derive the signatures, which in turn underlines the ability of the methods of the invention to comprehensively capture patterns of CIN occurring in cancer. Optimality of the solutions may be determined based on the divergence between the term PbC and the term E×SbC for the set of signatures.
The summarised measures may comprise the sum over all copy number events for which a feature has been quantified of the posterior probabilities of each feature value belonging to each of a set of predetermined distributions, wherein the predetermined distributions are the distributions defined by the parameters in Table 6, or corresponding distributions obtained by fitting mixture models to summarised measures of the set of copy number features obtained for a plurality of tumour samples. The signatures may be those defined in Table 7 or corresponding signatures obtained by quantifying the set of copy number features in a plurality of tumour samples, and identifying one or more mutational signatures likely to result in the copy number profiles of the plurality of tumour samples. The plurality of tumour samples may comprise tumour samples from a plurality of types of tumours, and the one or more signatures may have been obtained by combining: a first set of signatures obtained by quantifying the set of copy number features in the plurality of tumour samples, and identifying one or more mutational signatures likely to result in the copy number profiles of the plurality of tumour samples; and one or more further sets of signatures obtained by quantifying the set of copy number features in a plurality of tumour samples from a respective tumour type for each further set of signatures, and identifying one or more mutational signatures likely to result in the copy number profiles of the plurality of tumour samples from the respective tumour type. The plurality of tumour types may comprise at least 10 tumour types, at least 20 tumour types, at least 30 tumour types, or all tumour types represented in the TCGA database, such as e.g. 33 tumour types. The present inventors have discovered that combining a first set of signatures obtained using a pan-cancer data set and additional sets of signatures obtained using cancer type-specific datasets advantageously enabled the identification of a combined set of signatures that captures more of the signal in the pan-cancer data set than would be possible by extracting signatures only at the pan-cancer level. For example, this process enabled the identification of signatures that were identified in cancer type specific data (e.g. CX9 discovered in OC, CX10 discovered in ESCA) and not when looking at the combined cancer data alone, but that are in fact present in many more cancer types than the one in which they were originally identified. The one or more signatures may have been obtained using copy number profiles of a plurality of tumour samples, each of which has a number of copy number alternation events above a predetermined threshold. For example, a threshold of 20 copy number alterations event may be used, particularly when looking at human genome wide copy number profiles. The plurality of sets of signatures may have been combined by removing signatures in the one or more further sets that have a similarity to any signature in the first set above a predetermined threshold and/or removing signatures in the one or more further sets that have a similarity to any signature in another of the one or more further sets above a predetermined threshold, and/or removing signatures in the one or more further sets that can be obtained by a linear combination of signatures in the first set. Similarity between signatures can be obtained as a cosine similarity. The cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is equal to the cosine of the angle between the two vectors. It is also equal to the inner products of the two vectors, normalised to each have length 1. The cosine similarity between two signatures can be calculated as:
where S1 and S2 are equally-sized vectors with nonnegative components being the respective signatures. Alternatively, the similarity between two signatures may be obtained as the angular distance or angular similarity between the two vectors representing the signatures. As another alternative, the similarity between two signatures may be obtained as the Euclidian distance between L2 normalised version of the two vectors representing the signatures. As another alternative, the similarity between two signatures may be obtained as the correlation between the two vectors representing the signatures. A predetermined threshold for evaluating similarity between signatures may be chosen based on a simulated distribution of similarities associated with simulated signatures derived from the first and one or more further sets of signatures. For example, a predetermined threshold may be chosen as the 0.999 quantile (or the 0.90, 0.950, 0.980, 0.990, 0.995th quantile) of a distribution of similarities between said simulated signatures. The simulated distribution of similarities may have been obtained from a set of signatures using a Dirichlet process, preferably maintaining the proportion of 0s and/or signature components summing to 1 for each signature. A predetermined threshold may be chosen as a cosine similarity between 0.7 and 0.8, such as e.g. 0.74. The plurality of sets of signature may have been combined without manual curation. In other words, the plurality of sets of signatures may have been combined by removing redundant signatures using objective criteria as described above, and no signature may be removed using subjective criteria (e.g. artefactual signatures). The present inventors have identified that the methods of the invention resulted in the identification of signatures that do not contain artefactual signatures and as such it is possible to collapse a set of signatures (such as e.g. a combination of signatures identified using a pan-cancer data set and a cancer type-specific dataset) using only objective criteria based on similarity/linear combinations of signatures.
The one or more signatures of chromosomal instability may comprise signatures associated with one or more processes causing chromosomal instability, and exposure to the one or more signatures is indicative of the presence of the respective process(es) in the sample. The method may comprise determining that a signature is present in the sample if the exposure to the signature in the sample is above a signature specific threshold. The signature-specific threshold may have been identified using a simulated background distribution of signature exposure for the respective signature. The one or more signatures may comprise one or more signatures selected from: one or more signatures associated with impaired DNA damage response during mitosis, one or more signatures associated with disruption of the spindle assembly checkpoint, one or more signatures associated with impaired homologous recombination, one or more signatures associated with tolerance of whole genome duplication, one or more signatures associated with impaired non-homologous end joining, one or more signatures associated with replication stress, one or more signature associated with impaired DNA damage sensing, and one or more signatures associated with cell cycle control impairment. The method may comprise determining whether the sample/tumour has one or more processes causing chromosomal instability based on the signature exposures. For example, a process causing chromosomal instability may be present if the exposure of a signature associated with said process indicates that the signature is present in the sample.
A signature-specific threshold for a signature may have been identified (or may be identified as part of a method described herein) using a simulated background distribution of signature exposure for the respective signature. A simulated background distribution of signature exposure may be obtained by adding noise (such as e.g. random noise, optionally limited to 10% of the original value) to the number of copy number alterations events and their associated copy number features from a plurality of tumour samples to obtain a plurality of simulated copy number profiles, and determining the exposure to the one or more signatures for the simulated copy number profiles. A signature specific threshold for a signature may have been identified using such a simulated background distribution of signature exposure for the respective signature by determining the distribution of exposure value in the simulated background distribution from samples with an exposure below a threshold (e.g. exposure of 0) prior to the addition of noise. For example, the 95th percentile of such a distribution may be used as a signature-specific threshold. In particular, the one or more signatures of chromosomal instability may comprise one or more signatures selected from: one or more signatures associated with chromosome missegregation (such as e.g. signatures CX1, CX6 and/or CX14 in Table 7 or corresponding signatures), one or more signatures associated with chromosome missegregation via defective mitosis (such as e.g. signatures CX1, CX6 and/or CX14 in Table 7 or corresponding signatures), one or more signatures associated with impaired homologous recombination (such as e.g. signatures CX2, CX3 and/or CX5 in Table 7 or corresponding signatures), one or more signatures associated with impaired DNA damage sensing (such as e.g. signature CX3 in Table 7 or a corresponding signature), one or more signatures associated with replication stress (such as e.g. signatures CX3, CX5, CX8, CX9, CX10, CX11, and/or CX13 in Table 7 or corresponding signatures), one or more signatures associated with tolerance of whole genome duplication and/or PI3K/AKT-mediated tolerance of whole genome duplication (such as e.g. signature CX4 in Table 7 or a corresponding signature), one or more signatures associated with impaired NHEJ optionally with replication stress for example replication fork collapse (such as e.g. signature CX10 in Table 7 or a corresponding signature), and one or more signatures associated with disruption of the spindle assembly checkpoint (such as e.g. signature CX14 in Table 7 or a corresponding signature). Determining exposure to one or more signatures may comprise determining exposure to all signatures in Table 7 or corresponding signatures, such as e.g. signatures that have been obtained as described in any embodiment of the first aspect above or second aspect below. Thus, also described herein are methods of determining whether one or more processes causing chromosomal instability are present in a tumour sample, the method including the steps of: (a) obtaining a tumour copy number profile for the sample; (b) quantifying a set of copy number features of the copy number profile, wherein a copy number feature is a metric that characterises a copy number event in a copy number profile, and wherein the set of features does not comprise the absolute copy number of segments in the copy number profile; and (c) determining exposure to one or more signatures of chromosomal instability based on the quantified features, wherein the signatures of chromosomal instability have been obtained by quantifying the set of copy number features in a plurality of tumour samples, and identifying one or more mutational signatures likely to result in the copy number profiles of the plurality of tumour samples, and wherein the one or more signatures comprise signatures associated with the one or more processes causing chromosomal instabilities. Determining exposure to one or more signatures of chromosomal instability may comprise obtaining an estimate of exposure to each of the one or more signatures and normalising the exposures for each signature. The normalising may comprise scaling signature exposures using the parameters of a distribution of the exposure to the respective signatures in a cohort of samples.
The sample may have been obtained from a subject who has bene diagnosed as having cancer. The plurality of tumour samples may comprise tumour samples from one or more of a plurality of types of tumours. The cancer and/or the plurality of tumour types may be selected from: ovarian cancer, breast cancer, endometrial cancer, kidney cancer, lung cancer, pancreatic cancer, liver cancer, oesophagus cancer, stomach cancer, head and neck cancer, brain cancer, colon cancer, pancreatic cancer, prostate cancer, bladder cancer, cervical cancer, leukemia, lymphoma, testicular cancer, thyroid cancer, melanoma, adrenal cancer, bowel cancer, sarcoma, thymoma, neuroendocrine tumour, and bile duct cancer. The sample may be a tumour sample or a liquid biopsy sample. The method may further comprise one or more of obtaining the sample from a subject who has been diagnosed as having cancer, obtaining sequence data from the sample, determining a copy number profile from sequence data obtained from the sample, obtaining a matched germline sample, obtaining sequence data from a matched germline sample, and providing to a user one or more of: the exposure to the one or more signatures, a value derived therefrom, values for one or more copy number features, and a determination of whether one or more processes causing chromosomal instability are likely to be present in the sample. The method may further comprise obtaining the sample from a tumour of a subject. The method may further comprise obtaining sequence data from a sample from a tumour. The method may further comprise providing to a user one or more of: the exposure to the one or more signatures, a value derived therefrom (such as e.g. a probabilistic score), values for one or more copy number features, and a determination of whether one or more processes causing chromosomal instability are likely to be present in the sample. The method may further comprise obtaining a germline sample from the subject and/or obtaining sequence data from a germline sample from the subject. The tumour sample may be a sample comprising tumour cells or genetic material derived therefrom. The tumour sample may be a sample of cells or tissue that has been obtained directly from a tumour (e.g. a tumour biopsy). The tumour sample may be a sample comprising cells or genetic material derived from a tumour, such as e.g. a liquid biopsy sample comprising circulating tumour cells or circulating tumour DNA. Obtaining a copy number profile for a sample may comprise receiving a copy number profile from a database, a user interface, a computing device, etc. Obtaining a copy number profile for a sample may comprise determining a copy number profile from sequence data obtained from the tumour sample and optionally from a matched germline sample. The sequence data may have been obtained using next generation sequencing or a genomic array. The sequence data may have been obtained using a genotyping array, whole exome sequencing, whole genome sequencing, single cell sequencing or shallow whole genome sequencing. Depending on the type of data used, and in particular e.g. the resolution of the data, a different set of components and signatures may be identified. Thus, the exact distributions and signatures described herein are not essential to the invention and corresponding distributions and signature may be obtained (which may in particular include a higher number of distributions, particularly if the data used has higher resolution) using different data to extract the signatures. However, the process described herein is usable with any data that is capable of generating copy number profiles. Further, the particular distributions and signatures described herein are usable to analyse a copy number profile obtained with any such data, although they may not make full use of the resolution in the copy number profile if it has been derived from data with higher resolution than that used to obtain the signatures described herein.
Obtaining a tumour copy number profile for the sample may comprise obtaining absolute copy numbers at each of a plurality of genomic locations, such as e.g. genome bins (bins of any sizes may be used, such as e.g. 10, 20, 30, 50 or 100 kb, specifically 30 kb). Absolute copy number may be determined or may have been previously determined by obtaining a relative copy number at each of the plurality of genomic locations and determining an absolute copy number for the respective locations based on the relative copy number, the mean relative copy number of the sample, the tumour cell ploidy in the sample and the tumour purity of the sample. The tumour purity (% tumour cells in the sample) and tumour cell ploidy (e.g. average absolute copy number of the tumour cells in the sample) may be purity and ploidy estimates. Methods for estimated purity are known in the art. The tumour cell ploidy may be the average absolute copy number of the tumour cells in the sample. Purity and ploidy values may be jointly estimated by identifying the purity and ploidy that minimise the differences between estimated absolute copy numbers at the plurality of genomic locations and the closest integer to the respective absolute copy number. An absolute copy number for a genomic location may be calculated as
wherein aCN is the absolute copy number at location j, rCN is the relative copy number at the location, purity is the tumour purity, and d is given by
where r is the mean relative copy number of the sample, and ploidy is the average absolute copy number of the tumour cells in the sample.
According to a second aspect, there is provided a method of characterising the processes causing chromosomal instability occurring in a plurality of types of cancers, the method including the steps of:
The method may have any of the following optional features. The method of the present aspect may have any of the features described in relation to the first aspect. In particular, the copy number features and/or the step of quantifying the copy number features may have any of the features described in relation to copy number features and quantification thereof in any embodiment of the first aspect. The one or more mutational signatures and/or the step of identifying the one or more mutational signatures may have any of the features of any embodiment of the first aspect. The method may further comprise outputting one or more results of the method, for example to a user through a user interface, to a computing device, to a computer readable medium or memory. The one or more results may comprise information identifying the signatures. The information may comprise information identifying the copy number features used (such as e.g. parameters of a plurality of components/distributions of copy number features that have been identified in the plurality of tumour copy number profiles). The information may comprise the weights of each summarised copy number feature (component) in the signature. Thus, the present aspect also relates to a method for identifying signatures of chromosomal instability, for example for use in a method according to the first aspect. The method may further comprise identifying one or more processes causing chromosomal instability associated with at least one of the one or more signatures. Identifying one or more processes causing chromosomal instability associated with at a signature may comprise analysing the pattern of copy number alterations associated with the signature (for example using the summarised copy number features and weights thereof in the signature to identify prevalent patterns of copy number abnormalities). Identifying one or more processes causing chromosomal instability associated with at a signature may comprise identifying chromosomal instability related genes (such as e.g. cancer driver genes, genes involved in DNA repair, DNA replication, cell cycle and/or chromatin organisation) whose mutational status correlates with exposure to the signature. Identifying chromosomal instability related genes whose mutational status correlates with exposure to the signature may comprise determining the mutational status (e.g. presence of single nucleotide variants, deletion and/or amplification in the gene) of said genes in the plurality of tumour samples, and testing for a difference in exposure between samples with a mutation in the gene and sample without a mutation in the gene (for example using a statistical test for equality of means between two groups). The exposures for the signature across the plurality of tumour samples may be centred and scaled prior to testing for a difference in exposure.
According to a third aspect, there is provided a method of predicting whether a subject with cancer is likely to respond to a therapy that targets a particular gene, the method comprising characterising a DNA sample obtained from a tumour of the subject, using the method of any embodiment of the first aspect, as having a high or low exposure to one or more signatures of chromosomal instability associated with response to inhibition of the gene, wherein if the sample is characterised as having a high exposure to said signature, the subject is likely to respond to the therapy. A signature of chromosomal instability may be considered to be associated with response to inhibition of the gene if the exposure to the signature is significantly correlated with the effect (e.g. cellular proliferation, growth inhibition, toxicity, etc) of perturbation of the gene, for example by genetic perturbation (such as in a CRISPR essentiality screen, also referred to as CRISPR knockout screen, or in an RNAi essentiality screen) or by drug perturbation of the gene (such as in a drug response screen). Correlation between exposure to a signature and the effect of perturbation of a gene may be assessed using Kendall's tau correlation. A sample may be considered to have a high exposure to the signature if exposure to the signature is higher than the expected exposure for the signature in a set of samples that do not respond to the perturbation of the gene. Instead, or in addition to this, a sample may be considered to have a low exposure to the signature if exposure to the signature is lower than the expected exposure for the signature in a set of samples that respond to the perturbation of the gene. Alternatively, a sample may be considered to have a high exposure to the signature if exposure to the signature is above a signature-specific predetermined threshold, such as e.g. a signature-specific predetermined threshold as described above. The therapy may be a drug, such as a chemotherapy. The method may further comprise treating the subject with a therapy that targets the gene or recommending the subject for treatment with a therapy that targets the gene if the subject is predicted to be likely to respond to the therapy. The method may further comprise treating the subject with an alternative therapy that does not target the gene or recommending the subject for treatment with an alternative therapy that does not target the gene if the subject is predicted to be unlikely to respond to the therapy.
The therapy may inhibit CCND1 and the signature may be a signature associated with tolerance to whole genome duplication or PI3K/AKT-mediated tolerance of whole genome duplication (such as e.g. CX4 in Table 7 or a corresponding signature). The therapy may inhibit PARP1 and the signature may be a signature associated with impaired homologous recombination (such as e.g. CX5 in Table 7 or a corresponding signature). The therapy may inhibit a kinase in a mitogenic pathway (such as EGFR, JAK1, MET, PRKCA, PI3KCA) and the signature may be a signature associated with replication stress (optionally wherein the signature is further indicative of focal amplifications; such as e.g. CX9 in Table 7 or a corresponding signature). The therapy may inhibit CDK4 and the signature may be a signature associated with replication stress (optionally wherein the signature is further indicative of clustered amplifications, such as e.g. CX13 in Table 7 or a corresponding signature). The therapy may be a therapy selected from any of the drugs in Table 4 or a therapy that targets any of the targets in Table 4, and the signature may be the corresponding signature in Table 4, as defined in Table 7, or a corresponding signature. The therapy may be a therapy that targets a gene selected from the genes in Table 5, and the signature may be the corresponding signature in Table 5, as defined in Table 7, or a corresponding signature. The therapy may be a therapy that targets a gene selected from the genes in Table 8, and the signature may be the corresponding signature in Table 8, as defined in Table 7, or a corresponding signature.
A corresponding signature is a signature that has been identified using data from a plurality of tumour samples and the set of copy number features, and that has the same aetiology as the corresponding signature described herein and/or is associated with the same pattern of change (e.g. as described in Table 3) as the corresponding signature described herein and/or is the most similar to the corresponding signature described herein when comparing a newly derived set of corresponding signatures and the set of signatures described herein. Establishing the similarity between sets of signatures may be performed as described herein for example in Example 2. A signature of chromosomal instability may be considered to be associated with response to inhibition of the gene if the exposure to the signature is significantly correlated with the effect of perturbation of the gene. A sample may be considered to have a high exposure to the signature if exposure to the signature is higher than the expected exposure for the signature in a set of samples that do not respond to the perturbation of the gene.
According to a fourth aspect, there is provided a method of identifying a drug target for the treatment of a cancer, the method comprising characterising a plurality of DNA samples, using the method of any embodiment of the first aspect, wherein the plurality of DNA samples comprise samples obtained from a tumour or a tumour cell line in which the drug target has been the subject of inhibition and for which response to inhibition of the drug target has been quantified, and determining whether one or more signatures of chromosomal instability are associated with response to inhibition of the drug target, wherein the presence of a signature of chromosomal instability associated with response to inhibition of the drug target is indicative of the drug target being usable for the treatment of a cancer in which the signature is active.
A signature of chromosomal instability may be considered to be associated with response to inhibition of the drug target if the exposure to the signature is significantly correlated with the effect (e.g. cellular proliferation, growth inhibition, toxicity, AUC of drug response curve, etc) of perturbation of the drug target. Correlation between exposure to a signature and the effect of perturbation of a drug target may be assessed using Kendall's tau correlation. The drug target may be a gene. The perturbation of the drug target may have been obtained, for example by genetic perturbation (such as in a CRISPR essentiality screen, also referred to as CRISPR knockout screen, or in an RNAi essentiality screen). The plurality of DNA samples may comprise samples obtained from a tumour or a tumour cell line in which a plurality of drug targets have been the subject of inhibition and for which response to inhibition of the plurality of drug targets has been quantified. In such embodiments, identifying a drug target may comprise determining whether the one or more signatures of chromosomal instability are associated with response to inhibition of any of the drug targets. The method may further comprise identifying and optionally providing a drug that targets the drug target.
Also described herein is a method of providing a prognosis for a subject who has been diagnosed with a tumour of a particular type, the method comprising: characterising a DNA sample obtained from a tumour of the subject, using the method of any embodiment of the first aspect, as having a high or low exposure to one or more signatures of chromosomal instability associated with prognosis in the particular tumour type, wherein samples with a high or low exposure to the one or more signatures are associated with different prognosis. The particular type of tumour may be ovarian cancer. For example, samples with a high exposure to signature CX14 or signature CX5 (or corresponding signature) may be associated with a poorer prognosis than samples with a low exposure to said signature. As another example, samples with a high exposure to signature CX3, signature CX11 or signature CX16 (or corresponding signature) may be associated with a better prognosis than samples with a low exposure to said signature.
According to a fifth aspect, there is provided a method of predicting whether a subject with cancer is likely to respond to a platinum based therapy, the method comprising characterising a sample obtained from a tumour in the subject, e.g. using the method of any embodiment of the first aspect, as having a high or low exposure to a first signature (CX3) associated with impaired homologous recombination plus replication stress, impaired damage sensing and impaired nucleotide excision repair, wherein if the sample is characterised as having a high exposure to said signature, the subject is likely to respond to platinum-based therapy. Instead, or in addition to this, the method may comprise characterising a sample obtained from a tumour in the subject, using the method of any embodiment of the first aspect, as having a high or low exposure to a first signature (CX5) associated with impaired homologous recombination plus replication stress, wherein if the sample is characterised as having a high exposure to said signature, the subject is not likely to respond to platinum-based therapy. The first signature (CX3) associated with impaired homologous recombination plus replication stress, impaired damage sensing and impaired nucleotide excision repair may be signature CX3 provided in Table 7 or a corresponding signature. The first signature (CX5) associated with impaired homologous recombination plus replication stress, may be signature CX5 provided in Table 7 or a corresponding signature. While mutational signatures enriched in samples that are HR deficient have been proposed in the past, the present inventors have been able to characterise the landscape of HR deficiency in more detail, and have thus identified different categories of samples showing signs of HR deficiency but also other processes that underline CIN. This enabled them to develop a highly performant clinical classifier, which is able to make predictions that would not have been possible without this more nuanced characterisation of the CIN processes active in tumours. The sample may be considered to have a high exposure to the first signature if exposure to the first signature is higher than exposure to a second signature (CX2) associated with impaired homologous recombination alone. The sample may be considered to have a low exposure to the first signature if exposure to the first signature is lower than exposure to a second signature (CX2) associated with impaired homologous recombination alone. The second signature (CX2) associated with impaired homologous recombination alone may be signature CX2 provided in Table 7 or a corresponding signature. A sample may be considered to have a high exposure to the first signature if exposure to the first signature is higher than exposure to a second signature (CX2) associated with impaired homologous recombination alone. A sample may be considered to have a low exposure to the first signature if exposure to the first signature is lower than exposure to a second signature (CX2) associated with impaired homologous recombination alone. A sample may considered to have a high exposure to the first signature if exposure to the first signature is higher than exposure to the first signature in a control sample or set of samples. In particular, a sample may be considered to have a high exposure to the first signature if exposure to the first signature is higher than the expected exposure for the first signature in a cohort of patients that are resistant to platinum-based therapy. Similarly, a sample may be considered to have a low exposure to the first signature if exposure to the first signature is not higher than the expected exposure for the first signature in a cohort of patients that are resistant to platinum-based therapy. Alternatively, a sample may be considered to have a high exposure to the first signature if exposure to said signature is above a signature-specific predetermined threshold, such as e.g. a signature-specific predetermined threshold as described above (e.g. based on a background distribution of signature exposure for said signature). The cancer may be ovarian cancer (e.g. high grade serous ovarian cancer).
According to a related aspect, there is provided a method of predicting whether a subject with cancer is likely to respond to a platinum based therapy, the method comprising characterising a sample obtained from a tumour in the subject, e.g. using the method of any embodiment of the first aspect, as having a high or low exposure to a third signature (CX5) associated with impaired homologous recombination plus replication stress but not impaired damage sensing and impaired nucleotide excision repair, wherein if the sample is characterised as having a high exposure to said signature, the subject is not likely to respond to platinum-based therapy. The third signature (CX5) associated with impaired homologous recombination plus replication stress but not impaired damage sensing and impaired nucleotide excision repair may be signature CX5 provided in Table 7 or a corresponding signature. A sample may be considered to have a high exposure to the third signature if exposure to the third signature is higher than the expected exposure for the third signature in a cohort of patients that are sensitive to platinum-based therapy. Similarly, a sample may be considered to have a low exposure to the third signature if exposure to the third signature is not higher than the expected exposure for the third signature in a cohort of patients that are sensitive to platinum-based therapy. Alternatively, a sample may be considered to have a high exposure to the second signature if exposure to said signature is above a signature-specific predetermined threshold, such as e.g. a signature-specific predetermined threshold as described above (e.g. based on a background distribution of signature exposure for said signature). The first, second and/or third signature(s) may have been obtained using the method of any embodiment of the second aspect. The first, second and/or third signature exposure may have been normalised, for example by centring and scaling using predetermined parameters for the respective signatures. The predetermined parameters may have been obtained using a suitable cohort of patients/samples, such as a cohort of patients comprising platinum-sensitive and platinum-resistant patients.
According to a related aspect, there is provided a method of predicting whether a subject with cancer is likely to respond to a platinum based therapy, the method comprising characterising a sample obtained from a tumour in the subject, e.g. using the method of any embodiment of the first aspect, in terms of its exposure to a first signature (CX3) associated with impaired homologous recombination plus replication stress, impaired damage sensing and impaired nucleotide excision repair, a second signature (CX2) associated with impaired homologous recombination alone, and a third signature (CX5) associated with impaired homologous recombination plus replication stress but not impaired damage sensing and impaired nucleotide excision repair, and classifying the sample between at least a first class and a second class using a classifier that has been trained to classify samples between at least a first class and a second class based on their exposure to the first, second and third signatures, wherein samples in the first class are likely to respond to platinum-based therapy and samples in the second class are unlikely to respond to platinum-based therapy. The classifier may be a support vector machine.
The signatures may be copy number signatures, i.e. signatures derived from copy number profiles. Preferably, the signatures of chromosomal instability have been obtained using the methods of the first aspect. The method may further comprise administering a platinum-based therapy, to a subject that has been diagnosed as likely to respond to the platinum-based therapy. The method may comprise recommending a subject that has been diagnosed as likely to respond to the platinum-based therapy for treatment with the platinum-based therapy. The method may comprise administering an alternative therapy (e.g. another chemotherapy, radiotherapy, etc.) and/or recommending a subject for treatment with an alternative therapy, where the subject has been diagnosed as not likely to respond to the platinum-based therapy.
According to a further aspect, there is provided a method of selecting a subject having cancer for treatment with a platinum-based therapy, the method comprising characterising a sample obtained from a tumour in the subject as likely to respond to a platinum-based therapy according to any embodiment of the fifth aspect, and selecting the subject for treatment with a platinum-based therapy if the sample is characterised as likely to respond to platinum-based therapy.
According to a further aspect, there is provided an platinum-based therapy for use in a method of treatment of cancer in a subject from whom a DNA sample has been obtained and the DNA sample has been characterised by a method according to any embodiment of the fifth aspect as likely to respond to platinum-based therapy.
According to any of these aspects, the platinum-based therapy may be administered (or recommended for administration) in combination with one or more therapies, such as one or more chemotherapies, one or more courses of radiotherapy and/or one or more surgical interventions.
According to a further aspect, there is provided a platinum-based therapy for use in a method of treatment of cancer in a subject, the method comprising: (i) determining whether a DNA sample obtained from said subject is likely to respond to platinum-based therapy using a method according to any embodiment of the fifth aspect; and (ii) administering the platinum-based therapy to said subject if the DNA sample is determined to be likely to respond to platinum-based therapy.
According to a further aspect, there is provided a platinum-based therapy for use in a method of treatment of cancer in a subject, the method comprising: (i) determining whether a DNA sample obtained from said subject is likely to respond to platinum-based therapy using a method according to a method described herein; and (ii) administering the platinum-based therapy to said subject if the DNA sample is determined to be likely to respond to platinum-based therapy. The subject may have been diagnosed as having or being at risk of having ovarian cancer or oesophagal cancer.
According to a further aspect, there is provided a system comprising: a processor; and a computer readable medium comprising instructions that, when executed by the processor, cause the processor to perform the (computer-implemented) steps of the method of any preceding aspect.
According to a further aspect, there is provided a non-transitory computer readable medium or media comprising instructions that, when executed by at least one processor, cause the at least one processor to perform the method of any embodiment of any aspect described herein.
According to a further aspect, there is provided a computer program comprising code which, when the code is executed on a computer, causes the computer to perform the method of any embodiment of any aspect described herein.
According to a further aspect, there is provided a method of predicting whether a subject with cancer is likely to respond to a therapy that targets a particular gene, the method comprising: characterising a DNA sample obtained from a tumour of the subject as having a high or low exposure to one or more signatures of chromosomal instability associated with response to inhibition of the gene, wherein if the sample is characterised as having a high exposure to said signature, the subject is likely to respond to the therapy. In some embodiments, the one or more signatures may be as defined herein (for example a CX1, CX2, CX3, CX4, CX5, CX6, CX7, CX8, CX9, CX10, CX11, CX12, CX13, CX14, CX15, CX16 or CX17 signature as defined in Table 7). The signature may be obtained or obtainable by a method as defined in connection with the first aspect of the invention. The therapy may be a therapy selected from any of the drugs in Table 4 or a therapy that targets any of the targets in Table 4, and the signature may be the corresponding signature in Table 4, as defined in Table 7, or a corresponding signature. The therapy may be a therapy that targets a gene selected from the genes in Table 5, and the signature may be the corresponding signature in Table 5, as defined in Table 7, or a corresponding signature. The therapy may be a therapy that targets a gene selected from the genes in Table 8, and the signature may be the corresponding signature in Table 8, as defined in Table 7, or a corresponding signature. A corresponding signature is a signature that has been identified using data from a plurality of tumour samples and the set of copy number features, and that has the same aetiology as the corresponding signature described herein and/or is associated with the same pattern of change (e.g. as described in Table 3) as the corresponding signature described herein and/or is the most similar to the corresponding signature described herein when comparing a newly derived set of corresponding signatures and the set of signatures described herein. Establishing the similarity between sets of signatures may be performed as described herein for example in Example 2. A signature of chromosomal instability may be considered to be associated with response to inhibition of the gene if the exposure to the signature is significantly correlated with the effect of perturbation of the gene. A sample may be considered to have a high exposure to the signature if exposure to the signature is higher than the expected exposure for the signature in a set of samples that do not respond to the perturbation of the gene. In some embodiments in accordance with this aspect of the invention, the therapy inhibits CCND1 and the signature is a signature associated with tolerance to whole genome duplication or PI3K/AKT-mediated tolerance of whole genome duplication (such as e.g. CX4 in Table 7 or a corresponding signature), wherein the therapy inhibits PARP1 and the signature is a signature associated with impaired homologous recombination (such as e.g. CX5 in Table 7 or a corresponding signature), wherein the therapy inhibits a kinase in a mitogenic pathway (such as EGFR, JAK1, MET, PRKCA, PI3KCA) and the signature is a signature associated with replication stress (optionally wherein the signature is further indicative of focal amplifications; such as e.g. CX9 in Table 7 or a corresponding signature), wherein the therapy inhibits CDK4 and the signature is a signature associated with replication stress (optionally wherein the signature is further indicative of clustered amplifications, such as e.g. CX13 in Table 7 or a corresponding signature).
In describing the present invention, the following terms will be employed, and are intended to be defined as indicated below.
“and/or” where used herein is to be taken as specific disclosure of each of the two specified features or components with or without the other. For example “A and/or B” is to be taken as specific disclosure of each of (i) A, (ii) B and (iii) A and B, just as if each is set out individually herein.
A “sample” as used herein may be a cell (including a circulating cell such as a circulating tumour cell) or tissue sample (e.g. a biopsy), a biological fluid, an extract (e.g. a protein or DNA extract obtained from the subject), from which genomic material can be obtained for genomic analysis, such as genomic sequencing (whole genome sequencing, whole exome sequencing, targeted (also referred to as “panel”) sequencing). In particular, the sample may be a tumour sample, a biological fluid sample containing DNA or cells, a blood sample (including plasma or serum sample), a urine sample, a cervical smear, an ascites fluid sample. It has been found that urine, ascites fluid and cervical smears contains cells, and so may provide a suitable sample for use in accordance with the present invention. Other sample types suitable for use in accordance with the present invention include fine needle aspirates, lymph nodes samples (e.g. aspirates or biopsies), surgical margins, bone marrow or other tissue from a tumour microenvironment, where traces of tumour DNA may be found or expected to be found. The sample may be one which has been freshly obtained from a subject or may be one which has been processed and/or stored prior to making a determination (e.g. frozen, fixed or subjected to one or more purification, enrichment or extractions steps). For example, the sample may be a formalin-fixed tumour sample. The sample may be derived from one or more of the above biological samples via a process of enrichment or amplification. For example, the sample may comprise a DNA library generated from the biological sample and may optionally be a barcoded or otherwise tagged DNA library. A plurality of samples may be taken from a single patient, e.g. serially during a course of treatment. Moreover, a plurality of samples may be taken from a plurality of patients. As such, a sample as described herein may refer to any type of sample comprising cells or genomic material derived therefrom, whether from a biological sample obtained from a subject, or from a sample obtained from e.g. a cell line. “The sample is preferably from a mammalian (such as e.g. a mammalian cell sample or a sample from a mammalian subject, including in particular a model animal such as mouse, rat, etc.), preferably from a human (such as e.g. a human cell sample or a sample from a human subject). Further, the sample may be transported ad/or stored, and collection may take place at a location remote from the genomic sequence data acquisition (e.g. sequencing) location, and/or the computer-implemented method steps may take place at a location remote from the sample collection location and/or remote from the genomic data acquisition (e.g. sequencing) location (e.g. the computer-implemented method steps may be performed by means of a networked computer, such as by means of a “cloud” provider).
“Patient” as used herein in accordance with any aspect of the present invention is intended to be equivalent to “subject” and specifically includes both healthy individuals and individuals having a disease or disorder (e.g. a proliferative disorder such as a cancer). Preferably, the patient is a human patient. In some cases, the patient is a human patient who has been diagnosed with, is suspected of having or has been classified as at risk of developing, a cancer. The cancer may be ovarian cancer, breast cancer, endometrial cancer (uterus/womb cancer), kidney cancer (renal cell), lung cancer (small cell, non-small cell and mesothelioma), central nervous system cancer including brain cancer (gliomas, astrocytomas, glioblastomas), melanoma (including choroid melanoma and skin cancers), merkel cell carcinoma, clear cell renal cell carcinoma (ccRCC), carcinoma of unknown primary (CUP), lymphoid cancer (such as e.g. lymphoma), gastrointestinal cancer (e.g. colorectal cancer, oesophagus cancer stomach cancer), small bowel cancers (duodenal and jejunal), leukemia, pancreatic cancer, hepatobiliary tumours, germ cell cancers, bone/soft tissue cancer, prostate cancer, head and neck cancers (such as e.g. adenoid cystic carcinoma, ACC), pancreatic cancer, cervical cancer (e.g. Cervical Squamous Cell Carcinoma and endocervical adenocarcinoma, CESC), liver cancer, bladder cancer (such as e.g. bladder carcinoma, BLCA), urinary tract cancer, neuroendocrine tumour (NET), thyroid cancer and sarcomas. For example, the cancer may be any cancer represented in The Cancer Genome Atlas (TCGA) such as LAML (Acute Myeloid Leukemia), ACC (Adrenocortical carcinoma), BLCA (Bladder Urothelial Carcinoma, LGG (Brain Lower Grade Glioma, BRCA (Breast invasive carcinoma), CESC (Cervical squamous cell carcinoma and endocervical adenocarcinoma), CHOL (Cholangiocarcinoma), LCML (Chronic Myelogenous Leukemia), COAD (Colon adenocarcinoma), ESCA (Esophageal carcinoma), GBM (Glioblastoma multiforme), HNSC (Head and Neck squamous cell carcinoma), KICH (Kidney Chromophobe), KIRC (Kidney renal clear cell carcinoma), KIRP (Kidney renal papillary cell carcinoma), LIHC (Liver hepatocellular carcinoma), LUAD (Lung adenocarcinoma), LUSC (Lung squamous cell carcinoma), DLBC (Lymphoid Neoplasm Diffuse Large B-cell Lymphoma), MESO (Mesothelioma), OV (Ovarian serous cystadenocarcinoma), PAAD (Pancreatic adenocarcinoma), PCPG (Pheochromocytoma and Paraganglioma), PRAD (Prostate adenocarcinoma), READ (Rectum adenocarcinoma), SARC (Sarcoma), SKCM (Skin Cutaneous Melanoma), STAD (Stomach adenocarcinoma), TGCT (Testicular Germ Cell Tumors), THYM (Thymoma), THCA (Thyroid carcinoma), UCS (Uterine Carcinosarcoma), UCEC (Uterine Corpus Endometrial Carcinoma), and UVM (Uveal Melanoma).
A “tumour sample” refers to a sample that contains tumour cells or genetic material derived therefrom. The tumour sample may be a cell or tissue sample (e.g. a biopsy) obtained directly from a tumour. A tumour sample may be a sample that comprises tumour cell or genetic material derived therefrom, that has not be obtained directly from a tumour. For example, a tumour sample may be a sample comprising circulating tumour cells or circulating tumour DNA. Thus, a tumour sample may also be a biological fluid (e.g. a liquid biopsy such as a blood, urine, or cerebrospinal fluid biopsy). A sample comprising a mixture of tumour cells and other cells (or material genetic derived therefrom) may be subject to one or more processing steps, whether prior to or subsequent to the acquisition of sequence data, in order to identify sequence data that is representative of the genetic material from the tumour. For example, a sample comprising cells may be subject to one or more cell purification steps which selectively enrich the sample for tumour cells. As another example, a sample of genetic material may be subject to one or more capture and/or size selection steps to selectively enrich the sample for tumour-derived genetic material. Protocols for doing this are known in the art. As another example, sequence data may be subject to one or more filtering steps (e.g. based on fragment length) to enrich the data for information that relates to tumour-derived genetic material. Protocols for doing this are known in the art. In embodiments, the sample is a sample comprising tumour cells. Preferably, such a sample has a tumour purity (where tumour purity can be quantified as the proportion of cells in the sample that are tumour cells) of at least 30%, at least 35%, at least 40%, at least 45%, or at least 50%. Advantageously, the sample has a tumour purity of at least 40%. Without wishing to be bound by theory, it is believed that the copy number profiles generated from samples that have lower tumour purity may be less suitable for the purpose of the present invention as the signal corresponding to the tumour genome may be lost amongst the signal from the genomes of other cells.
A “normal sample” (also referred to as “germline sample”) refers to a sample that contains non-tumour or non-modified cells or genetic material derived therefrom. A normal sample may be matched to a particular tumour or modified sample in the sense that it is obtained from the same biological source (subject or cell line) as the tumour or modified sample. A normal sample may be a cell or tissue sample obtained from a subject, or a sample of biological fluid. A normal sample may be, e.g. a blood sample. A sample comprising a mixture of normal cells (or material genetic derived therefrom) may be subject to one or more processing steps, whether prior to or subsequent to the acquisition of sequence data, in order to identify sequence data that is representative of the genetic material from the normal cells (as already described above). For example, a sample comprising normal and tumour-derived cells (e.g. a blood sample comprising circulating tumour cells or a blood sample comprising cells froma haematological tumour) can be subject to one or more purification steps which selectively enrich the sample for normal cells. A normal sample may be used in the context of the present invention as a control to analyse a tumour sample. For example, a tumour sample and a (typically matched) normal sample may be analysed together in order to obtain a copy number profile for the tumour. Methods to obtain tumour copy number profile from a pair of normal and tumour samples are known in the art. For example, these include ASCAT [Van Loo et al. 2010] and Sequenza [Favero et al. 2015]. Such methods may provide a tumour copy number profile as well as an estimate of the purity (proportion of tumour cells) of the tumour sample. The purity estimate may be used to exclude tumour samples that have low purity and hence could lead to low quality copy number estimates.
The term “sequence data” refers to information that is indicative of the presence and/or amount of genomic material in a sample that has a particular sequence. Such information may be obtained using sequencing technologies, such as e.g. next generation sequencing (NGS, such as e.g. whole exome sequencing (WES), whole genome sequencing (WGS), or sequencing of captured genomic loci (targeted or panel sequencing)), or using array technologies, such as e.g. SNP arrays, or other molecular counting assays. When NGS technologies are used, the sequence data may comprise a count of the number of sequencing reads that have a particular sequence. When non-digital technologies are used such as array technology, the sequence data may comprise a signal (e.g. an intensity value) that is indicative of the number of sequences in the sample that have a particular sequence, for example by comparison to an appropriate control. Sequence data may be mapped to a reference sequence, for example a reference genome, using methods known in the art (such as e.g. Bowtie (Langmead et al., 2009)). Thus, counts of sequencing reads or equivalent non-digital signals may be associated with a particular genomic location. Further, a genomic location may contain a mutation, in which case counts of sequencing reads or equivalent non-digital signals may be associated with each of the possible variants (also referred to as “alleles”) at the particular genomic location. The process of identifying the presence of a mutation at a particular location in a sample is referred to as “variant calling”, and can be performed using methods known in the art (such as e.g. the GATK HaplotypeCaller, https://gatk.broadinstitute.org/hc/en-us/articles/360037225632-HaplotypeCaller). For example, sequence data may comprise a count of the number of reads (or an equivalent non-digital signal) which match a germline (also sometimes referred to as “reference”) allele at a particular genomic location, and a count of the number of reads (or an equivalent non-digital signal) which match a mutated (also sometimes referred to as “alternate”) allele at the genomic location.
A composition as described herein may be a pharmaceutical composition which additionally comprises a pharmaceutically acceptable carrier, diluent or excipient. The pharmaceutical composition may optionally comprise one or more further pharmaceutically active polypeptides and/or compounds. Such a formulation may, for example, be in a form suitable for intravenous infusion.
As used herein “treatment” refers to reducing, alleviating or eliminating one or more symptoms of the disease which is being treated, relative to the symptoms prior to treatment.
The systems and methods described herein may be implemented in a computer system, in addition to the structural components and user interactions described. “Computer-implemented method” where used herein is to be taken as meaning a method whose implementation involves the use of a computer, computer network or other programmable apparatus, wherein one or more features of the method are realised wholly or partly by means of a computer program. As used herein, the term “computer system” includes the hardware, software and data storage devices for embodying a system or carrying out a method according to the above described embodiments. For example, a computer system may comprise a processing unit (such as a central processing unit, CPU, and/or a graphics processing unit, GPU), input means, output means and data storage, which may be embodied as one or more connected computing devices. Preferably the computer system has a display or comprises a computing device that has a display to provide a visual output display. The data storage may comprise RAM, disk drives or other computer readable media. The computer system may include a plurality of computing devices connected by a network and able to communicate with each other over that network. It is explicitly envisaged that computer system may consist of or comprise a cloud computer. The methods described herein may be provided as computer programs or as computer program products or computer readable media carrying a computer program which is arranged, when run on a computer, to perform the method(s) described herein. As used herein, the term “computer readable media” includes, without limitation, any non-transitory medium or media which can be read and accessed directly by a computer or computer system. The media can include, but are not limited to, magnetic storage media such as floppy discs, hard disc storage media and magnetic tape; optical storage media such as optical discs or CD-ROMs; electrical storage media such as memory, including RAM, ROM and flash memory; and hybrids and combinations of the above such as magnetic/optical storage media.
The present invention relates broadly to the characterisation of DNA samples, particularly tumour samples, in terms of their copy number profiles.
A “copy number profile” refers to the quantification of the number of copies for each of a plurality of portions of a genomic sequence. In the context of the present disclosure, a copy number profile is preferably a genome-wide copy number profile. A copy number profile is typically obtained by sequencing a sample of genomic DNA (or a DNA library derived therefrom, as explained above, including a sample of DNA derived from genomic DNA by fragmentation, such as e.g. cell free DNA), and quantifying the number of copies per portion (e.g. per bin, where a bin can be e.g. a 30 kb region) of the genomic sequence, as known in the art. A “tumour copy number profile” refers to a copy number profile that is associated with a tumour genome. A tumour copy number profile may be obtained by sequencing a sample of genomic DNA (or a DNA library derived therefrom, as explained above, or a sample of DNA derived from genomic DNA by fragmentation) that is derived primarily from tumour cells or assumed to be derived primarily from tumour cells. As the skilled person understands, such samples can be contaminated with non-tumour DNA, which contamination can be minimised through processing of the sample or of the sequencing data, as known in the art. Preferably, a tumour copy number profile is a copy number profile that has been obtained by sequencing a sample of genomic DNA derived from tumour cells. In other words, a tumour copy number profile is preferably obtained using a sample comprising tumour cells.
A “segment” in a copy number profile refers to a portion of a sequence represented in a copy number profile which is associated with a consistent absolute copy number. The consistent copy number is different from that associated with the sequence directly upstream (if such a sequence is present and associated with a copy number estimate) and the sequence directly downstream (if such a sequence is present and associated with a copy number) of said portion. In other words, a segment refers to a portion of sequence that has a copy number associated with it, where the copy number associated with the segment differs from the copy number associated with its immediate neighbouring segment(s). The copy number associated with a segment may differ from the copy number associated with its immediate neighbouring segment(s) because the segment(s) that surround the segment are associated with a different copy number, because the segment(s) that surround the segment are not associated with a copy number (e.g. because data for the segment(s) is missing, or of insufficient quality), or a combination of both (e.g. a segment may be surrounded by a segment that is associated with a different copy number on one side, and a segment that is not associated with a copy number on the other side). In other words, segments refer to the longest continuous portion of a copy number profile that are each associated with a single copy number. A segment may be associated with a set of coordinates, e.g. genomic coordinates, which define the boundaries of the segment. Each boundary may be associated with a copy number changepoint (also referred to as “changepoint”). In embodiments, the copy number profiles comprise at most 350 segments (copy number events), at most 300 segments, or at most 250 segments. Preferably, the copy number profiles comprise at most 250 segments (copy number events). These numbers may be particularly useful when looking at human genome wide copy number profiles. Without wishing to be bound by theory, it is believed that higher numbers of segments may be indicative of unwanted DNA degradation (such as e.g. formalin-mediated DNA degradation). The copy number of a segment may in practice be a copy number estimate obtained using a method for determining tumour copy number profiles, such as e.g. ASCAT [Van loo et al., 2011] or Sequenza [Favero et al., 2015]. A “copy number event” refers to an instance in a copy number profile that is associated with one or more segments, and for which one or more copy number features can be quantified. For example, a segment may constitute an event and one or more features can be quantified for the segment such as its length, absolute copy number can be quantified, difference in absolute copy number with a neighbouring segment. A set of segments may constitute a copy number event and one or more features can be quantified for the set such as the number of breakpoints relative to the total length of the set of segments, and the number of contiguous segments oscillating between two copy number states.
The term “copy number (CN) features” refers to properties of copy number events observable in a copy number profile. Copy number features may include: segment size (also referred to as “segment length” typically expressed in number of bases), breakpoint count per×MB (the number of changepoints appearing in a sliding windows across the copy number profile, where the window is preferably 10 MB and the copy number profile is preferably genome-wide), change-point copy number (the absolute difference in copy number between a segment and an adjacent/neighbouring segment in the copy number profile, which may be defined relative to the upstream or downstream neighbouring segment), breakpoint count per chromosome arm (the number of changepoints occurring per chromosome arm), and number of segments with oscillating copy number (sometimes referred to as “length of segments with oscillating copy number”; number of continuous segments alternating between two copy number states, rounded to the nearest integer copy-number state; also referred to as length of chain of oscillating copy number states). The number features used according to the present invention do not include the segment copy number (the observed absolute copy number state of each segment, also referred to herein as “copy number” or “absolute copy number”). CN features can be observed on a genome-wide basis for a sample or collection of samples. In the context of the present disclosure, the term “genome-wide” refers to the assessment of a feature over a copy number profile that represents a substantial portion of a genome. For example, a substantial portion of a genome may comprise or consist of a chromosome, a plurality of chromosomes, or a portion of a genome determined by the parameters of the sequencing process used such as e.g. sequencing depth. Indeed, as the skilled person understands, even whole genome sequencing protocols may fail to accurately capture every sequence in a genome, especially at lower sequencing depths.
The genome wide characteristics of a copy number feature can be assessed by quantifying the copy number feature for each copy number event in a copy number profile, and obtaining one or more summarised measure for the copy number profile. One or more summarised measure may be obtained for each copy number feature. Alternatively, one or more summarised measure may be obtained which captures the contribution of multiple features. One such measure is exposure to a signature representing the genome-wide imprint of distinct putative mutational processes (where a “mutational process” as used herein refers to any process that can cause chromosomal instability), also referred to herein as “copy number signature” or simply “signature”. As described in Alexandrov et al. in relation to single base substitutions (Cell Rep. 2013 Jan. 31; 3 (1): 246-59. doi: 10.1016/j.celrep.2012.12.008.), exposure to a mutational signature represents the number of mutations attributed to that signature in a particular genome. The signature of a mutational process is the probabilities of a mutational process causing each of the possible mutation types in a mutation catalogue (where mutation types are defined in Alexandrov et al. as C:G>A:T, C:G>G:C, C:G>T:A, T:A>A:T, T:A>C:G, and T:A>G:C). In Macintyre et al. (Nat Genet. 2018 September; 50(9): 1262-1270), the present inventors extended the concept of mutational signatures to copy number events instead of single base substitutions. The mutation types in the mutational catalogue were defined as individual components of mixture models fitted to the distribution of values obtained for each of 6 copy number features (including the segment copy number) across a set genome-wide copy number profiles from a HGSOC cohort. Copy number signatures (i.e. signatures of copy number alteration processes) were obtained using these, which capture the probability of copy number alteration processes causing copy number events that are distributed according to each of the copy number feature components. Thus, the term “exposure” (also referred to as “signature activity” or “activity”) in this context captures the strength of evidence for the presence of copy number alteration events attributable to the signature. In the present work, the inventors built on this work and identified that a compact set of features that did not include a feature representing the copy number of a segment advantageously avoided redundancy if signatures for the same aetiology appear across different ploidy background. This therefore resulted in more robust, and more biologically relevant signatures identified using these copy number features. Further, the inventors identified and demonstrated that this approach could be used to identify signatures that are recurrent across all cancer types investigated, thus providing a robust pan-caner framework for the characterisation of chromosomal instability in cancer.
Methods for determining the exposure to a signature are known in the art (see e.g. Alexandrov et al., 2020; Macintyre et al., 2018). In particular, the determination of the exposure to one or more mutational signatures may be performed by identifying the matrix E that satisfies C≈PE where C is a mutational catalogue for one or more samples for which exposure is to be determined (also referred to as patient-by-component, PbC matrix), P is a signature matrix comprising the one or more mutational signatures for which exposure is to be determined (also referred to as signature-by-component, SbC matrix), and E is an exposure matrix (also referred to as patient-by-signature, PbS). For example, exposure to a signature performing matrix decomposition to identify the value (or vector of values) that satisfies the equation: PbC=PbS×SbC (or, in practice PbC=PbS×SbC+ε, where ε is a residual term to be minimised), where SbC is the row of the signature-by-component matrix that corresponds to a particular signature of a set of signatures, and PbS is the patient-by-signature value (or vector of values). Exposure to multiple copy number signatures can be calculated using the corresponding rows of the signature-by-component matrix. In embodiments, exposure (E) to a copy number signature i as described herein (from a set of n signatures, where n can be e.g. 17, where the signatures can include any or all of the signatures disclosed herein or corresponding signatures) is the value Ei that satisfies the equation:
where: E is a vector of size n comprising coefficients E1, . . . , n where Ei is the exposure to signature i; PbC is a vector of size c, each value representing the sum-of-posterior probabilities of each copy number event in the copy number profile belonging to a component C, where each component C is a distribution of values for a copy number feature; and SbC is a matrix of size c by n, each value representing the weight of a component C in a copy number signature i as described herein.
The characterisation of a DNA sample from a tumour in terms of signatures of chromosomal instability active in the sample can be used to identify a treatment for a subject with cancer, and to identify a drug target for the treatment of cancer. Thus, the invention also provides a method of treating cancer in a subject, wherein the method comprises administering or recommending a subject for administration of a particular therapy, depending on the signature(s) of chromosomal instability that have been found to be active in the sample.
Optionally, sequence data may be obtained from the tumour (and optionally the matched normal) DNA sample(s). The step of obtaining sequence data from a DNA sample may comprise sequencing the DNA sample or analysing the sample using a genomic array.
Alternatively, sequence data may have been previously obtained. Thus, obtaining sequence data may comprise receiving the data from one or more databases, or from a user through a user interface. At step 32, the sample(s) is/are characterised using methods described herein such as e.g. by reference to
At optional step 38, the subject may be treated with the therapy identified at step 40. Alternatively, the methods described herein may be used to provide a prognosis for a subject. Thus, based on the determination made at step 32, a subject may be classified at step 35 as having a good or a poor prognosis, where prognosis is known to be associated with exposure to one or more signatures. For example, a signature may be known to be associated with prognosis if samples with a high or low exposure to the one or more signatures are associated with different prognosis in a cohort of patients. In other words, a signature may be known to be associated with prognosis if samples with a good prognosis in a cohort of patients have a significantly different expected exposure to the signature than samples with a poor prognosis.
The prognosis may be specific to a particular tumour type, such as e.g. ovarian cancer. Alternatively, the methods described herein may be used to identify a drug target for the treatment of cancer. This may comprise obtaining tumour copy number profiles for a plurality of samples and characterising them at step 32 using the methods described herein. The plurality of DNA samples in this case comprise samples obtained from a tumour or a tumour cell line in which the drug target has been the subject of inhibition and for which response to inhibition of the drug target has been quantified. This may further comprise at step 37 determining whether one or more signatures of chromosomal instability are associated with response to inhibition of the drug target, wherein the presence of a signature of chromosomal instability associated with response to inhibition of the drug target is indicative of the drug target being usable for the treatment of a cancer in which the signature is active. This may further comprise optional step 39 of identifying a drug that targets the drug target, for example through a drug database or using any drug design method known in the art.
Any treatment described herein may be used alone or in combination with another treatment. For example, any treatment with a drug may be used in combination with one or more chemotherapies, one or more course of radiation therapy, and/or one or more surgical interventions. In particular, any treatment described herein may be used in combination with a treatment for which the subject has been identified as likely to be responsive.
For example, signatures of chromosomal instability described herein have been shown to be predictive of whether a patient with cancer is likely to respond to platinum-based therapy.
Additionally, the presence of some processes causing chromosomal instability in a tumour have been shown to be associated with different prognosis in cancer. Thus, also described herein are methods of providing a prognosis for a subject that has been diagnosed as having a cancer, the method comprising determining the activity of one or more signatures as described herein in a tumour from the subject. The method may further comprise classifying the subject between a group that has good prognosis, and a group that has poor prognosis.
For example, the method may comprise determining whether a sample from a tumour of the subject has a high or low exposure to one or more signatures that have been identified to be associated with prognosis (as explained above). A subject may then be classified in the group that has poor prognosis if the sample is determined to have a high exposure to the signature(s), and in a group that has good prognosis otherwise. Alternatively, a subject may be classified in the group that has poor prognosis if the exposure(s) to the signature or a score derived therefrom is above a threshold, and in the group that has good prognosis otherwise.
Whether a prognosis is considered good or poor may vary between cancers and stage of disease. In general terms a good prognosis is one where the overall survival (OS), disease free survival (DFS) and/or progression-free survival (PFS) is longer than that of a comparative group or value, such as e.g. the average for that stage and cancer type. A prognosis may be considered poor if OS, DFS and/or PFS is lower than that of a comparative group or value, such as e.g. the average for that stage and type of cancer. Thus, in general terms, a “good prognosis” is one where survival (OS, DFS and/or PFS) and/or disease stage of an individual patient can be favourably compared to what is expected in a population of patients within a comparable disease setting. Similarly, a “poor prognosis” is one where survival (OS, DFS and/or PFS) of an individual patient is lower (or disease stage worse) than what is expected in a population of patients within a comparable disease setting.
The subject is preferably a human patient.
The following is presented by way of example and is not to be construed as a limitation to the scope of the claims.
In this example, the inventors present a robust analysis framework for chromosomal instability in human cancers. The approach substantially extends previous work in ovarian cancer (Macintyre et al., 2018) and related work in sarcoma (Steele et al., 2019) using a pan-cancer analysis of 7,880 high-quality samples across the 33 cancer types in the TCGA collection. A compendium of 17 copy number signatures characterises different types of CIN and their aetiologies are supported by a wide array of independent data sources, as demonstrated in Example 2. In Example 3, the inventors apply the signatures to predict drug response and to identify new drug targets. In Example 4, the inventors show how the new framework refines the understanding of impaired homologous recombination, one of the most clinically relevant types of CIN.
Data sources. Table 1 lists the data used throughout these examples and their respective source. All data are publicly accessible, except for the TCGA, PCAWG and ICGC raw data (SNp 6.0 arrays, WGS, WES). Access to the TCGA data (SNP 6.0 arrays, WES) and the TCGA-part of the PCAWG data (WGS) can be obtained by a Data Access Request through the Database of Genotypes and Phenotypes (dbGaP, http://www.cancergenomicscloud.org/controlled-access-data). For the ICGC data (SNP 6.0) and non-TCGA PCAWG data (WGS), an application to the ICGC DACO (https://daco.icgc.org/) has to be made. The COSMIC database can be accessed with a free account without an application (https://cancer.sanger.ac.uk/cosmic).
Inferring absolute copy numbers. Sample-specific CEL files from TCGA were processed with Affymetrix Power Tools (APT; v2.11.2; standard options) and ASCAT (Van Loo et al., 2010, R package, v2.4). By default ASCAT produces an absolute, genome-wide copy number segmentation where segments are rounded to the nearest integer copy number state and merged. However, the copy number signature framework described herein used unrounded copy number segments. Therefore, the inventors determined an optimal segmentation penalty to avoid over- or under-segmentation. They compared ASCAT generated breakpoints with multiple segmentation penalties (25, 35, 50, 70, 100, 140, 200) to the PCAWG consensus copy number calls for 714 samples from 22 cancer types which were generated using whole-genome sequencing. They found that a penalty of 70 offered the highest degree of overlap. They ran ASCAT on 12,240 samples of which 12,141 produced usable solutions. To remove any noise around diploid copy states, all segments around 2 (1.9<x<2.1) were collapsed to exactly 2 and adjacent segments were merged. This merging was also performed for deleted segments (copy number equals 0).
Identifying high-quality copy number profiles. For 815 multisample patients with 2,021 samples, the inventors identified a representative sample based on the following criteria: sample passed quality control; from the primary tumour; had blood as normal control; if there were still multiple options at that point, one was chosen at random. This reduced the number of available copy number profiles to 11,032. The following four quality controls were applied to the absolute copy number estimates: 1) purity>0.4 and purity<1: samples with a purity lower than 40% were removed to avoid low quality absolute copy number estimates. A purity equal to 1 generally indicates that ASCAT has identified a copy number profile from a normal tissue and therefore only samples with purity<1 were considered. 2) tumour or normal MAPD<0.75: The median of absolute pairwise differences between adjacent probes (MAPD) is a per-microarray estimate of the variance, similar to the standard deviation or the interquartile range 33. Filtering for the MAPD removed noisy samples. 3) tumour and normal MAPD<0.4 and fraction homozygous segments <0.1: Occasionally tumour samples are matched with the wrong germline sample. This results in a high fraction of homozygous segments. The above thresholds removed these samples. 4) differences between log R and BAF segments <250: We identified and removed 67 samples from 17 cancer types with extreme noise in log R values, which manifested as a large-scale wave pattern which could not be captured by the median of absolute pairwise differences between adjacent probes (MAPD). The corresponding B-allele frequencies (BAF) values did not show this wave pattern and generally appeared consistent. These samples were identified by having more than 250 segments difference between their log R and BAF segmentations. All filters combined resulted in 7,880 high-quality copy number profiles across 33 cancer types (
Computing detectable chromosomal instability (dCIN). For signature identification, we aimed to use only samples with sufficient evidence of chromosomal instability (CIN). To that end, we identified and established a pragmatic threshold for the detectability of CIN (dCIN) from copy number profiles. We called ovarian copy number signatures on 539 high-grade serous ovarian cancer samples to estimate the number of undetected CNAs in order to establish an empirical threshold for detectable CIN. The first step of calling ovarian copy number signatures is to extract six fundamental features from all copy number segments: segment size, absolute copy number, copy number change to the left neighbouring segment, breakpoints per 10 megabases (Mb), breakpoints per chromosome arm, and length of oscillating chains. Based on previously established mixture models (Macintyre et al., 2018) for each of these six feature distributions, posterior probabilities for the extracted feature values were calculated. The output of this process is a matrix with 539 samples in its rows and the 36 mixture components as columns. The entries are the sum-of-posteriors of the previously determined posterior probabilities. Using YAPSA (Huenschmann et al., 2019, R package, v1.12.0) and the signature definition matrix from Macintyre et al., we calculated signature activities for each of the 539 samples. Not all the signal from the input matrix can be assigned to the different signatures. Using 5% as a threshold for assessing the activity of a signature in a sample, we set signature activities above 5% to 0 and multiplied the activity matrix by the signature definition matrix to derive a matrix with undetected signal. For each sample and feature, we summed the remaining probabilities to estimate the number of undetected CNAs. For each feature, we defined the quantile 0.95 of the distribution of undetected CNAs. From this distribution of quantiles 0.95, we took the 3rd quartile (which is less volatile and more robust than the maximum) as the empirical threshold of the detectability of CIN which is 20 CNAs (
Comparison to previous dCIN measures. Previous studies to quantify the proportion of cancer samples with CIN used aneuploidy as a surrogate measure with pan-cancer estimates ranging between 60% and 95% depending on the method (karyograms see Cimini et al., 2008; genomical case studies-sLens & Medema, 2019, Mitelman er al.; computational methods-Carter et al., 2012; FISH-Cisyk et al., 2015; reviewed Ben-David & Amon, 2020). Multiple studies found that around 90% of solid tumours are aneuploid (Weaver & Cleveland, 2006; Taylor et al., 2018, Cimini et al., 2008, Lens & Medema, 2019, Ben-David & Amon, 2020, Bakhoum et al. 2018, Duijf et al., 2013). This number goes back to a snapshot of the Mitelman database of chromosome aberrations and gene fusions, which analysed 17 solid and 9 hematopoietic cancers of which 13 could be matched to organs included in the TCGA project (
Deriving pan-cancer chromosomal instability signatures-Choice of features. Research into genomic and chromosomal instability revealed a long list of known mutational processes resulting in unique copy number patterns. Table 2 lists well-studied copy number patterns and describes which specific copy number feature is able to capture their unique characteristics. These observations motivate our selection of five fundamental features to model known patterns and potentially distinguish their mutational processes.
Deriving pan-cancer chromosomal instability signatures-breakpoint density. We included two features capturing potential clusters of CNAs: breakpoints per 10 Mb and breakpoints per chromosome arm. These two features are needed to accurately model events of different sizes: one for small and medium sized clusters with the largest possible window size and one feature for larger clusters. In more detail: Clusters of CNAs can come in various sizes, e.g. short tandem duplications with less than 10 kb in length each, long tandem duplications with lengths of over 100 kb each, and up to large-scale transitions and chromothriptic events which can span multiple dozens of mega bases. Their respective cluster sizes can be in the tens of kb up to whole chromosome arms. However, the smallest chromosome arm on a SNP 6.0 array is 12.8 Mb in length. Having larger window sizes than 10 Mb would therefore lead to a skewed breakpoint density on smaller chromosome arms. A smaller window size might be unfeasible to capture medium sized clusters, which would require introducing a third feature capturing clusters of CNAs.
Deriving pan-cancer chromosomal instability signatures-exclusion of absolute copy number as a feature. In a major deviation from previous work in Macintyre et al., we did not use the absolute copy number of a segment as a feature, because we found that it can artificially split a signature by ploidy status despite representing the same mutational process. This effect is very clear in the original ovarian copy number signatures, where two (signatures 3 and 7) defined the same process for different ploidy levels. When we rederived the ovarian signatures without the copy number feature we found six signatures (rather than the published seven), with signatures 3 and 7 collapsed to one (
Deriving pan-cancer chromosomal instability signatures-exclusion of loss of heterozygosity as a feature. We are not using a feature to describe LOH, because cheap genomic technologies like shallow WGS, which might be very useful in the clinic, do not have allele-specific information. In more detail: Loss of heterozygosity (LOH) is the loss of one allele and is a common genetic event during cancer evolution and might pose a novel class of cancer vulnerabilities by removing genetic redundancy in cancer cells. Most LOH events are so-called copy-loss events and can be detected through a change in copy number. In our data set, the median length of LOH events across the TCGA was 3.6 million bp, which suggests that the majority of events can be detected by the SNP6 technology. Implementing a specific LOH feature, i.e. the proportion of the major allele over both alleles, would add the ability to detect copy-neutral LOH and probably add resolution to further differentiate on copy-loss LOH events. But this would require the use of a high-throughput method allowing for the differentiation of the signal between alleles, such as WGS or SNP 6.0 arrays. However, with the advent of shallow WGS and single-cell sequencing in the scientific mainstream which do not yet allow allele-specific resolution 34, 63-65, a dedicated LOH feature might lead to constraints in the use of the future signatures.
Implementation changes to Macintyre et al., 2018. For the extraction of features from copy number profiles, we introduced three changes from the original implementation described in Macintyre et al.: we smoothed normal segments by collapsing and merging near diploid segments, we only used CNAs (non-diploid segments) for the segment size and changepoint feature and we removed the absolute copy number feature. In more detail: We collapsed and merged near diploid segments to a diploid state (1.9<x<2.1) to avoid signal from segments that were most likely normal diploid segments. We then ignored normal segments when extracting the segment size and changepoint features to avoid inflating the distributions. For the changepoint distribution, we skipped the first segment of a chromosome if it was a normal segment or subtracted 2 from the absolute copy number if it was a copy number aberration.
The major difference brought about by these changes was a shift between signatures 3 and 7 (
Feature distributions and mixture modelling. We used mixture models for each feature distribution to denoise the data and to identify a core set of underlying copy number features. For the segment size and changepoint feature distributions, being (quasi)-continuous data, we fitted Dirichlet-Process Gaussian mixture models using variational inference (code implementation followed Blei and Jordan, 128 initial components and we ran 10,000 iterations for segment size distribution, and 5,000 iterations for changepoint distribution), resulting in 30 and 22 mixture components, respectively, with weights of more than 1%. Mixture components were merged if a mean was within or near the first standard deviation of another mixture component. The mean of the newly merged component was calculated by the weighted mean of the involved components. Standard deviation was estimated by sampling 100,000 points from the inferred distributions. This reduced the number of mixture components from 30 down to 22 for the segment size mixture model and from 22 down to 10 for the changepoint mixture model. For the count distributed breakpoints per 10 MB, breakpoints per chromosome arm and lengths of oscillating chains, we used Poisson mixture models from flexmix (Grun & Leisch, 2008, R package, v2.3-15) and used the Bayesian Information Criterion (BIC) to decide on the number of components. We derived 3 components for the breakpoints per 10 Mb distribution, 5 for the breakpoints per chromosome arm, and 3 for the length of oscillating chains. In total, we derived 43 mixture components. Once derived, we calculated the probability for each feature value to belong to each component. For a given sample, the probabilities of each mixture component were then summed together, resulting in a 1×43 dimensional vector of sum-of-posterior probabilities. Deriving the sum-of-posterior vector for all 6,335 samples resulted in a 6,335 by 43 sum-of-posterior matrix, referred to as input matrix from here on.
Deriving pan-cancer signatures. We used non-negative matrix factorisation (NMF) on the input matrix. The 43 components contain copy number events generated by all types of CIN and NMF facilitates them identification of separate types of CIN. We used SignatureAnalyzer (Kim et al., 2016, Tan & Fevotte, 2013, Python 3.6.7, cloned on 15/03/19) and L1 regularisation on both output matrices to perform non-negative matrix factorisation (NMF) on the input matrix. We ran the NMF 200 times with random initialisation to navigate the non-convex optimisation function and compared the overall Kulback-Leibler (KL) divergence between the input matrix and the reconstructed input matrix from the output matrices. A lower KL divergence indicates a better solution. The number of signatures, K, was defined as the mode of the distribution of K's over the 200 runs, which was equal to 10 for all 6,335 samples pan-cancer. Of the 200 runs, 95 solutions had a K of 10. To pick a representative solution, we first calculated the cosine similarity between the signatures in the 95 sets and visualised the results as a graph.
The graph had 950 nodes, each one representing a signature, and edges connecting nodes if the cosine similarity between these two signatures was above 0.85, a threshold established in the literature (Alexandrov et al., 2020). This graph showed 12 independent clusters with sporadic outliers attached to them, as well as a few single independent nodes. In a second step, we ordered the 95 solutions by their KL divergence and, starting with the highest-ranked solutions, evaluated their spread in the graph. We chose the 9th optimal solution as it had a signature in 10 of the 12 clusters. The previous 8 solutions all had at least two signatures in the same cluster, indicating almost identical signatures. In terms of the Kullback-Leibler divergence, the solution we picked is about 1% worse than the most optimal solution we derived.
Deriving cancer-type enriched signatures. Next, we aimed at deriving cancer-type enriched signatures to complement the pan-cancer set of 10 signatures. We hypothesised that there might be signatures present in a certain cancer type but that their signal may be drowned out during pan-cancer NMF. Therefore, for 20 of the 33 cancer types with over 100 samples with detectable CIN, we derived cancer-type enriched signatures following the above described method. The number of samples was similar to those found in previous publications on copy number signatures (Macintyre et al., 2018, Steele et al., 2019). K ranged from three in testicular germ cell tumours to eight in five cancers. For 11 cancer types the mathematically optimal set of signatures was also the representative of other solutions. For the other 9 cancer types we picked a representative solution over the mathematical solution. In total we derived 128 cancer-type enriched signatures.
Merging to a signature compendium. For merging the pan-cancer and the cancer-type enriched signatures to create a compendium of CIN signatures, we compared signatures by cosine similarity. To decide which signatures to merge we derived a threshold using simulation. Based on the set of 138 pan-cancer and cancer-type enriched signatures (138×43 matrix), we simulated 1,000 sets of signatures using a Dirichlet process from MCMCpack (Martin & Park, 2011, R package, v1.4-7), maintaining the proportion of 0s and signature components summing to 1 for each signature. Descriptive analyses (heatmaps) were then used to display (i) the distribution of cosine similarities between the signatures, (ii) the simulated signatures, and (iii) the distribution of cosine similarities between the original signatures and the simulated signatures. These analyses showed that the simulated signatures were similar to the original ones. Therefore, we used the quantile 0.999 of the distribution of simulated cosine similarities, which was equal to 0.74, as a cosine similarity threshold. In order to merge the pan-cancer and cancer-type enriched signatures, we employed a number of strategies to avoid redundancies amongst the signatures: (1) remove cancer-type enriched signatures which were similar to pan-cancer signatures. To do this, we remove all cancer-type enriched signatures that had a cosine similarity over 0.74 with any pan-cancer signature. (2) remove cancer-type enriched signatures which were too similar to each other. To do this we selected a representative signature (highest activity for the largest number of samples) from the groups of cancer-type enriched signatures that had a cosine similarity over 0.74. (3) remove cancer-type enriched signatures which constitute a combination of pan-cancer signatures. To do this we performed non-negative least squares on each pair of pan-cancer specific signatures and cancer-type enriched signatures. For any combination which showed a reconstruction error below 0.1, the cancer type enriched signature was removed. The combination of three strategies reduced the 128 cancer-type enriched signatures to 7. No combination of pan-cancer signatures resulted in a cancer-type enriched signature, and vice-versa, therefore the 10 pan-cancer signatures were combined with the 7 cancer-type enriched signatures to yield a set of 17 signatures of CIN in human cancers. These signatures were then named by their rank determined by the number of samples with non-zero signature activity. Once defined, the final signature activities for all 6,335 samples with detectable chromosomal instability were computed using the linear combination decomposition function from YAPSA (R package, v1.12.0) on the original input matrix.
Simulation of signature activities. We tested how stable the signature activities are by performing a Monte Carlo simulation where we added noise to the number of CNAs and their characteristics and then measured how strong the signature activities changed. For each value of the segment length and the copy number changepoint feature, we added or subtracted up to 10% of their value which we drew from a Gaussian distribution with a mean of that value and a standard deviation of one-twentieth of the mean. If a drawn value was larger than 10% we discarded that value and drew again. This procedure had the effect that longer or higher copy number segments were modified stronger than smaller or lower copy number segments. In addition, we modified the breakpoints per chromosome arm, breakpoints per 10 Mb and length of oscillating copy number chain features by drawing the same number of values from a sample's feature distribution with replacement. In order to avoid a strong deviation from the original sample feature distribution, we applied a maximum change of +/−10% to the number of CNAs. If a draw resulted in a higher change in CNAs, then we redrew values until the number of CNAs was below 10% change from the original value. One run of the simulation included a change of all five feature distributions for all 6,335 samples with a subsequent computation of the signature activities. In total, we ran the simulation 1,000 times. We plotted the interquartile range (IQR) for all samples and all 17 signatures (
Definition of signature-specific thresholds. To ensure robustness of the signature activities and to enable trust in small signature activities, we sought to identify signature-specific thresholds. Based on the 1,000 simulations, we used the non-zero simulated values from samples with true zero activity. We fitted a single Gaussian distribution to the non-zero values with the function Mclust from mclust (Scrucca et al., 2016, R package, v5.4.6) and took the quantile 0.95 as a threshold (
Testing robustness of signatures across sequencing technologies. In order to test the robustness of the signatures across sequencing technologies, we used 478 TCGA samples with detectable levels of CIN that were also part of the PCAWG project and derived signature activities and definitions in five settings: 1. SNP 6.0 without matched normal, 2. WGS downsampled to SNP 6.0-positions, 3. WGS downsampled to shallow WGS, 4. WES on-target reads, 5. WES off-target reads. For each setting, we fitted absolute copy number profiles (step 1—see “Absolute copy number profiling below”), extracted signature activities for the 17 compendium signatures and derived 10 signatures (step 2-see “Extract signature activities and definitions”) and compared them to the 10 pan-cancer signatures (
Absolute copy number profiling-SNP 6.0 without matched normal. For many bulk tumour SNP 6.0 profiles, the non-tumour contaminant pulls the B-allele frequency (BAF) of heterozygous SNPs close to 0.5, such that 0<BAFhet<1 and it can be distinguished from the BAF of homozygous SNPs, i.e. BAFhom=1 or BAFhom=0. Using this rationale, ASCAT proposes a built-in function to infer which SNPs are heterozygous from the tumour profile only. The parameters of this function are based on the noise profile of the platform and the array design, as manually defined after visual inspection of multiple profiles from the same platform. In this case for Genome-Wide Human SNP Array 6.0, the function expects a range of BAF values for homozygous SNPs (BAFhom<0.04 or BAFhom>0.96) and a minimum fraction of noisy SNPs (>4% expected with BAFhom<BAFnoisy<BAFhet), as well as a platform-specific maximum fraction of heterozygous SNPs (25%) and minimum fraction of homozygous SNPs (67%). To identify heterozygous SNPs, the BAF of each SNP is first flipped around 0.5 such that all BAF values are within [0, 0.5], i.e. if BAF>0.5, BAFflipped=1-BAF. Then the fractions of homozygous fhom, noisy fnoise, and heterozygous fhet SNPs are determined from the platform-specific parameters and the sample's BAF values: from the total number of SNPs N, the fhom×N SNPs with a BAF closest to 0 are labeled homozygous; from the remaining SNPs, fhet×N SNPs with the lowest distance between their BAF and a local median of the BAF within a window of 301 (of the non-homozygous) SNPs centred around the SNP positions are labeled heterozygous; the remaining SNPs are labeled noisy and left out from downstream analyses. While this methodology allows to identify heterozygous SNPs in most unbalanced segments, in tumours for which purity is close to 100%, the BAF of heterozygous SNPs in regions that are largely unbalanced, and especially with loss of heterozygosity (LOH), is indistinguishable from the BAF of homozygous SNPs, i.e. BAF=1 or BAF=0. Thus, highly imbalanced segments in pure samples are devoid of identifiable heterozygous SNPs. Moreover, because ASCATs segmentation step only considers heterozygous SNPs, this leaves large blind spots for the identification of copy-number segments in those samples, which would significantly influence our signature framework. To counter this, we have added a step after the identification of heterozygous SNPs, which rescues SNP positions in segments of BAF>0.75 and where the density of heterozygous SNPs is much lower than the average in non-LOH regions (<20%). We rescue as many SNP positions from the SNPs labeled as homozygous, as to reach a 20% density of heterozygous SNPs within the segment. To avoid inducing spurious segments in the BAF of rescued SNPs, we do not rescue SNPs at random but rather the SNPs with BAF values closest to BAF=0.5. We then run ASCAT on the union of the SNPs identified as heterozygous in the first step and those rescued in segments depleted from heterozygous SNPs.
Absolute copy number profiling—WGS downsampled to SNP 6.0 positions. From PCAWG WGS (hg19) data, we considered allele counts from alleleCounter (v4.1.0, default parameters with the “--dense-snps” option; https://github.com/cancerit/alleleCount) at SNP 6.0 loci. For each sample, loci with at least 1 count in the tumour and 10 counts in the matched normal were used to derive log R and BAF. Log R was then corrected for GC content and replication, implemented following the methodology in the Battenberg R package (https://github.com/Wedge-lab/battenberg). ASCAT was then used with different penalty values of interest and a gamma value set to 1.
Absolute copy number profiling—Bin size estimation allowing for downsampling to shallow WGS. As genome binning, or lack thereof, differs between SNP 6.0 array copy number calling and shallow WGS approaches, we sought to identify a genomic bin size which most accurately recapitulates the segment feature distribution seen in SNP 6.0. Segment size was selected as an appropriate measure of the similarity in sample segmentation to compare between differing genomic technologies. A set of 40 ovarian cancer WGS tumour samples from the PCAWG study were used to verify segment size agreement due to a high variability in segment lengths and high segment counts per sample. WGS samples were downsampled to between 3-15M reads, with consideration for ploidy, purity, bin size. Segment size distributions were calculated for five genomic bin sizes (5 kb, 15 kb, 30 kb, 50 kb, and 100 kb) and compared with segment size distributions generated from matched SNP 6.0 array segmentation as described in section “Inferring absolute copy number for SNP 6.0 arrays”. Segment size distributions were tested against the reference set using a Mann-Whitney U test after removing segments longer than 10 Mb. We determined and selected a bin size of 30 kb because its the segment size distribution resulted in a non-significant change (or least significant change) and therefore most accurately replicated the segmentation profiles in matched SNP 6.0 array data (30 kb p-value=0.11, all other bin sizes p-value <0.001). We downsampled WGS samples to appropriate read counts, based on the bin size determined prior, sample ploidy, and sample purity and performed absolute copy number fitting as described in section “Shallow WGS: Absolute copy number fitting.
Absolute copy number profiling—Shallow WGS: Absolute copy number fitting. To infer copy numbers in shallow coverage WGS, we defined 30 kb non-overlapping windows along the genome, excluding centromeres and regions of undefined sequence, i.e. “N”, in the reference genome hg19 (see section above “Bin size estimation allowing for downsampling to shallow WGS”). Then, we empirically identified low-quality bins akin to Scheinin et al., which can be bins of low mappability or high variability of sequence repeats across individuals. This is done as follows: for 10 normal diploid blood samples from the PCAWG project with an estimated fraction of tumour contaminant at 0%, we count the number of reads in each bin and derive the log R across the bins
where Nbin is the number of non-duplicated reads with mapping quality >=30 in the while, and
where is the rank of each bin's log R and N is the number of bins. We then summed the scores across samples and took the absolute value. After visual inspection of the profiles, a conservative fifteen percent of the bins with the highest final score were labelled as bad bins and removed from downstream analyses. This was done separately for autosomes and chromosome X. We then process each tumour sample individually: we counted reads in the remaining 30 kb bins and derived the log R as described. The log R was then corrected for the GC fraction of the bins using loess, and segmented using circular binary segmentation (CBS, Olshen et al., 2004) using the function segment (Seshan & Olsen, 2019, R package DNAcopy, v1.64.0) with default parameters. We assessed the influence of the alpha parameter (0.001, 0.01, 0.05), which is the p-value threshold used to identify segments with different mean values. After the profile was segmented, we searched the space of combinations of realistic purity and ploidy values (0.01<purity<1, 1.5<average tissue ploidy<5, and tumour ploidy<5) to fit the log R to integer values, solving the log R equation for each segment:
where nT is the total number of copies in the tumour, ρ is the purity and ψ is the average tissue ploidy. We picked the purity and ploidy combination that minimized the distance between nr and integer values and solved for each segment to obtain a non-rounded total number of copies.
Absolute copy number profiling—WES: Absolute copy number fitting of on-target reads. From TCGA WES (hg19) data, we considered allele counts from alleleCounter (v4.0.0, default parameters with the “--dense-snps” option; https://github.com/cancerit/alleleCount) at SNP loci from the 1,000 Genome Project (Auton et al., 2015, phase 3) falling into exonic regions (GENCODE v19 exome Frankish et al., 2019, see Table 1 for details); extended by 1,000 bp upstream and downstream). For each sample, loci with at least 1 count in the tumour and 20 counts in the matched normal were used to derive log R and BAF. Log R was then corrected for GC content and replication timing, following the methodology implemented in Battenberg (Wedge et al., 2021R package, v.2.2.9). ASCAT was then used with different penalty values of interest (35, 50, 70, 100, 140) and a gamma value set to 1. The following rules were used to discard samples: No ASCAT solution; Purity>0.99 and 1.99≤ploidy≤2.01; Tumour MAPD>0.75; Tumour MAPD>0.4 and fraction homozygous segments ≥0.1; Differences between log R and BAF segments >250; Number of considered SNPs ≤75,000. To pick a single representative sample for multi-sample cases, we considered the following priority: matched blood over matched adjacent normal tissue and most recent analysis date (if available). If multiple samples still fitted these criteria, a sample was randomly selected.
Absolute copy number profiling—WES: Absolute copy number fitting of off-target reads. We calculated and estimated downsampling read counts using the same methodology and bin size estimates calculated for down sampling of WGS to shallow WGS (see section “Bin size estimation allowing for downsampling to shallow WGS”). To derive the copy-number profiles, we first define an on-target region from the bed definition of the targets in TCGA, adding 1000 base pairs either side of the targets. We then bin the genome into 30 kb bins but excluding the on-target regions, i.e. the bins are not necessarily made of contiguous sequences and their genomic width is usually larger than 30 kb. We remove bins that overlap with “bad bins” as defined earlier, we derive the log R in each bin, segment it with CBS, and apply the ASCAT log R-only methodology as described above in section “Shallow WGS: Absolute copy number fitting”.
Extract signature activities and definitions. Extraction of the feature values from the copy number profiles were performed as described above in section “Feature distributions and mixture modelling”. For identifying the signature activities, we used the linear combination decomposition function from YAPSA35 (R package, v1.12.0) and the 17 CIN signatures. For identifying new signatures and comparing the definitions to the original 10 pan-cancer signatures, we used the same procedure as described above (“Deriving pan-cancer signatures”) with the notable change that we used only solutions with K, the number of the original pan-cancer signatures, equal to 10. As no solution with K=10 were found for the two WES settings and the SNP6-positions only WGS setting, we chose the optimal K (K=7) for the two WES settings and greyed out the figure for the SNP6-positions only WGS setting.
Comparison to gold standard (SNP6 with matched normal). All signature activities were significantly different from a random background distribution of signature activities. For this background distribution, we simulated a 1000 sets of random signature activities for the 478 samples similar to the method described above in section “Merging to a signature compendium”. The expected random cosine similarity was around 0.05 (average random cosine similarity across all samples, peaking at 0.05). The signature definitions were compared by cosine similarity where we chose a cutoff of 0.95 to define equality. Every one of the 10 original pan-cancer signatures (
Extension to WGS and WES data. We developed our signature approach using TCGA SNP6 data because this represents the largest pan-cancer collection of high-resolution copy number profiled tumours to date. Nevertheless, the resolution of copy number calling from SNP6 data is lower than whole-genome sequence data, which is a limitation of the particular implementation in this example. In the cross-technology platform comparison, we normalised all data to match the resolution of SNP6 arrays and showed that our signature methodology is robust across technology platforms at this resolution (
CIN activity cannot be simply measured by the number of CNAs. Modelling the effects of different types of chromosomal instability presents a unique challenge compared to SNV and SV signatures. This challenge stems from the fact that for CNA analyses we need to distinguish between a genome-changing event and the number of CNAs it produces. For some types of chromosomal instability, a single event causes a single copy number change (e.g. whole chromosome missegregation), while for other types a single event can cause many changes (e.g. chromothripsis). Therefore, to adequately model the activity of different types of CIN, the number of CNAs needs to be decoupled from the number of “events”. This is different from SNV signatures, where the number of SNVs is assumed to be a good measure of the activity of a mutational process. Developers of SV signatures faced a similar problem and Li et al. (2020) addressed it by performing extensive clustering of SVs into mutually exclusive “event” types, ranging from events that encompassed a single deletion of a certain size, to events that involved large numbers of SVs as part of chromoplexy chains. This facilitated the counting of these events in a tumour and the application of NMF, similar to what had been previously done for SNVs. The approach of Li et al thus allowed SV signatures to be interpreted in a similar manner to SNV signatures. Copy number data, however, presents unique challenges, precluding the use of a similar approach. CNAs cannot be easily clustered into single “events” as is the case for SVs, as they lack the information required to relate one CNA to another. Further confounding this problem is the fact that copy number changes are quantified with respect to reference genome coordinates. For example, the loss of a “chimeric” chromosome (an already rearranged chromosome containing parts of many different chromosomes), is a single “event” in the tumour cell. However, mapped to a reference genome, this can manifest as many copy number changes across multiple chromosomes. Therefore, it is not possible to generate a vector of “event” counts and thus approaches similar to those used in the SNV and SV setting cannot be applied. The section “Estimation of CNAs produced by a signature” in the Methods of Example 2 shows how to provide an estimate of how many copy number changes are represented by each signature in a tumour.
Correlations and substructure in the feature embeddings. Modelling copy number signatures demands a completely new and different approach. As originally published in Macintyre et al. (2018) and substantially enhanced here, we used the published literature to define a set of copy number related features representing known outcomes of different types of CIN including chromothripsis, tandem duplication, whole-genome duplication and focal amplification. This formed the basis of our set of 5 features. Embedding CNAs observed in a tumour into this feature space, means single CNAs can add weight to multiple features. This has the significant advantage of allowing “events” to be encoded that correspond to low or high numbers of CNAs, and thus overcomes the significant challenges outlined above of working with copy number data. However, this approach does have some potential caveats when interpreting the signature definitions. First of all, our feature encoding introduces substructure in the input matrix that is different to that of a typical SNV count matrix. However, this structure does not violate any assumptions of NMF (the only assumption of NMF is that the matrix does not contain negative numbers), so NMF can still be applied. When faced with deciding on which approach to use for matrix decomposition, we identified two possibilities: 1) a bespoke approach, similar to the Hierarchical Dirichlet model with latent variables that was used in Li et al. (2020), where we specifically take into account the substructure of our feature encoding; or 2) the widely used NMF approach, which is agnostic to the substructure. The advantage of the HDP-like approach is that we can impose prior knowledge that certain features should behave in certain, coordinated ways. However, as the feature encoding itself already strongly imposes prior knowledge, this additional constraint may suppress any ability to perform data driven discovery and could be overly biased by our prior knowledge. The advantage of using NMF is that it is robust and widely used across many scientific domains, including face recognition, astronomical image analysis and text mining. It has well defined behaviour and provides a data driven approach to signature identification across our feature space. In this work we aimed for a balance between using prior knowledge and data driven discovery: we encoded prior knowledge in the selection of copy number features (which were designed to capture known patterns of CIN) but used a widely used, well-tested, and robust “off-the-shelf” method for deconvolution. However, it is important to note that NMF may result in signature weights that do not reflect the expectations of our prior knowledge. A significant strength of our approach is that it is easy to identify potentially erroneous signature definitions as these have strong weights on a single component. In this case, the single defining weight is so strong that other related weights (which we might expect to see) get shrunk to 0 (this is part of the NMF regularisation to ensure a unique solution). Fortunately, CX13 is the only signature that exhibits such behaviour. Our low to medium amplification signatures (C×8 and C×9) have weights on high change-points but also have weights on the expected segment sizes (both short and long). However, in the case of CX13, the high changepoint component is such a strong defining feature that no weight is given to segment size components. While this may yield a slightly unexpected signature definition, this does not necessarily imply that this signature is not robust in representing the process generating high-level focal amplification. Indeed, post hoc analysis of segments with the defining change point value (at least 10 copies) yields a median segment size of 790 kB, with an interquartile range between 280 kb and 2.7 Mb which is consistent with estimates for ecDNA sizes of 1-3 MB. This effect may be more subtly manifested across all signatures. The replication of all pan-cancer signatures during our cancer-type enriched signature analysis suggests that our signatures are robust readouts of biological signals.
Testing for an ecDNA-focused signature. In the CIN signature compendium, the maximum mixture component for the copy number changepoint feature was located at around 10 copies, in stark contrast to the ovarian copy number signature components (Macintyre et al., 2018) where the maximum was at 30. This component at 30 copies change was crucial for defining ovarian signature 6. Given that focal amplifications such as extrachromosomal DNA (ecDNA) are present in a wide range of cancer type but still show cancer-type specific enrichment, particularly in ovarian cancer (Kim et al., 2020, Gu et al., 2020), we were wondering whether we would miss a signature focusing on extremely high amplifications. We manually added the mixture component around 30 copies to the components of the changepoint feature and rederived pan-cancer signatures. This additional component did not create an additional signature focusing on ultra-high amplification. Instead the additional mixture component was added to the signature already containing the strongest weight on the changepoint component around 10 copies.
All examples used the reference genome GRCh37/hg19. Unless stated otherwise, statistical software R (v3.6.1) was used.
The inventors derived 7,880 high-quality absolute copy number profiles across 33 tumour types using ASCAT (Van Loo et al., 2010) on the SNP array data of TCGA (
Using these 6,335 genome-wide copy number profiles, they computed distributions of five fundamental copy number features previously demonstrated to encode patterns of copy number change representing different underlying causes of CIN (Macintyre et al., 2018). These features are the copy number change between a segment and neighbouring segment, segment length, breakpoint count per 10 megabases, breakpoint count per chromosome arm and length of chains of oscillating copy number states (Methods). In a major departure from the previous work, the inventors did not include a feature representing the copy number of a segment as they discovered that this would advantageously avoid redundancy if signatures for the same aetiology appear across different ploidy backgrounds.
Then, the inventors applied mixture modelling to define distinct components for each cohort-wide feature distribution, which conceptually represent the basic building blocks for defining CIN processes, and identified a total of 43 mixture components across the 5 features (Methods,
The inventors investigated this data set in two different ways (
The inventors tested the robustness of their signatures in several ways. First, they used a Monte Carlo simulation study to ensure that low signature activities are not just due to noise and derived signature-specific activity thresholds (Methods,
In this example, the inventors investigated the putative causes underlying the signatures identified in Example 1. In order to do this, the inventors decided to derive an initial hypothesis based on the signature definitions (i.e. the patterns of copy number change encoded by the signature,
See Example 1. The paragraphs below first describe how the inventors compiled the list of genes of interests in order to link signature activities to mutated genes, then how they identified the signature aetiologies and then how they associated the signatures with orthogonal data.
Selection of 1,146 genes of interest. We performed mutation analysis on 1,146 putative CIN related genes compiled from known pan-cancer driver genes (from the TCGA, Bailey et al., 2018), and genes reported as involved in DNA repair, DNA replication, the cell cycle, and chromatin organisation by the Reactome database (Date of access: Mar. 3, 2020; Jassal et al., 2020, Fabregat et al., 2018). The Reactome query resulted in a list of 1,330 proteins known to be involved in DNA repair, DNA replication, the cell cycle, and chromatin organisation. Reactome supplied three types of identifiers: UniProt, ensembl gene identifier, and EMBL identifier. Using the three identifiers, we mapped the proteins to 1,443 ensembl gene entries (ensembl96 on a GRCh37 reference genome). Ignoring the UniProt identifier and removing duplicated entries, the total went down to 1,036 gene names with 1,136 ensembl identifiers. We added to this list 200 genes reported by Bailey et al. as pan-cancer drivers. We transferred the names to ensembl96 on a GRCh37 reference genome and merged them with the Reactome-derived list of genes. The resulting list had 1,296 ensembl gene identifiers for 1,179 gene names. An additional functional descriptor supplied by Bailey et al. was taken into account for the pan-cancer driver genes: “oncogene” and “tumour-suppressor gene”. If none existed, we assigned the descriptor “ambivalent”. In total, 58 oncogenes, 112 tumour-suppressors, and 30 ambivalent genes were labelled as such. For all 1,179 genes, we identified their transcriptional start sites (TSS) on a GRCh37 reference genome. Since most genes have multiple TSS, the canonical or strongest proven TSS was obtained by the following decision tree: 1) If only one TSS was present for a gene, then take this one independent of quality; 2) if a GENCODE basic flag was present, take only these, 3) if multiple TSS were present, choose one at random. This decision tree worked for all genes, resulting in 1,179 TSS. After filtering for TSS sites on canonical chromosomes (1-22, X, Y), 1,146 genes were present containing 57 oncogenes, 108 tumour-suppressors, and 30 ambivalent genes. We used somatic, exonic single nucleotide mutation (SNVs) calls from the TCGA (22). Gene names, if necessary, were replaced by ensembl96 (Hunt et al., 2018, Yates et al., 2020) gene names using biomaRt (Durinck et al., 2009 and 2005; R package, v2.42.1) to reach consensus names across multiple data sources. The only gene from our list of 1,146 key genes that was not present was TERC, which is a long non-coding RNAs. We filtered the set of SNVS mutations for synonymous and ambivalent SNVs defined by having received the following variant classification: Silent, Intron, 3′Flank, 3′UTR, 5′Flank, 5′UTR. In addition, we only kept SNVs that passed the TCGA filter criteria. Samples with known hypermutation status have been removed in accordance with previous publications (Bailey et al., 2018). The set of SNVs did not include germline mutations which had been removed by the NCI Genomics Data Commons (GDC) because of potential issues around patient identification. Germline mutations for BRCA1 and BRCA2, as well as promoter methylations for BRCA1 and RAD51C, were obtained from other TCGA publications (see next section). For evaluating the copy number state of the 1,146 key genes, we used the same copy number profiles as we did for deriving the CIN signatures. The copy number at the transcriptional start site (TSS) was taken as a surrogate copy number for the whole gene. For each TSS in every sample, we evaluated whether a loss-of-heterozygosity (LOH), and a deletion (DEL) or amplification (AMP) was present. LOH was defined as a state where the b-allele (the allele with the minor copy number) had a copy number of less than 0.4. For deletions the same threshold of 0.4 was applied. Amplifications were called in cases with copy numbers over 5 copies. Both deletions and amplifications were checked with RNA expression data which were obtained from the Catalogue of Somatic Mutations in Cancer (COSMIC) website (2). Again names of the genes were updated to ensemble (96) as described above. Only if an overexpression in the case of oncogenes or under-expression in all other cases matched with an amplification, respectively deletion, was this status kept and moved forward in the analysis.
Identification of samples with germline mutations or promoter hypermethylation in BRCA1, BRCA2 and RAD51C. We obtained BRCA 1/2 mutation status from two studies: Maxwell et al. 2017 identified 100 BRCA and OV samples with a germline BRCA1/2 mutant (37 breast cancer, 63 ovarian); and Wang et al. 2017 identified 99 ovarian samples, 41 germline and 58 somatic samples. 149 of these 161 showed detectable levels of CIN. Of those, 139 samples also harboured a loss of heterozygosity (LOH) of either gene. LOH was defined as a state where the b-allele (the allele with the minor copy number) had a copy number of less than 0.4. Thirteen TCGA ovarian samples in which RAD51C was epigenetically silenced by promoter hypermethylation were identified by the TCGA Research Network (Cancer Genome Atlas Research Network, 2011), who measured methylation at CpG dinucleotides by using an Illumina GoldenGate methylation assay following the protocols from the TCGA Research Network (Cancer Genome Atlas Research Network, 2008). Testing for differences in signature activity. Samples were classified as mutant using the following criteria: for oncogenes, only amplifications and point mutations that did not result in a frame shift were considered mutant; for tumour-suppressor genes, we used deletions and all types of point mutations. Amplifications and deletions were only used when they were corroborated by up- or downregulated gene expression. For BRCA1 we added two extra comparisons where we evaluated whether a potential mutation was rescued or not. Rescue was defined by a mutation in either TB53BP1, RIF1, or MAD2L2 (Boersma et al., 2015, Isono et al., 2017). Not rescued was defined by having no mutation in any of these three genes. As we had prior knowledge about germline mutation and promoter methylation for BRCA1, BRCA2, and RAD51C, we treated these samples as mutated. To test for differences in signature activity between samples with driver mutations and without a mutation in one of the 1, 146 key genes, we scaled and centered the activities per signature (Z-scores). We split the samples into two groups, one group with a mutation and one without. Then we used Welch's t-test to perform a test of equality of means between the two groups. We corrected the resulting 19,006 p-values for the false discovery rate (Benjamini & Hochberg, 1995). Using an alpha of 0.05, we identified 788 genes with at least one significant correlation with the 17 signatures. Of those, 74 genes showed positive associations generating 182 significant correlations with the 17 signatures. We only regarded positive associations in order to avoid compositional data effects.
Selection of high-confidence genes. To ensure that we only regard genes with high confidence when proposing a putative cause, we further filtered the 74 genes down to 32 by applying two additional criteria: q-value <0.005 and effect size >+0.4. The following 32 genes were identified as high-confidence genes: AKT1, BRCA1, BRCA2, CCND1, CDK4, CDKN2C, CIC, CUL1, ERBB2, ERBB3, ERCC2, FBXW7, H3F3A, IDH1, IDH2, MAPK1, MYC, NFE2L2, RAC1, RPL22, PBRM1, PCBP1, PIK3R2, PMS1, PPP2R1A, PSMA4, SPOP, SOS1, SOX9, TP53, U2AF1, VHL. We studied the literature around those genes and collated information on how they are known to influence genomic stability.
Enrichment in extrachromosomal DNA amplicons. For the three signatures (CX8, CX9, CX13) that showed a pattern of amplification (changepoint above 3 copies,
In detail, we used their Supplementary Table 1 (Kim et al., 2020) to identify whether high-confidence genes were part of ecDNA amplicons. For each of the high-confidence genes of the three amplification signatures, using a Welsh's t-test we tested whether samples with amplicons for that respective gene had a significantly higher signature activity than samples without amplicons. This led to an additional filtering step, resulting in the removal of association of seven genes with the three amplification signatures. For CX8 we found that amplicons including the genes SPOP, MYC, RAC1 lead to a significantly higher signature activity. For CX9 we excluded CDK4, CCND1, ERBB2 and for CX13 we excluded CDK4.
Proposing an aetiology and assigning a confidence rating. The two pieces of evidence we used for proposing aetiologies (i.e. putative causes and mechanisms) underpinning the signatures were the high-confidence genes and the pattern of change (signature definition).
In order to show our confidence in the evidence supporting a putative cause, we devised a simple confidence rating from zero to three (
Focused analysis of CCNE1 and CDK12. There is ample evidence for the importance of CCNE1 and CDK12 mutations in cancer and chromosomal instability. CDK12 showed strong associations with CIN signatures CX2 and CX5 but the strong p-value correction (for 19,000 multiple tests) scaled their q-values to 0.06, above our q-value threshold of 0.005. However, those trends support the notion that CX2 and CX5 are associated with class 2 and 3 tandem duplications (
Thus, we found that, classified as pan-cancer oncogene, CCNE1 is significantly associated with eight signatures (CX2, CX3, CX4, CX5, CX8, CX9, CX10-signatures with an effect size >0.4 and an adjusted p-value<0.005).
Correcting for cancer type when testing for driver gene associations. The above approach for testing associations between genes of interest and signature activities may be confounded by cancer type (see “Linking signature activities to mutated genes”). One example is VHL mutation, which is nearly exclusively limited to renal cancer. However, having a single tumour type enriched for a specific gene mutation and type of chromosomal instability does not mean that the association is not putatively causal (there is good support in the literature for VHL's role in regulating mitosis). Therefore, when performing this type of analysis, we faced a conundrum: an analysis without correction potentially reports confounded results, and an analysis with correction potentially excludes strong and real biology (e.g. BRCA2 is mutated in only a small number of tumour types and therefore may not pass correction). To deal with this, we opted to combine both approaches. We used an uncorrected approach to propose putative aetiologies, then used a corrected approach to see if these aetiologies continued to have support. The corrected approach tested the association between activity and mutations with a multivariate regression model correcting for cancer type. The original list of genes associated with signatures is provided below.
CX1: CDKN2C—Cell cycle checkpoint failure—Inhibitor of CDK4, controls G1 progression; VHL—supports the aetiology—Mitotic errors—Mitotic fidelity and spindle orientation; RPL22—Cell cycle checkpoint failure—Ribosomal protein activating TP53; IDH1—Proliferation—Amplification or deletions associated with increased proliferation; CIC—supports the aetiology Mitotic errors RTK/MAPK inhibition. Loss promotes mitotic errors, especially chromosome segregation effects; PBRM1—supports the aetiology—Mitotic errors—Loss may give rise to spindle checkpoint expression phenotype. Also implicated in sister chromatid and centromeric cohesion. Alternative name is BAF180 which is part of the SWI/SNF complex BAF/PBAF; SOX9—No clear role-Loss-of-function related to promotion of metastasis.
CX2: BRCA1—supports the aetiology—Impaired HR—Major role in HR and DNA replication fidelity.
CX3: BRCA1 (not rescued)—supports the aetiology-Impaired HR-Major role in HR and DNA replication fidelity. Samples with additional mutations in TP53BP1, RIF1 or MAD2L2 (also known as REV7) were ignores as they were shown to rescue BRCA1 loss; BRCA1-supports the aetiology-Impaired HR-Major role in HR and DNA replication fidelity; PMS1—Compensatory MMR, Downregulation of HR-Integral part of mismatch repair (MMR).
Upregulated it could compensate for higher SNVs due to impaired HR. MMR is also known to downregulate HR pathway. It has been suggested that MMR is involved in double-strand break repair, especially single-strand annealing (SSA); PCBP1—Alleviating replication stress-Regulates POLH which is important for translesion DNA synthesis (TLS). TLS is involved in resolving stalled forks during DNA replication. Also regulates multiple genes involved in DNA repair. Upregulation is likely to achieve a compensatory dosage effect of repair pathways; U2AF1—supports the aetiology—Replication stress (R loops)-Splicing factor contributing to cancer progression. Mutants involved in increased R loops and replication stress; BRCA2—supports the aetiology—Impaired HR—Major role in HR and DNA replication fidelity; PPP2R1A—supports the aetiology—Replication stress (Fork collapse)—Subunit of PP2. Activation of PP2 has been shown to interrupt DNA replication, resulting in fork collapses and HR activity; SOS1—supports the aetiology—Cell cycle progression-Produces protein involved in activation of Ras proteins; PIK3R2—supports the aetiology—Cell cycle progression—Regulatory subunit of PI3K, which recruits AKT and PDK1, involved in cell growth, survival; MAPK1—supports the aetiology-Cell cycle progression, Replication stress (Asymmetric forks, hyperreplication)—Essential component for progressing cell cylce, inducing mitosis, regulation of transcription etc. Causes Ras—induced CIN which is linked to replication stress by asymmetric forks and re-replication (DNA hyperreplication); NFE2L2—DNA damage response-Produces NRF2. Transcription factor. Involved in antioxidant response. Also activator of ATR. Involved in cellular response to double-strand breaks; TP53—supports the aetiology-Cell cycle progression, DNA damage response, Replication stress (Fork stalling)—Regulating the G1/S and G2/M checkpoints. Defective p53 leads to impaired HR and failed recoveries from replication fork stalling; MYC-supports the aetiology-Cell cycle progression, Replication stress (pot. fork stalling)—Known oncogene. Overexpression shown to influence cell cycle progression (G1/S transition), increase of genomic instability and replication stress. It is responsible for inducing DNA replication origins and bubbles. It is assumed fork stalling due to mechanical interaction of close replication bubbles; ERCC2—supports the aetiology—Impaired NER—Encodes a DNA helicase involved in nucleotide excision repair (NER). Many mutations are missense mutations. Most missense studied are known to inactivate NER. Mutations known to drive cisplatin sensitivity.
CX4: PIK3R2—supports the aetiology—Cell cycle progression-Regulatory subunit of PI3K, which recruits AKT and PDK1, involved in cell growth, survival; PCBP1—Alleviating replication stress—Regulates POLH which is important for translesion DNA synthesis (TLS). TLS is involved in resolving stalled forks during DNA replication. Also regulates multiple genes involved in DNA repair. Upregulation is likely to achieve a compensatory dosage effect of repair pathways; AKT1—supports the aetiology—Cell cycle progression, Mitotic errors-Known oncogene. Part of the AKT/PI3K pathways, regulating wide variety of functions. Induces asymmetric cell divisions; IDH2—No clear role-Proteins are used i.a. in metabolism. Mutants are known to suppress HR and induce sensitivity to PARP inhibitors; U2AF1—Replication stress (R loops)-Splicing factor contributing to cancer progression. Mutants involved in increased R loops and replication stress; PPP2R1A—Replication stress (Fork collapse)—Subunit of PP2. Activation of PP2 has been shown to interrupt DNA replication, resulting in fork collapses and HR activity; MAPK1—supports the aetiology—Cell cycle progression, Replication stress (Asymmetric forks, hyper-replication)—Essential component for progressing cell cycle, inducing mitosis, regulation of transcription etc. Causes Ras-induced CIN which is linked to replication stress by asymmetric forks and re-replication (DNA hyper-replication); ERCC2—Impaired NER—Encodes a DNA helicase involved in nucleotide excision repair (NER). Many mutations are missense mutations. Most missense studied are known to inactivate NER. Mutations known to drive cisplatin sensitivity; CDK4—supports the aetiology—Cell cycle progression, Mitotic errors-Regulation of cell cycle. Important for centrosome separation.
CX5 U2AF1—supports the aetiology—Replication stress (R loops)—Splicing factor contributing to cancer progression. Mutants involved in increased R loops and replication stress; CDK4—supports the aetiology-Cell cycle progression, Mitotic errors-Regulation of cell cycle. Important for centrosome separation; PCBP1—supports the aetiology-Alleviating replication stress-Regulates POLH which is important for translesion DNA synthesis (TLS). TLS is involved in resolving stalled forks during DNA replication. Also regulates multiple genes involved in DNA repair. Upregulation is likely to achieve a compensatory dosage effect of repair pathways; PMS1—supports the aetiology-Compensatory MMR, Downregulation of HR—Integral part of mismatch repair (MMR). Upregulated it could compensate for higher SNVs due to impaired HR. MMR is also known to downregulate HR pathway. It has been suggested that MMR is involved in double-strand break repair, especially single-strand annealing (SSA); MAPK1—supports the aetiology—Cell cycle progression, Replication stress (Asymmetric forks, hyperreplication)—Essential component for progressing cell cylce, inducing mitosis, regulation of transcription etc. Causes Ras-induced CIN which is linked to replication stress by asymmetric forks and re-replication (DNA hyperreplication); PPP2R1A—supports the aetiology-Replication stress (Fork collapse)-Subunit of PP2. Activation of PP2 has been shown to interrupt DNA replication, resulting in fork collapses and HR activity; BRCA1—supports the aetiology—Impaired HR-Major role in HR and DNA replication fidelity.
CX6: PSMA4—supports the aetiology-Mitotic errors (Spindle assembly checkpoint), APC and SCF Regulation—Part of the 20S proteasome, forming the 26S proteasome which activates the anaphase promoting complex (APC/C) and is implicated in the Spindle Assembly checkpoint (SAC) machinery. SCF protein complex is regulated by APC and indirectly by 26S proteasome; CUL1—supports the aetiology-Cell cycle progression, DNA damage response (NER), SCF Regulation-Part of the SCF ubiquitin ligase and promotes G1/S transition. CUL1 is known to upregulate mTORC1 which leads to cell growth. Cells can undergo tetraploidization involning activation of mTOR. Also implicated in activation of nucleotide excision repair (NER); H3F3A—supports the aetiology—Chromatin remodelling, Integrity of DNA, Mitotic errors-Core component of Histone 3.3 regulating transcription, DNA repair, DNA replication, and chromosomal stability. Histone 3.3 depletion causes dysfunction of heterochromatin structures at telomeres, centromeres, and pericentromeric regions of chromosomes, leading to mitotic defect; RAC1—supports the aetiology—Cell cycle progression, SCF Regulation, Centrosome amplification-Activates aldolase A and ERK pathways. Associated with poor outcome. Overexpressed RAC1 activates MAPK in cell lines. Regulated by SCF protein complex. Overexpression known to correlate with centrosome amplification. Prominent role of RAC1 in upregulating R5P synthesis and nucleoside metabolism, which is involved in DNA damage repair.
CX8: U2AF1—supports the aetiology—Replication stress (R loops)—Splicing factor contributing to cancer progression. Mutants involved in increased R loops and replication stress; PCBP1—supports the aetiology—Alleviating replication stress—Regulates POLH which is important for translesion DNA synthesis (TLS). TLS is involved in resolving stalled forks during DNA replication. Also regulates multiple genes involved in DNA repair. Upregulation is likely to achieve a compensatory dosage effect of repair pathways; PIK3R2—supports the aetiology—Cell cycle progression-Regulatory subunit of PI3K, which recruits AKT and PDK1, involved in cell growth, survival; PMS1—Compensatory MMR Downregulation of HR-Integral part of mismatch repair (MMR). Upregulated it could compensate for higher SNVs due to impaired HR. MMR is also known to downregulate HR pathway. It has been suggested that MMR is involved in double-strand break repair, especially single-strand annealing (SSA); ERCC2—Impaired NER-Encodes a DNA helicase involved in nucleotide excision repair (NER). Many mutations are missense mutations. Most missense studied are known to inactivate NER. Mutations known to drive cisplatin sensitivity; AKT1—supports the aetiology—Cell cycle progression, Mitotic errors-Known oncogene. Part of the AKT/PI3K pathways, regulating wide variety of functions. Induces asymmetric cell divisions; PPP2R1A—supports the aetiology-Replication stress (Fork collapse)—Subunit of PP2. Activation of PP2 has been shown to interrupt DNA replication, resulting in fork collapses and HR activity; ERBB3—supports the aetiology—Cell cycle progression—Dimerises with ERBB2 and can activate both MAPK and AKT/PI3K pathways; MAPK1-supports the aetiology-Cell cycle progression, Replication stress, (Asymmetric forks, hyper-replication)-Essential component for progressing cell cylce, inducing mitosis, regulation of transcription etc. Causes Ras-induced CIN which is linked to replication stress by asymmetric forks and re-replication (DNA hyper-replication); CDK4—supports the aetiology-Cell cycle progression, Mitotic errors—Regulation of cell cycle. Important for centrosome separation; SPOP—Not taken into account due to their significant enrichment in ecDNA amplicons which we believe results in a spurious correlation with amplification-signatures; MYC—Not taken into account due to their significant enrichment in ecDNA amplicons which we believe results in a spurious correlation with amplification-signatures; RAC1—Not taken into account due to their significant enrichment in ecDNA amplicons which we believe results in a spurious correlation with amplification-signatures.
CX9: CDK4—Not taken into account due to their significant enrichment in ecDNA amplicons which we believe results in a spurious correlation with amplification-signatures; CCND1—Not taken into account due to their significant enrichment in ecDNA amplicons which we believe results in a spurious correlation with amplification-signatures; ERBB2—Not taken into account due to their significant enrichment in ecDNA amplicons which we believe results in a spurious correlation with amplification-signatures.
CX10: PPP2R1A—supports the aetiology-Replication stress (Fork collapse)—Subunit of PP2. Activation of PP2 has been shown to interrupt DNA replication, resulting in fork collapses and HR activity; FBXW7—supports the aetiology—Impaired NHEJ—Known tumour-suppressor. Part of the ubiquitin protein ligase complex SCF. Has been shown to facilitates NHEJ; CDK4—supports the aetiology—Cell cycle progression, Mitotic errors-Regulation of cell cycle. Important for centrosome separation; ERBB2—Cell cycle progression-ERBB2, or HER2, is involved in cell-cycle progression and mitosis. ERBB2 signalling is shown to initiate CIN and defective cell-cycle control in certain breast cancer models.
CX11: CDK4—supports the aetiology—Cell cycle progression, Mitotic errors—Regulation of cell cycle. Important for centrosome separation.
CX13: CDK4—Not taken into account due to their significant enrichment in ecDNA amplicons which we believe results in a spurious correlation with amplification-signatures.
CX14: CIC—supports the aetiology-Mitotic errors RTK/MAPK inhibition. Loss promotes mitotic errors, especially chromosome segregation effects.
CX17: U2AF1—supports the aetiology-Replication stress (R loops)—Splicing factor contributing to cancer progression. Mutants involved in increased R loops and replication stress; PMS1—Compensatory MMR, Downregulation of HR-Integral part of mismatch repair (MMR). Upregulated it could compensate for higher SNVs due to impaired HR. MMR is also known to downregulate HR pathway. It has been suggested that MMR is involved in double-strand break repair, especially single-strand annealing (SSA).
Nearly all genes in the uncorrected analysis (n=24/32) appeared in the corrected list. The following gene/signature associations appeared in the original list, but did not appear in the corrected list: CX1: IDH1, CDKN2C, SOX9, PBRM1, VHL; CX3: NFE2L2, PIK3R2, PPP2R1A; CX10: PPP2R1A. These associations have the potential to be confounded and therefore aetiologies based on these need to be treated with caution. Of these, the only ones that pose significant concern are VHL and PBRM1, as these are used to link CX1 with mitotic errors. However, this concern is mitigated by the fact that in the corrected list, tumours with inactivating mutations in CDH1, a key regulator of mitosis, are significantly higher in CX1 activity, supporting the putative aetiology of mitotic errors. These results demonstrate that our proposed aetiologies, which use driver gene mutation associations, are robust to potential confounding effects of tumour type.
Supporting aetiologies by orthogonal data. In order to corroborate the putative causes we used an array of orthogonal data: two additional patient cohorts and their clinical metadata, (˜1,900 patients from the PCAWG project, ˜400 patients from the ICGC project); 5 types of mutational signatures, (SBS, indel, DBS, ovarian copy number, rearrangement); 14 molecular features (somatic point mutations, gene expression, cell cycle score, aneuploidy score, whole-chromosome CNAs, SCNA levels, tandem duplications, loss of heterozygosity (LOH), chromothripsis, kataegis, whole-genome duplication status, telomere length and elongation machinery activity, ecDNA, centrosome amplifications); and 11 DNA repair specific features (germline BRCA1 and BRCA2 mutations, BRCA1 and RAD51C hypermethylation data, HRDetect response, Myriad myChoice score, TP53 inactivation score, telomeric imbalances score, large-scale state transition score, LOH score, DNA repair proficiency score, protein expression score for 23 DNA-damage repair genes, PCAWG structural variant with associated microhomologies). The sources of the data are listed in Table 1. These data were processed in the following way.
Chromothripsis. The detection of chromothripsis events was performed for the PCAWG samples 12. We only used events that received “Chromothripsis” as a final call. We counted the number of chromothriptic events for the 410 samples with signature activities and tested the Spearman correlation to the signature activities of the 17 CIN signatures. Multiple testing correction was performed according to Benjamini and Hochberg.
Kataegis. The detection of kataegis events was performed for the PCAWG samples (ICGC/TCGA Pan-Cancer Analysis of Whole GenomesConsortium, 2020). For 498 samples we had with signature activities. We reduced the list of detected kataegis events to those with a significant adjusted p-value. We counted the number of kataegic events per sample and tested the Spearman correlation to the activities of the 17 signatures of CIN. Multiple testing correction was performed according to Benjamini and Hochberg.
Loss of heterozygosity (LOH). Loss of heterozygosity (LOH) segments were obtained from our copy number profiles. LOH was defined as a state where the b-allele (the allele with the minor copy number) had a copy number of less than 0.4. We calculated the proportion of the genome affected by LOH by using the hg19 chromosome sizes. For each signature, we tested the Spearman correlation between this proportion of LOH and signature activity. All p-values have been corrected for multiple testing using the Benjamini-Hochberg method.
SBS, DBS and ID signatures. Signature exposures to single base substitutions (SBS), doublet base substitutions (DBS) and small insertion and deletion (ID) signatures were created in Alexandrov et al. 2020 and were downloaded from Synapse.org. We used signature exposures for both Signature Profiler (SP) and Signature Analyzer (SA). Spearman correlations between SBS, DBS and ID signatures, and the CIN signature activities were performed for the 1900 PCAWG samples. P-values were corrected for multiple testing using the Benjamini-Hochberg method for each type of signature (SBS, DBS, ID) and software (SP, SA).
Potential correlation with age for CX1. CX1 correlated positively with the “clock-like” SBS1 signature suggesting these errors might also be mediated via a natural ageing process. Therefore, we tested the correlation between CX1 and the age of patients by using a linear as well as a robust linear model correcting for cancer type (R package MASS96, v7.3_51.5; R core package, v.3.6.1). Pan-cancer, we found no significant relationship (linear model: p-value for CX1 and age 0.81; robust linear model: p-value 0.88). This result is in line with previous publications which found very few CNAs genome-wide in normal tissue, suggesting negative selection. However, we performed the same analysis for each cancer type according to the method of Alexandrov et al. (2020) and found that for ovarian and lung squamous cell cancer, there was a positive correlation between age and number of estimated CNAs for CX1 (OV: F-statistic=5.9, q-value=0.048; LUSC: F-statistic=6.8, q-value=0.044). Given that ovarian cancer has one of the longest latencies (over 20 years), it might be that extreme latency periods lead to an accumulation of mitotic errors.
Whole-chromosome CNAs. The smallest canonical human chromosome in the hg19 reference genome is chromosome 22 with 48 Mb. However, SNP 6.0 arrays do not fully cover the whole range of chromosomes. The resulting actual coverage of chromosomes shrunk to 35.2 Mb for chromosome 22 and 37.4 for chromosome 21. We counted a segment as a whole chromosome CNAs if it was larger than 95% of the whole chromosome length covered by SNP6 arrays and had a copy number value greater or smaller than 2. For each signature, we tested the Spearman correlation between counts of whole chromosome CNAs and signature activity. P-values were corrected for multiple testing using the Benjamini-Hochberg method.
Whole-genome duplication. To test whether the whole-genome duplication status associated positively, negatively, or independently with the activity of a signature, we performed two tests for each signature: a proportion test at zero activity for the proportion of WGD samples compared to the proportion of WGD samples across the 6,335 samples, and a Welsh's t-test for samples with activities higher than zero. Only if a signature showed both significantly depleted levels at zero and a significant positive shift in non-zero activities, we positively associated that signature with WGD. If a signature had both an enriched proportion of WGD samples at zero and a significant negative shift in non-zero activities, then we negatively associated that signature with WGD. Otherwise, we classified the signature as independent. P-values were corrected for multiple testing using the Benjamini-Hochberg method.
Identification of signatures related to ploidy changes. There are generally two ways a tumour cell can change its ploidy status (where ploidy is measured as the average copy number state observed across the genome): 1) whole-genome duplication which increases the ploidy status by two; or 2) an accumulation of copy number gains which increases the overall ploidy status (similarly for accumulation of losses which reduce the ploidy status). Number 1) is represented by CX4 (
Telomerase length and activity of telomere elongation machinery. Telomere length data were obtained from Barthel et al. 2017. The authors estimated telomere length with TelSeq89 for TCGA samples. We removed estimates for samples based on shallow whole-genome sequencing (sWGS) and whole-exome sequencing (WXS) because we could not evaluate how well telomere length estimates were from off-target reads. A supplementary figure in Barthel et al. shows significant differences between sequencing technologies. We therefore focused on estimates derived from whole-genome sequenced samples (WGS). The supplied telomere length ratios were converted to lengths in base pairs. Tumour telomere lengths were subtracted by control telomere lengths. Positive values indicate longer telomere lengths in tumours. For each signature, a robust linear regression model (R package MASS Venables & Ripley 2002, v7.3_51.5) was fitted to the telomere lengths and signature activities. An F-test to test whether the slope of the robust linear model was different from zero was performed with sfsmisc (Maechler, 2020 R package, v1.1-7). The resulting 17 p-values were corrected for the false discovery rate according to Benjamini and Hochberg (
Telomerase signature scores (TSS), a gene expression score to assess the activity of the telomere elongation machinery, were obtained from Barthel et al., For each signature, we tested the Spearman correlation between TSS and signature activity. P-values have been corrected for multiple testing using the Benjamini-Hochberg method (
Tandem duplications. Number of tandem duplication events, tandem duplication score and Tandem Duplicator Class categorisation were sourced from Menghi et al. For each signature, we tested the Spearman correlation between the number of tandem duplication events, the tandem duplication score and signature activity. P-values were corrected for multiple testing using the Benjamini-Hochberg method (
Cell cycle score (CCS). The Cell cycle scores (CCS), an additive gene expression score implying the speed of cancer cells going through the cell cycle, for 5466 samples were obtained from Lundberg et al., We used the same 3 categories as the authors: low, medium (intermediary), high. The signature activities of the high CCS group were compared to the activities of the low and medium group using Welsh's t-test. P-values have been corrected for multiple testing using the Benjamini-Hochberg method (
Centrosome amplification (CA20). The CA20 score, an additive gene expression score of 20 genes experimentally proven to be involved with centrosome amplification, for 5437 samples were obtained from Almeida et al. For each signature, we tested the Spearman correlation between the CA20 score and signature activity. P-values were corrected for multiple testing using the Benjamini-Hochberg method (
Seven HRD metrics. Sourced from Knijnenburg et al., pre-processed by Jordan Griffin (Gerke lab) and published in this Github repository (https://github.com/GerkeLab/TCGAhrd), we used the following seven metrics known to correlate with homologous recombination deficiency (HRD): Myriad myChoice score (also called HRD score-Telli et al., 2016), TP53 inactivation score (Knijnenburg et al., 2018, More information in this github repository: https://github.com/greenelab/pancancer), telomeric imbalances score (Birkbak et al., 2012), large-scale state transition score (Popova et al., 2012), LOH score (Abkevich e al., 2012), DNA repair proficiency score (RPS, Pitroda et al., 2014), protein expression score for 23 DNA-damage repair genes (Knijnenburg et al., 2018). For each signature, we tested the Spearman correlation between the seven metrics and signature activity. P-values for each metric were corrected for multiple testing using the Benjamini-Hochberg method (
Microhomologies of PCAWG structural variants. From the structural variant (SV) data of the PCAWG project (ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium, 2020), we extracted the length of microhomologies. Two columns were used: Homlen (length of homology) and homseq (sequence of homology). If the homlen column had a NA entry and the homseq column showed “ ” [sic], we interpreted this as a length of zero. Structural variants with NAs in both columns were ignored. SVs with 0-1 bp microhomologies were assigned to the nonhomologous end joining (NHEJ) pathway, SVs with 2-20 bp to the theta-mediated end-joining (TMEJ) and SVs with longer stretches of microhomologies were assigned to the single strand annealing (SSA) pathway (Stok et al., 2021, Schimmel et al., 2019). We performed two analyses: One with a focus on samples with a dominant impaired HR signature activity (
Ovarian copy number signatures. We compared the 17 CIN signatures to the previously identified 7 ovarian copy number signatures (Macintyre et al., 2018) by computing the cosine similarities of the signature definitions and signature activities called on the 545 TCGA-OV samples (
Gene expression of DNA repair genes. We explored the activation of DNA repair pathways across 5,186 TCGA samples, which had extracted CIN signature activities and available gene expression data. We initially downloaded the list of genes involved in the different DNA repair pathways from the Reactome database (Fabregat et al., 2018, date of access: Nov. 5, 2021), and selected those genes differentially expressed between IHR and non-IHR samples by using a two-sided t-test (P-value cutoff of 0.01). IHR samples were those having a high activity of CX2, CX3 or CX5 signatures (threshold of scaled and centered signature activity >1.25). Then, Spearman correlations were computed between the activity of IHR signatures and the expression of DNA repair genes (Z-scores). All reported P-values have been corrected for multiple testing using the Benjamini-Hochberg method.
Testing for the influence of ethnicity. Black patients have been shown to experience higher rates of homologous recombination deficiency (HRD). We therefore explored whether the signatures differed in activity across different ethnic backgrounds as annotated by the TCGA. We found that there were substantial differences in signature activities with respect to the patients' ethnicities (
Testing the independence between CX2, CX3 and CX5 activity in BRCA1/2 mutants. We present three signatures of impaired homologous recombination (IHR): CX2, CX3 and CX5.
Their putative aetiologies were all derived in the same way, accumulating evidence from the signature pattern, gene associations and additional data. In particular, CX2 and CX5, like CX3, are not just associated with mutations in HR genes but also have copy number patterns indicative of defects in HR, which argues that they are specific markers of this process. To address the question whether CX2 and CX5 are independent forms of IHR, or are their associations driven by co-occurrence with CX3, we performed an association analysis between BRCA mutant cases and the activity of the three signatures (CX2, CX3, CX5), using a multivariate regression model which explicitly corrects for the effects of association with the other two signatures, cancer types and BRCA mutant cases. The results of this analysis are provided below, where the association between BRCA1/2 mutation status and signature activity for CX2, CX3 and CX5 was tested while correcting in each analysis for two of the signatures and cancer type. The WT BRCA1/2 was used as a benchmark against which the results were compared.
The results of this analysis demonstrates that both CX2 and CX5 have significantly higher activities in BRCA mutant patients independent of CX3 activity levels and cancer type. This provides evidence that CX2 and CX5 are likely directly associated with BRCA mutant status and therefore potentially a result of impaired HR.
Testing for the influence of IHR signatures on triple-negative breast cancers. Triple-negative breast cancers (TNBC) are known to experience higher rates of homologous recombination deficiency (HRD). It is therefore of interest how the three impaired HR signatures (IHR, CX2, CX3, CX5) are distributed across TNBC and non-TNBC as well as to ovarian cancer samples, especially for BRCA1/2 and RAD51C wild type samples. We obtained TNBC status from Lehmann et al. and for each of the three signatures, tested differences in activities between breast (TNBC and non-TNBC) and ovarian (BRCA1/2 mutant and WT) and a control group consisting of other WT BRCA1/2 and RAD51 methylation negative samples in the TCGA cohort (
Implicating mitotic catastrophe in impaired HR. Samples with impaired HR signatures CX3 and CX5 experience additional replication stress and impaired DNA damage sensing. The more perturbed the cell cycle checkpoints and DNA repair pathways are, the easier copy number aberrations and structural variants manifest. However, we hypothesised that samples with an abundance of CNAs over SVs might accumulate those during mitosis which is influenced by how quick the cell cycle progresses indicated by replication stress and cell cycle activity and how strong double-strand break (DSB) repair mechanisms are impaired. Using the 1,898 PCAWG samples, we fitted a robust linear model to the number of CNAs and SVs. All samples with weights less than 1 did not fit well in the model and were deemed samples with inferred mitotic catastrophe (
Therefore, we conclude that most of the excess CNAs produced by CX3 may be caused during the cell cycle potentially by higher replication stress but manifest during mitosis, causing a mitotic catastrophe.
Estimation of CNAs produced by a signature. We fitted a linear model on the raw signature values to the number of CNAs in each sample. We set the intercept to zero, to only allow positive coefficients. Interaction terms between the signatures were avoided as they resulted in negative coefficients, indicating that the model overfitted. From the model we extracted the coefficients and multiplied with the raw signature values for each sample to get the estimates of CNAs produced by a signature (
To determine putative causes underlying each of the 17 signatures, the inventors developed a data integration framework and assigned a confidence score to each signature aetiology based on the quality and extent of supporting data (
Mitotic signatures: CX1, CX6 and CX14 all encoded patterns related to whole arm or whole chromosome changes and significantly correlated with direct counts of whole-chromosome changes (
Signatures of impaired HR: CX2, CX3 and CX5 all exhibited patterns previously associated with impaired HR: CX2 showed a pattern of short, clustered changes associated with tandem-duplications; CX5 showed medium clustered, chained events associated with tandem duplication; and CX3 showed large, single copy changes with associated loss of heterozygosity. CX2, CX3 and CX5 were all observed at significantly higher levels in tumours with somatic BRCA1 mutation independently of each other, despite representing significantly different patterns of copy number changes (
Whole-genome duplication signature: CX4 encompassed a unique pattern of copy number change with neighbouring segments separated by 2 copy changes (
Signature of impaired non-homologous end joining: CX10, a signature of clustered copy number changes, had significantly higher activity in tumours with inactivating mutations in FBXW7, and correlated with FBXW7 mutant mediated tandem duplications class 1/2, suggesting a role for impaired non-homologous end joining (NHEJ) (Zhang et al., 2016 and 2019). A significant increase in the proportion of breakpoints with microhomologies in samples with this signature was indicative of a lack of blunt-end joining, a hallmark of NHEJ (
Correlation of the signature with FBXW7 mutant mediated tandem duplication class 1/254 and higher activity in tumours with PPP2R1A amplification, suggests these errors may occur in the context of replication fork collapse (Perl et al., 2019).
Signatures of amplification: CX8, CX9 and CX13 encoded patterns of low-, medium- and high-level amplifications, respectively. All three signatures were associated with increasing cell cycle score (Lundberg et al., 2020) (
Clustered copy number signatures: CX11, CX16 and CX17 represented patterns indicative of clustered copy number changes. CX11 was associated with cell cycle score (
Cross-signature observations: Many covariates demonstrated associations with multiple signatures. Chromothripsis was linked with seven different signatures (
In this example, the inventors investigated whether the activity of the copy number signatures identified in Example 1 can be predictive of drug response, and can identify drug targets that could be exploited to treat patients that show particular copy number signature activities.
Calculating signature activities for cell lines. Cell line SNP 6.0 array CEL files were downloaded from the CCLE project (Ghandi et al., 2019) and were processed using Affymetrix Power Tools (APT; v2.11.2; standard options). Combined cell line log R and BAF data was split into sample-specific log R and BAF files. No germline samples were available for the identification of heterozygous SNPs in the cell line data, thus we ran the pipeline for SNP 6.0 without matched normal described above (see section “SNP 6.0 without matched normal” in the Methods of Example 1). In this pipeline, while the rescued SNPs allow for the identification of additional segments from their Log R, by construction their average BAF is never equal to 0 or 1 even in the presence of LOH. This biases purity estimates towards lower values and renders the copy-number fitting inaccurate. To circumvent this problem, we first fitted log R and BAF as per the ASCAT methodology (Van Loo et al. 2010), then searched the space of combinations of purity and ploidy around the same ploidy solutions, but between 95% and 100% purity and only solving for the total copy number using the Log R-only equation for each segment
Where nT is the total number of copies in the tumour, ρ is the purity and ψ is the average tissue ploidy. Akin to ASCAT, we picked the purity and ploidy combination that minimized the distance between nr and integer values. Only fitting the Log R and not the BAF in cell lines lead to more accurate values of the purity and thus the copy-number profiles. As fitting of absolute copy number to non-matched cell line data is inherently more challenging than other instances, all cell lines where ASCAT identified a plausible solution were visually inspected to determine fit suitability and excluded cell lines where absolute copy number profiles were incorrect or poorly fitted. We extracted signature activities for the 17 compendium signatures in 594 cancer cell lines. Extraction of the feature values from the copy number profiles derived from cell lines were performed as described above in section “Feature distributions and mixture modelling” in the Methods of Example 1. For identifying the signature activities, we used the linear combination decomposition function from YAPSA (R package, v1.12.0) and the 17 signatures of CIN.
CRISPR, RNAi and drug response data integration. Using the depmap portal we obtained essentiality scores of 17,645 genes from CRISPR knock-out screens, essentiality scores of 17,309 genes from RNAi screens, and area under dose response curve (AUC) values for 1,394 drugs across 481 human cancer cell lines from the PRISM screen (HTS002 screen) (see Table 1). After merging these data with cell line copy number profiles, we obtained a total of 382 cell lines with available copy number profiling and gene essentiality data from CRISPR and RNAi screens. From all of these, 297 also had drug response data from the PRISM screen.
Correlation of signatures with target gene KO, KD and drug inhibition. Kendall's Tau correlation was performed between CIN signature activities and gene essentiality scores from the genetic perturbation screens. Correlations were filtered for tau<−0.07 and q-value<0.05 after multiple testing corrections using the Benjamini-Hochberg method. This yielded 7,225 targets significantly associated with a signature from the CRISPR screen and 7,091 targets from the RNAi screen. These targets were then used to test correlation of the signature activity with matched drug inhibition of the target measured via AUC. After filtering (Kendall's tau<−0.1 and Benjamini-Hochberg corrected q-value<0.01) a total of 44 drug responses were correlated with signature activity via 40 targets. Conversely, those targets significantly linked to a CIN signature in both CRISPR and RNAi perturbation screens but without matched drug response represented putative novel targets for drug discovery. From an initial list of 319 targets overlapping in both the RNAi and CRISPR based correlation, 297 had unknown targeted therapies. After manually inspecting the target tractability (Mitsopoulos et al., 2021, Therneau et al., 2021) of these targets were considered druggable according to their structure or by ligand-based approaches, and after manual literature curation, 49 had a clear prior implication in CIN.
The putative signature aetiologies implicated canonical cancer pathways as some of the major drivers of CIN. Many of these pathways have been the focus of targeted therapy development. Therefore, given that the signatures identified in these examples can be readily measured in patient tumours, the inventors explored their utility for therapy response prediction and drug target identification. They integrated data from 297 cancer cell lines, including copy number profiling, genome-wide CRISPR knock-out screens93, genome-wide RNAi screens (McFarland et al., 2018) and the PRISM drug repurposing screen (Corsello et al., 2019). They assessed correlations between signature activities, gene essentiality, and sensitivity to drug perturbation of the gene (Methods,
They identified 40 genes where copy number signature activity was significantly correlated with both genetic and drug perturbation of the target (
Copy number signature correlations with gene essentiality scores, from both CRISPR and RNAi perturbation screens, identified 104 target genes with druggable structures (Mitsopoulos, 2021) that currently have no targeted therapies in the clinic. These represent putative synthetic lethal drug targets, 49 of which had evidence of being implicated in CIN-related mechanisms (
In this example, the inventors investigated in more detail the copy number signatures identified in Example 2 as likely to be indicative of impaired homologous recombination, and whether these could be predictive of response to platinum-based therapy.
Preparation of survival data. Clinical metadata for the TCGA cohort were downloaded from the TCGA website: https://portal.gdc.cancer.gov/. Clinical metadata for the ICGC breast cancer cohort was from the original publication (Nik-Zainal et al., 2016). Data for the PCAWG OV-AU and ESAD-UK were obtained from synapse (ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium, 2020). Age was converted into years and split into 4 groups: [0, 53), [53, 62), [62, 76), [76+). Stages were simplified by omitting additional stage identifier after the initial roman letter identifier (e.g. Stage IIA1 became Stage II). Stage IS was treated as NA. We added some missing clinical stages supplied by Liu et al. 2018.
Survival analysis. In order to study the relationship between the signature activities, clinical classifiers and survival of patients, we deployed a univariate (Kaplan-Meier estimates) and a multivariate method (Cox proportional hazard model). Kaplan-Meier estimates (function survfit) and Cox proportional hazard models (function coxph) were performed using survival (Therneau, 2021, R package, v3.1-11) and survminer (Drawing Survival Curves using ggplot2′ [R package survminer version 0.4.9]. (2021); R package, v0.4.8). For the Kaplan-Meier estimates a logrank test was performed. For the Cox proportional hazard models we included the covariates stage and age category, except for HRDetect classification on the TCGA-OV cohort where we only corrected for age and not stage since there were not enough samples in each stage for the model to converge (n=40).
Clinical classifier based on CX2/CX3 signature activity. From our work on the signature aetiologies, we found that CX2 has no significant relation to predicting platinum-sensitivity in TCGA ovarian cancer patients but CX3 does (
Comparison with HRDetect and Myriad myChoice classifier. HRDetect values were taken from Davies et al. 2017 and Degasperi et al. 2020. As in the original publication, the threshold of 0.7 was used to identify positive samples (Davies et al). Myriad myChoice classifier is based on the HRD score (Telli et al., 2016). Positive samples were samples with scores above or equal to 42. HRD scores were taken from Knijnenburg et al. 2018 and complemented by data from Marquard et al. 2015 for samples with NA values.
Clinical classifier based on three impaired HR signatures. In order to test our clinical classifier based on CX2 and CX3 alone, we deployed three additional clinical classifiers taking into account all three impaired HR signatures (CX2, CX3, CX5). The three classifiers produced hazard ratios for the resistant or early death group at 0.62, 0.57 and 0.67 which were inferior to the 0.54 hazard ratio produced by the CX2/CX3 classifier.
Logistic regression on survival outcome. First, we extracted the germline BRCA1 samples from the ovarian TCGA cohort and scaled the activities for the three impaired HR signatures. Then we trained a logistic regression using the function glm (R core package, v.3.6.1) on the scaled signatures to predict the outcome of the survival data (death). Then we scaled the signature activities of all TCGA ovarian cancer samples according to the scale and variance from the germline BRCA1 samples and tested the logistic regression classifier by producing a Kaplan-Meier estimate and a Cox proportional hazard model (
Support vector machine (svm) on hierarchical clustering. The third classifier was an svm predicting the class membership of two groups produced by hierarchical clustering on the scaled signature activities of the germline BRCA1 samples of the TCGA ovarian cancer cohort. A ten-fold cross validation was used as a regularisation method. The classifier was then applied to the complete TCGA-OV cohort which was again scaled according to the activities of the germline BRCA1 samples (
The aetiologies of the 3 impaired HR signatures described in Example 2 indicated a model of increasing CIN complexity (
Disruption of both HR and NER have been shown to confer sensitivity to platinum-based chemotherapy (Konstantinopoulos et al., 2015, Martin et al., 2008). Given only CX3 is associated with disruption of NER, we hypothesised our IHR signatures may demonstrate differing abilities to predict platinum sensitivity. As ovarian cancer patients are routinely treated with platinum-based chemotherapy, we were able to test the ability of all three signatures to predict overall survival, and hence platinum sensitivity, using a Cox-proportional hazards model (
Given that the IHR signatures were able to dissect platinum response, we further hypothesised that they could be used in combination to provide better predictors of platinum sensitivity. Given that CX2 was not predictive, we used it as a baseline for measuring IHR-related genomic changes, and required CX3 activity to exceed it, resulting in a simple classification rule: “if CX3 activity is greater than CX2 activity, then predict sensitivity” (
In this example, the inventors further validated the findings described in Example 4 in relation to prediction of sensitivity to platinum-based chemotherapy, in an independent cohort of ovarian cancer patients.
Using CIN signatures quantified across an independent cohort of 40 high-grade serous ovarian cancer patients treated with platinum-based chemotherapy, the inventors validated the ability of CX2, CX3 and CX5 to predict sensitivity to platinum-based chemotherapy treatment. They trained cox proportional hazards models to predict the progression free survival interval after first-line treatment for all three signatures, and the CX3/CX2 classifier described in Example 4. As described in Example 4, it was found that CX2 was not predictive of sensitivity, CX5 predicted resistance and CX3 predicted sensitivity (
Ethical approval and clinical sample collection. Clinical data and samples for the patients were collected as part of the prospective Cambridge Translational Cancer Research Ovarian Study 04 (CTCROV04) approved by the Institutional Ethics Committee (REC08/H0306/61). Patients provided written, informed consent for participation in this study and for the use of their donated tissue for the laboratory studies carried out in this work.
Sample processing for tissues. FFPE tissue blocks were cut as 8 μm sections and tumour-enriched regions were recovered by macrodissection based on regions marked on an adjacent haematoxylin-and-eosin-stained section by the study pathologist. DNA was extracted from 3-10 sections using QIAmp DNA Micro kit (Qiagen) with the following modification to the original protocol: an additional incubation step with Buffer ATL at 95.C for 15 minutes was introduced before adding proteinase K. The paraffin was removed using a xylene/ethanol method.
DNA sequencing. Whole-genome sequence libraries were prepared from 75 ng DNA using SMARTer Thruplex DNA-Seq (Takara) protocol. DNA from each sample was sheared on Covaris LE220 (Covaris): duty cycle-30%, intensity-5.0, bursts per sec-50, duration-120 sec, peak incident power-180, temperature 20° C., water level-4. All samples underwent 5 PCR cycles. Library quality and quantity were assessed with D5000 on 4200 Tapestation according to the supplier's recommendations. Libraries were then pooled together and sequenced using PE-50 mode on NovaSeq SP aiming for 10 million reads per sample.
Absolute copy-number fitting. Reads were aligned against the human genome assembly GRCh37 using BWA-MEM (Li 2013, arXiv: 1303.3997). Duplicates were marked using Picard (github.com/broadinstitute/picard) and relative copy number was computed using QDNAseq (Scheinin et al. Genome Res. 2014 December; 24 (12): 2022-32.) with a bin size of 30 kb. Absolute tumour copy number (the number of chromosome copies of each DNA segment in the tumour cells in a sample) was computed for every bin across each sample. Each segmented relative copy number bin estimate j was transformed from relative copy number (rCN) to absolute copy number (aCN) as follows:
where purity is the fraction of tumour cells in the sample, and d is a constant proportional to the read depth, which is computed from the mean relative copy number of the sample, r, and the average absolute copy number of the tumour cells in the sample, ploidy.
Both purity and ploidy were unobserved in the data and were estimated using a grid search of purities ranging from [0.05,1] in 0.01 increments and ploidies ranging from [1.8,8] in increments of 0.1, minimising the following mean squared error:
Purity/ploidy values were excluded from consideration if they resulted in a fit which showed greater than 10 megabases of the genome with homozygous loss. Fits were removed which did not show at least one genomic segment at every integer copy number state from 1 to ploidy.
Signature calling. Pre-identified copy number features such as copy number segment size, magnitude of copy number change, and frequency of copy number changes are extracted from the copy number profiles. These features are already compositionally associated with pancancer signatures as described above. As a result, by comparing the feature breakdown for an individual copy number profile to a matrix of signature-feature associations, it is possible to generate the signature breakdown for a sample. This signature breakdown gives exposures to each of the 17 pancancer signatures for the sample. In order to improve the reliability of the signature exposures each of the 17 signatures was used with a threshold, where values below the threshold are set to 0 to avoid false positives in signature exposure. These thresholds are defined in advance of the present cohort-specific analysis, using Markov-Chain Monte Carlo methods to identify the variability in each signature as a result of random noise in the copy number features. In this analysis, all downstream survival models used thresholded signature exposures. Finally, to compare exposures between signatures, each set of raw signatures for the samples were centred and scaled according to the distribution of the signature over all samples in the presently analysed cohort. By scaling the samples in this way, it is possible to compare exposures in the same sample by their relative magnitude in the cohort.
Calculating progression free survival (PFS) for platinum treatment. PFS was calculated following the CA125 definitions of progression in 1st Line Therapy, agreed by the Gynecologic Cancer InterGroup (GCIG) in 2005 (gcigtrials.org/system/files/CA%20125%20Definitions%20Agreed%20to%20by%20GCIG%20-%20November%202005.pdf). Patients were sorted into three separate categories based on their CA125 levels over the course of the Platinum 1st line treatment; Category A patients begin with ‘abnormal’ readings pre-treatment, that then fall into the normal range during treatment; Category B patients begin with abnormal readings that never normalise; and Category C patients begin in the normal range. In these definitions, the normal range was chosen to be 0-35 CA125, and the abnormal range 35+, following NICE guidance (nice.org.uk/guidance/cg122/resources/ovarian-cancer-recognition-and-initial-management-pdf-35109446543557). These categories then defined the patient-specific CA125 progression threshold. In categories A and C, progression occurs on the 1st CA125 reading that has at least twice the normal limit, or 70, provided that there is a consecutive reading of twice the normal limit recorded at least one week after the initial reading. In category B, the progression criteria are the same, with the exception that the progression limit is increased to be at least twice the lowest CA125 reading in the treatment line. In many patients, CA125 levels will continue to rise during the early platinum treatments, due to the lag between treatment administration and effect. To avoid false early progressions, CA 125 readings were only eligible for progression if they occurred after the last treatment in the line. There remain some cases where CA125 readings fail to reach the threshold for progression before the next line of therapy begins. In these cases the beginning of the next line is chosen as the date of progression. Using the progression date calculated for each patient, the progression-free survival was defined as the number of days from the date of diagnosis to the date of progression.
Survival analysis. Cox proportional-hazards models were generated in R using the coxph function from the survival package. Progression-free survival (PFS) was then predicted using various combinations of pancancer signature exposures. Signatures CX2, CX3 and CX5 were used individually, and in addition a classifier was made by comparing CX3 to CX2, where both signatures were centred and scaled to allow for comparison. As described above, in this example a scaled CX3>scaled CX2 was taken to indicate sensitivity, and a scaled CX3≤ scaled CX2 was taken to indicate resistance. To further confirm the results, covariates were also introduced to the cox model to test for significance. When including stratified age at diagnosis and tumour stage into the cox model, neither covariate showed significance in predicting survival, and neither changed the significance of the signature predictors.
Examples 1-3 present a robust analysis framework for chromosomal instability in human cancers built on a pan-cancer analysis across 33 cancer types, and Examples 4-5 demonstrate its practical utility. This resource advances the field in two ways: it untangles CIN according to characteristic genomic patterns and underlying causes, and defines copy number signatures as new biomarkers to quantitatively measure different types of CIN. The approach described herein complements previous landscape studies of the genetic consequences of CIN (Network, TCGAR 2008, 2011, 2017, Zack et al., 2013), which generally focused on recurrent somatic copy number events at individual loci that are likely to reflect selective pressures (Zack et al., 2013). In contrast, copy number signatures (Macintyre et al., 2018, Steele et al., 2019) uncover mechanistic biases in the patterns of alterations across all chromosomes.
To maximise sample size, we used SNP 6.0 technology data from the TCGA collection. This technology is well established for copy number analysis, but has lower resolution than whole-genome sequencing. As further WGS data becomes available there will be an opportunity to further refine the signatures and increase their resolution. However, in their current form, we have demonstrated that the signatures are widely applicable across technologies, including inexpensive assays like shallow WGS that can be easily applied in a clinical setting to formalin-fixed tumour material (Scheinin et al., 2014). However, it is important to note that the bulk-sequenced samples we analysed do not show dynamics of CIN. The approach may be adapted to multiple samples or single cells from the same patient to show how patterns of CIN change over time. Further work may also be performed to quantify copy number signature activity at specific genomic loci, as the method described in these examples supports signature quantification at a whole-genome level.
The 17 copy number signatures (Tables 6 and 7) and their aetiologies provide a valuable resource for furthering our understanding of CIN. For example, CX1 represents the most prevalent type of CIN across tumours: chromosome missegregation. CX1 aetiology analysis pointed at multiple different mitotic defects giving rise to this signature. This suggests that, despite diversity in the potential causes of mitotic defects, these all result in the same change in genome structure (Bakhoum & Cantley, 2018). These missegregation events typically result in large copy number changes, potentially disrupting the function of many genes, however, our signature analysis reveals that these changes only represent, on average, 4% of the total number of copy number changes observed in a tumour (
In summary, the signature compendium presented here is an important resource to guide future studies into a deeper understanding of the origins and diversity of CIN and how to therapeutically target different CIN types.
All references cited herein are incorporated herein by reference in their entirety and for all purposes to the same extent as if each individual publication or patent or patent application was specifically and individually indicated to be incorporated by reference in its entirety.
The specific embodiments described herein are offered by way of example, not by way of limitation. Various modifications and variations of the described compositions, methods, and uses of the technology will be apparent to those skilled in the art without departing from the scope and spirit of the technology as described. Any sub-titles herein are included for convenience only, and are not to be construed as limiting the disclosure in any way. Unless context dictates otherwise, the descriptions and definitions of the features set out above are not limited to any particular aspect or embodiment of the invention and apply equally to all aspects and embodiments which are described. Throughout the specification and claims, the following terms take the meanings explicitly associated herein, unless the context clearly dictates otherwise. The phrase “in one embodiment” as used herein does not necessarily refer to the same embodiment, though it may. Furthermore, the phrase “in another embodiment” as used herein does not necessarily refer to a different embodiment, although it may. Thus, as described below, various embodiments of the invention may be readily combined, without departing from the scope or spirit of the invention. It must be noted that, as used in the specification and the appended claims, the singular forms “a,” “an,” and “the” include plural referents unless the context clearly dictates otherwise. Ranges may be expressed herein as from “about” one particular value, and/or to “about” another particular value. When such a range is expressed, another embodiment includes from the one particular value and/or to the other particular value. Similarly, when values are expressed as approximations, by the use of the antecedent “about,” it will be understood that the particular value forms another embodiment. The term “about” in relation to a numerical value is optional and means for example +/−10%. Throughout this specification, including the claims which follow, unless the context requires otherwise, the word “comprise” and “include”, and variations such as “comprises”, “comprising”, and “including” will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps. Other aspects and embodiments of the invention provide the aspects and embodiments described above with the term “comprising” replaced by the term “consisting of” or “consisting essentially of”, unless the context dictates otherwise. The features disclosed in the foregoing description, or in the following claims, or in the accompanying drawings, expressed in their specific forms or in terms of a means for performing the disclosed function, or a method or process for obtaining the disclosed results, as appropriate, may, separately, or in any combination of such features, be utilised for realising the invention in diverse forms thereof.
Number | Date | Country | Kind |
---|---|---|---|
2114203.9 | Oct 2021 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2022/077473 | 10/3/2022 | WO |