The present invention relates to a method for characterising the properties of cancer based on a DNA sample from a tumour. It is particularly, but not exclusively, concerned with a method for identifying whether the tumour is deficient in mismatch repair (MMR), and methods for identifying a treatment accordingly.
Somatic mutations are a hallmark of cancer and can arise through both endogenous and exogenous processes. Endogenous processes that have been shown to give rise to DNA lesions include endogenous biochemical activities such as hydrolysis and oxidation (Lindhal et al., 1972), and errors at replication. Fortuitously, our cells are equipped with DNA repair pathways that constantly mitigate this endogenous damage (Mardis et al., 2019; Berger & Mardis, 2018). One such pathway is the DNA mismatch repair (MMR) pathway. This pathway is highly conserved and plays a key role in maintaining genomic stability (Li, 2007). In eukaryotes, the pathway is mediated by key proteins collectively referred to as “Mut homologue” proteins. These include MSH2 and MSH6 (together forming the heterodimer MutSα), MSH2 and MSH3 (together forming the heterodimer MSHβ), MLH1 and PMS2 (together forming the heterodimer MutLα), MLH1 and PMS1 (together forming the heterodimer MutLβ), and MLH1 and MLH3 (together forming the heterodimer MutLγ).
Mutations in the Mut homologue proteins affect genomic stability, and are known to be associated with genetic conditions such as Lynch syndrome (also known as Hereditary nonpolyposis colorectal cancer (HNPCC)), an autosomal dominant genetic condition that is associated with a high risk of colon cancer as well as endometrial, ovary, stomach, small intestine, hepatobiliary tract, upper urinary tract, brain, and skin cancer. MMR deficiency can result in microsatellite instability (MSI), a condition that manifests in the creation of novel microsatellite fragments (repeated sequences of DNA, with repeats often a few base pairs long). MSI has been associated with many cancers, and is most prevalent in association with colon cancer. Studies have found that patients stratified on the basis of whether they were MSI-High (MSI-H), MSI-low (MSI-L) or microsatellite stable (MSS) had different prognosis, with the MSI-H status associated with better survival (Popat et al., 2005). This relationship with cancer prognosis has led to the development of multiple commercial diagnostic assays for the detection of microsatellite instability. However, MSI is only one possible manifestation of impaired DNA mismatch repair. Therefore, testing for MSI is not equivalent to testing for MMR deficiency, which is the true biological difference underlying differences in prognosis and response to therapy. Sequence data (such as e.g. whole exome sequencing or whole genome sequencing data) is increasingly commonly acquired in the context of cancer therapy. This data can potentially be leveraged to acquire a wealth of information about a patient's tumour, including their MMR status. Algorithms to classify MMR-deficiency tumors have been developed using massively-parallel sequencing data (Ni Huang et al., 2013; Wang & Liang, 2018; Cortes-Ciriano, 2017; Salipante et al., 2014; Hause et al., 2016). These classifiers depend on detecting elevated tumor mutational burdens (TMB) or microsatellite instability (MSI). Thus, they also rely on relatively crude metrics of genomic instability that common manifestations of MMR deficiency.
Therefore, there is still a need for improved methods for identifying MMR-deficient tumours using sequence data.
Statements of Invention
The present inventors postulated that improved prediction of the MMR status of tumours could be obtained through the use of mutational signatures. Somatic mutations arising through endogenous and exogenous processes mark the genome with distinctive patterns, termed mutational signatures (Helleday et al., 2014; Alexandrov et al., 2013; Nik-Zainal et al., 2012; Nik-Zainal et al., 2012). While there have been advancements in analytical aspects of deriving mutational signatures from human cancers (Alexandrov et al., 2020; Haradhvala et al., 2018; Kim et al., 2016), etiologies and mechanisms underpinning these mutational patterns (Nik-Zainal, S. et al., 2015; Zou, X. et al., 2018; Christensen, S. et al., 2019; Kucab, J. E. et al., 2019) are often still unclear. The present inventors used an experimental approach to create biallelic gene knockouts that produce mutational signatures in the absence of administered DNA damage, and are thus indicative of genes that are important at maintaining the genome from intrinsic sources of DNA perturbations. They identified signatures of substitutions and/or indels in a plurality of genes including 5 genes in the MMR pathway: ΔMLH1, ΔMSH2, ΔMSH6, ΔPMS2, and ΔPMS1, suggesting that proteins of these genes are critical guardians of the genome in non-transformed cells, and supporting the hypothesis that mutational signatures could provide a useful indication of the presence of a deficiency in this pathway. These insights led them to develop a more sensitive and specific mutational-signature-based assay to detect MMR deficiency, MMRDetect. Current TMB-based assays have reduced sensitivity to detect MMR deficiency because many tissues do not have high proliferative rates and may not meet the detection criteria of such assays. They may also falsely call MMR-deficient cases as MMR-proficient, because single components were used for measurement (e.g., indel burden or substitution count only). High mutational burdens can be due to different biological processes (Campbell et al., 2017). Consequently, assays based on burden alone are unlikely to be adequately specific. By contrast, the new approach was shown to have excellent specificity and sensitivity, and was able to correctly classify cases that were misclassified with previous approaches.
Thus, according to a first aspect, there is provided a method of characterising a DNA sample obtained from a tumour, the method including the steps of: determining the value of one or more mutational signature metrics for the sample, wherein the mutational signature metrics are selected from: exposure of one or more mutational signatures of mismatch repair (MMR), similarity between the substitution profile of the sample and that of one or more MMR gene knockouts, the number of repeat mediated indels in the mutational profile of the sample, and the similarity between the repeat mediated deletion profile of the sample and that of one or more MMR gene knockouts; and based on said values of said one or more mutational signature metrics, determining whether said sample has a high or low likelihood of being mismatch repair (MMR)-deficient. Determining the value of one or more mutational signature metrics for the sample may comprise determining the similarity between the substitution profile of the sample and that of one or more MMR gene knockouts.
The present inventors have identified the similarity between the substitution profile of the sample and that of one or more MMR gene knockouts to have high predictive value in relation to the sample's MMR status. Prior to the present invention, prediction of MMR status was based primarily on the observation of signs of microsatellite instability. The inventors postulated that mutational profiles that can be identified in samples known to have an MMR deficiency may provide a good indicator of MMR status in test samples. They found that this was indeed the case, but only for some mutational profiles and metrics derived therefrom. The similarity between substitution profiles of a test and MMR gene knockouts was surprisingly found to be a particularly good predictor of MMR status. By contrast, the similarity between the profile of repeat-mediated insertion of a sample and that of knockout generated indel signatures was found to be a poor predictor of MMR status.
Determining the value of one or more mutational signature metrics for the sample may comprise determining the exposure of one or more mutational signatures of MMR. The present inventors have identified the exposure of mutational signatures that have been associated with MMR as having high predictive value in relation to the sample's MMR status. Importantly, associations between mutational signatures and possible underlying biological mechanisms are typically proposed aetiologies that are not underlined by direct mechanistic evidence. Thus, the observation that exposure of MMR signatures is actually predictive of MMR status could not have been predicted from the mere fact that these signatures have been postulated to be associated with MMR deficiency. For example, patterns of mutations that are similar to those caused by MMR deficiency may also result from other mutational processes or combinations thereof, such that the observation of the presence of such patterns may in practice not correlate or not sufficiently correlate with MMR status.
Determining the value of one or more mutational signature metrics for the sample may further comprise determining the number of repeat mediated indels in the mutational profile of the sample. Determining the value of one or more mutational signature metrics for the sample may further comprise determining the similarity between the repeat mediated deletion profile of the sample and that of one or more MMR gene knockouts. The present inventors have identified the number of repeat mediated indels in the mutational profile of a sample, and the similarity between the repeat mediated deletion profile of the sample and that of one or more MMR gene knockouts to improve the MMR status prediction obtained using MMR signature exposure and/or similarity between substitution profiles of the sample and that of one or more MMR gene knockouts, at least in the training cohort used. By contrast, the similarity between the repeat mediated insertion profile of the sample and that of one or more MMR gene knockouts was not found to improve the prediction of MMR status in the training cohort used.
Determining the value of one or more mutational signature metrics for the sample may comprise determining the value of all of: exposure of one or more mutational signatures of mismatch repair (MMR), similarity between the substitution profile of the sample and that of one or more MMR gene knockouts, the number of repeat mediated indels in the mutational profile of the sample, and the similarity between the repeat mediated deletion profile of the sample and that of one or more MMR gene knockouts.
Determining whether said sample has a high or low likelihood of being MMR-deficient comprises using said values of said one or more mutational signature metrics to classify said sample between a class associated with a high likelihood of being mismatch repair (MMR)-deficient and a class associated with a low likelihood of being MMR-deficient. Classifying said sample may comprise classifying the sample between a class associated with a high likelihood of being mismatch repair (MMR)-deficient, a class associated with a low likelihood of being MMR-deficient, and one or more additional classes. The one or more additional classes may comprise one or more classes associated with different likelihood of being MMR deficient, and/or one or more classes associated with unknown status (e.g. a class associated with a medium likelihood of being MMR deficient in addition to classes associated with high and low likelihoods of being MMR deficient, respectively). In other words, the classification may be binary or may be a multi-class classification. Determining whether said sample has a high or low likelihood of being mismatch repair (MMR)-deficient may be performed based on the values of one or more further metrics in addition to the values of the one or more mutational signature metrics.
The step of classifying the sample may be performed using one or more machine learning models selected from: a decision tree, a logistic regression classifier, a support vector machine, a naïve Bayes classifier, and a k-nearest neighbour classifier. The machine learning model is preferably a logistic regression classifier. The present inventors have found that logistic regression classifiers were particularly robust, and in particular performed best when applied to data sets that are different from those on which the classifier was trained (such as e.g. when applied to samples from a different type of tumour from those represented in the data that was used to train the classifier).
Determining whether said sample has a high or low likelihood of being MMR-deficient may comprise: generating, using said values of said one or more mutational signature metrics, a probabilistic score; and based on said probabilistic score, determining whether said sample has a high or low likelihood of being MMR-deficient. Determining, based on said probabilistic score, whether said sample has a high or low likelihood of being MMR-deficient may comprise comparing said probabilistic score with one or more predetermined thresholds, and determining that the sample has a high likelihood of being MMR-deficient if the probabilistic score is below a first predetermined threshold, and a low likelihood of being MMR-deficient if the probabilistic score is at or above a second predetermined threshold. The first and second predetermined threshold may be the same or different.
The method may further comprise receiving (e.g. from a user through a user interface, or from a database) or determining a first and or second predetermined threshold. The first and/or second predetermined thresholds may be determined (or may have been determined) using test data comprising the values of said probabilistic score for a plurality of samples that have a known MMR deficiency status. For example, the predetermined threshold(s) may be chosen so as to optimise (maximise or minimise, as the case may be) one or more performance metrics such as accuracy, specificity or sensitivity of detection of samples from MMR-deficient tumours.
The first and second predetermined thresholds may be the same, and may be between about 0.5 and about 0.9, between about 0.6 and about 0.8, such as about 0.7. The present inventors have found a threshold of 0.7 to be associated with a particularly high accuracy, at least based on the test data used (comprising colorectal tumour samples).
In embodiments, determining, based on said probabilistic score, whether said sample has a high or low likelihood of being MMR-deficient comprises comparing said probabilistic score with one or more predetermined thresholds, and determining that the sample has a high likelihood of being MMR-deficient if the probabilistic score is above a first predetermined threshold, and a low likelihood of being MMR-deficient if the probabilistic score is at or below a second predetermined threshold, optionally wherein the first and second predetermined threshold are the same.
The probabilistic score may be obtained using a logistic regression model, optionally wherein the probabilistic score is generated using the formula:
where p is the probability that a sample has a particular MMR deficiency status, so is an intercept weight, β is a vector of weights for each of k variables, and x is a vector of variables associated with the sample, wherein the variables comprise said one or more mutational signature metrics or variables derived therefrom. For example, variables derived from the one or more mutational signature metrics may be obtained by scaling each of the mutational signature metrics. The value of the weights β and intercept weight β0 may be determined using a suitable training cohort.
Determining the value of one or more mutational signature metrics for the sample may comprise scaling the value of each mutational signature metric. Scaling the mutational signature metrics may advantageously increase the comparability of the values of the respective variables and reduce the risk that metrics that are on different scales disproportionately affect the probabilistic score obtained. Scaling may be performed using any method known in the art, such as e.g. by normalisation (also known as min-max scaling, i.e. transforming a variable such that the range of possible values for the variable ranges between 0 and 1), or by standardisation (where values are centred around the mean with a unit standard deviation by, for each observation, subtracting the mean and dividing by the standard deviation for the variable). The present inventors have found simple normalisation, for example dividing each value by the maximum observed or expected value for the variable to strike a good balance between simplicity and improving the comparability of the variables thus improving the performance of the MMR deficiency identification process. The scaling may be performed using one or more parameters for each mutational signature metric, such as e.g. a value by which every value for a particular metric should be divided in order to obtain the corresponding derived (i.e. normalised) value. Thus, the method may further comprise receiving or determining the value of said one or more parameters.
Determining whether said sample has a high or low likelihood of being mismatch repair (MMR)-deficient based on the value of said mutational signature metrics for the sample may comprise weighting each of said values by a predetermined weighting factor. The predetermined weighting factors may represent the relative importance of the mutational signature metrics in the determination of the likelihood of the sample being MMR-deficient. The predetermined weighting factors may be such that the exposure of one or more mutational signatures of mismatch repair (MMR) has a higher weight than any of: the similarity between the substitution profile of the sample and that of one or more MMR gene knockouts, the number of repeat mediated indels in the mutational profile of the sample, and the similarity between the repeat mediated deletion profile of the sample and that of one or more MMR gene knockouts. Instead or in addition to this, the predetermined weighting factors may be such that the similarity between the substitution profile of the sample and that of one or more MMR gene knockouts has a higher weight than any of: the number of repeat mediated indels in the mutational profile of the sample, and the similarity between the repeat mediated deletion profile of the sample and that of one or more MMR gene knockouts Instead or in addition to this, the predetermined weighting factors may be such that the exposure of one or more mutational signatures of mismatch repair (MMR) and the similarity between the substitution profile of the sample and that of one or more MMR gene knockouts both have a higher respective weight than any of: the number of repeat mediated indels in the mutational profile of the sample, and the similarity between the repeat mediated deletion profile of the sample and that of one or more MMR gene knockouts Instead or in addition to this, the predetermined weighting factors may be such that the exposure of one or more mutational signatures of mismatch repair (MMR) has a higher weight than the similarity between the substitution profile of the sample and that of one or more MMR gene knockouts, the similarity between the substitution profile of the sample and that of one or more MMR gene knockouts has a higher weight than the similarity between the repeat mediated deletion profile of the sample and that of one or more MMR gene knockouts, and the similarity between the repeat mediated deletion profile of the sample and that of one or more MMR gene knockouts has a higher weight than the number of repeat mediated indels in the mutational profile of the sample.
For example, the exposure of one or more mutational signatures of mismatch repair (MMR) may have a weight between about −60 and about −20, between about −50 and about −30, between about −40 and −45, such as about −43, e.g. −42.95. As another example, the similarity between the substitution profile of the sample and that of one or more MMR gene knockouts may have a weight between about −20 and about 0, between about −20 and about −10, about −15, such as e.g. −14.53. As another example, the similarity between the repeat mediated deletion profile of the sample and that of one or more MMR gene knockouts may have a weight between about −15 and about 0, between about −10 and about 0, about −5, such as e.g. −4.62. As another example, the number of repeat mediated indels in the mutational profile of the sample may have a weight between about −20 and about 0, between about −15 and 0, between about −10 and 0, between about −5 and 0, about −3, such as e.g. −2.96. When a linear model is used (such as e.g. a logistic regression model), an intercept weight so may additionally be used. The intercept weight may have a value between 10 and 20, such as e.g. 16.043. The precise value of the intercept is not critical as it is identical for every sample and hence samples can still be compared to each other regardless of the value used for the intercept weight. However, when using models such as a logistic regression model, an intercept value fitted using a suitable training dataset is preferably used as this enables the interpretation of the resulting score in a more straightforward manner as indicative of the likelihood of samples being MMR deficient.
All of the variables are preferably normalised prior to weighting. Alternatively, the respective weights may be adjusted so as to obtain equivalent weights for un-normalised values. As the skilled person understands, the exact values of the weights used are likely to depend on the training data used. For example, the examples herein demonstrate how to obtain suitable values using training data comprising colorectal cancer samples. Using a different training data set (comprising additional samples and/or different samples such as e.g. samples from other types of tumours) may result in different weights. However, the relative importance of the variables may remain similar.
Determining whether said sample has a high or low likelihood of being mismatch repair (MMR)-deficient based on said values of said one or more mutational signature metrics may comprise using a machine learning model that has been trained using training data comprising the values of said mutational signature metrics for a plurality of samples that have a known MMR deficiency status. In embodiments, the machine learning model is able to provide a prediction of whether a sample has a high or low likelihood of being mismatch repair (MMR)-deficient with above 99% accuracy (as evaluated using the AUC metric), such as e.g. AUC=1, on at least one test set of samples. The test set of samples preferably comprises at least 50 samples, at least 60 samples, at least 70 samples, at least 80 samples, at least 90 samples, or about 100 samples. The test set of samples may comprise samples from one or more types of tumours. The one or more types of tumours in the test set of samples may be represented in the training set used to train the machine learning algorithm. The test set of samples may comprise colorectal cancer samples. The training set of samples may comprise colorectal cancer samples. The test set of samples and the training set of samples preferably comprise samples that are known to be MMR deficient and samples that are known to be MMR proficient. The test set of samples and/or the trainings et of samples preferably comprise a plurality of samples that are known to be MMR deficient and a plurality of samples that are known to be MMR proficient. The training set of samples and/or the training set of samples preferably comprise between about 5% and about 50%, between about 10% and about 40%, between about 10% and about 30% of samples that are known to be MMR deficient. In embodiments, the proportion of samples that are known to be MMR deficient in the training set of samples is similar to that in the test set of samples. The proportion of samples that are known to be MMR deficient in the training set of samples and/or in the test set of samples may be similar to the expected proportion of tumours that are MMR deficient in the tumour samples represented in the data set.
Determining the value of one or more mutational signature metrics for the sample may comprise cataloguing the somatic mutations in said sample to produce a mutational catalogue for that sample, wherein the value of said mutational signature metrics is derived from said mutational catalogue. A mutational catalogue may also be referred to herein as a mutation profile. A mutational catalogue may be separated into sub-catalogues that catalogue mutations of a particular type such as e.g. substitutions, deletions, insertions, indels, etc. These may be referred to as a “substitution profile/catalogue”, “deletion profile/catalogue”, etc. A catalogue may comprise the number of mutations in each of a plurality of classes considered as part of a catalogue or subcatalogue.
A mutational profile may refer to a somatic mutational profile. A somatic mutational profile may comprise exclusively mutations that are not present (or assumed not to be present) in a corresponding germline genome. Thus, cataloguing the somatic mutations in a sample may comprise identifying all mutations present in a sample and removing or otherwise excluding mutations that are present or assumed to be present in a corresponding germline genome. Mutations that are present in a corresponding germline genome may be identified by identifying the mutations present in a germline sample obtained from the same subject. In other words, mutations that are present in a corresponding germline genome may be defined as mutations that have been identified by analysing genomic material from a matched normal (e.g. non-tumour and/or non-modified) sample. For example, a somatic mutational profile for a tumour may be obtained by comparison with a germline sample from the same subject (i.e. a sample of normal/non-tumour cells or genomic material derived therefrom). In the case of a mutational profile that has been obtained from a sample that has been engineered or selected to contain a particular modification, a somatic mutational profile may be obtained using a sample obtained prior to the engineering or selection step that resulted in the particular modification. For example, in the case of MMR gene knockout samples, a corresponding “germline” profile may be obtained from the parent sample, prior to introducing the MMR gene knockout modification. Mutations that are assumed to be present in a corresponding germline genome may be defined as mutations that are present in a reference genome or set of reference genomes. A reference genome or set of reference genomes may be obtained from one or more reference samples that are not strictly matched normal samples. For example, the reference sample(s) may be process matched, or may comprise a plurality of normal (i.e. non-tumour/non-modified) samples not all of which are matched to the sample for which a somatic mutational profile is determined (e.g. pooled normal samples may be used as references for a plurality of tumour samples). A reference genome or set of reference genomes may be obtained from one or more databases. For example, a reference genome may be used and all mutations compared to this reference genome may be assumed to be somatic mutations. Alternatively, a set of reference genomes may be obtained from a database as a catalogue of known germline mutations in one or more populations (e.g. a genetic variation database such as dbSNP https://www.ncbi.nlm.nih.gov/snp/, 1000 genomes https://www.internationalgenome.org/, etc.). The use of a matched normal sample advantageously provides greatest certainty that the mutations identified in the DNA from the tumour sample are somatic mutations. The use of pooled normal samples comprising a matched normal sample may provide similar (though less precise information) and may be useful e.g. when sequencing resources are limited. Compared to the use of a matched normal sample, this may risk excluding more somatic mutations are seemingly germline mutations. The use of a reference genome or set of reference genome advantageously does not require the acquisition and analysis of a separate normal sample. However, the reference genome or set of reference genome is unlikely to capture all germline mutations present in the subject, and to include mutations that are in fact somatic in the subject. This is particularly true if a single reference genome is used rather than a collection capturing common sequence variation. Thus, this may result in a less accurate identification of somatic mutations.
Cataloguing the somatic mutations in said sample may comprise determining the number of mutations in the mutational catalogue which are attributable to each of a plurality of base substitution classes and/or indel classes which are determined to be present, optionally wherein the base substitution classes include all possible trinucleotide substitution classes and/or wherein the indel classes include classes for multiple combinations of indel type, e.g. selected from insertion, deletion and complex, indel size, e.g. selected from 1-bp or longer, and flanking sequence, such as e.g. repeat-mediated, microhomology-mediated or other. The base substitution classes may be described according to the “96 channels convention” known in the art, i.e. the product of 6 types of substitution multiplied by 4 types of 5′ base (A,C,G,T) and 4 types of 3′ base (A,C,G,T). Trinucleotide substitution classes are listed in Table 3 (column “mutation type”). The indel classes may include the following 15 channels: 1 bp C/T insertion at short repetitive sequence (<5 bp), 1 bp C/T insertion at long repetitive sequence (>=5 bp), long insertions (>1 bp) at repetitive sequences, microhomology-mediated insertions, 1 bp C/T deletions at short repetitive sequence (<5 bp), 1 bp C/T deletions at long repetitive sequence (>=5 bp), long deletions (>lbp) at repetitive sequences, microhomology-mediated deletions, other deletion and complex indels. Alternatively, the indel classes may include 45 channels including the preceding 15 channels but where the 1 bp C/T indels at repetitive sequences are further expanded according to the exact length of the repetitive sequences (from 0 to 9).
Determining the value of the exposure of one or more mutational signatures of MMR for the sample may comprise determining the value of the exposure to a plurality of mutational signatures of MMR and summing the values of the exposure to each of the plurality of mutational signatures of MMR. Determining the value of the exposure of one or more mutational signatures of MMR for the sample may be performed as described in Degasperi et al. Determining the value of the exposure of one or more mutational signatures of MMR for the sample may be performed by identifying the matrix E that satisfies C≈PE where C is a mutational catalogue for the sample, P is a signature matrix comprising the one or more mutational signatures of MMR, and E is an exposure matrix. The one or more mutational signatures of MMR may be selected from RefSig MMR1 and RefSig MMR2. The one or more mutational signatures of MMR may be selected from known mutational signatures that have been derived from mutational catalogues associated with a plurality of cancer samples. Known mutational signatures that have been derived from mutational catalogues associated with a plurality of cancer samples include COSMIC signatures (e.g. as described in Alexandrov et al., 2020) or RefSig signatures (as described in e.g. Degasperi et al., 2020). The one or more mutational signatures of MMR may be signatures selected from such sets of signatures that have MMR deficiency as a postulated aetiology.
RefSig MMR1 (also referred to as “MMR1”) and RefSig MMR2 (also referred to as MMR2) are described in Degasperi et al., 2020 and available at https://signal.mutationalsignatures.com/explore/study/1 (see https://signal.mutationalsignatures.com/explore/referenceCancerSignature/52 for RefSig MMR1 and https://signal.mutationalsignatures.com/explore/referenceCancerSignature/56 for RefSiq MMR2).
The signature matrix P typically comprises the one or more mutational signatures of MMR and additional signatures that have been identified together with the one or more mutational signatures of MMR. The coefficients of the E matrix corresponding to the MMR signatures of interest in the sample under investigation may then be used as the exposure value(s) for the one or more signatures of MMR. The signature matrix P may comprise all of the reference signatures (RefSig) described in Degasperi et al., 2020 (and available at https://signal.mutationalsignatures.com/explore/study/1), or organ specific equivalents thereof. When organ-specific signatures equivalent to RefSig signatures are used, the values of the exposure RefSig MMR1 and/or RefSig MM2 may be obtained using a conversion matrix, such as described in Degasperi et al., 2020, and available at https://signal.mutationalsignatures.com/explore/study/1.
Determining the value of the similarity between a substitution or repeat mediated deletion profile of the sample and that of one or more MMR gene knockouts may comprise determine the cosine similarity between pairs of profiles. Determining the value of similarity between a substitution or repeat mediated deletion profile of the sample and that of one or more MMR gene knockouts may comprise determining the value of similarity between a substitution or repeat mediated deletion profile of the sample and that of each of a plurality of MMR gene knockouts to obtain a plurality of similarity values, and obtaining a summarised similarity value for the plurality of similarity values, optionally wherein the summarised similarity value is the maximum or the mean similarity value. Determining the value of similarity between a substitution profile of the sample and that of one or more MMR gene knockouts may comprise determining the value of similarity between a substitution profile of the sample and that of each of a plurality of MMR gene knockouts to obtain a plurality of similarity values, and obtaining a summarised similarity value for the plurality of similarity values, wherein the summarised similarity value is the maximum similarity value. Determining the value of similarity between a repeat mediated deletion profile of the sample and that of one or more MMR gene knockouts may comprise determining the value of similarity between a repeat mediated deletion profile of the sample and that of each of a plurality of MMR gene knockouts to obtain a plurality of similarity values, and obtaining a summarised similarity value for the plurality of similarity values, wherein the summarised similarity value is the mean similarity value.
The one or more MMR gene knockouts may be selected from: MSH2, MSH3, MSH6, MLH1, PMS2, and PMS1. The one or more MMR gene knockouts may be selected from: MSH2, MSH6, MLH1, PMS2, and PMS1. The one or more MMR gene knockouts may be selected from PMS2, MLH1, MSH2 and MSH6. The one or more MMR gene knockouts may include a plurality of gene knockouts, such as all of the gene knockouts, selected from: MSH2, MSH6, MLH1, PMS2, and PMS1. The one or more MMR gene knockouts include a plurality of gene knockouts selected from: PMS2, MLH1, MSH2 and MSH6. The one or more gene knockouts may include (all of) PMS2, MLH1, MSH2 and MSH6.
The substitution and/or repeat mediated deletion profile (collectively referred to as mutational profile) of an MMR gene knockout may have been derived from one or more MMR gene knockout samples as described herein. The term “MMR gene knockout sample” refers to any sample of cells or genetic material derived therefrom, in which the function of one or more genes of the MMR pathway is impaired. These one or more genes are the one referred to as “gene knockouts”, i.e. a MMR gene knockout sample which is MSH2 is a sample of cells or genetic material derived therefrom, in which the function of MSH2 is impaired.
A mutational profile for an MMR gene knockout may have been derived from a plurality of MMR gene knockout samples. Using a plurality of MMR gene knockout samples to generate each MMR gene knockout mutational profile may advantageously reduce the effect of variability between different gene knockout samples. For example, the plurality of MMR gene knockout samples may comprise a plurality (e.g. between 2 and 4) of samples of cells or material genetic derived therefrom in which the same MMR gene has been impaired. The samples may be technical and/or biological replicates, for examples samples of cells or material genetic derived therefrom where the same gene has been impaired using the same technical means. The function of a gene in the MMR pathway may have been impaired through a knockout, through silencing, through one or more mutations (e.g. coding or truncating mutations), or through downregulation. Preferably, the function of a gene in the MMR pathway has been impaired through knockout, such as e.g. using CRISPR-Cas9.
A mutational profile for an MMR gene knockout may have been derived from one or more MMR gene knockout samples and one or more background mutational profiles. The background mutational profiles may have been obtained from one or more control samples.
A mutational profile for an MMR gene knockout may have been derived from a MMR gene knockout sample by: obtaining a plurality of mutational profiles for respective bootstrap samples for the MMR gene knockout, obtaining a plurality of mutational profiles for respective bootstrap background samples, and subtracting a summarised value for the bootstrap background mutational profiles from a summarised value for the bootstrap MMR knockout mutational profiles. A summarised value may be the centroid of a plurality of mutational profiles. Mutational profiles for bootstrap samples (whether for MMR gene knockouts or background) may be obtained using a plurality of mutational profiles each obtained from a respective sample (MMR knockout sample or background sample). A background sample may be a sample in which no gene in the MMR pathway has had its function impaired. A background sample may be a sample in which the function of a control gene has been impaired. A control gene may be chosen as a gene not involved in the MMR pathway or a gene which, if impaired, does not result in a functional impairment of the MMR pathway. A control gene may be chosen as a gene that is not involved in a DNA repair pathway, or a gene which, if impaired, does not result in functional impairment in a DNA repair pathway.
A mutational profile for an MMR gene knockout may have been derived from a plurality of MMR gene knockout samples by obtaining a mutational profile for each MMR gene knockout sample and deriving a summarised mutational profile for the plurality of MMR gene knockout samples from the mutational profiles of the respective samples. Similarly, a background mutational profile may have been derived from a plurality of control samples by obtaining a mutational profile for each control sample and deriving a summarised mutational profile for the plurality of control samples from the mutational profiles of the respective samples. Alternatively, mutational profiles derived from a plurality of MMR gene knockout samples may each be used individually. For example, when determining the similarity between a mutational profile of a sample and that of a plurality of gene knockout samples, each of the profiles of the respective gene knockout samples may be compared individually with the profile of the sample, and a summarised value for the similarity (such as e.g. the maximum or average) may be used as the value of the corresponding mutational signature metric. Thus, the step of determining the value of a mutational signature metric that uses a mutational profile may comprise obtaining the mutational profile using any of the steps described above.
The similarity between two mutation profiles may be obtained as the cosine similarity. The cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is equal to the cosine of the angle between the two vectors. It is also equal to the inner products of the two vectors, normalised to each have length 1. Alternatively, the similarity between two mutation profiles may be obtained as the angular distance or angular similarity between the two vectors encoding the mutation profiles. As another alternative, the similarity between two mutation profiles may be obtained as the Euclidian distance between L2 normalised version of the two vectors encoding the mutation profiles. As another alternative, the similarity between two mutation profiles may be obtained s the correlation between the two vectors encoding the mutation profiles.
Determining the number of repeat mediated indels in the mutational profile of the sample may comprise obtaining a mutational catalogue for the sample and determining the number of insertions and deletions in the mutational profile that occur within repetitive regions. Repetitive regions may be regions comprising multiple repeats of the same sequence motif, optional wherein a sequence motif is a sequence of between 1 and 9 bases in length. A repetitive region may be defined as a region of a reference genome (e.g. the reference genome used to call mutational profiles, such as a defined release of the human reference genome, if human genetic material is being analysed) comprise multiple (i.e. 2 or more) repeats of the same sequence motif. A sequence motif may be defined as a sequence of one or more specific bases. For example, AA, AAA, AAAA, AAAAA, ATAT, ATATAT, ATATATAT, CAGCAG, CAGCAGCAG, CAGCAGCAGCAGCAG are all repetitive regions.
The method may further comprise obtaining the sample from a tumour of a subject. The method may further comprise obtaining sequence data from a sample from a tumour. The method may further comprise providing to a user one or more of: the value of the one or more mutational signature metrics, a value derived therefrom (such as e.g. a probabilistic score), and a determination of whether the sample has a high likelihood or a low likelihood of being MMR-deficient. The method may further comprise obtaining a germline sample from the subject and/or obtaining sequence data from a germline sample from the subject. The tumour sample may be a sample comprising tumour cells or genetic material derived therefrom. The tumour sample may be a sample of cells or tissue that has been obtained directly from a tumour (e.g. a tumour biopsy). The tumour sample may be a sample comprising cells or genetic material derived from a tumour, such as e.g. a liquid biopsy sample comprising circulating tumour cells or circulating tumour DNA.
According to a further aspect, there is provided a method of predicting whether a subject with cancer is likely to respond to an immunotherapy, the method comprising characterising a sample obtained from a tumour in the subject as having a high or low likelihood of being MMR-deficient using a method of any embodiment of the first aspect, wherein if the sample is characterised as having a high likelihood of being MMR-deficient, the subject is likely to respond to immunotherapy. The method may further comprise administering the immunotherapy, to a subject that has been diagnosed as likely to respond to immunotherapy. The method may comprise recommending a subject that has been diagnosed as likely to respond to the immunotherapy for treatment with the immunotherapy. The method may comprise administering an alternative therapy (e.g. a conventional chemotherapy, radiotherapy, etc.) and/or recommending a subject for treatment with an alternative therapy, where the subject has been diagnosed as not likely to respond to immunotherapy.
According to a further aspect, there is provided a method of selecting a subject having cancer for treatment with an immunotherapy, the method comprising characterising a sample obtained from a tumour in the subject as having a high or low likelihood of being MMR-deficient using a method of any embodiment of the first aspect, and selecting the subject for treatment with an immunotherapy if the sample is characterised as having a high likelihood of being MMR-deficient.
According to a further aspect, there is provided an immunotherapy for use in a method of treatment of cancer in a subject from whom a DNA sample has been obtained and the DNA sample has been characterised by a method according to any one of claims x to x as having a high likelihood of being MMR-deficient.
According to a further aspect, there is provided a method of treating cancer in a subject determined to have a tumour with a high likelihood of being MMR-deficient, wherein the likelihood of the tumour being MMR-deficient is determined by characterising a DNA sample obtained from the tumour using a method according to any embodiment of the first aspect.
According to any of these aspects, the immunotherapy may be administered (or recommended for administration) in combination with one or more therapies, such as one or more chemotherapies, one or more courses of radiotherapy and/or one or more surgical interventions.
According to any of these aspects, the immunotherapy may be administered (or recommended for administration) in combination with a PARP inhibitor or platinum-based therapy if the subject has been determined as having a high likelihood of being HR-deficient and/or having a high-likelihood of responding to a PARP inhibitor or platinum-based therapy. Thus, any such method may further comprise determining whether the subject is likely to respond to a PARP inhibitor or platinum-based therapy and/or characterising a sample obtained from a tumour in the subject as having a high or low likelihood of being HR-deficient. Methods suitable for this purpose are described in WO 2018/115452, WO 2017/191074, and WO 2017/191073, all of which are incorporated herein by reference.
According to a further aspect, there is provided an immunotherapy for use in a method of treatment of cancer in a subject, the method comprising: (i) determining whether a DNA sample obtained from said subject has a high or low likelihood of being MMR-deficient using a method according to any embodiment of the first aspect; and (ii) administering the immunotherapy to said subject if the DNA sample is determined to have a high likelihood of being MMR-deficient. An immunotherapy may be a checkpoint inhibitor drug, such as a PD-1 or PD-L1 inhibitor.
According to a further aspect, there is provided a method of predicting whether a subject with cancer is likely to respond to a non-fluorouracil-based chemotherapy, the method comprising characterising a sample obtained from a tumour in the subject as having a high or low likelihood of being MMR-deficient using a method of any preceding claim, wherein if the sample is characterised as having a high likelihood of being MMR-deficient, the subject is likely to respond to the non-fluorouracil-based chemotherapy.
According to a further aspect, there is provided a method of predicting whether a subject with cancer is likely to respond to a fluorouracil-based chemotherapy, the method comprising characterising a sample obtained from a tumour in the subject as having a high or low likelihood of being MMR-deficient using a method of any preceding claim, wherein if the sample is characterised as having a high likelihood of being MMR-deficient, the subject is unlikely to respond to the fluorouracil-based chemotherapy.
According to any of these aspects, the fluorouracil-based therapy or non-fluorouracil based therapy may be administered (or recommended for administration) in combination with one or more therapies, such as one or more chemotherapies, one or more courses of radiotherapy and/or one or more surgical interventions.
According to any of these aspects, the fluorouracil-based therapy or non-fluorouracil based therapy may be administered (or recommended for administration) in combination with a PARP inhibitor or platinum-based therapy if the subject has been determined as having a high likelihood of being HR-deficient and/or having a high-likelihood of responding to a PARP inhibitor or platinum-based therapy. Thus, any such method may further comprise determining whether the subject is likely to respond to a PARP inhibitor or platinum-based therapy and/or characterising a sample obtained from a tumour in the subject as having a high or low likelihood of being HR-deficient. Methods suitable for this purpose are described in WO 2018/115452, WO 2017/191074, and WO 2017/191073.
According to a further aspect, there is provided a method of providing a prognosis for a subject who has been diagnosed with cancer, the method comprising characterising a sample obtained from a tumour in the subject as having a high or low likelihood of being MMR-deficient using a method of any preceding claim, wherein if the sample is characterised as having a high likelihood of being MMR-deficient, the subject is likely to have a better prognosis than a subject characterised as having a low likelihood of being MMR-deficient.
According to a further aspect there is provided a chemotherapy for use in a method of treatment of cancer in a subject, the method comprising: (i) determining whether a DNA sample obtained from said subject has a high or low likelihood of being MMR-deficient using a method according to any embodiment of the first aspect; and (ii) administering the chemotherapy to said subject if the DNA sample is determined to have a high likelihood of being MMR-deficient, preferably wherein the chemotherapy is a non-fluorouracil-based therapy. Alternatively, the method may comprise administering the chemotherapy to said subject if the DNA sample is determined to have a low likelihood of being MMR-deficient, preferably wherein the chemotherapy is a fluorouracil-based therapy.
According to a further aspect, there is provided a method of providing a tool for characterising a DNA sample obtained from a tumour, the method including the steps of: obtaining mutational signature profiles for a plurality of training samples associated with known MMR-deficiency status; determining the value of one or more mutational signature metrics for the training samples, wherein the mutational signature metrics are selected from: exposure of one or more mutational signatures of mismatch repair (MMR), similarity between the substitution profile of the sample and that of one or more MMR gene knockouts, the number of repeat mediated indels in the mutational profile of the sample, and the similarity between the repeat mediated deletion profile of the sample and that of one or more MMR gene knockouts; and training a machine learning model to predict, based on said values of said one or more mutational signature metrics, whether each training sample has a high or low likelihood of being mismatch repair (MMR)-deficient. The method of the present aspect may have any of the features described in relation to the first aspect.
According to a further aspect, there is provided a system comprising: a processor; and
a computer readable medium comprising instructions that, when executed by the processor, cause the processor to perform the (computer-implemented) steps of the method of any preceding aspect. According to a further aspect, there is provided a non-transitory computer readable medium or media comprising instructions that, when executed by at least one processor, cause the at least one processor to perform the method of any embodiment of any aspect described herein. According to a further aspect, there is provided a computer program comprising code which, when the code is executed on a computer, causes the computer to perform the method of any embodiment of any aspect described herein.
In describing the present invention, the following terms will be employed, and are intended to be defined as indicated below.
“and/or” where used herein is to be taken as specific disclosure of each of the two specified features or components with or without the other. For example “A and/or B” is to be taken as specific disclosure of each of (i) A, (ii) B and (iii) A and B, just as if each is set out individually herein.
A “sample” as used herein may be a cell or tissue sample (e.g. a biopsy), a biological fluid, an extract (e.g. a protein or DNA extract obtained from the subject), from which genomic material can be obtained for genomic analysis, such as genomic sequencing (whole genome sequencing, whole exome sequencing, targeted (also referred to as “panel”) sequencing). In particular, the sample may be a blood sample, or a tumour sample. The sample may be one which has been freshly obtained from a subject or may be one which has been processed and/or stored prior to making a determination (e.g. frozen, fixed or subjected to one or more purification, enrichment or extractions steps). In particular, the sample may be a cell or tissue culture sample. As such, a sample as described herein may refer to any type of sample comprising cells or genomic material derived therefrom, whether from a biological sample obtained from a subject, or from a sample obtained from e.g. a cell line. The sample is preferably from a mammalian (such as e.g. a mammalian cell sample or a sample from a mammalian subject, including in particular a model animal such as mouse, rat, etc.), preferably from a human (such as e.g. a human cell sample or a sample from a human subject). Further, the sample may be transported ad/or stored, and collection may take place at a location remote from the genomic sequence data acquisition (e.g. sequencing) location, and/or the computer-implemented method steps may take place at a location remote from the sample collection location and/or remote from the genomic data acquisition (e.g. sequencing) location (e.g. the computer-implemented method steps may be performed by means of a networked computer, such as by means of a “cloud” provider).
A “tumour sample” refers to a sample that contains tumour cells or genetic material derived therefrom. The tumour sample may be a cell or tissue sample (e.g. a biopsy) obtained directly from a tumour. A tumour sample may be a sample that comprises tumour cell or genetic material derived therefrom, that has not be obtained directly from a tumour. For example, a tumour sample may be a sample comprising circulating tumour cells or circulating tumour DNA. Thus, a tumour sample may also be a biological fluid (e.g. a liquid biopsy such as a blood, urine, or cerebrospinal fluid biopsy). A sample comprising a mixture of tumour cells and other cells (or material genetic derived therefrom) may be subject to one or more processing steps, whether prior to or subsequent to the acquisition of sequence data, in order to identify sequence data that is representative of the genetic material from the tumour. For example, a sample comprising cells may be subject to one or more cell purification steps which selectively enrich the sample for tumour cells. Similarly, a sample comprising modified and non-modified cells can be subject to one or more purification or selection steps to enrich the sample for modified cells. Protocols for doing this are known in the art. As another example, a sample of genetic material may be subject to one or more capture and/or size selection steps to selectively enrich the sample for tumour-derived genetic material. Protocols for doing this are known in the art. As another example, sequence data may be subject to one or more filtering steps (e.g. based on fragment length) to enrich the data for information that relates to tumour-derived genetic material. Protocols for doing this are known in the art.
A “normal sample” (also referred to as “germline sample” or “parent sample”) refers to a sample that contains non-tumour or non-modified cells or genetic material derived therefrom. A normal sample may be matched to a particular tumour or modified sample in the sense that it is obtained from the same biological source (subject or cell line) as the tumour or modified sample. A normal sample may be a cell or tissue sample obtained from a subject, or a sample of biological fluid. A sample comprising a mixture of normal cells and other cells (or material genetic derived therefrom) may be subject to one or more processing steps, whether prior to or subsequent to the acquisition of sequence data, in order to identify sequence data that is representative of the genetic material from the normal cells (as already described above). For example, a sample comprising modified and non-modified cells can be subject to one or more purification or selection steps to enrich the sample for non-modified cells. Similarly, a sample comprising normal and tumour-derived cells can be subject to one or more purification steps which selectively enrich the sample for normal cells.
The term “sequence data” refers to information that is indicative of the presence and/or amount of genomic material in a sample that has a particular sequence. Such information may be obtained using sequencing technologies, such as e.g. next generation sequencing (NGS, such as e.g. whole exome sequencing (WES), whole genome sequencing (WGS), or sequencing of captured genomic loci (targeted or panel sequencing)), or using array technologies, such as e.g. SNP arrays, or other molecular counting assays. When NGS technologies are used, the sequence data may comprise a count of the number of sequencing reads that have a particular sequence. When non-digital technologies are used such as array technology, the sequence data may comprise a signal (e.g. an intensity value) that is indicative of the number of sequences in the sample that have a particular sequence, for example by comparison to an appropriate control. Sequence data may be mapped to a reference sequence, for example a reference genome, using methods known in the art (such as e.g. Bowtie (Langmead et al., 2009)). Thus, counts of sequencing reads or equivalent non-digital signals may be associated with a particular genomic location. Further, a genomic location may contain a mutation, in which case counts of sequencing reads or equivalent non-digital signals may be associated with each of the possible variants (also referred to as “alleles”) at the particular genomic location. The process of identifying the presence of a mutation at a particular location in a sample is referred to as “variant calling”, and can be performed using methods known in the art (such as e.g. the GATK HaplotypeCaller, https://gatk.broadinstitute.org/hc/en-us/articles/360037225632-HaplotypeCaller). For example, sequence data may comprise a count of the number of reads (or an equivalent non-digital signal) which match a germline (also sometimes referred to as “reference”) allele at a particular genomic location, and a count of the number of reads (or an equivalent non-digital signal) which match a mutated (also sometimes referred to as “alternate”) allele at the genomic location.
The term “mutation” refers to a difference in a nucleotide sequence (e.g. DNA or RNA) in a sample compared to a reference. For example, a mutation may be a single nucleotide variant (SNV), multiple nucleotide variants, a deletion mutation, an insertion mutation, a translocation, a missense mutation, a translocation, a fusion, etc. Mutations may be identified using sequence data. An “indel mutation” (or simply “indel”) refers to an insertion and/or deletion of bases in a nucleotide sequence (e.g. DNA or RNA) of an organism.
Within the context of the present invention, a mutation is typically a somatic mutation, unless the context indicates otherwise. A “somatic mutation” is a mutation that is present in a tumour or modified cell (or genetic material derived therefrom), but not in a corresponding (matched) normal or non-modified cell.
The present invention relates broadly to the identification of MMR deficiencies. A cell (or by extension, a tissue, tumour or subject comprising such a cell) may be referred to as “MMR-deficient” if it has one or more alterations that impair the function of the mismatch repair pathway. The alteration may be genetic (e.g. a mutation of any kind in one or more genes of the MMR pathway) or epigenetic (e.g. direct or indirect epigenetic silencing of one or more genes of the MMR pathway) or post-translational through complex interactions between multiple proteins. The alteration may directly affect a gene in the MMR pathway, or may indirectly affect a gene in the MMR pathway (for example by directly affecting a gene that is not in the MMR pathway but which, if impaired, affects the function of the MMR pathway, by physical or functional interaction). For example, alteration of the function of a gene in DNA repair pathway different from the MMR pathway may alter the function of the MMR pathway as a knock-on effect.
A composition as described herein may be a pharmaceutical composition which additionally comprises a pharmaceutically acceptable carrier, diluent or excipient. The pharmaceutical composition may optionally comprise one or more further pharmaceutically active polypeptides and/or compounds. Such a formulation may, for example, be in a form suitable for intravenous infusion.
As used herein “treatment” refers to reducing, alleviating or eliminating one or more symptoms of the disease which is being treated, relative to the symptoms prior to treatment.
The systems and methods described herein may be implemented in a computer system, in addition to the structural components and user interactions described. As used herein, the term “computer system” includes the hardware, software and data storage devices for embodying a system or carrying out a method according to the above described embodiments. For example, a computer system may comprise a central processing unit (CPU), input means, output means and data storage, which may be embodied as one or more connected computing devices. Preferably the computer system has a display or comprises a computing device that has a display to provide a visual output display. The data storage may comprise RAM, disk drives or other computer readable media. The computer system may include a plurality of computing devices connected by a network and able to communicate with each other over that network. It is explicitly envisaged that computer system may consist of or comprise a cloud computer.
The methods described herein may be provided as computer programs or as computer program products or computer readable media carrying a computer program which is arranged, when run on a computer, to perform the method(s) described herein. As used herein, the term “computer readable media” includes, without limitation, any non-transitory medium or media which can be read and accessed directly by a computer or computer system. The media can include, but are not limited to, magnetic storage media such as floppy discs, hard disc storage media and magnetic tape; optical storage media such as optical discs or CD-ROMs; electrical storage media such as memory, including RAM, ROM and flash memory; and hybrids and combinations of the above such as magnetic/optical storage media.
Prediction of DNA from a Tumour Sample as MMR Deficient or Proficient
In embodiments of the present invention, a prediction of whether a DNA sample from a tumour of a patient is MMR proficient or deficient is performed. In these embodiments, this prediction is performed by a computer-implemented method or tool that takes as its inputs sequence data from the sample or the values of one or more mutational signature metrics derived therefrom, and produces as output a probabilistic score indicative of whether the sample is MMR proficient or deficient, or information derived therefrom such as a classification of the sample as likely MMR deficient/unlikely MMR deficient.
In a development of this embodiment, the computer-implemented method or tool may take as its inputs a list of somatic mutations generated from sequence data associated with a tumour sample (such as e.g. sequencing data obtained from genomic material from fresh-frozen derived DNA, circulating tumour DNA or formalin-fixed paraffin-embedded (FFPE) DNA representative of a suspected or known tumour from a patient). These somatic mutations can then be analysed to determine the value(s) of the one or more mutational signature metrics.
In a development of this embodiment, the computer-implemented method or tool may take as its inputs sequence data associated with a tumour sample, and may use this data to generate a list of somatic mutations. These somatic mutations can then be analysed to determine the value(s) of the one or more mutational signature metrics. A list of somatic mutation may be obtained by identifying mutations present in sequence data associated with a tumour sample, and removing or otherwise excluding mutations that are present or assumed to be present in a corresponding germline genome. Mutations that are present in a corresponding germline genome may be identified by identifying the mutations present in a germline sample obtained from the same subject (also referred to as a “matched germline” or “matched normal” sample). Thus, the computer-implemented method or tool may further take as input sequence data associated with a matched germline sample. Mutations that are assumed to be present in a corresponding germline genome may be identified by identifying mutations that are present in a reference genome or set of reference genomes. A reference genome or set of reference genomes may be obtained from one or more reference samples that are not (or not all) matched normal samples. For example, the reference sample(s) may be process matched, or may comprise a plurality of normal (i.e. non-tumour/non-modified) samples not all of which are matched to the sample for which a somatic mutational profile is determined (e.g. pooled normal samples may be used as references for a plurality of tumour samples). A reference genome or set of reference genomes may be obtained from one or more databases.
A list of somatic mutations may comprise mutations of one or more types selected from: substitutions, deletions, and insertions. A list of somatic substitutions associated with a sample or a group of samples may be referred to as a “substitution profile”. A list of somatic deletions associated with a sample or a group of samples may be referred to as a “deletion profile”. A list of somatic insertions associated with a sample or a group of samples may be referred to as a “insertion profile”. A list comprising both somatic insertions and deletions associated with a sample or group of samples may be referred to as an “indel profile”. An insertion or deletion may be referred to as “repeat mediated” if it occurs in a repetitive region. A repetitive region may be defined as a region that includes a plurality (e.g. 2 or more) of repeats of a sequence motif. A sequence motif may be defined as a sequence of between 1 and n bases, where n may be selected as 3, 4, 5, 6, 7, 8, 9, 10, 11, or 12. For example n=9 may be convenient. The use of higher values of n requires more extensive cataloguing of such regions, which may be associated with diminishing returns as repeats of longer motifs are less likely. A repetitive region may be defined by reference to a reference genome. In other words, a repetitive region may be defined as a particular locus (defined by its genomic coordinates) in a reference genome. Thus, any mutation identified within such a locus may be considered to be “repeat mediated”.
In some embodiments, the present invention provides methods for classifying samples from tumours between classes that are associated with different likelihoods of MMR deficiency. In particular, mutational signature metrics may be evaluated using one or more pattern recognition algorithms. Such analysis methods may be used to form a predictive model, which can be used to classify test data. For example, one convenient and particularly effective method of classification employs multivariate statistical analysis modelling, first to form a model (a “predictive mathematical model”) using data (“modelling data”) from samples of known subgroup (e.g., from subjects known to have a MMR deficient or MMR proficient tumour), and second to classify an unknown sample (e.g., “test sample”) according to subgroup. Pattern recognition methods have been used widely to characterize many different types of problems ranging, for example, over linguistics, fingerprinting, chemistry and psychology. In the context of the methods described herein, pattern recognition is the use of multivariate statistics, both parametric and non-parametric, to analyse data, and hence to classify samples and to predict the value of some dependent variable based on a range of observed measurements. In the context of the present invention, “supervised” approaches are suitably used, whereby a training set of samples with known class or outcome is used to produce a mathematical model which is then evaluated with independent validation data sets. Here, a “training set” of gene expression data is used to construct a statistical model that predicts correctly the “subgroup” of each sample. This training set is then tested with independent data (referred to as a test or validation set) to determine the robustness of the computer-based model. These models may be based on a range of different mathematical procedures such as logistic regression models, support vector machine, decision trees, k-nearest neighbour and naïve Bayes classifiers. The robustness of the predictive models can for example be checked using cross-validation, by leaving out selected samples from the analysis.
The one or more mutational signature metrics may be selected from: the exposure to one or more MMR mutational signatures (EMMRD), the similarity between the substitution profile of the sample and that of one or more MMR gene knockout(s) (Ssub), the number of repeat mediated indels (Nrep.indel), and the similarity between the repeat-mediated deletion profile of the sample and that of one or more MMR gene knockout(s) (Srep.del).
Methods for determining the exposure to a mutational signature are known in the art (see e.g. Alexandrov et al., 2020; Degasperi et al., 2020; Fantini et al., 2020; Gehring et al., 2015). In particular, the determination of the exposure to one or more mutational signatures may be performed by identifying the matrix E that satisfies C≈PE where C is a mutational catalogue for one or more samples for which exposure is to be determined, P is a signature matrix comprising the one or more mutational signatures for which exposure is to be determined, and E is an exposure matrix The determination of the exposure to one or more mutational signatures may be performed as described in Degasperi et al., 2020.
The one or more MMR mutational signatures may be selected from MMR1, MMR2, or any corresponding tissue specific signatures as described in Degasperi et al., 2020 (and available at https://signal.mutationalsignatures.com/explore/study/1), SBS6, SBS14, SBS15, SBS20, SBS21, SBS26, or ID7 as described in Alexandrov et al., 2020 (and available at https://cancer.sanger.ac.uk/cosmic/signatures/). In general, any mutational signature that has been mechanistically or phenotypically associated with MMR deficiency may be used as an MMR mutational signature. A mutational signature may have been mechanistically associated with MMR if it has been identified in cells that are known to have one or more impairment (e.g. one or more natural or engineered molecular impairment) that lead to MMR deficiency, or if it is more similar than expected by chance to a signature that has been derived from cells that are known to have one or more impairments that lead to MMR deficiency (e.g. a signature that is more similar than expected by chance to a mutational signature derived from a MMR knockout sample). For example, a mutational signature that is enriched (e.g. associated with comparatively strong exposure values) in cells that are known to be MMR deficient (e.g. cancer cells that are known to be MMR deficient) may be a suitable MMR mutational signature. A mutational signature may have been phenotypically associated with MMR deficiency if it is enriched in mutation types that are known hallmarks of MMR deficiency (e.g. small (e.g. 1 bp) insertions and deletions of T at mononucleotide T repeats, C>T substitutions, T>C substitutions) and/or if it is frequently identified in cells that have a phenotype indicative of MMR deficiency, such as e.g. cells that are microsatellite unstable. For example, mutational signatures that are often found (more often than expected by chance and/or more often than other signatures) in samples that are microsatellite unstable may be phenotypically associated with MMR deficiency and may be used as MMR mutational signatures.
The determination of the similarity between two mutation profiles may be performed by calculating the cosine similarity between the two mutation profiles. The cosine similarity between two mutation profiles can be calculated as:
where S and M are equally-sized vectors with nonnegative components being the respective mutation profiles (e.g. S being that of a sample and M that of a reference knockout profile).
The method may further comprise receiving (for example from a user, through a user interface, or from one or more databases) one or more of: one or more mutational signature(s) of MMR, and a mutation profile (e.g. substitution profile and/or repeat mediated deletion profile) of one or more MMR gene knockouts or gene knockout samples.
The mutational profile of an MMR gene knockout is a mutational profile derived from one or more MMR gene knockout samples. The term “MMR gene knockout sample” refers to any sample of cells or genetic material derived therefrom, in which the function of one or more genes of the MMR pathway is impaired. Any manipulation that impairs the function of at least one MMR gene may therefore result in an MMR gene knockout cell. Such a manipulation may directly affect a gene in the MMR pathway, or may affect a gene in another pathway, indirectly affecting the function of the MMR pathway. In embodiments, an MMR gene knockout sample has one or more alterations that directly affect the function of a gene in the MMR pathway. Such an alteration may be genetic or epigenetic. In embodiments, an MMR gene knockout has one or more alterations that indirectly affect the function of a gene in the MMR pathway. For example, the function of a gene in the MMR pathway may be affect post-translationally through complex interactions with multiple proteins, at least one of these interactions having been impaired by directly impairing the gene coding for a protein involved in the interaction. For example, an MMR gene knockout cell (or cell line) may be a cell in which one or more genes of the MMR pathway has been silenced, mutated, downregulated or knocked out. Techniques for performing such manipulations are known in the art. In embodiments, an MMR gene knockout sample is a sample of cells or genetic material derived therefrom, in which one or more genes in the MMR pathway has been knocked out, for example using CRISPR-Cas9. An MMR gene may be selected from MSH2 (Homo sapiens Gene ID: 4436, or a homologue thereof), MSH6 (Homo sapiens Gene ID:2956, or a homologue thereof), MSH3 (Homo sapiens Gene ID: 4437, or a homologue thereof), MLH1 (Homo sapiens Gene ID:4292, or a homologue thereof), PMS1 (Homo sapiens Gene ID:5378, or a homologue thereof) or PMS2 (Homo sapiens Gene ID:5395, or a homologue thereof). In embodiments, the one or more MMR genes are selected from MSH2, MSH6, MLH1, PMS2, and PMS1. In embodiments, an MMR gene knockout sample is a sample of cells or genetic material derived therefrom, in which the function of a single gene in the MMR pathway is impaired. A gene knockout sample may be a sample of mammalian cells, suitably human cells, or genetic material derived therefrom.
At step 16, it is determined whether the sample has a high or low likelihood of being MMR deficient, based on the value of the one or more signature metrics received or determined at step 14. This may optionally be performed by classifying the sample between at least two classes, a first class associated with a high likelihood of being MMR deficient, and a second associated with a low likelihood of being MMR deficient. Such as classification may be performed by generating a probabilistic score at step 16A using the value(s) of the one or more mutational signature metrics or values derived therefrom (such as e.g. by normalisation), and comparing the score thus obtained at step 16B to one or more predetermined thresholds that define the boundary(ies) of the first and second classes. At step 18, one or more results of this analysis may optionally be provided to a user through a user interface.
Uses of Predictor Outcome
A prediction of whether a tumour is likely to be MMR deficient can be used in the treatment of cancer. Thus, the invention also provides a method of treating cancer in a subject, wherein the method comprises administering or recommending a subject for administration of a particular therapy, depending on whether a tumour of the subject is identified as likely to be MMR deficient.
In particular, MMR deficient cancers have been identified as having an increased likelihood of response to immunotherapy, and particularly checkpoint inhibitors (CPI) (see e.g. Zhao, Jiang & Li, 2019). CPI therapy includes for example treatment with an anti-CTL4 or anti-PD(L)1 drug. Thus, also described herein are methods of determining whether a subject that has been diagnosed as having a cancer is likely to benefit from treatment with an immunotherapy, preferably a CPI therapy, the method comprising determining the MMR status of a tumour from the subject using the methods described herein. The method may further comprise classifying the subject between a group that is likely to respond to CPI therapy, and a group that is not likely to respond to CPI therapy. For example, the method may comprise determining whether a sample from a tumour of the subject has a high or low likelihood of being MMR deficient (as explained above). A subject may then be classified in the group that is not likely to respond to CPI therapy if the sample is determined to have a low likelihood of being MMR deficient, and in a group that is likely to respond to CPI therapy otherwise. Alternatively, a subject may be classified in the group that is not likely to respond to CPI therapy if the likelihood of MMR deficiency (e.g. as captured in a probabilistic score as described above) is below a threshold, and in the group that is likely to respond to CPI therapy otherwise.
In some cases CPI therapy may comprise CTLA-4 blockade (cytotoxic T-lymphocyte associated protein 4, Gene ID:1493), PD-1 inhibition (PDCD1, programmed cell death 1, Gene ID:5133), PD-L1 inhibition (CD274, CD274 molecule, Gene ID: 29126), Lag-3 (Lymphocyte activating 3; Gene ID: 3902) inhibition, Tim-3 (T cell immunoglobulin and mucin domain 3; Gene ID: 84868) inhibition, TIGIT (T cell immunoreceptor with Ig and ITIM domains; Gene ID: 201633) inhibition and/or BTLA (B and T lymphocyte associated; Gene ID: 151888) inhibition. The CPI therapy may be an anti-PD1 or anti-PDL1 therapy (also referred to as anti-PD(L)1 inhibitor). The inhibitor may be a therapeutic antibody. For example, the CPI therapy may be a PD-1 inhibitor such as pembrolizumab, nivolumab, or tislelizumab. Pembrolizumab is a therapeutic antibody that has been approved by the FDA (U.s>Food and Drug Administration) for patients with unresectable or metastatic microsatellite instability-high (MSI-H) or mismatch repair deficient (dMMR) solid tumors that have progressed following prior treatment. This indication is independent of PD-L1 expression assessment, tissue type and tumor location. Nivolumab is a therapeutic antibody used to treat various cancers including melanoma, lung cancer, renal cell carcinoma, Hodgkin lymphoma, head and neck cancer, colon cancer, and liver cancer. Tislelizumab is a therapeutic antibody under investigation for the treatment of advanced solid tumours. The CPI therapy may be a PDL-1 (also referred to as “PD-L1”) inhibitor such as atezolizumab, avelumab, or durvalumab. Atezolizumab is a therapeutic antibody used to treat urothelial carcinoma, non-small cell lung cancer (NSCLC), triple-negative breast cancer (TNBC), small cell lung cancer (SCLC), and hepatocellular carcinoma (HCC). It was the first PD-L1 inhibitor approved by the FDA. Avelumab is a therapeutic antibody used for the treatment of Merkel cell carcinoma, urothelial carcinoma, and renal cell carcinoma. Durvalumab is a therapeutic antibody that has been approved by the FDA for the treatment of certain types of bladder and lung cancer. As another example, the CPI therapy may be a CTLA-4 inhibitor, such as ipilimumab or tremelimumab. Ipilimumab is a therapeutic antibody approved by the FDA for the treatment of melanoma, and under investigation for the treatment of non-small cell lung cancer, small cell lung cancer, bladder cancer and metastatic hormone-refractory prostate cancer. Tremelimumab is a therapeutic antibody under investigation for the treatment of melanoma, mesothelioma and non-small cell lung cancer.
Further, MMR deficient cancers have been identified as having a decreased likelihood of response to fluorouracil based treatment (e.g. adjuvant 5-fluorouracil chemotherapy) and/or an increased likelihood of response to non-fluorouracil based treatments (Devaud & Gallinger, 2013; Jover et al., 2009). Thus, also described herein are methods of determining whether a subject that has been diagnosed as having a cancer is likely to benefit from treatment with chemotherapy, preferably a fluorouracil based therapy or a non-fluorouracil based therapy, the method comprising determining the MMR status of a tumour from the subject using the methods described herein. Such a method may further comprise classifying the subject between a group that is likely to respond to fluorouracil based therapy, and a group that is not likely to respond to fluorouracil-based therapy. For example, the method may comprise determining whether a sample from a tumour of the subject has a high or low likelihood of being MMR deficient (as explained above). A subject may then be classified in the group that is likely to respond to fluorouracil-based therapy if the tumour is determined to have a low likelihood of being MMR deficient, and in a group that is not likely to respond to fluorouracil-based therapy otherwise. Alternatively, a subject may be classified in the group that is not likely to respond to fluorouracil-based therapy if the likelihood of MMR deficiency (e.g. as captured in a probabilistic score as described above) is above a threshold, and in the group that is likely to respond to fluorouracil-based therapy otherwise.
Alternatively, such a method may comprise classifying the subject between a group that is likely to respond to non-fluorouracil based therapy, and a group that is not likely to respond to no-fluorouracil-based therapy. For example, the method may comprise determining whether a sample from a tumour of the subject has a high or low likelihood of being MMR deficient (as explained above). A subject may then be classified in the group that is likely to respond to non-fluorouracil-based therapy if the tumour is determined to have a high likelihood of being MMR deficient, and in a group that is not likely to respond to non-fluorouracil-based therapy otherwise. Alternatively, a subject may be classified in the group that is not likely to respond to non-fluorouracil-based therapy if the likelihood of MMR deficiency (e.g. as captured in a probabilistic score as described above) is below a threshold, and in the group that is likely to respond to fluorouracil-based therapy otherwise.
Any treatment described herein may be used alone or in combination with another treatment. For example, any treatment with a drug may be used in combination with one or more chemotherapies, one or more course of radiation therapy, and/or one or more surgical interventions. In particular, any treatment described herein may be used in combination with a treatment for which the subject has been identified as likely to be responsive. For example, a subject may be identified as likely to be deficient for homologous recombination (HRdeficient) using one or more methods known in the art. Such a subject may be treated or identified as likely to benefit from treatment with a PARP inhibitor or platinum-based drug. For example, a subject may be identified as likely to be HR-deficient using the methods described in WO 2018/115452 or WO 2017/191074, or likely to respond to a PARP inhibitor or a platinum-based drug using the methods described in WO 2017/191073. As a particular example, a method of treating a subject that has been diagnosed as having cancer may comprise: determining whether the subject is likely to benefit from treatment with an immunotherapy, preferably a CPI therapy, the method comprising determining the MMR status of a tumour from the subject using the methods described herein; and determining whether the subject is likely to benefit from treatment with a PARP inhibitor or platinum based therapy, the method comprising determining the HR status of a tumour from the subject, for example using the methods described in WO 2018/115452 or WO 2017/191074. Such a method may further comprise treating the subject with an immunotherapy (e.g. a CPI therapy, such as a PD1/PDL1 inhibitor) if the subject has been identified as likely to be MMR deficient, and/or treating the subject with a PARP inhibitor or platinum-based therapy if the subject has been identified as likely to be HR deficient.
Additionally, the MMR status of a tumour has been shown to be associated with different prognosis in cancer (see e.g. Sinicrope, 2009). For example, MMR deficient tumours have been associated with improved prognosis compared to non-MMR deficient tumours, for example in terms of disease free survival and overall survival. Thus, also described herein are methods of providing a prognosis for a subject that has been diagnosed as having a cancer, the method comprising determining the MMR status of a tumour from the subject. The method may further comprise classifying the subject between a group that has good prognosis, and a group that has poor prognosis. For example, the method may comprise determining whether a sample from a tumour of the subject has a high or low likelihood of being MMR deficient (as explained above). A subject may then be classified in the group that has poor prognosis if the sample is determined to have a low likelihood of being MMR deficient, and in a group that has good prognosis otherwise. Alternatively, a subject may be classified in the group that has poor prognosis if the likelihood of MMR deficiency (e.g. as captured in a probabilistic score as described above) is below a threshold, and in the group that has good prognosis otherwise.
Whether a prognosis is considered good or poor may vary between cancers and stage of disease. In general terms a good prognosis is one where the overall survival (OS), disease free survival (DFS) and/or progression-free survival (PFS) is longer than that of a comparative group or value, such as e.g. the average for that stage and cancer type. A prognosis may be considered poor if OS, DFS and/or PFS is lower than that of a comparative group or value, such as e.g. the average for that stage and type of cancer. Thus, in general terms, a “good prognosis” is one where survival (OS, DFS and/or PFS) and/or disease stage of an individual patient can be favourably compared to what is expected in a population of patients within a comparable disease setting. Similarly, a “poor prognosis” is one where survival (OS, DFS and/or PFS) of an individual patient is lower (or disease stage worse) than what is expected in a population of patients within a comparable disease setting.
The subject is preferably a human patient.
The cancer may be any cancer that may be MMR deficient. In particular, the methods described herein may be used to characterise any type of cancer that is known to have MMR deficient subpopulations or in which MMR deficiencies have been reported in at least some patients. The cancer may be ovarian cancer, breast cancer, endometrial cancer (uterus/womb cancer), kidney cancer (renal cell), lung cancer (small cell, non-small cell and mesothelioma), brain cancer (gliomas, astrocytomas, glioblastomas), melanoma, merkel cell carcinoma, clear cell renal cell carcinoma (ccRCC), lymphoma, gastrointestinal cancer (e.g. colorectal cancer), small bowel cancers (duodenal and jejunal), leukemia, pancreatic cancer, hepatobiliary tumours, germ cell cancers, prostate cancer, head and neck cancers, bladder cancer, thyroid cancer and sarcomas. For example, the cancer may be colorectal cancer, breast cancer, endometrial cancer, breast cancer, prostate cancer, bladder cancer or thyroid cancer, all of which are known to have MMR deficient subpopulations. As another example, the cancer may be colorectal cancer, endometrial/uterus cancer, biliary caner, bone/soft tissue cancer, breast cancer, central nervous system cancer, choroid melanoma, carcinoma of unknown primary (CUP), esophagus cancer, head and neck cancer, kidney cancer, liver cancer, lung cancer, lymphoid cancer, neuroendocrine tumour (NET), ovarian cancer, pancreatic cancer, prostate cancer, skin cancer, stomach cancer, urinary tract cancer. All of these have been tested with the methods described herein. In embodiments, the cancer is colorectal cancer. The links between MMR deficiency and prognosis as well as therapy response in colorectal cancer has been extensively studied and as such there is strong evidence that treatment and prognosis in such caners can be adjusted using information regarding the MMR status of such cancers. Such information is more accurately obtained using the methods described herein, compared to the prior art. As such, the treatment strategy designed for a subject and/or the prognosis provided for a subject having colorectal cancer can be improved using the methods of the present invention.
Systems
The following is presented by way of example and is not to be construed as a limitation to the scope of the claims.
While there have been advancements in analytical aspects of deriving mutational signatures from human cancers (Haradhvala, N.J. et al., 2018; Alexandrov, L. B. et al., 2020; Kim, J. et al., 2016), there is an emerging need for experimental substantiation, elucidating etiologies and mechanisms underpinning these mutational patterns (Nik-Zainal, S. et al., 2015; Zou, X. et al., 2018; Christensen, S. et al., 2019; Kucab, J. E. et al., 2019). In these examples, the inventors combine CRISPR-Cas9-based biallelic knockouts of a selection of DNA replicative/repair genes in human induced Pluripotent Stem Cells (hiPSCs), whole-genome sequencing (WGS), and in-depth analysis of experimentally-generated data, to obtain mechanistic insights into mutation formation. This work focuses on directly mapping whole-genome mutational outcomes associated with human DNA repair defects, critically, in the absence of any applied, external damage. The insights derived from this are then used to develop a classifier, MMRDetect, for improved clinical detection of MMR-deficient tumors
Methods
Cell lines and culture. The human iPSC line used in this study is previously described (Kucab et al., 2019). The line was derived at the Wellcome Trust Sanger Institute (Hinxton, UK). The use of this cell line model was approved by Proportionate Review Sub-committee of the National Research Ethics (NRES) Committee North West—Liver-pool Central under the project “Exploring the biological processes underlying mutational signatures identified in induced pluripotent stem cell lines (iPSCs) that have been genetically modified or exposed to mutagens” (ref: 14.NW.0129). It is a long-standing iPSC line that is diploid and does not have any known driver mutations. It does carry a balanced translocation between chromosomes 6 and 8. It grows stably in culture and does not acquire a vast number of karyotypic abnormalities. This is confirmed through mutational and copy number assessment of the WGS data reviewed of all subclones.
Cell culture reagents were obtained from Stem Cell Technologies unless otherwise indicated. Cells were routinely maintained on Vitronectin XF-coated plates (10-15 ug/mL) in TeSR-E8 medium. The medium was changed daily, and cells were passaged every 4-8 days depending on the confluence of the plates using Gentle Cell Dissociation Reagent.
All cell lines were grown at 37° C., with 20% oxygen and 5% carbon dioxide in a humidified incubator, except for the pilot study in which the iPSCs knockouts were also grown under hypoxic condition (3% oxygen) as one of the experimental conditions (see “Pilot study” below). Cells were cultivated as monolayers in their respective growth medium and passaged every 3-4 days to maintain sub-confluence during the mutation accumulation step. All cell lines were tested negative for mycoplasma contamination using MycoAlert™ Mycoplasma Detection Kit and LookOut® Mycoplasma PCR Detection Kit according to the manufacturers' protocol.
Generation of DNA repair gene knockouts in human iPSCs. Biallelic DNA repair gene knockouts in human iPSCs were performed by the High Throughput Gene Editing team of Cellular Operations at the Sanger Institute, Hinxton, UK. These knockouts were generated based on the principles of CRISPR/Cas9-mediated HRD and NHEJ as described in Bressan, R. B. et al., 2017.
Generation of donor plasmids for precise gene targeting via HDR. All knockouts were generated using an established protocol that was found to minimize potential off-target effects (Bressan, R. B. et al., 2017). Briefly, the intermediate targeting vectors were generated for each gene using GIBSON assembly of the four fragments: pUC19 vector, 5′ homology arm, R1-pheS/zeo-R2 cassette and 3′ homology arm. Gene-specific homology arms were amplified by PCR from the iPSC gDNA and were either gel-purified or column-purified (QIAquick, QIAGEN). pUC19 vector and R1-pheS/zeo-R2 cassette were prepared as gel-purified blunt fragments (EcoRV digested). Fragments were assembled via GIBSON assembly reactions (Gibson Assembly Master Mix, NEB, E2611) according to the manufacturer's instructions. Assembly reaction mix was transformed into NEB 5-alpha competent cells and clones resistant to carbenicillin (50 μg/mL) and zeocin (10 μg/mL) were analysed by Sanger sequencing to select for correctly-assembled constructs. Sequence-verified intermediate targeting vectors were converted into donor plasmids via a Gateway exchange reaction. LR Clonase II Plus enzyme mix (Invitrogen, 12538120) was used to perform a two-way reaction exchanging only the R1-pheSzeo-R2 cassette with the pL1-EF1αPuro-L2 cassette as previously described78. The latter was generated by cloning synthetic DNA fragments of the EF1a promoter and puromycin resistance cassette into one of pL1/L2 vector (Tate, P. H. & Skarnes, W. C., 2011). Following Gateway reaction and selection on yeast extract glucose (YEG)+carbenicillin agar (50 μg/mL) plates, correct donor plasmids were verified by capillary sequencing across all junctions.
Guide RNA design & cloning. For every gene knockout, two separate gRNAs targeting within the same critical exon of a gene were also selected. The gRNAs were selected using the WGE CRISPR tool (Hodgkins, A. et al., 2015) based on their off-target scores. Selected gRNAs were suitably positioned to ensure DNA cleavage within the exonic region, excluding any sequence within the homology arms of the targeting vector. To generate individual gene targeting plasmids, gene-specific forward and reverse oligos were annealed and cloned into BsaI site of either U6_BsaI_gRNA (unpublished). The guide RNA (gRNA) sequences used are listed in Table 1.
Delivery of KO-targeting plasmids, donor templates and Cas9, selection and genotyping. Human iPSCs were dissociated to single cells and nucleofected with Cas9-coding plasmid (hCas9, Addgene 41815), sgRNA plasmid and donor plasmid on Amaxa 4D-Nucleofactor program CA-137 (Lonza). Following nucleofection, cells were selected for up to 11 days with 0.25 μg/mL puromycin. Edited cells were expanded to ˜70% confluency before subcloning. Approximately 1000 cells were subcloned onto 10 cm tissue culture dishes precoated with SyntheMAX substrate (Corning) at a concentration of 5 μg/cm2 to allow colony formation for 8-10 days until colonies are approximately 1-2 mm in diameter. Individual colonies were picked into U-bottom 96-well plates using a dissection microscope and a p20 pipette, grown to confluence and then replica plated. Once confluent, the replica plates were either frozen as single cells in 96-well vials or the wells were lysed for genotyping.
To genotype individual clones from a 96-well replica plate, cells were lysed and used for PCR amplification with LongAmp Taq DNA Polymerase (NEB, M0323). Insertion of the cassette into the correct locus was confirmed by visualizing on 1% E-gel (Invitrogen, G700801) PCR products generated by gene-specific (GF1 and GR1) and cassette specific primers ((ER: TGATATCGTGGTATCGTTATGCGCCT and PF: CATGTCTGGATCCGGGGGTACCGCGTCGAG) for both 5′ and 3′ ends. We also confirmed single integration of the cassette by performing a qPCR copy number assay. To check the CRISPR site on the non-targeted allele, PCR products were generated from across the locus, using the same 5′ and the 3′ gene-specific genotyping primers. The PCR products were treated with exonuclease I and alkaline phosphatase (NEB, M0293; M0371) and Sanger sequenced to verify successful knockouts. Sequence reads and their traces were analysed and visualised on a laboratory information management system (LIMS)-2. For each targeted gene, two independently-derived clones with different specific mutations were isolated and studied further.
Genomic DNA extraction and WGS. Samples were quantified with Biotium Accuclear Ultra high sensitivity dsDNA Quantitative kit using Mosquito LV liquid platform, Bravo WS and BMG FLUOstar Omega plate reader and cherry picked to 500 ng/120 μl using Tecan liquid handling platform. Cherry picked plates were sheared to 450 bp using a Covaris LE220 instrument. Post-sheared samples were purified using Agencourt AMPure XP SPRI beads on Agilent Bravo WS. Libraries were constructed (ER, A-tailing and ligation) using ‘Agilent Sureselect kit’ on an Agilent Bravo WS automation system. KapaHiFi Hot start mix and IDT 96 iPCR tag barcodes were used for PCR set-up on Agilent Bravo WS automation system. PCR cycles include 6 standard cycles: 1) Incubate 95° C. 5 mins; 2) Incubate 98° C. 30 secs; 3) Incubate 65° C. 30 secs; 4) Incubate 72° C. 1 min; 5) Cycle from 2, 5 more times; 6) Incubate 72° C. 10 mins. Post PCR plate was purified using Agencourt AMPure XP SPRI beads on Beckman BioMek NX96 liquid handling platform. Libraries were quantified with Biotium Accuclear Ultra high sensitivity dsDNA Quantitative kit using Mosquito LV liquid handling platform, Bravo WS and BMG FLUOstar Omega plate reader, then pooled in equimolar amounts on a Beckman BioMek NX-8 liquid handling platform and finally normalized to 2.8 nM ready for cluster generation on a c-BOT and loading on requested Illumina sequencing platform. Pooled samples were loaded on the X10 using 150 PE run length, sequenced to ˜25× coverage. The details of sequence coverage for all clones and subclones are provided in Table 2.
Alignment and somatic variant-calling. Short reads were aligned to human reference genome GRCh37/hg19 assembly using the BWA-MEM algorithm (Li, H. 2013). Three algorithms, CaVEMan (http://cancerit.github.io/CaVEMan/) (Jones, D. et al., 2016), Pindel (http://cancerit.github.io/cgpPindel) (Raine, K. M. et al., 2015) and BRASS (https://github.com/cancerit/BRASS) were used to call somatic substitutions, indels and rearrangements in all subclones, respectively.
Assurance of knockout state using WGS data. First, we examined whether there were CRISPR-Cas9 off-target effects by seeking relevant mutations in other DNA repair genes besides the genes of interest. We also searched for potential off-target sites based on gRNA target sequences using COSMID (Cradick, T. J. et al., 2014) and confirmed that there were no off-target hits in knockouts that generated mutational signatures. We confirmed chromosome copy number in all subclones remained stable and unchanged from their parent. Second, we confirmed that there are frameshift indels near the gRNA targeted sequence in the genes of interest for all knockout subclones. One UNG knockout was found to be heterozygous and was excluded in the downstream analysis. Third, we checked mislabeled samples by examining the shared mutations between subclones. Subclones originally derived from the same parental knockout clone would share some mutations, in contrast to subclones from different knockouts. Consequently, one ΔPRKDC, one ΔTP53 and two ΔNBN subclones were removed from downstream analysis. Fourth, variant allele fraction (VAF) distribution for each knockout subclone was examined. VAF>=0.4 was used as a cut-off for determination of whether the subclone was derived from a single-cell. When contrasting mutation burden between subclones, we only selected subclones that were derived from single-cells, cultured for 15 days. Shared mutations among subclones were removed to obtain de novo somatic mutations accumulated after knocking out the gene of interest. Table 2 summarizes the number of de novo mutations (substitutions and indels) for all subclones.
Proteomics analysis. Cell pellets were dissolved in 150 μL buffer containing 1% sodium deoxycholate (SDC), 100 mM triethylammonium bicarbonate (TEAB), 10% isopropanol, 50 mM NaCl and Halt protease and phosphatase inhibitor cocktail (100×) (Thermo, #78442) using pulsed probe sonication followed by boiling at 90° C. for 5 min. Aliquots containing 50 μg of total protein, measured with the Coomassie Plus Bradford Protein Assay (Pierce), were reduced with 5 mM tris-2-carboxyethyl phosphine (TCEP) for 1 h at 60° C. and alkylated with 10 mM lodoacetamide (IAA) for 30 min in dark. Proteins were then digested with 75 ng/μL trypsin (Pierce) overnight. The tryptic digests from the ATP2B4, EXO1, OGG1, PMS1, PMS2, RNF168 and UNG knock-out clones as well as three biological replicates of the parental cell line were labelled with the TMTpro 16plex reagents (Thermo) according to manufacturer's instructions. The digests from MLH1, MSH2, MSH6 clones were subjected to label-free single-shot analysis. The TMTpro labelled peptides were fractionated with offline high-pH Reversed-Phase (RP) chromatography (XBridge C18, 2.1×150 mm, 3.5 μm, Waters) on a Dionex Ultimate 3000 HPLC system with 1% gradient. Mobile phase A was 0.1% ammonium hydroxide and mobile phase B was acetonitrile, 0.1% ammonium hydroxide. LC-MS analysis was performed on the Dionex Ultimate 3000 system coupled with the Orbitrap Lumos Mass Spectrometer (Thermo Scientific). Selected TMTpro peptide fractions were loaded to the Acclaim PepMap 100, 100 μm×2 cm C18, 5 μm, 100 Å trapping column and were analyzed with the EASY-Spray C18 capillary column (75 μm×50 cm, 2 μm). Mobile phase A was 0.1% formic acid and mobile phase B was 80% acetonitrile, 0.1% formic acid. The TMTpro peptide fractions were analyzed with a 90 min gradient from 5%-38% B. MS spectral were acquired with mass resolution of 120 k and precursors were isolated for CID fragmentation with collision energy 35%. MS3 quantification was obtained with HCD fragmentation of the top 5 most abundant CID fragments isolated with Synchronous Precursor Selection (SPS) and collision energy 55% at 50k resolution. For the label-free experiments, peptides were analyzed with a 240 min gradient and HCD fragmentation with collision energy 35% and ion trap detection. Database search was performed in Proteome Discoverer 2.4 (Thermo Scientific) using the SequestHT search engine with precursor mass tolerance 20 ppm and fragment ion mass tolerance 0.5 Da. TMTpro at N-terminus/K (for the labelled samples only) and Carbamidomethyl at C were defined as static modifications. Dynamic modifications included oxidation of M and Deamidation of N/Q. The Percolator node was used for peptide confidence estimation and peptides were filtered for q-value <0.01. All spectra were searched against reviewed UniProt human protein entries. Only unique peptides were used for quantification.
Pilot Study. Prior to generating the full set of knockouts described above, a pilot study was conducted to evaluate the effects of culture conditions and time on mutational signatures. Three genes were selected for knockout (Δ): MSH6, UNG and ATP2B4 (negative control). Two genotypes per gene were obtained and grown in culture to gauge reproducibility of signatures between different genotypes of a gene-knockout. These lines were cultured under normoxic (20%) and hypoxic (3%) states, for defined culture times of ˜15, 30 or 45 days. Two single-cell subclones were derived for whole genome sequencing for each parental line (equivalent to four subclones per gene edit). One of the UNG genotypes appeared to be heterozygous, which was excluded in downstream analysis. All classes of somatic mutations were called, subtracting variation of the primary hiPSC parental clone (see methods in Example 2), and the cosine similarity between mutational profiles of the subclones and the background signature were obtained. The results of this analysis are shown on
Results
We knocked out (Δ) 42 genes involved in DNA repair/replicative pathways and an unrelated control gene, ATP2B4 (
A total of 173 subclones were obtained from 78 genotyped knockouts of 43 genes (Table 2).
All subclones were sequenced to an average depth of ˜25-fold. Short-read sequences were aligned to human reference genome assembly GRCh37/hg19. All classes of somatic mutations were called, subtracting variation of the primary hiPSC parental clone (see methods section in Example 2; Table 2, Table 3,
We confirmed that mutational outcomes were neither due to off-target edits nor to the acquisition of new driver mutations (see Methods). We verified that knockouts were biallelic, confirmed this further by protein mass spectrometry, and ensured that subclones were derived from single cells in all comparative analyses (see Methods).
In this example, the inventors investigated whether knocking out the genes as described in Example 1 would produce a mutational signature.
Methods
See Example 1.
Proliferation assay. Cells were seeded at 5,500 per well on 96-w plates. Measurements were taken at 24 h intervals post-seeding over a period of 5 days according to manufacturer's instructions. Briefly, plates were removed from the incubator and allowed to equilibrate at room temperature for 30 minutes, and equal volume of CellTiter-Glo reagent (Promega) was added directly to the wells. Plates were incubated at room temperature for 2 minutes on a shaker and left to equilibrate for 10 minutes at 22° C. before luminescence was measured on PHERAstar FS microplate reader. Luminescence readings were normalized and presented as relative luminescence units (RLU) to time point 0 (to). Doubling time was calculated based on replicate-averaged readings on the linear portion of the proliferation curve (exponential phase) using formula:
Determination of gene knockout-associated mutational signatures. An intrinsic background mutagenesis exists in normal cells grown in culture. Knocking out a DNA repair gene that is involved in repairing endogenous DNA damage may result in increased unrepaired DNA damage and, thereby result in mutation accumulation with subsequent rounds of replication. Whole-genome sequencing of these knockouts can detect the mutations that occur as a result of being a specified knockout. If the mutation burden and the mutational profile of a knockout is significantly different from the control subclones which have only the background mutagenesis, it is most likely that there is gene knockout-associated mutagenesis. Based on this principle, our approach to identify gene knockout-associated mutational signature involved three steps: 1) we determined the background mutational signature; 2) we determined the difference between the mutational profile of knockout and background mutation profiles; 3) we removed the background mutation profile from mutation profile of the knockout subclone.
Substitution profiles were described according to the classical convention of 96 channels: the product of 6 types of substitution multiplied by 4 types of 5′ base (A,C,G,T) and 4 types of 3′ base (A,C,G,T). Indel profiles were described by type (insertion, deletion, complex), size (1-bp or longer) and flanking sequence (repeat-mediated, microhomology-mediated or other) of the indel. Here, we used two sets of indel channels. Set one contains 15 channels: 1 bp C/T insertion at short repetitive sequence (<5 bp), 1 bp C/T insertion at long repetitive sequence (>=5 bp), long insertions (>1 bp) at repetitive sequences, microhomology-mediated insertions, 1 bp C/T deletions at short repetitive sequence (<5 bp), 1 bp C/T deletions at long repetitive sequence (>=5 bp), long deletions (>1 bp) at repetitive sequences, microhomology-mediated deletions, other deletion and complex indels (see
Note that for all mutational profiles obtained throughout these examples (whether from gene knockouts or from samples), the somatic mutational profiles (excluding germline mutations) are used.
Identifying background signatures. The mutational profile of control subclones were used to determine background mutagenesis. Aggregated substitution profiles of all control subclones (ΔATP2B4) were used as the background substitution mutational signature. Aggregated indel profiles of all subclones containing <=8 indels were used as the background indel mutational signature.
Distinguishing mutational profiles of control and gene-edited subclone profiles. Signal-to-noise ratio affects mutational signature detection. In this study, ‘noise’ is largely background mutagenesis. The averaged mutation burden caused by the background mutagenesis in control cells for substitution and indels are around 150 and 10, with standard deviation of 10 and 1.4, respectively. ‘Signal’ represents the elevated mutation burden caused by gene knockouts. The averaged mutation burden in knockouts range from 63 to 2360 for substitution, and 0 to 2122 for indels after 15 days in culture, as shown in Table 2.
The costs associated with whole genome sequencing is prohibitive, thus we have 2-4 subclones per knockout. The intrinsic fluctuation of detected mutation burden in each sample and the limited subclone numbers impose a greater uncertainty in mutational signature detection. Thus, to distinguish high-confidence mutational signatures from noise, we employed three different methods.
First, we evaluated the similarity of mutational profile between control and each gene knockout. According to the mutational profile of control subclones, pcontrol=[pcontrol1,pcontrol2, . . . , pcontrolK]T, for a given number of mutations N (0<N<10000), one could generate L bootstrapped samples:
where Σk=1Kmlk=N. One can calculate the cosine similarities (sl) between bootstrapped control samples (ml) and experimentally-obtained control profile (pcontrol) to obtain a distribution of cosine similarities P(S):
We can then calculate the cosine similarity (Sknockout) between control profile (pcontrol) and knockout profile (pknockout). As shown in
Second, we used contrastive principal component analysis (cPCA)(Abid, A. et al., 2018), which efficiently identified directions that were enriched in the knockouts relative to the background through eliminating confounding variations present in both (
Third, we used t-Distributed stochastic neighbor embedding (t-SNE)(van der Maaten, L. & Hinton, G. 2008), which is a visualization technique for viewing pairwise similarity data resulting from nonlinear dimensionality reduction based on probability distributions. In t-SNE implementation, mutational profiles that are similar to each other were plotted nearby each other, whereas profiles that are dissimilar are plotted distantly in a 2D space (
Subtraction of the background mutational signature from knockout mutation profile. The experiment-associated mutational signature can then be obtained by subtracting the background mutational signature from the mutational profile of treated subclones through quantile analysis. First, one can generate a set of bootstrap samples (e.g. 10,000 samples) of each treated subclone in order to determine the distribution of mutation number for each channel. This set of “hypothetical samples” aims to simulate the variability that may be present in a larger population of subclones, even though only 4 subclones could be generated for practical reasons. According to the distribution, the upper and lower boundaries (e.g., 99% CI) for each channel (e.g. each of the 96 channels for substitutions) can be identified for each treatment. The same process is applied to the control knockouts (ATP2B4) to estimate the expected background mutational signature variability. Based on the background mutational signature (average mutation signature in each of the channels, across the 4 control subclones) and averaged mutation burden (across the 4 control subclones; used as initial value), one can construct bootstrapped background profiles. The bootstrap background profiles can then be used to derive a centroid value across bootstrap background profiles, and this is subtracted from the centroid of bootstrap subclone samples. This process results in a mutational signature for each knockout, which is derived from all subclones for the knockout with variability estimated by bootstrapping, and adjusted to remove the estimated background contribution. Due to data noise, some channels may have negative values, in which case, the negative values are set to zero. Occasionally, the number of mutations in a few channels will fall outside the lower boundary after removing the background profile. To avoid negative values, the background mutation pattern is maintained but burden is scaled down through an automated iterative process.
Other software used. IntersectBed (Quinlan, A. R. & Hall, I. M., 2010) was used to identify mutations overlapping certain genomic features. All statistical analysis in these Examples were performed in R (Team, R. C. 2017). All plots were generated by ggplot2 (Wickham, H., 2009).
Results
We reasoned that under the controlled experimental settings described in Example 1, if simply knocking-out a gene (in the absence of providing additional DNA damage) could produce a signature, then the gene is critical to maintaining genome stability from endogenous sources of DNA damage. It would manifest an increased mutation burden above background and/or an altered mutation profile (
To address potential uncertainty associated with the relatively small number of subclones per knockout and variable mutation counts in each gene knockout (see Example 1 and Methods above), we generated bootstrapped control samples with variable mutation burdens (50-10,000). We calculated cosine similarities between each bootstrapped sample and the background control (ΔATP2B4) mutational signature (mean and standard deviations). A cosine similarity close to 1.0 indicates that the mutation profile of the bootstrapped sample is near-identical to the control signature. Cosine similarities could thus be considered across a range of mutation burdens (green line in
We identified nine single substitution, two double substitution and six indel signatures. Three gene knockouts, ΔOGG1, ΔUNG, and ΔRNF168, produced only substitution signatures. Six gene knockouts, ΔMSH2, ΔMSH6, ΔMLH1, ΔPMS2, ΔEXO1, and ΔPMS1, presented substitution and indel signatures. ΔEXO1 and ΔRNF168 also produced double substitution patterns. The average de novo mutation burden accumulated for these nine knockouts (
In standardized experiments performed in a diploid, non-transformed human stem cell model, biallelic gene knockouts that produce mutational signatures in the absence of administered DNA damage are indicative of genes that are important at maintaining the genome from intrinsic sources of DNA perturbations. We find signatures of substitutions and/or indels in nine genes: ΔOGG1, ΔUNG, ΔEXO1, ΔRNF168, ΔMLH1, ΔMSH2, ΔMSH6, ΔPMS2, and ΔPMS1, suggesting that proteins of these genes are critical guardians of the genome in non-transformed cells. Many gene knockouts did not show mutational signatures under these conditions. This does not mean that they are not important DNA repair proteins. There may be redundancy, or the gene may be crucial to the orchestration of DNA repair, even if itself is not imperative at directly preventing mutagenesis. It is also possible that some gene knockouts have very low rates of mutagenesis such that a statistically distinct mutational signature cannot be distinguished from background mutagenesis within our experimental time-frame. For genes involved in double-strand-break (DSB) repair, hiPSCs may not be permissive for surviving DSBs to report signatures. Other genes may require alternative forms of endogenous DNA damage that manifest in vivo but not in vitro, for example, aldehydes, tissue-specific products of cellular metabolism, and pathophysiological processes such as replication stress. Likewise, for genes in the nucleotide excision repair pathway, bulky DNA adducts, whether exogenous (e.g., ultraviolet damage) or endogenous (e.g., cyclopurines and by-products of lipid peroxidation) may be a pre-requisite before these compromised genes reveal associated signatures. While experimental modifications such as the addition of DNA damaging agents to increase mutation burden or using alternative cellular models, for example, cancer lines or cellular models of specific tissue-types, could amplify signal, they could also modify mutational outcomes, and that must be taken into consideration when interpreting data. Also, not all genes have been successfully knocked out in this endeavour and could have similarly important roles in directly preventing mutagenesis.
In this example, the inventors investigated in-depth the mutational signatures identified in Example 2 associated with genes involved in the MMR pathway.
Methods
See Examples 1 and 2.
Topography analysis of signatures. Strand bias. Reference information of replicative strands and replication-timing regions were obtained from Repli-seq data of the ENCODE project (https://www.encodeproject.org/) (The E.P.C. et al., 2012). The transcriptional strand coordinates were inferred from the known footprints and transcriptional direction of protein coding genes. First, for a given mutational signature, one could calculate the ‘expected’ ratio of mutations between transcribed and non-transcribed strand, or between lagging and leading strands, according to the distribution of trinucleotide sequence context in these regions. Second, the ‘observed’ ratio of mutations between different strands can be identified through mapping mutations to the genomic coordinates of all gene footprints (for transcription) or leading/lagging regions (for replication). Third, all mutations were orientated towards pyrimidines as the mutated base (as this has become the convention in the field). This helped denote which strand the mutation was on. Fourth, the level of asymmetry between different strands was measured by calculating the odds ratio of mutations occurring on one strand (e.g., transcribed or leading strand) vs. on the other strand (e.g., non-transcribed or lagging strand).
Results
Knockouts of five genes involved in the mismatch repair (MMR) pathway (Gupta et al., 2012; Palombo et al., 1995, Warren et al., 2007), MSH2, MSH6, MLH1, PMS2, and PMS1, produced substitution and indel signatures (
In-depth analysis of these mutational signatures allowed us to determine putative sources of endogenous DNA damage (
First, we consistently observed replication strand bias across ΔMLH1, ΔMSH2, ΔMSH6, and ΔPMS2: C>A on the lagging strand (equivalent to G>T leading strand bias), C>T on the leading strand (or G>A lagging) and T>C lagging (or A>G leading) (
Second, the predominance of C>A transversions could be explained by differential processing of 8-oxo-dGs (
Third, we found that T>A transversions at ATT were strikingly persistent in MMR knockout signatures, although with modest peak size (<3% normalized signature,
Since polynucleotide repeat tracts predispose to indels due to replication slippage and are a known source of mutagenesis in MMR-deficient cells, we hypothesize that the T>A transversions observed at sites of abutting polyA and polyT tracts are the result of a ‘reverse template slippage’. In this scenario, the polymerase replicating across a mixed repeat sequence such as a repeat of 6 As followed by 4 Ts in which the template slipped at one of the As would incorporate five instead of six Ts opposite the A repeat (red arrow pathway in
In this example, the inventors compared and validated the mutational signatures identified in Example 2 associated with genes involved in the MMR pathway.
Methods
See Examples 1-3.
CMMRD patient sample collection. Four CMMRD patients were recruited at Doce de Octubre University Hospital, Spain, St George's Hospital in London and Great Ormond Street Hospital under the auspices of the Insignia project. This included two PMS2-mutant patients and two MSH6-mutant patients. Table 5 shows the genotypes of these four patients. A healthy donor was recruited as control.
Generation of iPSCs from Constitutional Mismatch Repair Deficiency (CMMRD) Patients. Peripheral blood mononuclear cells (PBMCs) isolation, erythroblast expansion, and IPSC derivation were done by the Cellular Generation and Phenotyping facility at the Wellcome Sanger Institute, Hinxton, according to Agu et al 2015. Briefly, whole blood samples collected from consented CMMRD patients were diluted with PBS, and PBMCs were separated using standard Ficoll Paque density gradient centrifugation method. Following the PBMC separation, samples were cultured in media favouring expansion into erythroblasts for 9 days. Reprogramming of erythroblasts enriched fractions was done using non-integrating CytoTune-iPS Sendai Reprogramming kit (Invitrogen) based on the manufacturer's recommendations. The kit contains three Sendai virus-based reprogramming vectors encoding the four Yamanaka factors, Oct3/4, Sox2, Klf4, and c-Myc. Successful reprogramming was confirmed via genotyping array and expression array.
Results
There are uncertainties regarding which of the cancer-derived signatures (described in Alexandrov, L. B. et al. (2020) and Degasperi, A. et al. (2020)) are truly MMR-deficiency signatures. It was suggested that SBS6, SBS14, SBS15, SBS20, SBS21, SBS26, and SBS44 were MMR-deficiency related (Alexandrov, L. B. et al. (2020)). In an independent analytical exercise, only two MMR-associated signatures were identified (Degasperi, A. et al. (2020)), although variations of the signatures were seen in different tissue types (Degasperi, A. et al. (2020)). An experimental process would help to obtain clarity in this regard (Nik-Zainal, S. et al., 2015; Zou, X. et al., 2018; Christensen, S. et al., 2019; Kucab, J. E. et al., 2019).
As described above, substitution patterns of ΔMSH2, ΔMSH6, and ΔMLH1 showed enormous qualitative similarities to each other and were distinct from ΔPMS2 (
While the qualitative indel profiles of ΔMSH2, ΔMSH6, and ΔMLH1 were very similar, their quantitative burdens were rather different (
Thus, there are clear qualitative differences between substitution and indel profiles of ΔMSH2, ΔMSH6, and ΔMLH1 from ΔPMS2. To validate these two gene-specific experimentally-generated MMR knock-out signatures, we interrogated genomic profiles of normal cells derived from patients with inherited autosomal recessive defects in MMR genes resulting in Constitutional Mismatch Repair Deficiency (CMMRD), a severe, hereditary cancer predisposition syndrome characterized by an increased risk of early-onset (often pediatric) malignancies and cutaneous café-au-lait macules (Poulogiannis et al., 2010; Heinen et al., 2016). hiPSCs were generated from erythroblasts derived from blood samples of four CMMRD patients (two PMS2 homozygotes and two MSH6 homozygotes) and two healthy control64. hiPSC clones obtained were genotyped (Agu et al., 2015). Expression arrays and cellomics-based immunohistochemistry were performed to ensure that pluripotent stem cells were generated (see Methods). Parental clones were grown out to allow mutation accumulation, single-cell subclones were derived, and whole-genome sequenced (
Gene-specificity of mutational signatures seen in CMMRD hiPSCs was virtually identical to those of the CRISPR-Cas9 knockouts and cancers (
Furthermore, gene-specific MMR signatures were seen in the International Cancer Genome Consortium (ICGC) cohort of >2,500 primary WGS cancers (Degasperi, A. et al., 2020). Indeed, biallelic MSH2/MSH6/MLH1 mutant tumors carried the same signature (RefSig MMR1) as ΔMSH2/ΔMSH6/ΔMLH1 clones (
In this example, the inventors developed an algorithm to classify tumours according to MMR-deficiency status using the insights generated in Examples 1-4.
Methods
See Examples 1-4.
MMRDetect algorithm. We trained a mismatch repair (MMR) deficiency logistic regression-based classifier, called MMRDetect, based on mutational signatures obtained from the experimental work. We obtained mutation data from 336 WGS colorectal cancers with accompanying immunohistochemistry (IHC) staining of the four MMR proteins (MSH2, MSH6, MLH1 and PMS2) from UK100,000 Genomes Project (UK100kGP). Within this cohort of 336 colorectal cancers, there were 79 (24%) cancers with abnormal IHC staining indicative of MMR deficiency. 336 cancers were randomly divided into a training set and a test set by using the R function sample( ). The training set had 180 MMR-proficient and 56 MMR-deficient samples. The test data set had 77 MMR-proficient and 23 MMR-deficient samples (Table 6).
Based on the experimental data, we investigated four potential predictor variables in MMRDetect (
The values of different variables were transformed to between 0 and 1 using formula x′=x/max(x) for comparability. This is performed for all training samples and for all samples that are subsequently evaluated for testing purposes or in use to identify MMR deficiency in a subject. Table 6 shows calculated parameters of 336 tumors for MSIseq and MMRDetect. The logistic regression algorithm (function glm( )) provided in R package glmnet was employed as the framework of MMRDetect. Table 7 provides the weight (coefficients) of the four variables obtained from training the model using the training data set, and the value of the intercept weight. A ten-fold cross validation was performed for the training data to evaluate the stability of the weights (
Additional four datasets were used to compare the performance of MMRDetect and MSIseq:
The characteristics of each of these cohorts are shown in Tables 8-11 below.
Results
Algorithms to classify MMR-deficiency tumors have been developed using massively-parallel sequencing data (Ni Huang et al., 2013; Wang & Liang, 2018; Cortes-Ciriano, 2017; Salipante et al., 2014; Hause et al., 2016). These classifiers depend on detecting elevated tumor mutational burdens (TMB) or microsatellite instability (MSI). New knowledge from our experimental data and awareness of tissue-specific signature variation (
We obtained WGS data on 336 colorectal cancers from patients recruited via the National Health Service-based UK 100,000 Genomes Project (UK100kGP) run by Genomics England (GEL). These samples critically had accompanying immunohistochemistry (IHC) validation of MMR-deficiency status based on protein staining of MSH2, MSH6, MLH1 and PMS2. 79 out of 336 cases were identified as MMR-deficient (˜24%). This cohort of 336 samples were randomly assigned into a training set (comprising 180 MMR-proficient and 56 MMR-deficient samples) or a test set (comprising 77 MMR-proficient and 23 MMR-deficient samples). We developed a logistic regression classifier, called MMRDetect, using new mutational-signatures-based parameters derived from the experimental insights gained from our studies above: 1) the exposure of MMR-deficient substitution signatures (EMMRD); 2) the cosine similarity between substitution profile of the tumor and that of MMR knockouts (Ssub); 3) the mutation burden of indels in repetitive regions (Nrep.indel), and 4) the cosine similarity between repeat-mediated deletion profile of the tumor and that of MMR knockouts (Srep.indel) (further details in Methods,
Samples with MMRDetect-calculated probability <0.7 are defined as MMR-deficient by MMRDetect (
We next directly compared MMRDetect and MSIseq on another 2012 colorectal and 713 uterine samples from UK100kGP, 2,610 published WGS primary cancers (Nik-Zainal et al., 2016; Campbell et al., 2020; Staaf eta I., 2019) and 2024 WGS metastatic cancers (Priestley et al., 2019) (Tables 8-11, Methods). There was very high concordance between MMRDetect and MSISeq for classifying tumors (0.97 to 0.997 (
Unlike signatures of environmental mutagens that are historic, signatures of repair pathway defects are likely to be on-going in human cancer cells, and could serve as biomarkers of targetable abnormalities for precision medicine (Mardis, 2019; Berger & Mardis, 2018; Wood et al., 2001) (
All references cited herein are incorporated herein by reference in their entirety and for all purposes to the same extent as if each individual publication or patent or patent application was specifically and individually indicated to be incorporated by reference in its entirety.
The specific embodiments described herein are offered by way of example, not by way of limitation. Various modifications and variations of the described compositions, methods, and uses of the technology will be apparent to those skilled in the art without departing from the scope and spirit of the technology as described. Any sub-titles herein are included for convenience only, and are not to be construed as limiting the disclosure in any way.
Unless context dictates otherwise, the descriptions and definitions of the features set out above are not limited to any particular aspect or embodiment of the invention and apply equally to all aspects and embodiments which are described.
Throughout the specification and claims, the following terms take the meanings explicitly associated herein, unless the context clearly dictates otherwise. The phrase “in one embodiment” as used herein does not necessarily refer to the same embodiment, though it may. Furthermore, the phrase “in another embodiment” as used herein does not necessarily refer to a different embodiment, although it may. Thus, as described below, various embodiments of the invention may be readily combined, without departing from the scope or spirit of the invention.
It must be noted that, as used in the specification and the appended claims, the singular forms “a,” “an,” and “the” include plural referents unless the context clearly dictates otherwise. Ranges may be expressed herein as from “about” one particular value, and/or to “about” another particular value. When such a range is expressed, another embodiment includes from the one particular value and/or to the other particular value. Similarly, when values are expressed as approximations, by the use of the antecedent “about,” it will be understood that the particular value forms another embodiment. The term “about” in relation to a numerical value is optional and means for example +/−10%.
Throughout this specification, including the claims which follow, unless the context requires otherwise, the word “comprise” and “include”, and variations such as “comprises”, “comprising”, and “including” will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps.
Other aspects and embodiments of the invention provide the aspects and embodiments described above with the term “comprising” replaced by the term “consisting of” or “consisting essentially of”, unless the context dictates otherwise.
The features disclosed in the foregoing description, or in the following claims, or in the accompanying drawings, expressed in their specific forms or in terms of a means for performing the disclosed function, or a method or process for obtaining the disclosed results, as appropriate, may, separately, or in any combination of such features, be utilised for realising the invention in diverse forms thereof.
Number | Date | Country | Kind |
---|---|---|---|
2104308.8 | Mar 2021 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2022/057387 | 3/21/2022 | WO |