The present invention relates in part to methods for diagnosing, treating and monitoring brain cancer by analysing urine samples. In particular, the methods of the invention find use in the diagnosis, treatment and monitoring of brain cancers such as glioma.
Primary brain tumours, which are diagnosed in over 260,000 patients worldwide annually (Wesseling & Capper, 2018), have a poor prognosis and lack effective treatments. Better methods for early detection and identification of tumour recurrence may enable the development of novel treatment strategies. The development of new treatments would also benefit from minimally invasive methods that characterise the evolving glioma genome (Westphal & Lamszus, 2015; Brennan et al, 2013). DNA analysis in liquid biopsies has the potential to replace or supplement current imaging-based monitoring techniques, which have limited effectiveness, and to provide the genomic information required for precision medicine whilst reducing the morbidity associated with repeated biopsy (Westphal & Lamszus, 2015; Kros et al, 2015; Mouliere et al, 2014). However, cell-free tumour DNA (ctDNA) is extremely challenging to detect in the plasma of patients with brain tumours as its fractional concentration (mutant allele fractions, MAF) is low and appears to be in the same range as that observed in plasma of patients with early stage carcinomas (Bettegowda et al, 2014; Zill et al, 2018). Reported detection rates for ctDNA in plasma of glioma patients are typically around 15%-30% (Bettegowda et al, 2014). Although higher rates of detection have been reported, the high frequency of alterations resulting from clonal hematopoiesis may confound these results (Zill et al, 2018; Piccioni et al, 2019; Pan et al, 2019). In addition to plasma, ctDNA has been detected in urine for some cancer types, however this has been limited largely to urothelial cancers, or patients with advanced cancers and high plasma tumour fraction (Patel et al, 2017; Dudley et al, 2019; Husain et al, 2017; Bosschieter et al, 2018; Hentschel et al, 2020). Cerebrospinal fluid (CSF) has been proposed as an alternative medium for brain tumour ctDNA analysis (De Mattos-Arruda et al, 2015; Wang et al, 2015; Mouliere et al, 2018b; Pentsova et al, 2016; Seoane et al, 2019; Pan et al, 2019, 2015), however detection sensitivity has remained poor in previous analyses (CSF detected in 42/85 patients, 49.4%) (Miller et al, 2019). In addition, CSF sampling via lumbar puncture is an invasive and painful procedure for patients and requires skilled medical staff, which severely limits its use for research, diagnosis and repeat sampling (Hasbun et al, 2001; Engelborghs et al, 2017).
Thus, compared to other disease types, detection of circulating cell-free tumour DNA (ctDNA) in patients with brain tumours, in particular gliomas (GBM), is challenging. Because CSF is both difficult to collect and associated with significant discomfort for the patient, it is unlikely that analysis of ctDNA in CSF will be considered as a viable approach for longitudinal sampling going forward. On the other hand, minimally invasive liquid biopsy, in the form of plasma or urine, don't face these same challenges, but their use is hampered by the presence of only minute levels of glioma-derived cfDNA signal.
Thus, there remains a need for approaches that can effectively detect ctDNA in patients with brain cancer, that do not suffer from the disadvantages of existing methods.
The present inventors have previously demonstrated that tumour cfDNA could be detected in plasma samples for a variety of cancers using a machine learning approach combining cfDNA fragmentation pattern information and somatic alteration analysis (Mouliere et al., 2018a). In particular, in Mouliere et al. (2018a), a random forest model including as predictive features (a) the proportion of fragments in the size ranges 160-180, 180-220 and 250-320, (b) the amplitude of oscillations in fragment size density with 10-bp (base pairs) periodicity, and (c) a feature quantifying the deviation from copy number neutrality (t-MAD, trimmed median absolute deviation from copy number neutrality) was found to have best performance in discriminating between healthy and cancer patients using plasma samples, when assessed on a cohort of samples from cancer types with low ctDNA in plasma (renal cancer, glioblastoma, bladder cancer, pancreatic cancer). This was also the subject of patent application WO 2020/094775, which is incorporated herein by reference. The present inventors hypothesised that differences in fragment lengths of circulating DNA could be present in urine samples as well. The present inventors further hypothesized that an approach specifically designed for detection of ctDNA in urine samples could be exploited to enhance sensitivity for detecting the presence of ctDNA for non-invasive genomic analysis of brain cancers. As explained above, this is a particularly challenging task even in fluids such as CSF, let alone in urine. As described in detail herein, the present inventors used a sequencing approach that preserves the structural properties of ctDNA, allowing them to determine the size profile of mutant ctDNA in matched CSF, plasma and urine samples from glioma patients. This demonstrated a shift towards shorter fragment sizes for mutant (tumour-derived) cfDNA in comparison to non-mutant cfDNA in CSF, plasma and urine samples, with different respective characteristics in each of the fluids. Based on this, they designed an approach specifically tailored to detect ctDNA in urine of brain cancer patients. Analysing urine fragmentation in samples from 5 patients with low grade glioma (LGG) and with high grade glioma (HGG), and 53 individuals without glioma, the inventors demonstrated that urine samples from glioma patients could be identified by analysing specific fragmentation patterns from shallow whole genome sequencing (sWGS, <1× coverage) data using machine learning classifiers. They discovered in particular that in this context the proportion of fragments in lower size ranges than those used in plasma were particularly informative, and that including features that capture these size ranges specifically as informative features for the classification improved the sensitivity and specificity of classification in the context of detecting ctDNA from brain tumours in urine samples.
Accordingly, in a first aspect the present invention provides a method for analysing a urine sample from a subject, the method comprising: providing the value of one or more cell-free DNA fragment size metrics for said sample; and determining whether the sample has a high or low likelihood of being from a brain cancer patient by providing said values of said cell-free DNA fragment size metrics as input to a machine learning model trained to classify sample data into one of at least two classes, the at least two classes comprising a first class having a high likelihood of being from a brain cancer patient and a second class having a low likelihood of being from a brain cancer patient, wherein the one or more cell-free DNA fragment size metrics comprise at least one metric representing the proportion of fragments in a size range that does not extend above 100 bp and that is between 10 and 100 bp wide.
The present inventors have discovered that the cfDNA fragmentation profile in urine samples could be used to discriminate between samples that are likely to contain ctDNA from brain cancer and samples that are unlikely to contain ctDNA from brain cancer, and that such a discrimination was particularly improved by investigating the range of sizes below 100 in more detail than was previously done for plasma samples. This is based at least in part on the discovery that cfDNA fragmentation patterns are different in urine and plasma samples, and further that samples from patients with other central nervous diseases also show fragmentation patterns that differ from those seen in samples from healthy patients, such that an approach specifically tailored to the particular size distribution features in these types of patients enhances the ability to discriminate between patients with and without brain malignancies.
All of the methods described herein may be computer implemented unless context indicates otherwise. As the skilled person understands, the complexity of the operations described herein (due at least to the complexity of analysing sequencing data, training a machine learning model, obtaining a distribution of fragment size from sequencing data etc. as described herein, particularly in view of the amount of data that is typically generated by DNA sequencing) are such that they are beyond the reach of a mental activity. Thus, unless context indicates otherwise (e.g. where sample preparation or acquisition steps are described), all steps of the methods described herein are computer implemented.
The one or more cell-free DNA fragment size metrics may comprise a plurality of metrics representing the proportion of fragments in respective size ranges. The respective size ranges may be substantially non-overlapping. Two size ranges may be substantially non-overlapping when the proportion of the size ranges that is common between them is smaller than the proportion of each size range that is unique to itself. For example, size ranges that overlap by a common range that represents less than 10% of each of the respective size ranges (where the exact percentage may be different for the respective size range depending on their size) may be considered to be substantially non-overlapping. The one or more cell-free DNA fragment size metrics may comprise a plurality of metrics representing the proportion of fragments in respective size ranges that are each between 0 and 300 bp. Each of the respective size ranges may be between 10 and 100 bp wide. The one or more cell-free DNA fragment size metrics may comprise a metric representing the amplitude of oscillations in fragment size density with approximately 10 bp periodicity in a particular size range. The particular size range may be between approximately 50 bp and approximately 140 bp.
The one or more cell-free DNA fragment size metrics may comprise a plurality of metrics representing the proportion of fragments in respective substantially non-overlapping size ranges between 0 and 150 bp. The one or more cell-free DNA fragment size metrics may comprise at least 2 or at least 3 metrics representing the proportion of fragments in respective substantially non-overlapping size ranges between 0 and 150 bp. The size range or each of the respective size ranges may be between 20 and 100 bp wide, between 20 and 80 bp wide, between 20 and 50 bp wide, at least 10 bp wide, at least 20 bp wide, at least 30 bp wide, at most 100 bp wide, at most 90 bp wide, at most 80 bp wide, at most 70 bp wide, at most 60 bp wide, at most 50 bp wide, about 20 bp wide, about 30 bp wide, about 40 bp wide or about 50 bp wide. The one or more cell-free DNA fragment size metrics may comprise one or more metrics representing the proportion of fragments in the 30-90 bp range and/or one or more metrics representing the proportion of fragments in the 90-150 bp range. The one or more metric representing the proportion of fragments in the 30-90 bp range may comprise a metric representing the proportion of fragments in the 30-60 bp range and/or a metric representing the proportion of fragments in the 60-90 bp range. The one or more metric representing the proportion of fragments in the 90-150 bp range may comprise a metric representing the proportion of fragments in the 90-120 bp range and/or a metric representing the proportion of fragments in the 120-150 bp range. The one or more cell-free DNA fragment size metrics may comprise a metric representing the proportion of fragments in a plurality of ranges selected from the following ranges: 30-60 bp, 60-90 bp, 90-120 bp, 120-150, 150-180, 180-210, 240-270 and 270-300. The cell-free DNA fragment size metrics may further comprise a metric representing the amplitude of oscillations in fragment size density with 10 bp periodicity in a particular size range. The cell-free DNA fragment size metrics may further comprise a metric representing the proportion of fragments in each of the following ranges: 30-60 bp, 60-90 bp, 90-120 bp, 120-150, 150-180, 180-210, 240-270 and 270-300. As the skilled person understands, the reference to e.g. the 60-90 size range may encompass a range that starts at 61, for example when a 30-60 size range is also used in order to avoid double counting. In other words, strictly non-overlapping equivalents of each of the combinations of ranges described are also envisaged.
Providing the value of one or more cell-free DNA fragment size metrics for said sample may comprise: providing data representing fragment sizes of cell-free DNA fragments obtained from said sample; and determining the value of the one or more cell-free DNA fragment size metrics from the data representing fragment sizes of cell-free DNA fragments obtained from said sample. The step of providing data representing fragment sizes of cell-free DNA fragments obtained from said sample may comprise sequencing DNA from said sample and/or obtaining a urine sample from said subject and/or processing a urine sample from said subject or a sample of DNA derived therefrom. The data representing fragment sizes of the cell-free DNA fragments may comprise fragment sizes inferred from sequence data (e.g. sequence reads), fragment sizes determined by fluorimetry, or fragment sizes determined by densitometry. Alternatively, the data representing fragment sizes of cell-free DNA fragments obtained from the sample may comprise sequence data. The step of providing data representing fragment sizes of cell-free DNA fragments may comprise determining the lengths of cfDNA fragments from sequence data and/or determining the distribution of lengths of cfDNA fragments from sequence data. The sequence data may have been obtained using paired-end sequencing. The sequence data may have been obtained using a ligation-based approach do obtain a sequencing library. The sequencing library may be an indexed sequencing library. The present inventors have found the user of paired-end sequencing and/or a ligation-based strategy for library preparation to result in particularly higher recovery rates of cfDNA. This may in turn further improve the performance of the methods described herein. The step of providing data (e.g. sequence data, data representing fragment sizes of cell-free DNA fragments, the value of one or more cell-free DNA fragment size metrics for said sample) for a sample from the subject may comprise or consist of receiving data from a user (for example through a user interface), from one or more computing device (s), or from one or more data stores or databases.
The step of providing data representing fragment sizes of cell-free DNA fragments obtained from said sample may further comprise sequencing (or otherwise determining the sequence composition of genomic material present in a sample) one or more samples from the subject, wherein the one or more samples is/are urine samples from the subject, cfDNA-containing samples derived from urine samples from the subject, or samples derived therefrom such as e.g. by purification (including e.g. size selection to remove very large fragments such as e.g. genomic DNA fragments), extraction, library preparation, etc. Size selection may comprise an in vitro size selection that is performed on DNA extracted from a urine sample and/or is performed on a library created from DNA extracted from a urine sample. For example, in vitro size selection may comprises agarose gel electrophoresis or bead-based size selection. Instead or in addition to in vitro size selection, size selection may comprise an in silico size selection that is performed on sequence reads. The value of one or more cell-free DNA fragment size metrics for said sample may be derived from sequence data. In convenient embodiments, the sequence data may be whole genome sequencing (WGS) data, paired-end sequencing data, hybrid-capture sequencing and/or shallow whole genome sequencing (sWGS) data. In general, it is believed that the methods described herein would provide useful results using any type of data from which cell-free DNA fragment size information can be obtained. This includes for example sequencing data, fluorimetry data and densitometry data. Sequencing data is believed to be a particularly convenient type of data (at least because it is generally available). particularly when the sequencing includes a step of ligation and paired-end sequencing (as this can result in high cfDNA recovery rates). The sequencing data may be whole genome (such as e.g. WGS), or may use a capture-based approach (such as e.g. hybrid-capture sequencing). sWGS data may refer to WGS data that has <0.4× depth of coverage. The present inventors have discovered that sWGS was able to provide enough information to analyse urine samples as described herein, thereby providing a cost-effective way of diagnosing brain cancer in a non-invasive manner, increasing the scope of clinical applicability of the methods described.
The method may further comprise obtaining, from the subject, one or more urine samples. The method may further comprise processing a urine sample obtained from the subject or a DNA sample derived therefrom, for example by purification, extraction, library preparation, etc. The method may further comprise providing to a user, for example through a user interface, an output of the method such as a determination of whether the sample has a high or low likelihood of being from a brain cancer patient, a probabilistic score provided by the machine learning model and/or a value derived therefrom or associated therewith.
The machine learning model may have been trained using training data comprising the values of cfDNA size metrics for a plurality of urine samples from subjects with brain cancer and for a plurality of urine samples from subjects that do not have brain cancer. The subjects that do not have brain cancer comprise healthy subjects and subjects with non-malignant central nervous system diseases. For example, data from patients that have non-malignant central nervous system diseases selected from the following set may be used: cervical myelopathy, cerebral artery aneurysm, hydrocephalus and Parkinson's disease. The machine learning model may be a random forest model, a logistic regression model, a support vector machine, or a generalised linear model. A generalised linear model may be a regularized generalised linear model. The machine learning model may provide an output that is a probabilistic score, such as a probability of belonging to the high likelihood class or a probability of correct classification, e.g., a probability that the sample in question has been classified correctly. The machine learning may provide an output that is a probabilistic score, and determining whether the sample has a high or low likelihood of being from a brain cancer patient may comprise comparing the probabilistic score to a threshold, for example a threshold determined based on the training data as one that most accurately classifies training samples on the high/low likelihood category. The performance of the machine learning model when trained on the training set may be assessed by the area under the curve (AUC) value from a receiver operating characteristic (ROC) analysis. Generally a model showing the highest AUC value may be selected as having the best performance. The machine learning model may have been trained on a training set comprising at least 10, 20, 30, 40 or at least 50 samples from subjects that do not have brain cancer and at least 10, 20, 30, or at least 40 samples from subjects known to have a brain cancer.
The urine sample may be from a subject having or suspected of having a brain cancer. The brain cancer may be a glioma, a meningioma, a pituitary adenoma, a glioblastoma, a medulloblastoma, an oligodendroglioma, a brain metastasis. The brain cancer may be a glioma. The subject may be a human. A glioma may be a high grade glioma or a low grade glioma. A brain metastasis may be a metastasis located in the brain, associated with a cancer of any origin. The method may be a method for detecting the presence of, growth of, prognosis of, regression of, treatment response of, residual disease or recurrence of a brain cancer in a subject from which the sample has been obtained. The urine sample may have been obtained prior to the subject having undergone treatment with a cancer therapy. The urine sample may have been obtained subsequent to the subject having undergone treatment with a cancer therapy. The method may be carried out on a sample obtained prior to a cancer treatment of the subject and on a sample obtained following the cancer treatment of the subject. The urine sample may be or have been processed within 12 hours, within 4 hours, within 2 hours or within an hour of collection. The processing may comprise refrigeration, freezing, centrifugation, and/or mixing with one or more preserving compounds such as EDTA. The sample may have been obtained from the subject in a primary care setting, in a hospital, or at any other location such as e.g. privately by the subject (e.g. at home). In particular, the sample may have been obtained at a location that is different from the location at which the sample is processed (e.g. to preserve it, extract DNA, derive a library, sequence the DNA in the sample, etc.) and/or the location at which the sequence data is analysed to provide the value of one or more cell-free DNA fragment size metrics for said sample and/or the location at which said values are analysed as described herein. In particular, each of the above may be performed at different locations. Further, any data analysis step may be performed over a distributed network such as e.g. on the cloud. Further, each of the above may be performed at locations that are not primary care locations or hospitals. Indeed, it is an advantage of the invention that an analysis can be performed without requiring trained medical staff, contrary to diagnosis/monitoring methods that require an invasive step (such as e.g. collection of blood or csf) or specialised medical equipment (such as e.g. medical imaging).
In a second aspect the present invention provides a method for analysing a urine sample from a subject, comprising: analysing a urine sample, a DNA sample derived from a urine sample, or a library derived from a urine sample, wherein the sample has been obtained from the subject, to determine fragment sizes of nucleic acid fragments in said sample or said library; and carrying out the method of the first aspect of the invention using the fragment sizes. Also described is a method for analysing a urine sample from a subject, comprising: sequencing a DNA sample derived from the urine sample, or a library derived from the urine sample, that has been obtained from the subject to obtain a plurality of sequence reads; processing the sequence reads to determine data representing fragment sizes of cfDNA fragments obtained from said sample; and carrying out the method of the first aspect of the invention using the data. Processing the sequence reads may comprise one or more of the following steps: aligning sequence reads to a reference genome of the same species as the subject (e.g. the human reference genome GRCh37 for a human subject); removal of contaminating adapter sequences; removal of PCR and optical duplicates; removal of sequence reads of low mapping quality; and if multiplex sequencing, de-multiplexing by excluding mismatches in sequencing barcodes.
In accordance with any aspect of the invention, the fragment sizes of cfDNA fragments may be inferred from sequence reads using the mapping locations of the read ends in the genome following alignment of the sequence reads with the reference genome of the species from which the sample was obtained. In accordance with any aspect of the present invention the sample may be or may have been subjected to one or more processing steps to remove whole cells, for example by centrifugation. In particular cases the sequence reads may comprise paired-end reads generated by sequencing DNA from both ends of the fragments present in a library generated from the urine sample or DNA sample derived therefrom. The original length of the DNA fragments in the cfDNA containing sample may be inferred using the mapping locations of the read ends in the genome following alignment of the sequence reads with the reference genome of the species from which the sample was obtained (e.g. the human reference genome GRCh37 for a human subject). In accordance with any aspect of the present invention, the subject may be mammalian, a human, a companion animal (e.g. a dog or cat), a laboratory animal (e.g. a mouse, rat, rabbit, pig or non-human primate), a domestic or farm animal (e.g. a pig, cow, horse or sheep). Preferably, the subject is a human patient. In some cases, the subject is a human patient who has been diagnosed with, is suspected of having or has been classified as at risk of developing, a brain cancer.
According to a third aspect, there is provided a method of diagnosing a subject suspected of having a brain cancer as likely to have brain cancer, the method comprising: analysing one or more urine samples from the subject using the method of any embodiment of the first aspect to determine whether the one or more samples have a high or low likelihood of being from a brain cancer patient; and diagnosing the subject as likely to have a brain cancer if one or more of the one or more urine samples are determined to have a high likelihood of being from a brain cancer patient. A subject suspected of having a brain cancer may be a subject belonging to a population considered to be at risk of developing brain cancer. The risk may be low, and may be based on e.g. age, medical history, family history, the presence of genetic markers of risk in the subject or their family, etc. Thus, the method may be used for screening of a population of subjects. As such, also described herein is a method of screening for brain cancer in a population of subjects, the method comprising: analysing one or more urine samples from the subjects using the method of any embodiment of the first aspect to determine whether the one or more samples have a high or low likelihood of being from a brain cancer patient; and diagnosing a subject as likely to have a brain cancer if one or more of the one or more urine samples from the subject are determined to have a high likelihood of being from a brain cancer patient.
According to a fourth aspect, there is provided a method of selecting a subject suspected of having a brain cancer for treatment with a cancer therapy, the method comprising characterising a urine sample obtained from the subject as having a high or low likelihood of being from a cancer patient using the method of any embodiment of the first aspect, and selecting the subject for treatment with the cancer therapy if the sample is characterised as having a high likelihood of being from a brain cancer patient. The subject may have been previously treated for brain cancer, and the brain cancer therapy may be a therapy that has been previously used for the subject or a different therapy. For example, the cancer therapy may be a cancer therapy that has not previously been used for the subject. The method may further comprise obtaining an image-based analysis for the subject such as e.g. a brain MRI. In such embodiments, the step of selecting the subject for treatment with the cancer therapy may depend on the result of the image-based analysis as well as the analysis of the urine sample. For example, a different course of treatment may be selected if the sample is characterised as having a high likelihood of being from a brain cancer patient, depending on the result of the image-based diagnosis.
According to a fifth aspect, there is provided method of selecting a subject suspected of having a brain cancer for further diagnostic test, the method comprising characterising a urine sample obtained from the subject as having a high or low likelihood of being from a cancer patient using the method of any embodiment of the first aspect, and selecting the subject for further diagnostic test if the sample is characterised as having a high likelihood of being from a brain cancer patient. The further diagnostic test may be an invasive diagnostic test and/or an imaging-based test. An invasive diagnostic test may comprise a biopsy, such as e.g. a blood, CSF or tissue biopsy. An imaging-based test may comprise a brain MRI.
According to a sixth aspect, there is provided a method of detecting recurrence of a brain cancer in a subject, the method comprising characterising a urine sample obtained from the subject as having a high or low likelihood of being from a cancer patient using the method of any embodiment of the first aspect, and determining that recurrence is likely to have occurred if the sample is characterised as having a high likelihood of being from a brain cancer patient. According to a related aspect, there is provided a method of detecting residual disease in a subject with brain cancer, the method comprising characterising a urine sample obtained from the subject as having a high or low likelihood of being from a cancer patient using the method of any embodiment of the first aspect, and determining that residual disease is likely to be present if the sample is characterised as having a high likelihood of being from a brain cancer patient. In accordance with any aspect described herein, the subject may have been previously treated for brain cancer. The methods according to any embodiment of any aspect may be repeated using urine samples that have been obtained from the subject at a plurality of times. For example, this may be performed in order to monitor the presence or absence of recurrence of a brain cancer in the subject, or to diagnose a brain cancer in a subject (e.g. a subject at risk of developing brain cancer). One of the advantages of the invention over previous methods to diagnose initial/recurrent brain cancer is that the method is non-invasive and simple to implement, thereby expanding the possibilities in terms of frequency of monitoring. For example, the method may be repeated using urine samples that have been obtained from the subject monthly, weekly or even daily. As a result, the sensitivity of detection of a brain cancer or recurrence thereof may be increased, thereby improving the chances of a good prognosis for the subject as the cancer can be treated earlier than would have otherwise been possible. This may be particularly advantageous in the context of detecting recurrence in a subject previously treated for brain cancer.
According to a further aspect, there is provided a method of monitoring brain cancer in a subject previously treated for brain cancer, the method comprising characterising a urine sample obtained from the subject as having a high or low likelihood of being from a brain cancer patient using a method of any embodiment of the first aspect. The method may further comprise determining that the previous course of treatment was ineffective and/or that the subject's cancer has relapsed if the urine sample obtained from the subject is characterised as having a high likelihood of being from a brain cancer patient. The method may further comprise selecting the subject for treatment with a brain cancer therapy if the urine sample obtained from the subject is characterised as having a high likelihood of being from a brain cancer patient. According to a further aspect, there is provided a method of treating a brain cancer in a subject, the method comprising characterising a urine sample obtained from the subject as having a high or low likelihood of being from a cancer patient using the method of any embodiment of the first aspect, and treating the subject with a cancer therapy if the sample is characterised as having a high likelihood of being from a brain cancer patient.
According to a further aspect, there is provided a method of providing a prognosis for a subject who has been diagnosed with a brain cancer, the method comprising characterising a urine sample obtained from the subject as having a high or low likelihood of being from a brain cancer patient, wherein if the sample is characterised as having a high likelihood of being from a brain cancer patient, the subject is likely to have a poorer prognosis than a subject from which a urine sample is characterised as having a low likelihood of being from a brain cancer patient. The method may comprise providing said values of said cell-free DNA fragment size metrics as input to a machine learning model trained to classify sample data into one of a plurality of classes, the plurality of classes associated with different likelihoods of being from a brain cancer patient, wherein the plurality of classes are associated with different prognosis. For example, the plurality of classes may comprise a first class associated with a high likelihood of being from a brain cancer patient, a second class associated with a low likelihood of being from a brain cancer patient, and one or more further classes associated with intermediate likelihoods of being from a brain cancer patient, wherein subjects in the first class have poorer prognosis than subjects in the second and further classes, optionally wherein subjects in at least one of the further classes have poorer prognosis than subjects in the second class.
The methods of any aspect described herein may further comprise outputting a result of the method, for example through a user interface. The result may be selected from a classification of a sample in the high/low likelihood class, a probabilistic score indicating the likelihood of the sample being from a brain cancer patient, or information derived therefrom such as a prognosis, therapeutic or diagnosis indication. The method according to any aspect may comprise one or more of the following steps: subjecting the subject to one or more further diagnostic tests if the sample has been identified as likely to be from a brain cancer patient, optionally wherein the one or more further diagnostic tests are selected from an imaging based test, and a blood, plasma or CSF-based analysis; detecting the presence of one or more genetic alterations in the sequence data obtained from the urine sample; selecting the subject for treatment with a cancer therapy, and/or treating the subject with a cancer therapy; selecting the subject for further monitoring comprising repeating the method at a later time point.
According to a further aspect, there is provided a method for providing a tool for analysing a urine sample, the method comprising: providing the value of one or more cell-free DNA fragment size metrics for a plurality of training urine samples associated with known brain cancer status, wherein the one or more cell-free DNA fragment size metrics comprise at least one metric representing the proportion of fragments in a size range that does not extend above 100 bp and that is between 10 and 100 bp wide; and training a machine learning model to classify sample data into one of at least two classes, the at least two classes comprising a first class having a high likelihood of being from a brain cancer patient and a second class having a low likelihood of being from a brain cancer patient. The method of the present aspect may have any of the features described in relation to the first aspect. The machine learning model may be trained predict, based on said values of said one or more fragment size metrics, the likelihood of each sample being from a brain cancer patient, and to identify a threshold that applies to said likelihood and that classifies samples between at least two classes comprising a first class having a high likelihood of being from a brain cancer patient and a second class having a low likelihood of being from a brain cancer patient. The method may further comprise providing the trained machine learning model or one or more parameters thereof to a user, e.g. via a user interface, or to a computing device, or writing the trained machine learning model or more parameters thereof on a computer readable medium.
According to a further aspect, there is provided a system comprising: a processor; and a computer readable medium comprising instructions that, when executed by the processor, cause the processor to perform the (computer-implemented) steps of the method of any preceding aspect.
According to a further aspect, there is provided a non-transitory computer readable medium or media comprising instructions that, when executed by at least one processor, cause the at least one processor to perform the method of any embodiment of any aspect described herein.
According to a further aspect, there is provided a computer program comprising code which, when the code is executed on a computer, causes the computer to perform the method of any embodiment of any aspect described herein.
Embodiments of the present invention will now be described by way of example and not limitation with reference to the accompanying figures. However various further aspects and embodiments of the present invention will be apparent to those skilled in the art in view of the present disclosure.
The present invention includes the combination of the aspects and preferred features described except where such a combination is clearly impermissible or is stated to be expressly avoided. These and further aspects and embodiments of the invention are described in further detail below and with reference to the accompanying examples and figures.
Aspects and embodiments of the present invention will now be discussed with reference to the accompanying figures. Further aspects and embodiments will be apparent to those skilled in the art. All documents mentioned in this text are incorporated herein by reference.
In describing the present invention, the following terms will be employed, and are intended to be defined as indicated below.
A “sample” as used herein may be a biological sample, such as a cell-free DNA sample, a cell (including a circulating tumour cell) or tissue sample (e.g. a biopsy), a biological fluid, an extract (e.g. a protein or DNA extract obtained from the subject). Within the context of the present invention, the sample may be a urine sample, or a sample derived therefrom. The sample may be one which has been freshly obtained from the subject or may be one which has been processed and/or stored prior to making a determination (e.g. frozen, fixed or subjected to one or more purification, enrichment or extractions steps, including centrifugation). The sample may be derived from one or more of the above biological samples via a process of enrichment or amplification. For example, the sample may comprise a DNA library generated from the biological sample and may optionally be a barcoded or otherwise tagged DNA library. A plurality of samples may be taken from a single patient, e.g. serially during a course of treatment. Moreover, a plurality of samples may be taken from a plurality of patients. Sample preparation may be as described in the Materials and Methods section herein.
The term “sequence data” refers to information that is indicative of the presence and/or amount of genomic material in a sample that has a particular sequence. Such information may be obtained using sequencing technologies, such as e.g. next generation sequencing (NGS, such as e.g. whole exome sequencing (WES), whole genome sequencing (WGS), or sequencing of captured genomic loci (targeted or panel sequencing)), or using array technologies, such as e.g. SNP arrays, or other molecular counting assays. When NGS technologies are used, the sequence data may comprise a count of the number of sequencing reads (also referred to as “sequence reads” or “sequence read data”) that have a particular sequence. When non-digital technologies are used such as array technology, the sequence data may comprise a signal (e.g. an intensity value) that is indicative of the number of sequences in the sample that have a particular sequence, for example by comparison to an appropriate control. Sequence data may be mapped to a reference sequence, for example a reference genome, using methods known in the art (such as e.g. Bowtie (Langmead et al., 2009)). Thus, counts of sequencing reads or equivalent non-digital signals may be associated with a particular genomic location. Sequence reads data may be provided or obtained directly, e.g., by sequencing the cfDNA sample or library or by obtaining or being provided with sequencing data that has already been generated, for example by retrieving sequence read data from a non-volatile or volatile computer memory, data store or network location. Where the sequence reads are obtained by sequencing a sample, the median mass of input DNA may in some cases be in the range 1-100 ng, e.g., 2-50 ng or 3-10 ng. The DNA may be amplified to obtain a library having, e.g. 100-1000 ng of DNA. The library may be obtained using a ligation-based approach. The sequencing may be paired-end sequencing. The sequence reads may be in a suitable data format, such as FASTQ, SAM or BAM. The sequence read data, e.g., FASTQ files, may be subjected to one or more processing or clean-up steps prior to or as part of the step of reads collapsing into read families. For example, the sequence data files may be processed using one or more tools selected from as FastQC v0.11.5, a tool to remove adaptor sequences (e.g. cutadapt v1.9.1). The sequence reads (e.g. trimmed sequence reads) may be aligned to an appropriate reference genome (or may have been previously aligned to an appropriate reference sequence, e.g. in the case of SAM/BAM files), for example, the human reference genome GRCh37 for a human subject. As used herein “read” or “sequencing read” may be taken to mean the sequence that has been read from one molecule and read once. Each molecule can be read any number of times, depending on the sequencing performed.
The present invention relates broadly to the use of cfDNA fragment size metrics to characterise a urine sample from a subject. The term “cfDNA fragment size metric” refers to any metric that can be derived from a distribution of the size of cfDNA fragments in a sample. Within the context of the present invention, a cfDNA fragment size metric includes at least one metric indicative of the proportion of fragments within a particular size range. A size range may be expressed using numbers of base pairs (bp). For example, the size range 30-60 bp refers to the fragments that are between 30 bp and 60 bp in length. A metric indicative of the proportion of fragments within a size range may be a normalised number of fragments that have a length within said size range. The normalised number of fragment in a size range may be equal to the proportion of fragments in said range if the number of fragments is normalised using the total number of fragments in the sample or the total number of fragments within a predetermined size range that comprises the size range and optionally any other size range for which a metric may be calculated. A metric indicative of the proportion of fragments within a size range may be the value of a density function obtained from the distribution of fragments sizes in the sample. A cfDNA fragment size metric may be a metric that is obtained from the distribution of fragment sizes in the sample and that quantifies an aspect of the shape of the distribution, such as e.g. the amplitude of oscillations (optionally with a predetermined approximate periodicity such as e.g. 10 bp) within a predetermined range (e.g. 50-140 bp) of the distribution. Such a metric may be obtained by determining the height of local maxima and minima in the distribution for a sample within the predetermined range. Such a metric may be obtained by identifying local maxima and minima for each of a plurality of samples, within the predetermined range, estimating the average position of each maximum and minimum across the plurality of samples, and using the height of the distribution at each of these positions for a candidate sample to calculate the amplitude of oscillations for said candidate sample. An amplitude of oscillations may be obtained for a plurality of maxima and minima by summing the height of the maxima and subtracting the sum of the height of the minima. The height of a maximum/minimum may be defined as the number of fragments with the length corresponding to said maximum/minimum divided by the total number of fragments. Identifying local maxima/minima may comprise selecting positions y (i.e. sizes) such that the y is the largest value in the interval [y−2, y+2]. Any other method of identifying local minima/maxima in a distribution may be used. When the positions of maxima/minima are empirically defined (i.e. based on the distributions observed in one or more samples), the periodicity of the oscillation may not be exactly equal to a predetermined frequency. In particular, the distance between maxima or minima may not be exactly constant, and may vary slightly within the size range in which the periodic oscillations are observed. Thus, reference to periodic oscillations of e.g. 10 bp periodicity may in practice refer to peaks that are between e.g. 8 and 12 bp apart. A set of peak locations may be obtained from a plurality of training samples, for example samples from patients that have been identified as having cancer (e.g. brain cancer).
As used herein “treatment” refers to reducing, alleviating or eliminating one or more symptoms of the disease which is being treated, relative to the symptoms prior to treatment.
As used herein, the term “machine learning model” refers to a mathematical model that has been trained to predict one or more output values based on input data, where training refers to the process of learning, using training data, the parameters of the mathematical model that result in a model that can predict outputs values with minimal error compared to comparative (known) values associated with the training data (where these comparative values are commonly referred to as “labels”). The term “machine learning algorithm” or “machine learning method” refers to an algorithm or method that trains and/or deploys a machine learning model. “Classifier” or “classification algorithm” may be a machine learning model or algorithm that maps input data, such as a cfDNA fragment size features, to a category, such as cancerous or non-cancerous origin. A classifier may produce as output a probabilistic score, which reflects the likelihood that an observation belongs to particular category, In some embodiments, the present invention provides methods for detecting, classifying, prognosticating, or monitoring cancer in subjects. In particular, data obtained from sequence analysis, such as fragment length may be evaluated using one or more classification algorithms. The machine learning approaches used herein may be termed “supervised” as a training set of samples with known class or outcome is used to produce a mathematical model which is then evaluated with independent validation data sets. Here, a “training set” of sequence information, e.g. fragmentation features, is used to construct a statistical model that predicts correctly the class of each sample. This training set is then tested with independent data (referred to as a test or validation set) to determine the robustness of the computer-based model. A machine learning model as described herein may comprise an ensemble of models whose predictions are combined. Alternatively, a machine learning model may comprise a single model. Supervised methods can use a data set with reduced dimensionality (for example, the first few principal components), but typically use unreduced data, with all dimensionality. The robustness of the predictive models can also be checked using cross-validation, by leaving out selected samples from the analysis. Any classification algorithm may be used in accordance with the present disclosure, including for example a regression model, k-nearest neighbour classifier, naïve Bayes classifier, etc. The machine learning model may be a regression model, i.e. a model that captures the relationship between a dependent variable (the variables that are being predicted) and a set of independent variables (also referred to as predictors). Any machine learning regression model may be used according to the present invention. For example, a machine learning model may be a random forest regressor (RF), a support vector machine (SVM), a logistic regression model (LR), a generalised linear model with or without regularisation (such as e.g. a binomial generalised linear model with elastic-net regularisation, GLMEN), a decision tree, or a k-nearest neighbour regressor. As detailed in the Examples herein, logistic regression (LR), support vector machine (SVM), generalised linear models with elastic-net regularisation (GLMEN) and Random Forests (RF) were used for variable selection and the classification of samples as “healthy” or “cancer”. A random forest regressor is a model that comprises an ensemble of decision trees and outputs a class that is the average prediction of the individual trees. Decision trees perform recursive partitioning of a feature space until each leaf (final partition sets) is associated with a single value of the target. Regression trees have leaves (predicted outcomes) that can be considered to form a set of continuous numbers. Random forest regressors are typically parameterized by finding an ensemble of shallow decision trees. A logistic regression model (also referred to as “logit model”) is a statistical model that uses a logistic function to model a binary dependent variable. A support vector machine is an algorithm that identifies a hyperplane or set of hyperplanes which can be used for classification or regression. A generalized linear model is a generalization of linear regression in which the response variable can have an error distribution that departs from a normal distribution. In particular each outcome of the dependent variables is assumed to be generated from a particular distribution in an exponential family (a class of distributions that includes the normal, Poisson and gamma distributions) whose mean depends on the independent variables. A regularized regression method is a process whereby additional constraints are provided to prevent overfitting, by introducing a regularization term or penalty that imposes a cost on the optimization function to make the optimal solution unique. The elastic net regularization method linearly combines penalties of the lasso (Tibshirani, Robert (1996). “Regression Shrinkage and Selection via the lasso”. Journal of the Royal Statistical Society. Series B (methodological). Wiley. 58 (1): 267-88) and ridge (see e.g. Gruber, Marvin (1998). Improving Efficiency by Shrinkage: The James-Stein and Ridge Regression Estimators. Boca Raton: CRC Press. pp. 7-15. ISBN 0-8247-0156-9.) methods.
“Computer-implemented method” where used herein is to be taken as meaning a method whose implementation involves the use of a computer, computer network or other programmable apparatus, wherein one or more features of the method are realised wholly or partly by means of a computer program. The systems and methods described herein may be implemented in a computer system, in addition to the structural components and user interactions described. As used herein, the term “computer system” includes the hardware, software and data storage devices for embodying a system or carrying out a method according to the above described embodiments. For example, a computer system may comprise a processing unit, such as a central processing unit (CPU) and/or a graphics processing unit (GPU), input means, output means and data storage, which may be embodied as one or more connected computing devices. Preferably the computer system has a display or comprises a computing device that has a display to provide a visual output display. The data storage may comprise RAM, disk drives or other computer readable media. The computer system may include a plurality of computing devices connected by a network and able to communicate with each other over that network. It is explicitly envisaged that computer system may consist of or comprise a cloud computer. The methods described herein may be provided as computer programs or as computer program products or computer readable media carrying a computer program which is arranged, when run on a computer, to perform the method (s) described herein. As used herein, the term “computer readable media” includes, without limitation, any non-transitory medium or media which can be read and accessed directly by a computer or computer system. The media can include, but are not limited to, magnetic storage media such as floppy discs, hard disc storage media and magnetic tape; optical storage media such as optical discs or CD-ROMs; electrical storage media such as memory, including RAM, ROM and flash memory; and hybrids and combinations of the above such as magnetic/optical storage media.
The methods described herein find use in detecting the presence of, growth of, prognosis of, regression of, residual disease, treatment response of, or recurrence of a brain cancer in a subject, by analysing a urine sample from said subject. Each of these uses is based on the highly accurate detection of cancer-associated patterns in the pool of cfDNA molecules in urine samples using the methods described herein, which are in particular able to discriminate between samples from brain cancer patients and samples from patients without a brain cancer (including healthy patients and patients with other central nervous system diseases).
Whether a prognosis is considered good or poor may vary between cancers and stage of disease. In general terms a good prognosis is one where the overall survival (OS), disease free survival (DES) and/or progression-free survival (PFS) is longer than that of a comparative group or value, such as e.g. the average for that stage and cancer type. A prognosis may be considered poor if OS, DES and/or PFS is lower than that of a comparative group or value, such as e.g. the average for that stage and type of cancer. Thus, in general terms, a “good prognosis” is one where survival (OS, DES and/or PFS) and/or disease stage of an individual patient can be favourably compared to what is expected in a population of patients within a comparable disease setting. Similarly, a “poor prognosis” is one where survival (OS, DFS and/or PFS) of an individual patient is lower (or disease stage worse) than what is expected in a population of patients within a comparable disease setting.
The following is presented by way of example and is not to be construed as a limitation to the scope of the claims.
A total of 35 glioma patients (30 high grade glioma HGG, 5 low grade glioma LGG) were recruited. Among the 5 LGG, 3 were diffuse astrocytoma, 1 was an oligodendroglioma and 1 a pilocytic astrocytoma. Among the 30 HGG, 29 were glioblastomas (GBM) and 1 was an anaplastic oligodendroglioma (AO). Matched tumour tissue, CSF, plasma, urine and buffy coat samples were collected for 8 patients. In addition, urine samples were collected from 26 healthy volunteers and 27 patients with other pathologies of the brain or central nervous system (CNS). Body fluid samples were analysed using two sequencing based approaches: patient-specific hybrid capture panels, and sWGS (shallow whole genome sequencing).
Lumbar puncture was performed immediately prior to craniotomy for tumour debulking. After sterile field preparation, the thecal sac was cannulated between the L3 and L5 intervertebral spaces using a 0.61 mm gauge lumbar puncture needle, and 10 ml of CSF was removed. After collection, CSF, whole blood and urine samples were immediately placed on ice and then rapidly transferred to a pre-chilled centrifuge for processing. For urine samples, 0.5M EDTA was added within an hour of collection. Samples were centrifuged at 1500 g at 4° C. for 10 minutes. Supernatant was removed and further centrifuged at 20,000 g for 10 minutes, and aliquoted into 2 mL microtubes for storage at −80° C. (Sarstedt, Germany). Tumour tissue DNA were extracted and isolated as described previously (Mouliere et al, 2018b). Fluids were extracted using the QIAsymphony platform (Qiagen, Germany). Up to 10 mL of plasma, 10 mL of urine and 8 mL of CSF was used per sample. DNA from cancer plasma, urine and CSF samples was eluted in 90 μL, and further concentrated down to 30 μL using a Speed-Vac concentrator (Eppendorf, Germany).
In order to identify patient specific somatic mutations, the inventors first performed whole exome sequencing (WES) of all tumour tissue and germline buffy coat DNA samples. Fifty nanograms of DNA were fragmented to ˜120 bp by acoustic shearing (Covaris) according to the manufacturer's instructions. Libraries were prepared using the Thruplex DNA-Seq protocol (Rubicon Genomics) with 5× cycles of PCR. Libraries were quantified using quantitative PCR (KAPA library quantification, KAPA biosystems) and pooled for exome capture (TruSeq Exome Enrichment Kit, Illumina). Exome capture was performed with the addition of i5 and i7 specific blockers (IDT) during the hybridization steps to prevent adaptor ‘daisy chaining’. Pools were concentrated using a SpeedVac vacuum concentrator (Eppendorf, Germany). After capture, 8× cycles of PCR were performed. Enriched libraries were quantified using quantitative PCR (KAPA library quantification, KAPA Biosystems), DNA fragment sizes were assessed by Bioanalyzer (2100 Bioanalyzer, Agilent Genomics) and captures were pooled in equimolar ratio for paired-end next generation sequencing on a HiSeq4000 (Illumina). Sequencing reads were de-multiplexed, allowing zero mismatches in barcodes. The reference genome was the GRCh37/b37/hg19 human reference genome—Genomes GRCh37-derived reference genome, which includes chromosomal plus unlocalized and unplaced contigs, the rCRS mitochondrial sequence (AC: NC_012920), Human herpesvirus 4 type 1 (AC:NC 007605) and decoysequence derived from HuRef, Human Bac and Fosmid clones and NA12878. The sequence data of the patient samples were aligned to the reference genome using BWA-MEM v0.7.15. The duplicate reads were marked using Picard v1.122 (http://broadinstitute.github.io/picard). Somatic SNV and indel mutations were called using GATK Mutect2 (Genome Analysis Toolkit), (https://www.broadinstitute.org/gatk) in tumour-normal pair mode using buffy coat as the normal. MAFs for each single-base locus were calculated with MuTect2 for all bases with PHRED quality 230. After MuTect2, we applied filtering parameters so that a mutation was called if no mutant reads for an allele were observed in germline DNA at a locus that was covered at least 10×, and if at least 4 reads supporting the mutant were found in the tumour data with at least 1 read on each strand (forward and reverse). Variants were annotated using Ensembl Variant Effect Predictor with details about consequence on protein coding, accession numbers for known variants and associated allele frequencies from the 1000 Genomes project.
Hybrid-based capture for the different body fluids (CSF, plasma, urine) analysis was designed to cover the variants identified above for each patient using the SureDesign software (Agilent). In addition, 52 genes of interest for glioma were included in the tumor-guided sequencing panel based on the TCGA databases. Patients were separated into 2 panels covering all the mutations included for those patients (4 patients per panel). Panel 1 covered in total 526 kbp (5841 regions) and panel 2 covered 526 kbp (5701 regions). Panels ranged in size between 1.430 Mb (panel 1) and 1.404 Mb (panel 2) with 120 bp RNA baits. Baits were designed with 5× tiling density, moderately stringent masking and balanced boosting. 99.7% of the targets had baits designed successfully. Indexed sequencing libraries were prepared using the Thruplex tag-seq kits (Takara). Libraries were captured either in 1-plex for plasma and urine samples or 3-plex for CSF samples (to a total of 1000 ng capture input) using the Agilent SureSelectXTHS protocol, with the addition of i5 and i7 blocking oligos (IDT), as recommended by the manufacturer for compatibility with ThruPLEX libraries. Custom Agilent SureSelectXTHS baits were used. 13 cycles were used for amplification of the captured libraries. Post-capture libraries were purified with AMPure XT beads, then quantified using quantitative PCR (KAPA library quantification, KAPA Biosystems), and DNA fragment sizes controlled by Bioanalyzer (2100 Bioanalyzer, Agilent Genomics). Capture libraries were then pooled in equimolar ratios for paired end next generation sequencing on a HiSeq4000 (Illumina).
Sequencing reads were de-multiplexed, allowing zero mismatches in barcodes. Cutadapt v1.9.1 was used to remove known 5′ and 3′ adaptor sequences specified in a separate FASTA 640 of adaptor sequences. Trimmed FASTQ files were aligned to the UCSC hg19 genome using BWA-mem v0.7.13 with a seed length of 19. Error suppression was carried out on ThruPLEX Tag-seq library BAM files using CONNOR. The consensus frequency threshold (−f) was set as 0.9 (90%), and the minimum family size threshold (−s) was varied between 2 and 5 for characterization of error rates (Wan et al, 2020). Patient-specific sequencing data consists of informative reads at multiple known patient-specific loci that were identified from tumour sequencing (see above).
sWGS
Indexed sequencing libraries were prepared using the ThruPLEX-Plasma Seq kit (Rubicon Genomics). Libraries were pooled in equimolar amounts and sequenced to <0.4× depth of coverage on a HiSeq 4000 (Illumina) generating 150-bp paired-end reads. Sequence data was analysed using an in-house pipeline that consists of the following steps. Paired end sequence reads were aligned to the human reference genome (GRCh37) using BWA-mem following the removal of contaminating adapter sequences. PCR and optical duplicates were marked using MarkDuplicates (Picard Tools) feature and these were excluded from downstream analysis along with reads of low mapping quality and supplementary alignments. When necessary, reads were down-sampled to 10 million in all samples for comparison purposes.
The preliminary analysis was carried out on 93 samples (40 cancers and 53 noncancer controls). For each sample the following features were calculated from sWGS data: P(30-60), P(61-90), P(91-120), P (121-150), 690 P(151-180), P(181-210), P(211-240), P(241-270), P (271-300). The data was arranged in a matrix where the rows represent each sample and the columns held the aforementioned features with an extra “class” column with the binary labels of “cancer” or “controls”. The amplitude of the 10 bp periodic peaks (OSC_10 bp) was calculated from the sWGS data as follows: from the samples with clear peaks, the local maxima (“peak”) and minima (“valley”) in the range 50-140 bp were calculated. The average of their positions across the samples was calculated: (minima: 62, 73, 84, 96, 106, 116, 126, 137; and maxima: 58, 69, 80, 92, 102, 112, 122, 134). To compute the “amplitude statistic”, the inventors calculated the sum of the height of the maxima and subtracted the sum of the height of the minima. The larger this difference, the more distinct are the peaks. The height of the x bp peak is defined as the number of fragments with length x divided by the total number of fragments. To define local maxima, the inventors selected the positions y such that y was the largest value in the interval [y−2, y+2]. The same rationale was used to pick minima. PCA were calculated and visualized in R using the package ggbiplot. The tSNE analysis was performed in R with the Rtsne package using 1000 iterations, Spearman correlations and a perplexity score of 8. Plots were generated in R using ggplot2. ROC curves were plotted in R with the plotROC package.
The following analysis was carried out in R utilising RandomForest, and pROC packages and in Python using scikit-learn and H2O Python API modules. The pairwise correlations between the features were calculated to assess multi-collinearity in the dataset (
All statistics were performed using R (v3.4.3) programming language (www.rproject.org). We also used the ggplot2 (v3.2.0) and ggpubr (v0.2) packages.
Raw sequencing data is deposited at the European Genome-phenome archive, (https://ega-740 archive.org/studies/EGAS00001004355).
Using paired-end sequencing reads from hybrid capture panels (targeting the 52 most frequently mutated genes in Glioma (Brennan et al., 2013) and single nucleotide variants identified by comparing tumour and non-tumour sequences in 8 glioma patients), the inventors determined the distribution of read lengths (fragmentation patterns) of mutant and non-mutant cfDNA, i.e. reads carrying mutations previously identified in matched tissue and those not carrying mutations, in the CSF (
The inventors analysed the cfDNA fragmentation patterns in 40 urine samples from 35 patients with gliomas (30 HGG and 5 LGG) collected pre-treatment with paired-end sWGS. They also sequenced urine cfDNA from 53 controls: 26 healthy individuals and 27 patients with other pathologies affecting the central nervous system (cervical myelopathy, cerebral artery aneurysm-both ruptured and unruptured, hydrocephalus and Parkinson's disease). Baseline urine samples from patients with cancer and other CNS pathologies were collected prior to surgery, and follow-up samples were collected for a subset of the cases. Age and other physiological properties of the cases and controls were collected. All urine samples were collected and processed according to the same protocol and time-frame for processing to reduce potential biases due to differences in pre-analytical processing (see Materials and Methods). The mean age of the healthy individuals was lower than for the cancer cases (41 years old and 61 years old, respectively). The inventors therefore evaluated the influence of donor age on the cfDNA fragment size distribution of the cohort of healthy individuals, and observed no significant difference (
The inventors demonstrated previously that cfDNA fragmentation features could be used to improve the detection of glioma in plasma samples (Mouliere et al, 2018a). In plasma samples, a random forest model comprising a copy number-based feature (t-MAD), and 4 fragment size features (OSC10, p(160-180), p(180-220), p(250-320), respectively the amplitude of 10 bp peaks (oscillations) in the distribution of fragment lengths in the 75-150 bp range, the proportion of fragments in the 160-180 bp, 180-220 bp and 250-320 bp range) was found to perform best at distinguishing cancer vs healthy samples. Here they explored whether these features in urine could be used to enhance detection of tumour DNA in glioma patients, and further to enable this detection in the presence of confounding factors such as the influence of the possible presence of other CNS disease on the cfDNA fragmentation profile. A predictive analysis was performed using 10 fragmentation features across 93 urine samples (40 samples from 35 cancer cases and 53 samples from 53 non-cancer controls). These ten fragmentation features were based on the proportion (P) of fragments in the following size ranges in sWGS data from each sample, using 30 bp bins: P(30 to 60), P (61 to 90), P (91 to 120), P(121 to 150), P (151 to 180), P(181 to 210), P (210 to 240), P (241 to 270) and P (271 to 300) (
Variable selection and the classification of samples as “non-cancer” or “cancer” were performed using logistic regression (LR) and other machine learning models trained and validated on 40 cancer samples and 53 controls (
In order to better understand the information that can be obtained from fragment size features, the inventors evaluated the cross-correlations of features in the set of samples (40 cancers-HGG and LGG, 55 controls-healthy and non-cancer) (
Tumour-derived DNA has previously been detected in the CSF of patients with glioma and may be helpful for tumour genomic analysis (De Mattos-Arruda et al, 2015; Pentsova et al, 2016; Wang et al, 2015; Pan et al, 2015; Miller et al, 2019; Mouliere et al, 2018b). However difficulties with longitudinal CSF collection in patients alongside the relative variability in tumour fraction detection may hamper clinical implementation and applicability of CSF analysis. There were different observations reported on the level of detection of ctDNA in plasma of glioma patients (Bettegowda et al, 2014; Pan et al, 2019; Mouliere et al, 2018a; Westphal & Lamszus, 2015). No prior studies had, to our knowledge, explored ctDNA analysis in urine samples from glioma patients.
Here, the inventors have shown that ctDNA can be detected, at very low levels, in the urine and plasma of the majority of patients with high grade glioma. The inventors identified size differences between mutant and non-mutant DNA using tumour-guided sequencing in CSF, plasma and urine of glioma patients. They analysed the size distributions of mutant ctDNA by sequencing >435 potentially mutated loci per patient at high depth. This revealed reads that could be unequivocally identified as tumour derived, and allowed a direct comparison of fragmentation features of ctDNA as compared to bulk cfDNA. Whilst a powerful technique, a potential limitation of this method is the fact that capture-based sequencing may be biased by probe capture efficiency and therefore may not accurately reflect ratios between tumour and non-tumour DNA, especially for short fragments <100 bp. Nevertheless, this observation was important as it strongly suggested that ctDNA size shift could be observed in the plasma and the urine of glioma patients. In the case of the former, this agrees with previous data generated using non capture based methods.
The inventors complemented this observation by analysing the genome-wide fragmentation patterns of urine cfDNA in 40 samples from 35 glioma patients using sWGS. They identified cfDNA fragmentation features that could classify urine samples from glioma patients from controls using urine samples, without a priori knowledge of somatic aberrations. The median size of cfDNA fragments in urine from control individuals without glioma (137 bp), patients with other CNS diseases (121 bp) and patients with gliomas (101 bp) was different from previous reports on other cancer types (Cheng et al, 2019; Markus et al, 2021). This could indicate that the cfDNA fragmentation profile could be biased depending on the collection procedure and pre-analytical factors. It is also possible that the shortening of cfDNA in the urine of glioma patients compared to controls is due, at least in part, to differences in patient physiology and that this may directly contribute to the detection of a fragmentation based glioma cfDNA signal in urine. Beyond the tissue of cancer origin, it is likely that urine cfDNA fragmentation might also be influenced by patient physiology (Teo et al, 2019), and pre-analytical parameters (Bosschieter et al, 2018). We attempted to mitigate for these effects by assessing the effect of age on the cfDNA fragmentation of urine samples, by controlling for the duration of pre-operative fasting, by using standardised sample preparation and DNA isolation and also by assessing the effect of tumour size on detectability. A more in depth analysis of how biological variables impact cfDNA fragmentation in urine samples will be needed in order to conclude the extent to which these factors may lead to different fragmentation patterns in different cohorts. Such pre-analytical differences notwithstanding, by using a binary classification the inventors observed that the shorter size ranges (P30-60 and P61-90) of cfDNA fragments in urine samples showed larger differences between cancer cases and controls. These size ranges were similar to the size range enriched in mutant cfDNA in urine as observed using tumour-guided capture panels. With 4 machine learning analyses, they identified and tested ten size features that can be informative for classifying urine samples as being derived either from healthy individuals or from patients with glioma. The LR, RF, SVM and GLMEN models correctly classified samples derived from patients with glioma in most of the cases (median AUC=0.90, median AUC=0.91, median AUC=0.80 and median AUC=0.91, respectively). The GLMEN model correctly identified samples from cancer patients vs samples from controls with a sensitivity of 65% and specificity of 95% in a cohort of 93 urine samples (40 cancer samples and 53 control samples). These results from urine samples from glioma show similar performance to those demonstrated in plasma in the inventors' previous work, which identified 63% of plasma samples from glioma patients with 94% specificity using another RF model based on integration of fragmentation features in plasma cfDNA (Mouliere et al, 2018a). Together with other studies that utilise methylation patterns in plasma (Sabedot et al, 2021; Nassiri et al, 2020), our work suggests that despite a low detection rate of mutations, epigenetic signals (i.e. fragmentation patterns) can be robustly detected in the plasma and also urine of glioma patients.
Thus, the inventors have demonstrated that classification algorithms can utilise information derived from cfDNA fragmentation features to improve the detection of glioma in patients using urine samples. These techniques may therefore provide a method to detect glioma in a truly non-invasive (urine) manner and thus avoiding the morbidity and risk of mortality associated with CSF sampling. These results encourage further confirmation through the analysis of a larger cohort of both glioma patients and control individuals without cancer.
All references cited herein are incorporated herein by reference in their entirety and for all purposes to the same extent as if each individual publication or patent or patent application was specifically and individually indicated to be incorporated by reference in its entirety.
The specific embodiments described herein are offered by way of example, not by way of limitation. Any sub-titles herein are included for convenience only, and are not to be construed as limiting the disclosure in any way.
Number | Date | Country | Kind |
---|---|---|---|
2109941.1 | Jul 2021 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2022/069203 | 7/8/2022 | WO |