Clonal hematopoiesis (CH), characterized as the homogenization of the hematopoietic stem cell population, can range from benign age-related CH (ARCH) to CH driven by specific oncogenic driver point mutations (e.g., CHIP). CH carries a significantly increased risk of cardiovascular events (e.g., myocardial infarctions), epithelial malignancies, and development of a hematological malignancy (11-13 times higher). The genetic complexity—known mutation type, mutation burden, and their frequencies—is an important distinguishing feature between CH and more aggressive disease. However, at this point in time, several drivers of mono/oligoclonal development are unknown and molecular diagnostics fails to screen for structural variants (copy number and translocations) that can be insightful for prognosis.
At present the gold standard for diagnosing hematological diseases without abnormal cytomorphologies is through Deoxyribonucleic acid (DNA) sequencing or single nucleotide polymorphism (SNP) arrays of specific point mutations whose variant frequencies are ≥2%. Studies have shown this diagnostic approach is insufficient at detecting CH in patients with either unknown oncogenic drivers or those with structural genomic variants (copy number changes or translocations), some of which are the strongest prognostic indicators to progression from CH to myeloid leukemias. Further, this method is unable to determine the aggressiveness of CH clones and life histories of CH in patients, preventing insights for clinical decision making beyond the presence/absence of CH or myolodysplastic syndrome (MDS).
Thus, there is a need in the art for a diagnostic approaches that are far more powerful, cheaper, and provides earlier insights into hematological diseases.
In some implementations, the techniques described herein relate to a computer-implemented method including: receiving patient data associated with a blood specimen from a subject, the patient data including fluctuating methylation clock (FMC) data; inputting the FMC data into a trained machine learning model; and predicting, using the trained machine learning model, a hematological condition in the subject.
In some implementations, the FMC data includes DNA methylation fluctuation data for a plurality of fluctuating CpG (fCpG) sites.
In some implementations, the patient data further includes one or more DNA alteration markers. For example, the one or more DNA alteration markers can include a signal nucleotide variant (SNV), a copy number alteration (CNA), or a structural variant (SV).
In some implementations, the step of predicting, using the trained machine learning model, the hematological condition includes diagnosing the subject with the hematological condition.
In some implementations, the step of predicting, using the trained machine learning model, the hematological condition includes providing a prognosis of the hematological condition.
In some implementations, the hematological condition is clonal hematopoiesis (CH). In some implementations, the hematological condition is clonal hematopoiesis of indeterminate potential (CHIP). In some implementations, the hematological condition is age related clonal hematopoiesis (ARCH).
In some implementations, the trained machine learning model is a random forest classifier.
In some implementations, the techniques described herein relate to a method including: receiving a blood specimen from a subject; obtaining, using a microarray, fluctuating methylation clock (FMC) data associated with the blood specimen; and inputting, using a computing device, the FMC data into a trained machine learning model; and predicting, using the trained machine learning model, a hematological condition in the subject.
In some implementations, the method further includes recommending, using the computing device, a course of treatment for the subject based on the predicted hematological condition.
In some implementations, the method further includes recommending performing a course of treatment on the subject based on the predicted hematological condition.
In some implementations, the techniques described herein relate to a system including: at least one processor and a memory operably coupled to the at least one processor, the memory having computer-executable instructions stored thereon that, when executed by the at least one processor, cause the processor to: receive patient data associated with a blood specimen from a subject, the patient data including fluctuating methylation clock (FMC) data; input the FMC data into a trained machine learning model; and predict, using the trained machine learning model, a hematological condition in the subject.
In some implementations, the FMC data includes DNA methylation fluctuation data for a plurality of fluctuating CpG (fCpG) sites.
In some implementations, the patient data further includes one or more DNA alteration markers. For example, the one or more DNA alteration markers can include a signal nucleotide variant (SNV), a copy number alteration (CNA), or a structural variant (SV).
In some implementations, the step of receiving, using the trained machine learning model, the predicted hematological condition includes receiving a diagnosis or prognosis of the hematological condition.
In some implementations, the predicted hematological condition is clonal hematopoiesis (CH), clonal hematopoiesis of indeterminate potential (CHIP), or age related clonal hematopoiesis (ARCH)
In some implementations, the trained machine learning model is a random forest classifier.
It should be understood that the above-described subject matter may also be implemented as a computer-controlled apparatus, a computer process, a computing system, or an article of manufacture, such as a computer-readable storage medium.
Other systems, methods, features and/or advantages will be or may become apparent to one with skill in the art upon examination of the following drawings and detailed description. It is intended that all such additional systems, methods, features and/or advantages be included within this description and be protected by the accompanying claims.
The components in the drawings are not necessarily to scale relative to each other. Like reference numerals designate corresponding parts throughout the several views.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. Methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present disclosure. As used in the specification, and in the appended claims, the singular forms “a,” “an,” “the” include plural referents unless the context clearly dictates otherwise. The term “comprising” and variations thereof as used herein is used synonymously with the term “including” and variations thereof and are open, non-limiting terms. The terms “optional” or “optionally” used herein mean that the subsequently described feature, event or circumstance may or may not occur, and that the description includes instances where said feature, event or circumstance occurs and instances where it does not. Ranges may be expressed herein as from “about” one particular value, and/or to “about” another particular value. When such a range is expressed, an aspect includes from the one particular value and/or to the other particular value. Similarly, when values are expressed as approximations, by use of the antecedent “about,” it will be understood that the particular value forms another aspect. It will be further understood that the endpoints of each of the ranges are significant both in relation to the other endpoint, and independently of the other endpoint.
As used herein, the terms “about” or “approximately” when referring to a measurable value such as an amount, a percentage, and the like, is meant to encompass variations of +20%, +10%, +5%, or +1% from the measurable value.
“Administration” of “administering” to a subject includes any route of introducing or delivering to a subject an agent. Administration can be carried out by any suitable means for delivering the agent. Administration includes self-administration and the administration by another.
The term “subject” is defined herein to include animals such as mammals, including, but not limited to, primates (e.g., humans), cows, sheep, goats, horses, dogs, cats, rabbits, rats, mice and the like. In some embodiments, the subject is a human.
The term “artificial intelligence” is defined herein to include any technique that enables one or more computing devices or comping systems (i.e., a machine) to mimic human intelligence. Artificial intelligence (AI) includes, but is not limited to, knowledge bases, machine learning, representation learning, and deep learning. The term “machine learning” is defined herein to be a subset of Al that enables a machine to acquire knowledge by extracting patterns from raw data. Machine learning techniques include, but are not limited to, logistic regression, support vector machines (SVMs), decision trees, random forest classifiers, and artificial neural networks. The term “representation learning” is defined herein to be a subset of machine learning that enables a machine to automatically discover representations needed for feature detection, prediction, or classification from raw data. Representation learning techniques include, but are not limited to, autoencoders. The term “deep learning” is defined herein to be a subset of machine learning that that enables a machine to automatically discover representations needed for feature detection, prediction, classification, etc. using layers of processing. Deep learning techniques include, but are not limited to, artificial neural network or multilayer perceptron (MLP).
Machine learning models include supervised, semi-supervised, and unsupervised learning models. In a supervised learning model, the model learns a function that maps an input (also known as feature or features) to an output (also known as target or targets) during training with a labeled data set (or dataset). In an unsupervised learning model, the model learns patterns (e.g., structure, distribution, etc.) within an unlabeled data set. In a semi-supervised model, the model learns a function that maps an input (also known as feature or features) to an output (also known as target or target) during training with both labeled and unlabeled data.
Referring now to
As described above, a supervised machine learning model “learns” a function that maps an input 120 (also known as feature or features) to an output 140 (also known as target or targets) during training with a labeled data set. Machine learning model training is discussed in further detail below. In some implementations, a trained supervised machine learning model is configured to classify the input 120 into one of a plurality of target categories (i.e., the output 140). In other words, the trained model can be deployed as a classifier. In other implementations, a trained supervised machine learning model is configured to provide a probability of a target (i.e., the output 140) based on the input 120. In other words, the trained model can be deployed to perform a regression.
Optionally, in some implementations, the machine learning model 100 is a random forest classifier. A random forest classifier is a supervised classification model that uses a series of decision tree classifiers. This disclosure contemplates that the Random Forest classifier can be implemented using a computing device (e.g., a processing unit and memory as described herein). Random forest classifiers are trained with a data set by determining a decision in across connected nodes, where each node represents a feature of the data this results in a probability distribution of a label given an observation with sub-sampling of the data features. Random forest classifiers are known in the art and are therefore not described in further detail herein.
Optionally, in some implementations, the machine learning model 100 is a support vector machine (SVM). An SVM is a supervised learning model that uses statistical learning frameworks to predict the probability of a target. This disclosure contemplates that the SVM can be implemented using a computing device (e.g., a processing unit and memory as described herein). SVMs can be used for classification and regression tasks. SVMs are trained with a data set to maximize or minimize an objective function, for example a measure of the SVM's performance, during training. SVMs are known in the art and are therefore not described in further detail herein.
Optionally, in some implementations, the machine learning model 100 is a Naïve Bayes' (NB) classifier. An NB classifier is a supervised classification model that is based on Bayes' Theorem, which assumes independence among features (i.e., presence of one feature in a class is unrelated to presence of any other features). This disclosure contemplates that the NB classifier can be implemented using a computing device (e.g., a processing unit and memory as described herein). NB classifiers are trained with a data set by computing the conditional probability distribution of each feature given label and applying Bayes' Theorem to compute conditional probability distribution of a label given an observation. NB classifiers are known in the art and are therefore not described in further detail herein.
Optionally, in some implementations, the machine learning model 100 is an artificial neural network (ANN). An artificial neural network (ANN) is a computing system including a plurality of interconnected neurons (e.g., also referred to as “nodes”). This disclosure contemplates that the nodes can be implemented using a computing device (e.g., a processing unit and memory as described herein). The nodes can be arranged in a plurality of layers such as input layer, output layer, and optionally one or more hidden layers. An ANN having hidden layers can be referred to as deep neural network or multilayer perceptron (MLP). Each node is connected to one or more other nodes in the ANN. For example, each layer is made of a plurality of nodes, where each node is connected to all nodes in the previous layer. The nodes in a given layer are not interconnected with one another, i.e., the nodes in a given layer function independently of one another. As used herein, nodes in the input layer receive data from outside of the ANN, nodes in the hidden layer(s) modify the data between the input and output layers, and nodes in the output layer provide the results. Each node is configured to receive an input, implement an activation function (e.g., binary step, linear, sigmoid, tan H, or rectified linear unit (ReLU) function), and provide an output in accordance with the activation function. Additionally, each node is associated with a respective weight. ANNs are trained with a dataset to maximize or minimize an objective function. In some implementations, the objective function is a cost function, which is a measure of the ANN's performance (e.g., error such as L1 or L2 loss) during training, and the training algorithm tunes the node weights and/or bias to minimize the cost function. This disclosure contemplates that any algorithm that finds the maximum or minimum of the objective function can be used for training the ANN. Training algorithms for ANNs include, but are not limited to, backpropagation. ANNs are known in the art and are therefore not described in further detail herein.
As shown in
Referring now to
At step 210, patient data associated with a blood specimen from a subject is received, for example by the computing device. The patient data includes fluctuating methylation clock (FMC) data, where the FMC data includes deoxyribonucleic acid (DNA) methylation fluctuation data for a plurality of fluctuating CpG (fCpG) sites. CpG sites are regions of DNA where a cytosine nucleotide occurs next to a guanine nucleotide in the linear sequence of bases along its length. Thus, as used herein, “CpG” refers to cytosine and guanine separated by a phosphate, which links the two nucleosides together in DNA. As described herein, fluctuating DNA methylation marks can be used as clocks in cells where ongoing methylation and demethylation causes repeated cycling between methylated and unmethylated states. In particular, CpG sites stochastically and measurably fluctuate in their DNA methylation levels (specifically the fraction of methylated alleles, typically referred to as the B value) between 0% (homozygously unmethylated CpG), 50% (heterozygous methylation) and 100% (homozygous methylation).
In some implementations, the blood specimen is optionally a peripheral blood sample extracted from the subject. DNA can be extracted from such blood specimen and DNA methylation can then be measured, for example, using a microarray. Example microarrays for measuring DNA methylation include, but are not limited to, EPIC microarrays from Illumina, Inc. of San Diego, California. Techniques for extracting DNA, isolating DNAs, and analyzing DNAs with a microarray are known in the art. As described above, the FMC data includes DNA methylation fluctuation data for a plurality of fCpG sites. It should be understood that the FMC data includes DNA methylation fluctuation data for specific fCpG sites. Optionally, as described in the Examples below, the specific fCpG sites can include all CpG loci having average values between 40% and 60% methylation in a dataset (e.g., the aging database of 656 healthy individuals discussed in Example 1). It should be understood that specific fCpG sites used for predicting hematological conditions using blood specimens may be different than fCpG sites used for predicting diseases using other tissue samples or fCpG sites used for predicting other diseases. Additionally, it should be understood that a peripheral blood sample is only provided as an example blood specimen. This disclosure contemplates that the blood specimen can be a skeletal bone marrow sample in other implementations.
Optionally, in some implementations, the patient data further includes one or more DNA alteration markers. For example, the one or more DNA alteration markers can include, but are not limited to, a signal nucleotide variant (SNV), a copy number alteration (CNA), or a structural variant (SV). DNA alteration markers can be obtained by sampling the subject's blood, extracting DNA from the sample, sequencing the DNA, and identifying DNA alteration markers in the data. DNA alteration markers can be identified based on a comparison of the blood sample DNA sequences to a control set of DNA sequences derived from a control subject or population that either has no disease or no disease recurrence. Techniques for extracting DNA, isolating DNAs, and sequencing are known in the art.
At step 220, the FMC data is input into a trained machine learning model (e.g., machine learning model 100 in
At step 230, the trained machine learning model (e.g., machine learning model 100 in
In some implementations, the techniques described herein relate to a method including: receiving a blood specimen from a subject; obtaining, using a microarray, fluctuating methylation clock (FMC) data associated with the blood specimen; and inputting, using a computing device, the FMC data into a trained machine learning model; and predicting, using the trained machine learning model, a hematological condition in the subject. In some implementations, the method further includes recommending, using the computing device, a course of treatment for the subject based on the predicted hematological condition. In some implementations, the method further includes recommending performing a course of treatment on the subject based on the predicted hematological condition.
It should be appreciated that the logical operations described herein with respect to the various figures may be implemented (1) as a sequence of computer implemented acts or program modules (i.e., software) running on a computing device (e.g., the computing device described in
Referring to
In its most basic configuration, computing device 300 typically includes at least one processing unit 306 and system memory 304. Depending on the exact configuration and type of computing device, system memory 304 may be volatile (such as random access memory (RAM)), non-volatile (such as read-only memory (ROM), flash memory, etc.), or some combination of the two. This most basic configuration is illustrated in
Computing device 300 may have additional features/functionality. For example, computing device 300 may include additional storage such as removable storage 308 and non-removable storage 310 including, but not limited to, magnetic or optical disks or tapes. Computing device 300 may also contain network connection(s) 316 that allow the device to communicate with other devices. Computing device 300 may also have input device(s) 314 such as a keyboard, mouse, touch screen, etc. Output device(s) 312 such as a display, speakers, printer, etc. may also be included. The additional devices may be connected to the bus in order to facilitate communication of data among the components of the computing device 300. All these devices are well known in the art and need not be discussed at length here.
The processing unit 306 may be configured to execute program code encoded in tangible, computer-readable media. Tangible, computer-readable media refers to any media that is capable of providing data that causes the computing device 300 (i.e., a machine) to operate in a particular fashion. Various computer-readable media may be utilized to provide instructions to the processing unit 306 for execution. Example tangible, computer-readable media may include, but is not limited to, volatile media, non-volatile media, removable media and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. System memory 304, removable storage 308, and non-removable storage 310 are all examples of tangible, computer storage media. Example tangible, computer-readable recording media include, but are not limited to, an integrated circuit (e.g., field-programmable gate array or application-specific IC), a hard disk, an optical disk, a magneto-optical disk, a floppy disk, a magnetic tape, a holographic storage medium, a solid-state device, RAM, ROM, electrically erasable program read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices.
In an example implementation, the processing unit 306 may execute program code stored in the system memory 304. For example, the bus may carry data to the system memory 304, from which the processing unit 306 receives and executes instructions. The data received by the system memory 304 may optionally be stored on the removable storage 308 or the non-removable storage 310 before or after execution by the processing unit 306.
It should be understood that the various techniques described herein may be implemented in connection with hardware or software or, where appropriate, with a combination thereof. Thus, the methods and apparatuses of the presently disclosed subject matter, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium wherein, when the program code is loaded into and executed by a machine, such as a computing device, the machine becomes an apparatus for practicing the presently disclosed subject matter. In the case of program code execution on programmable computers, the computing device generally includes a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. One or more programs may implement or utilize the processes described in connection with the presently disclosed subject matter, e.g., through the use of an application programming interface (API), reusable controls, or the like. Such programs may be implemented in a high level procedural or object-oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language and it may be combined with hardware implementations.
The following examples are put forth so as to provide those of ordinary skill in the art with a complete disclosure and description of how the compounds, compositions, articles, devices and/or methods claimed herein are made and evaluated, and are intended to be purely exemplary and are not intended to limit the disclosure. Efforts have been made to ensure accuracy with respect to numbers (e.g., amounts, temperature, etc.), but some errors and deviations should be accounted for. Unless indicated otherwise, parts are parts by weight, temperature is in ° C. or is at ambient temperature, and pressure is at or near atmospheric.
We previously discovered and exploited FMCs to determine the stem cell numbers and replacement rates in various healthy and precancerous glandular tissues. Within the colon crypts and endometrium glands regular monoclonal conversions occur via neutral drift over a predictable period in healthy and pre-malignant tissue due to their small numbers of stem cells and spatially confined organization. This differs from the hematopoietic system where polyclonality of the hematopoietic stem cell (HSC) population is extensive and clonal expansions are only likely in the most severe malignancies due to the multiple orders of magnitude more stem cells in a spatially unstratified tissue. The fCpG behavior seen in intestinal crypts and endometrial glands (both epithelium tissues) are likely to be present across other tissue types.
As described in this example, the loss of polyclonality, or homogenization of the HSC pool, should be reflected in fCpG distributions of normal, pre-malignant, and malignant samples (
Here, we build a mechanistic HSC model to show, through symmetric divisions, how fCpG variance increases as polyclonality is lost. We turn to the abundant public methylation array datasets for normal, pre-malignant, and malignant whole blood samples to derive fCpG sites. Unlike the crypt, a separate, less refined, identification process is necessary as heterogeneity of CpG 6 values is inaccessible within patients due to a lack of measurements within the same individual. Unfortunately, no methylation array datasets currently exist for CH. We collect and analyze 38 patients whole blood samples who have paired methylation array and mutation data, from paired peripheral blood (PB) and bone marrow (BM), with confirmed CH along with ten confirmed normal control samples. Using this cohort, we develop a novel approach to diagnose CH agnostic to mutation status, evaluating evidence of CH in 1,388 patients.
Whole blood was simulated in Java using the HAL framework as a non-spatial agent-based model using 27, 634 fCpG sites as measured in the experimental data. Parameters for normal hematopoiesis are numbers of HSCs (N), number of possible division events (T), (de) methylation rates(S) for the fCpG sites, and HSC replacement dynamics (A). To model clonal expansion, a single, random, cell was selected to grow upon induction, and added parameters are its expansion rate (E) and its final blood frequency of the clonal expansion (w). These clonal expansions resulted in the overall population size to grow until the appropriate final blood frequency was reached. The output of the simulations provided the δ values at the fCpG sites and the overall distribution variance over time.
The number of HSCs was set at a lower value of 1000 initiating cells. This was much lower than the 30,000 based on the large number of HSC inferred by DNA sequencing studies; however, the results shown here are invariant to more than 100 initiating cells. (De) methylation rates varied between CpG sites and were assigned based on the distribution averages of the 656 normal, healthy, individuals from GSE40279. We found that some of the whole blood fCpGs did not appear to have equal (de) methylation rates because their averages tended to always be above or below 50% in multiple individuals. Hence, to better model and match the data, we used a look-up distribution table in the simulations to initialize a cell's fCpG parameters, with lower and unequal (de) methylation rates at CpG sites with average methylation typically found near 0.4 (demethylation >methylation) or near 0.6 (methylation >demethylation) to maintain the variance of the 27,634 fCpG sites around 0.1 during cell divisions. The (de) methylation rates varied between 0.0001 to 0.001 changes per division, with the highest (de) methylation rates and more equal (de) methylation rates at CpG sites near 50% methylation.
Cell survival was set at exact replacement (one cell produces one living offspring), and results did not vary much if random replacement was simulated. A proportion of cells underwent replacement at each timestep. For the neoplastic simulations in
More sophisticated modelling with a better selection of whole blood fCpG sites could improve the extraction of ancestral information. For example, a selection of slower fCpG sites may improve the detection and analysis of indolent clonal expansions, where many of the faster fluctuations return to average ˜ 50% methylation by the time the expansion reaches detectable blood levels.
We simulated hematopoiesis to better understand how fluctuating sites detect clonality in whole blood (
We identified suitable fCpG loci by averaging normal whole blood DNA methylation at ˜ 450, 000 autosomal CpG loci from a commonly used aging database of 656 healthy individuals. We selected all loci (N=27, 634) with average values between 40 and 60% methylation in these 656 specimens. fCpG appear tissue-specific because only ˜5% of the intestinal loci were in the blood set. Fluctuating methylation for each individual sample revealed tight distributions around 50% methylation, which can be described by its variance (
CH in the blood is an early step in the evolution of neoplasia and will increase variances because clonal cells will initially share the 0, 50, 100% methylation pattern of the progenitor. For rapid clonal expansions (i.e., acute leukemias), W-shaped blood distributions like those observed in the crypts are expected. Consistent with these expectations, whole blood samples from different types of major hematopoietic neoplasm had higher than normal variances (
Clonal hematopoiesis (CH) is diagnosed based on somatic alterations whose frequencies are greater than 2% (generally SNVs and small insertions and deletions assessed from peripheral blood samples) in the absence of hematologic malignancy. The prevalence of CH increases as an individual ages and conveys a non-negligible risk for progression to various hematopoietic malignancy. While these studies focus on specific somatic alterations, there are others that have more generally found an increased risk of hematologic malignancies by defining CH as samples with high numbers of somatic mutations. Further still, chromosomal anomalies such as large structural variants or CNAs are also associated with increased risk of hematopoietic malignancies, but are not generally defined as a diagnostic method for CH, despite their more common occurrence in individuals who subsequently develop myeloid or lymphoid leukemia, as observed through longitudinal studies. While these risks for all somatic alterations carry an increased risk of developing malignancies, the absolute risk is low; however, incorporating a more broadly universal method for identifying CH is of great value, especially if information could be gained for teasing apart risk for hematopoietic malignancies. Here we develop a method agnostic to the underlying somatic alteration type by using FMC behavior.
Patients undergoing elective total hip replacement surgery were diagnosed as either normal or CH based on the VAF of putative driver mutations. Here, we present ten patients with no evidence of CH and 38 patients subdivided into different VAF groups ([1,2) % n=8, [2,5) % n=10, [5,10) % n=10, 2: 10% n=10) whose VAF of putative drivers is greater than 1%. Most patients present with a DNMT3A driver mutation (21/38 (about 55%) CH patients), where ≈81% (17/21) of those DNMT3A drivers are the highest frequency driver mutation for that patient. TET2 is the second most frequently observed driver, (10/38 (≈26%) CH patients), where 8/10 (≈80%) TET2 drivers are the highest frequency driver mutation for that patient (
fCpG in the blood form a predictable distribution around 50% methylation in normal samples (
To evaluate whether CH can be diagnosed using fCpG sites we used our cohort of confirmed CH and normal samples. Publicly available normal methylation data does not have any paired mutational data to rule out the presence of CH in the study's cohort of patients, especially patients with early CH where clones would be observed only at very low frequencies. Here we sub-sampled our 12,000 fCpG sites, without replacement, to bolster our numbers of samples to roughly 1500 samples of both Normal and CH (1500 and 1482, respectively). Each patient is sub-sampled equally so that representation from each sample is equal across our normal and CH VAF groups. This leaves us with 2982 samples of roughly equal representation from normal and CH samples (
The FMC-CH classifier proves to be a robust tool for diagnosing CH based only on fluctuating methylation clocks. The confusion matrix between the normal and CH samples yields a false positive rate of only 1.8% and false negative rate of 8.2% (
A diagnostic or research tool is only useful with an appropriate interface and an easily integrated function. For the FMC-CH classifier a function is provided that takes an array of fCpG δ values. The function pre-processes these samples, by performing the appropriate sub-sampling (allowing backwards compatibility with 450 k, 850 k, and earlier CpG probe sets), extracting the relevant summary statistics, and then performing the classification. This process is performed 100 times (by default, but the user can specify) providing summary statistics, such as a 95% confidence interval of the prediction probabilities. The threshold for CH classification from the replicate predictions is based on the 95% confidence interval, where CH is diagnosed if the upper limit of the 95% confidence interval is >50%. We illustrate this process by performing predictions on the entire CH cohort (
The gold standard for identifying CH is through identification of deleterious somatic alterations from PB with a VAF cutoff of 2%. We know that FMC dynamics are a function of several underlying processes. At their core it is the turnover, cell number, expansion rates, (de) methylation rates, and final blood frequency of sub-clones/malignant populations
While the FMC-CH classifier can accurately determine that CH is present with increasing accuracy as a clone expands (
Subclones within the simulations are all induced at 5 years. This early induction allows us to visualize the dynamics over the course of a 100-year simulation. While induction times of a driver mutation is possible this early in an individual's life, these drivers would likely expand very slowly, either through persistence (i.e., not lost during homogenization of HSCs during aging) or through slow continuous expansions, such as in our model. These clones will be detected at a point where this is of greater importance and may serve as an important delay for when closer monitoring is necessary for patients.
Based on the fCpG variances of the normal samples from the publicly available dataset we wanted to evaluate whether there was any evidence of CH present within these samples even if it may be below the clinical threshold of 2% VAF, given this has never been queried for these samples. For this we used the normal dataset of 656 patients used to derive fCpGs (GSE40279;
The current gold standard for validating the 24.2% of newly characterized patient samples would be to perform DNA sequencing on a panel of CH drivers to examine evidence of mutations with VAF≥2%. Neither of these cohorts have paired mutational data for validation purposes. However, we can examine whether expected characteristics of a CH cohort is present, and we can look for evidence of copy number alterations that could be significantly enriched within the newly identified CH patients.
Chances of Clonal Hematopoiesis Diagnosis Increases with Age
CH, as outlined above, is typically a disease largely limited to the elderly. By the age of 70, 10-15% of individuals will present with CH and by 85 years, more than 30% will have CH. As a sanity check we would expect that our FMC-CH classifier would classify CH in predominantly older patients. Across the two studies evaluated we see that the median age of normal samples is 54 years, significantly different from those with CH, whose median age is 70 years (−0.67 Cohen's d, P=2.09*10−25 two-sided paired t-test). Individually for each cohort, we see that the median age for samples classified as CH from GSE40279 and GSE87571 is 73 and 62 years compared to the normal samples of 62 and 45 years, respectively (GSE40279, −0.77 Cohen's d, P=3.13*10−18; GSE87571, −0.54 Cohen's d, P=7.90*10−09). When comparing the age distributions for each of these two cohorts we see that GSE40279 is a significantly older cohort (64.0±13.7 years; mean±SD) compared to GSE87571, which also has a broader sampling of patient ages (47.4±20.9 years; mean±SD). This is reflected in the differences between the proportion of samples that are classified as CH between the two cohorts, where the older cohort, GSE40279, had 9.3% more CH classifications. For the entire cohort of patients (n=1,388) the median age is 58 years. Patients older than 58 years are more likely to be diagnosed with CH compared to those 58 or younger across these two cohorts (odds ratio (OR)=3.02, two-sided fisher exact test, P=1.28*10−17).
Newly diagnosed CH samples from GSE40279 and GSE87571 exhibit fCpG variances consistent with the verified CH samples presented here (
We performed copy number calls using the methylation array data across the CH and normal cohorts to assess evidence of differences between the two groups of patients. We see that there are significant CNA burden differences overall, as well as for copy number losses and gains (
While burden alone may not support evidence of CH we posit that the distributions of CNA burdens is similar to those seen within clonal mosaicisms. In our data we lack information about subclonal proportions to perform the same analysis; however, given our understanding of what drives fCpG variance, loss of polyclonality in HSCs we can deduce that subclonal expansions within samples with higher fCpG variances are likely. Studies have shown that chromosomal abnormalities are present in expanded clones at frequencies of 7-95% representing clonal mosaicism, defined as CNA events with corresponding subclonalities above a threshold. The SNP array methodologies can resolve subclonal admixtures of normal to expanded subclones, something that is not possible using array-based methylation data. However, a previous study has analyzed the frequency of detectable clonal mosaic events by age in both cancer patients and cancer free patients. Similar to the age distributions seen for CH, whereby 10-15% of patients present with CH by the age of 70 and 30% by the age of 80, the frequency of individuals with detectable clonal mosaic events increases with age from 0.23% to 1.91% for those under 50 and between 75-79. The mosaic proportions within the study highlighted required mosaic proportions to be greater than 7%. Within our cohort of classified CH samples we see a similar reflection to the age distributions of the patients classified as being CH samples. The frequency of individuals with a presence of CNAs in the classified CH samples increase with age from 0.27% for those younger than 50 to 11.14% for those >70 years old (P=0.0013, chi-squared test;
We next performed copy number calls using our classified normals as the controls to examine specific differences in genes and recurrent CNAs across different genomic regions that may be implicated in driving CH within the classified CH samples. Using our filtered, high confidence segmentation calls we annotated cytobands, determined genes within aneuploidy segments, and analyzed the genes to determine enrichment in a particular disease area or if overlap exists with known CH or cancer driver genes. On this set of CNAs we see enrichment in several disease classes associated with hematological diseases and malignancies. Of interest, we see significant gene enrichment for genes in regions exhibiting aneuploidy for disease classes related to acute and chronic lymphoblastic and myeloid leukemias as well as various other malignancies. In addition, we find 25 known oncogenic drivers associated with recurrent CNA regions exhibiting gains/losses in the CH cohorts (
Clonally heterogeneous landscapes in the hematopoietic stem cell pool
Our model of HSC dynamics, analysis of the CH cohort presented here, and analysis of publicly available data thus far has revealed that fCpG variances reflect underlying turnover and clonal expansions within the HSC pool using peripheral blood. FMCs in the peripheral blood can be used to diagnose CH; however, it is necessary to evaluate how the make-up of multiple predominant subclones could confound the diagnostic capabilities of the FMC-CH classifier and alter fCpG 6 distributions. To this end we can leverage our HSC model and explore the CH data.
To assess the presence of multiple subclones we first must establish our baseline fCpG variance for a single clone and the corresponding detection times. From the in silico validation of the FMC-CH classifier, we showed that given a single clone, the expanding clone's expansion rate is the most important variable for the time that CH detection occurs. A rapidly expanding clone can be detected quickly, but an indolent expansion will take more time to be detected (
For our control, a single subclone with varied expansion rates, we see that the mean detection time across the three expansion rates is 20.0, 41.2, and 73.08 years after induction of the clone at year 5 for the three expansion rates considered (0.25, 0.125, 0.625, respectively;
The detection time for CH in the presence of multiple subclones decreases the time to diagnosis (
We know that the HSC compartment is highly heterogeneous, and we observe multiple subclones within our CH patients. Our model results suggest that there is an additive or multiplicative increase on fCpG variance as the number of subclones increases with various expansion rates. This prompts us to explore the presence of these multiple subclones and their relationship with FMC 6 distributions that we observe.
Within the CH samples presented, we observe a weak, positive correlation between the largest VAF driver (the gold standard in the clinic for CH diagnosis) with fCpG variance (
Sequencing data is difficult to perform subclonal deconvolution on without deeper whole exome/genome sequencing, and even then, the resolution of low frequency subclones (such as those detected in CH with driver VAFs of 2-10%) is difficult to unravel. Several statistical frameworks attempt to perform this deconvolution, but cannot be applied here. Due to this, we can deduce two possibilities for this increased variance for samples with multiple driver mutations. The first being that the driver subclones arose independently and expand with an unknown slow expansion rate. This first possibility is far easier to assess within the data as we can simply sum the VAFs of all observed clones (ignoring the fact that other independent non-‘driver’ subclones are likely present). When we do this, we see that the cumulative frequency of the subclones for patients with more than one mutation have a significant shift in what the HSC model would consider to be the final blood frequency corresponding to higher fCpG variances (
Paired Bone Marrow from Patients with Clonal Hematopoiesis
The PB is not the site of HSC, rather, HSCs are primarily located in the axial skeleton BM where hematopoiesis begins with cells eventually committing to their identities (one of several cell types, lymphoid or myeloid terminally differentiated cell) and move out to the peripheral blood. Most measurements within the PB are reflections of events that occur within the BM, but few studies confirm this. Importantly for fCpG distributions, we are interested to know if we can accurately capture the underlying distribution of the HSCs from the terminally differentiated PB. For the fCpG 6 distribution to accurately reflect that which is seen in the BM individual subclones would have to be giving rise to equal proportions of various blood cell types. This is still a widely open area of interest, but one study found that the production of blood is highly polyclonal, deriving from a large number of HSC. In our study we have paired methylation data derived from PB with which corresponding DNA sequences evaluate the presence of CH subclones. Here we also present the paired BM DNA sequencing results to examine whether PB or BM subclones are of equal sizes and whether their subclonal frequencies correlate with fCpG variance (
The presence of paired BM provides several sanity checks for both the interpretation of the PB findings as well as the fCpG variances. We see that all variant allele frequencies are similar between the BM and PB (
We next wanted to evaluate how well the fCpG 6 distribution correlates with the VAF of the putative drivers from the BM to see if similar correlations are found as those seen in the PB (
Our understanding of CH, aging of HSCs populations, and accumulation of mutations within normal tissues resulting in mosaic tissue suggests that we are likely to observe a vast collection of distinct populations of HSC subclones. Some of these subclonal populations ought to have acquired the necessary drivers for CH with their likelihood increasing as an individual ages. This prompted our analysis of concurrent driver clones as it relates to our ability to detect CH with our FMC-CH classifier (
We performed single cell sequencing on four of our CH samples who had drivers with VAFs>5% and showed presence of a DNMT3A or TET2 mutation (NOC062, NOC137, NOC115, and NOC131). Fortuitously, two of these samples exhibited sub-clonal structures that were nested (NOC062 and NOC137), while the other two were concurrent CH subclones (NOC115 and NOC131) (
So far, we have presented model simulations with distinct subclonal populations with varied expansion rates, various years between mutation induction, and numbers of subclones. However, given that nested structures exist within the data it's necessary to also simulate the occurrence of multiple subclones with one originating as the daughter from the initial driver population as confirmed in two of our samples. Here we initialize the first clone at five years with the slowest expansion rate of 6.25 percent per year. Unlike the previous simulations we introduce a new founder clone as a daughter of the initial subclone at 60 years (the timing of this second subclone was chosen arbitrarily, but it needed to occur early enough to be seen by the 100-year stopping point of the simulations). There are numerous approaches to determining fitness gains conveyed by the acquisition of passenger mutations and additional drivers acquired within a subclonal population, but here we assume that a modest fitness increase is conveyed as a faster expansion rate for that subclonal population. Where its expansion rate increases 200% (2×) to 12.5 percent per year (the moderate expansions seen in previous simulations). We permit these subclones to expand until the driver reaches a final population frequency of 20% (
We see that the nested subclonal causes an increase in the fCpG variance over the control where a single subclone grows at a steady 6.25 percent per year (
Through the work presented in this example we have orthogonally validated FMCs to show that not only are fCpGs found in the crypts of the colon, small intestine, and endometrium, but they exist in HSCs as well. Large numbers of fCpG sites reversibly switch their methylation status like an erratically swinging pendulum between 0%, 50% and 100% (representing homozygous and heterozygous (de) methylation). In the polyclonal populations, exemplified by HSC, fluctuations are unsynchronized between individual cells and fCpG methylation is saturated at 50%. This is distinct from clonal populations that form the characteristic W-shaped distribution with modal peaks at 0%, 50% and 100% methylation.
Within the scope of hematopoietic cells we show that fCpG dynamics are present and useful to reconstruct clonal dynamics. The identity of the fCpG sites in hematopoietic cells differs from those in the epithelium, likely reflecting that fCpGs tend to be found within non-expressed genes and the fact that gene expression pat-terns vary between tissues. We illustrate the ability of fCpG sites to be used to detect CH through our HSC model of symmetrically expanding subclonal populations corresponding to an increase in average fCpG variances with clonality and characteristic W-shaped distributions present in acute leukemias (
Development of HSC pools ends how they begin, more homogeneous in children and elderly. HSCs begin relatively homogenous during embryogenesis and expanding for the first two decades (<20 years) giving rise to a highly heterogeneous mixture of HSCs before losing some HSCs through aging. In acute leukemia cohorts of older patients versus pediatric patients we see that pediatric patients exhibit higher fCpG variance likely reflecting the differences in initial HSC dynamics. On the other end of the life span, where HSCs diminish in numbers and exhibit an age related incidence of CH we see fCpG variances increased.
Due to the age-related incidence of CH it is important to consider how important CH may be in the development of malignancies. Given that across most tissues in the human body somatic alterations are ubiquitous, and may not pose much risk to further developing of malignancies, it is paramount to be able to tease apart what underlying dynamics are clinically relevant to future malignancies and which are simply a part of human aging. To this end we show that we can leverage FMCs to determine the extent of CH. The approach developed here adds significant value in that it is more sensitive to the underlying dynamics than the current gold standard based on variant frequencies. FMC have the added value of being agnostic to the underlying genomic alterations for purposes of diagnosing CH; further, they are more sensitive to faster expanding clones or the presence of multiple subclones with malignant potential.
Described below is a molecular diagnostic method that leverages patient derived data to construct an ensemble machine learning algorithm to provide a patient diagnosis of clonal hematopoiesis (CH) and pre-cancerous hematological conditions encompassing age-related clonal hematopoiesis (ARCH) and clonal hematopoiesis of indeterminate potential (CHIP) using specific DNA methylation fluctuating CpG (fCpG) sites that serve as a fluctuating methylation clock (FMC).
Combining FMCs, a patient specific in silico hematopoietic stem cell (HSC) model, and patient DNA alteration markers (collectively single nucleotide variants (SNVs), copy number alterations (CNAs), and structural variants (SVs)) with associated patient outcomes, a patient prognosis can be constructed to assign risk of progression to a hematological malignancy.
Through longitudinal collection of fCpG measurements and subsequent use of this molecular diagnostic process as described above, an algorithmic interface can be used to construct a patient specific trajectory to predict when and if a patient will present with clinically actionable CH to construct a monitoring schedule for patient follow-ups, clinical intervention, and clinical decision making.
A patient and clinician user interface can provide patients with information on risks of lifestyle choices given their age, extent of clonal homogeneity, and possible outcomes calculated as probabilities using patient associated outcomes and in silico models. Such outcomes can include, but is not limited to, a lifetime risk of: hematological malignancy development, major cardiovascular events such stroke and heart attack, and/or development of epithelial neoplasia.
Using the molecular diagnostic method described above and refined criteria of observed fCpG measurements, a molecular diagnosis of a hematological malignancy can be provided for clinical follow-up and therapeutic intervention.
Using fCpG monitoring of patients with hematological malignancy combined with DNA alteration markers provides risk and patient stratification based on assessing aggressiveness of disease.
Using FMC DNA methylation measurements can be used to provide risk assessments for treatment associated hematological malignancies (e.g., treatment associated myeloid neoplasia (tMN)) while patients undergo therapeutic interventions for non-hematological malignancies. The same collected data and measurements provides risk assessments for autologous stem-cell transplantation success.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
This application claims the benefit of U.S. provisional patent application No. 63/319,031, filed on Mar. 11, 2022, and titled “METHODS FOR USING METHYLATION DATA TO PREDICT WHETHER A PATIENT HAS CHIP (CLONAL HEMATOPOIESIS OF INDETERMINATE POTENTIAL),” the disclosure of which is expressly incorporated herein by reference in its entirety.
This invention was made with government support under Grant no. CA143970 awarded by the National Institutes of Health. The government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2023/015101 | 3/13/2023 | WO |
Number | Date | Country | |
---|---|---|---|
63319031 | Mar 2022 | US |