To detect whether an individual has a health condition, such as cancer, or a characteristic of that health condition, such as a type of cancer or stage of development, a biological sample from the individual typically is processed to generate signals representative of the biological sample. A data processing system determines a likelihood that the individual has a health condition based on data derived from the generated signals. To build the data processing system, techniques from the field of machine learning typically are used. In this context, machine learning techniques are applied to data derived from the signals generated from biological samples. The data derived from the generated signals are commonly referred to as “features”, which are inputs to a computational model. To apply machine learning techniques, biological samples are obtained which originate from individuals for whom a diagnosis for the health condition is known. For each biological sample, respective data corresponding to features used by a computational model are derived. The data for the set of biological samples with a known diagnosis is referred to as a “training set.” The training set includes data both for individuals with the health condition and for individuals without the health condition. A computational model is built, or “trained,” using the training set, and then that trained computational model is applied to data representative of biological samples from individuals with unknown diagnoses to predict whether they likely have the health condition.
This Summary introduces a selection of concepts in simplified form that are described further below in the Detailed Description. This Summary neither identifies key or essential features, nor limits the scope, of the claimed subject matter.
To use a computational model in this context, there are several technical problems that arise relating to encoding the signal resulting from processing a biological sample into features.
Some problems arise because the signal includes a large amount of information. One of the challenges involves reducing the volume of data into a set of informative features. However, as the number of features increases, the complexity of the computational model increases. However, as the number of features decreases, information relevant to detection of a health condition may be lost.
Some problems arise because of uncertainty around which metrics and which regions of an analyte are truly informative of a health condition. Omission of some metrics or some regions from the set of features may impact the performance of a trained computational model.
To address such problems, the feature computational module encodes a signal generated by processing a biological sample, given one or more health-condition-informative regions related to an analyte, by using metrics based on marker information occurring within specified windows within a sequence of sites of interest within the health-condition informative regions related to the analyte. Each window has a specified position within a sequence of sites of interest in the health-condition informative region, and a specified size. The size is specified in terms of a number of consecutive sites of interest within the analyte. A metric is thus computed for a plurality of positions within the health-condition informative region.
In each metric, a first function of respective marker information for an instance of an analyte for a window is used to compute a respective value for each instance of the analyte in the window. A second function of these respective values is computed to provide one or more values for the one or more metrics for the window.
An example of a first function applied to an instance of an analyte is a count of occurrences of marker information within the instance of the analyte within the window. The second function first computes counts of the number of instances having each possible count resulting from the first function. The second function then divides the respective number of instances computed for each possible count by the total number of instances, thus providing a fractional value for each of the possible counts for this window.
Another example of a first function applied to an instance of an analyte is a function that identifies a pattern of the marker information in the instance, from among a set of possible patterns, and outputs an indication of that pattern. The second function first computes a count of the number of instances having each possible pattern in a window. The second function then divides the respective number of instances identified for each possible pattern by the total number of instances, thus providing a fractional value for each of the possible patterns for this window.
Using such metrics, each site of interest within each health-condition-informative region can have a plurality of metrics, thus providing numerous metrics for each health-condition-informative region, and substantially increasing the number of features available to a machine learning system. Also, the metrics as described above, lose less information than other metrics, such as metrics that average information over an entire health-condition-informative region. For example, a metric based on a count of patterns preserves a large amount of original information, effectively compressing the original data to reduce storage and computational requirements while preserving information useful to training a machine learning model.
Accordingly, in one aspect, a process encodes a signal generated from processing a biological sample originating from a subject. The signal is indicative of marker information in instances of an analyte in the biological sample. The process involves processing the signal using a computer processor having access to computer storage that stores the signal generated from processing a biological sample originating from a subject. The processing includes computing, for each instance of an analyte in the biological sample, and for each window of a plurality of windows on health-condition-informative regions of the analyte, a respective value for the instance for the window based on a first function of respective marker information for the instance of the analyte for the window. The processing includes computing, for each window of the plurality of windows on the health-condition-informative region, one or more respective metrics for the window based on a second function of the respective values computed for the instances of the analyte for the window based on the first function. The processing includes storing a data structure in memory including the one or more respective metrics computed for the window as a set of values associated with an identifier of the window of the health-condition-informative region. The set of values corresponds to a set of features corresponding to inputs of a computational model. The set of values for the subject for the plurality of windows represents an encoding of the signal from the biological sample of the subject.
In one aspect, a process encodes a signal generated from processing a biological sample originating from a subject. The signal is indicative of marker information in instances of an analyte in the biological sample. The computer-implemented process involves processing the signal using a computer processor having access to computer storage that stores the signal generated from processing a biological sample originating from a subject. The processing includes, for each window of a plurality of windows on a health-condition-informative region of an analyte: computing, for each instance of the analyte overlapping the window, a respective value for the instance for the window based on a first function of the respective marker information for the instance overlapping the window; computing one or more respective metrics for the window based on a second function of the respective values computed for the instances overlapping the window based on the first function; and storing a data structure in memory including the one or more respective metrics computed for the window as a set of values associated with an identifier of the window of the health-condition-informative region. The set of values corresponds to a set of features corresponding to inputs of a computational model. The set of values for the subject for the plurality of windows represents an encoding of the signal generated from processing the biological sample originating from the subject.
In one aspect, a process encodes methylation signals for DNA fragments from a liquid biopsy of a subject, wherein each methylation signal is indicative of methylation of CpGs in a respective DNA fragment. The process involves processing the methylation signals using a computer processor having access to computer storage that stores the methylation signals for the DNA fragments from the liquid biopsy of the subject. The processing includes computing, for each DNA fragment, and for each window of a plurality of windows on a cancer-informative region of DNA of the subject, a respective value for the DNA fragment for the window based on a first function of the respective methylation signal for the DNA fragment for the window. The processing includes computing, for each window of the plurality of windows on the cancer-informative region, one or more respective metrics for the window based on a second function of the respective values computed for the DNA fragments for the window based on the first function. The processing includes storing a data structure in memory including the one or more respective metrics computed for the window as a set of values associated with an identifier of the window of the cancer-informative region. The set of values corresponds to a set of features corresponding to inputs of a computational model. The set of values for the subject for the plurality of windows represents an encoding of the methylation signals for the DNA fragments from the liquid biopsy of the subject.
In one aspect, a process encodes methylation signals for DNA fragments from a liquid biopsy of a subject, wherein each methylation signal is indicative of methylation of CpGs in a respective DNA fragment. The process involves processing the methylation signals using a computer processor having access to computer storage that stores the methylation signals for the DNA fragments from the liquid biopsy of the subject. The processing includes, for each window of a plurality of windows on a cancer-informative region of DNA of the subject: computing, for each DNA fragment overlapping the window, a respective value for the DNA fragment for the window based on a first function of the respective methylation signal for the DNA fragment in the window; computing one or more respective metrics for the window based on a second function of the respective values computed for the DNA fragments for the window based on the first function; storing a data structure in memory including the one or more respective metrics computed for the window as a set of values associated with an identifier of the window of the cancer-informative region. The set of values corresponds to a set of features corresponding to inputs of a computational model. The set of values for the subject for the plurality of windows represents an encoding of the methylation signals for the DNA fragments from the liquid biopsy of the subject.
In one aspect, a process encodes methylation signals for DNA fragments from a liquid biopsy of a subject, wherein each methylation signal is indicative of methylation of CpGs in a respective DNA fragment. The process involves processing the liquid biopsy of the subject to generate in computer storage a respective methylation signal for each of a plurality of DNA fragments in the liquid biopsy, the respective methylation signal indicative of methylation of CpGs in the DNA fragment. The methylations signals are processed using a computer processor having access to the computer storage that stores the methylation signals for the DNA fragments from the liquid biopsy of the subject. The processing includes, for each window of a plurality of windows on a cancer-informative region of DNA of the subject: computing, for each DNA fragment overlapping the window, a respective value for the DNA fragment for the window based on a first function of the respective methylation signal for the DNA fragment in the window; computing one or more respective metrics for the window based on a second function of the respective values computed for the DNA fragments for the window based on the first function; storing a data structure in memory including the one or more respective metrics computed for the window as a set of values associated with an identifier of the window of the cancer-informative region. The set of values corresponds to a set of features corresponding to inputs of a computational model. The set of values for the subject for the plurality of windows represents an encoding of the methylation signals for the DNA fragments from the liquid biopsy of the subject.
In one aspect, a non transitory computer storage medium comprises computer storage with data encoded thereon. The data defines a training set for training a computational model, wherein the data in the training set represents a plurality of processed samples, each processed sample originating from a respective liquid biopsy from a respective subject. The data for each processed sample includes a respective set of values for the processed sample encoding methylation signals from DNA fragments in the processed sample, each set of values including, for each cancer-informative region of DNA, and for each window on the cancer-informative region, one or more respective metrics computed for the window and associated with an identifier of the window of the cancer-informative region, wherein each respective metric comprises a value based on computing a respective value for each DNA fragment for the window based on a first function of the respective methylation signal for the DNA fragment in the window, and a second function of the respective values computed for the DNA fragments for the window based on the first function; and a respective label for the processed sample indicative of a respective known characteristic of the respective subject associated with the processed sample.
In one aspect, a machine includes a processing system comprising at least one computer processor and computer storage, accessible by the processing system. The computer storage includes data defining a training set for training a computational model, data the in the training set representing a plurality of processed samples, each processed sample originating from a respective liquid biopsy from a respective subject, the data for each processed sample including: i. a respective set of values for the processed sample encoding methylation signals from DNA fragments in the processed sample, each set of values including, for each cancer-informative region of DNA, and for each window on the cancer-informative region, one or more respective metrics computed for the window and associated with an identifier of the window of the cancer-informative region, wherein each respective metric comprises a value based on computing a respective value for each DNA fragment for the window based on a first function of the respective methylation signal for the DNA fragment in the window, and a second function of the respective values computed for the DNA fragments for the window based on the first function, and ii. a respective label for the processed sample indicative of a respective known characteristic of the respective subject associated with the processed sample.
For such training, computer program code is stored in the computer storage that when executed by the processing system defines a computational model having inputs for receiving a set of values for a processed sample from the training set, and having an output providing a computed characteristic based on the set of values received at the inputs and parameters of a function, the computer program code further configuring the processing system to access the training set and train the computational model. Training involves repeatedly i. applying the respective sets of values for processed samples in the training set to the inputs of the computational model, ii. receiving, from the output of the computational model, respective outputs in response to the respective set of values applied to the inputs of the computational model, iii. comparing the respective outputs for the respective sets of values to the respective labels for the processed samples corresponding to the input sets of values, and iv. adjusting the parameters of the computational model to reduce error between the respective outputs from the computational model and the respective labels for the processed samples.
In one aspect, a cancer recognition system recognizes a risk of presence of a neoplasm in a subject based on a liquid biopsy from the subject. The system includes equipment having an input that receives a liquid biopsy and an output that provides a methylation signal for the liquid biopsy. The methylation signal is indicative of methylation of CpGs of DNA fragments in the liquid biopsy. The system includes an analytical platform having an input receiving the methylation signal for the liquid biopsy from the equipment and having a processing system that, in response to computer program instructions, is configured to process the methylation signals. To perform such processing, the processing system is configured to, for each window of a plurality of windows on a cancer-informative region of DNA of the subject: compute, for each DNA fragment overlapping the window, a respective value for the DNA fragment for the window based on a first function of the respective methylation signal for the DNA fragment in the window; compute one or more respective metrics for the window based on a second function of the respective values computed for the DNA fragments for the window based on the first function; store a data structure in memory including the one or more respective metrics computed for the window as a set of values associated with an identifier of the window of the cancer-informative region, wherein the set of values corresponds to a set of features corresponding to inputs of a computational model, and wherein the set of values for the subject for the plurality of windows represents an encoding of the methylation signals for the DNA fragments from the liquid biopsy of the subject; and input the computed set of values for the subject for the set of features to a trained computational model that applies a function to the computed set of values to produce an output indicative of a risk of presence of a neoplasm in the subject.
In one aspect, a health condition recognition system recognizes a risk of presence of a health condition in a subject based on a biological sample from the subject. The system includes an analytical platform having an input receiving a signal for the biological sample from the equipment and having a processing system that, in response to computer program instructions, is configured to process the signal. To perform such processing, the processing system is configured to compute, for each instance of an analyte in the biological sample, and for each window of a plurality of windows on health-condition-informative regions of the analyte, a respective value for the instance for the window based on a first function of respective marker information for the instance of the analyte for the window. The processing includes computing, for each window of the plurality of windows on the health-condition-informative region, one or more respective metrics for the window based on a second function of the respective values computed for the instances of the analyte for the window based on the first function. The processing includes storing a data structure in memory including the one or more respective metrics computed for the window as a set of values associated with an identifier of the window of the health-condition-informative region. The set of values corresponds to a set of features corresponding to inputs of a computational model. The set of values for the subject for the plurality of windows represents an encoding of the signal from the biological sample of the subject.
In one aspect, a process detects a risk of presence of a health condition in a subject. The process includes obtaining a biological sample from the subject, and processing the biological sample to obtain a signal indicative of marker information in instances of an analyte in the biological sample. This part of the process can be used to obtain training set samples or samples from individuals for whom status of a health condition is unknown. The signal is processed to provide a set of values for the subject as an encoding of the signal from the biological sample of the subject. The encoded signal is applied to a computational model trained using machine learning techniques to provide an output representing a risk of presence of a health condition in a subject.
In one aspect, a process detects a risk of presence of a neoplasm in a subject. The process includes obtaining a liquid biopsy from the subject, and processing the liquid biopsy to obtain a methylation signal for the liquid biopsy. The part of the process can be used to obtain training set samples or samples from individuals for whom status of a health condition is unknown. The methylation signals are processed to provide a set of values for the subject as an encoding of the methylation signal from the liquid biopsy of the subject. The encoded methylation signal is applied to a computational model trained using machine learning techniques to provide an output representing a risk of presence of a neoplasm in a subject.
In one aspect, a health condition recognition system recognizes a risk of presence of a health condition in a subject based on a biological sample from the subject. The system means for receiving a signal for the biological sample and means for processing the signal. The means for processing computes, for each instance of an analyte in the biological sample, and for each window of a plurality of windows on health-condition-informative regions of the analyte, a respective value for the instance for the window based on a first function of respective marker information for the instance of the analyte for the window. The means for processing computes, for each window of the plurality of windows on the health-condition-informative region, one or more respective metrics for the window based on a second function of the respective values computed for the instances of the analyte for the window based on the first function. The means for processing stores a data structure in memory including the one or more respective metrics computed for the window as a set of values associated with an identifier of the window of the health-condition-informative region. The set of values corresponds to a set of features corresponding to inputs of a computational model. The set of values for the subject for the plurality of windows represents an encoding of the signal from the biological sample of the subject.
In any of the foregoing aspects, the encoded signal represents features that can be applied to a computational model trained using machine learning techniques to provide an output representing a risk of presence of a health condition in a subject.
In any of the foregoing aspects, the respective metrics computed for each of the plurality of windows on the region can be stored in a database as data representing the biological sample or liquid biopsy sample. In some implementations, stored data includes an identifier identifying a liquid biopsy sample or a biological sample. In some implementations, stored data includes an identifier identifying a subject corresponding to the liquid biopsy sample or biological sample. In some implementations, stored data includes a plurality of sets of values corresponding to a plurality of liquid biopsy samples or biological samples from a single subject. In some implementations, the set of values for the subject corresponding to the set of features includes data associating the respective metric for each window for each health-condition-informative region with an identifier of the window and an identifier of the health-condition-informative region. In some implementations, the data is stored in a database allowing search and retrieval of the data given one or more of an identifier of a subject, an identifier of a health-condition-informative region, and identifier of a window, or an identifier of a liquid biopsy sample or biological sample.
In any of the foregoing aspects, samples in a training set can have a label. In some implementations, a label for a sample is selected from a group comprising a label indicative of non-cancer and a label indicative of cancer. In some implementations, a label for a sample is selected from a group comprising a label indicative of a type of cancer. In some implementations, a label for a sample is indicative of a health condition.
In any of the foregoing aspects, processing can further include applying the data representing the biological sample or liquid biopsy sample to a computational model that determines the risk of presence of the early-stage neoplasm in the subject based on the set of values for the set of features.
In any of the foregoing aspects, the first function applied to an instance of an analyte can include a count of occurrences of marker information within the instance of the analyte within the window. In some implementations, the second function of the respective values computed for the instances of the analyte for the window is based on a respective ratio of a count of instances of the analyte having a specific count of occurrences of marker information to a count of instances of analyte for the window.
In any of the foregoing aspects, the first function of the methylation signal for a DNA fragment in a window can include a count of methylated CpGs in the DNA fragment in the window. In some implementations, the second function of the respective values computed for DNA fragments for a window is based on a respective ratio of a count of DNA fragments having a specific count of methylated CpGs to a count of DNA fragments for the window. In some implementations, the second function of the respective values computed for DNA fragments for a window is based on a count of DNA fragments having a specific count of methylated CpGs.
In any of the foregoing aspects, the first function applied to an instance of an analyte can include an indication of a pattern of marker information in the instance in the window, from among a set of possible patterns. In some implementations, the second function of the respective values computed for the instances of the analyte for the window is based on, for each possible pattern of marker information in the window, a ratio of a count of instances of the analyte having the pattern to a count of instances of the analyte in the window.
In any of the foregoing aspects, the first function of the methylation signal can include an indication of a pattern of methylation of CpGs in the DNA fragment in the window. In some implementations, the second function of the respective values computed for DNA fragments for a window is based on, for each possible pattern of methylation in the window, a respective ratio of a count of DNA fragments having the pattern of methylation to a count of the DNA fragments.
In any of the foregoing aspects, a methylation signal can include data indicative of a respective methylation of each CpG in a sequence of CpGs of a DNA fragment. In any of the foregoing aspects, the region can include a plurality of CpGs wherein a number N of CpGs in the region is an integer greater than or equal to 1 and less than or equal to X, a positive integer. In any of the foregoing aspects, the region can include a plurality of CpGs wherein a number N of CpGs in the region is an integer selected from the group consisting of 1, 2, . . . , N.
In any of the foregoing aspects, the first function and the second function can be computed for a plurality of different window sizes for a region. In any of the foregoing aspects, the first function can include, for a window of size W within a region, for each possible pattern of 2 W patterns, a respective count for the pattern, wherein a “count” is when a read has that pattern in that window in that region.
In any of the foregoing aspects, each window has a specified position within a sequence of sites of interest in a health-condition informative region, and a specified size, wherein the size is specified in terms of a number of consecutive sites of interest within the analyte. In any of the foregoing aspects, the health-condition informative regions can be selected from among the genomic regions in one or more of Table I or Table II.
In any of the foregoing aspects, processing a can include includes processing the sample to locate cell-free DNA fragments. In some implementations, the liquid biopsy samples can be obtained and processed such that an average number of cell-free DNA located per cancer-informative region is sufficient to likely include one or more cell-free DNA originating from an early-stage neoplasm if present in the subject. In some implementations, the cell-free DNA fragments originate from health-condition-informative regions of DNA. In some implementations, processing each sample includes processing located cell-free DNA fragments to determine respective methylation information related to CpGs of the located cell-free DNA fragments. In some implementations, the sample is processed such that an average number of cell-free DNA fragments processed per cancer-informative region is sufficient to be likely to detect one or more cell-free DNA fragments per cancer-informative region originating from a present cancer.
In any of the foregoing aspects, a liquid biopsy sample or biological sample comprises plasma obtained from an asymptomatic individual. In any of the foregoing aspects, a liquid biopsy sample or biological sample comprises plasma obtained from a symptomatic individual.
In another aspect, an article of manufacture includes at least one computer storage, and computer program instructions stored on the at least one computer storage. The computer program instructions, when processed by a processing system of a computer, the processing system comprising one or more processing units and storage, configures the computer as set forth in any of the foregoing aspects and/or performs a process as set forth in any of the foregoing aspects.
Any of the foregoing aspects may be embodied as a computer system, as any individual component of such a computer system, as a process performed by such a computer system or any individual component of such a computer system, or as an article of manufacture including computer storage in which computer program instructions are stored and which, when processed by one or more computers, configure the one or more computers to provide such a computer system or any individual component of such a computer system.
The following Detailed Description references the accompanying drawings which form a part this application, and which show, by way of illustration, specific example implementations. Other implementations may be made without departing from the scope of the disclosure.
In the drawings, in the data flow diagrams, a parallelogram indicates an object, e.g., data, which is an input to a component of a system that manipulates the object or an output of such a system, whereas a rectangle indicates the component of the system, e.g., a programmed processor, which manipulates that object.
Referring now to the data flow diagram of
Biological samples from a plurality of individuals with known diagnoses are obtained. These biological samples are used to create data for a training set and therefore are referred to as training set samples 100. The training set samples include biological samples both from individuals with a health condition and from individuals without the health condition. For example, if the health condition is cancer, then the training set samples include biological samples both from individuals with cancer and from individuals without cancer. Machine learning techniques are applied to data derived from signals generated from processing the training set samples to train a computational model (118). The trained computational model 118 is used to detect the health condition in individuals (with unknown diagnoses) by analyzing data derived from signals generated from processing biological samples (104) from those individuals.
The biological samples (whether training set samples 100, or samples from individuals 104) are subjected to several processing steps, represented by sample preparation system(s) 102. Such sample preparation systems generally use reagents, probes, and other ingredients and processes to process a biological sample. Characteristics of the processed biological samples are measured using equipment that generates signals representing these characteristics. Data representing signals are output from the sample preparation system(s) 102. The output of the sample preparation system(s) 102 is called herein sample data, which can be sample data 124 for an individual sample 104, or sample data 106 for a training set sample. Training set sample data 106 and sample data 124 for individuals are derived from such measurements and are output by the sample preparation system. While
A feature computation module 108 processes the training set sample data 106 to derive respective data for each training set sample. This processing converts the raw data output from the sample preparation system into a set of values for a set of “features” for training and using machine learning models. This set of values for the set of features for a training set sample is stored in association with a respective label indicative of a known characteristic of the individual associated with the training set sample, as training data 110.
The known classification or “label” related to a health condition indicates either the absence of a health condition (or normal) or the presence of the health condition (or abnormal), with terms such as “normal” or “abnormal,” or “healthy” or “unhealthy.” In an implementation where the health condition being detected is cancer, the label can indicate whether, and in some cases what type or to what extent, an individual has cancer, with terms such as non-cancer (or normal) and cancer (or abnormal). Other information about the individual or the biological sample may be known, such as a tissue or liquid of origin, a stage of development, severity of condition, or type of any tumor, or other information.
It should be understood that the labels “normal” and “abnormal,” or “cancer” and “non-cancer,” are not intended to be limiting, and that any label intended to indicate that a biological sample from an individual is not definitively associated with a health condition can be used. A non-exhaustive set of examples of labels includes “normal,” “not detected,” “healthy,” “not identified,” or any other character or set of characters that is interpreted within the computer to discriminate such samples from other samples where the individual has been diagnosed with a health condition or any characteristics of such a health condition.
The training set 110, which comprises sets of values for a set of features which represent the training set samples 100 for which respective classifications or labels for the corresponding individual or biological sample are known, is used by a training module 116 to build or train a computational model 118. Typically, training is performed by dividing the training set into a train set 112 and a test set 114, applying the values for features for the train set samples to the computational model 118, and receiving the computed characteristic 130 from the computational model. The training module 116 repeatedly adjusts parameters of the computational model 118 using the train set 112 while minimizing errors in the computed characteristic 130. The training module 116 then uses the test set 114 to verify how well the computational model 118 has been trained.
The sample data 124 for individual samples are processed by a feature computation module 126 to generate a set of values for the set of features for the individual samples to provide feature data 128. Feature computation module 126 is similar to, if not identical to, feature computation module 108; however, feature data 128 from processed individual samples 124 are inputs to the trained computational model 118, to allow the associated individual to be classified by the trained computational model. Thus, the feature data 128 for processed individual samples does not include a label. The trained computational model 118, when applied to feature data 128, outputs the computed characteristic 130 which indicates the model's determination of the likelihood of the presence of the health condition in the individual.
The computational model 118 can be any computational model for detection of, or prediction of a likelihood of presence of, a health condition in a subject. Examples of such health conditions include, but are not limited to, various neoplasms or tumors, such as pre-cancerous neoplasms, cancer, including early-stage cancerous solid tumors, other diseases, or other disorders, including but not limited to autoimmune diseases, metabolic disorders, neurological disorders, aging, and trauma.
Example types of computational models include, but are not limited to, random forest or other form of classification or decision trees, ensembles of models, and “deep learning” models. Computational models are known by a variety of names, including, but not limited to, classifiers, decision trees, random forests, classification and regression trees, clustering algorithms, predictive models, neural networks, genetic algorithms, deep learning algorithms, convolutional neural networks, artificial intelligence systems, machine learning algorithms, Bayesian models, expert rules, support vector machines, conditional random fields, logistic regression, maximum entropy, among others.
In general, such computational models receive a vector or n-dimensional matrix of features as an input, and provide an output such as a classification, prediction, or other value. The computational model used for classification may or may not be a computational model that is trained by a training set. For example, the computational model can be a simplified set of computations performed on values for the features derived for an individual, where that set of computations is based on insights obtained by analyzing data from the training set. Typically, the computational model computes a function of the features, which may be a linear or non-linear function, to produce an output where the output is indicative of the resulting classification.
The output of the computational model can be a form of prediction, indicating a likelihood that the individual from whom a sample was obtained has a health condition, such as cancer, or a characteristic of a health condition, such as a type of cancer or stage of development. This prediction can be in the form of a probability between zero and one, or a binary output, such as a yes or no answer, or a score (which may be compared to one or more thresholds), or other format. The output can be accompanied by additional information indicating, for example, a level of confidence in the prediction. The output typically depends on the form of the computational model used.
In some implementations for detecting cancer, the output of the computational model indicates the likelihood of presence of cancer, without indicating a type of tumor or cancer, i.e., the affected tissue. In some implementations, the output of the model can indicate a type of cancer, such as its tissue of origin, or stage, or other characteristic, or any combination of these. In some implementations, the output of the model can indicate the presence of cancer, and then one or more additional models can be applied to the data to indicate a type of cancer. In some implementations, a separate model for each type of cancer can be used, and an ensembling process can process the outputs of the separate models.
To apply machine learning techniques to signals generated from biological samples to detect health conditions, one of the problems to solve is the selection of the “features” to be derived from those signals. These features are inputs to a computational model, and challenges arise in how to efficiently discover, compute, store, and otherwise process such features, both for creating and using a training set for training a computational model and for using a trained computational model for classification or prediction. The features to be selected depend in part on the nature of the biological samples and signals generated from processing such samples.
In some implementations, the biological sample is a liquid biopsy, such as a sample of blood or portion of blood such as plasma, urine, stool, saliva, mucous, or other liquid expelled by the body. In some implementations, the biological sample is a tissue biopsy, providing cells extracted from a specific tissue of the body.
A sample preparation system locates an analyte in the biological sample and measures a characteristic of that analyte. As an example, an analyte that can be detected in a biological sample is nucleic acid, for example DNA or RNA. DNA may be whole DNA or cell-free DNA, which are fragments of DNA in the biological sample, for which laboratory equipment such as a sequencer can be used. Such nucleic acids can be processed to measure a variety of different markers at different positions along the nucleic acid. Such markers, can include, but are not limited to, mutations, such as point mutations, (e.g., transition or transversion mutations), deletions, insertions, or other modifications. Such markers can be genetic or epigenetic or of another type.
In some implementations, modifications to DNA can include various types, such as 5-methylcytosine, 5-hydroxymethylcytosine, 5-formylcytosine, 5-carboxylcytosine, or methylation of other DNA bases (e.g., N6-methyladenine), and methylations occurring at different kinds of sites, such as at CpG, CpA, CpT, and CpC sites.
In some implementations, cell-free DNA are processed to determine whether CpGs in the DNA fragment are subject to cytosine methylation. In such an implementation, the sample preparation system includes equipment that generates a signal (herein called a “methylation signal”) for DNA fragments in the sample indicating whether CpGs in the DNA fragments are methylated or not. For example, a modified nucleic acid base (e.g., 5-methylcytosine) may be detected directly using, e.g., nanopore sequencing, or indirectly by chemically converting (e.g., bisulfite converting) the unmodified nucleic acid base to a different chemical entity selectively over the modified nucleic acid base (e.g., cytosine may be converted to uracil using bisulfite conversion under conditions where 5-methylcytosine is not converted to thymine), the presence of which may be detected, e.g., through nucleic acid sequencing.
In such applications, and others, the signal generated from processing a biological sample includes data representing a plurality of instances of an analyte, e.g., a nucleic acid, and respective information about markers related to each instance detected. The respective information about markers related to an analyte generally includes, for a marker, a respective location of the marker on the analyte and information about the marker, such as a state or type or other indication. There is a wide variety of types of sample processing techniques, measurement equipment, analytes, and markers, for different applications that can be used to produce signals representing the instances of an analyte and markers on such analytes within a biological sample.
Conceptually, using methylation of CpGs in cell-free DNA as an illustrative example, the signal illustrated in
The information about the markers for each instance of an analyte in a sample can result in a large amount of data. As an example, in practice, in the case of obtaining methylation state of CpGs in cell-free DNA from a blood sample using deep sequencing, using a DNA sequencer that outputs such data into a FASTQ format data file, the signal generated by processing a single blood sample can be many gigabytes, e.g., 20 to 30 gigabytes, of data.
By using the position information for each instance of an analyte, distinct instances of the analyte can be grouped into regions within the analyte. Typically, markers related to health conditions have been found to be localized within identifiable regions of analytes, such as specific genes or regions within the genome in the case of DNA. Thus, the signals generated for each instance of an analyte can be grouped and processed by health-condition-informative regions. Thus, the example in
Different metrics and health-condition-informative regions if used, may be useful in detecting a variety of diseases or conditions, such as cancer, autoimmune disorders, metabolic disorders, neurological disorders, aging, and trauma. As one example, aspects of the microbiome, such as relative amounts of microorganisms, diversity of microorganisms, gene expression patterns of microorganisms, and others, may be metrics useful in detecting a health condition and may be monitored over time. For instance, sequence information from DNA (e.g., bacterial genomes) or RNA (e.g., 16S rRNA) may be used to distinguish or quantify levels of specific microorganisms in the microbiome.
In certain embodiments, the health-condition-informative regions may be useful for detecting an early-stage health condition, e.g., prior to development of symptoms, such as an early-stage cancer. In certain embodiments, the health-condition-informative regions may be useful for detecting a pre-disease health condition, e.g., a health condition that is not presently a disease state but may later mature to a disease state. Non-limiting examples of pre-disease health conditions include precancer and prediabetes.
The methods described herein may be used in connection with a number of health conditions. The health condition may be cancer or precancer of any solid or liquid tissue. The individual may have hyperplasia, dysplasia, carcinoma in-situ, or a benign tumor. In some embodiments, the tissue associated with a precancer may be gastrointestinal tissue (such as colorectal tissue, pancreatic tissue, gastric tissue, esophageal tissue, hepatocellular tissue, cholangiocellular tissue, oral tissue, lip tissue); urogenital tissue (such as prostate tissue, renal tissue, bladder tissue, penile tissue); gynecological tissue (such as ovarian tissue, cervical tissue, endometrial tissue); lung tissue; head and neck tissue; CNS tissue including glial tissue, astrocytes, retinocyes; breast tissue; skin tissue; thyroid tissue; bone and soft tissue; and hematologic tissue (such as lymphocytes). In some embodiments, the cancer may be acute leukemia, astrocytomas, biliary cancer (cholangiocarcinoma), bone cancer, breast cancer, brain stem glioma, bronchioloalveolar cell lung cancer, cancer of the adrenal gland, cancer of the anal region, cancer of the bladder, cancer of the endocrine system, cancer of the esophagus, cancer of the head or neck, cancer of the kidney, cancer of the parathyroid gland, cancer of the penis, cancer of the pleural/peritoneal membranes, cancer of the salivary gland, cancer of the small intestine, cancer of the thyroid gland, cancer of the ureter, cancer of the urethra, carcinoma of the cervix, carcinoma of the endometrium, carcinoma of the fallopian tubes, carcinoma of the renal pelvis, carcinoma of the vagina, carcinoma of the vulva, cervical cancer, chronic leukemia, colon cancer, colorectal cancer, cutaneous melanoma, ependymoma, epidermoid tumors, Ewing's sarcoma, gastric cancer, glioblastoma, glioblastoma multiforme, glioma, hematologic malignancies, hepatocellular (liver) carcinoma, hepatoma, Hodgkin's Disease, intraocular melanoma, Kaposi sarcoma, lung cancer, lymphomas, medulloblastoma, melanoma, meningioma, mesothelioma, multiple myeloma, muscle cancer, neoplasms of the central nervous system (CNS), neuronal cancer, small cell lung cancer, non-small cell lung cancer, osteosarcoma, ovarian cancer, pancreatic cancer, pediatric malignancies, pituitary adenoma, prostate cancer, rectal cancer, renal cell carcinoma, sarcoma of soft tissue, schwannoma, skin cancer, spinal axis tumors, squamous cell carcinomas, stomach cancer, synovial sarcoma, testicular cancer, uterine cancer, or tumors and their metastases, including refractory versions of any of the above cancers, or any combination thereof.
In some embodiments, the autoimmune disorder may be multiple sclerosis, psoriasis, psoriatic arthritis, rheumatoid arthritis, systemic lupus erythematosus, Crohn's disease, Sjogren's syndrome, Behcet's disease, ulcerative colitis, Guillain-Barre syndrome, or a pre-disease thereof.
In some embodiments, the metabolic disorder may be diabetes, obesity, or a pre-disease thereof.
In some embodiments, the neurological disorder may be Alzheimer's disease, Parkinson's disease, Huntington's disease, amyotrophic lateral sclerosis, or a pre-disease thereof.
In some embodiments, the methods described herein may be informative for treating an individual with an intervention. For example, in some embodiments, a health condition in an individual may be detected as contemplated herein, and an intervention may be provided to the individual to treat the health condition. An individual may be treated using any intervention known to those of ordinary skill in the art. Non-limiting examples of interventions include surgery (e.g., excising diseased or pre-disease tissue from an individual), chemotherapy, gene therapy, gene editing, radiation therapy, or a lifestyle intervention (e.g., change in behavior or habits).
For each training set sample, a respective set of values for the set of features is computed by the feature computation module 108. The specific metrics used, and health-condition-informative regions selected can depend on a variety of factors and may be experimentally determined. The selected metric(s) for the selected health-condition-informative regions are computed for the training set to provide the train set and the test set.
In some implementations, a first set of features is computed for a training set, which can include several candidate features. The candidate features can include one or more candidate metrics, or one or more candidate health-condition-informative regions, or combinations of both. A computational model can be trained using candidate features, and then analyzed to determine which candidate features were more influential in the output of the trained computational model. Such analysis can be used to identify features which are more influential to the model, whether due to the metric or due to the health-condition-informative region. A second set of features can be defined by reducing the first set of features based on those identified features which are more influential, and the trained computational model 118 can be built using the second set of features.
For each individual sample for which a prediction or classification is to be made, a respective set of values for the set of features is computed by the feature computation module 126. The specific metrics used, and health-condition-informative regions selected can depend on a variety of factors and may be experimentally determined. The selected metric(s) for the selected health-condition-informative regions are computed for the sample, and then input to the trained computation model 118.
In some implementations, related to identifying individually informative CpGs, such CpGs typically are found in regions of the human genome referred to as “CGI”'s. Several CGI's may include individually informative CpGs. In some implementations the system may consider specifically information related to a set of CGIs known to be informative of cancer or other health conditions. In some of these implementations, a sample of cell free DNA is processed to obtain a first set of methylation data by measuring methylation level at a plurality of CpGs within one or more genomic regions set forth in one or more of Table I or Table II, attached as appendices to this Specification. These tables, as listed below, referred to in this Specification and Claims, form a part of this Specification and are hereby incorporated by reference into this Specification.
Table I—Listing of CGIs identified in U.S. Patent Publication 2020/0109456A1, which is hereby incorporated by reference, specifically the “Table I” of CGIs listed in that published patent application.
Table II—Listing of CGIs identified in PCT Patent Publication WO2022/133315, which is hereby incorporated by reference, specifically “Table 2” and “Table 3” of CGIs listed in that published patent application.
To use a computational model in this context, there are several technical problems that arise relating to encoding the signal resulting from processing a biological sample into features.
Some problems arise because the signal includes a large amount of information. One of the challenges involves reducing the volume of data into a set of informative features. However, as the number of features increases, the complexity of the computational model increases. However, as the number of features decreases, information relevant to detection of a health condition may be lost.
Some problems arise because of uncertainty around which metrics and which regions of an analyte are truly informative of a health condition. Omission of some metrics or some regions from the set of features may impact the performance of a trained computational model.
To address such problems, the feature computational module encodes a signal generated by processing a biological sample, given one or more health-condition-informative regions related to an analyte, by using metrics based on marker information occurring within a plurality of distinct windows within health-condition informative regions related to the analyte. Each window has a specified position within a sequence of sites of interest in a health-condition informative region, and a specified size. The size is specified in terms of a number of consecutive sites of interest within the analyte. A metric is thus computed for a plurality of positions within the health-condition informative region.
In some implementations, the feature computation module begins encoding by processing each instance of the analyte. For example, the feature computation module computes, for each instance of an analyte in the biological sample, and for each window of a plurality of windows on health-condition-informative regions of the analyte, a respective value for the instance for the window based on a first function of respective marker information for the instance of the analyte for the window. After processing instances of the analyte, the feature computation module then computes, for each window of the plurality of windows on the health-condition-informative region, one or more respective metrics for the window based on a second function of the respective values computed for the instances of the analyte for the window.
In some implementations, the feature computation module begins encoding by processing each window within a health-condition-informative region. For example, the feature computational model computes, for each window of a plurality of windows on a health-condition-informative region of the analyte, for each instance of the analyte overlapping the window, a respective value for the instance of the analyte for the window based on a first function of the respective marker information for the instance overlapping the window. After computing the respective values for the instances of the analyte that overlap the window, the feature computation module then computes one or more respective metrics for the window based on a second function of the respective values computed for the instance overlapping the window.
Example implementations of the feature computation module 108 will now be described. To simplify the following description, in the example described herein, training set sample data 106 and sample data 124 for individuals are described as having been derived from a methylation signal output by the sample preparation system 102 (See
In
In
In
In this example in
In
The second function first computes a count of the number of instances having each possible pattern in a window. That is, for that window, the second function first produces a count of the number of instances with the first pattern, a count of the number of instances with the second pattern, and so on. The second function then divides the respective number of instances identified for each possible pattern by the total number of instances, thus providing a fractional value for each of the possible patterns for this window, as shown in the bottom panel of
In this example in
In any of the foregoing example implementations, and in other implementations, a size of a health-condition-informative region, in terms of a number of sites of interest within an instance of an analyte, can vary. For example, cancer-informative regions of DNA may be as small as a single CpG site, and may include several 10's, 100's, or 1000's of CpG sites. Within a set of features, there may be a plurality of health-condition-informative regions, each having its own respective size.
In any of the foregoing example implementations, and in other implementations, a size of a window in a health-condition-informative region, in terms of a number of sites of interest within an instance of an analyte, can vary. Generally, the number of sites of interest is a positive integer number that ranges between 1 and N. In some example implementations, N is less than or equal to 10, or 9, or 8, or 7, or 6, or 5, or 4, or 3. Within a set of features, there may be a plurality of health-condition-informative regions, each having its own respective window size or set of window sizes. Different window sizes may be used in different regions. The same window size may be used in different regions. A region may have metrics computed for it for multiple different window sizes. Windows may be over-lapping or non-overlapping.
The computed sets of values for the set of features for samples can be stored in a data structure, which can be stored in a database, memory, or other computer storage for use in connection with the computational model, or for other purposes.
In some implementations, the sets of values for the set of features for a sample can be stored in association with an identifier of the subject, or an identifier of the sample, or both, so that the identifier of the subject or the identifier of the sample, or both, can be used to access the set of values from the computer storage. In some implementations, each computed value can be associated with an identifier of the cancer-informative region, and an identifier of the window within that region, to which the value corresponds.
Accordingly, an example implementation of such a data structure is shown in
A flowchart for an example process for encoding a signal from a sample is shown in
A flowchart for an example process for selecting features based on such an encoding is shown in
The foregoing description provides example implementations of a computer system implementing these techniques. The various computers used in this computer system can be implemented using one or more general-purpose computers, such as client devices including mobile devices and client computers, one or more server computers, or one or more database computers, or combinations of any two or more of these, which can be programmed to implement the functionality such as described in the example implementations.
Examples of such general-purpose computers include, but are not limited to, larger computer systems such as server computers, database computers, desktop computers, laptop and notebook computers, as well as mobile or handheld computing devices, such as a tablet computer, handheld computer, smart phone, media player, personal data assistant, audio and/or video recorder, or wearable computing device.
With reference to
A computer storage medium is any medium in which data can be stored in and retrieved from addressable physical storage locations by the computer. Computer storage media includes volatile and nonvolatile memory devices, and removable and non-removable storage devices. Memory 504, removable storage 508, and non-removable storage 510 are all examples of computer storage media. Some examples of computer storage media are RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optically or magneto-optically recorded storage device, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Computer storage media and communication media are mutually exclusive categories of media.
The computer 500 may also include communications connection(s) 512 that allow the computer to communicate with other devices over a communication medium. Communication media typically transmit computer program code, data structures, program modules or other data over a wired or wireless substance by propagating a modulated data signal such as a carrier wave or other transport mechanism over the substance. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal, thereby changing the configuration or state of the receiving device of the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media include any non-wired communication media that allows propagation of signals, such as acoustic, electromagnetic, electrical, optical, infrared, radio frequency and other signals. Communications connections 512 are devices, such as a network interface or radio transmitter, that interface with the communication media to transmit data over and receive data from signals propagated through communication media.
The communications connections can include one or more radio transmitters for telephonic communications over cellular telephone networks, and/or a wireless communication interface for wireless connection to a computer network. For example, a cellular connection, a Wi-Fi connection, a Bluetooth connection, and other connections may be present in the computer. Such connections support communication with other devices, such as to support voice or data communications.
The computer 500 may have various input device(s) 514 such as a various pointer (whether single pointer or multi-pointer) devices, such as a mouse, tablet and pen, touchpad and other touch-based input devices, stylus, image input devices, such as still and motion cameras, audio input devices, such as a microphone. The computer may have various output device(s) 516 such as a display, speakers, printers, and so on, also may be included. These devices are well known in the art and need not be discussed at length here.
The various storage 510, communication connections 512, output devices 516 and input devices 514 can be integrated within a housing of the computer or can be connected through various input/output interface devices on the computer, in which case the reference numbers 510, 512, 514 and 516 can indicate either the interface for connection to a device or the device itself as the case may be.
An operating system of the computer typically includes computer programs, commonly called drivers, which manage access to the various storage 510, communication connections 512, output devices 516 and input devices 514. Such access generally includes managing inputs from and outputs to these devices. In the case of communication connections, the operating system also may include one or more computer programs for implementing communication protocols used to communicate information between computers and devices through the communication connections 512.
Any of the foregoing aspects may be embodied as a computer system, as any individual component of such a computer system, as a process performed by such a computer system or any individual component of such a computer system, or as an article of manufacture including computer storage in which computer program code is stored and which, when processed by the processing system(s) of one or more computers, configures the processing system(s) of the one or more computers to provide such a computer system or individual component of such a computer system.
Each component (which also may be called a “module” or “engine” or “computational model” or the like), of a computer system such as described herein, and which operates on one or more computers, can be implemented as computer program code processed by the processing system(s) of one or more computers. Computer program code includes computer-executable instructions and/or computer-interpreted instructions, such as program modules, which instructions are processed by a processing system of a computer. Such instructions define routines, programs, objects, components, data structures, and so on, that, when processed by a processing system, instruct the processing system to perform operations on data or configure the processor or computer to implement various components or data structures in computer storage. A data structure is defined in a computer program and specifies how data is organized in computer storage, such as in a memory device or a storage device, so that the data can accessed, manipulated, and stored by a processing system of a computer.
It should be understood that the subject matter defined in the appended claims is not necessarily limited to the specific implementations described above. The specific implementations described above are disclosed as examples only.
What is claimed is:
38-111747303
89-158119704
62-211689104
2-32180487
8-53387523
698-102431119
44-20106946
00-86300953
713-128392611
18-49230040
10-50096912
69-54024560
37-58446800
01-127644104
44-175207553
64-220406840
2-31361821
0-85982198
51-45524020
52-42880674
5-89562647
05-160342843
3-34645024
035-102998646
96-63215009
32-65355134
82-29714013
9-19192100
9-15782729
28-140866748
42-100041477
indicates data missing or illegible when filed
| Number | Date | Country | |
|---|---|---|---|
| 63373754 | Aug 2022 | US |
| Number | Date | Country | |
|---|---|---|---|
| Parent | PCT/US2023/073069 | Aug 2023 | WO |
| Child | 19067342 | US |