This disclosure generally relates to identification of cancer in a patient, and more specifically to performing a physical assay on a test sample obtained from the patient, as well as statistical analysis of the results of the physical assay.
Analysis of circulating cell-free nucleotides, such as cell-free DNA (cfDNA) or cell-free RNA (cfRNA), using next generation sequencing (NGS) is recognized as a valuable tool for detection and diagnosis of cancer. Analyzing cfDNA can be advantageous in comparison to traditional tumor biopsy methods; however, identifying cancer-indicative signals in tumor-derived cfDNA faces distinct challenges, especially for purposes such as early detection of cancer where the cancer-indicative signals are not yet pronounced. As one example, it may be difficult to achieve the necessary sequencing depth of tumor-derived fragments. As another example, errors introduced during sample preparation and sequencing can make accurate identification cancer-indicative signals difficult. The combination of these various challenges stand in the way of accurately predicting, with sufficient sensitivity and specificity, characteristics of cancer in a subject through the use of cfDNA obtained from the subject.
Embodiments of the invention provide for a method of generating a cancer prediction, such as a presence or absence of cancer, for an individual based on cfDNA in a test sample obtained from the individual. Specifically, cfDNA from the individual is sequenced to generate sequence reads using one or more sequencing assays, also referred to herein as physical assays, examples of which include a small variant sequencing assay, whole genome sequencing assay, and methylation sequencing assay. The sequence reads of the sequencing assays are processed through corresponding computational analyses, also hereafter referred to any one of computational pipelines, computational assessments, and computational analyses. Each computational analysis identifies values of features of sequence reads that are informative for generating a cancer prediction while accounting for interfering signals (e.g., noise). As an example, small variant features (e.g., features derived from sequence reads that were generated by a small variant sequencing assay) can include a total number of somatic variants. As another example, whole genome features (e.g., features derived from sequence reads that were generated by a whole genome sequencing assay) can include a total number of copy number aberrations. As yet another example, methylation features (e.g., features derived from sequence reads that were generated by a methylation sequencing assay) can include a total number hypermethylated or hypomethylated regions. Additional features that are not derived from sequencing-based approaches, such as baseline features that can refer to clinical symptoms and patient information, can be further generated and analyzed.
In some embodiments, one, two, three, or all four of the types of features (e.g., small variant features, whole genome features, methylation features, and baseline features) can be provided to a single predictive cancer model that generates a cancer prediction. In some embodiments, the values of different types of features can be separately provided into different predictive models. Each separate predictive model can output a score that then serves as input into an overall model that outputs the cancer prediction.
Embodiments disclosed herein describe a method for detecting the presence of cancer in a subject, the method comprising: obtaining sequencing data generated from a plurality of cell-free nucleic acids in a test sample from the subject, wherein the sequencing data comprises a plurality of sequence reads determined from the plurality of cell-free nucleic acids; analyzing, using a suitable programed computer, the plurality of sequence reads to identify two or more sequencing based features; and detecting the presence of cancer based on the analysis of the two or more features.
Embodiments disclosed herein further describe a method for detecting the presence of cancer in an asymptomatic subject, the method comprising: obtaining sequencing data generated from a plurality of cell-free nucleic acids in a test sample from an asymptomatic subject; analyzing, using a suitable programed computer, the sequencing data to identify two or more sequencing based features; detecting the presence of cancer based on the analysis of the two or more features.
Embodiments disclosed herein further describe a method for detecting the presence of cancer in an asymptomatic subject, the method comprising: obtaining sequencing data generated from a plurality of cell-free nucleic acids in a test sample from an asymptomatic subject; analyzing, using a suitable programed computer, the sequencing data to identify two or more sequencing based features; detecting the presence of cancer based on the analysis of the two or more features.
In some embodiments, the method detects three or more different types of cancer. In some embodiments, the method detects five or more different types of cancer. In some embodiments, the method detects ten or more different types of cancer. In some embodiments, the method detects twenty or more different types of cancer. In some embodiments, the two or more different types of cancer are selected from breast cancer, lung cancer, prostate cancer, colorectal cancer, renal cancer, uterine cancer, pancreas cancer, esophageal cancer, lymphoma, head and neck cancer, ovarian cancer, hepatobiliary cancer, melanoma, cervical cancer, multiple myeloma, leukemia, thyroid cancer, bladder cancer, gastric cancer, anorectal cancer, and any combination thereof.
In some embodiments, the cell-free nucleic acids comprise cell-free DNA (cfDNA). In some embodiments, the sequence reads are generated from a next generation sequencing (NGS) procedure. In some embodiments, the sequence reads are generated from a massively parallel sequencing procedure using sequencing-by-synthesis. In some embodiments, the cell-free nucleic acids includes cf-DNA from white blood cells.
In some embodiments, the two or more features are derived from: a methylation sequencing assay on the plurality of cell-free nucleic acids in the test sample; a whole genome sequencing assay on the plurality of cell-free nucleic acids in the test sample; and/or a small variant sequencing assay on the plurality of cell-free nucleic acids in the test sample.
In some embodiments, the methylation sequencing assay is a whole genome bisulfite sequencing assay. In some embodiments, the methylation sequencing assay is a targeted bisulfite sequencing assay. In some embodiments, detecting the presence of cancer is based on the analysis of two or more features determined from the methylation sequencing assay. In some embodiments, the methylation sequencing assay features comprise one or more of a quantity of hypomethylated counts, quantity of hypermethylated counts, presence or absence of abnormally methylated fragments at CpG sites, hypomethylation score per CpG site, hypermethylation score per CpG site, rankings based on hypermethylation scores, and rankings based on hypomethylation scores.
In some embodiments, detecting the presence of cancer is based on the analysis of two or more features determined from the whole genome sequencing assay. In some embodiments, the whole genome sequencing assay features comprise one or more of characteristics of bins across the genome either a cfDNA sample or a gDNA sample, characteristics of segments across the genome from either a cfDNA sample or a gDNA sample, presence of one or more copy number aberrations, and reduced dimensionality features. In some embodiments, the method further comprising obtaining sequence data of genomic DNA from one of more white blood cells of the subject.
In some embodiments, the small variant sequencing assay is a targeted sequencing assay, and wherein the sequence data is derived from a targeted panel of genes. In some embodiments, detecting the presence of cancer based on the analysis of two or more features determined from the small variant sequencing assay. In some embodiments, the small variant sequencing assay features comprise one or more of a total number of somatic variants, a total number of nonsynonymous variants, total number of synonymous variants, a presence/absence of somatic variants per gene, a presence/absence of somatic variants for particular genes that are known to be associated with cancer, an allele frequency of a somatic variant per gene, order statistics according to AF of somatic variants, and classification of somatic variants that are known to be associated with cancer based on their allele frequency.
In some embodiments, the analysis further comprises one or more baseline features, and wherein the baseline feature comprises a polygenic risk score or clinical features of an individual, the clinical features comprising one or more of age, behavior, family history, symptoms, anatomical observations, and penetrant germline cancer carrier.
In some embodiments, the detected cancer is breast cancer, lung cancer, colorectal cancer, ovarian cancer, uterine cancer, melanoma, renal cancer, pancreatic cancer, thyroid cancer, gastric cancer, hepatobiliary cancer, esophageal cancer, prostate cancer, lymphoma, multiple myeloma, head and neck cancer, bladder cancer, cervical cancer, or any combination thereof.
In some embodiments, the analysis further comprises detecting the presence of one or more viral-derived nucleic acids in the test sample and wherein the detection of cancer is based, in part, on detection of the one or more viral nucleic acids. In some embodiments, the one or more viral-derived nucleic acids are selected from the group consisting of human papillomavirus, Epstein-Barr virus, hepatitis B, hepatitis C, and any combination thereof.
In some embodiments, the test sample is a blood, plasma, serum, urine, cerebrospinal fluid, fecal matter, saliva, pleural fluid, pericardial fluid, cervical swab, saliva, or peritoneal fluid sample.
In some embodiments, the predictive cancer model is one of a regression predictor, a random forest predictor, a gradient boosting machine, a Näive Bayes classifier, a neural network, or a XGBoost model.
The figures and the following description relate to preferred embodiments by way of illustration only. It should be noted that from the following discussion, alternative embodiments of the structures and methods disclosed herein will be readily recognized as viable alternatives that may be employed without departing from the principles of what is claimed.
Reference will now be made in detail to several embodiments, examples of which are illustrated in the accompanying figures. It is noted that wherever practicable similar or like reference numbers may be used in the figures and may indicate similar or like functionality. For example, a letter after a reference numeral, such as “predictive cancer model 170A,” indicates that the text refers specifically to the element having that particular reference numeral. A reference numeral in the text without a following letter, such as “predictive cancer model 170,” refers to any or all of the elements in the figures bearing that reference numeral (e.g. “predictive cancer model 170” in the text refers to reference numerals “predictive cancer model 170A” and/or “predictive cancer model 170B” in the figures).
The term “individual” refers to a human individual. The term “healthy individual” refers to an individual presumed to not have a cancer or disease. The term “subject” refers to an individual who is known to have, or potentially has, a cancer or disease.
The term “sequence reads” refers to nucleotide sequences read from a sample obtained from an individual. Sequence reads can be obtained through various methods known in the art.
The term “read segment” or “read” refers to any nucleotide sequences including sequence reads obtained from an individual and/or nucleotide sequences derived from the initial sequence read from a sample obtained from an individual.
The term “single nucleotide variant” or “SNV” refers to a substitution of one nucleotide to a different nucleotide at a position (e.g., site) of a nucleotide sequence, e.g., a sequence read from an individual. A substitution from a first nucleobase X to a second nucleobase Y may be denoted as “X>Y.” For example, a cytosine to thymine SNV may be denoted as “C>T.”
The term “indel” refers to any insertion or deletion of one or more base pairs having a length and a position (which may also be referred to as an anchor position) in a sequence read. An insertion corresponds to a positive length, while a deletion corresponds to a negative length.
The term “mutation” refers to one or more SNVs or indels.
The term “true” or “true positive” refers to a mutation that indicates real biology, for example, presence of a potential cancer, disease, or germline mutation in an individual. True positives are tumor-derived mutations and are not caused by mutations naturally occurring in healthy individuals (e.g., recurrent mutations) or other sources of artifacts such as process errors during assay preparation of nucleic acid samples.
The term “false positive” refers to a mutation incorrectly determined to be a true positive.
The term “cell free nucleic acid,” “cell free DNA,” or “cfDNA” refers to nucleic acid fragments that circulate in an individual's body (e.g., bloodstream) and originate from one or more healthy cells and/or from one or more cancer cells. Additionally cfDNA may come from other sources such as viruses, fetuses, etc.
The term “genomic nucleic acid,” “genomic DNA,” or “gDNA” refers to nucleic acid including chromosomal DNA that originates from one or more healthy (e.g., non-tumor) cells. In various embodiments, gDNA can be extracted from a cell derived from a blood cell lineage, such as a white blood cell.
The term “circulating tumor DNA” or “ctDNA” refers to nucleic acid fragments that originate from tumor cells or other types of cancer cells, which may be released into an individual's bloodstream as result of biological processes such as apoptosis or necrosis of dying cells or actively released by viable tumor cells.
The term “alternative allele” or “ALT” refers to an allele having one or more mutations relative to a reference allele, e.g., corresponding to a known gene.
The term “sequencing depth” or “depth” refers to a total number of read segments from a sample obtained from an individual.
The term “alternate depth” or “AD” refers to a number of read segments in a sample that support an ALT, e.g., include mutations of the ALT.
The term “reference depth” refers to a number of read segments in a sample that include a reference allele at a candidate variant location.
The term “variant” or “true variant” refers to a mutated nucleotide base at a position in the genome. Such a variant can lead to the development and/or progression of cancer in an individual.
The term “candidate variant,” “called variant,” or “putative variant” refers to one or more detected nucleotide variants of a nucleotide sequence, for example, at a position in the genome that is determined to be mutated. Generally, a nucleotide base is deemed a called variant based on the presence of an alternative allele on sequence reads obtained from a sample, where the sequence reads each cross over the position in the genome. The source of a candidate variant may initially be unknown or uncertain. During processing, candidate variants may be associated with an expected source such as gDNA (e.g., blood-derived) or cells impacted by cancer (e.g., tumor-derived). Additionally, candidate variants may be called as true positives.
The term “copy number aberrations” or “CNAs” refers to changes in copy number in somatic tumor cells. For example, CNAs can refer to copy number changes in a solid tumor.
The term “copy number variations” or “CNVs” refers to changes in copy number changes that derive from germline cells or from somatic copy number changes in non-tumor cells. For example, CNVs can refer to copy number changes in white blood cells that can arise due to clonal hematopoiesis.
The term “copy number event” refers to one or both of a copy number aberration and a copy number variation.
1.1. Overall Process Flow
At step 102, the test sample is obtained from the individual. Generally, samples may be from healthy subjects, subjects known to have or suspected of having cancer, or subjects where no prior information is known (e.g., asymptomatic subjects). The test sample may be a sample selected from the group consisting of blood, plasma, serum, urine, fecal, and saliva samples. Alternatively, the test sample may comprise a sample selected from the group consisting of whole blood, a blood fraction, a tissue biopsy, pleural fluid, pericardial fluid, cerebral spinal fluid, and peritoneal fluid.
As shown in each of
At step 104, one or more physical process analyses are performed, at least one physical process analysis including a sequencing-based assay on cfDNA 115 to generate sequence reads. Referring to
A baseline analysis 130 of the individual 110 can include a clinical analysis of the individual 110 and can be performed by a physician or a medical professional. In some embodiments, the baseline analysis 130 can include an analysis of germline changes detectable in the cfDNA 115 of the individual 110. In some embodiments, the baseline analysis 130 can perform the analysis of germline changes with additional information such as an identification of upregulated or downregulated genes. In other embodiments, the baseline analysis include analysis of clinical features (e.g., known risk factors for cancer, such as, a subject's age, race, body mass index (BMI), smoking history, alcohol intake, and/or family cancer history). Such additional information can be provided by a computational analysis, such as computational analysis 140B as depicted in
As used hereafter, a small variant sequencing assay refers to a physical assay that generates sequence reads, typically through targeted gene sequencing panels that can be used to determine small variants, examples of which include single nucleotide variants (SNVs) and/or insertions or deletions. Alternatively, as one of skill in the art would appreciate, assessment of small variants may also be done using a whole genome sequencing approach or a whole exome sequencing approach.
As used hereafter, a whole genome sequencing assay refers to a physical assay that generates sequence reads for a whole genome or a substantial portion of the whole genome which can be used to determine large variations such as copy number variations or copy number aberrations. Such a physical assay may employ whole genome sequencing techniques or whole exome sequencing techniques.
As used hereafter, a methylation sequencing assay refers to a physical assay that generates sequence reads which can be used to determine the methylation status of a plurality of CpG sites, or methylation patterns, across the genome. An example of such a methylation sequencing assay can include the bisulfite treatment of cfDNA for conversion of unmethylated cytosines (e.g., CpG sites) to uracil (e.g., using EZ DNA Methylation-Gold or an EZ DNA Methylation-Lightning kit (available from Zymo Research Corp)). Alternatively, an enzymatic conversion step (e.g., using a cytosine deaminase (such as APOBEC-Seq (available from NEBiolabs))) may be used for conversion of unmethylated cytosines to uracils. Following conversion, the converted cfDNA molecules can be sequenced through a whole genome sequencing process or a targeted gene sequencing panel and sequence reads used to assess methylation status at a plurality of CpG sites. Methylation-based sequencing approaches are known in the art (e.g., see US 2014/0080715 and U.S. Ser. No. 16/352,602, which are incorporated herein by reference). In another embodiment, DNA methylation may occur in cytosines in other contexts, for example CHG and CHH, where H is adenine, cytosine or thymine. Cytosine methylation in the form of 5-hydroxymethylcytosine may also assessed (see, e.g., WO 2010/037001 and WO 2011/127136, which are incorporated herein by reference), and features thereof, using the methods and procedures disclosed herein. In some embodiments, a methylation sequencing assay need not perform a base conversion step to determine methylation status of CpG sites across the genome. For example, such methylation sequencing assays can include PacBio sequencing or Oxford Nanopore sequencing.
Each of the whole genome sequencing assay 132, small variant sequencing assay 134, and methylation sequencing assay 136 is performed on the cfDNA 115 to generate sequence reads of the cfDNA 115. In various embodiments, each of the whole genome sequencing assay 132, small variant sequencing assay 134, and methylation sequencing assay 136 are further performed on the WBC DNA 120 to generate sequence reads of the WBC DNA 120. The process steps performed in each of the whole genome sequencing assay 132, small variant sequencing assay 134, and methylation sequencing assay 136 is described in further detail in relation to
At step 106, the sequence reads generated as a result of performing the sequencing-based assay are processed to determine values for features. Features, generally, are types of information obtainable from physical assays and/or computational analyses that may be used in predicting cancer in an individual. Generally, any given predictive model for identifying cancer in an individual includes one or more features as constituent components of the model. For any given patient or sample, a feature will have a value that is determined from the physical and/or computational analysis. These values are input into the predictive model to generate an output of the model.
Sequence reads are processed by applying a computational analysis. Generally, each computational analysis 140 represents an algorithm that is executable by a processor of a computer, hereafter referred to as a processing system. Therefore, each computational analysis analyzes sequence reads and outputs value features based on the sequence reads. Each computational analysis is specific for a given sequencing-based assay and therefore, each computational analysis outputs a particular type of feature that is specific for the sequencing-based assay.
As shown in
At step 108, a predictive cancer model is applied to the features to generate a cancer prediction for the individual 110. Examples of a cancer prediction include a presence or absence of cancer, a tissue of origin of cancer, a severity, stage, a grade of cancer, a cancer sub-type, a treatment decision, and a likelihood of response to a treatment. In various embodiments, the cancer prediction output by the predictive cancer model is a score, such as a likelihood or probability that indicates one or more of: a presence or absence of cancer, a tissue of origin of cancer, a severity, stage, a grade of cancer, a cancer sub-type, a treatment decision, and a likelihood of response to a treatment.
Generally, any such scores may either be singular, such as the presence or absence of cancer generally, presence/absence of a particular type of cancer. Alternatively, such scores may be plural, such that the output of the predictive cancer model may, for example, be a score representing the presence/absence of each of a number of types of cancer, a score representing the severity/grade of each of a number of types of cancer, a score representing the likelihood that particular cfDNA originated in each of a number of types of tissue, and so on. For clarity of description, the output of the predictive cancer model is generally referred to as a set of scores, the set comprising one or more scores depending upon what the predictive cancer model is configured to determine.
The predictive cancer model can be differently structured based on the particular features of the predictive cancer model. For example, the predictive cancer model can include one, two, three, or four different types of features, such as the baseline features 150, whole genome features 152, small variant features 154, and methylation features 156. In some embodiments, there may be four separate predictive cancer models, each structured to include one type of feature. In some embodiments, the predictive cancer model is a two-stage model that includes a first set of sub-models that each include a type of feature and a second sub-model that analyzes the outputs of the first set of sub-models to determine a cancer prediction. Each particularly structured predictive cancer model is described hereafter in relation to a processing workflow that generates values of one or more types of features that the predictive cancer model receives. As used hereafter, a workflow process refers to the performance of the physical process analysis, computational analysis, and application of a predictive cancer model.
In various embodiments, values of features output from different computational analyses are input into a single predictive cancer model to generate a cancer prediction. For example, referring to
In various embodiments, a single workflow process is performed to generate a single cancer prediction 190 without a need for performing other workflow processes. Therefore, values of features output from a single computational analysis are input into a single predictive cancer model to generate a cancer prediction. For example, referring to
In various embodiments, a predictive cancer model (e.g., any one of predictive cancer models 170A-D) can generate a cancer prediction 190A-D based on values of two types of features (e.g., two features selected from baseline features 150, whole genome features 152, small variant features 154, and methylation features 156). In some embodiments, a predictive cancer model can generate a cancer prediction 190A-D based on values of three types of features (e.g., three features selected from baseline features 150, whole genome features 152, small variant features 154, and methylation features 156).
In various embodiments, a two stage predictive cancer model is applied to the features to generate a cancer prediction. For example, at a first stage, the values of features output from each computational analysis are separately input into individual sub-models. At a second stage, the output of each individual sub-model is provided as input into an overall sub-model to generate a cancer prediction.
Although
In various embodiments, the number of scores output by each of the predictive models 180A, 180B, 180C, and 180D may differ. For example, predictive model 180B may output one set of scores (hereafter referred to a “WGS score”), predictive model 180C may output a set of two scores (hereafter referred to as “variant gene score” and “Order score”), and predictive model 180C may output a set of three scores (hereafter referred to as “MSUM score,” “WGBS score,” and “Binary score”).
In each of the different embodiments of the predictive cancer model (e.g., predictive cancer model 160 in
During training, training data is processed to generate values for features that are used to train the weights of the predictive cancer model. As an example, training data can include cfDNA and/or WBC DNA obtained from training samples, as well as an output label. For example, the output label can be indication as to whether the individual is known to be cancerous or known to be devoid of cancer (e.g., healthy), an indication of a cancer tissue of origin, or an indication of a severity of the cancer. Depending on the particular embodiment shown in
The trained predictive cancer model can be stored and subsequently retrieved when needed, for example, during deployment in step 108 of
1.2. Physical Assays
Generally, various sub-combinations of the steps (e.g., steps 205-235) are performed for each of the whole genome sequencing assay, small variant sequencing assay, and methylation sequencing assay. Specifically, steps 205, 215, 230, and 235 are performed for the whole genome sequencing assay. Steps 205 and 215-235 are performed for the small variant sequencing assay. In some embodiments, each of steps 205-235 are performed for the methylation sequencing assay. For example, a methylation sequencing assay that employs a targeted gene panel bisulfite sequencing employs each of steps 205-235. In some embodiments, steps 205-215 and 230-235 are performed for the methylation sequencing assay. For example, a methylation sequencing assay that employs whole genome bisulfite sequencing need not perform steps 220 and 225.
At step 205, nucleic acids (DNA or RNA) are extracted from a test sample. In the present disclosure, DNA and RNA may be used interchangeably unless otherwise indicated. That is, the following embodiments for using error source information in variant calling and quality control may be applicable to both DNA and RNA types of nucleic acid sequences.
However, the examples described herein may focus on DNA for purposes of clarity and explanation. In various embodiments, DNA (e.g., cfDNA) is extracted from the test sample through a purification process. In general, any known method in the art can be used for purifying DNA. For example, nucleic acids can be isolated by pelleting and/or precipitating the nucleic acids in a tube. The extracted nucleic acids may include cfDNA or it may include gDNA, such as WBC DNA.
In step 210, the cfDNA fragments are treated to convert unmethylated cytosines to uracils. In one embodiment, the method uses a bisulfite treatment of the DNA which converts the unmethylated cytosines to uracils without converting the methylated cytosines. For example, a commercial kit such as the EZ DNA METHYLATION—Gold, EZ DNA METHYLATION—Direct or an EZ DNA METHYLATION—Lightning kit (available from Zymo Research Corp, Irvine, Calif.) is used for the bisulfite conversion. In another embodiment, the conversion of unmethylated cytosines to uracils is accomplished using an enzymatic reaction. For example, the conversion can use a commercially available kit for conversion of unmethylated cytosines to uracils, such as APOBEC-Seq (NEBiolabs, Ipswich, Mass.).
At step 215, a sequencing library is prepared. During library preparation, adapters, for example, include one or more sequencing oligonucleotides for use in subsequent cluster generation and/or sequencing (e.g., known P5 and P7 sequences for used in sequencing by synthesis (SBS) (Illumina, San Diego, Calif.)) are ligated to the ends of the nucleic acid fragments through adapter ligation. In one embodiment, unique molecular identifiers (UMI) are added to the extracted nucleic acids during adapter ligation. The UMIs are short nucleic acid sequences (e.g., 4-10 base pairs) that are added to ends of nucleic acids during adapter ligation. In some embodiments, UMIs are degenerate base pairs that serve as a unique tag that can be used to identify sequence reads obtained from nucleic acids. As described later, the UMIs can be further replicated along with the attached nucleic acids during amplification, which provides a way to identify sequence reads that originate from the same original nucleic acid segment in downstream analysis.
In step 220, hybridization probes are used to enrich a sequencing library for a selected set of nucleic acids. Hybridization probes can be designed to target and hybridize with targeted nucleic acid sequences to pull down and enrich targeted nucleic acid fragments that may be informative for the presence or absence of cancer (or disease), cancer status, or a cancer classification (e.g., cancer type or tissue of origin). In accordance with this step, a plurality of hybridization pull down probes can be used for a given target sequence or gene. The probes can range in length from about 40 to about 160 base pairs (bp), from about 60 to about 120 bp, or from about 70 bp to about 100 bp. In one embodiment, the probes cover overlapping portions of the target region or gene. For targeted gene panel sequencing, the hybridization probes are designed to target and pull down nucleic acid fragments that derive from specific gene sequences that are included in the targeted gene panel. For whole exome sequencing, the hybridization probes are designed to target and pull down nucleic acid fragments that derive from exon sequences in a reference genome. As one of skill in the art would readily appreciate, other known means in the art for targeted enrichment of nucleic acids may be used.
After a hybridization step 220, the hybridized nucleic acid fragments are enriched 225. For example, the hybridized nucleic acid fragments can be captured and amplified using PCR. The target sequences can be enriched to obtain enriched sequences that can be subsequently sequenced. This improves the sequencing depth of sequence reads.
In step 230, the nucleic acids are sequenced to generate sequence reads. Sequence reads may be acquired by known means in the art. For example, a number of techniques and platforms obtain sequence reads directly from millions of individual nucleic acid (e.g., DNA such as cfDNA or gDNA) molecules in parallel. Such techniques can be suitable for performing any of targeted gene panel sequencing, whole exome sequencing, whole genome sequencing, targeted gene panel bisulfite sequencing, and whole genome bisulfite sequencing.
As a first example, sequencing-by-synthesis technologies rely on the detection of fluorescent nucleotides as they are incorporated into a nascent strand of DNA that is complementary to the template being sequenced. In one method, oligonucleotides 30-50 bases in length are covalently anchored at the 5′ end to glass cover slips. These anchored strands perform two functions. First, they act as capture sites for the target template strands if the templates are configured with capture tails complementary to the surface-bound oligonucleotides. They also act as primers for the template directed primer extension that forms the basis of the sequence reading. The capture primers function as a fixed position site for sequence determination using multiple cycles of synthesis, detection, and chemical cleavage of the dye-linker to remove the dye. Each cycle consists of adding the polymerase/labeled nucleotide mixture, rinsing, imaging and cleavage of dye.
In an alternative method, polymerase is modified with a fluorescent donor molecule and immobilized on a glass slide, while each nucleotide is color-coded with an acceptor fluorescent moiety attached to a gamma-phosphate. The system detects the interaction between a fluorescently-tagged polymerase and a fluorescently modified nucleotide as the nucleotide becomes incorporated into the de novo chain.
Any suitable sequencing-by-synthesis platform can be used to identify mutations. Sequencing-by-synthesis platforms include the Genome Sequencers from Roche/454 Life Sciences, the GENOME ANALYZER from Illumina/SOLEXA, the SOLID system from Applied BioSystems, and the HELISCOPE system from Helicos Biosciences. Sequencing-by-synthesis platforms have also been described by Pacific BioSciences and VisiGen Biotechnologies. In some embodiments, a plurality of nucleic acid molecules being sequenced is bound to a support (e.g., solid support). To immobilize the nucleic acid on a support, a capture sequence/universal priming site can be added at the 3′ and/or 5′ end of the template. The nucleic acids can be bound to the support by hybridizing the capture sequence to a complementary sequence covalently attached to the support. The capture sequence (also referred to as a universal capture sequence) is a nucleic acid sequence complementary to a sequence attached to a support that may dually serve as a universal primer.
As an alternative to a capture sequence, a member of a coupling pair (such as, e.g., antibody/antigen, receptor/ligand, or the avidin-biotin pair) can be linked to each fragment to be captured on a surface coated with a respective second member of that coupling pair. Subsequent to the capture, the sequence can be analyzed, for example, by single molecule detection/sequencing, including template-dependent sequencing-by-synthesis. In sequencing-by-synthesis, the surface-bound molecule is exposed to a plurality of labeled nucleotide triphosphates in the presence of polymerase. The sequence of the template is determined by the order of labeled nucleotides incorporated into the 3′ end of the growing chain. This can be done in real time or can be done in a step-and-repeat mode. For real-time analysis, different optical labels to each nucleotide can be incorporated and multiple lasers can be utilized for stimulation of incorporated nucleotides.
Massively parallel sequencing or next generation sequencing (NGS) techniques include synthesis technology, pyrosequencing, ion semiconductor technology, single-molecule real-time sequencing, sequencing by ligation, nanopore sequencing, or paired-end sequencing. Examples of massively parallel sequencing platforms are the Illumina HISEQ or MISEQ, ION PERSONAL GENOME MACHINE, the PACBIO RSII sequencer or SEQUEL System, Qiagen's GENEREADER, and the Oxford MINION. Additional similar current massively parallel sequencing technologies can be used, as well as future generations of these technologies.
At step 230, the sequence reads may be aligned to a reference genome using known methods in the art to determine alignment position information. The alignment position information may indicate a beginning position and an end position of a region in the reference genome that corresponds to a beginning nucleotide base and end nucleotide base of a given sequence read. Alignment position information may also include sequence read length, which can be determined from the beginning position and end position. A region in the reference genome may be associated with a gene or a segment of a gene.
In various embodiments, a sequence read is comprised of a read pair denoted as R1 and R2. For example, the first read R1 may be sequenced from a first end of a nucleic acid fragment whereas the second read R2 may be sequenced from the second end of the nucleic acid fragment. Therefore, nucleotide base pairs of the first read R1 and second read R2 may be aligned consistently (e.g., in opposite orientations) with nucleotide bases of the reference genome. Alignment position information derived from the read pair R1 and R2 may include a beginning position in the reference genome that corresponds to an end of a first read (e.g., R1) and an end position in the reference genome that corresponds to an end of a second read (e.g., R2). In other words, the beginning position and end position in the reference genome represent the likely location within the reference genome to which the nucleic acid fragment corresponds. An output file having SAM (sequence alignment map) format or BAM (binary alignment map) format may be generated and output for further analysis.
Following step 235, the aligned sequence reads are processed using a computational analysis, such as computational analysis 140B, 140C, or 140D as described above and shown in
2.1. Small Variant Features
The small variant computational analysis 140C receives sequence reads generated by the small variant sequencing assay 134 and determines values of small variant features 154 based on the sequence reads. As previously described, the small variant sequencing assay may be a sequencing-based assay that generates sequence reads, typically through targeted gene sequencing panels that can be used to determine small variants, examples of which include single nucleotide variants (SNVs) and/or insertions or deletions. Alternatively, as one of skill in the art would appreciate, assessment of small variants may also be done using a whole genome sequencing approach or a whole exome sequencing approach. Examples of small variant features 154 include any of: a total number of somatic variants in a subject's cfDNA, a total number of non-synonymous variants, total number of synonymous variants, a presence/absence of somatic variants per gene in a gene panel, a presence/absence of somatic variants for particular genes that are known to be associated with cancer, an allele frequency (AF) of a somatic variant per gene in a gene panel, a maximum AF of a somatic variant per gene in a gene panel, an AF of a somatic variant per category as designated by a publicly available database, such as oncoKB, and a ranked order of somatic variants according to the AF of somatic variants.
Generally, the feature values for the small variant features 154 are predicated on the accurate identification of somatic variants that may be indicative of cancer in the individual. The small variant computational analysis 140C identifies candidate variants and from amongst the candidate variants, differentiates between somatic variants likely present in the genome of the individual and false positive variants that are unlikely to be predictive of cancer in the individual. More specifically, the small variant computational analysis 140C identifies candidate variants present in cfDNA that are likely to be derived from a somatic source in view of interfering signals such as noise and/or variants that can be attributed to a genomic source (e.g., from gDNA or WBC DNA). Additionally candidate variants can be filtered to remove false positive variants that may arise due to an artifact and therefore are not indicative of cancer in the individual. As an example, false positive variants may be variants detected at or near the edge of sequence reads, which arise due to spontaneous cytosine deamination and end repair errors. Thus, somatic variants, and features thereof, that remain following the filtering out of false positive variants can be used to determine the small variant features.
For the feature of the total number of somatic variants, the small variant computational analysis 140C totals the identified somatic variants across the genome, or gene panel. Thus, for a cfDNA sample obtained from an individual, the feature of the total number of somatic variants is represented as a single, numerical value of the total number of somatic variants identified in the cfDNA of the sample.
For the feature of the total number of nonsynonymous variants, the small variant computational analysis 140C may further filter the identified somatic variants to identify the somatic variants that are nonsynonymous variants. As is well known in the art, a non-synonymous variant of a nucleic acid sequence results in a change in the amino acid sequence of a protein associated with the nucleic acid sequence. For instance, non-synonymous variants may alter one or more phenotypes of an individual or cause (or leave more vulnerable) the individual to develop cancer, cancerous cells, or other types of diseases. Therefore, the small variant computation analysis 140C determines that a candidate variant would result in a non-synonymous variant by determining that a modification to one or more nucleobases of a trinucleotide would cause a different amino acid to be produced based on the modified trinucleotide. A feature value for the total number of nonsynonymous variants is determined by summating the identified nonsynonymous variants across the genome. Thus, for a cfDNA sample obtained from an individual, the feature of the total number of nonsynonymous variants is represented as a single, numerical value.
For the feature of the total number of synonymous variants, synonymous variants represent other somatic variants that are not categorized as nonsynonymous variants. In other words, the small variant computational analysis 140C can perform the filtering of identified somatic variants, as described in relation to nonsynonymous variants, and identify the synonymous variants across the genome, or gene panel. Thus, for a cfDNA sample obtained from an individual, the feature of the total number of synonymous variants is represented as a single numerical value.
The feature of a presence/absence of somatic variants per gene can involve multiple feature values for a cfDNA sample. For example, a targeted gene panel may include 500 genes in the panel and therefore, the small variant computational analysis 140C can generate 500 feature values, each feature value representing either a presence or absence of somatic variants for a gene in the panel. As an example, if a somatic variant is present in the gene, then the value of the feature is 1. Conversely, if a somatic variant is not present in the gene, then the value of the feature is 0. In general, any size gene panel may be used. For example, the gene panel may comprise 100, 200, 500, 1000, 2000, 10,000 or more genes targets across the genome. In other embodiment, the gene panel may comprise from about 50 to about 10,000 gene targets, from about 100 to about 2,000 gene targets, or from about 200 to about 1,000 gene targets.
For the feature of presence/absence of somatic variants for particular genes that are known to be associated with cancer, the particular genes known to be associated with cancer can be accessed from a public database such as OncoKB. Examples of genes known to be associated with cancer include p53, LRP1B, and KRAS. Each gene known to be associated with cancer can be associated with a feature value, such as a 1 (indicating that a somatic variant is present in the gene) or a 0 (indicating that a somatic variant is not present in the gene).
The AF of a somatic variant per gene (e.g., in a gene panel) refers to the frequency of one or more somatic variants in the sequence reads. Generally, this feature is represented by one feature value per gene of a gene panel or per gene across the genome. The value of this feature can be a statistical value of AFs of somatic variants of the gene. In various embodiments, this feature refers to the one somatic variant in the gene with the maximum AF. In some embodiments, this feature refers to the average AF of somatic variants of the gene. Therefore, for a targeted gene panel of 500 genes, there are 500 feature values that represent the AF of a somatic variant per gene (e.g., in a gene panel).
The AF of a somatic variant per category as designated by a publicly available database, such as oncoKB. For example, oncoKB categorizes genes in one of four different categories. In one embodiment, the AF of a somatic variant per category is a maximum AF of a somatic variant across the genes in the category. In one embodiment, the AF of a somatic variant per category is a mean AF across somatic variants across the genes in the category.
The ranked order of somatic variants according to the AF of somatic variants refers to the top N allele frequencies of somatic variants. In general, the value of a variant allele frequency can be between 0 to 1, where a variant allele frequency of 0 indicates no sequence reads that possess the alternate allele at the position and where a variant allele frequency of 1 indicates that all sequence reads possess the alternate allele at the position. In other embodiments, other ranges and/or values of variant allele frequencies can be used. In various embodiments, the ranked order feature is independent of the somatic variants themselves and instead, is only represented by the values of the top Nvariant allele frequencies. An example of the ranked order feature for the top 5 allele frequencies can be represented as: [0.1, 0.08, 0.05, 0.03, 0.02] which indicates that the 5 highest allele frequencies, independent of the somatic variants, range from 0.02 up to 0.1.
2.2. Small Variant Computational Analysis Process Overview
A processing system, such as a processor of a computer, executes the code for performing the small variant computational analysis 140C.
At step 305B, the processing system stitches the collapsed reads based on the corresponding alignment position information. In some embodiments, the processing system compares alignment position information between a first sequence read and a second sequence read to determine whether nucleotide base pairs of the first and second sequence reads overlap in the reference genome. In one use case, responsive to determining that an overlap (e.g., of a given number of nucleotide bases) between the first and second sequence reads is greater than a threshold length (e.g., threshold number of nucleotide bases), the processing system designates the first and second sequence reads as “stitched”; otherwise, the collapsed reads are designated “unstitched.” In some embodiments, a first and second sequence read are stitched if the overlap is greater than the threshold length and if the overlap is not a sliding overlap. For example, a sliding overlap may include a homopolymer run (e.g., a single repeating nucleotide base), a dinucleotide run (e.g., two-nucleotide base sequence), or a trinucleotide run (e.g., three-nucleotide base sequence), where the homopolymer run, dinucleotide run, or trinucleotide run has at least a threshold length of base pairs.
At step 305C, the processing system assembles reads into paths. In some embodiments, the processing system assembles reads to generate a directed graph, for example, a de Bruijn graph, for a target region (e.g., a gene). Unidirectional edges of the directed graph represent sequences of k nucleotide bases (also referred to herein as “k-mers”) in the target region, and the edges are connected by vertices (or nodes). The processing system aligns collapsed reads to a directed graph such that any of the collapsed reads may be represented in order by a subset of the edges and corresponding vertices.
In some embodiments, the processing system determines sets of parameters describing directed graphs and processes directed graphs. Additionally, the set of parameters may include a count of successfully aligned k-mers from collapsed reads to a k-mer represented by a node or edge in the directed graph. The processing system stores directed graphs and corresponding sets of parameters, which may be retrieved to update graphs or generate new graphs. For instance, the processing system may generate a compressed version of a directed graph (e.g., or modify an existing graph) based on the set of parameters. In one use case, in order to filter out data of a directed graph having lower levels of importance, the processing system removes (e.g., “trims” or “prunes”) nodes or edges having a count less than a threshold value, and maintains nodes or edges having counts greater than or equal to the threshold value.
At step 305D, the processing system generates candidate variants from the assembled paths. In one embodiment, the processing system generates the candidate variants by comparing a directed graph (which may have been compressed by pruning edges or nodes in step 305B) to a reference sequence of a target region of a genome. The processing system may align edges of the directed graph to the reference sequence, and records the genomic positions of mismatched edges and mismatched nucleotide bases adjacent to the edges as the locations of candidate variants. In some embodiments, the genomic positions of mismatched edges and mismatched nucleotide bases to the left and right of edges are recorded as the locations of called variants. Additionally, the processing system may generate candidate variants based on the sequencing depth of a target region. In particular, the processing system may be more confident in identifying variants in target regions that have greater sequencing depth, for example, because a greater number of sequence reads help to resolve (e.g., using redundancies) mismatches or other base pair variations between sequences.
In one embodiment, the processing system generates candidate variants using a model to determine expected noise rates for sequence reads from a subject. The model may be a Bayesian hierarchical model, though in some embodiments, the processing system uses one or more different types of models. Moreover, a Bayesian hierarchical model may be one of many possible model architectures that may be used to generate candidate variants and which are related to each other in that they all model position-specific noise information in order to improve the sensitivity/specificity of variant calling. More specifically, the processing system trains the model using samples from healthy individuals to model the expected noise rates per position of sequence reads.
Further, multiple different models may be stored in a database or retrieved for application post-training. For example, a first model is trained to model SNV noise rates and a second model is trained to model insertion deletion noise rates. Further, the processing system may use parameters of the model to determine a likelihood of one or more true positives in a sequence read. The processing system may determine a quality score (e.g., on a logarithmic scale) based on the likelihood. For example, the quality score is a Phred quality score Q=−10·log10 P, where P is the likelihood of an incorrect candidate variant call (e.g., a false positive). Other models, such as a joint model, may use output of one or more Bayesian hierarchical models to determine expected noise of nucleotide mutations in sequence reads of different samples.
At step 305E, the processing system filters the candidate variants using one or more types of models or filters. For example, the processing system scores the candidate variants using a joint model, edge variant prediction model, or corresponding likelihoods of true positives or quality scores.
Reference is now made to
At step 310C, the processing system filters the candidate variants using edge filtering. In particular, the processing system may apply an edge filter that predicts the likelihood that a candidate variant is a false positive, also referred to as an edge variant, in view of the variants observed in the sample e.g., cfDNA sample. The term edge variant refers to a mutation located near an edge of a sequence read, for example, within a threshold distance of nucleotide bases from the edge of the sequence read. Specifically, given the collection of candidate variants for a sample, the edge filter may perform a likelihood estimation to determine a predicted rate of edge variants in the sample. Given certain conditions of the sample, the predicted rate may best explain the observed collection of candidate variants for the sample in view of two distributions. One distribution describes features of known edge variants whereas another trained distribution describes features of known non-edge variants. The predicted rate is a sample-specific parameter that controls how aggressively the sample is analyzed to identify and filter edge from the sample. Edge variants of the sample are filtered and removed, leaving non-edge variants for subsequent consideration. The term non-edge variant refers to a candidate variant that is not determined to be resulting from an artifact process, e.g., using an edge variant filtering method described herein. In some scenarios, a non-edge variant may not be a true variant (e.g., mutation in the genome) as the non-edge variant could arise due to a different reason as opposed to one or more artifact processes.
Returning to
At step 305G, optionally, the small variant features derived from the somatic variants can be used to generate a cancer prediction. For example, a prediction model, such as predictive cancer model 170C shown in
2.2.1. Example Noise Models
The probability mass functions (PMFs) illustrated in
Using the example of
z
p˜Multinom({right arrow over (θ)})
Together, the latent variable zp, the vector of mixture components {right arrow over (θ)}, α, and β allow the model for μ, that is, a sub-model of the Bayesian hierarchical model, to have parameters that “pool” knowledge about noise, that is they represent similarity in noise characteristics across multiple positions. Thus, positions of sequence reads may be pooled or grouped into latent classes by the model. Also advantageously, samples of any of these “pooled” positions can help train these shared parameters. A benefit of this is that the processing system may determine a model of noise in healthy samples even if there is little to no direct evidence of alternative alleles having been observed for a given position previously (e.g., in the healthy tissue samples used to train the model).
The covariate xp (e.g., a predictor) encodes known contextual information regarding position p which may include, but is not limited to, information such as trinucleotide context, mappability, segmental duplication, or other information associated with sequence reads. Trinucleotide context may be based on a reference allele and may be assigned numerical (e.g., integer) representation. For instance, “AAA” is assigned 1, “ACA” is assigned 2, “AGA” is assigned 3, etc. Mappability represents a level of uniqueness of alignment of a read to a particular target region of a genome. For example, mappability is calculated as the inverse of the number of position(s) where the sequence read will uniquely map. Segmental duplications correspond to long nucleic acid sequences (e.g., having a length greater than approximately 1000 base pairs) that are nearly identical (e.g., greater than 90% match) and occur in multiple locations in a genome as result of natural duplication events (e.g., not associated with a cancer or disease).
The expected mean AD count of a SNV at position p is modeled by the parameter μp. For sake of clarity in this description, the terms μp and yp refer to the position specific sub-models of the Bayesian hierarchical model. In one embodiment, μp is modeled as a Gamma-distributed random variable having shape parameter αz
μp˜Gamma(αz
In other embodiments, other functions may be used to represent μp, examples of which include but are not limited to: a log-normal distribution with log-mean γz
In the example shown in
y
i
|d
i
˜Poisson(di
In other embodiments, other functions may be used to represent yi
The expected mean total indel count at position p is modeled by the distribution μp. In some embodiments, the distribution is based on the covariate and has a Gamma distribution having shape parameter αx
μp˜Gamma(αx
In other embodiments, other functions may be used to represent p, examples of which include but are not limited to: negative binomial, Conway-Maxwell-Poisson distribution, zeta distribution, and zero-inflated Poisson.
The observed indels at position p in a human population sample i (of a healthy individual) is modeled by the distribution yi
y
i
|d
i
˜Poisson(di
In other embodiments, other functions may be used to represent yi
Due to the fact that indels may be of varying lengths, an additional length parameter is present in the indel model that is not present in the model for SNVs. As a consequence, the example model shown in
{right arrow over (y)}
i
|y
i
,{right arrow over (ϕp)}˜Multinom(yi
In other embodiments, a Dirichlet-Multinomial function or other types of models may be used to represent yi
By architecting the model in this manner, the processing system may decouple learning of indel intensity (i.e., noise rate) from learning of indel length distribution. Independently determining inferences for an expectation for whether an indel will occur in healthy samples and expectation for the length of the indel at a position may improve the sensitivity of the model. For example, the length distribution may be more stable relative to the indel intensity at a number of positions or regions in the genome, or vice versa.
In one embodiment, the processing system performs model fitting by storing draws of μp, the expected mean counts of AF per position and per sample. The model is trained or fitted through posterior sampling, as previously described. In an embodiment, the draws of μp are stored in a matrix data structure having a row per position of the set of positions sampled and a column per draw from the joint posterior (e.g., of all parameters conditional on the observed data). The number of rows R may be greater than 6 million and the number of columns for N iterations of samples may be in the thousands. In other embodiments, the row and column designations are different than the embodiment shown in
where mp and vp are the mean and variance of the sampled values of μp at the position, respectively. Those of skill in the art will appreciate that other functions for determining rp may also be used such as a maximum likelihood estimate.
The processing system may also perform dispersion re-estimation of the dispersion parameters in the reduced matrix, given the mean parameters. In one embodiment, following Bayesian training and posterior approximation, the processing system performs dispersion re-estimation by retraining for the dispersion parameters based on a negative binomial maximum likelihood estimator per position. The mean parameter may remain fixed during retraining. In one embodiment, the processing system determines the dispersion parameters r′p at each position for the original AD counts of the training data (e.g., yi
During application of trained models, the processing system may access the dispersion (e.g., shape) parameters {tilde over (r)}p and mean parameters mp to determine a function parameterized by {tilde over (r)}p and mp. The function may be used to determine a posterior predictive probability mass function (or probability density function) for a new sample of a subject. Based on the predicted probability of a certain AD count at a given position, the processing system may account for site-specific noise rates per position of sequence reads when detecting true positives from samples.
Referring back to the example use case described with respect to
2.3. Example Process Flows for Noise Models
The Bayesian hierarchical model may update parameters simultaneously for multiple (or all) positions included in the model. Additionally, the model may be trained to model expected noise for each ALT. For instance, a model for SNVs may perform a training process four or more times to update parameters (e.g., one-to-one substitutions) for mutations of each of the A, T, C, and G bases to each of the other three bases. In step 320C, the processing system stores parameters of the Bayesian hierarchical model (e.g., ensemble parameters output by the Markov Chain Monte Carlo method). In step 320D, the processing system approximates the noise distribution (e.g., represented by a dispersion parameter and a mean parameter) per position based on the parameters. In step 320E, the processing system performs dispersion re-estimation (e.g., maximum likelihood estimation) using original AD counts from the samples (e.g., training data) used to train the Bayesian hierarchical model.
and mp, respectively, specific to the candidate variant, which may be based on the position p of the candidate variant. The parameters may be derived using a model, e.g., a Bayesian hierarchical model representing a posterior predictive distribution with an observed depth of a given sequence read and a mean parameter μp at the position p as input. In an embodiment, the mean parameter μp is a gamma distribution encoding a noise level of nucleotide mutations with respect to the position p for a training sample.
At step 330C, the processing system inputs read information (e.g., AD or AF) of the set of sequence reads into a function (e.g., based on a negative binomial) parameterized by the parameters, e.g., and mp. At step 330D, the processing system determines a score for the candidate variant (e.g., at the position p) using an output of the function based on the input read information. The score may indicate a likelihood of seeing an allele count for a given sample (e.g., from a subject) that is greater than or equal to a determined allele count of the candidate variant (e.g., determined by the model and output of the function). The processing system may convert the likelihood into a Phred-scaled score. In some embodiments, the processing system uses the likelihood to determine false positive mutations responsive to determining that the likelihood is less than a threshold value. In some embodiments, the processing system uses the function to determine that a sample of sequence reads includes at least a threshold count of alleles corresponding to a gene found in sequence reads from a tumor biopsy of an individual. In some embodiments, the processing system may perform weighting based on quality scores, use the candidate variants and quality scores for false discovery methods, annotate putative calls with quality scores, or provision to subsequent systems.
The processing system may use functions encoding noise levels of nucleotide mutations with respect to a given training sample for downstream analysis. In some embodiments, the processing system uses the aforementioned negative binomial function parameterized by the dispersion and mean rate parameters
and mp to determine expected noise for a particular nucleic acid position within a sample, e.g., cfDNA or gDNA. Moreover, the processing system may derive the parameters by training a Bayesian hierarchical model using training data associated with the particular nucleic acid sample. The embodiments below describe another type of model referred to herein as a joint model, which may use output of a Bayesian hierarchical model.
2.4. Example Joint Model
In step 340A, the processing system determines depths and ADs for the various positions of nucleic acids from the sequence reads obtained from a cfDNA sample of a subject. The cfDNA sample may be collected from a sample of blood plasma from the subject.
In step 340B, the processing system determines depths and ADs for the various positions of nucleic acids from the sequence reads obtained from a gDNA of the same subject. The gDNA may be collected from white blood cells or a tumor biopsy from the subject.
In step 340C, a joint model determines a likelihood of a “true” AF of the cfDNA sample of the subject by modeling the observed ADs for cfDNA. In one embodiment, the joint model uses a Poisson distribution function, parameterized by the depths observed from the sequence reads of cfDNA and the true AF of the cfDNA sample, to model the probability of observing a given AD in cfDNA of the subject (also shown in
P(ADcfDNA|depthcfDNA,AFcfDNA)˜Poisson(depthcfDNA·AFcfDNA)+noisecjDNA
The noise component noisecfDNA is further described below. In other embodiments, other functions may be used to represent ADcfDNA, examples of which include but are not limited to: negative binomial, Conway-Maxwell-Poisson distribution, zeta distribution, and zero-inflated Poisson.
In step 340D, the joint model determines a likelihood of a “true” AF of the gDNA sample of the subject by modeling the observed ADs for gDNA. In one embodiment, the joint model uses a Poisson distribution function parameterized by the depths observed from the sequence reads of gDNA and the true AF of the gDNA sample to model the probability of observing a given AD in gDNA of the subject (also shown in
P(ADgDNA|depthgDNA,AFgDNA)˜Poisson(depthgDNA·AFgDNA)+noisegDNA
The noise component noisegDNA is further described below. In other embodiments, other functions may be used to represent ADgDNA, examples of which include but are not limited to: negative binomial, Conway-Maxwell-Poisson distribution, zeta distribution, and zero-inflated Poisson.
Since the true AF of cfDNA, as well as the true AF of gDNA, are inherent properties of the biology of a particular subject, it may not necessarily be practical to determine an exact value of the true AF from either source. Moreover, various sources of noise also introduce uncertainty into the estimated values of the true AF. Accordingly, the joint model uses numerical approximation to determine the posterior distributions of true AF conditional on the observed data (e.g., depths and ADs) from a subject and corresponding noise parameters:
P(AFcfDNA|depthcfDNA,ADcfDNA,cfDNA,mp
P(AFgDNA|depthgDNA,ADgDNA,gDNA,mp
The joint model determines the posterior distributions using Bayes' theorem with a prior, for example, a uniform distribution. The priors used for cfDNA and gDNA may be the same (e.g., a uniform distribution ranging from 0 to 1) and independent from each other.
In an embodiment, the joint model determines the posterior distribution of true AF of cfDNA using a likelihood function by varying the parameter, true AF of cfDNA, given a fixed set of observed data from the sample of cfDNA. Additionally, the joint model determines the posterior distribution of true AF of gDNA using another likelihood function by varying the parameter, true AF of gDNA, given a fixed set of observed data from the sample of gDNA. For both cfDNA and gDNA, the joint model numerically approximates the output posterior distribution by fitting a negative binomial (NB):
In an embodiment, the joint model performs numerical approximation using the following parameters for the negative binomial, which may provide an improvement in computational speed:
P(AF|depth,AD)∂NB(AD,size=r,μ=m·depth)
Where
m
=AF+m
r=r·m
2
/m
2
Since the observed data is different between cfDNA and gDNA, the parameters determined for the negative binomial of cfDNA will vary from those determined for the negative binomial of gDNA.
In step 340E, the processing system determines, using the likelihoods, a probability that the true AF of the cfDNA sample is greater than a function of the true AF of the gDNA sample. The function may include one or more parameters, for example, empirically-determined k and p values and described with additional detail with reference to
In an embodiment, the processing system determines that the posterior probability satisfies a chosen criteria based on the one or more parameters (e.g., k and p described below). The distributions of variants are conditionally independent given the sequences of the cfDNA and gDNA. That is, the processing system presumes that ALTs and noise present in one of the cfDNA or gDNA sample is not influenced by those of the other sample, and vice versa. Thus, the processing system considers the probabilities of the expected distributions of AD as independent events in determining the probability of observing both a certain true AF of cfDNA and a certain true AF of gDNA, given the observed data and noise parameters from both sources:
P(AFcfDNA,AFgDNA|depth,AD,,mp)∂
P(AFcfDNA|depthcfDNA,ADcfDNA,cfDNA,mp
P(AFgfDNA|depthgDNA,ADgDNA,gDNA,mp
In the example 3D plot in
In an embodiment, the processing system determines that a given criteria is satisfied by the posterior probability by determining the portion of the joint likelihood that satisfies the given criteria. The given criteria may be based on the k and p parameter, where p represents a threshold probability for comparison. For example, the processing system determines the posterior probability that true AF of cfDNA is greater than or equal to the true AF of gDNA multiplied by k, and whether the posterior probability is greater than p:
As shown in the above equations, the processing system determines a cumulative sum FcfDNA of the likelihood of the true AF of cfDNA. Furthermore, the processing system integrates over the likelihood function of the true AF of gDNA. In another embodiment, the seq processing system may determine the cumulative sum for the likelihood of the true AF of gDNA, and integrates over the likelihood function of the true AF of cfDNA. By calculating the cumulative sum of one of the two likelihoods (e.g., building a cumulative distribution function), instead of computing a double integral over both likelihoods for cfDNA and gDNA, the processing system reduces the computational resources (expressed in terms of compute time or other similar metrics) required to determine whether the joint likelihood satisfies the criteria and may also increase precision of the calculation of the posterior probability.
To account for noise in the estimated values of the true AF introduced by noise in the cfDNA and gDNA samples, the joint model may use other models (e.g., Bayesian hierarchical model) of the processing system previously described with respect to
In one example, the joint model uses a function parameterized by cfDNA-specific parameters to determine a noise level for the true AF of cfDNA. The cfDNA-specific parameters may be derived using a Bayesian hierarchical model trained with a set of cfDNA samples, e.g., from healthy individuals. In addition, the joint model uses another function parameterized by gDNA-specific parameters to determine a noise level for the true AF of gDNA. The gDNA-specific parameters may be derived using another Bayesian hierarchical model trained with a set of gDNA samples, e.g., from the same healthy individuals. In an embodiment, the functions are negative binomial functions having a mean parameter m and dispersion parameter {right arrow over (r)}, and may also depend on the observed depths of sequence reads from the training samples:
noisecfDNA=NB(mcfDNA·depthcfDNA,{tilde over (r)}cfDNA)
noisegDNA=NB(mgDNA·depthgDNA,{tilde over (r)}gDNA)
In other embodiments, the processing system may use a different type of function and types of parameters for cfDNA and/or gDNA. Since the cfDNA-specific parameters and gDNA-specific parameters are derived using different sets of training data, the parameters may be different from each other and particular to the respective type of nucleic acid sample. For instance, cfDNA samples may have greater variation in AF than gDNA samples, and thus {tilde over (r)}cfDNA may be greater than {tilde over (r)}gDNA.
2.5. Examples for Joint Models
The example results shown in the following figures were determined by the processing system using one or more trained models, which may include joint models and Bayesian hierarchical models. For purposes of comparison, some example results were determined using an empirical threshold or a simple model and are referred to as “empirical threshold” examples and denoted as “threshold” in the figures; these example results were not obtained using one of the trained models. In various embodiments, the results were generated using one of a number of example targeted sequencing assays, including “targeted sequencing assay A” and “targeted sequencing assay B,” also referred to herein and in the figures as “Assay A” and “Assay B,” respectively.
In an example process performed for a targeted sequencing assay, two tubes of whole blood were drawn into Streck BCTs from healthy individuals (self-reported as no cancer diagnosis). After plasma was separated from the whole blood, it was stored at −80° C. Upon assay processing, cfDNA was extracted and pooled from two tubes of plasma. Corielle genomic DNA (gDNA) were fragmented to a mean size of 180 base pairs (bp) and then sized selected to a tighter distribution using magnetic beads. The library preparation protocol was optimized for low input cell free DNA (cfDNA) and sheared gDNA. Unique molecular identifiers (UMIs) were incorporated into the DNA molecules during adapter ligation. Flowcell clustering adapter sequences and dual sample indices were then incorporated at library preparation amplification with PCR. Libraries were enriched using a targeted capture panel.
Target DNA molecules were first captured using biotinlyated single-stranded DNA hybridization probes and then enriched using magnetic streptavidin beads. Non-target molecules were removed using subsequent wash steps. The HiSeq X Reagent Kit v2.5 (Illumina; San Diego, Calif.) was used for flowcell clustering and sequencing. Four libraries per flowcell were multiplexed. Dual indexing primer mix was included to enable dual sample indexing reads. The read lengths were set to 150, 150, 8, and 8, respectively for read 1, read 2, index read 1, and index read 2. The first 6 base reads in read 1 and read 2 are the UMI sequences.
The selection of parameters may involve a tradeoff between a target sensitivity (e.g., adjusted using k and p) and target error (e.g., the upper confidence bound). For given pairs of k and p values, the corresponding mean number of false positives may be similar in value, though the sensitivity values may exhibit greater variance. In some embodiments, the sensitivity is measured using percent positive agreement (PPA) values for tumor, in contrast to PPA for cfDNA, which may be used a measurement of specificity:
In the above equations, “tumor” represents the number of mean variant calls from a ctDNA sample using a set of parameters, and “cfDNA” represents the number of mean variant calls from the corresponding cfDNA sample using the same set of parameters.
In an embodiment, cross-validation is performed to estimate the expected fit of the joint model to sequence reads (for a given type of tissue) that are different from the sequence reads used to train the joint model. For example, the sequence reads may be obtained from tissues having lung, prostate, and breast cancer, etc. To avoid or reduce the extent of overfitting the joint model for any given type of cancer tissue, parameter values derived using samples of a set of types of cancer tissue are used to assess statistical results of other samples known to have a different type of cancer tissue. For instance, parameter values for lung and prostate cancer tissue are applied to a sample having breast cancer tissue. In some embodiments, one or more lowest k values from the lung and prostate cancer tissue data that maximizes the sensitivity is selected to be applied to the breast cancer sample. Parameter values may also be selected using other constraints such as a threshold deviation from a target mean number of false positives, or 95% UCB of at most 3 per sample. The processing system may cycle through multiple types of tissue to cross validate sets of cancer-specific parameters.
The joint model may alter k according to a hinge loss function or another function to guard against non-tumor or disease related effects where a fixed value for k would not accurately capture and categorize those events. The hinge loss function example is particularly targeted at handling loss of heterozygosity (LOH) events. LOH events are germline mutations that occur when a copy of a gene is lost from one of an individual's parents. LOH events may contribute to significant portions of observed AF of a gDNA sample. By capping the k values to the predetermined upper threshold of the hinge loss function, the joint model may achieve greater sensitivity for detecting true positives in most sequence reads while also controlling the number of false positives that would otherwise be flagged as true positives due to the presence of LOH. In other embodiments, k and p may be selected based on training data specific to a given application of interest, e.g., having a target population or sequencing assay.
In some embodiments, the joint model takes into account both the AF of a gDNA sample and a quality score of the gDNA sample to guard against underweighting low AF candidate variants. The quality score for a noise model may be used to estimate the probability of error on a Phred scale. Additionally, the joint model may use a modified piecewise function for the hinge function. For example, the piecewise function includes two or more additive components. One component is a linear function based on the AF of the gDNA sample, and another component is an exponential function based on the quality score of the gDNA sample. Given a quality score threshold and a maximum AF scaling factor kmax, the joint model determines, using the exponential component of the piecewise function:
In the above calculation, P(not error) is the probability that an allele of the gDNA sample is not an error, P(error) is the probability that the allele of the gDNA sample is an error, and P(error)min is a minimum probability of error. A minimum threshold for error rate may be empirically determined as the intersecting point for the quality score densities between the likely somatic and likely germline candidate variants of alleles of the gDNA sample.
2.5.1. Example Variant Calls of Joint Model
2.6. Example Tuning of Joint Model
In step 355A, the processing system determines a probability that the true alternate frequency of a cfDNA sample is greater than a function of a true alternate frequency of a gDNA sample. Step 355A may correspond to previously described step 340E of the method 335 shown in
In step 355B, the processing system determines whether the probability is less than a threshold probability. As an example, the threshold probability may be 0.8, however in practice the threshold probability may be any value between 0.5 and 0.999 (e.g., determined based on a desired filtering stringency), static or dynamic, vary by gene and/or set by position, or other macro factors, etc. Responsive to determining that the probability is greater than or equal to the threshold probability, the processing system determines that the candidate variant is likely not associated with the gDNA sample such as a blood draw including white blood cells of a subject, i.e., not blood-derived. For example, the candidate variant is typically not present in sequence reads of the gDNA sample for a healthy individual. Accordingly, the processing system may call the candidate variant as a true positive that potentially associated with cancer or disease, e.g., potentially tumor-derived.
In step 355C, the processing system determines whether the alternate depth of the gDNA sample is significantly the same as or different than zero. For instance, the processing system performs an assessment using a quality score of the candidate variant using a noise model. The processing system may also compare the alternate depth against a threshold depth, e.g., determining whether the alternate depth is less than or equal to a threshold depth. As an example, the threshold depth may be 0 or 1 reads. Responsive to determining that the alternate depth of the gDNA sample is significantly different than zero, the processing system determines that there is positive evidence that the candidate variant is associated with nucleotide mutations not caused by cancer or disease. For instance, the candidate variant is blood-derived based on mutations that may typically occur in sequence reads of healthy white blood cells.
Responsive to determining that the alternate depth of the gDNA sample is not significantly nonzero, the processing system determines that the candidate variant is possibly associated with the gDNA sample, but does not make a determination of a source of the candidate variant. In other words, the processing system may be uncertain about whether the candidate variant is blood-derived or tumor-derived. In some embodiments, the processing system may select one of multiple threshold depths for comparison with the alternate depth. The selection may be based on a type of processed sample, noise level, confidence level, or other factors.
In step 355D, the processing system determines a gDNA depth quality score of sequence reads of the gDNA sample. In an embodiment, the processing system calculates the gDNA depth quality score using the an alternate depth of the gDNA sample, where C is a predetermined constant (e.g., 2) to smooth the gDNA depth quality score using a weak prior, which avoids divide-by-zero computations:
In step 355E, the processing system determines a ratio of sequence reads of the gDNA sample. The ratio may represent the observed cfDNA frequency and observed gDNA frequency in the processed samples. In an embodiment, the processing system calculates the ratio using the depths and alternate depths of the cfDNA sample and gDNA sample:
The processing system may use the predetermined constants C1, C2, C3, and C4 to smooth the ratio by a weak prior. As examples, the constants may be: C1=2, C2=4, C3=2, and C4=4. Thus, the processing system may avoid a divide-by-zero computation if one of the depths or alternate depths in the ratio denominator equals zero. Thus, the processing system may use the predetermined constants to steer the ratio to a certain value, e.g., 1 or 0.5.
In step 355F, the processing system determines whether the gDNA depth quality score is greater than or equal to a threshold score (e.g., 1) and whether the ratio is less than a threshold ratio (e.g., 6). Responsive to determining that the gDNA depth quality score is less than the threshold score or that the ratio is greater than or equal to the threshold ratio, the processing system determines that there is uncertain evidence regarding association of the candidate variant with the gDNA sample. Stated differently, the processing system may be uncertain about whether the candidate variant is blood-derived or tumor-derived because the candidate variant appears “bloodish” but there is not definitive evidence that a corresponding mutation is found in healthy blood cells.
In step 355G, responsive to determining that the gDNA depth quality score is greater than or equal to the threshold score and that the ratio is less than the threshold ratio, the processing system determines that the candidate variant is likely associated with a nucleotide mutation of the gDNA sample. In other words, the processing system determines that although there is not definitive evidence that a corresponding mutation is found in healthy blood cells, the candidate variant appears bloodier than normal.
Thus, the processing system may use the ratio and gDNA depth quality score to tune the joint model to provide greater granularity in determining whether certain candidate variants should be filtered out as false positives (e.g., initially predicted as tumor-derived, but actually blood-derived), true positives, or uncertain due to insufficient evidence or confidence to classify into either category. For example, based on the result of the method 350, the processing system may modify one or more of the parameters (e.g., k parameter) for a hinge loss function of the joint model. In some embodiments, the processing system uses one or more steps of the method 350 to assign candidate variants to different categories, for instance, “definitively,” “likely,” or “uncertain” association with gDNA (e.g., as shown in
In various embodiments, the processing system processes candidate variants using one or more filters in addition to the steps described with reference to the flowchart of the method 350 shown in
In some embodiments, the processing system filters candidate variants of sequence reads of a cfDNA sample responsive to determining that there is no quality score for the sequence reads. The processing system may determine quality scores for candidate variants using a noise model. The processing system may determine the quality scores with no base alignment. In some embodiments, the quality score may be missing for some samples or candidate variants due to a lack of training data for the joint model or poor training data that fails to produce useful parameters for a given candidate variant. For instance, high noise levels in sequence reads may lead to unavailability of useful training data. The processing system may tune specificity and selectivity of the joint model based on whether a single variant is processed or if the processing system is controlling for a targeted panel. As other examples, the processing system filters a candidate variant responsive to determining that the candidate variant is an edge variant artifact, has less than a threshold cfDNA depth (e.g., 200 sequence reads), has less than a threshold cfDNA quality score (e.g., 60), or corresponds to human leukocyte antigens (HLA), e.g., HLA-A. Since sequences associated with HLA-A may be difficult to align, the processing system may perform a custom filtering or variant calling process for sequences in these regions.
In some embodiments, the processing system filters candidate variants determined to be associated with germline mutations. The processing system may determine that a candidate variant is germline by determining that the candidate variant occurs at an appropriate frequency corresponding to a given germline mutation event and is present at a particular one or more positions (e.g., in a nucleotide sequence) known to be associated with germline events. Additionally, the processing system may determine a point estimate of gDNA frequency, where C is a constant (e.g., 0.5):
The processing system may determine that a candidate variant is germline responsive to determining that pointafDNA is greater than a threshold point estimate threshold (e.g., 0.3). In some embodiments, the processing system filters candidate variants responsive to determining that a number of variants associated with local sequence repetitions is greater than a threshold value. For example, an “AAAAAA” or “ATATATAT” local sequence repetition may be result of a polymerase slip that causes an increase in local error rates.
To improve the accuracy of catching these candidate variants, the processing system may use the filters as described above with reference to
2.7. Example Edge Filtering
Edge filtering is performed (e.g., step 310C shown in
2.7.1. Example Training Distributions of Features from Artifact and Non-Edge Variants
Training data 405 includes various sequence reads. Sequence reads in the training data 405 can correspond to various positions on the genome. In various embodiments, sequence reads in the training data 405 are obtained from more than one training sample.
The edge filter categorizes sequence reads in the training data 405 into one of an artifact training data 410A category, reference allele training data 430 category, or non-artifact training data 410B category. In various embodiments, sequence reads in the training data 405 can also be categorized into a no result or a no classification category responsive to determining that the sequence reads do not satisfy the criteria to be placed in any of the artifact training data 410A category, reference allele training data 430 category, or non-artifact training data 410B category.
As shown in
Sequence reads that correspond to a common position on the genome include: 1) sequencing reads that include a nucleotide base at the position that is different from the reference allele (e.g., an ALT) and 2) sequencing reads that include a nucleotide base at the position that matches the reference allele. The edge filter categorizes sequence reads that include an ALT into one of the artifact training data 410A or non-artifact training data 410B. Specifically, sequence reads that satisfy one or more criteria are categorized as artifact training data 410A. The criteria can be a combination of a type of mutation of the ALT and a location of the ALT on the sequence read. Referring to an example of a type of mutation, sequence reads categorized as artifact training data include an alternative allele that is either a cytosine to thymine (C>T) nucleotide base substitution or a guanine to adenine (G>A) nucleotide base substitution. Referring to an example of the location of the alternative allele, the alternative allele is less than a threshold number of base pairs from an edge of a sequence read. In one implementation, the threshold number of base pairs is 25 nucleotide base pairs, however, the threshold number may vary by implementation.
Sequence reads with an alternative allele that are categorized into the non-artifact training data 410B category are all other sequence reads with an alternative allele that do not satisfy the criteria of being categorized as artifact training data 410A. For example, any sequence read that includes an alternative allele that is not one of a C>T or G>A nucleotide base substitution is categorized as a non-edge training variant. As another example, irrespective of the type of nucleotide mutation, any sequence read that includes an alternative allele that is located greater than a threshold number of base pairs from an edge of a sequence read is categorized as non-artifact training data 410B. In one implementation, the threshold number of base pairs is 25 nucleotide base pairs, however, the threshold number may vary by implementation.
Referring now to the reference allele training data 430 category, sequence reads that include the reference allele are categorized in the reference allele training data 430 category.
Returning to
The artifact significance score 423A feature is a representation of whether the location of alternative alleles 475A (e.g., in terms of distance from edge of a sequence read or another measure) on a group of sequence reads in the artifact training data 410A is sufficiently different, to a statistically significant degree, from the location of reference alleles 442 on a group of sequencing reads in the reference allele training data 430. Specifically, artifact significance score 423A is a comparison between edge distances 450A of alternative alleles 475A (see
In various embodiments, the edge filter performs a statistical significance test for the comparison between edge distances. As one example, the statistical significance test is a Wilcoxon rank-sum test. Here, the edge filter assigns each sequence read in the artifact training data 410A and each sequence read in the reference allele training data 430 a rank depending on the magnitude of each edge distance 450A and 450C, respectively. For example, a sequence read that has the greatest edge distance 450A or 450C can be assigned the highest rank (e.g., rank=1), the sequence read that has the second greatest edge distance 450A or 450C can be assigned the second highest rank (e.g., rank=2), and so on. The edge filter compares the median rank of sequence reads in the artifact training data 410A to the median rank of sequence reads in the reference allele training data 430 to determine whether the locations of alternative alleles 475 in the artifact training data 410A significantly differ from locations of reference alleles 442 in the reference allele training data 430A. As an example, the comparison between the median ranks can yield a p-value, which represents a statistical significance score as to whether the median ranks are significantly different. In various embodiments, the artifact significance score 423A is represented by a Phred score, which can be expressed as:
Phred Score=−10 log10 P
where P is the p-value score. Altogether, a low artifact significance score 423A signifies that the difference in median ranks is not statistically significant whereas a high artifact significance score 423A signifies that the difference in median ranks is statistically significant.
Additionally, the non-artifact allele fraction 424B of the alternative allele 475B can be denoted as
Returning to
In another embodiment, the edge filter may use multiple artifact features 420 or non-artifact features 425 to generate a single distribution 440 or 445. For example,
2.7.2. Example Determining of a Sample-Specific Rate for Identifying Edge Variants
Sequence reads of a called variant 482 are obtained from a sample 480. Generally, the sequence reads of a called variant 482 refer to a group of sequence reads that cross over the position in the genome to which the called variant corresponds.
For each called variant, the edge filter extracts features 484 from the sequence reads of the called variant 482. Each feature 484 extracted from sequence reads of a called variant 482 can be a statistical distance from edge of alternative alleles in the sequence reads, an allele fraction of an alternative allele, a significance score, another type of feature, or some combination thereof. The edge filter applies values of features 484 extracted across called variants of the sample 480 as input to a sample-specific rate prediction model 485 that determines a predicted rate 486 for the sample 480. The predicted rate 486 for the sample 480 refers to an estimated proportion of called variants that are edge variants. In various embodiments, the predicted rate 486 is a value between 0 and 1, e.g., inclusive.
As shown in
In one embodiment, the likelihood equation for the estimation can be expressed as:
L(w|x)=w*(L(x)d1)+(1−w)*(L(x|d2)
where w is the predicted rate 486, x represents the features 484, d1 represents the artifact distribution 440, and d2 represents the non-artifact distribution 445. In other words, L(w|x) is the weighted sum of a likelihood of observing the features 484 in view of the artifact distribution 440 and a likelihood of observing the features 484 in view of the non-artifact distribution 445. Therefore, the maximum likelihood estimation determines the predicted rate 486 (e.g., rate w) that maximizes this overall likelihood given a certain set of conditions.
As shown in
L(w|x1,x2 . . . xn)=w*(Π1nL(xn)|d1)+(1−w)*(Π1nL(xn)|d2) (2)
Altogether, responsive to determining that the distribution of features 484 extracted from sequence reads of the called variants in the sample 480 are more similar to the artifact distribution 440 than the non-artifact distribution 445, the rate prediction model 485 determines a high predicted rate 486, which indicates that a high estimated proportion of called variants are likely edge variants. Alternatively, responsive the distribution of features 484 extracted from sequence reads of variants in the sample 480 are more similar to the non-artifact distribution 445 than the artifact distribution 440, the rate prediction model 485 determines a low predicted rate 486, which indicates that a low estimated proportion of called variants are likely edge variants. As discussed below, the predicted rate 486 can be used to control for the level of “aggressiveness” in which edge variants are identified in a sample. Therefore, a sample that is assigned a high predicted rate 486 can be aggressively filtered (e.g., using broader criteria to filter out a greater number of possible edge variants) whereas a sample that is assigned a low predicted rate 486 can be less aggressively filtered.
2.7.3. Example Variant Specific Analysis for Identifying an Edge Variant
In one embodiment, the edge filter filters called variants based on a type of mutation of the called variant. Here, a called variant that is not of the C>T or G>A mutation type can be automatically characterized as a non-edge variant. Alternatively, any called variant that is of the C>T or G>A is further analyzed in the subsequent steps described hereafter.
As shown in
The edge filter provides values of the extracted features 484 as input to the edge variant prediction model 492. As shown in
Specifically, the edge variant prediction model 492 determines the probability of observing the features 484 of the sequence reads of a called variant 482 in view of the artifact distribution 440 and the non-artifact distribution 445. In one embodiment, the edge variant prediction model 492 determines the artifact score 494 by analyzing the features 484 in view of the artifact distribution 440 and determines the non-artifact score 496 by analyzing the features 484 in view of the non-artifact distribution 445.
As a visual example, referring back to the example distribution shown in
As shown in
In one embodiment, edge variant probability 498 can be expressed as the posterior probability of the called variant being an edge variant in view of the features 484 extracted from sequence reads of the called variant 482. The combination of the artifact score 494, the non-artifact score 496, and the sample-specific predicted rate 486 can be expressed as:
The edge filter may compare the edge variant probability 498 against a threshold value. Responsive to determining that the edge variant probability 498 is greater than the threshold value, the edge filter determines that the called variant is an edge variant. Responsive to determining that the edge variant probability 498 is less than the threshold value, the edge filter determines that the called variant is a non-edge variant.
2.7.4. Example Variant Specific Analysis for Identifying an Edge Variant
For each called variant, one or more features 484 are extracted 454C from the sequence reads of the variant. Values of the extracted features 484 are applied 454D as input to a trained model to obtain an artifact score 494. The artifact score 494 represents a likelihood that the called variant is an edge variant (e.g., result of a processing artifact). The trained model further outputs a non-artifact score 496 which represents a likelihood that the called variant is a non-edge variant (e.g., not a result of a processing artifact).
For each called variant, an edge variant probability 498 is generated 498 by combining the artifact score 494 for the called variant, non-artifact score 496 for the called variant, and the sample-specific predicted rate 486. Based on the edge variant probability 498, the called variant can be reported 454E as an edge variant (e.g., variants that were called as a result of a processing artifact).
2.7.5. Examples of Edge Filtering
The following examples are put forth so as to provide those of ordinary skill in the art with a complete disclosure and description of how to make and use the disclosed embodiments, and are not intended to limit the scope of what is regarded as the invention. Efforts have been made to ensure accuracy with respect to the numbers used (e.g. amounts, temperature, concentrations, etc.) but some experimental errors and deviations should be allowed for. It will be appreciated by those of skill in the art that, in light of the present disclosure, numerous modifications and changes can be made in the particular embodiments exemplified without departing from the intended scope of the invention.
2.7.5.1. Categorizing Artifact and Clean Training Samples
The edge filter categorizes sequence reads that include an alternative allele for a particular site on the genome into artifact and non-artifact groups, as is described below. Additionally, the sequence reads that include the reference allele for the particular site on the genome are included as reference allele data to be later used to determine features of the sequence reads.
The edge filter categorizes sequence reads that include the alternative allele into the artifact or non-artifact category based on two criteria. A first criteria includes a threshold distance of 25 nucleotide base pairs. Therefore, sequence reads categorized in the artifact category include an alternative allele that is within 25 nucleotide base pairs from the edge of the sequence read. A second criteria is a type of nucleotide base mutation. Specifically, sequence reads categorized into the artifact category include an alternative allele that is one of a C>T or G>A mutation. The edge filter categorizes sequence reads that include an alternative allele that does not satisfy these two criteria into the non-artifact category.
The edge filter extracts features from sequence reads of a called variant, including sequence reads that include an alternative allele as well as sequence reads that include the reference allele. Here, the three types of features extracted include: 1) median distance of alternative alleles from the edge of the sequence read, 2) allele fraction of the alternative allele, and 3) significance score. The three types of extracted features are compiled and used to generate the artifact distributions and non-artifact distributions shown in
Several trends are observed across the artifact distributions and non-artifact distributions shown in
For the datasets shown in
In two separate scenarios, these candidate variants in the initial set were further filtered by the edge filter to identify and remove edge variants. A first scenario included the application of a first filter, hereafter referred to as a simple edge variant filter. The simple edge variant filter removes called variants that exhibit a median distance from edges of sequence reads that fall below a threshold distance. Here, the threshold distance is determined based on the location of edge variants in training sequence reads that are categorized in the artifact training data category. Specifically, the threshold distance is expressed as the summation of the median distance of the edge variants from edges of sequence reads and the median absolute deviation of the median distance of the edge variants from edges of sequence reads. The simple edge variant filter is a simple indiscriminate filter that removes all variants that satisfy this threshold distance criteria. The second filter refers to the edge filtering process. Here, the sample specific edge variant filter identifies edge variants while considering the distribution of called variants observed for the sample.
The non-edge variants that remain after removing edge variants using either the simple edge variant filter or the sample specific edge variant filter are retained for analysis in comparison to a conventional method. As referred to hereafter, the conventional method refers to the identification of genomic variations from solid tumor samples using a conventional process, specifically the Memorial Sloan Kettering Integrated Mutation Profiling of Actionable Cancer Targets (MSK-IMPACT) Pipeline (Cheng, D., et al, Memorial Sloan Kettering-Integrated Mutation Profiling ofActionable Cancer Targets (MSK-IMPACT), A Hybridization Capture-Based Next-Generation Sequencing Clinical Assay for Solid Tumor Molecular Oncology, Journal of Molecular Diagnostics, 17(3), p. 251-264).
Here, called variants that are both non-edge variants and detected by the conventional method are referred to as concordant variants.
The percentage of concordant variants shown in
Referring to the simple edge variant filter in
Comparatively, the application of a sample specific edge variant filter improves specificity without sacrificing sensitivity. As shown in
2.8. Examples of Combined Filtering and Scoring
The example data in the following
As illustrated by the progression of diagrams, the number of called variants generally decreases as the extent of filtering increases. Thus, the examples suggest that that non-synonymous filtering, joint modeling, edge filtering, and noise modeling by the processing system can successfully identify and remove a notable amount of false positives. Accordingly, the processing system provides for a more accurate variant caller that mitigates influence from various sources of noise or artifacts. Targeted assays analyzing cfDNA from blood samples using the disclosed methods may be able to capture tumor relevant biology. A slight proportional correlation may be observed in the diagrams between the count of called variants and age of the individuals (e.g., more evident in diagram 466A). Moreover, there are greater counts of called variants for cancer samples than non-cancer samples as expected.
3.1. Whole Genome Features
Referring briefly again to
Generally, bin scores and segment scores represent a normalized sequence read count. Specifically, bin scores and segment scores represent a total number of sequence reads categorized in a bin or segment that is normalized based on a total number of expected sequence reads in the bin or segment (e.g., expected based on training data).
In some embodiments, the total number of copy number aberrations is a single, numerical feature value for a sample. For example, the presence of copy number aberrations per chromosome can be a value of 0 or 1 that indicates whether a copy number aberration is located on a particular chromosome. Here, copy number aberrations is predicated on a workflow that can accurately different copy number aberrations from other copy number events, such as a copy number variation that arises in non-somatic source (e.g., arises in blood). An example of the copy number aberration detection workflow is described below.
Reduced dimensionality features refer to features that have been reduced to a lower dimensionality space (in comparison to data of original sequence reads) while still representing the main characteristics of the original sequence reads. As an example, bin scores of bins across the genome can be reduced to a lower dimensionality space and represented by reduced dimensionality features. Reduced dimensionality features can be generated through a dimensionality reduction process such as principal component analysis (PCA).
Generally, feature values of the aforementioned whole genome features 152 are determined through one of two workflows.
3.2. Copy Number Aberration Detection Workflow
Referring first to the copy number aberration detection workflow, the features of the characteristics of bins across the genome, the characteristics of segments across the genome, and the presence of copy number aberrations per chromosome can be determined through the copy number aberration detection workflow. Rather than the described workflow, a copy number aberration workflow known in the art may also be used (see, e.g., U.S. application Ser. No. 16/352,214, which is incorporated by reference herein).
For example, as shown in
At step 506B, sequence reads derived from each of cfDNA and gDNA are analyzed to identify characteristics of bins and segments across a genome. Generally, a bin includes a range of nucleotide bases of a genome. A segment refers to one or more bins. Therefore, each sequence read is categorized in bins and/or segments that include a range of nucleotide bases that corresponds to the sequence read. Each statistically significant bin or segment of the genome includes a total number of sequence reads (or normalized sequence reads) categorized in the bin or segment that is indicative of a copy number event. Generally, a statistically significant bin or segment includes a sequence read count that significantly differs from an expected sequence read count for the bin or segment even when accounting for possibly confounding factors, examples of which includes processing biases, variance in the bin or segment, or an overall level of noise in the sample (e.g., cfDNA sample or gDNA sample). Therefore, the sequence read count of a statistically significant bin and/or a statistically significant segment likely indicates a biological anomaly such as a presence of a copy number event in the sample.
Generally, the analysis of cfDNA sequence reads and the analysis of gDNA sequence reads are conducted independent of one another. In various embodiments, the analysis of cfDNA sequence reads and gDNA sequence reads are conducted in parallel. In some embodiments, the analysis of cfDNA sequence reads and gDNA sequence reads are conducted at separate times depending on when the sequence reads are obtained.
Reference is now made to
At step 510A, a bin sequence read count is determined for each bin of a reference genome. Generally, each bin represents a number of contiguous nucleotide bases of the genome. A genome can be composed of numerous bins (e.g., hundreds or even thousands). In some embodiments, the number of nucleotide bases in each bin is constant across all bins in the genome. In some embodiments, the number of nucleotide bases in each bin differs for each bin in the genome. In one embodiment, the number of nucleotide bases in each bin is between 10 kilobases (kb) and 1 megabase (mb). In one embodiment, the number of nucleotide bases in each bin is between 25 kilobases (kb) and 200 kb. In one embodiment, the number of nucleotide bases in each bin is between 40 kb and 100 kb. In one embodiment, the number of nucleotide bases in each bin is between 45 kb and 75 kb. In one embodiment, the number of nucleotide bases in each bin is 50 kb. In practice, other bin sizes may be used as well.
The sequence read count of a bin represents a total number of sequence reads that are categorized in the bin. A sequence read is categorized in a bin if the sequence read spans a threshold number of nucleotide bases that are included in the bin (i.e., align or map to a bin). In one embodiment, each sequence read categorized in a bin spans at least one nucleotide base that is included in the bin. Reference is now made to
As shown in
To determine the bin sequence read count for each bin, the sequence reads categorized in each bin are quantified. Therefore, bin 514A shown in
Returning to
For example, in one embodiment, at step 510C, a bin score for each bin is determined by modifying the bin sequence read count for the bin by the expected bin sequence read count for the bin. Step 510C serves to normalize the observed bin sequence read count such that if the particular bin consistently has a high sequence read count (e.g., high expected bin sequence read counts) across many samples, then the normalization of the observed bin sequence read count accounts for that trend. The expected sequence read count for the bin can be determined from training data (e.g., sequence reads from healthy individuals). The generation of the expected sequence read count for each bin is described in further detail below.
In one embodiment, a bin score for a bin can be represented as the log of the ratio of the observed sequence read count for the bin and the expected sequence read count for the bin. For example, bin score bi for bin i can be expressed as:
In other embodiments, the bin score for the bin can be represented as the ratio between the observed sequence read count for the bin and the expected sequence read count for the bin (e.g.,
the square root of the ratio (e.g.,
a generalized log transformation (glog) expected expected of the ratio (e.g., log(observed+√{square root over (observed2+expected))}) or other variance stabilizing transforms of the ratio.
Reference is now made to
Here, the observed sequence read counts and expected sequence read counts for bins in the first set 518A may not differ significantly. However, the observed sequence read counts for bins in the second set 518B may be significantly higher than the corresponding expected read counts for the bins. Therefore, the bin scores for each of the bins in the second set 518B are higher than the bin scores for each of the bins in the first set 518A. The higher bin scores of the bins in the second set 518B indicate a higher likelihood that the observed sequence read counts in bin M, bin M+1, and bin M+2 are a result of a copy number event.
The differing bin scores for the first set 518A and second set 518B of bins illustrates the benefit of normalizing the observed sequence read counts for each bin by the corresponding expected sequence read counts for the bin. Specifically, in the example shown in
Here, bin scores across bins can be informative for identifying a copy number aberration. Moreover, as described above, the bin scores for different bins can serve as a whole genome feature.
Returning to
To provide an example, a bin variance estimate (vari) for a bin i can be expressed as:
vari=varexp
where varexp
To determine the inflation factor Isample of the sample, a deviation of the sample is determined and combined with sample variation factors. Sample variation factors are coefficient values that are previously derived by performing a fit across data derived from multiple training samples. For example, if a linear fit is performed, sample variation factors can include a slope coefficient and an intercept coefficient. If higher order fits are performed, sample variation factors can include additional coefficient values.
More specifically, to determine coefficient values, for each training sample, sequence reads from the training sample can be used to determine z-scores for each bin of the reference genome. A z-score for bin i can be expressed as:
where bi is the bin score for bin i and vari is the bin variance estimate for the bin.
A first curve fit is performed between the bin z-scores of each training sample and the theoretical distribution of z-scores. Here, an example theoretical distribution of z-scores is a normal distribution. In one embodiment, the first curve fit is a linear robust regression fit which yields a slope value. Therefore, performing the first curve fit between bin z-scores of a training sample and the theoretical distribution of z-scores yields a slope value. The first curve fit is performed multiple times for multiple training samples to calculate multiple slope values.
A second curve fit is performed between slope values and deviations of training samples. As an example, the deviation of a training sample can be a median absolute pairwise deviation (MAPD), which represents the median of absolute value differences between bin scores of adjacent bins across the training sample. In one embodiment, the second curve fit is a linear robust regression fit. In another embodiment, the second curve fit can be a higher order polynomial fit. The second curve fit yields coefficient values which, in the embodiment where the second curve fit is a linear robust regression fit, includes a slope coefficient and an intercept coefficient. The coefficient values yielded by the second curve fit are stored as sample variation factors.
The deviation of the sample represents a measure of variability of sequence read counts in bins across the sample. In one embodiment, the deviation of the sample is a median absolute pairwise deviation (MAPD) and can be calculated by analyzing sequence read counts of adjacent bins. Specifically, the MAPD represents the median of absolute value differences between bin scores of adjacent bins across the sample. Mathematically, the MAPD can be expressed as:
∀(bini,bini+1),MAPD=median{|(bi)−(bi+1)|}
where bi and bi+1 are the bin scores for bin i and bin i+1 respectively.
The inflation factor Isample is determined by combining the sample variation factors and the deviation of the sample (e.g., MAPD). As an example, the inflation factor Isample of a sample can be expressed as:
I
sample=slope*σsample+intercept.
Here, each of the “slope” and “intercept” coefficients are sample variation factors whereas σsample represents the deviation of the sample.
At step 510E, each bin is analyzed to determine whether the bin is statistically significant based on the bin score and bin variance estimate for the bin. For each bin i, the bin score (bi) and the bin variance estimate (vari) of the bin can be combined to generate a z-score for the bin. An example of the z-score (zi) of bin i can be expressed as:
To determine whether a bin is a statistically significant bin, the z-score of the bin is compared to a threshold value. If the z-score of the bin is greater than the threshold value, the bin is deemed a statistically significant bin. Conversely, if the z-score of the bin is less than the threshold value, the bin is not deemed a statistically significant bin. In one embodiment, a bin is determined to be statistically significant if the z-score of the bin is greater than 2. In other embodiments, a bin is determined to be statistically significant if the z-score of the bin is greater than 2.5, 3, 3.5, or 4. In one embodiment, a bin is determined to be statistically significant if the z-score of the bin is less than −2. In other embodiments, a bin is determined to be statistically significant if the z-score of the bin is less than −2.5, −3, −3.5, or −4. The statistically significant bins can be indicative of one or more copy number events that are present in a sample (e.g., cfDNA or gDNA sample).
At step 510F, segments of the reference genome are generated. Each segment is composed of one or more bins of the reference genome and has a statistical sequence read count. Examples of a statistical sequence read count can be an average bin sequence read count, a median bin sequence read count, and the like. Generally, each generated segment of the reference genome possesses a statistical sequence read count that differs from a statistical sequence read count of an adjacent segment. Therefore, a first segment may have an average bin sequence read count that significantly differs from an average bin sequence read count of a second, adjacent segment.
In various embodiments, the generation of segments of the reference genome can include two separate phases. A first phase can include an initial segmentation of the reference genome into initial segments based on the difference in bin sequence read counts of the bins in each segment. The second phase can include a re-segmentation process that involves recombining one or more of the initial segments into larger segments. Here, the second phase considers the lengths of the segments created through the initial segmentation process to combine false-positive segments that were a result of over-segmentation that occurred during the initial segmentation process.
Referring more specifically to the initial segmentation process, one example of the initial segmentation process includes performing a circular binary segmentation algorithm to recursively break up portions of the reference genome into segments based on the bin sequence read counts of bins within the segments. In other embodiments, other algorithms can be used to perform an initial segmentation of the reference genome. As an example of the circular binary segmentation process, the algorithm identifies a break point within the reference genome such that a first segment formed by the break point includes a statistical bin sequence read count of bins in the first segment that significantly differs from the statistical bin sequence read count of bins in the second segment formed by the break point. Therefore, the circular binary segmentation process yields numerous segments, where the statistical bin sequence read count of bins within a first segment is significantly different from the statistical bin sequence read count of bins within a second, adjacent segment.
The initial segmentation process can further consider the bin variance estimate for each bin when generating initial segments. For example, when calculating a statistical bin sequence read count of bins in a segment, each bin i can be assigned a weight that is dependent on the bin variance estimate (e.g., vari) for the bin. In one embodiment, the weight assigned to a bin is inversely related to the magnitude of the bin variance estimate for the bin. A bin that has a higher bin variance estimate is assigned a lower weight, thereby lessening the impact of the bin's sequence read count on the statistical bin sequence read count of bins in the segment. Conversely, a bin that has a lower bin variance estimate is assigned a higher weight, which increases the impact of the bin's sequence read count on the statistical bin sequence read count of bins in the segment.
Referring now to the re-segmenting process, it analyzes the segments created by the initial segmentation process and identifies pairs of falsely separated segments that are to be recombined. The re-segmentation process may account for a characteristic of segments not considered in the initial segmentation process. As an example, a characteristic of a segment may be the length of the segment. Therefore, a pair of falsely separated segments can refer to adjacent segments that, when considered in view of the lengths of the pair of segments, do not have significantly differing statistical bin sequence read counts. Longer segments are generally correlated with a higher variation of the statistical bin sequence read count. As such, adjacent segments that were initially determined to each have statistical bin sequence read counts that differed from the other can be deemed as a pair of falsely separated segments by considering the length of each segment.
Falsely separated segments in the pair are combined. Thus, performing the initial segmentation and re-segmenting processes results in generated segments of a reference genome that takes into consideration variance that arises from differing lengths of each segment.
At step 510G, a segment score is determined for each segment based on an observed segment sequence read count for the segment and an expected segment sequence read count for the segment. An observed segment sequence read count for the segment represents the total number of observed sequence reads that are categorized in the segment. Therefore, an observed segment read count for the segment can be determined by summating the observed bin read counts of bins that are included in the segment. Similarly, the expected segment sequence read count represents the expected sequence read counts across the bins included in the segment. Therefore, the expected segment sequence read count for a segment can be calculated by quantifying the expected bin sequence read counts of bins included in the segment. The expected read counts of bins included in the segment can be determined from prior training data, such as sequence reads from healthy individuals. As described above, the segment score can be indicative of a copy number aberration. Therefore, the segment score can serve as a whole genome feature.
The segment score for a segment can be expressed as the ratio of the segment sequence read count and the expected segment sequence read count for the segment. In one embodiment, the segment score for a segment can be represented as the log of the ratio of the observed sequence read count for the segment and the expected sequence read count for the segment. Segment score sk for segment k can be expressed as:
In other embodiments, the segment score for the segment can be represented as one of the square root of the ratio (e.g.,
a generalized log transformation of the ratio (e.g., log(observed+√{square root over (observed2+expected))}) or other variance stabilizing transforms of the ratio.
At step 510H, a segment variance estimate is determined for each segment. Generally, the segment variance estimate represents how deviant the sequence read count of the segment is. The segment variance can influence whether a segment score is indicative of a copy number aberration. As discussed above, the segment variance can serve as a whole genome feature.
In one embodiment, the segment variance estimate can be determined by using the bin variance estimates of bins included in the segment and further adjusting the bin variance estimates by a segment inflation factor (Isegment). To provide an example, the segment variance estimate for a segment k can be expressed as:
vark=mean(vari)*Isegment
where mean(vari) represents the mean of the bin variance estimates of bins i that are included in segment k. The bin variance estimates of bins can be obtained from training data e.g., sequence reads from healthy individuals.
The segment inflation factor accounts for the increased deviation at the segment level that is typically higher in comparison to the deviation at the bin level. In various embodiments, the segment inflation factor may scale according to the size of the segment. For example, a larger segment composed of a large number of bins would be assigned a segment inflation factor that is larger than a segment inflation factor assigned to a smaller segment composed of fewer bins. Thus, the segment inflation factor accounts for higher levels of deviation that arises in longer segments. In various embodiments, the segment inflation factor assigned to a segment for a first sample differs from the segment inflation factor assigned to the same segment for a second sample. In various embodiments, the segment inflation factor Isegment for a segment with a particular length can be empirically determined in advance.
In various embodiments, the segment variance estimate for each segment can be determined by analyzing training samples. For example, once the segments are generated in step 510F, sequence reads from training samples are analyzed to determine an expected segment sequence read count for each generated segment and an expected segment variance estimate for each segment.
The segment variance estimate for each segment can be represented as the expected segment variance estimate for each segment determined using the training samples adjusted by the sample inflation factor. For example, the segment variance estimate (vark) for a segment k can be expressed as:
vark=varexp
where varexp
At step 510I, each segment is analyzed to determine whether the segment is statistically significant based on the segment score and segment variance estimate for the segment. For each segment k, the segment score (Sk) and the segment variance estimate (vark) of the segment can be combined to generate a z-score for the segment. An example of the z-score (Zk) of segment k can be expressed as:
To determine whether a segment is a statistically significant segment, the z-score of the segment is compared to a threshold value. If the z-score of the segment is greater than the threshold value, the segment is deemed a statistically significant segment. Conversely, if the z-score of the segment is less than the threshold value, the segment is not deemed a statistically significant segment. In one embodiment, a segment is determined to be statistically significant if the z-score of the segment is greater than 2. In other embodiments, a segment is determined to be statistically significant if the z-score of the segment is greater than 2.5, 3, 3.5, or 4. In some embodiments, a segment is determined to be statistically significant if the z-score of the segment is less than −2. In other embodiments, a segment is determined to be statistically significant if the z-score of the segment is less than −2.5, −3, −3.5, or −4. The statistically significant segments can be indicative of one or more copy number events that are present in a sample (e.g., cfDNA or gDNA sample).
Returning to
Generally, if statistically significant bins and statistically significant segments of the cfDNA sample are aligned with corresponding bins and segments of the gDNA sample that are also statistically significant, this indicates that the same copy number event is present in both the cfDNA sample and the gDNA sample. Therefore, the source of the copy number event is likely to be due to a non-tumor event (e.g., either a germline or somatic non-tumor event) and the copy number event is likely a copy number variation.
Conversely, if statistically significant bins and statistically significant segments of the cfDNA sample are aligned with corresponding bins and segments of the gDNA sample that are not statistically significant, this indicates that the copy number event is present in the cfDNA sample but is absent from the gDNA sample. In this scenario, the source of the copy number event in the cfDNA sample is due to a somatic tumor event and the copy number event is a copy number aberration.
Identifying the source of a copy number event that is detected in the cfDNA sample is beneficial in filtering out copy number events that are due to a germline or somatic non-tumor event. This improves the ability to correctly identify copy number aberrations that are due to the presence of a solid tumor.
Here, the presence of the identified copy number aberrations across the genome for a cfDNA sample can serve as a whole genome feature. In various embodiments, the locations of the copy number aberrations are further considered. For example, a whole genome feature can be the number of copy number aberrations in a particular region of a chromosome.
3.2.1. Example 1: Copy Number Aberrations Originate from Somatic Tumor Source in a Cancer Sample
Referring specifically to the data shown in
Unaligned indicators (e.g., marked as “+” in
Aligned bin indicators (e.g., marked as “x” in
Aligned segment indicators (e.g., marked as “V” in
Referring to
Referring to
Here, the statistically significant segments (e.g., segment 520A and 520B) in the cfDNA sample are unaligned with the corresponding segments (e.g., segment 520B and 522B) in the gDNA sample. Specifically, statistically significant segment 520A of the cfDNA sample is unaligned with segment 520B of the gDNA sample. Additionally, segment 522A of the cfDNA sample is unaligned with segment 522B of the gDNA sample. This indicates that the copy number events represented by each of the statistically significant segment 520A and 522A are likely due to a somatic tumor event.
Additionally, bin 526A of the cfDNA sample aligns with bin 526B of the gDNA sample. Thus, the copy number event represented by bin 526A of the cfDNA sample is likely due to either a germline or somatic non-tumor event.
3.2.2. Example 2: Potential Copy Number Aberration Originates from Somatic Tumor Source in a Non-Cancer Sample
As shown in
Bin 530A of the cfDNA sample aligns with bin 530B of the gDNA sample. Thus, the copy number event represented by bin 530A of the cfDNA sample is likely due to either a germline or somatic non-tumor event. The statistically significant segment 532A in the cfDNA sample is unaligned with the corresponding segment 532B in the gDNA sample. This indicates that the copy number event represented by the statistically significant segment 532A is possibly due to a somatic tumor event. This demonstrates that a healthy individual (e.g., not diagnosed for cancer) can potentially be screened for early detection of cancer by identifying possible copy number aberrations using cfDNA and gDNA samples obtained from the individual.
3.2.3. Example 3: Copy Number Variations Originate from a Germline or Somatic Non-tumor Source in a Non-Cancer Sample
As shown in
Here, the statistically significant segment 534A in the cfDNA sample aligns with the corresponding statistically significant segment 534B in the gDNA sample. This indicates that the copy number event represented by the statistically significant segment 534A is likely due to either a germline or somatic non-tumor event. Additionally, bin 536A of the cfDNA sample aligns with bin 536B of the gDNA sample. Thus, the copy number event represented by bin 536A of the cfDNA sample is also likely due to either a germline or somatic non-tumor event.
3.3. Dimensionality Reduction Workflow
Returning again to
The dimensionality reduction workflow begins at step 505 where sequence reads derived from a cfDNA sample are obtained. The sequence reads derived from cfDNA 115 can be received from a whole genome sequencing assay 132.
At step 508A, genomic regions across the genome that exhibit low variability in sequence read counts are identified through the use of one or more criteria. In various embodiments, genomic regions can refer to bins of the genome that were discussed above in relation to the copy number aberration detection workflow.
The criteria are formulated to improve the quality of data of high dimensionality by identifying and eliminating systematic errors or other types of non-disease related noises during data collection. In other words, by applying the criteria, bins that are typically noisy which are unsuitable for subsequent analysis can be eliminated. In some embodiments, sequence reads subject to the analysis at step 508A have been pre-processed to correct biases or errors using one or more methods such as normalization, correction of GC biases, correction of biases due to PCR over-amplification, and etc.
In some embodiments, one or more criteria are established to exclude nucleic acid sequencing data that likely contain systematic errors or other types of non-disease related noises during data collection. As disclosed herein, sequence data can include sequence reads of any biological samples, including but not limited to cell-free nucleic acid samples.
In some embodiments, data from only healthy subjects are used to establish the one or more criteria to avoid interferences from data associated with one or more disease conditions. In some embodiments, a criterion as disclosed herein can be established with respect to genomic or chromosomal regions. For example, nucleic acid sequence reads can be aligned to regions of a reference genome and one or more characteristics of the sequence reads can be used to determine whether data associated with a particular genomic region is associated with a baseline noise that confounds the information from the genomic region less. Thus, the particular genomic region can be excluded from subsequent analyses. Exemplary characteristics include but are not limited to, for example, number of reads, mappability of the reads, and etc.
In some embodiments, the genomic regions have the same size. In some embodiments, the genomic regions can have different sizes. In some embodiments, a genomic region can be defined by the number of nucleic acid residues within the region. In some embodiments, a genomic region can be defined by its location and the number of nucleic acids residues within the region. Any suitable size can be used to define genomic regions. For example, a genomic region can include 10 kb or fewer, 20 kb or fewer, 30 kb or fewer, 40 kb or fewer, 50 kb or fewer, 60 kb or fewer, 70 kb or fewer, 80 kb or fewer, 90 kb or fewer, 100 kb or fewer, 110 kb or fewer, 120 kb or fewer, 130 kb or fewer, 140 kb or fewer, 150 kb or fewer, 160 kb or fewer, 170 kb or fewer, 180 kb or fewer, 190 kb or fewer, 200 kb or fewer, or 250 kb or fewer.
Regions that are possibly associated with systematic errors are identified. In some embodiments, one or more criteria can be defined to reduce or eliminate data corresponding to these noisier regions. For example, a high variability filter can be created to allow one to discard data corresponding to all regions with data variations above a threshold value. In other embodiments, a low variability filter can be created to focus subsequent analysis on data with data variations below a threshold.
As an illustration, a human haploid reference genome includes over three billion bases that can be divided into about 30,000 regions (or bins). If an experimental value is observed for each bin, for example, a total number of sequence reads that align to the particular region or bin, each subject can correspond to over 30,000 measurements. After a low or high variability filter is applied, the number of measurements corresponding to a subject can be reduced by a significant portion; for example, including but not limited to about 50% or less, about 45% or less, about 40% or less, about 35% or less, about 30% or less, about 25% or less, 20% or less, 15% or less, 10% or less, or 5% or less. In some embodiments, the number of measurements corresponding to a subject can be reduced by 500 or more such as about 55%, 600, 65%, or 70% or more. For example, a subject, which originally has over 30,000 corresponding measurements, can have over 30% fewer measurements (e.g., about 20,000) after a high or low variability filter is applied.
At step 508B, the one or more criteria established from the previous step can be applied to a biological dataset of a training group (also referred to as “training data”). As disclosed herein, the training group includes both healthy subjects and subjects known to have one or more medical conditions (also referred to as “diseased subjects”). For example, for sequencing data, the one or more criteria previously determined in step 508A (e.g., a low or high variability filter) is applied to data of the training group to completely remove the portion of the data that are associated with the chromosomal regions defined in the filter. In some embodiments, the presumably noisy data are only partially removed. In some embodiments, the presumably noisy data that are not removed can be assigned a weighting factor to reduce their significance in the overall dataset.
Once data selection is performed for the biological dataset for the training group, the remaining training data, also referred to as the “selected training data” or “filtered training data,” are subject to further analysis to extract features that reflect differences between healthy subjects and subjects known to have one or more medical conditions. As noted previously, the original training data include data from both healthy subjects and diseased subjects. The filtered training data constitute a part of the original training data and thus also include data from both healthy subjects and subjects known to have a medical condition. It is assumed that the largest variations among the filtered training data come from differences between data from the healthy subjects and data from the diseased subjects. In essence, it is assumed that data associated with a heathy subject should be more similar to data of another healthy subject than the data from any diseased subject; and vice versa.
Like the original training data, the filtered training data are also of high dimensionality. In some embodiments, the filtered training data are subject to further analysis to reduce data dimensionality and differences between the healthy and diseased subjects are defined based on the reduced dimensionalities. For a given subject, about 20,000 filtered measurements can be further reduced to a handful of data points. For example, the about 20,000 filtered measurements can be transformed based on a few extracted features (e.g., a number of principal components) to render a number of data points. In some embodiments, after reduction of dimensionality, there are 5 or fewer features; 6 or fewer features; 7 or fewer features; 8 or fewer features; 9 or fewer features; 10 or fewer features; 12 or fewer features; 15 or fewer features; or 20 or fewer features. In some embodiments, the filtered measurements can have more than 20 features. The filtered measurements can then be transformed based on the selected features. For example, a sample having two 20,000 filtered measurements can be transformed and reduced to five or fewer data points. In some embodiments, a sample having two 20,000 filtered measurements can be transformed and reduced to more than five data points, such as 10, 15, 20, and etc.
As disclosed herein, the transformed data points from all subjects in the filtered training dataset are subject to further analysis to extract relations or patterns that reflect the differences between the sub-groups in the filtered training dataset. In some embodiments, further analysis includes a binomial logistic regression process; for example, for determining the likelihood of a subject having cancer versus not having cancer. In some embodiments, further analysis includes a multinomial logistic regression process; for example, for determining the type of cancer in addition to the likelihood of a subject having cancer.
At step 508C, optionally, scores are calculated for each subject. In some embodiments, a score is a cancer prediction (e.g., cancer prediction 190B shown in
Referring now to
During the data selection portion, data of high dimensionality are initially processed to improve quality. In some embodiments, the number of sequence reads that align to a particular region of a reference genome is normalized. For example, healthy subject data 605A can include sequence reads from a group of healthy subjects (also referred to as baseline subjects) and data from the baseline subjects can be used to establish the normalization standards. In some embodiments, sequence reads from the baseline subjects are aligned to a reference genome that is already divided into a plurality of regions. Assuming that there are no significant biases during the sequencing process, different regions in the genomes should be covered at roughly the same level. Consequently, the number of sequence reads that align to a particular region should be the same as those sequence reads that align to another region of the same size.
In one example, the number of sequence reads from a baseline subject across different genomic regions can be written as Readij, where integer i denotes a subject and is 1 through n while integer j denotes a genomic region and has a value of 1 through m. As disclosed, a reference genome can be divided into any number of genomic regions, or genomic regions of any sizes. A reference genome can be divided into up to 1,000 regions, 2,000 regions, 4,000 regions, 6,000 regions, 8,000 regions, 10,000 regions, 12,000 regions, 14,000 regions, 16,000 regions, 18,000 regions, 20,000 regions, 22,000 regions, 24,000 regions, 26,000 regions, 28,000 regions, 30,000 regions, 32.000 regions, 34,000 regions, 36,000 regions, 38,000 regions, 40,000 regions, 42,000 regions, 44,000 regions, 46,000 regions, 48,000 regions, 50,000 regions, 55,000 regions, 60,000 regions, 65,000 regions, 70,000 regions, 80,000 regions, 90,000 regions, or up to 100,000 regions. As such, m can be an integer corresponding to the number of genomic regions. In some embodiments, m can be an integer larger than 100,000.
In some embodiments, sequence reads of a subject can be normalized to the average read count across all chromosomal regions for the subject. When i remains constant, sequence reads from genomic regions 1 through m and the corresponding sizes of the regions can be used to compute an average expected number of sequence reads for subject i, for example, based on the equation:
i=Σj=1j=m(Readij/SizeRegionij)/m,
where SizeRegionij represents the size of the particular chromosomal region (e.g., in bases or kilobases) to which the sequence reads (Readij) are aligned. Here, Readij/Size Regionij is a sequence read density value. As such, for a subject i, the expected number of sequence reads that would align to a given chromosomal region j having a size of SizeRegionij can be calculated using the following:
As disclosed herein, data for any subject across different genomic regions can be used as a control to normalize the sequence reads of a genomic region. Here, an average read, which is used as the basis for data normalization, can be computed for a healthy control subject, a group of control subjects, or a test subject itself.
In some embodiments, sequence reads of a subject can be normalized against an overall average count from a group of subjects (e.g., a group of n healthy subjects).
In some embodiments, sequence reads for a subject corresponding to a particular region can be normalized using multiple approaches, utilizing both data from different regions for the subject itself and cross different control subjects.
In one aspect, disclosed herein are methods for establishing a template for selecting data for further analysis, based on patterns gleaned from data from healthy subjects (e.g., baseline healthy subjects 605A and reference healthy subjects 605B). In preferred embodiments, reference healthy subjects 605B do not or only have minimum overlap with baseline healthy subjects 605A. Sequencing data of cell-free nucleic acids are used to illustrate the concepts. However, one of skill in the art would understand that the current method can be applied to sequencing data of other materials or non-sequencing data as well.
In some embodiments, the number of healthy subjects in a baseline or reference healthy subject group can be varied. In some embodiments, the selection criteria for healthy subjects in the baseline and reference healthy subject groups are the same. In some embodiments, the selection criteria for healthy subjects in the baseline and reference healthy subject groups are different.
In some embodiments, a high or low variability filter is established using data from healthy reference subjects 605B. As disclosed herein, the data from healthy reference subjects 605B can be pre-processed (e.g., undergoing various normalization steps); for example, based on baseline control data from healthy subjects 605A. For example, training data from both healthy and cancer subjects can be pre-processed. In some embodiments, raw sequence read data can be directly used to set up a high or low variability filter.
In some embodiments, sequence reads of each healthy subject (e.g., from healthy subject data 605A) can be aligned to a plurality of chromosomal regions of reference genome. The variability of reach genomic region can be evaluated; for example, by comparing numbers of sequence reads for a particular genomic region across all healthy subjects in the control group. As an illustration, healthy subjects who are not expected to have cancers can be included as reference controls. The healthy subjects include but are not limited to subjects who do not have family histories of cancer or who are healthy and young (e.g. under 35 or 30-year-old). In some embodiments, healthy subjects in the reference control group may satisfy other conditions; for example, only healthy women will be included in a control group for breast cancer analysis. Only men will be included in a control group for prostate cancer analysis. In some embodiments, for diseases that are found predominantly or only in a particular ethnic group, only people from the same ethnic group are used to establish the reference control group.
For example, for a group of control healthy subjects (n), if we count the number of sequence reads that align to a genomic region, there will be n values for each genomic region. Parameters, such as a mean or medium count, standard deviation (SD), median absolute deviation (MAD), or the interquartile range (IQR), can be computed based on the n count values and used to determine whether a genomic region is considered of low or high variability. Any method for computing these parameters can be used.
For example, the sequence read numbers for region j in subjects 1 through n can be represented as Readij, where j is an integer and i is an integer between 1 and n. An average read count of region j
In some embodiments, high or low variability filters can be created for only a portion of the genome; for example, for only a particular chromosome or a portion thereof.
In some embodiments, training data 605C includes biological data (e.g., sequencing data) from both healthy subjects and subjects known to have a medical condition (also known as diseased subjects). In some embodiments, data associated healthy subjects who have previously been included in the baseline control group or reference control group will be excluded from training data 605C to possibly avoid certain biases.
In some embodiments, normalization parameters obtained using healthy subject data 605A and the low or high variability filter 605D can be applied to training data 605C to render new and filtered training data 610A for subsequent analysis.
In some embodiments, filtered training data 610A comprise balanced data for healthy and diseased subjects; for example, the numbers of healthy and diseased subjects are within about 5 to 10% of each other. In some embodiments, filtered training data 610A comprise unbalanced data for healthy and diseased subjects; for example, the numbers of healthy and diseased subjects differ more than 10% from each other. In the latter situation, methods can be applied to reduce the impact of unbalanced data.
In some embodiments, filtered training data 610A are subject to further analysis to create prediction model 610B. Prediction model 610B is used to predict whether a subject has a certain medical condition.
In some embodiments, prediction model 610B reflects differences between healthy and diseased subjects. In some embodiments, the differences used in prediction model 610B can be obtained by applying, for example, logistic regression to filtered training data 610A. In some embodiments, filtered training data 610A (e.g., numbers of sequence read that align to certain regions of a reference genome) can be directly used in logistic regression analysis. In some embodiments, filtered training data 610A undergoes a dimensionality reduction to reduce and possibly transform the dataset to a much smaller size. For example, Principal Component Analysis (PCA) can be used to reduce the size of a data set by about 100,000-fold or less, about 90,000-fold or less, about 80,000-fold or less, about 70,000-fold or less, about 60,000-fold or less, about 50,000-fold or less, about 40,000-fold or less, about 30,000-fold or less, about 20,000-fold or less, about 10,000-fold or less, about 9,000-fold or less, about 8,000-fold or less, about 7,000-fold or less, about 6,000-fold or less, about 5,000-fold or less, about 4,000-fold or less, about 3,000-fold or less, about 2,000-fold or less, about 1,000-fold or less, or about 500 fold or less. In some embodiments, the size of a data set can be reduced by more than 100,000-fold. In some embodiments, the size of a data set can be reduced by a couple of hundred folds or less. As disclosed herein, although the size of a data set is reduced, the number of samples can be retained. For example, after PCA, a data set of 1,000 samples can still retain 1,000 samples but the complexity of each sample is reduced (e.g., from corresponding to 25,000 features to 5 or fewer features). As such, the methods disclosed herein can improve efficiency and accuracy of data processing while greatly reduce computer storage space required.
Once a prediction model is established, it can be applied to test data 615A. Test data 615A can be taken from a test subject whose status is unknown with respect to a medical condition. In some embodiments, data from test subjects of known statuses can also be used for validation purposes. In some embodiments, test data 615A will be pre-processed such as going through normalization, GC content correction, and etc. In some embodiments, a high or low variability filter 605D is applied to test data 615A to remove data in chromosomal regions that likely correspond to systematic errors. In some embodiments, both pre-processing and a high or low filter can be applied to test data 615A to render filtered test data for further processing.
In some embodiments, when prediction model 610B is applied to filtered test data 610A, a classification score can be computed as a probability score to represent the likelihood for the particular medical condition to be present in the test subject being analyzed. In such embodiments, the prediction model 610B shown in
At step 635B, the filtered training data are separated using cross-validation methods. In some embodiments, the cross validation methods include but are not limited to exhaustive methods like leave-p-out cross-validation (LpO CV) where p can have any value that would create a valid partition or leave-one-out cross validation (LOOCV) where p=1. In some embodiments, the cross validation methods include but are not limited to non-exhaustive methods such as the holdout method, repeated random sub-sampling validation method, or a stratified or non-stratified k-fold cross-validation method where k can have any value that would create a valid partitioning. As disclosed herein, a cross validation procedure partitions the filtered training data into different pairs of a training subset and a validation subset at a predetermined percentage split. For example, the first training subset and first validation subset depicted at step 635B represent an 80:20 split during one fold of a k-fold cross-validation experiment. In another fold of the same k-fold cross-validation experiment, the filtered training data will be split into a different pair of training and validation subsets at the same percentage ratio. In some embodiments, multiple cross-validation experiments are applied, where the split ratio of a pair of training and validation subsets can be varied in each experiment. As disclosed herein, the subsets can be created randomly. In some embodiments, the subsets are created such that each subset include data from both healthy and diseased subjects. In some embodiments, only one of the subsets include data from both healthy and diseased subjects. For example, it is essential that the training subset include both healthy and diseased subjects.
In some embodiments, a training subset constitutes a majority of the filtered training data, for example, up to 60%, up to 65%, up to 70%, up to 75%, up to 80%, up to 85%, up to 90%, or up to 95% of the filtered training data. In some embodiments, more than 95% of a very large set of filtered training data can be used as the training subset. To avoid training biases, it is usually good practice to save at least 5% of untouched data as a test subset, i.e., as this subset will never be used as training data and will only be used to validate the resulting model.
At step 635C, data from the first training subset (e.g., counts of sequence reads or quantities derived therefrom) can be used to derive dimensionality reduction components that capture one or more differences between the data of healthy and diseased subjects. For example, samples that have been identified to have about 10,000 to about 20,000 of low variability regions can have 10,000 to 20,000 corresponding count values (or derived quantities such as a relative count value, a logCount value, and etc.). By using a dimensionality reduction method such as principal component analysis (PCA), it is possible to identify and select dimensionality reduction components, an example of which are principal components (PCs), that represent the largest variations among data in the first training subset. These dimensionality reduction components can be used to reduce the 10,000 to 20,000 values to a lower dimensionality space where each value in the reduced dimensionality space corresponds to one of the selected PCs. As an example, the higher dimensionality values may be N bin scores across N bins of the genome. Therefore, the dimensionality reduction process reduces the N bin scores down to a reduced number of values that each corresponds to a dimensionality reduction component. In some embodiments, 5 or fewer dimensionality reduction components are selected. In some embodiments, 10 or fewer dimensionality reduction components are selected. In some embodiments, 15 or fewer dimensionality reduction components are selected. In some embodiments, 20 or fewer dimensionality reduction components are selected. In some embodiments, 25 or fewer dimensionality reduction components are selected. In some embodiments, 30 or fewer dimensionality reduction components are selected. In some embodiments, 35 or fewer dimensionality reduction components are selected. In some embodiments, 40 or fewer dimensionality reduction components are selected. In some embodiments, 45 or fewer dimensionality reduction components are selected. In some embodiments, 50 or fewer dimensionality reduction components are selected. In some embodiments, 60 or fewer dimensionality reduction components are selected. In some embodiments, 70 or fewer dimensionality reduction components are selected. In some embodiments, 80 or fewer dimensionality reduction components are selected. In some embodiments, 90 or fewer dimensionality reduction components are selected. In some embodiments, 100 or fewer dimensionality reduction components are selected. In some embodiments, more than 100 dimensionality reduction components are selected.
These dimensionality reduction components identified through the dimensionality reduction method (e.g., through PCA) can be used during deployment (e.g., in the steps shown in
At step 635D, one or more component weights can be derived that reflect the relative contribution of each of the dimensionality reduction components. For example, for each dimensionality reduction component, a component weight is assigned to each value from each low variability region to reflect the respective importance of the data. A region contributing more to the observed differences will be assigned a larger component weight, and vice versa. In some embodiments, the component weight is specific for a component and is also region-specific, but the same for all subjects, for example, wk can represent the component weight associated for region j in connection with PC k, where j is an integer from 1 to m′, and m′ is an integer smaller than m, the original number of genomic regions designated in the reference genome. This component weight value is the same for different subjects. In some embodiments, more individualized component weight values can be formulated to reflect differences between the subjects. For example, the component weight may differ from one cancer type to another cancer type. For the same region and same PC, the component weight for people of different ethnic origins can differ. These component weights can be used during deployment (e.g., in the steps shown in
At step 635E, data in the first training subset are transformed based on the extracted dimensionality reduction components (e.g., a few selected PCs). In some embodiments, dimensionality of the transformed data is much smaller than those of the filtered training data, whose dimensionality is already reduced from the original un-filtered data. The concepts are illustrated as follows.
Subject1(Read)=[Read11,Read12,Read13, . . . , Read1m]
illustrates the data of subject 1 before a low variability filter is applied, where m is the total number of regions.
Subject1(FRead)=[FRead11,FRead12,FRead13, . . . , FRead1m′]
illustrates the data of subject 1 before a low variability filter is applied, where m′ is the total number of regions. After a low variability filter is applied, the total number of genomic regions is reduced to m′, which can be significantly smaller than m. For example, un-filtered data for a subject can include 30,000 components or more, each associated with a genomic region. After a low variability filter is applied, a significant portion of the genomic regions can be excluded as being having high variability; for example, filtered data for the same subject can include 20,000 components or fewer, each associated with a low variability genomic region.
At step 635E, data dimensionality of the filtered data can be further reduced based on the number of extracted dimensionality reduction components. For example, if k principal components are selected, the dimensionality of the filtered data can be reduced to k. The number of selected PCs can be much smaller than the dimensionality of the filtered data. For example, when only 5 PCs are selected, the data dimensionality of filtered read data (FRead) for subject 1 can be further reduced to 5, such as the expression below:
As such, quantity data (e.g., read numbers) associated with a large number of low variability regions can be reduced and transformed to a handful of numeric values. In some embodiments, a component weight (e.g., wPCj) can be assigned to each PC. In some embodiments, a single value can be computed based on the values associated with multiple PCs.
At step 635F, a classification method is applied to the transformed data of each subject to provide a classification score. In some embodiments, the classification score can be a binomial or multinomial probability score. For example, in a binomial classification for cancer, logistic regression can be applied to compute a probability score, where 0 represents no likelihood of cancer while I represents the highest certainty of having cancer. A score of over 0.5 indicates that the subject is more likely to have cancer than not having cancer. Logistic regression generates the coefficients (and its standard errors and significance levels) of a formula to predict a logit transformation of the probability of presence of the characteristic of interest. Using the same example to illustrate probability determination by logistic regression, the probability (p) of a subject having cancer can be written as the following
logit(p)=b0+b1×FReadPC1+b2×FReadPC2+b3×FReadPC3+b4×FReadPC4+b5×FReadPC5
where each transformed and reduced data derived from PCI is assigned a weight. The logit transformation is defined as the logged odds:
and p probability
The value of p can be computed by plugging the logit(p) value. In some embodiments, it is possible to look up values in a logit table.
In some embodiments, a multinomial classification approach can be taken to classify subjects into different cancer types. For example, existing multinomial classification techniques can be categorized into (i) transformation to binary (ii) extension from binary and (iii) hierarchical classification. In a transformation to binary approach, a multi-class problem can be transformed into multiple binary problems based on a one-vs-rest or one-vs-one approach. Exemplary extension from binary algorithms include but are not limited to neural networks, decision trees, k-nearest neighbors, naive Bayes, support vector machines and Extreme Learning Machines, and etc. Hierarchical classification tackles the multinomial classification problem by dividing the output space i.e. into a tree. Each parent node is divided into multiple child nodes and the process is continued until each child node represents only one class. Several methods have been proposed based on hierarchical classification. In some embodiments, multinomial logistic regression can be applied. It is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables (which may be real-valued, binary-valued, categorical-valued, etc.).
At step 635G, the filtered training data are partitioned into a second training subset and a second test/validation subset and the steps of 635B through 635F are repeated in one or more refinement cycles (also referred to as “one or more cross-validation cycles”). As disclosed herein, in the cross-validation procedure the validation subsets themselves have little (e.g., in repeated random sampling) or no overlap at all (LOOCV, LpO CV, k-fold) over different folds.
During a refinement cycle, a predetermined condition (e.g., a cost function) can be applied to optimize the classification results. In some embodiments, one or more parameters in a classification function are refined using training data subset and validated by validation or held out subset during each fold of the cross-validation procedure. In some embodiments, PC-specific weights and/or region-specific weights can be refined to optimize classification results.
In some embodiments, a small portion of the filtered training data can be kept aside, not as a part of a training subset during any fold of the cross-validation procedure to better estimate overfitting.
At step 635H, the refined parameters are used to compute classification scores. As disclosed herein, the refined parameters can function as a prediction model for cancer as well as cancer types. It is possible to construct a prediction model using multiple types of biological data; including but not limited to, for example, nucleic acid sequencing data (cell-free versus non cell-free, whole genome sequencing data, whole genome methylation sequencing data, RNA sequencing data, targeted panel sequencing data), protein sequencing data, tissue pathology data, family history data, epidemiology data, and etc.
In one aspect, disclosed herein are method for classifying a subject as having a certain medical condition, based on the dimensionality reduction components and the component weights (or refined component weights) established using training data.
At step 645A, test data is received from a test sample from the subject who status is unknown. In some embodiments, the test data are of the same type as those from the baseline healthy subjects including. In some embodiments, the test data are of the same type as those from the reference healthy subjects. Sample data type includes but not limited to sequencing data for detecting targeted mutations, whole genome sequencing data, RNA sequencing data, and whole genome sequencing data for detecting methylation. In some embodiments, the test data can be calibrated and adjusted for improved quality (e.g., normalization, GC content correction and etc.).
At step 645B, high dimensionality values of the test data are reduced to a lower dimensional space using component weights that are determined during training. For example, the high dimensionality values may be read data, such as a bin score (e.g., the normalized number of sequence reads that are categorized in bins across the genome). First, a data selection can be performed using a previously defined low variability filter. Advantageously, the filter-based approach is straight-forward and can be easily adjusted by changing the threshold value of the reference quantities computed for genomic regions in a reference genome. Therefore, high dimensionality values that are highly variable (e.g., due to presence of noise in certain bins) can be filtered and removed. Next, the high dimensionality values that remain following the data selection step can be reduced in dimensionality using a dimensionality reduction process such as PCA. The number of dimensionality reduction components resulting from the dimensionality reduction process can be based on the number of dimensionality reduction components that were determined during training. For example, if the dimensionality reduction process generated 5 principal components, then the dimensionality reduction process for values of the test sample also generates 5 principal components. Each dimensionality reduction component is weighted by a corresponding component weight that was determined during training. Therefore the lower dimensionality data can be calculated.
As an example, with 5 dimensionality reduction components, then 5 lower dimensionality data can be generated. Specifically, the data dimensionality of filtered read data (FRead) can be reduced to 5 lower dimensionality data (FReadPC1, FReadPC2, FReadPC3, FReadPC4, FReadPC5), which can be expressed as:
where wPCj represents the component weight assigned to each dimensionality reduction component. Here, the reduced dimensionality data represented as FReadPC can serve as whole genome features, as described above.
At step 645C, a classification score can be computed for the test subject using the reduced dimensionality data. In various embodiments, the classification score can be a cancer prediction outputted by a prediction model, such as predictive cancer model 170C shown in FIG. 1C. In other words, the prediction model can serve as a single-assay prediction model that outputs a cancer prediction 190B based only on whole genome features 152.
In some embodiments, a certain level of data selection can occur during initial data processing: e.g., during normalization, GC content correction, and other initial data calibration steps, it is possible to rejects sequence reads that are clearly defective and thus reduce the number of data. As illustrated in
As disclosed herein, a low variability filter specifies a subset of the genomic regions that will be selected for further processing. For example, the regions indicated with a hatched shading (e.g., regions indicated as 1, 2, 3, 4, 5, 6 . . . m′−1, m′) are selected regions. The filter allows categorical selection or rejection of data based on established analysis of possible systematic errors using reference healthy subjects.
The selected data are then transformed to further reduce data dimensionality to reduced dimensionality data (e.g., element 660). In some embodiments, transformed data 660 can be generated using data from all the selected genomic regions, but the dimensionality of data can be greatly reduced. For example, data from over 20,000 different genomic regions can be transformed into a handful of values. In some embodiments, a single value can be generated. Here, the reduced dimensionality data in the form of a handful of values or a single value are the reduced dimensionality features described above that can serve as whole genome features.
In some embodiments, selected sequencing data from element 655 can be sorted in subgroups according to the fragment size represented by the sequencing data. For example, instead of a single count for all sequence reads bound to a particular region, multiple quantiles can be derived each corresponding to a size or size range. For example, sequence reads corresponding to fragments of 140-150 bases will be separately grouped from sequence reads corresponding to fragments of 150-160 bases, as illustrated in element 670. As such, additional detail and fine tuning can be made before the data are used for classification.
As illustrated in
The current method and system offer advantages over previously known methods. For example, classification is done using quantities that can be easily derived from raw sequencing data. The current does not require building chromosome-specific segmentation maps, thus eliminating the time-consuming process for generating those maps. Also, the current method permits more efficient use of computer storage space because it no longer requires storage for the large segmentation maps.
3.3.1. Example 1: Comparison of Classification Score and Z-Score
4.1. Methylation Features
Referring briefly again to
The quantity of hypomethylated counts and quantity of hypermethylated counts refer to the total number of cfDNA fragments that are hypomethylated and hypermethylated, respectively. Hypomethylated and hypermethylated cfDNA fragments are identified using methylation state vectors, as is described below in relation to process 700 in
The hypomethylation score per CpG site and hypermethylation score per CpG site relate to an estimate of cancer probability given the presence of hypomethylation or hypermethylation of fragments. Generally, the hypomethylation score per CpG site and hypermethylation score per CpG site is based on counts of methylation state vectors that overlap the CpG site that are either hypomethylated or hypermethylated.
The rankings based on hypermethylation scores and rankings based on hypomethylation scores can refer to rankings of methylation state vectors based on their associated hypomethylation/hypermethylation scores. In one embodiment, the rankings refer to rankings of fragments based on a maximum hypomethylation/hypermethylation score of methylation state vectors in each fragment.
The presence or absence of an abnormally methylated (e.g., hypomethylated or hypermethylated) fragment at a CpG site can be expressed as a 0 (absent) or 1 (present) value for each CpG site. In some aspects, the presence/absence of an abnormally methylated fragment is determined for CpG sites that provide the most information gain. For example, in some scenarios, ˜3000 CpG sites are identified as providing information gain and therefore, there are ˜3000 feature values for this particular feature.
4.2. Methylation Computational Analysis Overview
For a given methylation state vector, the processing system enumerates 710B all possibilities of methylation state vectors having the same starting CpG site and same length (i.e., set of CpG sites) in the methylation state vector. As each methylation state may be methylated or unmethylated there are only two possible states at each CpG site, and thus the count of distinct possibilities of methylation state vectors depends on a power of 2, such that a methylation state vector of length n would be associated with 2n possibilities of methylation state vectors.
The processing system calculates 710C the probability of observing each possibility of methylation state vector for the identified starting CpG site/methylation state vector length by accessing a healthy control group data structure. In one embodiment, calculating the probability of observing a given possibility uses a Markov chain probability to model the joint probability calculation which will be described in greater detail with respect to
To generate a healthy control group data structure, the processing system subdivides methylation state vectors derived from cfDNA of individuals in the healthy control group into strings of CpG sites. In one embodiment, the processing system subdivides the methylation state vector such that the resulting strings are all less than a given length. For example, a methylation state vector of length 11 may be subdivided into strings of length less than or equal to 3 would result in 9 strings of length 3, 10 strings of length 2, and 11 strings of length 1. In another example, a methylation state vector of length 7 being subdivided into strings of length less than or equal to 4 would result in 4 strings of length 4, 5 strings of length 3, 6 strings of length 2, and 7 strings of length l. If a methylation state vector is shorter than or the same length as the specified string length, then the methylation state vector may be converted into a single string containing all of the CpG sites of the vector.
The processing system tallies the strings by counting, for each possible CpG site and possibility of methylation states in the vector, the number of strings present in the control group having the specified CpG site as the first CpG site in the string and having that possibility of methylation states. For example, at a given CpG site and considering string lengths of 3, there are 2{circumflex over ( )}3 or 8 possible string configurations. At that given CpG site, for each of the 8 possible string configurations, the processing system tallies how many occurrences of each methylation state vector possibility come up in the control group. Continuing this example, this may involve tallying the following quantities: <Mx, Mx+1, Mx+2, <Mx, Mx+1, Ux+2>, . . . , <Ux, Ux+1, Ux+2> for each starting CpG site x in the reference genome. The processing system creates the data structure storing the tallied counts for each starting CpG site and string possibility. Thus, this data structure can be used to determine the probability of observing each possibility of a methylation state vector.
The processing system calculates 710D a p-value score for the methylation state vector using the calculated probabilities for each possibility. In one embodiment, this includes identifying the calculated probability corresponding to the possibility that matches the methylation state vector in question. Specifically, this is the possibility having the same set of CpG sites, or similarly the same starting CpG site and length as the methylation state vector. The processing system sums the calculated probabilities of any possibilities having probabilities less than or equal to the identified probability to generate the p-value score.
This p-value represents the probability of observing the methylation state vector of the fragment or other methylation state vectors even less probable in the healthy control group. A low p-value score, thereby, generally corresponds to a methylation state vector which is rare in a healthy subject, and which causes the fragment to be labeled anomalously methylated, relative to the healthy control group. A high p-value score generally relates to a methylation state vector is expected to be present, in a relative sense, in a healthy subject. If the healthy control group is a non-cancerous group, for example, a low p-value indicates that the fragment is anomalously methylated relative to the non-cancer group, and therefore possibly indicative of the presence of cancer in the test subject.
As above, the processing system calculates p-value scores for each of a plurality of methylation state vectors, each representing a cfDNA fragment in the test sample. To identify which of the fragments are anomalously methylated, the processing system may filter 710E the set of methylation state vectors based on their p-value scores. In one embodiment, filtering is performed by comparing the p-values scores against a threshold and keeping only those fragments below the threshold. This threshold p-value score could be on the order of 0.1, 0.01, 0.001, 0.0001, or similar.
The processing system calculates 710C probabilities 720B for the enumerated possibilities of methylation state vectors. As methylation is conditionally dependent on methylation status of nearby CpG sites, one way to calculate the probability of observing a given methylation state vector possibility is to use Markov chain model. Generally a methylation state vector such as <S1, S2, . . . , Sn>, where S denotes the methylation state whether methylated (denoted as M), unmethylated (denoted as U), or indeterminate (denoted as I), has a joint probability that can be expanded using the chain rule of probabilities as:
P(<S1,S2, . . . , Sn>)=P(Sn|S1, . . . , Sn−1)*P(Sn−1|S1, . . . , Sn−2)* . . . *P(S2|S1)*P(S1).
Markov chain model can be used to make the calculation of the conditional probabilities of each possibility more efficient. In one embodiment, the processing system selects a Markov chain order k which corresponds to how many prior CpG sites in the vector (or window) to consider in the conditional probability calculation, such that the conditional probability is modeled as P(Sn|S1, . . . , Sn−1)˜P(Sn|Sn−k−2, . . . , Sn−1).
To calculate each Markov modeled probability for a possibility of methylation state vector, the processing system accesses the control group's data structure, specifically the counts of various strings of CpG sites and states. To calculate P(Mn|Sn−k−2, . . . , Sn−1), the processing system takes a ratio of the stored count of the number of strings from the data structure matching <Sn−k−2, . . . , Sn−1, Mn> divided by the sum of the stored count of the number of strings from the data structure matching <Sn−k−2, . . . , Sn−1, Mn> and <Sn−k−2, . . . , Sn−1, Un>. Thus, P(Mn|Sn−k−2, . . . , Sn−1), is calculated ratio having the form:
[# of <Sn−k−2, . . . , Sn−1,Mn>]/[# of <Sn−k−2, . . . , Sn−1,Mn>+# of <Sn−k−2, . . . , Sn−1,Un>].
The calculation may additionally implement a smoothing of the counts by applying a prior distribution. In one embodiment, the prior distribution is a uniform prior as in Laplace smoothing. As an example of this, a constant is added to the numerator and another constant (e.g., twice the constant in the numerator) is added to the denominator of the above equation. In other embodiments, an algorithmic technique such as Knesser-Ney smoothing is used.
In the illustration, the above denoted formulas are applied to the test methylation state vector 720A covering sites 23-26. Once the calculated probabilities 720B are completed, the processing system calculates 710D a p-value score 720C that sums the probabilities that are less than or equal to the probability of possibility of methylation state vector matching the test methylation state vector 720A.
In one embodiment, the computational burden of calculating probabilities and/or p-value scores may be further reduced by caching at least some calculations. For example, the analytic system may cache in transitory or persistent memory calculations of probabilities for possibilities of methylation state vectors (or windows thereof). If other fragments have the same CpG sites, caching the possibility probabilities allows for efficient calculation of p-score values without needing to re-calculate the underlying possibility probabilities. Equivalently, the processing system may calculate p-value scores for each of the possibilities of methylation state vectors associated with a set of CpG sites from vector (or window thereof). The processing system may cache the p-value scores for use in determining the p-value scores of other fragments including the same CpG sites. Generally, the p-value scores of possibilities of methylation state vectors having the same CpG sites may be used to determine the p-value score of a different one of the possibilities from the same set of CpG sites.
Referring again to
In calculating p-values for a methylation state vector larger than the window, the window identifies the sequential set of CpG sites from the vector within the window starting from the first CpG site in the vector. The analytic system calculates a p-value score for the window including the first CpG site. The processing system then “slides” the window to the second CpG site in the vector, and calculates another p-value score for the second window. Thus, for a window size I and methylation vector length m, each methylation state vector will generate m−l+1 p-value scores. After completing the p-value calculations for each portion of the vector, the lowest p-value score from all sliding windows is taken as the overall p-value score for the methylation state vector. In another embodiment, the processing system aggregates the p-value scores for the methylation state vectors to generate an overall p-value score.
Using the sliding window helps to reduce the number of enumerated possibilities of methylation state vectors and their corresponding probability calculations that would otherwise need to be performed. Example probability calculations are shown in
The processing system may perform any variety and/or possibility of additional analyses with the set of anomalous fragments. One additional analysis identifies 710G hypomethylated fragments or hypermethylated fragments from the filtered set. Fragments that are hypomethylated or hypermethylated may be defined as fragments of a certain length of CpG sites (e.g., more than 3, 4, 5, 6, 7, 8, 9, 10, etc.) with a high percentage of methylated CpG sites (e.g., more than 80%, 85%, 90%, or 95%, or any other percentage within the range of 50%-100%) or a high percentage of unmethylated CpG sites (e.g., more than 80%, 85%, 90%, or 95%, or any other percentage within the range of 50%-100%), respectively.
In various embodiments, the processing system may identify a presence or absence of an anomalously methylated fragment that overlaps a particular CpG site. As described above, the presence/absence of an anomalously methylated fragment of a CpG site can serve as a methylation feature. As one example, if a hypermethylated or hypomethylated fragment overlaps a particular CpG site, then the CpG site can be assigned a value of 1, indicating the presence of an anomalously methylated fragment. Conversely, the CpG site can be assigned a value of 0 which indicates the lack of an anomalously methylated fragment at the CpG site. In various embodiments, the processing system identifies CpG sites that provide the most information gain and determines the presence/absence of an anomalously methylated fragment at the CpG sites that provide the most information gain. For example, for each CpG site, the processing system computes information gain for the label (cancer vs. non-cancer) given knowledge of whether a highly methylated or unmethylated fragment overlaps that CpG site. The processing system selects the top k CpG sites based on the information gain of each CpG site and can determine the presence/absence of an anomalously methylated fragment at each of the most informative CpG sites.
An alternate analysis applies 710H a trained classification model on the set of anomalous fragments. The trained classification model described hereafter may refer to the predictive cancer model 170B shown in
The trained classification model can be trained to identify any condition of interest that can be identified from the methylation state vectors. Generally, the trained classification model can employ any one of a number of classification techniques. In one embodiment the classifier is a non-linear classifier. In a specific embodiment, the classifier is a non-linear classifier utilizing a L2-regularized kernel logistic regression with a Gaussian radial basis function (RBF) kernel.
In one embodiment, the trained classification model is a binary classifier trained based on methylation states for cfDNA fragments obtained from a subject cohort with cancer, and optionally based on methylation states for cfDNA fragments obtained from a healthy subject cohort without cancer, and is then used to classify a test subject probability of having cancer, or not having cancer, based on anomalously methylation state vectors. In further embodiments, different classifiers may be trained using subject cohorts known to have particular cancer (e.g., breast, lung, prostrate, etc.) to predict whether a test subject has those specific cancers.
In one embodiment, the classifier is trained based on information about hyper/hypo methylated regions from the process 710G and as described with respect to
Another additional analysis calculates the log-odds ratio that the anomalous fragments from a subject are indicative of cancer generally, or of particular types of cancer. The log-odds ratio can be calculated by taking the log of a ratio of a probability of being cancerous over a probability of being non-cancerous (i.e., one minus the probability of being cancerous), both as determined by the applied classification model.
4.3. Hyper/Hypo Methylated Regions and a Classifier
The process determines 730A, for each methylation state vector, whether the methylation state vector is hypomethylated or hypermethylated. Here, the hypermethylated or hypomethylated label is assigned if at least some number of CpG sites have a particular state (methylated or unmethylated, respectively) and/or have a threshold percentage of sites that are the particular state (again, methylated or unmethylated, respectively). As defined above, cfDNA fragments are identifed as hypomethylated or hypermethylated, respectively, if the fragment has at least five CpG sites that are either unmethylated or methylated and (logical AND) above 80% of the fragments CpG sites being unmethylated or methylated. The total fragments that are hypomethylated (e.g., quantity of hypomethylated counts) and the total fragments that are hypermethylated (e.g., quantity of hypermethylated counts) can serve as methylation features.
In an alternate embodiment, the process considers portions of the methylation state vector and determines whether the portion is hypomethylated or hypermethylated, and may distinguish that portion to be hypomethylated or hypermethylated. This alternative resolves missing methylation state vectors which are large in size but contain at least one region of dense hypomethylation or hypermethylation. This process of defining hypomethylation and hypermethylation can be applied in step 710G of
The process generates 730B a hypomethylation score and a hypermethylation score per CpG site in the genome. As discussed above, the hypomethylation score and hypermethylation score for each CpG site in the genome can serve as a methylation feature. To generate either score at a given CpG site, the classifier takes four counts at that CpG site—(1) count of (methylations state) vectors of the cancer set labeled hypomethylated that overlap the CpG site; (2) count of vectors of the cancer set labeled hypermethylated that overlap the CpG site; (3) count of vectors of the non-cancer set labeled hypomethylated that overlap the CpG site; and (4) count of vectors of the non-cancer set labeled hypermethylated that overlap the CpG site. Additionally the process may normalize these counts for each group to account for variance in group size between the non-cancer group and the cancer group.
To generate 730C the hypomethylation score at a given CpG site, the process takes a ratio of (1) over (1) summed with (3). Similarly the hypermethylation score is calculated by taking a ratio of (2) over (2) and (4). Additionally these ratios may be calculated with an additional smoothing technique as discussed above. The hypomethylation score and the hypermethylation score relate to an estimate of cancer probability given the presence of hypomethylation or hypermethylation of fragments from the cancer set.
The process generates 730C an aggregate hypomethylation score and an aggregate hypermethylation score for each anomalous methylation state vector. The aggregate hyper and hypo methylation scores, are determined based on the hyper and hypo methylation scores of the CpG sites in the methylation state vector. In one embodiment, the aggregate hyper and hypo methylation scores are assigned as the largest hyper and hypo methylation scores of the sites in each state vector, respectively. However, in alternate embodiments, the aggregate scores could be based on means, medians, maximum, or other calculations that use the hyper/hypo methylation scores of the sites in each vector.
The process then ranks 730D all of that subject's methylation state vectors by their aggregate hypomethylation score and by their aggregate hypermethylation score, resulting in two rankings per subject. In particular embodiments, the process ranks fragments based on the maximum hypomethylation score or maximum hypermethylation score, thereby resulting in two rankings per subject. These two rankings, which are the rankings based on hypermethylation scores and the rankings based on hypomethylation scores, can serve as methylation features. The process selects aggregate hypomethylation scores from the hypomethylation ranking and aggregate hypermethylation scores from the hypermethylation ranking. With the selected scores, the classifier generates 730E a single feature vector for each subject. In one embodiment, the scores selected from either ranking are selected with a fixed order that is the same for each generated feature vector for each subject in each of the training groups. As an example, in one embodiment the classifier selects the first, the second, the fourth, the eighth, and the sixteenth aggregate hyper methylation score, and similarly for each aggregate hypo methylation score, from each ranking and writes those scores in the feature vector for that subject. At step 730F, the process trains 730F a binary classifier to distinguish feature vectors between the cancer and non-cancer training groups. The trained classifier can then be applied when needed at, for example, step 710H of
Referring again to
Common assay features can be informative for predicting cancer. As one example, common assay features can be a characteristic of the cfDNA 115. As another example, common assay features can be a characteristic of the gDNA (e.g., WBC DNA 120). For example, a characteristic of cfDNA or gDNA can be a total quantity of cfDNA or gDNA in a sample. Tumors in cancerous individuals often produce higher levels of DNA and therefore, the total quantity of DNA (e.g., cfDNA) can be informative for generating a cancer prediction. As another example, a characteristic of cfDNA can be a total concentration of tumor-derived nucleic acid. As yet another example, a characteristic of cfDNA can be a mean or median fragment length across DNA fragments obtained from the cfDNA sample.
6.1. Baseline Features
Referring again to
Examples of baseline features 150 can also include a polygenic risk score derived from germline mutations. Germline mutations can include known germline mutations that are associated with cancer, such as one of ATM, BRCA1, BRCA2, CHEK2, PALB2, PTEN, STK11, or TP53. Here, the baseline computational analysis 140A can perform the steps to determine a feature value represented by the polygenic risk score.
6.1. Baseline Analysis and Computational Analysis Workflow
At step 810B, germline mutations of the individual are obtained. In various embodiments, identified germline mutations are obtained from the application of a physical assay to DNA obtained from the individual. As one example, the germline mutations are previously identified through application of the whole genome sequencing assay 132 and the whole genome computational analysis 140B. Specifically, the whole genome sequencing assay 132 can be applied to gDNA (e.g., WBC DNA 120) to generate sequence reads and the whole genome computational analysis 140B can be applied to identify the germline mutations that are present. As an example, a germline mutation can be a copy number variation that is jointly found in both gDNA and cfDNA, as discussed above in relation to the copy number aberration detection workflow (e.g., Section 3.2).
At step 810C, a polygenic risk score (PRS) is determined for the individual based on the germline mutations that were obtained. As an example, a PRS can be represented as the weighted summation of the identified germline mutations. Each assigned weight to a germline mutation can be previously determined through training data. As one example, the PRS for a is calculated as the weighted sum of log(odds ratio) across all risk loci for each individual i:
At step 810D, optionally, a cancer prediction for the individual can be provided based on the baseline features including the clinical features and the polygenic risk score. In one embodiment, a cancer prediction can include the stratification of the individual in a particular category based on the baseline features. Stratification of the individual can be useful because a cancer prediction for the individual can be tailored for the particular stratification that the individual is categorized in. An example of a stratification can be a risk of developing cancer (e.g., high risk, medium risk, low risk).
In one embodiment, criteria of the baseline features used to stratify the individual into a category. As an example, a criterion for a PRS can be a threshold PRS percentile (e.g. top 5, 10, 25, 50% of individuals). If the individual is assigned a PRS that is above the threshold PRS percentile, then the individual can be stratified in a first group. Conversely, if the individual is assigned a PRS that is below the threshold PRS percentile, then the individual can be stratified in a second group. As another example, a criterion for a clinical feature can be a presence or absence of cancer in an individual's family history. Thus, if the individual's family history includes the presence of cancer, then the individual is assigned in a first category. Conversely, the individual is categorized in a second category if the individual's family history is lacking a presence of cancer. Various criteria can be jointly used to further stratify individuals into different categories.
In an embodiment, the cancer prediction can be a presence or absence of cancer based on the baseline features. For example, given the PRS and one or more clinical features, a prediction as to whether the individual has cancer can be provided.
6.2. Example 1: Cancer Prediction Based on Baseline Features
Many baseline features were predictive of breast cancer. Different predictive models were tested for their ability to predict presence of the different types of breast cancer. The first model was based on the variable selection resulting from stepwise logistic regression, a second model that receives baseline features including a PRS and breast density, a third model that receives PRS only as the baseline feature, and a fourth model that receives breast density as the baseline feature. The performance of each model is documented as both the area under the curve (AUC) and the leave one out cross validation (LOOCV) result. In
Each of the predictive cancer models described within this subjection that generate a cancer prediction based on one or more types of features (e.g., small variant features, whole genome features, methylation features, and/or baseline features) through a logistic regression model, unless stated otherwise. Each predictive cancer model is trained using a set of training data derived from a first subset of patients of a circulating cell-free genome atlas (CCGA) study (NCT02889978) and then subsequently tested using a set of testing data derived from a second subset of patients from the CCGA study.
Of the 1792 patients in the training set, 1399 patients passed the eligibility criteria. The 1399 patients, of which 841 were cancerous (with 437 of the 841 presenting with tumor tissue) and 558 were non-cancerous, were used to train the models using a 10-fold cross validation process. The 1008 held out patients were subsequently used to test the performance of the models.
7.1. Multi-Assay Predictive Cancer Model in Accordance with
7.2. Single-Assay Predictive Cancer Model in accordance with
7.2.1. Small Variant Assay Predictive Cancer Model
7.2.2. Whole Genome Predictive Cancer Model
7.2.3. Methylation Predictive Cancer Model
7.2.4. Summary of Results of Single-Assay Predictive Cancer Models
As shown in
Of note, there are five patients that are clinically non-cancerous, but at least two of the predictive models called a presence of cancer. Of these five patients, one of the patients was subsequently clinically diagnosed with cancer. The additional four patients may be false positives.
7.3. Additional Single Assay Predictive Cancer Models
Specifically,
7.4. Multi-Stage, Multi-Assay Predictive Cancer Model in Accordance with
In accordance with
The second stage of the predictive cancer model is an overall predictive model is trained to output a cancer prediction using values of each of the WGS score, variant gene score, Order score, Binary score, WGBS score, and MSUM score. Therefore, during deployment, the second stage of the predictive cancer model receives, as input, the various scores and outputs a cancer prediction for an individual 110, such as a presence/absence of cancer in the individual.
The total AUC of the ROC curve is 0.722.
7.5. Example Multi-Stage, Multiple Assay Predictive Cancer Models
As described above in
In an embodiment, each first-stage model is configured to generate a cancer prediction from a similar type of features (e.g., features from a single sequencing assay). For example, a first first-stage model may generate a cancer prediction for the individual using small variant features (e.g., small variant features 154), a second first-stage model may generate a cancer prediction using methylation features (e.g., methylation features 156), etc. In another embodiment, each first stage model is configured to generate a cancer prediction from one or more types of features. For example, a first first-stage model may generate a cancer prediction for the individual using small variant features and methylation features, and the second first-stage model may generate a cancer prediction using whole genome features (e.g., whole genome features 152) and baseline features (e.g., baseline features 150).
There are many possible configurations of first-stage models for a multi-stage model. For example, there may be two, three, four, five, etc. first-stage models included in a multi-stage model. As another example, there may be one, two, three, four, five, etc. types of features used as an input to a first stage model.
A multi-stage model includes an overall predictive model (e.g., overall predictive model 185) that generates a cancer prediction (e.g., cancer prediction 190). The overall predictive model (“second-stage model”) inputs the cancer predictions of the first-stage models and generates the cancer prediction. The second-stage model may be trained using any of the various methods described herein. As an example, the second-stage model may be a random-forest model trained to determine a cancer prediction using features derived from test sequences with a known cancer diagnosis.
There are many possible configurations of second-stage models for a multi-stage model. For example, the multi-stage model may be trained to determine a cancer prediction using two, three, four, five, etc. cancer predictions generated from their corresponding first-stage models. As another example, a multi-stage model may have more than two stages. To illustrate, the results of two second-stage models may be input into a cancer prediction model (“third-stage model) trained to determine a cancer prediction for an individual.
7.5.1. Example 1: First-Stage Models Using WGS and Small Variant Features
As described above, one example of a multi-stage model includes two first-stage models and a single second-stage model. In this example, the first first-stage model (“first model”) is a cancer prediction model configured to determine a cancer prediction for an individual using small variant features. The features input into the first model may be derived using techniques similar to those described in Section 2, “Small Variant Computation Analysis.” The first model may output a score (e.g., an A score) indicating a cancer prediction for the individual. Section 7.2.1, “Small Variant Assay Predictive Caner Model,” describes a predictive cancer model that may be used as the first model.
The ROC curve plots illustrate the true positive rate as a function of the false positive rate for the first model, the second model, and the combined model. Curves on the ROC curve plots provide a representation of the predictive ability of each of the cancer prediction models (
Table 7.5.1.1. provides quantifications of the predictive capability of the multi-stage model illustrated in
Table 7.5.1.2 provides another quantification of the predictive ability of the first model and the combination model. In this example, the test sequences include 27 liquid cancer test sequences, 574 non-cancer test sequences, and 890 solid cancer test sequences. Within the table, “True” indicates that a model determined a positive cancer prediction and “False” indicates that a model determined a negative cancer prediction. A number under a cancer type indicates the number of test-sequences having the indicated predictions. So, for example, the combination model correctly called three liquid cancer test sequences “True” while the first model incorrectly called those same three liquid cancer test sequences “False.”
7.5.2. Example 2: First-Stage Models Using WGS and Methylation Features
Example 2 is another multi-stage model having a first model, a second model, and a combination model. In this example, the first model is configured to determine a cancer prediction for an individual using WGS features. The features input into the first model may derived using techniques similar to those described in Section 3, “Whole Genome Computational Analysis.” The first model may output a score (e.g., a WGS score) indicating a cancer prediction for the individual. Section 7.2.2, “Whole Genome Predictive Cancer Model,” describes a predictive cancer model that may be used as the first model.
The multi-stage model was configured in the manner described above. In particular, the multi-stage model included (i) a first model to generate a cancer prediction using WGS features (principal component projects on the first 5 principal components of the read counts as described in section 3.1) as an input, (ii) a second model to generate a cancer prediction using methylation features (hyper-methylation counts and/or hypo-methylation counts as described in section 4.3), and (iii) a combined model generating a cancer prediction using the cancer predictions of the first model and the second model. The multi-stage model was applied to test sequences obtained from a group of individuals to determine a cancer prediction.
The ROC curve plots illustrate the sensitivity as a function of specificity for the first model, the second model, and the combined model. Curves on the ROC curve plots provide a representation of the predictive ability of each of the cancer prediction models. In this case, the improvement of the combined model is more marginal than the multi-stage model shown in Example 1. As such, it is useful to analyze the area under curve (“AUC”) for each of the lines on the ROC curve plots. The AUC is a measure of the total predictive capability of each of the models, with a higher AUC representing, generally, a more sensitive and specific model.
Table 7.5.2.1. provides the AUC for the multi-stage model for determining a cancer prediction for different types of cancer. The table includes a column indicating the type of cancer and the AUC for the combined model, the first model, and the second model. Here, marginal improvement in cancer prediction is seen in lung cancer and colorectal cancer.
7.5.3. Example 3: First-Stage Models Using Baseline and Methylation Features
Example 3 is another multi-stage model having a first model, a second model, and a combination model. In this example, the first model is configured to determine a cancer prediction for an individual using methylation features. The features input into the first model may be derived using techniques similar to those described in Section 4, “Methylation Computational Analysis.” The first model may output a score (e.g., an M score) indicating a cancer prediction for the individual. Section 7.2.3, “Methylation Predictive Cancer Model,” describes a predictive cancer model that may be used as the second model.
The multi-stage model was configured in the manner described above. In particular, the multi-stage model included (i) a first model to generate a cancer prediction using methylation features (hyper-methylation counts and/or hypo-methylation counts as described in section 4.3), (ii) a second model to generate a cancer prediction using baseline features (including, sex, race, body mass index (BMI), alcohol intake, breastfeeding history, BiRAD breast density, smoking history, second hand smoking, and/or immediate family history of lung cancer), and (iii) a combined model generating a cancer prediction using the cancer predictions of the first model and the second model (using logistic regression). The multi-stage model was applied to test sequences obtained from a group of individuals to determine a cancer prediction.
The ROC curve plots illustrate the sensitivity as a function of specificity for the first model, and the combined model. Curves on the ROC curve plots provide a representation of the predictive ability of each of the cancer prediction models. Second models are not illustrated but perform worse than the combination model. For
7.6. Example Single-Stage, Multiple Assay Predictive Cancer Models
As described above in
Each of
As described above,
The above-illustrated ROC curve plots illustrate the true positive rate as a function of the false positive rate for the each of the above cancer predictive models utilizing one or more features as described. Each of the curves of the ROC curve plots provide a representation of the predictive ability of each of the cancer prediction models. Notably, the predictive capability of the cancer predictive models including two or more, three or more, or four features is better than the cancer predictive models based on a single feature. In other words, using two or more sequencing features in the cancer predictive models described herein results in a model with better cancer predictive capability than models based on a single feature.
The foregoing detailed description of embodiments refers to the accompanying drawings, which illustrate specific embodiments of the present disclosure. Other embodiments having different structures and operations do not depart from the scope of the present disclosure. The term “the invention” or the like is used with reference to certain specific examples of the many alternative aspects or embodiments of the applicants' invention set forth in this specification, and neither its use nor its absence is intended to limit the scope of the applicants' invention or the scope of the claims. This specification is divided into sections for the convenience of the reader only. Headings should not be construed as limiting of the scope of the invention. The definitions are intended as a part of the description of the invention. It will be understood that various details of the present invention may be changed without departing from the scope of the present invention. Furthermore, the foregoing description is for the purpose of illustration only, and not for the purpose of limitation.
This application claims the benefit of priority to U.S. Provisional Application No. 62/657,635, filed Apr. 13, 2018, and U.S. Provisional Application No. 62/679,738 filed Jun. 1, 2018, both of which are incorporated herein by reference in their entirety for all purposes.
Number | Date | Country | |
---|---|---|---|
62657635 | Apr 2018 | US | |
62679738 | Jun 2018 | US |