This disclosure generally relates to noise models for determining quality scores for nucleic acid sequencing datasets.
Computational techniques can be used on DNA sequencing data to identify mutations or variants in DNA that may correspond to various types of cancer or other diseases. Thus, cancer diagnosis or prediction may be performed by analyzing a biological sample such as a tissue biopsy or blood drawn from an individual, an aminol, a plant, etc. Detecting DNA that originated from tumor cells from a blood sample is difficult because circulating tumor DNA (ctDNA) is present at low levels relative to other molecules in cell-free DNA (cfDNA) extracted from the blood. The inability of existing methods to identify true positives (e.g., indicative of cancer in the subject) from signal noise diminishes the ability of known and future systems to distinguish true positives from false positives caused by noise sources, which can result in unreliable results for variant calling or other types of analyses.
Disclosed herein are systems and methods for training and applying site-specific noise models that are classified into a plurality of read tiers. The noise models can determine likelihoods of true positives in targeted sequencing. True positives can include single nucleotide variants, insertions, or deletions of base pairs. In particular, the models can use Bayesian inference to determine a rate or level of noise, e.g., indicative of an expected likelihood of certain mutations, per position of a nucleic acid sequence. Each model can be specific to a read tier. A read tier can be determined based on whether a potential variant location is located at an overlapping region and/or a complementary region of processed sequencing reads. Each model specific to a read tier can be a hierarchical model that accounts for covariates (e.g., trinucleotide context, mappability, or segmental duplication) and various types of parameters (e.g., mixture components or depth of sequence reads) that are specific to the read tier. The models can be trained from sequence reads of healthy subjects that are also stratified by read tiers. The outputs of different noise models can be combined to generate an overall quality score. An overall pipeline that incorporates various read-tier models can identify true positives at higher sensitivities and filter out false positives when compared to a single model that does not differentiate sequence reads by read tiers.
By way of example, in various embodiments, a method for processing a DNA sequencing dataset of a sample (e.g., an individual) can include accessing the DNA sequencing dataset generated by a DNA sequencing, the DNA sequencing dataset comprising a plurality of processed sequence reads that include a variant location. The method can also include stratifying the plurality of processed sequence reads into a plurality of read tiers. The method can further include determining, for each read tier, a stratified sequencing depth at the variant location. The method can further include determining, for each read tier, one or more noise parameters conditioned on the stratified sequencing depth of the read tier, the one or more noise parameters corresponding to a noise model specific to the read tier. The method can further include generating, for each read tier, an output of the noise model specific to the read tier based on the one or more noise parameters conditioned on the stratified sequencing depth of the read tier. The method can further include combining the generated noise model outputs to produce a combined result. The combined result can represent a likelihood that a total variant count for subsequently observed data being greater than or equal to a total variant count observed in the plurality of processed sequence reads is attributable to noise.
In one or more embodiments, the plurality of read tiers include one or more of: (1) a double-stranded, stitched read tier, (2) a double-stranded, unstitched read tier, (3) a single-stranded, stitched read tier, and (4) a single-stranded, unstitched read tier.
In one or more embodiments, a mutation at the variant location is one of: a single nucleotide variant, an insertion, and a deletion.
In one or more embodiments, the method can further include determining a quality score of the combined result, the quality score being a Phred-scale score.
In one or more embodiments, the method can further include, responsive to the quality score being higher than a predetermined threshold, indicating that the sample is likely to have a mutation at the variant location.
In one or more embodiments, determining, for each read tier, the one or more noise parameters conditioned on the stratified sequencing depth of the read tier can include accessing a parameter distribution specific to the read tier, the parameter distribution describing a distribution of a set of DNA sequencing samples associated with the read tier. The noise parameters are determined from the parameter distribution.
In one or more embodiments, for each read tier, the set of DNA sequencing samples associated with the read tier comprises sequence reads stratified into the read tier and corresponds to one or more healthy individuals.
In one or more embodiments, for each read tier, the noise model specific to the read tier is a Bayesian hierarchical model and the parameter distribution is based on a Gamma distribution.
In one or more embodiments, a first noise parameter corresponding to a noise model specific to a first read tier has a different value than a corresponding second noise parameter corresponding to a noise model specific to a second read tier.
In one or more embodiments, for each read tier, the determined one or more noise parameters comprise a mean of a noise distribution conditioned on the stratified sequencing depth of the read tier.
In one or more embodiments, each noise distribution is a negative binomial distribution conditioned on the stratified sequencing depth of each read tier.
In one or more embodiments for each read tier, the determined one or more noise parameters further comprise a dispersion parameter.
In one or more embodiments, the output for each noise model is the one or more noise parameters conditioned on the stratified sequencing depth determined for the read tier.
In one or more embodiments, the generated output of each noise model is the one or more noise parameters conditioned on the stratified sequencing depth determined for the read tier.
In one or more embodiments, the generated output of each noise model comprises a likelihood that a stratified variant count for the read tier exceeds a threshold.
In one or more embodiments, combining the generated noise model outputs comprises combining a mean variant count and a variance from each noise model output to produce an overall mean variant count and the overall dispersion parameter representative of an overall noise distribution for the combined result.
In one or more embodiments, the overall noise distribution is modeled based on a negative binomial distribution. Determining the overall mean variant count and the overall dispersion parameter can include determining the mean variant count for each read tier based on the stratified sequencing depth of the read tier. The determining step can also include determining the variance for each read tier. The determining step can further include summing the mean variant count for each read tier to determine the overall mean variant count. The determining step can further include combining the variance for each read tier to determine an overall variance. The determining step can further include determining the overall dispersion parameter based on the overall mean variant count and the overall variance.
In one or more embodiments, combining the output for each noise model to generate the combined result can include determining an observed stratified variant count of each read tier. The combining step can also include determining, in each read tier, possible events that are more likely than the observed stratified variant count of each read tier. The combining step can further include identifying combinations of the possible events associated with a higher likelihood of occurrence than the observed stratified variant count of each read tier. The combining step can further include summing probabilities of the identified combinations to determine a statistic complement. The combining step can further include determining a likelihood value by subtracting the statistic complement from 1.0.
In one or more embodiments, a first identified combination comprising one double-stranded read is equivalent to a second identification combination comprising two single-stranded reads.
In one or more embodiments, the determined likelihood value is equal to or greater than a likelihood of occurrence of the observed stratified variant count of each read tier.
In one or more embodiments, the method can further include training a machine learning model to determine the likelihood value.
In one or more embodiments, the method can further include receiving a body fluid sample of the individual. The method can further include performing the DNA sequencing on cfDNA of the body fluid sample. The method can further include generating raw sequence reads based on a result of the DNA sequencing. The method can further include collapsing and stitching the raw sequence reads to generate the plurality of processed sequence reads.
In one or more embodiments, the body fluid sample is a sample of one of: blood, whole blood, plasma, serum, urine, cerebrospinal fluid, fecal, saliva, tears, a tissue biopsy, pleural fluid, pericardial fluid, or peritoneal fluid of the individual.
In one or more embodiments, the plurality of processed sequence reads is sequenced from a tumor biopsy.
In one or more embodiments, the plurality of processed sequence reads is sequenced from an isolate of cells from blood, the isolate of cells including at least buffy coat white blood cells or CD4+ cells.
In one or more embodiments, the DNA sequencing is a type of massively parallel DNA sequencing.
In various embodiments, a non-transitory computer readable medium includes instructions that, when executed by one or more processors, cause the one or more processors to perform any of the steps described above and disclosed herein.
Further, in various embodiments, a system having a computer processor and a memory that stores computer program instructions is provided, whereby executing the instructions by the computer processor cause the processor to perform any of the steps described above and disclosed herein.
Embodiments according to the invention are in particular disclosed in the attached claims directed to a method and a computer program product, wherein any feature mentioned in one claim category, e.g. method, can be claimed in another claim category, e.g. computer program product, system, storage medium, as well. The dependencies or references back in the attached claims are chosen for formal reasons only. However, any subject matter resulting from a deliberate reference back to any previous claims (in particular multiple dependencies) can be claimed as well, so that any combination of claims and the features thereof is disclosed and can be claimed regardless of the dependencies chosen in the attached claims. The subject-matter which can be claimed comprises not only the combinations of features as set out in the attached claims but also any other combination of features in the claims, wherein each feature mentioned in the claims can be combined with any other feature or combination of other features in the claims. Furthermore, any of the embodiments and features described or depicted herein can be claimed in a separate claim and/or in any combination with any embodiment or feature described or depicted herein or with any of the features of the attached claims.
The figures depict embodiments of the present invention for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the invention described herein.
The term “individual” refers to a human individual. The term “healthy individual” refers to an individual presumed to not have cancer or a disease. The term “subject” refers to an individual who is being tested for cancer or a disease.
The term “sequence reads” refers to nucleotide sequences read from a sample obtained from an individual. Sequence reads can be obtained through various methods known in the art.
The term “read segment” or “read” refers to any nucleotide sequences including sequence reads obtained from an individual and/or nucleotide sequences derived from the initial sequence read from a sample obtained from an individual. For example, a read segment can refer to an aligned sequence read, a collapsed sequence read, or a stitched read. Furthermore, a read segment can refer to an individual nucleotide base, such as a single nucleotide variant.
The term “single nucleotide variant” or “SNV” refers to a substitution of one nucleotide to a different nucleotide at a position (e.g., site) of a nucleotide sequence, e.g., a sequence read from an individual. A substitution from a first nucleobase X to a second nucleobase Y can be denoted as “X>Y.” For example, a cytosine to thymine SNV may be denoted as “C>T.”
The term “indel” refers to any insertion or deletion of one or more base pairs having a length and a position (which can also be referred to as an anchor position) in a sequence read. An insertion corresponds to a positive length, while a deletion corresponds to a negative length.
The term “variant” refers to one or more SNVs or indels. A variant location refers to a location of interest in a DNA sequencing that could potentially contain SNVs or indels.
The term “true positive” refers to a mutation that indicates real biology, for example, the presence of a potential cancer, disease, or germline mutation in an individual. True positives are not caused by mutations naturally occurring in healthy individuals (e.g., recurrent mutations) or other sources of artifacts such as process errors during assay preparation of nucleic acid samples.
The term “false positive” refers to a mutation incorrectly determined to be a true positive. Generally, false positives can be more likely to occur when processing sequence reads associated with greater mean noise rates or greater uncertainty in noise rates.
The term “cell-free DNA” or “cfDNA” refers to nucleic acid fragments that circulate in an individual's body (e.g., bloodstream) and originate from one or more healthy cells and/or from one or more cancer cells.
The term “circulating tumor DNA” or “ctDNA” refers to nucleic acid fragments that originate from tumor cells or other types of cancer cells, which can be released into an individual's bloodstream as a result of biological processes such as apoptosis or necrosis of dying cells or actively released by viable tumor cells.
The term “alternative allele” or “ALT” refers to an allele having one or more mutations relative to a reference allele, e.g., corresponding to a known gene.
The term “sequencing depth” or “depth” refers to a total number of read segments from a sample obtained from an individual.
The term “alternate depth” or “AD” refers to a number of read segments in a sample that supports an ALT, e.g., include mutations of the ALT.
The term “alternate frequency” or “AF” refers to the frequency of a given ALT. The AF can be determined by dividing the corresponding AD of a sample by the depth of the sample for the given ALT.
In step 110, a nucleic acid sample (DNA or RNA) is extracted from a subject. The subject can be an individual. The sample can be any subset of the human genome or the whole genome. The sample can be extracted from a subject known to have or suspected of having cancer. The sample can include blood, plasma, serum, urine, fecal, saliva, other types of bodily fluids, or any combination thereof. In some embodiments, methods for drawing a blood sample (e.g., syringe or finger prick) can be less invasive than procedures for obtaining a tissue biopsy, which can require surgery. The extracted sample can comprise cfDNA and/or ctDNA. For healthy individuals, the human body can naturally clear out cfDNA and other cellular debris. If a subject has cancer or a disease, ctDNA in an extracted sample can be present at a detectable level for diagnosis.
In step 120, a sequencing library is prepared. During the library preparation, the nucleic acid samples are randomly cleaved into thousands or millions of fragments. Unique molecular identifiers (UMI) are added to the nucleic acid fragments (e.g., DNA fragments) through adapter ligation. The UMIs are short nucleic acid sequences (e.g., 4-10 base pairs) that are added to ends of DNA fragments during adapter ligation. In some embodiments, UMIs are degenerate base pairs that serve as a unique tag that can be used to identify sequence reads originating from a specific DNA fragment. During PCR amplification following adapter ligation, the UMIs are replicated along with the attached DNA fragment, which provides a way to identify sequence reads that came from the same original fragment in downstream analysis.
In step 130, targeted DNA sequences are enriched from the library. During enrichment, hybridization probes (also referred to herein as “probes”) are used to target, and pull down, nucleic acid fragments informative for the presence or absence of cancer (or disease), cancer status, or a cancer classification (e.g., cancer type or tissue of origin). For a given workflow, the probes can be designed to anneal (or hybridize) to a target (complementary) strand of DNA or RNA. The target strand can be the “positive” strand (e.g., the strand transcribed into mRNA, and subsequently translated into a protein) or the complementary “negative” strand. The probes can range in length from 10s, 100s, or 1000s of base pairs. In some embodiments, the probes are designed based on a gene panel to analyze particular mutations or target regions of the genome (e.g., of the human or another organism) that are suspected to correspond to certain cancers or other types of diseases. Moreover, the probes can cover overlapping portions of a target region. By using a targeted gene panel rather than sequencing all expressed genes of a genome, also known as “whole exome sequencing,” the method 100 can be used to increase the sequencing depth of the target regions, where depth refers to the count of the number of times a given target sequence within the sample has been sequenced. Increasing sequencing depth reduces the required input amounts of the nucleic acid sample. After a hybridization step, the hybridized nucleic acid fragments are captured and can also be amplified using PCR.
In step 140, sequence reads are generated from the enriched DNA sequences. Sequencing data can be acquired from the enriched DNA sequences by known means in the art. For example, the method 100 can include next-generation sequencing (NGS) techniques including synthesis technology (Illumina), pyrosequencing (454 Life Sciences), ion semiconductor technology (Ion Torrent sequencing), single-molecule real-time sequencing (Pacific Biosciences), sequencing by ligation (SOLiD sequencing), nanopore sequencing (Oxford Nanopore Technologies), or paired-end sequencing. In some embodiments, massively parallel sequencing is performed using sequencing-by-synthesis with reversible dye terminators.
In some embodiments, the sequence reads can be aligned to a reference genome using known methods in the art to determine the alignment position information. The alignment position information can indicate a beginning position and an end position of a region in the reference genome that corresponds to a beginning nucleotide base and end nucleotide base of a given sequence read. Alignment position information can also include sequence read length, which can be determined from the beginning position and end position. A region in the reference genome can be associated with a gene or a segment of a gene.
In various embodiments, a sequence read is comprised of a read pair denoted as R1 and R2. For example, the first read R1 can be sequenced from a first end of a nucleic acid fragment whereas the second read R2 can be sequenced from the second end of the nucleic acid fragment. Therefore, nucleotide base pairs of the first read R1 and second read R2 can be aligned consistently (e.g., in opposite orientations) with nucleotide bases of the reference genome. Alignment position information derived from the read pair R1 and R2 can include a beginning position in the reference genome that corresponds to an end of a first read (e.g., R1) and an end position in the reference genome that corresponds to an end of a second read (e.g., R2). In other words, the beginning position and end position in the reference genome represent the likely location within the reference genome to which the nucleic acid fragment corresponds. An output file having SAM (sequence alignment map) format or BAM (binary) format can be generated and output for further analysis such as variant calling, as described below with respect to
In step 300, the sequence processor 205 collapses sequence reads of the input sequencing data. In some embodiments, collapsing sequence reads includes using UMIs, and optionally alignment position information from sequencing data of an output file (e.g., from the method 100 shown in
In step 305, the sequence processor 205 stitches the collapsed reads based on portions of overlapping nucleotide sequences among two or more sequence reads. In some embodiments, the sequence processor 205 compares nucleotide sequences between a first read and a second read to determine whether nucleotide base pairs of the first and second reads overlap. The two sequence reads can also be compared to a reference genome. In an example use case, responsive to determining that an overlap (e.g., of a given number of nucleotide bases) between the first and second reads is greater than a threshold length (e.g., a threshold number of nucleotide bases), the sequence processor 205 designates the first and second reads as “stitched”; otherwise, the collapsed reads are designated “unstitched.” In some embodiments, the first and second reads are stitched if the overlap is greater than the threshold length and if the overlap is not a sliding overlap. For example, a sliding overlap can include a homopolymer run (e.g., a single repeating nucleotide base), a dinucleotide run (e.g., two-nucleotide base sequence), or a trinucleotide run (e.g., three-nucleotide base sequence), where the homopolymer run, dinucleotide run, or trinucleotide run has at least a threshold length of base pairs.
In step 310, the sequence processor 205 assembles reads into paths. In some embodiments, the sequence processor 205 assembles reads to generate a directed graph, for example, a de Bruijn graph, for a target region (e.g., a gene). Unidirectional edges of the directed graph represent sequences of k nucleotide bases (also referred to herein as “k-mers”) in the target region, and the edges are connected by vertices (or nodes). The sequence processor 205 aligns collapsed reads to a directed graph such that any of the collapsed reads can be represented in order by a subset of the edges and corresponding vertices.
In some embodiments, the sequence processor 205 determines sets of parameters describing directed graphs and processes directed graphs. Additionally, the set of parameters can include a count of successfully aligned k-mers from collapsed reads to a k-mer represented by a node or edge in the directed graph. The sequence processor 205 stores, e.g., in the sequence database 210, directed graphs and corresponding sets of parameters, which can be retrieved to update graphs or generate new graphs. For instance, the sequence processor 205 can generate a compressed version of a directed graph (e.g., or modify an existing graph) based on the set of parameters. In an example use case, in order to filter out data of a directed graph having lower levels of importance, the sequence processor 205 removes (e.g., “trims” or “prunes”) nodes or edges having a count less than a threshold value, and maintains nodes or edges having counts greater than or equal to the threshold value.
In step 315, the variant caller 240 generates candidate variant reads from the paths assembled by the sequence processor 205. A variant can correspond to an SNV or an indel. In some embodiments, the variant caller 240 can generate the candidate variant reads by comparing a directed graph (which could have been compressed by pruning edges or nodes in step 310) to a reference sequence of a target region of a genome. The variant caller 240 can align edges of the directed graph to the reference sequence, and records the genomic positions of mismatched edges and mismatched nucleotide bases adjacent to the edges as the locations of candidate variants. Additionally, the variant caller 240 can generate candidate variant reads based on the sequencing depth of a target region. In particular, the variant caller 240 can be more confident in identifying variants in target regions that have greater sequencing depth, for example, because a greater number of sequence reads help to resolve (e.g., using redundancies) mismatches or other base pair variations between sequences.
In some embodiments, variant reads can be classified into different read tiers, based on the quality of the variant read. The quality of the variant read can correspond to the location of the potential variant location compared to the overlap and/or complementary locations of the collapsed sequences. In a sample preparation (e.g., a library preparation process) in a massively parallel sequencing, the nucleic acid samples of a subject individual can be cleaved randomly and before parallel sequencing is performed. The same copies of a nucleic acid sequence can be cleaved differently and randomly. Hence, some of the enriched fragments can have overlapped regions that can be stitched with other enriched fragments, while other enriched fragments do not. Some enriched fragments can also have complementary sequences that are also enriched, thus generating a double-stranded fragment in sequence processing. As a result, variant reads for different sequence locations can correspond to different qualities. For example, a variant read at a location in which both complementary strands of fragments are enriched often has better quality than another variant read at another location that only finds support from a single-stranded fragment. The details of the read tiers of the variant reads will be further discussed in
In some embodiments, the variant caller 240 generates candidate variant reads using the models 225 to determine expected noise rates for sequence reads from a subject. Each of the models 225 can be a Bayesian hierarchical model. A Bayesian hierarchical model can be one of many possible model architectures that can be used to generate candidate variants and which are related to each other in that they all model position-specific noise information in order to improve the sensitivity/specificity of variant calling. More specifically, the machine learning engine 220 trains the model 225 using samples from healthy individuals to model the expected noise rates per position of sequence reads. In some embodiments, the variant reads corresponding to different read tiers can be treated differently by different models each specific to a particular read tier. The results of each model can be combined to generate a combined result. The details of stratifying read tiers and models will be further discussed below in association with
Further, multiple different models can be stored in the model database 215 or retrieved for application post-training. For example, a first model is trained to model SNV noise rates and a second model is trained to model indel noise rates. Further, the score engine 235 can use parameters of the models 225 to determine a likelihood of one or more true positives in a sequence read. The score engine 235 can determine a quality score (e.g., on a logarithmic scale) based on the likelihood. For example, the quality score is a Phred quality score Q=−10·log10 P, where P is the likelihood of an incorrect candidate variant call (e.g., a false positive).
In step 320, the score engine 235 scores the variant reads based on the models 225 or corresponding likelihoods of true positives or quality scores. Training and application of the model 225 are described in more detail below.
In step 325, the processing system 200 outputs the analysis result with respect to the variants. In some embodiments, the processing system 200 outputs some or all of the determined candidate variants along with the corresponding scores. Downstream systems, e.g., external to the processing system 200 or other components of the processing system 200, can use the variants and scores for various applications including, but not limited to, predicting the presence of cancer, disease, or germline mutations.
The probability mass functions (PMFs) illustrated in
Using the example of
z
p˜Multinom({right arrow over (θ)})
Together, the latent variable zp, the vector of mixture components {right arrow over (θ)}, α, and β allow the model for that is, a sub-model of the Bayesian hierarchical model 225, to have parameters that “pool” knowledge about noise, that is they represent similarity in noise characteristics across multiple positions. Thus, positions of sequence reads can be pooled or grouped into latent classes by the model. Also advantageously, samples of any of these “pooled” positions can help train these shared parameters. A benefit of this is that the processing system 200 can determine a model of noise in healthy samples even if there is little to no direct evidence of alternate alleles having been observed for a given position previously (e.g., in the healthy tissue samples used to train the model).
The covariate xp (e.g., a predictor) encodes known contextual information regarding position p which can include, but is not limited to, information such as trinucleotide context, mappability, segmental duplication, or other information associated with sequence reads. Trinucleotide context can be based on a reference allele and can be assigned numerical (e.g., integer) representation. For instance, “AAA” is assigned 1, “ACA” is assigned 2, “AGA” is assigned 3, etc. Mappability represents a level of uniqueness of the alignment of a read to a particular target region of a genome. For example, mappability is calculated as the inverse of the number of position(s) where the sequence read will uniquely map. Segmental duplications correspond to long nucleic acid sequences (e.g., having a length greater than approximately 1000 base pairs) that are nearly identical (e.g., greater than 90% match) and occur in multiple locations in a genome as a result of natural duplication events (e.g., not associated with cancer or a disease).
The expected mean AD frequency of an SNV at position p is modeled by the parameter μp. In some embodiments, the parameter μp corresponds to the mean AD count per sequencing depth, di
μp˜Gamma(az
In some embodiments, other functions can be used to represent μp, examples of which include but are not limited to: a log-normal distribution with log-mean γz
The variance of a distribution can be determined by the mean variant frequency μp and the dispersion parameter rp. For example, in the case of a Gamma distribution, the variance, vp, can be determined by:
Lambda λp can be the mean variant count, which can be determined by μp multiplied by sequencing depth, di
In the example shown in
y
i
|d
i
˜Poisson(di
In some embodiments, other functions can be used to represent yi
y
i
|d
i
˜NB(di
The mean variant frequency μp, the mean variant count λp (λp=di
The expected mean total indel frequency at position p is modeled by the distribution μp. In some embodiments, the parameter μp corresponds to the mean indel count per sequencing depth, di
μp˜Gamma(αx
In some embodiments, other functions can be used to represent μp, examples of which include but are not limited to: negative binomial, Conway-Maxwell-Poisson distribution, zeta distribution, and zero-inflated Poisson. The shape parameter αx
The variance of a distribution can be determined by the mean variant frequency μp and the dispersion parameter rp. For example, in the case of a Gamma distribution, the variance, vp, can be determined by:
Lambda λp can be the mean variant count, which can be determined by μp multiplied by sequencing depth, di
The observed indels at position p in a human population sample i (of a healthy individual) is modeled by the distribution yi
y
i
|d
i
˜Poisson(di
In some embodiments, other functions can be used to represent yi
y
i
|d
i
˜NB(di
The mean variant frequency μp, the mean variant count λp=di
Because indels can be of varying lengths, an additional length parameter is present in the indel model that is not present in the model for SNVs. As a consequence, the example model shown in
which represents the indel distribution under noise conditional on parameters. The distribution can be a multinomial given indel intensity yi
In some embodiments, a Dirichlet-Multinomial function or other types of models can be used to represent yi
By architecting the model in this manner, the machine learning engine 220 can decouple the learning of indel intensity (i.e., noise rate) from learning of indel length distribution. Independently determining inferences for an expectation for whether an indel will occur in healthy samples and expectation for the length of the indel at a position may improve the sensitivity of the model. For example, the length distribution can be more stable relative to the indel intensity at a number of positions or regions in the genome, or vice versa.
In some embodiments, the machine learning engine 220 performs model fitting by storing draws of μp in the parameters database 230. The model is trained or fitted through posterior sampling, as previously described. In some examples, the draws of μp are stored in a matrix data structure having a row per position of the set of positions sampled and a column per draw from the joint posterior (e.g., of all parameters conditional on the observed data). The number of rows R can be greater than 6 million and the number of columns for N iterations of samples can be in the thousands. In some embodiments, the row and column designations are different than the embodiment shown in
where λp and vp are the mean and variance of the sampled values of μp at the position, respectively. Those of skill in the art will appreciate that other functions for determining rp can also be used such as a maximum likelihood estimate. Different noise parameters can be determined for different read tiers. For example, each read tier can have different values of λp and rp.
The machine learning engine 220 can also perform dispersion re-estimation of the dispersion parameters in the reduced matrix, given the rate parameters. In some embodiments, following Bayesian training and posterior approximation, the machine learning engine 220 performs dispersion re-estimation by retraining for the dispersion parameters based on a negative binomial maximum likelihood estimator per position. The rate parameter can remain fixed during retraining. In some embodiments, the machine learning engine 220 determines the dispersion parameters r′p at each position for the original AD counts of the training data (e.g., yi
During the application of trained models, the processing system 200 can access the dispersion (e.g., shape) parameters and rate parameters λp to determine a function parameterized by and λp. The function can be used to determine a posterior predictive probability mass function (or probability density function) for a new sample of a subject. Based on the predicted probability of a certain AD count at a given position, the processing system 200 can account for site-specific noise rates per position of sequence reads when detecting true positives from samples. Referring back to the example use case described with respect to
The Bayesian hierarchical models and distributions used to model various parameters in the Bayesian hierarchical models can be trained separately for different read tiers of variant reads. For example, each read tier can have its own Bayesian hierarchical model that has its own parameters such as αx
For more information regarding the training and use of the Bayesian hierarchical models that model the noise levels of a sequencing dataset, U.S. patent application Ser. No. 16/153,593, filed on Oct. 5, 2018 and entitled “Site-Specific Noise model for Targeted Sequencing,” is hereby incorporated by reference for all purposes.
It is noted that in a sequence amplification process (e.g., massively parallel sequencing), one or more sequences of a sample (e.g., an individual) can be cleaved into different fragments in a pseudo-random fashion. In some cases, not all fragments are ligated with UMIs such that some of the fragments are washed away before the ligated fragments are enriched. Hence, the enriched fragments are at least partially random in each sequencing run. The extent of overlap between different fragments can vary. For example, some of the enriched fragments can have overlapping regions that can be stitched with other enriched fragments. Some enriched fragments can also have complementary sequences (e.g., forward and reverse sequences, positive and negative sequences, top and bottom sequences, 5′ to 3′ and 3′ to 5′ sequences) that are enriched, thus generating a double-stranded read for all or part of the entire sequence read. As a result, variant reads at different sequence locations can, in some examples, include a complementary and/or overlapping sequence read to confirm the variant. Hence, each variant read can correspond to a different read tier quality. For example, a variant read at a location in which both complementary strands of fragments are enriched often has better quality than another variant read at a second location in which only a single fragment is enriched. There is an increased likelihood that a variant read at a location that is not included within an overlapping region or a complementary region is attributable to noise, and not to an actual variant presented in the sample of the subject.
By way of example, in
At
A third example read tier 830 includes variant reads located or otherwise fully embedded within single-stranded (e.g., non-duplex) and stitched reads. In the third read tier 830, the potential variant location is included within the overlapping region of two or more sequence reads, and thus the sequence reads include a stitched region. However, the sequence reads are single-stranded because the sequence reads (such as the two illustrated 5′ to 3′ sequence reads) do not include a complementary region (e.g., the sequence reads are based on the 5′ to 3′ strands only and are not supported by a complementary 3′ to 5′ strand). In some cases (not illustrated), one or more complementary sequence reads (e.g., a 3′ to 5′ sequence read) can be found at example read tier 3 but not include the potential variant location. Thus, the potential variant read belongs to the third read tier 830 representing single-stranded but stitched reads.
Further as shown at
In some embodiments, sequence reads of a sample can be stratified into the four read tiers illustrated in
Sequence reads can, additionally or alternatively, be classified into different read tiers by other classification methods. For example, if the variants are SNVs, each read tier can be further sub-divided into twelve additional sub-tiers based on the types of nucleotide substitution (e.g., A>C, A>T, G>C, etc.) (see, e.g.,
Referring to
Comparing the differences among
Turning now to
In step 910, a processing system can access a DNA sequencing dataset generated by DNA sequencing. For example, the DNA sequencing can be a type of massively parallel DNA sequencing, such as next-generation sequencing (NGS). The DNA sequencing dataset includes a plurality of processed sequence reads that include a variant location of interest (e.g., a specific gene location in a DNA sequence). At least some of the processed sequence reads can be generated from collapsing and stitching of raw sequence reads in the DNA sequencing, generated such as by the process described in
The processed sequence reads that include the variant location of interest can be of different base-pair lengths and of different extent of overlapping and/or complementing. In step 920, the processing system can stratify the plurality of processed sequence reads into different read tiers. The different read tiers can be stratified based on the quality of the sequence reads. For example, the processed sequence reads can be stratified based on whether the variant location is included in an overlapping region and/or is included in a complementary region, as discussed in association with
In step 930, the processing system can determine, for each read tier, a stratified sequencing depth at the variant location. For each read tier, the stratified sequencing depth can be the sequencing depth of the sequence reads that are stratified into the read tier. In other words, a stratified sequencing depth can be the total number of sequence reads that are stratified into the read tier. The processing system can also determine the actual variant count for each read tier. For example, for a read tier, a majority of the sequence reads may not contain an actual variant (whether it is an SNV or indel) at the variant location. In some cases, only a handful of sequence reads include an actual variant at the variant location. A stratified variant count can be the total number of actual variant counts for a particular read tier.
In step 940, the processing system can determine, for each read tier, one or more noise parameters conditioned on the stratified sequencing depth of the read tier. The noise parameters can be the parameters of a noise model that is specific to the read tier. For example, the processing system can include a plurality of stratified noise models, each specific to a read tier. The stratified noise models (or some of them) can correspond to the Bayesian hierarchical models described in
A probability distribution of a stratified variant count for each read tier can be modeled by a noise distribution. The probability distribution of the stratified variant count can depend on the type of distribution used and one or more parameters that define the noise distribution. For example, in the case of a Bayesian hierarchical model discussed, the distribution of the stratified variant count can correspond to a posterior distribution conditioned on two parameters. The parameters can be the stratified mean variant count conditioned on the stratified sequencing depth and the dispersion parameter. Each of the parameters can further correspond to one or more prior distributions that affect the parameters. For example, the stratified mean variant count conditioned on the stratified sequencing depth can be modeled by a Gamma distribution. Because a prior distribution can describe the distribution of a parameter, the prior distribution can also be referred to as a parameter distribution.
For each read tier, the processing system can determine the one or more noise parameters conditioned on the stratified sequencing depth by inputting the stratified sequencing depth, obtained from the dataset of the subject, to the trained noise model. For example, a trained noise model can access a parameter distribution (e.g., a prior distribution) specific to the read tier. The parameter distribution can be formed based on a stratified training set of reference individuals and can describe the distribution of the stratified training set. The trained noise model can use the parameter distribution to determine the noise parameter conditioned on the stratified sequencing depth corresponding to the read tier.
While a Bayesian hierarchical model is used as an example of a noise model, in various embodiments different types of trained machine learning models can be used as the noise models. Also, depending on the model used, different noise parameters can be used to model the noise distribution.
In step 950, the processing system can generate an output for the noise model specific to a read tier based on the one or more noise parameters conditioned on the stratified sequencing depth of the read tier. The generation of the output can be repeated for different read tiers. Depending on the embodiments, different types of outputs can be generated. For instance, in some embodiments, each stratified noise model does not perform further computation after the noise parameters are determined. The output of a noise model can be the one or more noise parameters that are conditioned on the stratified sequencing depth determined for each tier. In the case of a negative binomial distribution being used to as a noise distribution to model the stratified variant count, the output of the noise model can be a stratified mean variant count conditioned on the stratified sequencing depth and a dispersion parameter. In some embodiments, after determining the noise parameters, each stratified noise model can generate a posterior distribution. In such an embodiment, the output of a noise model specific to a read tier can be a likelihood that a variant count of the read tier for subsequently observed data being greater than or equal to a total variant count observed in the DNA dataset of the subject individual is attributable to noise. Other suitable outputs are also possible.
In step 960, the processing system can combine the generated noise model outputs to produce a combined result. The combined result can be a representation of the overall processing result of the DNA sequencing dataset of the subject individual. The combined result can take any suitable forms. In some embodiments, the combined result can include a likelihood that a total variant count for subsequently observed data being greater than or equal to a total variant count observed in the plurality of processed sequence reads is attributable to noise. Put differently, the likelihood can represent the likelihood that an event would be as or more extreme than a total variant count observed in the plurality of processed sequence reads of the DNA dataset of the subject individual. In some cases, the likelihood can correspond to the p-value that is used in a null hypothesis. How the outputs of the stratified noise models can be combined to generate a combined result can depend on different embodiments. In some embodiments, a moment matching method, which will be discussed in detail in
In step 970, the processing system can determine a quality score of the combined result. In some examples, the combined result, such as in the form of likelihood P (e.g., a p-value), can be converted into a Phred-scaled quality score, where Q=−10·log10 P. For example, a Phred quality score of 20 indicates a P= 1/100 chance of an incorrect variant call, and a Phred quality score of 60 indicates a P= 1/1,000,000 chance of an incorrect variant call. Thus, a higher Phred quality score corresponds to a greater confidence for the detection of an actual mutation. The quality score can be used to distinguish a true positive from a false positive. In some embodiments, in response to the quality score being higher than a predetermined threshold, the processing system can indicate that the individual is statistically likely to have a mutation at the variant location.
The step 1010 can include several sub-steps. For each read tier, the processing system can determine the stratified sequencing depth. The first and second moments of the noise distribution for each tier can be used as the noise parameters to define the noise distribution. In step 1012, based on the stratified sequencing depth, the processing system can determine the first moment (e.g., the mean variant count) of each read tier. For example, in the case of a Bayesian hierarchical model discussed above, the variant frequency for a particular read tier can be modeled as a Gamma-distributed random variable having shape parameter
and rate parameter
Each read tier can have its own shape parameter and rate parameter that are determined based on reference sample datasets. Hence, the variant frequency of each read tier conditioned on the stratified sequencing depth can be different.
The processing system can determine the first moment, the stratified mean variant count, λp, for each tier by multiplying the variant frequency by the stratified sequencing depth:
λp=μp×di
In step 1014, the processing system can also determine the second moment, the variance, of each read tier. In the case of a Bayesian hierarchical model having a Gamma-distributed variant frequency, the variance of each read tier can be determined by the mean variant count, λp, and a dispersion parameter, rp. For example, the variance, vp, can be determined by:
In step 1016, the processing system can determine the overall mean variant count (the overall first moment) and the overall variance (the overall second moment) by moment matching. In some cases, the processing system can perform the moment matching by summing the moments for different read tiers to obtain the overall moment. For example, the overall mean variant count across all read tiers conditioned on the total sequencing depth can be determined by:
λall=Σλp
Likewise, the overall variance across all read tiers can be determined by summing the variance of each read tier:
v
all
=Σv
p
The processing system can model the likelihood of the overall observed variant count conditioned on the total sequencing depth by an overall noise distribution. The overall noise distribution can be a negative binomial distribution that is parameterized by the overall mean, λall, and an overall dispersion parameter, rall. The overall dispersion parameter can be determined by the overall mean and the overall variance:
In step 1020, the processing system can determine the overall likelihood using an overall noise distribution that is modeled by the overall first moment and the overall second moment. For example, the random variable yi
y
i
|d
i
˜NB(λall,rall)
The random variable yi
Focusing on only the read tier for the double-stranded variant count (x-axis) first, in an example, the observed stratified variant count from a subject individual is 2. For the same read tier, an event having a stratified variant count of 3 is less likely (more extreme) than the actual observed stratified variant count because a variant read at a potential variant location is unlikely compared to a non-variant read at the potential variant location. Likely, another event having a stratified variant count of 4 is even less likely than the actual observed stratified variant count. In other words, a combination of the less likely (more extreme) events occupies the space that is larger than the observed variant count and spans all the way to infinity. Conversely, an event having a stratified variant count of 1 or 0 is more likely than the actual observed stratified variant count of 2.
Now considering both read tiers, there can be different combinations of observed stratified variant counts that can be assumed to equally or roughly equally likely. In NGS sample preparation, the nucleic acid sequences of the subject individual can be cleaved in a partially random fashion. As a result, some of the processed sequence reads may not include a complementary sequence read. Hence, some of the processed sequence reads can be single-stranded sequence reads. In other words, for the same nucleic acid sequence sample, different NGS runs will produce different combinations of sequence reads in different read tiers. The stratified variant count of a first read tier can be equivalent to the stratified variant count of a second read tier based on some ratios. In some embodiments, the ratio is modeled as a predetermined value. For example, one double-stranded variant count can be considered to be equivalent to two single-stranded variant count, although in some embodiments numbers other than 2 can also be used.
Based on the observed stratified variant counts of different read tiers, coordinates in the graph shown in
The combined result of the processing system can take the form of the p-value that represents the likelihood of an event that is as or more extreme than the observed data. The processing system can integrate by summing the probabilities corresponding to all coordinates that represent events that are as or more extreme than the observed data to determine the p-value. However, because the coordinates can include points that go all the way to infinity, the processing system can also compute the statistic complement of the p-value instead. In other words, the processing system can sum the probabilities corresponding to all coordinates that represent events that are less extreme than the observed data to determine the complement of the p-value. The processing system can then determine the p-value by subtracting the complement from 1.0. In some embodiments, the processing system can use probabilities in a logarithmic scale for numerical stability because adding floating-point numbers on a computer can be numerically unstable.
Referring back to
Other ways are also possible for determining the overall p-value. For example, a tail probability technique can be used. In some embodiments, the integration method can be replaced by one or more machine learning models. For example, a random forest regression model can be trained to determine the Phred-scale quality score or the p-value from a set of training sample data. The integration process described in
In
In step 1470, the system can generate a diagnosis of a disease based on the identified variant locations. In some embodiments, variants or mutations that can be indicative of certain cancers and/or serve as biomarkers for certain therapeutics can include: ACVR1B, AKT3, AMER1, APC, ARID1A, ARID1B, ARID2, ASXL1, ASXL2, ATM, ATR, BAP1 BCL2, BCL6, BCORL1, BCR, BLM, BRAF, BRCA1, BTG1, CASP8, CBL, CCND3, CCNE1, CD74, CDC73, CDK12, CDKN2A, CHD2, CJD2, CREBBP, CSF1R, CTCF, CTNNB1, DICER1, DNAJB1, DNMT1, DNMT3A, DNMT3B, DOT1L, EED, EGFR, EIF1AX, EP300, EPHA3, EPHA5, EPHB1, ERBB2, ERBB4, ERCC2, ERCC3, ERCC4, ESR1, FAM46C, FANCA, FANCC, FANCD2, FANCE, FAT1, FBXW7, FGFR3, FLCN, FLT1, FOXO1, FUBP1, FYN, GATA3, GPR124, GRIN2A, GRM3, H3F3A, HIST1H1C,IDH1, IDH2, IKZF1, IL7R, INPP4B, IRF4, IRS1, IRS2, JAK2, KAT6A, KDM6A, KEAP1, KIF5B, KIT, KLF4, KLH6, KMT2C, KRAS, LMAP1, LRP1B, LZTR1, MAP3K1, MCL1, MGA, MSH2, MSH6, MST1R, MTOR, MYD88, NPM1, NRAS, NTRK1, NTRK2, NUP93, NUTM1, PAX3, PAX8, PBRM1, PGR, PHOX2B, PIK3CA, POLE, PTCH1, PTEN, PTPN11, PTPRT, RAD21, RAF1, RANBP2, RB1, REL, RFWD2, RHOA, RPTOR, RUNX1, RUNX1T1, SDHA, SHQ1, SLIT2, SMAD4, SMARCA4, SMARCD1, SNCAIP, SOCS1, SPEN, SPTA1, SUZ12, TET1, TET2, TGFBR, and TNFRSF14. In some embodiments, cancer immunotherapies can target OX40, LAG3, and/or ICOS.
In step 1480, a treatment of the disease may be provided. Before providing the treatment, a companion diagnostics operation may also be performed. The companion diagnostics operation may identify one or more criteria, including variants or mutations, using the process described herein. Providing the treatment may take the form of causing or recommending a medical practitioner to administrate a specific dosage of a drug to a patient.
For example, the systems and methods described herein can be used for detecting variants or mutations that are biomarkers for cancer treatments, such as certain immunotherapies and targeted therapeutics. Such therapies can include, for example, an immunoglobulin, a protein, a peptide, a small molecule, a nanoparticle, or a nucleic acid. In some embodiments, the therapies comprise an antibody, or a functional fragment thereof. In some embodiments, the antibody may include: Rituxan® (rituximab), Herceptin® (trastuzumab), Erbitux® (cetuximab), Vectibix® (Panitumumab), Arzerra® (Ofatumumab), Benlysta® (belimumab), Yervoy® (ipilimumab), Perjeta® (Pertuzumab), Tremelimumab®, Opdivo® (nivolumab), Dacetuzumab®, Urelumab®, Tecentriq® (atezolizumab, MPDL3280A), Lambrolizumab®, Blinatumomab®, CT-011, Keytruda® (pembrolizumab, MK-3475), BMS-936559, MED14736, MSB0010718C, Imfinzi® (durvalumab), Bavencio® (avelumab) and margetuximab (MGAH22).
In some embodiments, the immunotherapies and targeted therapeutics comprise PD-1 inhibition, PD-L1 inhibition, or CTL-4 inhibition. PD-1 inhibition targets the programmed death receptor on T-cells and other immune cells. Examples of PD-1 inhibition immunotherapies include Pembrolizumab; Keytruda; Nivolumab; Opdivo; Cemiplimab; Libtayo. PD-L1 inhibition targets the programmed death receptor ligand expressed by tumor and regulatory immune cells. Examples of PD-L1 Inhibition immunotherapies include Atezolizumab; Tecentriq; Avelumab; Bavencio; Durvalumab; Imfinzi. CTL-4 inhibition targets T-cell activation. Examples of CTL-4 inhibition immunotherapies include Ipilimumab; Yervoy.
For non-small cell lung cancer indications, variants or mutations that can be biomarkers for immunotherapy treatments can include EGFR exon 19 deletions & EGFR exon 21 L858R alterations (e.g., for therapies such as Gilotrif® (afatinib), Iressa® (gefitinib), Tagrisso® (osimertinib), or Tarceva® (erlotinib)); EGFR exon 20 T790M alterations (e.g., which may be treated with Tagrisso® (osimertinib)); ALK rearrangements (e.g., which may be treated with Alecensa® (alectinib), Xalkori® (crizotinib), or Zykadia® (ceritinib)); BRAF V600E (e.g., which may be treated with Tafinlar® (dabrafenib) in combination with Mekinist® (trametinib)); single nucleotide variants (SNVs) and indels that lead to MET exon 14 skipping (e.g., which may be treated with TabrectaTM (capmatinib)).
For melanoma indications, variants or mutations that can be biomarkers for immunotherapy treatments can include BRAF V600E (e.g., which may be treated with Tafinlar® (dabrafenib) or Zelboraf® (vemurafenib)); BRAF V600E or V600K (e.g., which may be treated with Mekinist® (trametinib) or Cotellic® (cobimetinib), in combination with Zelboraf® (vemurafenib)).
For breast cancer indications, variants or mutations that can be biomarkers for immunotherapy treatments can include ERBB2 (HER2) amplification (e.g., which may be treated with Herceptin® (trastuzumab), Kadcyla® (ado-trastuzumab-emtansine), or Perjeta® (pertuzumab)); PIK3CA alterations (e.g., which may be treated with Piqray® (alpelisib)).
For colorectal cancer indications, variants or mutations that can be biomarkers for immunotherapy treatments can include KRAS wild-type (absence of mutations in codons 12 and 13) (e.g., which may be treated with Erbitux® (cetuximab)); KRAS wild-type (absence of mutations in exons 2, 3, and 4) and NRAS wild type (absence of mutations in exons 2, 3, and 4) (e.g., which may be treated with Vectibix® (panitumumab)).
For ovarian cancer indications, variants or mutations that can be biomarkers for immunotherapy treatments can include BRCA1/2 alterations (e.g., which may be treated with Lynparza® (olaparib) or Rubraca® (rucaparib)).
For prostate cancer indications, variants or mutations that can be biomarkers for immunotherapy treatments can include Homologous Recombination Repair (HRR) gene (BRCA1, BRCA2, ATM, BARD1, BRIP1, CDK12, CHEK1, CHEK2, FANCL, PALB2, RAD51B, RAD51C, RAD51D and RAD54L) alterations (e.g., which may be treated with Lynparza® (olaparib)).
For solid tumor cancer indications, variants or mutations that can be biomarkers for immunotherapy treatments can include a tumor mutational burden (TMB) that is greater than or equal to 10 mutations per megabase (e.g., which may be treated with Keytruda® (pembrolizumab)).
By way of example,
The structure of a computing machine described in
By way of example, a computing machine may be a personal computer (PC), a tablet PC, a set-top box (STB), a personal digital assistant (PDA), a cellular telephone, a smartphone, a web appliance, a network router, an internet of things (IoT) device, a switch or bridge, or any machine capable of executing instructions 1524 that specify actions to be taken by that machine. Further, while only a single machine is illustrated, the term “machine” and “computer” may also be taken to include any collection of machines that individually or jointly execute instructions 1524 to perform any one or more of the methodologies discussed herein.
The example computer system 1500 includes one or more processors 1502 such as a CPU (central processing unit), a GPU (graphics processing unit), a TPU (tensor processing unit), a DSP (digital signal processor), a system on a chip (SOC), a controller, a state equipment, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or any combination of these. Parts of the computing system 1500 may also include a memory 1504 that store computer code including instructions 1524 that may cause the processors 1502 to perform certain actions when the instructions are executed, directly or indirectly by the processors 1502. Instructions can be any directions, commands, or orders that may be stored in different forms, such as equipment-readable instructions, programming instructions including source code, and other communication signals and orders. Instructions may be used in a general sense and are not limited to machine-readable codes.
One and more methods described herein improve the operation speed of the processors 1502 and reduces the space required for the memory 1504. For example, the machine learning methods described herein reduces the complexity of the computation of the processors 1502 by applying one or more novel techniques that simplify the steps in training, reaching convergence, and generating results of the processors 1502. The algorithms described herein also reduces the size of the models and datasets to reduce the storage space requirement for memory 1504.
The performance of certain of the operations may be distributed among the more than processors, not only residing within a single machine, but deployed across a number of machines. In some example embodiments, the one or more processors or processor-implemented modules may be located in a single geographic location (e.g., within a home environment, an office environment, or a server farm). In other example embodiments, the one or more processors or processor-implemented modules may be distributed across a number of geographic locations. Even though in the specification or the claims may refer some processes to be performed by a processor, this should be construed to include a joint operation of multiple distributed processors.
The computer system 1500 may include a main memory 1504, and a static memory 1506, which are configured to communicate with each other via a bus 1508. The computer system 1500 may further include a graphics display unit 1510 (e.g., a plasma display panel (PDP), a liquid crystal display (LCD), a projector, or a cathode ray tube (CRT)). The graphics display unit 1510, controlled by the processors 1502, displays a graphical user interface (GUI) to display one or more results and data generated by the processes described herein. The computer system 1500 may also include alphanumeric input device 1512 (e.g., a keyboard), a cursor control device 1514 (e.g., a mouse, a trackball, a joystick, a motion sensor, or other pointing instrument), a storage unit 1516 (a hard drive, a solid state drive, a hybrid drive, a memory disk, etc.), a signal generation device 1518 (e.g., a speaker), and a network interface device 1520, which also are configured to communicate via the bus 1508.
The storage unit 1516 includes a computer-readable medium 1522 on which is stored instructions 1524 embodying any one or more of the methodologies or functions described herein. The instructions 1524 may also reside, completely or at least partially, within the main memory 1504 or within the processor 1502 (e.g., within a processor's cache memory) during execution thereof by the computer system 1500, the main memory 1504 and the processor 1502 also constituting computer-readable media. The instructions 1524 may be transmitted or received over a network 1526 via the network interface device 1520. While computer-readable medium 1522 is shown in an example embodiment to be a single medium, the term “computer-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, or associated caches and servers) able to store instructions (e.g., instructions 1524). The computer-readable medium may include any medium that is capable of storing instructions (e.g., instructions 1524) for execution by the processors (e.g., processors 1502) and that cause the processors to perform any one or more of the methodologies disclosed herein. The computer-readable medium may include, but not be limited to, data repositories in the form of solid-state memories, optical media, and magnetic media. The computer-readable medium does not include a transitory medium such as a propagating signal or a carrier wave.
Beneficially, various embodiments described herein improve the accuracy and efficiency of existing technologies in the field of sequencing, such as PCR and massively parallel DNA sequencing (e.g., NGS). The embodiments provide solutions to the challenge of identifying errors introduced by the sequencing and amplification process. A massively parallel DNA sequencing may start with one or more DNA samples, which are randomly cleaved and typically amplified using PCR. The parallel nature of massively parallel DNA sequencing results in replicates of nucleotide sequences of each allele. The extent of replication and sequencing at each allele site could vary. For example, some sequences are overlapped and/or double-stranded while other sequences are not. Both the PCR amplification process and the sequencing process and the sequencing process have non-trivial error rates. The sequence errors may act to obscure the nucleotide sequences of the true alleles. The embodiments may be used to determine one or more alleles analyzed by a massively parallel DNA sequencing instrument. By taking consideration of read-tier specific noise models, the massively parallel DNA sequencing workflows exhibits sufficient fidelity to generate correct sequence determination by more accurately distinguishing true alleles from erroneous sequences.
Conventionally, to reduce the error rate in determining the correct sequence, the sequencing depth of the sample needs to increase. This means fewer samples can be analyzed in a batch of sequencing because more resources are dedicated to a sample. The embodiments improve the accuracy of sequencing without increasing the sequencing depth of a particular allele site, thereby allowing more allele sites or patient samples to be sequenced at the same time in the case of a massively parallel DNA sequencing. Embodiments described may reduce the sequencing depths needed while increasing the accuracy of a massively parallel DNA sequencing that is used to read the nucleotide sequences generated in the amplification.
The foregoing description of the embodiments of the invention has been presented for the purpose of illustration; it is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Persons skilled in the relevant art can appreciate that many modifications and variations are possible in light of the above disclosure.
Some portions of this description describe the embodiments of the invention in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are commonly used by those skilled in the data processing arts to convey the substance of their work effectively to others skilled in the art. These operations, while described functionally, computationally, or logically, are understood to be implemented by computer programs or equivalent electrical circuits, microcode, or the like. Furthermore, it has also proven convenient at times, to refer to these arrangements of operations as modules, without loss of generality. The described operations and their associated modules may be embodied in software, firmware, hardware, or any combinations thereof.
Any of the steps, operations, or processes described herein may be performed or implemented with one or more hardware or software modules, alone or in combination with other devices. In some embodiments, a software module is implemented with a computer program product including a non-transitory computer-readable medium containing computer program code, which can be executed by a computer processor for performing any or all of the steps, operations, or processes described.
Embodiments of the invention may also relate to a product that is produced by a computing process described herein. Such a product may include information resulting from a computing process, where the information is stored on a non-transitory, tangible computer readable storage medium and may include any embodiments of a computer program product or other data combination described herein.
While one or more processes described herein may be described with one or more steps, the use of the term “step” does not imply that a particular order. For example, while this disclosure may describe a process that includes multiple steps sequentially, the steps in the process do not need to be performed by the specific order claimed or described in the disclosure. Some steps may be performed before others even though the other steps are claimed or described first in this disclosure.
Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by any claims that issue on an application based hereon. Accordingly, the disclosure of the embodiments of the invention is intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2020/049751 | 9/8/2020 | WO |
Number | Date | Country | |
---|---|---|---|
62897923 | Sep 2019 | US |