This disclosure generally relates to a Bayesian inference based model for targeted sequencing and to leveraging the model in variant calling and quality control.
Computational techniques can be used on DNA sequencing data to identify mutations or variants in DNA that may correspond to various types of cancer or other diseases. Thus, cancer diagnosis or prediction may be performed by analyzing a biological sample such as a tissue biopsy or blood drawn from a subject. Detecting DNA that originated from tumor cells from a blood sample is difficult because circulating tumor DNA (ctDNA) is typically present at low levels relative to other molecules in cell-free DNA (cfDNA) extracted from the blood. The inability of existing methods to identify true positives (e.g., indicative of cancer in the subject) from signal noise diminishes the ability of known and future systems to distinguish true positives from false positives caused by noise sources, which can result in unreliable results for variant calling or other types of analyses.
Disclosed herein are methods for training and applying a site-specific noise model (also referred to herein as a “Bayesian hierarchical model,” “noise model,” or “model”) for determining likelihoods of true positives in targeted sequencing. True positives may include single nucleotide variants, insertions, or deletions of base pairs. In particular, the model may use Bayesian inference to determine a rate or level of noise, e.g., indicative of an expected likelihood of certain mutations, per position of a nucleic acid sequence. Moreover, the model may be a hierarchical model that accounts for covariates (e.g., trinucleotide context, mappability, or segmental duplication) and various types of parameters (e.g., mixture components or depth of sequence reads). The model may be trained by Markov chain Monte Carlo sampling from sequence reads of healthy subjects. Therefore, an overall pipeline that incorporates the model can identify true positives at higher sensitivities and filter out false positives.
In various embodiments, a method for processing sequencing data of a nucleic acid sample includes identifying a candidate variant of a plurality of sequence reads. The method further includes accessing a plurality of parameters including a dispersion parameter r and a mean rate parameter m specific to the candidate variant, where the r and m are derived using a model. The method further includes inputting read information of the plurality of sequence reads into a function parameterized by the plurality of parameters. The method further includes determining a score for the candidate variant using an output of the function based on the input read information.
In one or more embodiments, the plurality of parameters represent mean and shape parameters of a gamma distribution, and the function is a negative binomial based on the plurality of sequence reads and the plurality of parameters.
In one or more embodiments, the plurality of parameters represent parameters of a distribution that encodes an uncertainty level of nucleotide mutations with respect to a given position of a sequence read.
In one or more embodiments, a gamma distribution is one component of a mixture of the distribution.
In one or more embodiments, the plurality of parameters are derived from a training sample of sequence reads from a plurality of healthy individuals.
In one or more embodiments, the training sample excludes a subset of the sequence reads from the plurality of healthy individuals based on filtering criteria.
In one or more embodiments, the filtering criteria indicates to exclude sequence reads that have (i) a depth less than a threshold value or (ii) an allele frequency greater than a threshold frequency.
In one or more embodiments, the filtering criteria varies based on positions of candidate variants in a genome.
In one or more embodiments, the plurality of parameters are derived using a Bayesian Hierarchical model.
In one or more embodiments, the Bayesian Hierarchical model includes a multinomial distribution grouping positions of sequence reads into latent classes.
In one or more embodiments, the Bayesian Hierarchical model includes fixed covariates unrelated to training samples from healthy individuals.
In one or more embodiments, the covariates are based on a plurality of nucleotides adjacent to a given position of a sequence read.
In one or more embodiments, the covariates are based on a level of uniqueness of a given sequence read relative to a target region of a genome.
In one or more embodiments, the covariates are based whether a given sequence read is a segmental duplication.
In one or more embodiments, the Bayesian Hierarchical model is estimated using a Markov chain Monte Carlo method.
In one or more embodiments, the Markov chain Monte Carlo method uses a Metropolis-Hastings algorithm.
In one or more embodiments, the Markov chain Monte Carlo method uses a Gibbs sampling algorithm.
In one or more embodiments, the Markov chain Monte Carlo method uses Hamiltonian mechanics.
In one or more embodiments, the read information includes a depth d of the plurality of sequence reads, the function parameterized by m·d.
In one or more embodiments, the score is a Phred-scaled likelihood.
In one or more embodiments, the plurality of sequence reads are obtained from a cell free nucleotide sample obtained from an individual.
In one or more embodiments, the method further includes collecting or having collected the cell free nucleotide sample from a blood sample of the individual, and performing enrichment on the cell free nucleotide sample to generate the plurality of sequence reads.
In one or more embodiments, the plurality of sequence reads are obtained from a sample of blood, whole blood, plasma, serum, urine, cerebrospinal fluid, fecal, saliva, tears, a tissue biopsy, pleural fluid, pericardial fluid, or peritoneal fluid of an individual.
In one or more embodiments, the plurality of sequence reads are obtained from tumor cells obtained from a tumor biopsy.
In one or more embodiments, the plurality of sequence reads is sequenced from an isolate of cells from blood, the isolate of cells including at least buffy coat white blood cells or CD4+ cells.
In one or more embodiments, the method further includes determining that the candidate variant is a false positive mutation responsive to comparing the score to a threshold value.
In one or more embodiments, the candidate variant is a single nucleotide variant.
In one or more embodiments, the model encodes noise levels of nucleotide mutations for one base of A, T, C, and G to each of the other three bases.
In one or more embodiments, the candidate variant is an insertion or deletion of at least one nucleotide.
In one or more embodiments, the model includes a distribution of lengths of insertions or deletions.
In one or more embodiments, model separates inference for determining a likelihood of an alternate allele from inference for determining a length of the alternate allele using the distribution of lengths.
In one or more embodiments, the distribution of lengths is multinomial with Dirichlet prior.
In one or more embodiments, the Dirichlet prior on the multinomial distribution of lengths is determined by covariates of anchor positions of a genome.
In one or more embodiments, the model includes a distribution ω determined based on covariates.
In one or more embodiments, the model includes a distribution ϕ determined based on covariates and anchor positions of a genome.
In one or more embodiments, the model includes a multinomial distribution grouping lengths of insertions or deletions at anchor positions of sequence reads into latent classes.
In one or more embodiments, an expected mean total count of insertions or deletions at a given anchor position is modeled by a distribution based on covariates and anchor positions of a genome.
Figure (
The figures depict embodiments of the present invention for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the invention described herein.
The term “individual” refers to a human individual. The term “healthy individual” refers to an individual presumed to not have a cancer or disease. The term “subject” refers to an individual who is known to have, or potentially has, a cancer or disease.
The term “sequence reads” refers to nucleotide sequences read from a sample obtained from an individual. Sequence reads can be obtained through various methods known in the art.
The term “read segment” or “read” refers to any nucleotide sequences including sequence reads obtained from an individual and/or nucleotide sequences derived from the initial sequence read from a sample obtained from an individual. For example, a read segment can refer to an aligned sequence read, a collapsed sequence read, or a stitched read. Furthermore, a read segment can refer to an individual nucleotide base, such as a single nucleotide variant.
The term “single nucleotide variant” or “SNV” refers to a substitution of one nucleotide to a different nucleotide at a position (e.g., site) of a nucleotide sequence, e.g., a sequence read from an individual. A substitution from a first nucleobase X to a second nucleobase Y may be denoted as “X>Y.” For example, a cytosine to thymine SNV may be denoted as “C>T.”
The term “indel” refers to any insertion or deletion of one or more base pairs having a length and a position (which may also be referred to as an anchor position) in a sequence read. An insertion corresponds to a positive length, while a deletion corresponds to a negative length.
The term “mutation” refers to one or more SNVs or indels.
The term “candidate variant,” “called variant,” or “putative variant,” refers to one or more detected nucleotide variants of a nucleotide sequence, for example, at a position in the genome that is determined to be mutated (i.e., a candidate SNV) or an insertion or deletion at one or more bases (i.e., a candidate indel). Generally, a nucleotide base is deemed a called variant based on the presence of an alternative allele on a sequence read, or collapsed read, where the nucleotide base at the position(s) differ from the nucleotide base in a reference genome. Additionally, candidate variants may be called as true positives or false positives.
The term “true positive” refers to a mutation that indicates real biology, for example, presence of a potential cancer, disease, or germline mutation in an individual. True positives are not artifacts that may mimic real biology. For example, recurrent apparent variants in healthy individuals are likely to be technical artifacts rather than biology, and various process errors can lead to spurious variants.
The term “false positive” refers to a mutation incorrectly determined to be a true positive. Generally, false positives may be more likely to occur when processing sequence reads associated with greater mean noise rates or greater uncertainty in noise rates.
The term “cell-free nucleic acids” or “cfNAs” refers to nucleic acid molecules that can be found outside cells, in bodily fluids such blood, sweat, urine, or saliva. Cell-free nucleic acids are used interchangeably as circulating nucleic acids.
The term “cell-free DNA” or “cfDNA” refers to nucleic acid fragments that circulate in bodily fluids such blood, sweat, urine, or saliva and originate from one or more healthy cells and/or from one or more cancer cells.
The term “circulating tumor DNA” or “ctDNA” refers to deoxyribonucleic acid fragments that originate from tumor cells or other types of cancer cells, which may be released into an individual's bodily fluids such blood, sweat, urine, or saliva as result of biological processes such as apoptosis or necrosis of dying cells or actively released by viable tumor cells.
The term “circulating tumor RNA” or “ctRNA” refers to ribonucleic acid fragments that originate from tumor cells or other types of cancer cells, which may be released into an individual's bodily fluids such blood, sweat, urine, or saliva as result of biological processes such as apoptosis or necrosis of dying cells or actively released by viable tumor cells.
The term “alternative allele” or “ALT” refers to an allele having one or more mutations relative to a reference allele, e.g., corresponding to a known gene.
The term “sequencing depth” or “depth” refers to a total number of read segments from a sample obtained from an individual at a given position, region, or loci. In some embodiments, the depth refers to the average sequencing depth across the genome or across a targeted sequencing panel.
The term “alternate depth” or “AD” refers to a number of read segments in a sample that support an ALT, e.g., include mutations of the ALT.
The term “alternate frequency” or “AF” refers to the frequency of a given ALT. The AF may be determined by dividing the corresponding AD of a sample by the depth of the sample for the given ALT.
In step 110, a test sample comprising a plurality of nucleic acid molecules (DNA or RNA) is obtained from a subject, and the nucleic acids are extracted and/or purified from the test sample. In the present disclosure, DNA and RNA may be used interchangeably unless otherwise indicated. That is, the following embodiments for using error source information in variant calling and quality control may be applicable to both DNA and RNA types of nucleic acid sequences. However, the examples described herein may focus on DNA for purposes of clarity and explanation. The nucleic acids in the extracted sample may comprise the whole human genome, or any subset of the human genome, including the whole exome. Alternatively, the sample may be any subset of the human transcriptome, including the whole transcriptome. The test sample may be obtained from a subject known to have or suspected of having cancer. In some embodiments, the test sample may include blood, plasma, serum, urine, fecal, saliva, other types of bodily fluids, or any combination thereof. Alternatively, the test sample may comprise a sample selected from the group consisting of whole blood, a blood fraction, a tissue biopsy, pleural fluid, pericardial fluid, cerebral spinal fluid, and peritoneal fluid. In some embodiments, methods for drawing a blood sample (e.g., syringe or finger prick) may be less invasive than procedures for obtaining a tissue biopsy, which may require surgery. The extracted sample may comprise cfDNA and/or ctDNA. For healthy individuals, the human body may naturally clear out cfDNA and other cellular debris. In general, any known method in the art can be used to extract and purify cell-free nucleic acids from the test sample. For example, cell-free nucleic acids can be extracted and purified using one or more known commercially available protocols or kits, such as the QIAamp circulating nucleic acid kit (Qiagen). If a subject has a cancer or disease, ctDNA in an extracted sample may be present at a detectable level for diagnosis.
In step 120, a sequencing library is prepared. During library preparation, sequencing adapters comprising unique molecular identifiers (UMI) are added to the nucleic acid molecules (e.g., DNA molecules), for example, through adapter ligation (using T4 or T7 DNA ligase) or other known means in the art. The UMIs are short nucleic acid sequences (e.g., 4-10 base pairs) that are added to ends of DNA fragments and serve as unique tags that can be used to identify nucleic acids (or sequence reads) originating from a specific DNA fragment. Following adapter addition, the adapter-nucleic acid constructs are amplified, for example, using polymerase chain reaction (PCR). During PCR amplification, the UMIs are replicated along with the attached DNA fragment, which provides a way to identify sequence reads that came from the same original fragment in downstream analysis. Optionally, as is well known in the art, the sequencing adapters may further comprise a universal primer, a sample-specific barcode (for multiplexing) and/or one or more sequencing oligonucleotides for use in subsequent cluster generation and/or sequencing (e.g., known P5 and P7 sequences for used in sequencing by synthesis (SBS) (Illumina, San Diego, Calif.)).
In step 130, targeted DNA sequences are enriched from the library. In accordance with one embodiment, during targeted enrichment, hybridization probes (also referred to herein as “probes”) are used to target, and pull down, nucleic acid fragments known to be, or that may be, informative for the presence or absence of cancer (or disease), cancer status, or a cancer classification (e.g., cancer type or tissue of origin). For a given workflow, the probes may be designed to anneal (or hybridize) to a target (complementary) strand of DNA or RNA. The target strand may be the “positive” strand (e.g., the strand transcribed into mRNA, and subsequently translated into a protein) or the complementary “negative” strand. The probes may range in length from 10 s, 100 s, or 1000 s of base pairs. In one embodiment, the probes are designed based on a gene panel to analyze particular mutations or target regions of the genome (e.g., of the human or another organism) that are suspected to correspond to certain cancers or other types of diseases. Moreover, the probes may cover overlapping portions of a target region. As one of skill in the art would readily appreciate, any known means in the art can be used for targeted enrichment. For example, in one embodiment, the probes may be biotinylated and streptavidin coated magnetic beads used to enrich for probe captured target nucleic acids. See, e.g., Duncavage et al., J Mol Diagn. 13(3): 325-333 (2011); and Newman et al., Nat Med. 20(5): 548-554 (2014). By using a targeted gene panel rather than sequencing the whole genome (“whole genome sequencing”), all expressed genes of a genome (“whole exome sequencing” or “whole transcriptome sequencing”), the method 100 may be used to increase sequencing depth of the target regions, where depth refers to the count of the number of times a given target sequence within the sample has been sequenced. Increasing sequencing depth allows for detection of rare sequence variants in a sample and/or increases the throughput of the sequencing process. After a hybridization step, the hybridized nucleic acid fragments are captured and may also be amplified using PCR.
In step 140, sequence reads are generated from the enriched nucleic acid molecules (e.g., DNA molecules). Sequencing data or sequence reads may be acquired from the enriched nucleic acid molecules by known means in the art. For example, the method 100 may include next generation sequencing (NGS) techniques including synthesis technology (Illumina), pyrosequencing (454 Life Sciences), ion semiconductor technology (Ion Torrent sequencing), single-molecule real-time sequencing (Pacific Biosciences), sequencing by ligation (SOLiD sequencing), nanopore sequencing (Oxford Nanopore Technologies), or paired-end sequencing. In some embodiments, massively parallel sequencing is performed using sequencing-by-synthesis with reversible dye terminators.
In various embodiments, the enriched nucleic acid sample 115 is provided to the sequencer 145 for sequencing. As shown in
In some embodiments, the sequencer 145 is communicatively coupled with one or more computing devices 160. Each computing device 160 can process the sequence reads for various applications such as variant calling or quality control. The sequencer 145 may provide the sequence reads in a BAM file format to a computing device 160. Each computing device 160 can be one of a personal computer (PC), a desktop computer, a laptop computer, a notebook, a tablet PC, or a mobile device. A computing device 160 can be communicatively coupled to the sequencer 145 through a wireless, wired, or a combination of wireless and wired communication technologies. Generally, the computing device 160 is configured with a processor and memory storing computer instructions that, when executed by the processor, cause the processor to process the sequence reads or to perform one or more steps of any of the methods or processes disclosed herein.
In some embodiments, the sequence reads may be aligned to a reference genome using known methods in the art to determine alignment position information. For example, in one embodiment, sequence reads are aligned to human reference genome hg19. The sequence of the human reference genome, hg19, is available from Genome Reference Consortium with a reference number, GRCh37/hg19, and also available from Genome Browser provided by Santa Cruz Genomics Institute. The alignment position information may indicate a beginning position and an end position of a region in the reference genome that corresponds to a beginning nucleotide base and end nucleotide base of a given sequence read. Alignment position information may also include sequence read length, which can be determined from the beginning position and end position. A region in the reference genome may be associated with a gene or a segment of a gene.
In various embodiments, for example when a paired-end sequencing process is used, a sequence read is comprised of a read pair denoted as R1 and R2. For example, the first read R1 may be sequenced from a first end of a double-stranded DNA (dsDNA) molecule whereas the second read R2 may be sequenced from the second end of the double-stranded DNA (dsDNA). Therefore, nucleotide base pairs of the first read R1 and second read R2 may be aligned consistently (e.g., in opposite orientations) with nucleotide bases of the reference genome. Alignment position information derived from the read pair R1 and R2 may include a beginning position in the reference genome that corresponds to an end of a first read (e.g., R1) and an end position in the reference genome that corresponds to an end of a second read (e.g., R2). In other words, the beginning position and end position in the reference genome represent the likely location within the reference genome to which the nucleic acid fragment corresponds. An output file having SAM (sequence alignment map) format or BAM (binary) format may be generated and output for further analysis such as variant calling, as described below with respect to
At step 300, optionally, the sequence processor 205 collapses aligned sequence reads of the input sequencing data. In one embodiment, collapsing sequence reads includes using UMIs, and optionally alignment position information from sequencing data of an output file (e.g., from the method 100 shown in
At step 305, optionally, the sequence processor 205 may stitch sequence reads, or collapsed sequence reads, based on the corresponding alignment position information merging together two sequence reads into a single read segment. In some embodiments, the sequence processor 205 compares alignment position information between a first sequence read and a second sequence read (or collapsed sequence reads) to determine whether nucleotide base pairs of the first and second reads partially overlap in the reference genome. In one use case, responsive to determining that an overlap (e.g., of a given number of nucleotide bases) between the first and second reads is greater than a threshold length (e.g., threshold number of nucleotide bases), the sequence processor 205 designates the first and second reads as “stitched”; otherwise, the collapsed reads are designated “unstitched.” In some embodiments, a first and second read are stitched if the overlap is greater than the threshold length and if the overlap is not a sliding overlap. For example, a sliding overlap may include a homopolymer run (e.g., a single repeating nucleotide base), a dinucleotide run (e.g., two-nucleotide repeating base sequence), or a trinucleotide run (e.g., three-nucleotide repeating base sequence), where the homopolymer run, dinucleotide run, or trinucleotide run has at least a threshold length of base pairs.
At step 310, the sequence processor 205 may optionally assemble two or more reads, or read segments, into a merged sequence read (or a path covering the targeted region). In some embodiments, the sequence processor 205 assembles reads to generate a directed graph, for example, a de Bruijn graph, for a target region (e.g., a gene). Unidirectional edges of the directed graph represent sequences of k nucleotide bases (also referred to herein as “k-mers”) in the target region, and the edges are connected by vertices (or nodes). The sequence processor 205 aligns collapsed reads to a directed graph such that any of the collapsed reads may be represented in order by a subset of the edges and corresponding vertices.
In some embodiments, the sequence processor 205 determines sets of parameters describing directed graphs and processes directed graphs. Additionally, the set of parameters may include a count of successfully aligned k-mers from collapsed reads to a k-mer represented by a node or edge in the directed graph. The sequence processor 205 stores, e.g., in the sequence database 210, directed graphs and corresponding sets of parameters, which may be retrieved to update graphs or generate new graphs. For instance, the sequence processor 205 may generate a compressed version of a directed graph (e.g., or modify an existing graph) based on the set of parameters. In one use case, in order to filter out data of a directed graph having lower levels of importance, the sequence processor 205 removes (e.g., “trims” or “prunes”) nodes or edges having a count less than a threshold value, and maintains nodes or edges having counts greater than or equal to the threshold value.
At step 315, the variant caller 240 generates candidate variants from the sequence reads, collapsed sequence reads, or merged sequence reads assembled by the sequence processor 205. In one embodiment, the variant caller 240 generates the candidate variants by comparing sequence reads, collapsed sequence reads, or merged sequence reads (which may have been compressed by pruning edges or nodes in step 310) to a reference sequence of a target region of a reference genome (e.g., human reference genome hg19). The variant caller 240 may align edges of the sequence reads collapsed sequence reads, or merged sequence reads to the reference sequence, and records the genomic positions of mismatched edges and mismatched nucleotide bases adjacent to the edges as the locations of candidate variants. Additionally, the variant caller 240 may generate candidate variants based on the sequencing depth of a target region. In particular, the variant caller 240 may be more confident in identifying variants in target regions that have greater sequencing depth, for example, because a greater number of sequence reads help to resolve (e.g., using redundancies) mismatches or other base pair variations between sequences.
In one embodiment, the variant caller 240 generates candidate variants using the model 225 to determine expected noise rates for sequence reads from a subject (e.g., from a healthy subject). The model 225 may be a Bayesian hierarchical model, though in some embodiments, the processing system 100 uses one or more different types of models. Moreover, a Bayesian hierarchical model may be one of many possible model architectures that may be used to generate candidate variants and which are related to each other in that they all model position-specific noise information in order to improve the sensitivity or specificity of variant calling. More specifically, the machine learning engine 220 trains the model 225 using samples from healthy individuals to model the expected noise rates per position of sequence reads.
Further, multiple different models may be stored in the model database 215 or retrieved for application post-training. For example, a first model is trained to model SNV noise rates and a second model is trained to model indel noise rates. Further, the score engine 235 may use parameters of the model 225 to determine a likelihood of one or more true positives in a sequence read. The score engine 235 may determine a quality score (e.g., on a logarithmic scale) based on the likelihood. For example, the quality score is a Phred quality score Q=−10·log10 P, where P is the likelihood of an incorrect candidate variant call (e.g., a false positive).
At step 320, the score engine 235 scores the candidate variants based on the model 225 or corresponding likelihoods of true positives or quality scores. Training and application of the model 225 is described in more detail below.
At step 325, the processing system 200 outputs the candidate variants. In some embodiments, the processing system 200 outputs some or all of the determined candidate variants along with the corresponding scores. Downstream systems, e.g., external to the processing system 200 or other components of the processing system 200, may use the candidate variants and scores for various applications including, but not limited to, predicting presence of cancer, disease, or germline mutations.
The probability mass functions (PMFs) illustrated in
Using the example of
z
p˜Multinom({right arrow over (θ)})
Together, the latent variable zp, the vector of mixture components {right arrow over (θ)}, α, and β allow the model for μ, that is, a sub-model of the Bayesian hierarchical model 225, to have parameters that “pool” knowledge about noise, that is they represent similarity in noise characteristics across multiple positions. Thus, positions of sequence reads may be pooled or grouped into latent classes by the model. Also advantageously, samples of any of these “pooled” positions can help train these shared parameters. A benefit of this is that the processing system 100 may determine a model of noise in healthy samples even if there is little to no direct evidence of alternate alleles having been observed for a given position previously (e.g., in the healthy tissue samples used to train the model).
The covariate xp (e.g., a predictor) encodes known contextual information regarding position p which may include, but is not limited to, information such as trinucleotide context, segmental duplication, distance closest to repeat, mappability, uniqueness, k-mer uniqueness, warnings for badly behaved regions of a sequence, or other information associated with sequence reads. Trinucleotide context may be based on a reference allele and may be assigned numerical (e.g., integer) representation. For instance, “AAA” is assigned 1, “ACA” is assigned 2, “AGA” is assigned 3, etc. Mappability represents a level of uniqueness of alignment of a read to a particular target region of a genome. For example, mappability is calculated as the inverse of the number of position(s) where the sequence read will uniquely map. Segmental duplications correspond to long nucleic acid sequences (e.g., having a length greater than approximately 1000 base pairs) that are nearly identical (e.g., greater than 90% match) and occur in multiple locations in a genome as result of natural duplication events (e.g., not associated with a cancer or disease).
The expected mean AD count of a SNV at position p is modeled by the parameter μp. For sake of clarity in this description, the terms μp and yp refer to the position specific sub-models of the Bayesian hierarchical model 225. In one embodiment, μp is modeled as a Gamma-distributed random variable having shape parameter αz
μp˜Gamma(αz
In other embodiments, other functions may be used to represent μp, examples of which include but are not limited to: a log-normal distribution with log-mean γz
In the example shown in
y
i
|d
i
˜Poisson(di
In other embodiments, other functions may be used to represent yi
The expected mean total indel count at position p is modeled by the distribution μp. In some embodiments, the distribution is based on the covariate and has a Gamma distribution having shape parameter αx
μp˜Gamma(αx
In other embodiments, other functions may be used to represent μp, examples of which include but are not limited to: negative binomial, Conway-Maxwell-Poisson distribution, zeta distribution, and zero-inflated Poisson.
The observed indels at position p in a human population sample i (of a healthy individual) is modeled by the distribution yi
y
i
|d
i
˜Poisson(di
In other embodiments, other functions may be used to represent yi
Due to the fact that indels may be of varying lengths, an additional length parameter is present in the indel model that is not present in the model for SNVs. As a consequence, the example model shown in
{right arrow over (y)}
i
|y
i
,{right arrow over (ϕ)}p˜Multinom(yi
In other embodiments, a Dirichlet-Multinomial function or other types of models may be used to represent yi
By architecting the model in this manner, the machine learning engine 220 may decouple learning of indel intensity (i.e., noise rate) from learning of indel length distribution. Independently determining inferences for an expectation for whether an indel will occur in healthy samples and expectation for the length of the indel at a position may improve the sensitivity of the model. For example, the length distribution may be more stable relative to the indel intensity at a number of positions or regions in the genome, or vice versa.
In one embodiment, the machine learning engine 220 performs model fitting by storing draws of μp, the expected mean counts of AF per position and per sample, in the parameters database 230. The model is trained or fitted through posterior sampling, as previously described. In an embodiment, the draws of μp are stored in a matrix data structure having a row per position of the set of positions sampled and a column per draw from the joint posterior (e.g., of all parameters conditional on the observed data). The number of rows R may be greater than 6 million and the number of columns for N iterations of samples may be in the thousands. In other embodiments, the row and column designations are different than the embodiment shown in
where mp and vp are the mean and variance of the sampled values of μp at the position p, respectively. Those of skill in the art will appreciate that other functions for determining rp may also be used such as a maximum likelihood estimate.
The machine learning engine 220 may also perform dispersion re-estimation of the dispersion parameters in the reduced matrix, given the mean parameters. In one embodiment, following Bayesian training and posterior approximation, the machine learning engine 220 performs dispersion re-estimation by retraining for the dispersion parameters based on a negative binomial maximum likelihood estimator per position. The mean parameter may remain fixed during retraining. In one embodiment, the machine learning engine 220 determines the dispersion parameters r′p at each position for the original AD counts of the training data (e.g., yi
=max(rp, r′p), and stores
in the reduced matrix. Those of skill in the art will appreciate that other functions for determining
may also be used, such as a method of moments estimator, posterior mean, or posterior mode.
During application of trained models, the processing system 100 may access the dispersion (e.g., shape) parameters and mean parameters mp to determine a function parameterized by
and mp. The function may be used to determine a posterior predictive probability mass function (or probability density function) for a new sample of a subject. Based on the predicted probability of a certain AD count at a given position, the processing system 100 may account for site-specific noise rates per position of sequence reads when detecting true positives from samples. Referring back to the example use case described with respect to
The Bayesian hierarchical model 225 may update parameters simultaneously for multiple (or all) positions included in the model. Additionally, the model 225 may be trained to model expected noise for each ALT. For instance, a model for SNVs may perform a training process four or more times to update parameters (e.g., one-to-one substitutions) for mutations of each of the A, T, C, and G bases to each of the other three bases. In step 830, the machine learning engine 220 stores parameters of the Bayesian hierarchical model 225 (e.g., ensemble parameters output by the Markov Chain Monte Carlo method). In step 840, the machine learning engine 220 approximates noise distribution (e.g., represented by a dispersion parameter and a mean parameter) per position based on the parameters. In step 850, the machine learning engine 220 performs dispersion re-estimation (e.g., maximum likelihood estimation) using original AD counts from the samples (e.g., training data) used to train the Bayesian hierarchical model 225.
and mp, respectively, specific to the candidate variant, which may be based on the position p of the candidate variant. The parameters may be derived using a model, e.g., a Bayesian hierarchical model 225 representing a posterior predictive distribution with an observed depth of a given sequence read and a mean parameter μp at the position p as input. In an embodiment, the mean parameter μp is a gamma distribution encoding a noise level of nucleotide mutations with respect to the position p for a training sample.
At step 930, the processing system 100 inputs read information (e.g., AD or AF) of the set of sequence reads into a function (e.g., based on a negative binomial) parameterized by the parameters, e.g., and mp. At step 940, the processing system 100 (e.g., the score engine 235) determines a score for the candidate variant (e.g., at the position p) using an output of the function based on the input read information. The score may indicate a likelihood of seeing an allele count for a given sample (e.g., from a subject) that is greater than or equal to a determined allele count of the candidate variant (e.g., determined by the model and output of the function). The processing system 100 may convert the likelihood into a Phred-scaled score. In some embodiments, the processing system 100 uses the likelihood to determine false positive mutations responsive to determining that the likelihood is less than a threshold value. In some embodiments, the processing system 100 uses the function to determine that a sample of sequence reads includes at least a threshold count of alleles corresponding to a gene found in sequence reads from a tumor biopsy of an individual. Responsive to this determination, the processing system 100 may predict presence of cancer cells in the individual based on variant calls. In some embodiments, the processing system 100 may perform weighting based on quality scores, use the candidate variants and quality scores for false discovery methods, annotate putative calls with quality scores, or provision to subsequent systems. The methods described above with respect to
The example results shown in the following figures were determined by the processing system 100 using one or more trained Bayesian hierarchical models 225. Bayesian hierarchical (BH) models 225 for SNVs and Indels may be referred to as a “SNV BH model” and “Indel BH model,” respectively. For purposes of comparison, some example results were determined without using a model 225 and are referred to as “no model” examples. In various embodiments, the results were generated using a targeted sequencing assay utilizing GRAIL's (GRAIL, Inc., Menlo Park, Calif.) proprietary 508 cancer gene panel to evaluate and call variants from targeted sequencing data from circulating cell-free DNA (cfDNA) samples obtained from subjects in one of two studies, Study “A” and Study “B,” as indicated in the figures. Study A included sequencing data from plasma samples obtained from 50 healthy subjects (no diagnosis of cancer) and 50 samples each from subjects with pre-metastatic breast cancer and pre-metastatic non-small cell lung cancer. Study B included evaluable sequencing data from plasma samples obtained from 124 cancer patients (39 subjects with metastatic breast cancer (MBC), 41 subjects with non-small cell lung cancer (NSCLC), and 44 subjects with castration-resistant prostate cancer (CRCP).
Whole blood was drawn into STRECK Blood Collection Tubes (BCT®) from healthy individuals and cancer patients, separated into plasma and buffy coat, and stored at −80° C. Cell-free DNA (cfDNA) was extracted from the plasma using a modified QIAmp Circulating Nucleic Acid kit (Qiagen, Germantown, Md.) and quantified using the Fragment Analyzer High Sensitivity NGS kit (Advanced Analytical Technologies, Akneny Iowa). A sequencing library was prepared from extracted cfDNA with a modified Illumina TruSeq DNA Nano protocol (ILLUMINA®; San Diego, Calif.). The library preparation protocol included adapter ligation of sequencing adapters comprising unique molecular identifiers (UMIs) used for error correction as described above. Sequencing libraries were PCR amplified and quantified using the Fragment Analyzer Standard Sensitivity NGS kit.
Quantified DNA libraries underwent hybridization-based capture with GRAIL's proprietary research panel targeting 508 cancer related genes (GRAIL, Inc., Menlo Park, Calif.). Target DNA molecules were first captured using biotinlyated single-stranded DNA hybridization probes and then enriched using magnetic streptavidin beads. Non-target molecules were removed using subsequent wash steps. Enriched libraries were sequenced on a HiSex X using the HiSeq X Reagent Kit v2.5 (ILLUMINA®; San Diego, Calif.) at a nominal raw target coverage of 60,000×. Four libraries were pooled per flowcell and dual indexing primer mix was included to enable dual sample indexing reads. The read lengths were set to 150, 150, 8, and 8, respectively for read 1, read 2, index read 1, and index read 2. The first 6 base reads in read 1 and read 2 were the UMI sequences.
The example results of the
The example results for both SNV and indel type mutations (excluding hypermutators) shown in
As shown by the example results, the BH models retain concordant mutations and in several cases improve sensitivity (e.g., greater PPA) of concordant mutations. For instance, in the breast cancer cfNDA sample for indels, the baseline PPA are 0.1 and 0.26 for “No Model 1” and “No Model 2,” respectively. However, the PPA increases to 0.37 and 0.42 for “BH_gDNA” and “BH_nonsyn,” respectively.
The “Tumor-ordered” results indicate that target cancer genes detected by the tumor-based “GRAIL” and cfDNA-based “Tumor” assays generally match. The baseline “GRAIL-ordered PASS” results obtained without using the BH models indicate that the “GRAIL” assay detects mutations in genes that do not match either of the target cancer genes or the genes detected by the “Tumor” assay. However, the “GRAIL-ordered BH” results obtained using the BH models indicate that the “GRAIL” assay detects genes that matches some of the target cancer genes and some of the genes detected by the “Tumor” assay. For instance, in
The foregoing description of the embodiments of the invention has been presented for the purpose of illustration; it is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Persons skilled in the relevant art can appreciate that many modifications and variations are possible in light of the above disclosure.
Some portions of this description describe the embodiments of the invention in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are commonly used by those skilled in the data processing arts to convey the substance of their work effectively to others skilled in the art. These operations, while described functionally, computationally, or logically, are understood to be implemented by computer programs or equivalent electrical circuits, microcode, or the like. Furthermore, it has also proven convenient at times, to refer to these arrangements of operations as modules, without loss of generality. The described operations and their associated modules may be embodied in software, firmware, hardware, or any combinations thereof.
Any of the steps, operations, or processes described herein may be performed or implemented with one or more hardware or software modules, alone or in combination with other devices. In one embodiment, a software module is implemented with a computer program product including a computer-readable non-transitory medium containing computer program code, which can be executed by a computer processor for performing any or all of the steps, operations, or processes described.
Embodiments of the invention may also relate to a product that is produced by a computing process described herein. Such a product may include information resulting from a computing process, where the information is stored on a non-transitory, tangible computer readable storage medium and may include any embodiment of a computer program product or other data combination described herein.
Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by any claims that issue on an application based hereon. Accordingly, the disclosure of the embodiments of the invention is intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.
This application claims the benefit of priority to U.S. Provisional Application No. 62/569,367, filed on Oct. 6, 2017, which is incorporated herein by reference in its entirety for all purposes.
Number | Date | Country | |
---|---|---|---|
62569367 | Oct 2017 | US |