It is important to accurately determine the presence of disease causing mutations in a patient, given the severity not only of the disease, but also of the treatment for such diseases (e.g., chemotherapy or radiation treatment). A method for determining the presence of such mutations can be performed by taking a tissue or fluid sample from the patient, then sequencing the sample looking for variants (mutations) in the DNA. However, there are factors in both sample procurement and sequencing that can lead to an abundance of false positive results that reduce the confidence level of the test results.
In a first embodiment, a method for detection of variant DNA in a heterogenous cell sample is described, the method comprising: sequencing the heterogenous cell sample from a subject, producing an input sequence; and applying a heuristic filter pipeline to the input sequence, producing an output report.
In a second embodiment, a method is described as the method of the first embodiment further comprising: sequencing a control cell sample from the subject, producing a control sequence.
In a third embodiment, a method is described as the method of the second embodiment wherein the heuristic filter pipeline further comprises at least one of: determining amplicons to be excluded; determining read positions to be excluded; and determining variants to be excluded.
These embodiments are exemplary and other embodiments are understood from the disclosure. One skilled in the art could conceive of further embodiments from the teachings herein.
The accompanying drawings, which are incorporated into and constitute a part of this specification, illustrate one or more embodiments of the present disclosure and, together with the description of example embodiments, serve to explain the principles and implementations of the disclosure.
Genome sequencing is useful for detection and identification of disease mutations in cells, such as with cancer. Difficulties can arise in computer-aided sequencing when the biological sample, for example taken from a blood sample from a patient, contains a heterogeneous mixture of cell deoxyribonucleic acid (DNA).
Nucleic acid sequencing is a method for determining the exact order of nucleotides present in a given DNA or RNA molecule. Next-generation sequencing (NGS), also known as high-throughput sequencing, is a term used to describe a number of different modern nucleic acid sequencing technologies including Illumia™ sequencing, Roche 454™ sequencing, Ion torrent: Protein/PGM™ sequencing and SOLiD™ sequencing. These sequencing technologies allow one to sequence DNA and RNA quickly and cheaply compared to the previously used Sanger sequencing.
The term “nucleic acids” “polynucleotides” as used herein refer to biological molecules comprising a plurality of nucleotides. Exemplary nucleic acids include deoxyribonucleic acids (DNA) and ribonucleic acids (RNA), each synthesized from four different types of nucleotides, also called “bases”. The nucleotides for DNA include deoxy-adenosine (“A”), deoxy-thymidine (“T”), deoxy-cytosine (“C”), and deoxy-guanosine (“G”). The nucleotides for RNA include adenosine (“A”), uracil (“U”), cytosine (“C”) and guanosine (“G”). The nucleotides of a DNA or RNA are arranged in a particular order, referred to as the sequence of the DNA or RNA. The precise order of nucleotides, i.e. the four bases, within a DNA or RNA molecule is determined using nucleic acid sequencing methods.
In cases where a suspected disease or condition is concerned, targeted sequencing of specific genes or genomic regions is preferred. Compared to whole genome sequencing, which sequences an entire genome, targeted sequencing targets on a sequence segment of interest comprising one or more specific genes or genomic regions. Targeted sequencing generally yields higher coverage of genomic regions of interest and reduces sequencing cost and time.
“Amplicon sequencing” as used herein refers to a targeted sequencing method in which a discrete region of a genome is first amplified from the entire genome using PCR and the generated amplicons are used as templates for subsequent sequencing. Amplicon sequencing is typically used to investigate genetic variants in complex and heterogeneous samples. Sequencing can be carried out in a sample containing amplification products of a single amplicon. Alternatively, the sample can contain mixtures of multiple amplicons pooled together, as will be understood by a skilled person. Amplicon Sequencing is a method where multiple amplicons are pooled together and co-sequenced.
“Amplicons” as used herein are defined as replicated DNA (or ribonucleic acid—RNA) strands that are formed by polymerase chain reaction (PCR), ligase chain reactions (LCR), or other DNA duplication methods, where the strands are copies of a target region of a genome. In order to multiplex PCR amplification, each amplicon has to be unique and independent (no overlapping amplicons), which requires careful selection of the primers used to tag the regions to be amplified. Amplicons for sequencing have a length typically in the range between 100 bp and 500 bp.
The processing and sequencing of amplicons with different sequencing platforms can be flexible and allows for a range of experimental designs. A variety of options regarding design parameters can be selected, such as the length of amplicons, the number of amplicons pooled together, the number of reads desired for a given amplicon or a pool of amplicons, whether to read from one end (unidirectional sequencing) or both ends (bi-directional sequencing) of the amplicon and other factors identifiable to a skilled person in the art.
“Read” or “reads” used herein are defined as a sequenced range of DNA or RNA. A read can be a sequence that is output by a sequencing instrument, where the read attempts to match a range of DNA that was input to the instrument. Each set of reads maps to a particular amplicon, with a read being a sequence for the complete amplicon or, typically, a range of bases comprising a subset of the amplicon. The total set of reads in the input data for the filter pipeline can include multiple amplicons, each having multiple reads mapped to them. The range of the read lengths depends upon the primers chosen for a given library. The mapping of reads to an amplicon can be determined during alignment/assembly using a sequencing alignment tool, for example the Bowtie™ 2 read alignment tool from Johns Hopkins University (see “Fast gapped-read alignment with Bowtie 2” by Ben Langmead and Steven L. Saizberg, Nat Methods, Author manuscript; PMC Apr. 1, 2013).
In order to analyze libraries formed from heterogeneous mixtures of DNA (i.e., a mixture of different cells), rare sequencing events that contain a disease mutation, called herein the “signal”, must be differentiated or filtered from extraneous sequencing information, called herein the “noise”. A signal that is of the same order of magnitude as noise (e.g., a high frequency of DNA in the sample that is not being targeted for analysis) is difficult to interpret unless a specific filtering method is used to remove at least some of the noise.
There are at least two sources of noise in the sequencing pipeline. First, the DNA mixtures that are produced from input pellets (DNA or cell pellets) are complicated mixtures of cells and therefore any useful signal is diluted by DNA that has no informational content. A second source of noise is due to the specific sequencing technology employed. For example, sequencing noise or “machine” noise can be derived from an ion-to-bases sequencing process, for example with the Ion Torrent™ Personal Genome Machine (PGM™) platform. For example, ion detection sequencing that reads bases on pH detection is sensitive to homopolymers and will sometimes read a homopolymer chain as being one base too long or too short, particularly if the chain is long.
As used herein, “ion-to-bases” refers to ion semiconductor sequencing or ion detection sequencing, a method of sequencing DNA based on the detection of hydrogen ions that are released during the polymerization of DNA. This is a method of sequencing by synthesis, such that a complementary strand is built based on the sequence of the target strand.
Based upon empirical evidence, the machine noise contribution can be 5% to 10% or higher. Based upon the nature of the rare cell pellets recovered from a cell isolation platform, the required theoretical sensitivity needs to be on the order of about 1% to enable useful patient information to be reproducibly recovered from samples. Given that this sensitivity is not compatible with the noise characteristics of the sequencing platform, an informatics based sequence filtering strategy is required to reduce the noise below the required sensitivity (for example, 1%, or one cell in one hundred being a target cell). The noise in a sequencing pipeline can be reduced significantly by a heuristic filtration method.
The ability to distinguish a sequence variant (SNV) from a non-variant/reference genome requires sufficient sampling of the test sample to ensure a statistically valid result (i.e., a satisfactory degree of confidence in the results). For example, at the 1.0% threshold this translates to 20 informative (mutation bearing) reads per 2000 total reads. Cell-free DNA, however, may not have enough integrity to allow that many reads, so a lower threshold might be required, which in turn results in a lower level of confidence in the results. In addition to collecting a sufficient number of total reads, there are other considerations that affect the ability to call SNV's from sequencing tests. In order to call a sequence variant as a true mutation, confounding artifacts of the sequencing process must be excluded.
A sequence variant also known as mutations include deletions, insertions, substitutions and duplications of a single or multiple nucleotides and chromosome rearrangements such as translocation and inversion. A particular type of sequence variant indicates a genetic variation formed by single base pair substitution, called a point mutation.
Once the FASTQ files (i.e., text-based files containing sequences of reads produced from a genome sequencing procedure) are exported from the ion-to-bases conversion server they must be analyzed for sequence variants (SNVs). In order for this to be accomplished, a sequence alignment of the experimental files to the reference sequence must be accomplished. In order to perform an alignment of the FASTQ sequences to a human reference assembly, a sequence alignment software device is required. This alignment output is in a BAM format. The BAM format is a binary version of the SAM (Sequence Alignment/Map) tab delimited file alignment. Once an indexed BAM file has been produced and gapped, the actual alignment can be visualized if needed.
Despite the alignment of each FASTQ read to the reference sequence (an amplicon), there is still a chance that a given base will be in error due to the base calling or due to biological or machine noise. Thus a post-alignment software program for sequence analysis has been developed. This program is called the “heuristic filter pipeline”, a series of filtering steps that generates an SNV report from the FASTQ data. This SNV report can then be exported into the LIMS (Laboratory Information Management System) for patient reporting. An example heuristic filter algorithm is as below:
The filters can be applied in any order, and in any combination (i.e., not all filters need to be used). The inclusion of and thresholds used by the various filters can depend upon the nature of the input data and the sources of noise present in the DNA acquisition and sequencing process. Each filter step can also record a percentage of pass and/or fail rate for that filter as a threshold to determine if the filter should be applied to the results (for example, if the number of amplicons failing the Amplicon Coverage filter is too high—or equivalently if the number of amplicons passing the Amplicon Coverage filter is too low—then the amplicons that would be excluded from the Amplicon Coverage filter are not excluded). This would create a controllable tolerance level for the filter in question, allowing a filter be more permissive for batches that would otherwise have too few remaining SNVs after filtering.
“Noise”, as used herein, includes false positive and unreliable results from any source, internal or external to the system, or data that is not clinically significant. “Signal”, as used herein, includes highly reliable results that a user is trying to analyze.
Case/control can include comparing sequences from the patient's sample (e.g., blood to be analyzed) and a germatic control sample (e.g., patient's normal/unmutated tissue).
In an example library, for a three minute assembly the post-assembly process adds about one and a half minutes to the process.
In addition to the hit/miss statistics, the hit list analysis can include details of why each removed hit was filtered out. The specific filter (Case/control, End-of-read, Cluster, etc.) that removed the hit can be listed next to the hit for analysis of the noise of the system.
As it can be shown by
Table 1 shows a portion of an example filter report. For a given sequencing run (run) for a given patient (pat_id), variants (avar) are shown relative to the reference base (aref) they substitute with the variant location identified by chromosome number (chr) and gene coordinate (coordinate). The total base coverage (coverage) and variant count (var_count) for the variant is given. A filter report field (filter) reports whether the variant was not filtered by the heuristic filter (value of NON) or, if it was filtered, which filter removed the variant from the final results (e.g., EOR for end-of-read filter, GLOB for global filter, etc.). Another field (effect) reports other effects that can determine scoring of the variant, such as being non-synonymous (value NS). The report can include further information, such as the type of run (e.g., germ line run), base counts at that position, percent variation, deletion counts, gene identification, transcript identification, protein change, complimentary DNA (cDNA) change, or Catalogue of Somatic Mutations in Cancer (COSMIC) identification.
A number of embodiments of the disclosure have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the present disclosure. Accordingly, other embodiments are within the scope of the following claims.
The examples set forth above are provided to those of ordinary skill in the art as a complete disclosure and description of how to make and use the embodiments of the disclosure, and are not intended to limit the scope of what the inventor/inventors regard as their disclosure.
Modifications of the above-described modes for carrying out the methods and systems herein disclosed that are obvious to persons of skill in the art are intended to be within the scope of the following claims. All patents and publications mentioned in the specification are indicative of the levels of skill of those skilled in the art to which the disclosure pertains. All references cited in this disclosure are incorporated by reference to the same extent as if each reference had been incorporated by reference in its entirety individually.
It is to be understood that the disclosure is not limited to particular methods or systems, which can, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting. As used in this specification and the appended claims, the singular forms “a,” “an,” and “the” include plural referents unless the content clearly dictates otherwise. The term “plurality” includes two or more referents unless the content clearly dictates otherwise. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the disclosure pertains.