This application generally relates to methods, systems, and computer readable media for nucleic acid sequencing, and, more specifically, to methods, systems, and computer readable media for improving base calling accuracy when sequencing nucleic acid sequencing data.
Nucleic acid sequencing data may be obtained in various ways, including using next-generation sequencing systems such as, for example, the Ion PGM™ and Ion Proton™ systems implementing Ion Torrent™ sequencing technology; see, e.g., U.S. Pat. No. 7,948,015 and U.S. Pat. Appl. Publ. Nos. 2010/0137143, 2009/0026082, and 2010/0282617, which are all incorporated by reference herein in their entirety. There is a need for new methods, systems, and computer readable media that can better evaluate base calls and reduce sequencing errors when analyzing data obtained using these or other sequencing systems/platforms.
The accompanying drawings, which are incorporated into and form a part of the specification, illustrate one or more exemplary embodiments and serve to explain the principles of various exemplary embodiments. The drawings are exemplary and explanatory only and are not to be construed as limiting or restrictive in any way.
According to an exemplary embodiment, there is provided a method for improving base calling accuracy in nucleic acid sequencing, comprising: (a) exposing a plurality of template polynucleotide strands, sequencing primers, and polymerase disposed in a plurality of defined spaces disposed on a sensor array to a series of flows of nucleotide species according to a predetermined order; (b) obtaining a plurality of series of measured intensity values corresponding to the series of flows of nucleotide species and to the plurality of defined spaces disposed on the sensor array and randomly selecting a training subset of the plurality of series of measured intensity values; (c) generating a first plurality of series of base calls corresponding to the training subset of the plurality of series of measured intensity values using a base caller and aligning the first plurality of series of base calls to a reference genome or sequence using an aligner; (d) determining a plurality of intensity value thresholds corresponding to different homopolymer lengths and nucleotide species, and a plurality of parameters of a linear transformation corresponding to different homopolymer lengths and nucleotide species; (e) generating a second plurality of series of base calls corresponding to the plurality of series of measured intensity values using the base caller and, for homopolymers of a least a first predetermined length, at least some of the plurality of parameters of a linear transformation; and (f) recalibrating the second plurality of series of base calls corresponding to the plurality of series of measured intensity values, for homopolymers of at most a second predetermined length, using at least some of the plurality of intensity value thresholds.
According to an exemplary embodiment, there is provided a non-transitory machine-readable storage medium comprising instructions which, when executed by a processor, cause the processor to perform a method for improving base calling accuracy in nucleic acid sequencing, comprising: (a) exposing a plurality of template polynucleotide strands, sequencing primers, and polymerase disposed in a plurality of defined spaces disposed on a sensor array to a series of flows of nucleotide species according to a predetermined order; (b) obtaining a plurality of series of measured intensity values corresponding to the series of flows of nucleotide species and to the plurality of defined spaces disposed on the sensor array and randomly selecting a training subset of the plurality of series of measured intensity values; (c) generating a first plurality of series of base calls corresponding to the training subset of the plurality of series of measured intensity values using a base caller and aligning the first plurality of series of base calls to a reference genome or sequence using an aligner; (d) determining a plurality of intensity value thresholds corresponding to different homopolymer lengths and nucleotide species, and a plurality of parameters of a linear transformation corresponding to different homopolymer lengths and nucleotide species; (e) generating a second plurality of series of base calls corresponding to the plurality of series of measured intensity values using the base caller and, for homopolymers of a least a first predetermined length, at least some of the plurality of parameters of a linear transformation; and (f) recalibrating the second plurality of series of base calls corresponding to the plurality of series of measured intensity values, for homopolymers of at most a second predetermined length, using at least some of the plurality of intensity value thresholds.
According to an exemplary embodiment, there is provided a system for improving base calling accuracy in nucleic acid sequencing, including: a plurality of template polynucleotide strands, sequencing primers, and polymerase disposed in a plurality of defined spaces disposed on a sensor array; an apparatus configured to expose the plurality of template polynucleotide strands, sequencing primers, and polymerase to a series of flows of nucleotide species according to a predetermined order; a machine-readable memory; and a processor configured to execute machine-readable instructions, which, when executed by the processor, cause the system to perform a method for improving base calling accuracy in nucleic acid sequencing, comprising: (a) obtaining a plurality of series of measured intensity values corresponding to the series of flows of nucleotide species and to the plurality of defined spaces disposed on the sensor array and randomly selecting a training subset of the plurality of series of measured intensity values; (b) generating a first plurality of series of base calls corresponding to the training subset of the plurality of series of measured intensity values using a base caller and aligning the first plurality of series of base calls to a reference genome or sequence using an aligner; (c) determining a plurality of intensity value thresholds corresponding to different homopolymer lengths and nucleotide species, and a plurality of parameters of a linear transformation corresponding to different homopolymer lengths and nucleotide species; (d) generating a second plurality of series of base calls corresponding to the plurality of series of measured intensity values using the base caller and, for homopolymers of a least a first predetermined length, at least some of the plurality of parameters of a linear transformation; and (e) recalibrating the second plurality of series of base calls corresponding to the plurality of series of measured intensity values, for homopolymers of at most a second predetermined length, using at least some of the plurality of intensity value thresholds.
The following description and the various embodiments described herein are exemplary and explanatory only and are not to be construed as limiting or restrictive in any way. Other embodiments, features, objects, and advantages of the present teachings will be apparent from the description and accompanying drawings, and from the claims.
According to various exemplary embodiments, methods, systems, and computer readable media for improving base calling accuracy in nucleic acid sequencing using recalibration of base calls or related intensity signals or parameters are disclosed herein. The various embodiments may improve accuracy by performing recalibration of base calls or related intensity signals or parameters using a training subset of called reads that were aligned to a reference genome or sequence, which may compensate for systematic bias that may be present in nucleic acid sequencing signals and often results in under-calls or over-calls. Such methods, systems, and computer readable media may reduce certain systematic errors and improve overall sequencing accuracy (especially in the case of long homopolymers), which may in turn improve downstream processing such as variant calling.
In this application, “defined space” generally refers to any space (which may be in one, two, or three dimensions) in which at least some of a molecule, fluid, and/or solid can be confined, retained and/or localized. The space may be a predetermined area (which may be a flat area) or volume, and may be defined, for example, by a depression or a micro-machined well in or associated with a microwell plate, microtiter plate, microplate, or a chip, or by isolated hydrophobic areas on a generally hydrophobic surface. Defined spaces may be arranged as an array, which may be a substantially planar one-dimensional or two-dimensional arrangement of elements such as sensors or wells. Defined spaces, whether arranged as an array or in some other configuration, may be in electrical communication with at least one sensor to allow detection or measurement of one or more detectable or measurable parameter or characteristics. The sensors may convert changes in the presence, concentration, or amounts of reaction by-products (or changes in ionic character of reactants) into an output signal, which may be registered electronically, for example, as a change in a voltage level or a current level which, in turn, may be processed to extract information or signal about a chemical reaction or desired association event, for example, a nucleotide incorporation event and/or a related ion concentration (e.g., a pH measurement). The sensors may include at least one ion sensitive field effect transistor (“ISFET”) or chemically sensitive field effect transistor (“chemFET”).
In an embodiment, the primer-template-polymerase complex may be subjected to a series of exposures of different nucleotides in a pre-determined sequence or ordering. If one or more nucleotides are incorporated, then the signal resulting from the incorporation reaction may be detected, and after repeated cycles of nucleotide addition, primer extension, and signal acquisition, the nucleotide sequence of the template strand may be determined. The output signals measured throughout this process depend on the number of nucleotide incorporations. Specifically, in each addition step, the polymerase extends the primer by incorporating added dNTP only if the next base in the template is complementary to the added dNTP. With each incorporation, an hydrogen ion is released, and collectively a population released hydrogen ions change the local pH of the reaction chamber. The production of hydrogen ions may be monotonically related to the number of contiguous complementary bases (e.g., homopolymers) in the template. Deliveries of nucleotides to a reaction vessel or chamber may be referred to as “flows” of nucleotide triphosphates (or dNTPs). For convenience, a flow of dATP will sometimes be referred to as “a flow of A” or “an A flow,” and a sequence of flows may be represented as a sequence of letters, such as “ATGT” indicating “a flow of dATP, followed by a flow of dTTP, followed by a flow of dGTP, followed by a flow of dTTP.” The predetermined ordering may be based on a cyclical, repeating pattern consisting of consecutive repeats of a short pre-determined reagent flow ordering (e.g., consecutive repeats of pre-determined sequence of four nucleotide reagents such as, for example, “ACTG ACTG . . . ”), may be based in whole or in part on some other pattern of reagent flows (such as, e.g., any of the various reagent flow orderings discussed herein and/or in Hubbell et al., U.S. Pat. Appl. Publ. No. 2012/0264621, published Oct. 18, 2012, which is incorporated by reference herein in its entirety), and may also be based on some combination thereof.
In various embodiments, output signals due to nucleotide incorporation may be processed, given knowledge of what nucleotide species were flowed and in what order to obtain such signals, to make base calls for the flows and compile consecutive base calls associated with a sample nucleic acid template into a read. A base call refers to a particular nucleotide identification (e.g., dATP (“A”), dCTP (“C”), dGTP (“G”), or dTTP (“T”)). Base calling may include performing one or more signal normalizations, signal phase and signal decay (e.g, enzyme efficiency loss) estimations, signal corrections, and model-based signal predictions, and may identify or estimate base calls for each flow for each defined space. Any suitable base calling method may be used, including as described in Davey et al., U.S. Pat. Appl. Publ. No. 2012/0109598, published on May 3, 2012, and/or Sikora et al., U.S. Pat. Appl. Publ. No. 2013/0060482, published on Mar. 7, 2012, which are all incorporated by reference herein in their entirety, recognizing of course that more accurate base callers may yield better results.
Examples of hardware elements may include processors, microprocessors, input(s) and/or output(s) (I/O) device(s) (or peripherals) that are communicatively coupled via a local interface circuit, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth), integrated circuits, application specific integrated circuits (ASIC), programmable logic devices (PLD), digital signal processors (DSP), field programmable gate array (FPGA), logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth. The local interface may include, for example, one or more buses or other wired or wireless connections, controllers, buffers (caches), drivers, repeaters and receivers, etc., to allow appropriate communications between hardware components. A processor is a hardware device for executing software, particularly software stored in memory. The processor can be any custom made or commercially available processor, a central processing unit (CPU), an auxiliary processor among several processors associated with the computer, a semiconductor based microprocessor (e.g., in the form of a microchip or chip set), a macroprocessor, or generally any device for executing software instructions. A processor can also represent a distributed processing architecture. The I/O devices can include input devices, for example, a keyboard, a mouse, a scanner, a microphone, a touch screen, an interface for various medical devices and/or laboratory instruments, a bar code reader, a stylus, a laser reader, a radio-frequency device reader, etc. Furthermore, the I/O devices also can include output devices, for example, a printer, a bar code printer, a display, etc. Finally, the I/O devices further can include devices that communicate as both inputs and outputs, for example, a modulator/demodulator (modem; for accessing another device, system, or network), a radio frequency (RF) or other transceiver, a telephonic interface, a bridge, a router, etc.
Examples of software may include software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, application program interfaces (API), instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination thereof. A software in memory may include one or more separate programs, which may include ordered listings of executable instructions for implementing logical functions. The software in memory may include a system for identifying data streams in accordance with the present teachings and any suitable custom made or commercially available operating system (O/S), which may control the execution of other computer programs such as the system, and provides scheduling, input-output control, file and data management, memory management, communication control, etc.
According to various embodiments, one or more features of teachings and/or embodiments described herein may be performed or implemented using an appropriately configured and/or programmed non-transitory machine-readable medium or article that may store an instruction or a set of instructions that, if executed by a machine, may cause the machine to perform a method and/or operations in accordance with the embodiments. Such a machine may include, for example, any suitable processing platform, computing platform, computing device, processing device, computing system, processing system, computer, processor, scientific or laboratory instrument, etc., and may be implemented using any suitable combination of hardware and/or software. The machine-readable medium or article may include, for example, any suitable type of memory unit, memory device, memory article, memory medium, storage device, storage article, storage medium and/or storage unit, for example, memory, removable or non-removable media, erasable or non-erasable media, writeable or re-writeable media, digital or analog media, hard disk, floppy disk, read-only memory compact disc (CD-ROM), recordable compact disc (CD-R), rewriteable compact disc (CD-RW), optical disk, magnetic media, magneto-optical media, removable memory cards or disks, various types of Digital Versatile Disc (DVD), a tape, a cassette, etc., including any medium suitable for use in a computer. Memory can include any one or a combination of volatile memory elements (e.g., random access memory (RAM, such as DRAM, SRAM, SDRAM, etc.)) and nonvolatile memory elements (e.g., ROM, EPROM, EEROM, Flash memory, hard drive, tape, CDROM, etc.). Moreover, memory can incorporate electronic, magnetic, optical, and/or other types of storage media. Memory can have a distributed, clustered, remote, or cloud architecture where various components are situated remote from one another, but are still accessed by the processor. The instructions may include any suitable type of code, such as source code, compiled code, interpreted code, executable code, static code, dynamic code, encrypted code, etc., implemented using any suitable high-level, low-level, object-oriented, visual, compiled and/or interpreted programming language.
In various embodiments, recalibration methods as discussed herein may be performed in parallel on subdivisions or partitions of a sensor array or chip. For example, a sensor array or chip may be divided into two or more regions (which could be physical regions, such as quadrants, or could be based on a content of regions or portions, such as library fragments vs. test fragments, for example) and recalibration may be performed on each region independently of the others. In some cases, recalibration may also be performed separately for different groups of nucleotide flows (e.g., earlier flows vs. later flows).
In various embodiments, recalibration methods as discussed herein may be performed in two stages: (i) a training stage, and (ii) a run stage. In the training stage, one or more recalibration thresholds and model parameters may be determined using a training set of base sequences aligned to a reference genome or sequence. In the run stage, the one or more recalibration thresholds and model parameters may be used to recalibrate base calls or related intensity signals or parameters to improving base calling accuracy.
In various embodiments, recalibration methods as discussed herein may comprise both a non-parametric recalibration module and a parametric model-based recalibration module, which can yield more accurate base calling without significant impact on computation efficiency. In various embodiments, one or two decision points or configurable switches may be used to control which one(s) of the approaches are used for homopolymers of a given length. In some cases, non-parametric recalibration may be used for homopolymers of at most a given length and parametric model-based recalibration may be used for homopolymers exceeding that given length (possibly up to some preset maximum length). In other cases, there may be some overlap, wherein non-parametric recalibration may be used for homopolymers of at most a first given length and parametric model-based recalibration may be used for homopolymers of at least a second given length (possibly up to some preset maximum length).
Non-Parametric Recalibration
In an embodiment, the homopolymer statistics may be accumulated as follows to determine the lower and upper thresholds for a plurality of different homopolymer lengths and nucleotides: For each aligned/mapped sequence, a data array may be populated that comprises (i) the called homopolymers (e.g., A 1-mers, C 2-mers, etc.), (ii) the reference homopolymers (e.g., A 1-mers, C 2-mers, etc.), (iii) corresponding intensities, and (iv) corresponding lower/upper homopolymer boundary perturbations relative to an ideal homopolymer intensity threshold. For example, if 2-mers should in theory have a measured intensity of 200 with a lower cut-off of 150 and an upper cut-off of 250, a particular 2-mer having been called with an intensity value of 230 would have an upper perturbation of −20. (Of course, such a numbering is only an example and different scaling/numbering could also be used). Then, for each flow among the nucleotide flows, a data array comprising counts for the nucleotide species, called homopolymer, reference homopolymer, intensities, and perturbation may be generated. Such counts may be converted to frequencies, which may be used to generate probability distributions.
In an embodiment, a random training sample of reads (e.g., about 1 million reads, about 2 million reads, or more) may be obtained and aligned to a reference genome. The size of the training sample may vary according to experimental needs and objectives, with largest sizes allowing improved determination of parameters (and non-parametric thresholds). However, larger training samples typically lead to increases in required computational resources and/or time. The aligned reads may be post-processed to determine the joint distribution of true homopolymer length and observed flow signals for each nucleotide. The distributions may be further processed to generate accuracy values, which may then be used for recalibration of all reads produced by a base caller. The accuracy values may be presented as a graph of accuracy as a function of intensity (e.g., homopolymer length) for each flow signal. For a given flow signal, a graph of accuracy (which can sometimes be referred to as a flow quality value) may be related to homopolymer-calling error probabilities and generated using an expression −10×log10 (1−Cn/C), where Cn represents a frequency of the homopolymer length n with highest frequency among homopolymers of lengths 1, . . . , j observed in some interval of intensities, and where C=C1+ . . . +Cj represents the total of frequencies of homopolymers 1, . . . , j observed with frequencies Ci, . . . , Cj, where i and j represent homopolymer lengths. Of course, such a graph may be represented in the form of a data array or table. The distributions may be determined separately for a plurality of regions (e.g., four quadrants) on an array/chip and for a plurality of bins of nucleotide flows (e.g., the first half of the flows making up a first bin, with the second half making up a second bin). In an example, such an accuracy graph or table may thus have 32 sets of accuracy graphs or table and their related homopolymer distributions, corresponding to the 4 nucleotide types, 4 regions out of 2-by-2 chip spatial stratification, and 2 partitions of flows. In other embodiments, spatial and flow partitions may be defined more or less densely.
Parametric Recalibration
In an embodiment, the predicted intensity values may be obtained via a predictive model used by the base caller, which may be a model that can predict intensity values that would be likely to arise for candidate base call sequences given the underlying sequencing technology and operating parameters such as ordering of nucleotide flows, sensor characteristics, etc.). Using a predictive model, the measurement intensity values mi1, mi2, . . . , mij, . . . , miM represent a vector of measured values for M nucleotide flows associated with an i-th read (e.g., a set of normalized, calibrated values observed for the i-th read over the M flows) and the model-predicted intensity values pi1, pi2, . . . , pij, . . . , piM represent a vector of predicted values for the i-th read over the M flows under the predictive model. Such model-predicted base calling may be performed as described in Davey et al., U.S. Pat. Appl. Publ. No. 2012/0109598, published on May 3, 2012, and/or Sikora et al., U.S. Pat. Appl. Publ. No. 2013/0060482, published on Mar. 7, 2012, which are all incorporated by reference herein in their entirety. For recalibration, the measurements and model-predicted values may be obtained for each aligned read and for each nucleotide flow. Once the measurements and model-predicted values have been accumulated, the parameters a and b of a linear transformation (e.g., m=a×p+b) may be estimated to minimize, for certain sets of measurements/predictions corresponding to homopolymers of given length and nucleotide, a difference between them. The parameters of the linear transformation may be determined using any suitable data analysis or optimization method known in the art. In an embodiment, the parameters may be determined by solving under least-squares.
In an embodiment, a phenomenon such as the measurement spikes of
In some embodiments, in addition to recalibrating homopolymers (and thus base sequences), related data or signals may also be modified consistently. For example, predicted base quality score may be adjusted by ensuring production of the same number of quality scores as the number of recalibrated base calls. In an embodiment, in the case of a deletion, the first predicted quality score may be removed; in the case of an insertion of a longer homopolymer, the last quality score of the homopolymer may be re-used; and in the case of insertion of a new 1-mer, a flow quality value may be used. In addition, a corresponding measured intensity may be modified for possible downstream analysis. For example, a measured intensity may be modified using the following expression: 98×(m−LT)/(UT−LT)+n×100−49, where n is the calibrated homopolymer length (before recalibration) and m is the intensity to be modified. (Of course, such constants or multipliers are only an example and different scaling/numbering could also be used.) For example, a 2-mer with intensity of 235 may be recalibrated to 3-mer if the lower and upper thresholds for 2-mers are 131 and 230, respectively, in which case the corresponding intensity may be modified as 98×(235−131)/(230−131)+2×100−49=254 (rounded to nearest integer). The same equation would keep the intensity unchanged if the lower and upper thresholds were identical to those one might expect in theory (e.g., 151 and 249, under an assumption that n-mers would be centered around intensities n×100 and separated at midpoints therebetween): 98×(235−151)/(249−151)+2×100−49=235.
According to an exemplary embodiment, there is provided a method for improving base calling accuracy in nucleic acid sequencing, comprising: (a) exposing a plurality of template polynucleotide strands, sequencing primers, and polymerase disposed in a plurality of defined spaces disposed on a sensor array to a series of flows of nucleotide species according to a predetermined order; (b) obtaining a plurality of series of measured intensity values corresponding to the series of flows of nucleotide species and to the plurality of defined spaces disposed on the sensor array and randomly selecting a training subset of the plurality of series of measured intensity values; (c) generating a first plurality of series of base calls corresponding to the training subset of the plurality of series of measured intensity values using a base caller and aligning the first plurality of series of base calls to a reference genome or sequence using an aligner; (d) determining a plurality of intensity value thresholds corresponding to different homopolymer lengths and nucleotide species, and a plurality of parameters of a linear transformation corresponding to different homopolymer lengths and nucleotide species; (e) generating a second plurality of series of base calls corresponding to the plurality of series of measured intensity values using the base caller and, for homopolymers of a least a first predetermined length, at least some of the plurality of parameters of a linear transformation; and (f) recalibrating the second plurality of series of base calls corresponding to the plurality of series of measured intensity values, for homopolymers of at most a second predetermined length, using at least some of the plurality of intensity value thresholds.
In such a method, the plurality of intensity value thresholds may comprise a lower intensity threshold and an upper intensity threshold for each of the different homopolymer lengths and nucleotide species. The plurality of intensity value thresholds may comprise a set of lower intensity thresholds and upper intensity thresholds for each of nucleotide species A, C, G, and T determined using a graph of accuracy as a function of signal intensity. The accuracy may determined using an expression −10×log10 (1−Cn/C), where Cn represents a frequency of the homopolymer length n with highest frequency among homopolymers of lengths 1, . . . , j, and wherein C=C1+ . . . +Cj represent the total of frequencies of homopolymers of lengths 1, . . . , j. The lower intensity thresholds and upper intensity thresholds may correspond to local minima of the graph of accuracy as a function of signal intensity for each of nucleotide species A, C, G, and T.
In such a method, recalibrating the second plurality of series of base calls may comprise replacing a homopolymer base call called for a measured intensity value falling outside a range defined by the lower intensity threshold and upper intensity threshold for the homopolymer length and nucleotide species of the homopolymer base call with a different homopolymer base call. Recalibrating the second plurality of series of base calls may further comprise correcting the measured intensity value corresponding to the replaced homopolymer base call using an expression comprising a constant multiplied by a ratio between (i) a difference between the measured intensity value and a lower intensity threshold and (ii) a difference between an upper intensity threshold and a lower intensity threshold. The plurality of intensity value thresholds may comprise a plurality of separate sets of intensity value thresholds, each corresponding to a partition of the sensor array. The plurality of intensity value thresholds may comprise a plurality of separate sets of intensity value thresholds, each corresponding to a partition of the series of flows of nucleotide species. The plurality of intensity value thresholds may comprise a plurality of separate sets of intensity value thresholds, each corresponding to a partition of the sensor array and a partition of the series of flows of nucleotide species.
In such a method, the base caller may be configured to call bases at least in part using differences between the measured intensity values and model-predicted intensity values obtained using a predictive model of intensities responsive to flows of nucleotide species. The plurality of parameters of a linear transformation may comprise a slope and an offset for different homopolymer lengths and nucleotide species that represent a compensation for differences between measured intensity values and model-predicted intensity values. The plurality of parameters of a linear transformation may comprise parameters a and b for different homopolymer lengths and nucleotide species that minimize an absolute value of a difference between (i) a times the model-predicted intensity values plus b, and (ii) the measured intensity values. The plurality of parameters of a linear transformation may comprise a plurality of separate sets of parameters of a linear transformation, each corresponding to a partition of the sensor array. The plurality of parameters of a linear transformation may comprise a plurality of separate sets of parameters of a linear transformation, each corresponding to a partition of the series of flows of nucleotide species. The plurality of parameters of a linear transformation may comprise a plurality of separate sets of parameters of a linear transformation, each corresponding to a partition of the sensor array and a partition of the series of flows of nucleotide species. Generating the second plurality of series of base calls corresponding to the plurality of series of measured intensity values may comprise applying the plurality of parameters of a linear transformation to the model-predicted intensity values. Generating the second plurality of series of base calls corresponding to the plurality of series of measured intensity values may comprise transforming the model-predicted intensity values using the plurality of parameters of a linear transformation.
According to an exemplary embodiment, there is provided a non-transitory machine-readable storage medium comprising instructions which, when executed by a processor, cause the processor to perform a method for improving base calling accuracy in nucleic acid sequencing, comprising: (a) exposing a plurality of template polynucleotide strands, sequencing primers, and polymerase disposed in a plurality of defined spaces disposed on a sensor array to a series of flows of nucleotide species according to a predetermined order; (b) obtaining a plurality of series of measured intensity values corresponding to the series of flows of nucleotide species and to the plurality of defined spaces disposed on the sensor array and randomly selecting a training subset of the plurality of series of measured intensity values; (c) generating a first plurality of series of base calls corresponding to the training subset of the plurality of series of measured intensity values using a base caller and aligning the first plurality of series of base calls to a reference genome or sequence using an aligner; (d) determining a plurality of intensity value thresholds corresponding to different homopolymer lengths and nucleotide species, and a plurality of parameters of a linear transformation corresponding to different homopolymer lengths and nucleotide species; (e) generating a second plurality of series of base calls corresponding to the plurality of series of measured intensity values using the base caller and, for homopolymers of a least a first predetermined length, at least some of the plurality of parameters of a linear transformation; and (f) recalibrating the second plurality of series of base calls corresponding to the plurality of series of measured intensity values, for homopolymers of at most a second predetermined length, using at least some of the plurality of intensity value thresholds.
According to an exemplary embodiment, there is provided a system for improving base calling accuracy in nucleic acid sequencing, including: a plurality of template polynucleotide strands, sequencing primers, and polymerase disposed in a plurality of defined spaces disposed on a sensor array; an apparatus configured to expose the plurality of template polynucleotide strands, sequencing primers, and polymerase to a series of flows of nucleotide species according to a predetermined order; a machine-readable memory; and a processor configured to execute machine-readable instructions, which, when executed by the processor, cause the system to perform a method for improving base calling accuracy in nucleic acid sequencing, comprising: (a) obtaining a plurality of series of measured intensity values corresponding to the series of flows of nucleotide species and to the plurality of defined spaces disposed on the sensor array and randomly selecting a training subset of the plurality of series of measured intensity values; (b) generating a first plurality of series of base calls corresponding to the training subset of the plurality of series of measured intensity values using a base caller and aligning the first plurality of series of base calls to a reference genome or sequence using an aligner; (c) determining a plurality of intensity value thresholds corresponding to different homopolymer lengths and nucleotide species, and a plurality of parameters of a linear transformation corresponding to different homopolymer lengths and nucleotide species; (d) generating a second plurality of series of base calls corresponding to the plurality of series of measured intensity values using the base caller and, for homopolymers of a least a first predetermined length, at least some of the plurality of parameters of a linear transformation; and (e) recalibrating the second plurality of series of base calls corresponding to the plurality of series of measured intensity values, for homopolymers of at most a second predetermined length, using at least some of the plurality of intensity value thresholds.
According to an exemplary embodiment, there is provided a method for determining recalibration thresholds and parameters in nucleic acid sequencing, comprising: (a) exposing a plurality of template polynucleotide strands, sequencing primers, and polymerase disposed in a plurality of defined spaces disposed on a sensor array to a series of flows of nucleotide species according to a predetermined order; (b) obtaining a plurality of series of measured intensity values corresponding to the series of flows of nucleotide species and to the plurality of defined spaces disposed on the sensor array and randomly selecting a training subset of the plurality of series of measured intensity values; (c) generating a first plurality of series of base calls corresponding to the training subset of the plurality of series of measured intensity values using a base caller and aligning the first plurality of series of base calls to a reference genome or sequence using an aligner; and (d) determining a plurality of intensity value thresholds corresponding to different homopolymer lengths and nucleotide species, and a plurality of parameters of a linear transformation corresponding to different homopolymer lengths and nucleotide species.
According to an exemplary embodiment, there is provided a method for improving base call accuracy using recalibration thresholds and parameters, comprising: (a) receiving a plurality of intensity value thresholds corresponding to different homopolymer lengths and nucleotide species, and a plurality of parameters of a linear transformation corresponding to different homopolymer lengths and nucleotide species; (b) generating a plurality of series of base calls corresponding to a plurality of series of measured intensity values using a base caller and, for homopolymers of a least a first predetermined length, at least some of the plurality of parameters of a linear transformation; and (c) recalibrating the plurality of series of base calls corresponding to the plurality of series of measured intensity values, for homopolymers of at most a second predetermined length, using at least some of the plurality of intensity value thresholds. In such a method, (i) the plurality of intensity value thresholds and plurality of parameters may have been generated using an initial plurality of series of base calls corresponding to a randomly selected training subset of the plurality of series of measured intensity values; (ii) the plurality of series of measured intensity values may have been obtained as a result of a series of flows of nucleotide species to a sensor array comprising a plurality of defined spaces in which template polynucleotide strands, sequencing primers, and polymerase have been disposed; and (iii) the initial plurality of series of base calls may have been obtained using a base caller and aligned to a reference genome or sequence using an aligner.
According to an exemplary embodiment, there is provided a method for improving base calling accuracy in nucleic acid sequencing, comprising: exposing template polynucleotide strands, sequencing primers, and polymerase to flows of nucleotide species; obtaining a series of measured intensity values and randomly selecting a training subset therefrom; generating series of base calls using a base caller and aligning the series of base calls to a reference genome or sequence using an aligner; determining intensity value thresholds and parameters of a linear transformation corresponding to different homopolymer lengths and nucleotide species; generating series of base calls corresponding to the series of measured intensity values using at least some of the parameters of a linear transformation; and recalibrating the series of base calls corresponding to the plurality of series of measured intensity values using at least some of the intensity value thresholds.
Unless otherwise specifically designated herein, terms, techniques, and symbols of biochemistry, cell biology, genetics, molecular biology, nucleic acid chemistry, nucleic acid sequencing, and organic chemistry used herein follow those of standard treatises and texts in the relevant field.
Although the present description described in detail certain embodiments, other embodiments are also possible and within the scope of the present invention. For example, those skilled in the art may appreciate from the present description that the present teachings may be implemented in a variety of forms, and that the various embodiments may be implemented alone or in combination. Variations and modifications will be apparent to those skilled in the art from consideration of the specification and figures and practice of the teachings described in the specification and figures, and the claims.
This application is a Division of U.S. application Ser. No. 14/255,528 filed Apr. 17, 2014, which claims priority to U.S. application No. 61/879,910 filed Sep. 19, 2013, and to U.S. application No. 61/814,061 filed Apr. 19, 2013, which disclosures are herein incorporated by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
61879910 | Sep 2013 | US | |
61814061 | Apr 2013 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14255528 | Apr 2014 | US |
Child | 16055315 | US |