The technology disclosed relates to apparatus and corresponding methods for the automated analysis of an image or recognition of a pattern. Included herein are systems that transform an image for the purpose of (a) enhancing its visual quality prior to recognition, (b) locating and registering the image relative to a sensor or stored prototype, or reducing the amount of image data by discarding irrelevant data, and (c) measuring significant characteristics of the image. In particular, the technology disclosed relates to segmenting clusters into subpopulations and base calling clusters in a particular subpopulation.
The following are incorporated by reference for all purposes as if fully set forth herein:
The subject matter discussed in this section should not be assumed to be prior art merely as a result of its mention in this section. Similarly, a problem mentioned in this section or associated with the subject matter provided as background should not be assumed to have been previously recognized in the prior art. The subject matter in this section merely represents different approaches, which in and of themselves can also correspond to implementations of the claimed technology.
Various protocols in biological or chemical research involve performing a large number of controlled reactions on local support surfaces or within predefined reaction chambers. The desired reactions may then be observed or detected, and subsequent analysis may help identify or reveal properties of chemicals involved in the reaction. For example, in some multiplex assays, an unknown analyte having an identifiable label (e.g., fluorescent label) may be exposed to thousands of known probes under controlled conditions. Each known probe may be deposited into a corresponding well of a microplate. Observing any chemical reactions that occur between the known probes and the unknown analyte within the wells may help identify or reveal properties of the analyte. Other examples of such protocols include known DNA sequencing processes, such as sequencing-by-synthesis or cyclic-array sequencing. In cyclic-array sequencing, a dense array of DNA features (e.g., template nucleic acids) are sequenced through iterative cycles of enzymatic manipulation. After each cycle, an image may be captured and subsequently analyzed with other images to determine a sequence of the DNA features.
As a more specific example, one known DNA sequencing system uses a pyrosequencing process and includes a chip having a fused fiber-optic faceplate with millions of wells. A single capture bead having clonally amplified sstDNA from a genome of interest is deposited into each well. After the capture beads are deposited into the wells, nucleotides are sequentially added to the wells by flowing a solution containing a specific nucleotide along the faceplate. The environment within the wells is such that if a nucleotide flowing through a particular well complements the DNA strand on the corresponding capture bead, the nucleotide is added to the DNA strand. A colony of DNA strands is called a cluster, and a cluster can include many (thousands of) nucleotides. Incorporation of the nucleotide into the cluster initiates a process that ultimately generates a fluorescent light signal. The system includes a CCD camera that is positioned directly adjacent to the faceplate and is configured to detect the light signals from the DNA clusters in the wells. Subsequent analysis of the images taken throughout the pyrosequencing process can determine a sequence of the genome of interest. Based on different fluorescent light signals of nucleotides adenine (A), cytosine (C), guanine (G), and thymine (T), the particular nucleotide incorporated into the DNA strand of the cluster can be identified. This identification process is also known as “base calling.”
One challenge with the analysis of image data is variation in intensity profiles of clusters being base called. This may cause a drop in data throughput and an increase in error rates of base calling during a sequencing run. There are many potential reasons for intensity profile variation. It may result from the chemistry modulation effects where the intensity profiles of clusters at a current sequencing cycle can be shifted based on their base context. It may result from differences in cluster brightness, caused by fragment length distribution in the cluster population. It may result from phasing, which occurs when a molecule in a cluster does not incorporate a nucleotide in some sequencing cycles and lags behind other molecules, or when a molecule incorporates more than one nucleotide in a single sequencing cycle. It may result from fading, i.e., exponential decay in signal intensity of clusters as a function of sequencing cycle number due to excessive washing and laser exposure as the sequencing run progresses. It may result from underdeveloped cluster colonies, i.e., small cluster sizes that produce empty or partially filled wells on a patterned flow cell. It may result from overlapping cluster colonies caused by unexclusive amplification. It may result from under illumination or uneven illumination, for example, due to clusters being located on the edges of a flow cell. It may result from impurities on a flow cell that obfuscate emitted signal. It may result from polyclonal clusters, i.e., when multiple clusters are deposited in the same well.
Base calling accuracy is crucial for high-throughput DNA sequencing and downstream analysis such as read mapping and genome assembly. Accordingly, an opportunity arises to correct the intensity variations of clusters. Improved base calling throughput and reduced base calling error rate during a sequencing run may result.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
The color drawings also may be available in PAIR via the Supplemental Content tab. In the drawings, like reference characters generally refer to like parts throughout the different views. Also, the drawings are not necessarily to scale, with an emphasis instead generally being placed upon illustrating the principles of the technology disclosed. In the following description, various implementations of the technology disclosed are described with reference to the following drawings, in which,
The following discussion is presented to enable any person skilled in the art to make and use the technology disclosed and is provided in the context of a particular application and its requirements. Various modifications to the disclosed implementations will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other implementations and applications without departing from the spirit and scope of the technology disclosed. Thus, the technology disclosed is not intended to be limited to the implementations shown but is to be accorded the widest scope consistent with the principles and features disclosed herein.
The discussion is organized as follows. First, we introduce base calling clusters and variations in intensity profiles of the clusters caused by base context. Then we propose the technology disclosed for a base calling pipeline that processes base calls of an already based called sequence and is iteratively trained to generate predicted k-mer-specific centroids. Each of the k-mer-specific centroids represents a mean value of the intensities of clusters with the same k-mer context. In particular, the base calling pipeline includes a context-dependent signal modulation model that subdivides base calls of already base called sequences into k-mer-specific time series, transforms these time series and merges them into predicted per-sequencing cycle intensity values represented by k-mer-specific centroids. After that, we setup examples of using k-mer-specific centroids to base call target clusters. Advancing further, we provide various performance results of context-dependent base calling and improvement over context-independent base calling approaches.
The technology disclosed begins with the concept of clusters, intensity extraction and base calling clusters. In one implementation, a sequencer uses sequencing by synthesis (SBS) technology for generating sequencing images. SBS relies on growing nascent strands complementary to cluster strands with fluorescently-labeled nucleotides, while tracking the emitted signal of each newly added nucleotide. The fluorescently-labeled nucleotides have a 3′ removable block that anchors a fluorophore signal of the nucleotide type. SBS occurs in repetitive sequencing cycles, each comprising three steps: (a) extension of a nascent strand by adding the fluorescently-labeled nucleotide; (b) excitation of the fluorophore using one or more lasers of an optical system of the sequencer and imaging through different filters of the optical system, yielding sequencing images; and (c) cleavage of the fluorophore and removal of the 3′ block in preparation for the next sequencing cycle. Incorporation and imaging are repeated up to a designated number of sequencing cycles, defining the read length, which refers to the number of base pairs (bp) sequenced from a DNA fragment. Using this approach, each sequencing cycle interrogates a new position along the cluster strands.
Intensity values can be extracted from different color/intensity channel sequencing images generated by a sequencer at each sequencing cycle during a sequencing run. Examples of the sequencer include Illumina's iSeq, HiSeqX, HiSeq 3000, HiSeq 4000, HiSeq 2500, NovaSeq 6000, NextSeq 550, NextSeq 1000, NextSeq 2000, NextSeqDx, MiSeq, and MiSeqDx.
The tremendous power of Illumina's sequencers stems from their ability to simultaneously execute and sense millions or even billions of analytes (e.g., clusters). A cluster comprises approximately one thousand identical copies of a template strand, though clusters vary in size and shape. Clusters are grown from the template strand, prior to the sequencing run, by bridge amplification or exclusion amplification of the input library which is a collection of similarly sized DNA fragments. The purpose of the amplification and cluster growth is to increase the intensity of the emitted signal since the imaging device cannot reliably sense the fluorophore signal of a single strand. In some embodiments, the imaging device perceives a cluster of thousands of template strands as a single spot. For instance, the imaging device can detect such a cluster of thousands of template strands as a spot represented by a single pixel or multiple pixels.
The sequencing process occurs in a flow cell—a small glass slide that holds the input DNA fragments during the sequencing process. The flow cell is connected to the high-throughput optical system that includes microscopic imaging, excitation lasers, and fluorescence filters. In some cases, the flow cell consists of (or includes) a complementary metal-oxide-semiconductor (CMOS). An imaging device (e.g., a solid-state imager such as a charge-coupled device (CCD) or a CMOS sensor) in the sequencer takes images at multiple locations along a series of non-overlapping regions called tiles. At each sequencing cycle, the imaging device takes sequencing images of each tile at each color/intensity channel. The sequence data of clusters immobilized on each tile at each sequencing cycle, therefore, includes intensity signals extracted from the sequencing images.
In the illustrated embodiment, the flow cell 102 includes sidewalls 138, 125, and a flow cover 136 that is supported by the sidewalls 138, 125. The sidewalls 138, 125 are coupled to the sample surface 134 and extend between the flow cover 136 and the sidewalls 138, 125. In some embodiments, the sidewalls 138, 125 are formed from a curable adhesive layer that bonds the flow cover 136 to the sampling device 104.
The sidewalls 138, 125 are sized and shaped so that a flow channel 144 exists between the flow cover 136 and the sampling device 104. The flow cover 136 may include a material that is transparent to excitation light 101 propagating from an exterior of the biosensor 100 into the flow channel 144. In an example, the excitation light 101 approaches the flow cover 136 at a non-orthogonal (or orthogonal) angle.
Also shown, the flow cover 136 may include inlet and outlet ports 142, 146 that are configured to fluidically engage other ports (not shown). For example, the other ports may be from the cartridge or the workstation. The flow channel 144 is sized and shaped to direct a fluid along the sample surface 134. A height H1 and other dimensions of the flow channel 144 may be configured to maintain a substantially even flow of a fluid along the sample surface 134. The dimensions of the flow channel 144 may also be configured to control bubble formation.
By way of example, the flow cover 136 (or the flow cell 102) may comprise a transparent material, such as glass or plastic. The flow cover 136 may constitute a substantially rectangular block having a planar exterior surface and a planar inner surface that defines the flow channel 144. The block may be mounted onto the sidewalls 138, 125. Alternatively, the flow cell 102 may be etched to define the flow cover 136 and the sidewalls 138, 125. For example, a recess may be etched into the transparent material. When the etched material is mounted to the sampling device 104, the recess may become the flow channel 144.
The sampling device 104 may be similar to, for example, an integrated circuit comprising a plurality of stacked substrate layers 120-126. The substrate layers 120-126 may include a base substrate 120, a solid-state imager 122 (e.g., CMOS image sensor), a filter or light-management layer 124, and a passivation layer 126. It should be noted that the above is only illustrative and that other embodiments may include fewer or additional layers. Moreover, each of the substrate layers 120-126 may include a plurality of sub-layers. The sampling device 104 may be manufactured using processes that are similar to those used in manufacturing integrated circuits, such as CMOS image sensors and CCDs. For example, the substrate layers 120-126 or portions thereof may be grown, deposited, etched, and the like to form the sampling device 104.
The passivation layer 126 is configured to shield the filter layer 124 from the fluidic environment of the flow channel 144. In some cases, the passivation layer 126 is also configured to provide a solid surface (i.e., the sample surface 134) that permits biomolecules or other analytes-of-interest to be immobilized thereon. For example, each of the reaction sites may include a cluster of biomolecules that are immobilized to the sample surface 134. Thus, the passivation layer 126 may be formed from a material that permits the reaction sites to be immobilized thereto. The passivation layer 126 may also comprise a material that is at least transparent to a desired fluorescent light. By way of example, the passivation layer 126 may include silicon nitride (Si2N4) and/or silica (SiO2). However, other suitable material(s) may be used. In the illustrated embodiment, the passivation layer 126 may be substantially planar. However, in alternative embodiments, the passivation layer 126 may include recesses, such as pits, wells, grooves, and the like. In the illustrated embodiment, the passivation layer 126 has a thickness that is about 150-200 nm and, more particularly, about 170 nm.
The filter layer 124 may include various features that affect the transmission of light. In some embodiments, the filter layer 124 can perform multiple functions. For instance, the filter layer 124 may be configured to (a) filter unwanted light signals, such as light signals from an excitation light source; (b) direct emission signals from the reaction sites toward corresponding sensors 106, 108, 110, 112, and 114 that are configured to detect the emission signals from the reaction sites; or (c) block or prevent detection of unwanted emission signals from adjacent reaction sites. As such, the filter layer 124 may also be referred to as a light-management layer. In the illustrated embodiment, the filter layer 124 has a thickness that is about 1-5 μm and, more particularly, about 2-4 μm. In alternative embodiments, the filter layer 124 may include an array of microlenses or other optical components. Each of the microlenses may be configured to direct emission signals from an associated reaction site to a sensor.
In some embodiments, the solid-state imager 122 and the base substrate 120 may be provided together as a previously constructed solid-state imaging device (e.g., CMOS chip). For example, the base substrate 120 may be a wafer of silicon and the solid-state imager 122 may be mounted thereon. The solid-state imager 122 includes a layer of semiconductor material (e.g., silicon) and the sensors 106, 108, 110, 112, and 114. In the illustrated embodiment, the sensors are photodiodes configured to detect light. In other embodiments, the sensors comprise light detectors. The solid-state imager 122 may be manufactured as a single chip through CMOS-based fabrication processes.
The solid-state imager 122 may include a dense array of sensors 106, 108, 110, 112, and 114 that are configured to detect activity indicative of a desired reaction from within or along the flow channel 144. In some embodiments, each sensor has a pixel area (or detection area) that is about 1-2 square micrometer (μm2). The array can include 500,000 sensors, 5 million sensors, 10 million sensors, or even 200 million sensors. The sensors 106, 108, 110, 112, and 114 can be configured to detect a predetermined wavelength of light that is indicative of the desired reactions.
In some embodiments, the sampling device 104 includes a microcircuit arrangement, such as the microcircuit arrangement described in U.S. Pat. No. 7,595,882, which is incorporated herein by reference in the entirety. More specifically, the sampling device 104 may comprise an integrated circuit having a planar array of the sensors 106, 108, 110, 112, and 114. Circuitry formed within the sampling device 104 may be configured for at least one of signal amplification, digitization, storage, and processing. The circuitry may collect and analyze the detected fluorescent light and generate pixel signals (or detection signals) for communicating detection data to a signal processor. The circuitry may also perform additional analog and/or digital signal processing in the sampling device 104. Sampling device 104 may include conductive vias 130 that perform signal routing (e.g., transmit the pixel signals to the signal processor). The pixel signals may also be transmitted through electrical contacts 132 of the sampling device 104.
The sampling device 104 is discussed in further detail with respect to U.S. Nonprovisional patent application Ser. No. 16/874,599, titled “Systems and Devices for Characterization and Performance Analysis of Pixel-Based Sequencing,” filed May 14, 2020, which is incorporated by reference as if fully set forth herein. The sampling device 104 is not limited to the above constructions or uses as described above. In alternative embodiments, the sampling device 104 may take other forms. For example, the sampling device 104 may comprise a CCD device, such as a CCD camera, that is coupled to a flow cell or is moved to interface with a flow cell having reaction sites therein.
The clusters and the tiles are discussed in further detail with respect to U.S. Nonprovisional patent application Ser. No. 16/825,987, titled “Training Data Generation For Artificial Intelligence-based Sequencing,” filed 20 Mar. 2020.
Base calling can be performed by fitting a mathematical model to the intensity profiles of clusters to be called. In particular, a mixture of four intensity distributions can be fitted to the intensity values of a target cluster to be called at a given sequencing cycle and determines the likelihoods of the intensity profiles of the target cluster belonging to each of the four intensity distributions.
In some implementations, the mixture of intensity distribution is a Gaussian mixture model. A Gaussian mixture model comprises multiple Gaussians, each identified by k ∈{1, . . . , K}, where K is the number of clusters (i.e., groups of data points). For example, the Gaussian mixture model can include four intensity distributions, corresponding to four nucleotide bases A, G, C and T. Each Gaussian k in the mixture includes the following parameters:
When a target cluster is to be base called at a current sequencing cycle, an expectation maximization algorithm can be used to fit the mixture of intensity distributions to the intensity profiles of the target cluster during the current sequencing cycle. When the mixture of intensity distributions is a Gaussian mixture model, for example, the expectation maximization algorithm iteratively maximizes the likelihood of observing means u (centroids) and covariances Σ (dimensions of the ellipsoid) that best fit the intensity profiles for the target cluster to be base called. For each of the four intensity distributions corresponding to one of the four bases A, C, T, and G, a centroid and covariances of the distribution are calculated. The intensity distribution (or centroid of the intensity distribution) with a maximum likelihood to which the target cluster belongs is determined as the base call for the target cluster.
For a base to be called at a given sequencing cycle, its corresponding base context varies. Base context refers to prior and/or succeeding bases that are identified at prior and/or succeeding sequencing cycles, respectively. Analysis has revealed that the intensity profiles of clusters at a current sequencing cycle can be shifted based on their base context identified at prior and succeeding sequencing cycles, also known as chemistry modulation effects or fully functional nucleotide (FFN) triphosphate modulation effects. In some embodiments, chemistry modulation effects result from differential incorporation of two (or more) FFN species for a given base. When prior base context includes one or more base A, the shift in the intensity distribution can be substantial. As illustrated in
Quenching effect is another effect by which base context causes variations in the intensity profiles of clusters. In the sequencing-by-synthesis (SBS) process, nucleotides incorporated into the template sequences contain fluorophores that specifically identify the types of the bases, and attached to the nucleotides is a cleavable linker. After the incorporated base is identified, the linker is cleaved, allowing the fluorophore to be removed and ready for the next base to be attached and identified. Nevertheless, the cleavage can leave a remaining “pendant arm” moiety located on each of the detected nucleotides, which impacts the intensity profiles of the following nucleotides incorporated into the template sequences. For example, the remaining “pendant arm” after the cleavage of the fluorophores attached to base G quenches/reduces/suppresses the intensity values of a subsequent fluorophore when the next nucleotide is incorporated. The quenching effect can be substantial when base calling dimer GA. The fluorophores attached to base A can be significantly quenched by the “pendant arm” of the fluorophores attached to prior base G. In a two-channel base calling system, the intensity values of base A at both color/intensity channels can be reduced, increasing the risk of miscalls. The intensity profiles of other bases (e.g., C, G and T) can be similarly impacted by the “pendant arm” of the fluorophores attached to base G (or some other nucleotide base). In some cases, however, a preceding G can lead to a high average intensity in certain FFN sets, while an A directly preceding an A can lead to relatively low intensity values.
The technology disclosed provides approaches to context-dependent base calling, by taking into consideration the variations in the intensity profiles of clusters caused by their base context. We introduce a base calling system including memory storing context-specific centroids and runtime logic configured to use the context-specific centroids to base call a target cluster. Each of the context-specific centroids represents a mean value μ of the intensity distribution of clusters with the same base context.
Base context can be represented by k-mers (k≥1). The k-mers can be 4∧k permutations of k base positions, where 4 corresponds to four bases A, G, C and T. Therefore, context-specific centroid can be k-mer-specific centroids, including 4∧k k-mer-specific centroids. Each k-mer-specific centroid can represent the mean value μ of the intensity distributions of clusters with the same k-mer context.
The context-specific centroids can be learned by iteratively training a base calling pipeline using base calls of known (i.e., already base called) sequences as training samples. In one or more embodiments, the base calling pipeline can process the base calls of already base called sequences in k-mer-specific time series, each of the k-mer-specific time series representing presence or absence of a particular k-mer at each sequencing cycle in a plurality of sequencing cycles across which the base calls are generated. The base calling pipeline can transform the k-mer-specific time series into predicted k-mer-specific centroids and merge these predicted k-mer-specific centroids on a sequencing cycle-by-sequencing cycle basis to generate predicted per-sequencing cycle intensity values represented by the predicted k-mer-specific centroids. By comparing the predicted per-sequencing cycle intensity values against known intensity values of the base calls, the base calling pipeline can determine a training loss (e.g., a transformation loss) and based on which, update the predicted k-mer-specific centroids accordingly to generate updated k-mer-specific centroids. These updated k-mer-specific centroids can be stored in the memory as k-mer-specific centroids.
The base calling system can use context-specific centroids to base call a target cluster. The base calling system can access current intensity data of the target cluster captured at a current sequencing cycle of a sequencing run, as well as context intensity data of the target cluster for at least one of a preceding sequencing cycle or a succeeding sequencing cycle. The context intensity data is used to identify the base context of the target cluster. For instance, the base calling system can determine base context from prior and/or succeeding base calls based on base calls made during previous cycles (e.g., prior base calls) and/or preliminary base calls made for future cycles (e.g., succeeding base calls). The base calling system can access k-mer-specific centroids stored in the memory and select context-specific centroids that correspond to the base context of the target cluster. By comparing the current intensity data of the target cluster with the selected context-specific centroids, the base calling system can base call the cluster. For example, the context-specific centroid of the intensity distribution with a maximum likelihood to which the target cluster belongs can be determined as the base call for the target cluster. Alternatively, one of the selected context-specific centroids that is closest to the current intensity data of the target cluster can be determined as the base call.
When a target cluster is to be called at the current sequencing cycle N, the base calling system can identify the corresponding base context determined from prior sequencing cycles. For example, the prior two bases that are called at prior sequencing cycles N−2 and N−1 can be G and A, respectively. The base calling system can select four intensity distributions, each with an optimized trimer-specific centroid, corresponding to the base context of GA_ (e.g., trimers GAA, GAG, GAC, and GAT). The base calling system can base call the target cluster at the current sequencing cycle by comparing the intensity profile of the cluster with the four centroids. In one or more embodiments, the base calling system calculates a Euclidean distance between each of the four trimer-specific centroids and the intensity profile of the target cluster at the respective color/intensity channel. The centroid of the intensity distribution with a shortest Euclidean distance to the target cluster is determined as the base call. Alternatively, the base calling system can determine the likelihoods of the intensity profiles of the target cluster belonging to each of the four intensity distributions. The centroid with a maximum likelihood to which the target cluster belongs is determined as the base call for the target cluster.
The k-mers can include a current base to be called at a current sequencing cycle and prior bases identified at prior sequencing cycles. When the base context is represented by dimers including a current base to be called and an immediately prior base (i.e., k=2), for example, there are sixteen (4∧2) permutations of two base positions, namely, AA, AG, AC, AT, CA, CG, CC, CT, GA, GG, GC, GT, TA, TG, TC and TT. Accordingly, there are sixteen (4∧2) dimer-specific centroids that can be learned by iteratively training the base calling pipeline. Consider a target cluster that is to be called at a given sequencing cycle with an immediately prior base A. To base call the target cluster, the base calling system can identify the immediately prior base A and select four dimer-specific centroids of the intensity distributions corresponding to dimer context AA, AG, AC and AT, respectively. The base calling system can compare the intensity profile of the target cluster at the current sequencing cycle with the four centroids and call the base for the target cluster.
As illustrated in
Alternatively, k-mers can include a current base to be called at a current sequencing cycle, prior bases identified at prior sequencing cycles and succeeding bases identified succeeding sequencing cycles. For example, base context can be represented by trimers including the current base to be called, one prior base and one succeeding base (i.e., k=3). Thus, there are sixty-four (4∧3) permutations of 3 base positions. Consider a target cluster that has been base called for three successive sequencing cycles, namely, cycles N−1, N, and N+1. The base calling system can determine a preliminary base call for the target cluster at each of the three successive sequencing cycles based on the corresponding intensity profiles. When the target cluster has a preliminary base A called at sequencing cycle N−1 and a preliminary base T called at sequencing cycle N+1, the base calling system can identify the base context A_T and compare the intensity profiles of the target cluster at the current sequencing cycle N with the trimer-specific centroids of the intensity distributions corresponding to base context AAT, AGT, ACT and ATT, respectively. In the foregoing trimer examples AAT, AGT, ACT and ATT, the base call in the middle of each trimer represents the base call for cycle N.
We now turn to the advantages of context-dependent base calling. In the base calling domain, sequencing characteristics can show significant diversity in various categories, including sequencing platforms, sequencing instruments, sequencing protocols, sequencing chemistries, sequencing reagents, cluster densities and so on. The disclosed base calling pipeline can be trained on large-scale training samples with diverse sequencing characteristics that adequately model the real-world sequencing runs.
More importantly, the disclosed base calling pipeline models context-dependent effects by iteratively learning context-specific centroids using large-scale known sequences as training samples. The optimized context-specific centroids accurately reflect the intensities of clusters having the same context but diverse sequencing characteristics. In other words, instead of generating a mixture of four intensity distribution with varying shapes and dimensions, the base calling pipeline can granulize them into groups of context-dependent distributions. It reduces the adverse impact of the intensity variations caused by e.g., chemistry modulation effects, FFN modulation effects, quenching effects and therefore, reduces the error rate of base calling.
The modeling of context-dependent effect can be trained offline to determine optimized context-specific centroids and thus, significantly saves computation power. Compared to base calling algorithms that rely on iteratively fitting both the centroids and covariances of the intensity distributions to the intensity profiles for a target cluster, the base call system disclosed herein may only need to compare four context-specific centroids with the intensity profiles of the target cluster for base calling at each sequencing cycle. Because each centroid is optimized to represent mean values of the intensity distributions of clusters with the same context, the corresponding intensity distribution can be considered substantially uniform (circular instead of elliptical). Therefore, the context-dependent base calling disclosed herein can improve the efficiency of base calling while maintaining the low error rate.
In one or more embodiments, the base calling pipeline is a context-dependent signal modulation (CDSM) model that corrects for the context-dependent effect. In some embodiments, the CDSM model functions by processing data from previously determined base calls for certain sequences. As described above, to “base call” a cluster at a given sequencing cycle refers to processing the intensity profiles of the cluster by fitting a mixture of intensity distributions to the intensity profiles and determines the base incorporated into the template nucleotide as one of the four bases A, G, C and T. After such initial base calling, the CDSM model takes as input base calls of known sequences and generates k-mer-specific centroids as predicted mean values of intensity distributions of clusters with k-mer-specific base context.
In one or more embodiments, the base calls as input to the CDSM model 600 are discrete base call that are encoded as binary permutations across two color/intensity channels.
A skilled person would appreciate that the binarized intensities as illustrated in
The k-mer-specific time series can be transformed using k-mer-specific transforms, such as channel mixing matrices and/or k-mer-specific phasing correction (e.g., using convolutional kernels). In one or more embodiments, the CDSM model uses k-mer-specific matrices to transform k-mer-specific time series and generate transformed time series that represent predicted k-mer-specific centroids. Each of the 4∧k time series has a corresponding k-mer-specific 2×2 matrix and after transformation, generates a corresponding predicted k-mer-specific centroid. A binarized k-mer-specific identifier can be used as a lookup index to identify the corresponding k-mer-specific matrix in order to perform the transformation. For a base X with intensity profiles captured at a given sequencing cycle and a particular k-mer-specific context, the CDSM model can transform the k-mer-specific time series by multiplying the binarized intensities of base X at the given sequencing cycle with the corresponding k-mer-specific matrix. In the alternative to a 2×2 linear transform matrix, in some cases, a 1 or c value can be added to the intensity vector to generate a 3×3 affine transform matrix. A 1 or c value can be added depending on whether an inverse or the forward transform is used. For example, [x,y] can be fed to a 2×2 matrix for a linear transform. For an affine transform, a vector [x,y, 1] or [x,y,c] is multiplied to a 3×3 matrix. The value c represents a learnable parameter through back propagation.
Consider as an example tetramer context of ATGC (where G represents the base at a current sequencing cycle). The k-mer-specific 2×2 matrix M corresponding to tetramer context ATGC is identified using a unique integer identifier computed from the k-mer itself as a lookup index. For a context length of 4, as in this example, there are 4∧4=256 possible k-mers and as a result 256 possible unique k-mer identifiers that can be used to lookup the corresponding transform matrices. The binarized intensities of base G at the current sequencing cycle is in a vector form b. Accordingly, the transformed time series with adjusted intensities i in a vector form are calculated using the following:
All of the coefficients within matrix M that map binarized intensities b with adjusted intensities i are learnable through backpropagation, such that the gradient update can be applied to the entries of matrix M. When k-mer-specific 2×2 matrix M is used, the CDSM model can perform linear transformation. The CDSM model can also perform non-linear transformations. For example, the CDSM model can use k-mer-specific 3×3 matrix and perform affine transformation to generate predicted k-mer-specific centroids.
In some embodiments, the CDSM model directly learns adjusted intensities i using gradient descents. Instead of separately transforming each binarized k-mer-specific time series using a corresponding transformation matrix, the k-mer-specific centroids are learnable through backpropagation. Indeed, in some embodiments, the CDSM model treats the transformed centroids (e.g., the transformed intensities i) as learnable parameters, which can shortcut some (or all) of the computation of the transform coefficients (in matrix M) and the application through multiplying by M. For example, for each of the k-mer-specific time series, the respective binarized intensities are encoded with a dimension of k×2, where k represents the number of bases in each k-mer and 2 represents the two color/intensity channels. The CDSM model processes the binarized intensities of k-mer-specific time series as input through e.g., convolutional kernels, and generates predicted k-mer-specific centroids. In one or more embodiments, for each of the k-mer-specific time series, the respective discrete base calls are one-hot encoded with a dimension of k×4, where k represents the number of bases in each k-mer and 4 represents the four bases A, G, C and T. The CDSM model processes the one-hot encoded base calls of k-mer-specific time series as input through e.g., convolutional kernels, and generates predicted k-mer-specific centroids. The coefficients of the convolutional kernels can be optimized through backpropagation. The use of learnable k-mer-specific centroids without corresponding transformation matrices can significantly save computation power and accelerate the optimization process of k-mer-specific centroids.
When X is base A, the binarized intensities of trimer-specific time series AAA, ACA, AGA, ATA, CAA, CCA, CGA, CTA, GAA, GCA, GGA, GTA, TAA, TCA, TGA, TTA are captured from both the first and second color/intensity channels. For example, the highlighted (e.g., outlined in boxes) time series represent the time series corresponding to GCA, GAA, TAA, CAA and AAA, respectively, shown in either
It should also be noted that some of the trimers may not appear in a given sequence. Accordingly, their corresponding time series have minimal binarized intensities.
The transformed k-mer-specific time series can further be corrected for context-based phasing. In the ideal situation of sequencing-by-synthesis (SBS) process, the lengths of all nascent strands within an analyte would be the same. Imperfections in the cyclic reversible termination (CRT) chemistry create stochastic failures that result in nascent strand length heterogeneity. In other words, the readout of the sequence copies of an analyte loses synchrony. One example is the phasing effect where an oligonucleotide in a cluster does not incorporate a nucleotide in some of the sequencing cycles and therefore, lags behind other oligonucleotides. To correct for the phasing effect, the CDSM model can apply k-mer-specific phasing coefficients to the k-mer-specific time series and generate corrected k-mer-specific time series. K-mer-specific phasing coefficients are k-mer-dependent instead of cluster-dependent. Each of the k-mer-specific time series has a corresponding k-mer-specific coefficient for phasing correction and thus, there are 4∧k permutations of k-mer-specific phasing coefficients.
In accordance with
In one or more embodiments in accordance with
Next, we turn to more details of the training process of the CDSM model in which the transformations parameters and context-dependent phasing coefficients are optimized.
The goal of training the CDSM model is to optimize the parameters for transformations and context-dependent phasing coefficients. The model gradually combines simpler features into complex features so that the most suitable hierarchical representations can be learned from training data. Given a training dataset, the forward pass sequentially computes the output and propagates the function signals forward through the model. In the final output layer, an objective loss function measures error between the inferenced outputs and the given labels. To minimize the training error, the backward pass uses the chain rule to backpropagate error signals and compute gradients with respect to all parameters throughout the model. Finally, the parameters are updated using optimization algorithms based on stochastic gradient descent. Whereas batch gradient descent performs parameter updates for each complete dataset, stochastic gradient descent provides stochastic approximations by performing the updates for each small set of data examples. Several optimization algorithms stem from stochastic gradient descent. For example, the Adagrad, Adam and Levenberg-Marquardt training algorithms perform stochastic gradient descent while adaptively modifying learning rates based on update frequency and moments of the gradients for each parameter, respectively.
In one or more embodiments, the CDSM model is trained using a plurality of already base called sequences. The number of already base called sequences as training samples can be 10-50, 50-200, 200-500, 500-1000, 1000-2000 and so on. For example, the training samples can include 512 or 1024 sequences. The base calls of these training samples as well as the corresponding intensity profiles at each sequencing cycle can be used as ground truth.
During the training process, the CDSM model receives base calls of training samples as input and subdivides them into 4∧k k-mer-specific time series. Each of the time series represents presence or absence of a particular k-mer at each sequencing cycle in a plurality of sequencing cycles across which the base calls are generated. The CDSM model transforms the k-mer-specific time series into transformed time series with adjusted binarized intensities representing predicted k-mer-specific centroids. In one or more embodiments, the transformation is performed through matrices or convolution kernels. Each of the k-mer-specific time series can have a corresponding matrix with transformation parameters that can be optimized during the training. As illustrated, the CDSM model can have a first set of transformation parameters, a second set of transformation parameters, . . . , 4∧k set of transformation parameters, each set representing the parameters of a particular k-mer-specific matrix. When 2×2 matrices are used for transformation, there are 4∧k of k-mer-specific 2×2 matrices with 4∧(k+1) learnable transformation parameters. In one or more embodiments, the k-mer-specific matrices are initialized with identity matrices, which model individual-sequence-specific behavior (or individual-k-mer-specific behavior) of k-mers. Accordingly, there are 4∧(k+2) parameters including initial parameters in the identity matrices and 4∧(k+1) learnable parameters.
The CDSM model can apply learnable k-mer-specific phasing coefficients to transformed k-mer-specific time series and generate corrected k-mer-specific time series. As illustrated, the CDSM model can have a first set of phasing parameters, a second set of phasing parameters, . . . , 4∧k set of phasing parameters, each set corresponding to a particular k-mer. These parameters can be adjusted during the training process by comparing a loss between the ground truth and the actual output. In some embodiments, the CDSM model uses a single phasing/prephasing coefficient set for all training samples.
The corrected transformed k-mer-specific time series can be merged via e.g., a sum operator into a merged time series on a sequencing cycle-by-sequencing cycle basis. The merged time series represent predicted per-sequencing cycle intensity values. The CDSM model can compare the predicted per-sequencing cycle intensity values to the ground truth intensity profiles of the training samples and determine a transformation loss based on the comparison. To update the model parameters (e.g., parameters for transformations and context-dependent phasing coefficients) with gradient descent, the use of transformation loss is to minimize the difference between the predicted per-sequencing cycle intensity values and the ground truth intensity profiles. During the backpropagation through computation graph, the gradients can flow backward through the merge step, and all of the upstreaming parameters can be updated.
An example of how gradients flow backwards through a sum operator is as follows. During backpropagation the backward pass computes the gradients with respect to the inputs of each node in the computational graph. The sum operation takes the gradients on its outputs and broadcasts it equally to all of its inputs, regardless of what the input values were during the forward pass. It follows from the fact that the local gradient for the sum operation is simply +1.0. As a result of applying the chain rule, the gradients on all inputs should be equal to the gradients on the output multiplied by 1.0 and thus, remain unchanged.
In one or more embodiments, the CDSM model iteratively fits the base calls. This process can start from a batch of sequences as training samples. For each sequence in the batch, initial respective parameters for intensity corrections (e.g., scale correction, background correction, laser ramp correction) can be estimated. The CDSM model processes discrete base calls of the batch of sequences and generates predicted k-mer-specific centroids. Via backpropagation, the CDSM model iteratively updates the parameters of the model. This iterative process can repeat e.g., 2000 times and during which, the base calls can be updated as well. For example, every thirty steps/cycles, the CDSM model, with newly updated parameters, can be inverted. That is, the CDSM model performs the base calling process by using the predicted k-mer-specific centroid and generate a finer fit for base calls. Based on the newly updated parameters and k-mer-specific centroid, the base calling system can update initial base calls to be more accurate.
In some embodiments, the base calling system uses an Adam algorithm to perform stochastic gradient descent for updating the CDSM model. The following pseudo code represents an example Adam algorithm for updating the CDSM model:
In particular, an exponential moving average of the gradient (dw) and the square of the gradient for each parameter (delta1 and delta2 used as ema parameters) are stored. An average normalized gradient vector can be created, and the stochastic gradient descent can be applied in the following form:
Unbiasing terms are used to debias the exp moving average at the beginning of training, these terms have no effect after a few steps (when t>>1). “e” is a small number used for numerical stability. “full( )” is used to map gradients from float 16 to float 32 for numerical stability.
With the completion of the training process, the transformation parameters and context-dependent phasing coefficients are optimized. They can be locked and thus, are no longer learnable. The predicted k-mer-specific centroids are optimized to accurately represent mean values of the intensity distributions of cluster with the same k-mer context and used for base calling unknown sequences.
The discussion now turns to particular implementations of context-dependent base calling, performed by the base calling pipeline disclosed herein. In one or more embodiments, the base calling system accesses current intensity data for a target cluster to be called at a current sequencing cycle of a sequencing run and context intensity data for the target cluster at preceding and/or succeeding sequencing cycles. The base context of the target cluster can be identified based on having base called the bases in previous cycles and having made preliminary base calls for future cycles. The base calling system further accesses a plurality of k-mer-specific centroids stored in the memory and determines respective k-mer-specific centroids that correspond to the base context of the target cluster at the current sequencing cycle. By comparing the respective k-mer-specific centroids with the current intensity data, the base calling system determines the base call of the target cluster.
Using maximum likelihood sequence estimation (MLSE), the base calling system identifies or determines a base call without inverting phasing. Phasing correction gathers the signal into a single cycle and enables an easier way to make base calls by comparing intensities on a per-cycle basis (e.g., after phasing correction the base calling system only looks at the intensities from cycle n to determine a base call). In some cases, the drawback of phasing correction is that it amplifies noise, and thus, after phasing correction, the intensities from clusters detected by the base calling system might exhibit relatively higher variation and thereby cause a base call error. Once a base call error is made, this error is propagated to neighboring cycles through the decision feedback loop of base calling. Such an error-propagated decision feedback loop can result in incorrect context and the wrong centroid to generate a base call for the following cycles. In some cases, a wrong base call might throw off (or otherwise adversely reconfigure) the RTA channel estimation algorithm and provoke more errors.
To overcome these issues using MLSE, the base calling system identifies a sequence of k bases that can explain the shape of the signal without performing phasing correction. Thus, the chance for decision feedback errors is reduced. The base call decision is made by matching the signal along more than 1 cycles of intensities. In some cases, the base calling system uses a multi-k-mer approach (or a brute-force approach 0 by determining signals based on all possible 3-mers or 5-mers and selecting the k-mer that causes the signal to be closest to the observed signal. The base calling system further runs these candidate sequence calls through the CDSM model (in the forward direction) and compares 3 or 5 cycles worth of intensities. (As indicated above, when feeding candidate sequence calls through the CDSM model, the base calling system also applies sequence dependent effects and a forward version of phasing.) In some embodiments, the base calling system applies such a multi-k-mer approach followed by candidate sequence calls at every cycle, while in other embodiments the base calling system applies more sophisticated algorithms based on the fact that once the problem is solved for a 5-mer, shifting one cycle to the right will result in redundant computations. This every cycle multi-k-mer approach is akin to a tree search algorithm where a system executes all possible branches of the tree each corresponding to a different sequence. In some embodiments, the base calling system uses an algorithm based on dynamic programming (e.g., a Viterbi algorithm, which is the core of the MLSE algorithm). In certain cases, the base calling system uses shortcut techniques with hardware acceleration to precompute all the sequence permutations and parallelize the matching to the data using parallel computations.
As illustrated in both figures, the CDSM model is trained to take as input encoded base calls 1312/1342 of already base called sequences and performs context-dependent signal modulation 1314/1344. The CDSM model iteratively learns k-mer-specific centroids 1316/1346, which can be stored in memory for base calling.
When a target cluster immobilized in a flow cell is to be base called, the sequencing platform generates raw intensities 1336/1366 of the target cluster, referring to the raw signals captured by the sequencing platform. In some embodiments, the raw intensities 1336/1366 are further corrected to generate corrected intensities, e.g., fully corrected intensities 1320 as illustrated in
In particular, the phasing/prephasing correction 1322/1352 is to address loss of synchrony in the readout of the sequence copies of an analyte loses synchrony caused by phasing and prephasing. Phasing is caused by incomplete removal of 3′ terminators and fluorophores as well as sequences in the analyte missing an incorporation cycle. Prephasing is caused by the incorporation of nucleotides without effective 3′-blocking. Incomplete extension due to phasing results in lagging strands (e.g., t−1 from the current cycle). Addition of multiple nucleotides or probes in a population of identical strands due to prephasing results in leading strands (e.g., t+1 from the current cycle). Phasing and prephasing effects are nonstationary distortions and thus the proportion of sequences in each analyte that is affected by phasing and prephasing increases with cycle number, which hampers correct base identification and limiting the length of useful sequence reads.
The decay correction 1326/1356 is to address the signal decay, for example, fading of the intensities of the fluorophores that are incorporated into the template sequences during the sequencing-by-synthesis process. As sequencing proceeds, accurate base calling becomes increasingly difficult, because signal strength decreases and noise increases, resulting in a substantially decreased signal-to-noise ratio. It has been observed that later synthesis steps attach tags in a different position relative to the sensor than earlier synthesis steps. When the sensor is below a sequence that is being synthesized, signal decay results from attaching tags to strands further away from the sensor in later sequencing steps than in earlier steps. This causes signal decay with progression of sequencing cycles.
The background corrections 1330/1360 and 1324/1354 are to address background variation. Background intensity of a particular sensor is relatively steady between cycles, but varies across the sensors. Positioning of the illumination source, which can vary by illumination color, creates a spatial pattern of background variation over a field of the sensors. It has been found that manufacturing differences among the sensors were observed to produce different background intensity readouts, even between adjoining sensors. In a first approximation, idiosyncratic variation among sensors can be ignored. In a refinement, the idiosyncratic variation in background intensity among sensors can be taken into account. Background intensity can be a constant parameter to be fit, either overall or per pixel. Alternatively, different background intensities are taken into account and corrected accordingly.
The scale correction 1328/1358 is to address the variations in the intensities of clusters. When clusters are immobilized on the surface of the flow, their size and shape may vary. A larger-sized cluster includes more template oligonucleotides than a small-sized cluster and thus, may show higher intensity values when more fluorophores are incorporated into the oligonucleotides. The scale correction 1328/1358 can account for the difference in the scale of the intensities of clusters.
In some embodiments, at least one of the camera gain correction 1332/1362, background correction 1330/1360 and 1324/1354, scale correction 1328/1358, decay correction 1326/1356 and phasing/rephrasing correction 1322/1352 can be iteratively learned by training the base calling system. Each of the correction processes can involve learnable and cluster-dependent parameters, that is, each cluster or a batch of clusters can have a particular set of learnable parameters used to correct for inter-cluster intensity variations. During the training process of the base calling system where these parameters are iteratively optimized, the transformation parameters and context-dependent phasing parameters in the CDSM models can be locked. In other words, the base calling system does not learn the chemistry effects caused by base context but leverages the optimized transformation parameters and context-dependent phasing parameters.
It should also be noted that the orders of signal corrections as illustrated in
As further illustrated in
In some embodiments, the current intensity data of the target cluster at a current sequencing cycle N is processed using inverse matrices for base calling. In particular, the current intensity data (i.e., fully corrected intensities 1320) can be expressed as 1×2 array fci(c). In certain embodiments, “fci” refers to fully corrected intensities 1320 and/or to application of the per cluster corrections learned in the CDSM model (e.g., scale, offset, decay, and camera gain). In some cases, the base calling system can base call at different stages in the CDSM model by carrying the intensities from the instrument (e.g., the transformed signal) output backwards through the CDSM model inverse and provide the output to a given stage. The base calling system can then iterate over the 4 possible base calls given the context and carry this forward to the same model stage to find the base call that produces the least difference with the transformed signal coming from the instrument.
The target cluster has two prior base calls identified at prior sequencing cycles N−2 and N−1. Given the particular base context, the base calling system selects respective matrices Sk that correspond to the base context.
For each of the respective matrices Sk, the base calling system computes the binarized base calls bc(c) by multiplying the inverse matrix Sk with the current intensity data as follows:
Next, the base calling system calculates a normalized difference x(c) between the binarized base calls and rounded binarized base calls as follows:
The binarized base call bc(c) that produces the lowest value of x(c) is determined as the base call for the target cluster.
For a k-mer that has a base to be called at sequencing cycle N, when N<k, the base context (e.g., number of prior base calls) of the target cluster can be insufficient to determine which of the four k-mer-specific centroids 1316 should be selected to compare with the fully corrected intensities 1320 of the target cluster. The base calling system can compare each of the k-mer-specific centroids 1316 with the current intensity data of the target cluster and determine which centroid fits the best.
Consider an example of base calling the target cluster at sequencing cycle 1 (N=1). No prior base context of the target cluster can be determined because this is the first cycle. The k-mer-specific centroids 1316 include sixty-four trimer-specific centroids (k=3). The base calling pipeline can compare each of the sixty-four centroids with the fully corrected intensities 1320 of the target cluster for base calling.
At sequencing cycle 2 (N=2), now the target cluster has a base context including a known prior base call (e.g., base A) identified at sequencing cycle 1. Therefore, the base calling pipeline does not need to compare all of the sixty-four trimer-specific centroids with the current intensity data of the target cluster. Instead, sixteen trimer-specific centroids with base A as the first base AGG, AGT, AGC, AGA, ACG, ACT, ACC, ACA, AAG, AAT, AAC, AAA, ATG, ATT, ATC and ATA can be selected. These trimer-specific centroids are used to call the target cluster at sequencing number 2.
When N≥k, the target cluster has a base context including more prior base calls identified at sequencing cycles N−1, N−2, . . . , 1. Only four k-mer-specific centroids are needed to compare to the intensity profiles of the cluster to be called. For example, when k=3 and N≥3, the base context can include two known prior base calls that are identified at sequencing cycles N−2 and N−1. The base calling pipeline can compare four of the trimer-specific centroids with the same base context with the current intensity data for base calling the target.
As illustrated in
of The transformation parameters in these tetramer-specific matrices can be optimized during iterative training of the base calling pipeline via backpropagation. The color bars in
The identification of the tetramers with transformation matrices significantly deviated from identity matrices sheds light on the study of the intensity variations caused by chemistry modulation effect.
In one implementation, at least one of the base calling system, base calling pipeline or Context-Dependent Signal Modulation (CDSM) model is communicably linked to the storage subsystem 2210 and the user interface input devices 2238.
User interface input devices 2238 can include a keyboard; pointing devices such as a mouse, trackball, touchpad, or graphics tablet; a scanner; a touch screen incorporated into the display; audio input devices such as voice recognition systems and microphones; and other types of input devices. In general, use of the term “input device” is intended to include all possible types of devices and ways to input information into computer system 2200.
User interface output devices 2276 can include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices. The display subsystem can include an LED display, a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem can also provide a non-visual display such as audio output devices. In general, use of the term “output device” is intended to include all possible types of devices and ways to output information from computer system 2200 to the user or to another machine or computer system.
Storage subsystem 2210 stores programming and data constructs that provide the functionality of some or all of the modules and methods described herein. These software modules are generally executed by processors 2278.
Processors 2278 can be graphics processing units (GPUs), field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), and/or coarse-grained reconfigurable architectures (CGRAs). Processors 2278 can be hosted by a deep learning cloud platform such as Google Cloud Platform™, Xilinx™, and Cirrascale™. Examples of processors 2278 include Google's Tensor Processing Unit (TPU)™, rackmount solutions like GX4 Rackmount Series™, GX15 Rackmount Series™, NVIDIA DGX-1™, Microsoft′ Stratix V FPGA™, Graphcore's Intelligent Processor Unit (IPU)™, Qualcomm's Zeroth Platform™ with Snapdragon processors™, NVIDIA's Volta™, NVIDIA's DRIVE PX™, NVIDIA's JETSON TX1/TX2 MODULE™, Intel's Nirvana™, Movidius VPU™, Fujitsu DPI™, ARM's DynamicIQ™, IBM TrueNorth™, Lambda GPU Server with Testa V100s™, and others.
Memory subsystem 2222 used in the storage subsystem 2210 can include a number of memories including a main random access memory (RAM) 2232 for storage of instructions and data during program execution and a read only memory (ROM) 2234 in which fixed instructions are stored. A file storage subsystem 2236 can provide persistent storage for program and data files, and can include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges. The modules implementing the functionality of some implementations can be stored by file storage subsystem 2236 in the storage subsystem 2210, or in other machines accessible by the processor.
Bus subsystem 2255 provides a mechanism for letting the various components and subsystems of computer system 2200 communicate with each other as intended. Although bus subsystem 2255 is shown schematically as a single bus, alternative implementations of the bus subsystem can use multiple busses.
Computer system 2200 itself can be of varying types including a personal computer, a portable computer, a workstation, a computer terminal, a network computer, a television, a mainframe, a server farm, a widely-distributed set of loosely networked computers, or any other data processing system or user device. Due to the ever-changing nature of computers and networks, the description of computer system 2200 depicted in
Each of the processors or modules discussed herein may include an algorithm (e.g., instructions stored on a tangible and/or non-transitory computer readable storage medium) or sub-algorithms to perform particular processes. The base calling pipeline can be implemented utilizing any combination of dedicated hardware boards, DSPs, processors, etc. Alternatively, the base calling pipeline implemented utilizing an off-the-shelf PC with a single processor or multiple processors, with the functional operations distributed between the processors. As a further option, the modules described below may be implemented utilizing a hybrid configuration in which some modular functions are performed utilizing dedicated hardware, while the remaining modular functions are performed utilizing an off-the-shelf PC and the like. The modules also may be implemented as software modules within a processing unit.
Various processes and steps of the methods set forth herein can be carried out using a computer. The computer can include a processor that is part of a detection device, networked with a detection device used to obtain the data that is processed by the computer or separate from the detection device. In some implementations, information (e.g., image data) may be transmitted between components of a system disclosed herein directly or via a computer network. A local area network (LAN) or wide area network (WAN) may be a corporate computing network, including access to the Internet, to which computers and computing devices comprising the system are connected. In one implementation, the LAN conforms to the transmission control protocol/internet protocol (TCP/IP) industry standard. In some instances, the information (e.g., image data) is input to a system disclosed herein via an input device (e.g., disk drive, compact disk player, USB port etc.). In some instances, the information is received by loading the information, e.g., from a storage device such as a disk or flash drive.
A processor that is used to run an algorithm or other process set forth herein may comprise a microprocessor. The microprocessor may be any conventional general purpose single- or multi-chip microprocessor such as a Pentium™ processor made by Intel Corporation. A particularly useful computer can utilize an Intel Ivybridge dual-12 core processor, LSI raid controller, having 128 GB of RAM, and 2 TB solid state disk drive. In addition, the processor may comprise any conventional special purpose processor such as a digital signal processor or a graphics processor. The processor typically has conventional address lines, conventional data lines, and one or more conventional control lines.
The implementations disclosed herein may be implemented as a method, apparatus, system or article of manufacture using standard programming or engineering techniques to produce software, firmware, hardware, or any combination thereof. The term “article of manufacture” as used herein refers to code or logic implemented in hardware or computer readable media such as optical storage devices, and volatile or non-volatile memory devices. Such hardware may include, but is not limited to, field programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), complex programmable logic devices (CPLDs), programmable logic arrays (PLAs), microprocessors, or other similar processing devices. One or more implementations of the technology disclosed, or elements thereof can be implemented in the form of a computer product including a non-transitory computer readable storage medium with computer usable program code for performing the method steps indicated. Furthermore, one or more implementations of the technology disclosed, or elements thereof can be implemented in the form of an apparatus including a memory and at least one processor that is coupled to the memory and operative to perform exemplary method steps. Yet further, in another aspect, one or more implementations of the technology disclosed or elements thereof can be implemented in the form of means for carrying out one or more of the method steps described herein; the means can include (i) hardware module(s), (ii) software module(s) executing on one or more hardware processors, or (iii) a combination of hardware and software modules; any of (i)-(iii) implement the specific techniques set forth herein, and the software modules are stored in a computer readable storage medium (or multiple such media).
As used herein, the term “sequenced data” refer to intensity data (e.g., intensity values) and non-intensity data. In some implementations, the segmentation and conditional base calling are performed on non-intensity data, such as on pH changes induced by the release of hydrogen ions during molecule extension. The pH changes are detected and converted to a voltage change that is proportional to the number of bases incorporated (e.g., in the case of Ion Torrent). Therefore, the sequence data disclosed herein includes voltage signals. In other implementations, the non-intensity data is constructed from nanopore sensing that uses biosensors to measure the disruption in current as an analyte passes through a nanopore or near its aperture while determining the identity of the base. For example, the Oxford Nanopore Technologies (ONT) sequencing is based on the following concept: pass a single strand of DNA (or RNA) through a membrane via a nanopore and apply a voltage difference across the membrane. The nucleotides present in the pore will affect the pore's electrical resistance, so current measurements over time can indicate the sequence of DNA bases passing through the pore. This electrical current signal (the ‘squiggle’ due to its appearance when plotted) is the raw data gathered by an ONT sequencer. These measurements are stored as 16-bit integer data acquisition (DAC) values, taken at e.g., 4 kHz frequency. With a DNA strand velocity of ˜450 base pairs per second, this gives approximately nine raw observations per base on average. This signal is then processed to identify breaks in the open pore signal corresponding to individual reads. These stretches of raw signal are base called—the process of converting DAC values into a sequence of DNA bases. In some implementations, the non-intensity data comprises normalized or scaled DAC values. Therefore, the sequence data disclosed herein can include current signals.
As used herein, the terms “polynucleotide” or “nucleic acids” refer to deoxyribonucleic acid (DNA), but where appropriate the skilled artisan will recognize that the systems and devices herein can also be utilized with ribonucleic acid (RNA). The terms should be understood to include, as equivalents, analogs of either DNA or RNA made from nucleotide analogs. The terms as used herein also encompasses cDNA, that is complementary, or copy, DNA produced from an RNA template, for example by the action of reverse transcriptase.
The single stranded polynucleotide molecules sequenced by the systems and devices herein can have originated in single-stranded form, as DNA or RNA or have originated in double-stranded DNA (dsDNA) form (e.g., genomic DNA fragments, PCR and amplification products and the like). Thus, a single stranded polynucleotide may be the sense or antisense strand of a polynucleotide duplex. Methods of preparation of single stranded polynucleotide molecules suitable for use in the method of the disclosure using standard techniques are well known in the art. The precise sequence of the primary polynucleotide molecules is generally not material to the disclosure, and may be known or unknown. The single stranded polynucleotide molecules can represent genomic DNA molecules (e.g., human genomic DNA) including both intron and exon sequences (coding sequence), as well as non-coding regulatory sequences such as promoter and enhancer sequences.
In some implementations, the nucleic acid to be sequenced through use of the current disclosure is immobilized upon a substrate (e.g., a substrate within a flow cell or one or more beads upon a substrate such as a flow cell, etc.). The term “immobilized” as used herein is intended to encompass direct or indirect, covalent or non-covalent attachment, unless indicated otherwise, either explicitly or by context. In some implementations covalent attachment may be preferred, but generally all that is required is that the molecules (e.g., nucleic acids) remain immobilized or attached to the support under conditions in which it is intended to use the support, for example in applications requiring nucleic acid sequencing.
As indicated above, the present disclosure comprises novel systems and devices for sequencing nucleic acids. As will be apparent to those of skill in the art, references herein to a particular nucleic acid sequence may, depending on the context, also refer to nucleic acid molecules which comprise such nucleic acid sequence. Sequencing of a target fragment means that a read of the chronological order of bases is established. The bases that are read do not need to be contiguous, although this is preferred, nor does every base on the entire fragment have to be sequenced during the sequencing. Sequencing can be carried out using any suitable sequencing technique, wherein nucleotides or oligonucleotides are added successively to a free 3′ hydroxyl group, resulting in synthesis of a polynucleotide chain in the 5′ to 3′ direction. The nature of the nucleotide added is preferably determined after each nucleotide addition. Sequencing techniques using sequencing by ligation, wherein not every contiguous base is sequenced, and techniques such as massively parallel signature sequencing (MPSS) where bases are removed from, rather than added to, the strands on the surface are also amenable to use with the systems and devices of the disclosure.
As described herein, the term “SBS” refers to sequencing-by-synthesis. In SBS, four fluorescently labeled modified nucleotides are used to sequence dense clusters of amplified DNA (possibly millions of clusters) present on the surface of a substrate (e.g., a flow cell). Various additional aspects regarding SBS procedures and methods, which can be utilized with the systems and devices herein, are disclosed in, for example, WO04018497, WO04018493 and U.S. Pat. No. 7,057,026 (nucleotides), WO05024010 and WO06120433 (polymerases), WO05065814 (surface attachment techniques), and WO 9844151, WO06064199 and WO07010251, the contents of each of which are incorporated herein by reference in their entirety.
As used herein, an element or step recited in the singular and proceeded with the word “a” or “an” should be understood as not excluding plural of said elements or steps, unless such exclusion is explicitly stated. Furthermore, references to “one implementation” are not intended to be interpreted as excluding the existence of additional implementations that also incorporate the recited features. Moreover, unless explicitly stated to the contrary, implementations “comprising” or “having” or “including” an element or a plurality of elements having a particular property may include additional elements whether or not they have that property.
In particular implementations, the reaction includes the incorporation of a fluorescently-labeled molecule to an analyte. The analyte may be an oligonucleotide and the fluorescently-labeled molecule may be a nucleotide. The desired reaction may be detected when an excitation light is directed toward the oligonucleotide having the labeled nucleotide, and the fluorophore emits a detectable fluorescent signal. In alternative implementations, the detected fluorescence is a result of chemiluminescence or bioluminescence. A desired reaction may also increase fluorescence (or Förster) resonance energy transfer (FRET), for example, by bringing a donor fluorophore in proximity to an acceptor fluorophore, decrease FRET by separating donor and acceptor fluorophores, increase fluorescence by separating a quencher from a fluorophore or decrease fluorescence by co-locating a quencher and fluorophore.
In some implementations, sensors (e.g., light detectors, photodiodes) are associated with corresponding pixel areas of a sample surface of a biosensor. As such, a pixel area is a geometrical construct that represents an area on the biosensor's sample surface for one sensor (or pixel). A sensor that is associated with a pixel area detects light emissions gathered from the associated pixel area when a desired reaction has occurred at a reaction site or a reaction chamber overlying the associated pixel area. In a flat surface implementation, the pixel areas can overlap. In some cases, a plurality of sensors may be associated with a single reaction site or a single reaction chamber. In other cases, a single sensor may be associated with a group of reaction sites or a group of reaction chambers.
As used herein, a “biosensor” includes a structure having a plurality of reaction sites and/or reaction chambers (or wells). A biosensor may include a solid-state imaging device (e.g., CCD or CMOS imager) and, optionally, a flow cell mounted thereto. The flow cell may include at least one flow channel that is in fluid communication with the reaction sites and/or the reaction chambers. As one specific example, the biosensor is configured to fluidically and electrically couple to a bioassay system. The bioassay system may deliver reactants to the reaction sites and/or the reaction chambers according to a predetermined protocol (e.g., sequencing-by-synthesis) and perform a plurality of imaging events. For example, the bioassay system may direct solutions to flow along the reaction sites and/or the reaction chambers. At least one of the solutions may include four types of nucleotides having the same or different fluorescent labels. The nucleotides may bind to corresponding oligonucleotides located at the reaction sites and/or the reaction chambers. The bioassay system may then illuminate the reaction sites and/or the reaction chambers using an excitation light source (e.g., solid-state light sources, such as light-emitting diodes or LEDs). The excitation light may have a predetermined wavelength or wavelengths, including a range of wavelengths. The excited fluorescent labels provide emission signals that may be captured by the sensors.
In alternative implementations, the biosensor may include electrodes or other types of sensors configured to detect other identifiable properties. For example, the sensors may be configured to detect a change in ion concentration. In another example, the sensors may be configured to detect the ion current flow across a membrane.
As used herein, a “cluster” is a colony of similar or identical molecules or nucleotide sequences or DNA strands. For example, a cluster can be an amplified oligonucleotide or any other group of a polynucleotide or polypeptide with a same or similar sequence. In other implementations, a cluster can be any element or group of elements that occupy a physical area on a sample surface. In implementations, clusters are immobilized to a reaction site and/or a reaction chamber during a base calling cycle.
As used herein, “base calling” identifies a nucleotide base in a nucleic acid sequence. Base calling refers to the process of determining a base call (A, C, G, T) for every cluster at a specific cycle. As an example, base calling can be performed utilizing four-channel, two-channel or one-channel methods and systems described in the incorporated materials of U.S. Patent Application Publication No. 2013/0079232. In particular implementations, a base calling cycle is referred to as a “sampling event.” In one dye and two-channel sequencing protocol, a sampling event comprises two illumination stages in time sequence, such that a pixel signal is generated at each stage. The first illumination stage induces illumination from a given cluster indicating nucleotide bases A and T in a AT pixel signal, and the second illumination stage induces illumination from a given cluster indicating nucleotide bases C and T in a CT pixel signal.
In some implementations, a computer-implemented method set forth herein can occur in real time while multiple images of an object are being obtained. Such real time analysis is particularly useful for nucleic acid sequencing applications wherein an array of nucleic acids is subjected to repeated cycles of fluidic and detection steps. Analysis of the sequencing data can often be computationally intensive such that it can be beneficial to perform the methods set forth herein in real time or in the background while other data acquisition or analysis algorithms are in process. Example real time analysis methods that can be used with the present methods are those used for the MiSeq, HiSeq, and NovaSeq sequencing devices commercially available from Illumina, Inc. (San Diego, Calif.) and/or described in US Pat. App. Pub. No. 2012/0020537 A1, which is incorporated herein by reference.
One or more features of an implementation can be combined with the base implementation. Implementations that are not mutually exclusive are taught to be combinable. One or more features of an implementation can be combined with other implementations. This disclosure periodically reminds the user of these options. Omission from some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the preceding sections—these recitations are hereby incorporated forward by reference into each of the following implementations.
The detailed description of some implementations will be better understood when read in conjunction with the appended drawings. To the extent that the figures illustrate diagrams of the functional blocks of various implementations, the functional blocks are not necessarily indicative of the division between hardware circuitry. Thus, for example, one or more of the functional blocks (e.g., processors or memories) may be implemented in a single piece of hardware (e.g., a general purpose signal processor or random access memory, hard disk, or the like). Similarly, the programs may be standalone programs, may be incorporated as subroutines in an operating system, may be functions in an installed software package, and the like. It should be understood that the various implementations are not limited to the arrangements and instrumentality shown in the drawings.
While the present invention is disclosed by reference to the preferred implementations and examples detailed above, it is to be understood that these examples are intended in an illustrative rather than in a limiting sense. It is contemplated that modifications and combinations will readily occur to those skilled in the art, which modifications and combinations will be within the spirit of the invention and the scope of the following claims.
The present application claims the benefit of, and priority to, U.S. Provisional Application 63/476,428, entitled “CONTEXT-DEPENDENT BASE CALLING,” filed Dec. 21, 2022. The aforementioned application is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63476428 | Dec 2022 | US |