Provided herein is a system that includes a joint event segmentation and basecalling module configured to accept raw current signals sequenced from a DNA strand, wherein the raw current signals span one or more events each representing a single base movement in an underlying DNA base sequence. The joint event segmentation and basecalling module combines a run-length tracking hidden Markov model (HMM) for event detection and a de Bruijn HMM for basecalling in a single joint HMM, wherein the run-length HMM tracks duration of a current event in the DNA base sequence and the de Bruijn HMM tracks a local k-mer of the current event in the DNA base sequence via a de Bruijn graph. The joint event segmentation and basecalling module then utilizes the single joint HMM to process the raw current signals to track both the local k-mer and the duration of the current event in the DNA base sequence simultaneously via dynamic programming.
These and other features and advantages will be apparent from a reading of the following detailed description.
Before various embodiments are described in greater detail, it should be understood that the embodiments are not limiting, as elements in such embodiments may vary. It should likewise be understood that a particular embodiment described and/or illustrated herein has elements which may be readily separated from the particular embodiment and optionally combined with any of several other embodiments or substituted for elements in any of several other embodiments described herein.
It should also be understood that the terminology used herein is for the purpose of describing the certain concepts, and the terminology is not intended to be limiting. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood in the art to which the embodiments pertain.
Unless indicated otherwise, ordinal numbers (e.g., first, second, third, etc.) are used to distinguish or identify different elements or steps in a group of elements or steps, and do not supply a serial or numerical limitation on the elements or steps of the embodiments thereof. For example, “first,” “second,” and “third” elements or steps need not necessarily appear in that order, and the embodiments thereof need not necessarily be limited to three elements or steps. It should also be understood that the singular forms of “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Some portions of the detailed descriptions that follow are presented in terms of procedures, methods, flows, logic blocks, processing, and other symbolic representations of operations performed on a computing device or a server. These descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. In the present application, a procedure, logic block, process, or the like, is conceived to be a self-consistent sequence of operations or steps or instructions leading to a desired result. The operations or steps are those utilizing physical manipulations of physical quantities. Usually, although not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated in a computer system or computing device or a processor. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as transactions, bits, values, elements, symbols, characters, samples, pixels, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout the present disclosure, discussions utilizing terms such as “storing,” “determining,” “sending,” “receiving,” “generating,” “creating,” “fetching,” “transmitting,” “facilitating,” “providing,” “forming,” “detecting,” “decrypting,” “encrypting,” “processing,” “updating,” “instantiating,” or the like, refer to actions and processes of a computer system or similar electronic computing device or processor. The computer system or similar electronic computing device manipulates and transforms data represented as physical (electronic) quantities within the computer system memories, registers or other such information storage, transmission or display devices.
It is appreciated that present systems and methods can be implemented in a variety of architectures and configurations. For example, present systems and methods can be implemented as part of a distributed computing environment, a cloud computing environment, a client server environment, hard drive, etc. Embodiments described herein may be discussed in the general context of computer-executable instructions residing on some form of computer-readable storage medium, such as program modules, executed by one or more computers, computing devices, or other devices. By way of example, and not limitation, computer-readable storage media may comprise computer storage media and communication media. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform particular tasks or implement particular data types. The functionality of the program modules may be combined or distributed as desired in various embodiments.
Computer storage media/drive can include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data. Computer storage media can include, but is not limited to, random access memory (RAM), read only memory (ROM), electrically erasable programmable ROM (EEPROM), flash memory, or other memory technology, compact disk ROM (CD-ROM), digital versatile disks (DVDs) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store the desired information and that can be accessed to retrieve that information.
DNA sequencing has undergone several phases of innovation starting from the classic sequencing methods of Sanger dideoxy (terminator base) method and Maxam-Gilbert (chemical cleavage) method. The Sanger dideoxy method works by replicating a strand but stopping at a random instance of the specific base, while the Maxam-Gilbert method works by selectively cleaving the DNA strand at instances of a specific chosen base (A, C, G or T). Both methods are similar in that they generate short DNA fragments (of random lengths) terminating at an instance of the selected base. A technique called polyacrylamide gel electrophoresis is used to measure the lengths of the resulting fragments, which determines the positions of the specific base in the original DNA strand. The process is done separately for each of the four bases to get the complete DNA sequence. A limitation of these methods is that they only work well for sequences that are 100 bases long. Beyond this limit the uncertainty in the position of each base becomes unacceptable. As a result, longer sequences have to be broken down, sequenced individually, and stitched together like pieces of a jig saw puzzle using genome assembly algorithms.
The second generation of DNA sequencing saw the rise of massively parallel sequencers that were still limited to processing short fragments. The idea of “single molecule sequencing” is to avoid fragmentation of the DNA and try to sequence the entire single stranded DNA (ssDNA) molecule in a single pass. Currently, single-molecule sequencers from Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) can sequence molecules over 10 kilobases long. PacBio's single-molecule real-time sequencer (SMRT) uses an optical waveguide system to sequence DNA, while ONT uses a nanopore device with a sensor to measure electrical currents.
Nanopore DNA sequencing is a single molecule sequencing technology that promises low-cost high-throughput DNA sequencing for medical applications as well as DNA based data storage. A nanopore DNA sequencing device contains a tiny pore through which a negatively charged single strand of DNA from the sample is made to pass through a nanopore channel under an external electric field by a process called electrophoresis. The width of the nanopore channel is in the order of, e.g., 2 nm, just wide enough for a single stranded DNA molecule (ssDNA) to pass through. The raw signal measured by the device sensor is a sampled transverse ionic or tunneling current between two electrodes, wherein the current depends on the base sequence in the DNA strand. Each base produces a characteristic current response. By measuring the changes in the measured current with the passing of each base in the DNA sequence through the pore, the method may detect the bases (nucleotides) in the ssDNA sequence.
The main computational problem associated with nanopore DNA sequencing is using the raw current signal to recover the underlying DNA base sequence, known as the basecalling problem. However, it is often simpler to first identify events boundaries (event detection problem) via segmentation and post process the identified events to recover the base sequence via basecalling. During such two-step approach, the sampled current signal is first segmented into events, wherein each event corresponds to individual bases transiting through the nanopore channel. Next, the event information is fed to a basecaller to detect the base sequence in the DNA strand. The basecaller is typically implemented using a hidden Markov model (HMM), which is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobservable (i.e. hidden) states, or neural-networks.
The two-step process of event segmentation followed by post-segmentation basecalling is not necessarily optimal because the event segmentation step does not use the interdependency in the signal levels caused by inter-symbol interference (ISI). Furthermore, such two-step process typically needs to over-segment the raw signal to keep the basecaller complexity under control and it may not be able to handle cases where the raw data is under-segmented beyond the HMM capability.
A new approach is proposed that contemplates systems and methods to support a new architecture to perform joint event segmentation and basecalling from a sampled current signal from a DNA sequencer via a hidden Markov model (HMM). Here, the HMM is based on a dynamic programming algorithm that tracks both the duration of the current event (event run-length) and the local k-mer pattern in the DNA strand/base sequence. Compared to the two-step process discussed above, the proposed approach utilizes a joint HMM to operate directly on the raw signal samples sensed by single bio-molecule sequencers in general, including nanopore DNA sequencing. Since the proposed approach is derived from first principles/terms using Bayesian estimation, it is provably optimal for the class of models considered. In addition, the joint event segmentation and basecaller is aware of the underlying structure (base sequence) that causes the signal levels to be interdependent. The proposed approach provides an efficient way to compute the branch metrics for the HMM with O(1) complexity per state and the overall complexity of the HMM is O(4MK) where M is the memory size of the ISI in the base response and K is the maximum event duration tracked by the HMM.
Referring now to
In the example of
In the example of
In some embodiments, the run-length tracking HMM 106 is configured to encode the run-length or duration of the current event, rather than signal level of the current event, into the HMM state. In some embodiments, the run-length tracking HMM 106 is configured to estimate the event levels on-the-fly at each HMM state based on the samples in the current event at that state. Since the signal level is not encoded at all, encoding the run-length or duration of the current event works accurately even for large or even continuous level sets without level quantization.
In some embodiments, the de Bruijn HMM 108 used for basecalling is based on the de Bruijn graph, which is a graph whose states are k-mer substrings and had edges (U, V) if the length-(k−1) suffix of state U equals the length-(k−1) prefix of state V. For post-segmentation basecalling, the de Bruijn HMM 108 is configured to minimize under-segmentation where two or more true events are detected as a single merged event even if one or more individual events may be detected as multiple smaller events as a result.
In some embodiments, the joint event segmentation and basecalling module 104 is configured to adopt a simple model for the raw current signals for a typical nanopore DNA sequence wherein each raw current signal is modeled as a (noisy) piecewise-constant over each event where each event represents the movement of a single base through the nanopore channel. In the discussions below, the following definitions and notation are adopted. Let B={A, C, G, T} denote the DNA base (nucleotide) alphabet and b={bk∈B} a base sequence in the DNA strand being read by the sequencer. Let d={dk} and λ={λk} be shorthand for the sequence of durations and ideal (noise-free) levels of all events that make up the observation sequence. Let Ek denote the temporal support of the k-th event:
Since the event duration dk may be very weakly correlated with from one event to the next, in some embodiments, the event durations dk can be modeled as independent and identically distributed (IID) and independent of the level variable kk with known a priori probability distribution P (dk).
In some embodiments, the joint event segmentation and basecalling module 104 is configured to adopt a signal model for noisy raw samples in the k-th event as
s
n=λk+wn (2)
where wn˜(0,σk2) for n∈Ek is additive IID Gaussian noise with zero mean variance σk2 and kk is the level of the k-th event that may depend on the local k-mer pattern. In some embodiments, the joint event segmentation and basecalling module 104 is configured to adopt a slightly more complex noise model by introducing correlations between noise samples with an event:
w(Ek)˜(0;Σk) (3)
where w(Ek)={wn: n∈Ek} is shorthand for the vector of noise samples in the k-th event, and Σk is a dk×dk covariance matrix. One way to introduce correlations is by using a common noise source for each event. Specifically, we model wn as a sum of two IID noise components:
w
n
=v
n
+u
k
;n∈E
k
with vn˜(0,σk2) and uk˜(0,τk2) (indexed by k rather than n so that it affects all noise samples in the k-th event). Here, vn represents the sensor measurement noise with a bandwidth high enough to be modeled as white noise. And uk represents a slowly varying noise that includes unexplained noise sources such as interference from other bio-molecules or residual ISI not modeled by the signal. This results in a dk×dk covariance matrix with the following two-parameter form:
where e=(1, 1, . . . , 1)T is the vector with all elements being ones. This model generalizes the simpler IID noise model when we set τk2=0. All these noise parameters can also be modeled as look-up tables Φσ[⋅]. and Φτ[⋅] applied to the local (2L+1)-mer base pattern:
σk2=Φσ[bk−Lk+L],τk2=Φτ[bk−Lk+L]
Although our primary focus above was a signal model for nanopore DNA sequencing, the general models for other bio-molecule sequencers are similar. All of our following methods are broadly applicable to these sequencing technologies as well.
Starting from the signal model (2) and the noise model (3), the joint event segmentation and basecalling module 104 is configured to calculate log-probability of the observed raw signal samples as
wherein b={bk} and d={dk} denote the base sequence and event durations, respectively and s(Ek) denote the raw signal samples in event Ek defined in (1). All of λk,σk and τk2 are all functions of the local M-mer pattern bk−Lk+L with M=2L+1. By assumption, the durations {dk} are independent and identically distributed (IID) with a known priori probability distribution function (PDF) P(dk). Therefore,
If there is no prior information about the base sequence, the joint event segmentation and basecalling module 104 is configured to use Bayesian estimation to find the durations d and base sequence b as follows:
Taking the logarithm of the above equation and applying (4) and (5), ({circumflex over (d)}, {circumflex over (b)}) can be calculated as:
If there is prior information on the base sequence in the form of a Markov model P(bk+L|bn−Lk+L−1), the joint event segmentation and basecalling module 104 is configured to incorporate a log-probability of this prior information into the above expression. Using (2), (3) and the standard expression for a multi-dimensional Gaussian PDF, the first term in the above summation can be calculated as:
wherein detΣk can be calculated as
Using (9), the first term in (7) can be reduced to the following simple form:
while the second term in (7), being independent of the observations, can be precomputed using (8) and stored in a look-up table for efficiency by the joint event detection and basecalling module 104.
In some embodiments, the joint event segmentation and basecalling module 104 is configured to solve the equation (6) using dynamic programming (DP). The idea of dynamic programming is to reduce the main problem to a simple post-processing step in terms of some or all smaller sub-problems. These sub-problems then are solved sequentially going from smallest to largest in size. Let Vk=bk−L+1k+L denote the (M−1)-mer pattern referred to as the “k-mer state” at time k. The state sequence V={Vk} uniquely describes the base sequence b={bk} and vice versa. Furthermore, the k-mer bk−Lk+L can be compactly represented by the pair (Vk−1; Vk), which also represents an edge in the de Bruijn graph g whose states are (M−1)-mers. As such, the Bayesian estimation problem (6) can be calculated as a maximization problem/function using an additive scoring model, wherein the goal is to maximize a sum of individual event scores as follows:
with the score for the k-th event defined as
S(Vk−1,Vk,Ek))log P(s(Ek)|dk,bk−Lk+L+log P(dk) (12)
For dynamic programming, let F [n, V], for n≤N and each ending at state V. Denote the best score for sub-problem with partial observation sequence S1n and let En,mi={j:n−m+1≤j≤n} denote the event of length m ending at index n. Then, starting with F [0, V]=0, F [n, V] can be updated recursively as follows:
where p(V)={U:(U,V) is an edge in } is the set of parent states of V.
In the maximization F[n, V] above, m represents the duration of the last event and U is the ending state for the penultimate event for the sub-problem of size n, wherein U must be such that (U, V) is an edge in the de Bruijn graph . In some embodiments, the joint event segmentation and basecalling module 104 is configured to calculate the maximization function sequentially over n for each V∈. For each n and V, the joint event segmentation and basecalling module 104 is configured to store the optimal values form and U (trace-back information) in memory. At the end, the joint event segmentation and basecalling module 104 is configured to trace back to recover the event durations and sequence of k-mer states (and hence base movements).
In some embodiments, the joint event segmentation and basecalling module 104 is configured to reformulate the dynamic programming algorithm as running the Viterbi algorithm on the following joint HMM 110 that combines the run-length HMM 106 and the de Bruijn HMM 108 to track both the local k-mer state and the event duration (run-length). In some embodiments, the HMM states are labeled by a pair (U, m) where U is a k-mer and m is the duration of the current event. The state (U, m) has an outgoing edge to (U, m+1) to represent the extension of the current event. It also has edges to states (V, 1) to represent the transition to a new event for all V such that (U, V) is an edge in the de Bruijn graph . In some embodiments, where m is large, the joint event segmentation and basecalling module 104 is configured to limit the HMM complexity by picking a sufficiently large parameter K and merging all states with m≥K into a single state labeled m=K For a slightly technical reason we also need to include the k-mer V along with U in the terminal states with m=K. This will be clear shortly when we describe the branch metrics below. Such HMM implementation with a limit on the maximum run-length effectively achieves linear complexity and real-time segmentation and basecalling. In summary, the main HMM states are labeled (U, m) with U∈ and 1≤m≤K−1. Additional HMM states are of the form (U, V, K) for all edges (U, V) in , wherein the following is a complete list of all the edges in the HMM 110:
1. (U,m) has edges to (U,m+1) for 1≤m≤K−2 and (V,1)
2. (U,K−1) has edges to (U,V,K) and (V,1)
3. (U,V,K) has edges to (U,V,K) and (V,1)
where U and V are any k-mers in such that (U, V) is an edge in .
In some embodiments, the joint event segmentation and basecalling module 104 is configured to translate the DP into the language and form of HMMs as follows. Specifically, the branch metrics in the HMI 110, expressed in terms of the scoring function (12) discussed above, are defined for event extension edges and are set to zero:
βn[(U,m)→(U,m+1)]=0 for 1≤m≤K−2;
βn[(U;K−1)→(U,V,K)]=0
because a simple approach is adopted in the way the signal samples are processed. In some embodiments, the (accumulated) metric of involving these signal samples are computed by means of the event scoring function, only when the joint event segmentation and basecalling module 104 commits to ending the current event, i.e., on event transition edges. The branch metric for these edges are
βn[(U,m)→(V,1)]=−S(U,V,En−1,m) for 1≤m≤K−1;
βn[(U,V,K)→(V,1)]=−S(U,V,En−1,K)
Finally, the branch metric for the self-loop at state (U,V,K),
βn[(U,V,K)→(U,V,K)]=−S(U,V,En,K+1)+S(U,V,En,K),
is carefully chosen to guarantee the correct path metric if the event duration were exactly K+1. As such, the HMM 110 computes path metrics correctly for all event durations up to length K+1. Beyond that it is an approximation, but generally a good one, if K is sufficiently large. In some cases K only needs to be long enough such that P(dk) has a geometric tail distribution for d≥K to get correct results. One such example is when the noise has a simple IID noise model, i.e. when τk2=0. Note that the self-loop metric depends on both U and V: this is the reason the terminal states is labeled to include both U and V. Additionally, it can be shown that, ignoring the finiteness of K, that the score F[n, V] in the DP equals negative of the state metric at state (V,1) at time n+1. In some embodiments, the Viterbi algorithm is used to compute the state sequence with the least accumulated path metric in the HMM.
Since the base alphabet has 4 elements, the number of (M−1)-mers is 4M−1. A simple counting argument shows that there are 4M−1(K−1)+4M−1(K+3)=states in the HMM 110. Similarly, there are 4M (K+1) edges with nonzero branch metrics. The bulk of the computation needed for the branch metric βn[(U,m)→(V,1)] is the expression (10) as it involves many signal samples. However, that computation can be done efficiently in terms of cumulative sums of (sn−m−λk) and (sn−m−λk)2 over m from 1 to K. This results in a complexity of only O(1) per edge. The overall complexity of the HMM 110 per time sample is thus O(4MK).
While the embodiments have been described and/or illustrated by means of particular examples, and while these embodiments and/or examples have been described in considerable detail, it is not the intention of the Applicants to restrict or in any way limit the scope of the embodiments to such detail. Additional adaptations and/or modifications of the embodiments may readily appear, and, in its broader aspects, the embodiments may encompass these adaptations and/or modifications. Accordingly, departures may be made from the foregoing embodiments and/or examples without departing from the scope of the concepts described herein. The implementations described above and other implementations are within the scope of the following claims.