In certain embodiments, a method may comprise receiving a training signal generated by molecular detection, segmenting the training signal into events, determining signal characteristics for individual ones of the events, and fitting a Hidden Markov Model (HMI) based on the events and the signal characteristics. The method may further comprise receiving a second signal generated by molecular detection, applying the HMI to the second signal and, in response, segmenting the second signal into the events and labeling the events based on the signal characteristics of the individual ones of the events.
In certain embodiments, an apparatus may comprise a circuit configured to receive a training signal generated by molecular detection. The circuit may segment the training signal into events, determine signal characteristics for individual events, and fit a Hidden Markov Model (HMM) based on the events and the signal characteristics. The circuit may receive a second signal generated by molecular detection. The circuit may apply the HMI to the second signal and responsively segment the second signal into the events and label the events based on the signal characteristics of the individual events.
In certain embodiments, a memory device may store instructions that, when executed by a processor, cause the processor to perform a method comprising receiving a training signal generated by molecular detection, segmenting the training signal into events, determining signal characteristics for individual events, and fitting a Hidden Markov Model (HMI) based on the events and the signal characteristics. The method performed by the processor may further comprise receiving a second signal generated by molecular detection, applying the HMM to the second signal and responsively segmenting the second signal into the events and labeling the events based on the signal characteristics of the individual events.
In the following detailed description of certain embodiments, reference is made to the accompanying drawings which form a part hereof, and in which are shown by way of illustration of example embodiments. It is also to be understood that features of the embodiments and examples herein can be combined, exchanged, or removed, other embodiments may be utilized or created, and structural changes may be made without departing from the scope of the present disclosure.
In accordance with various embodiments, the methods and functions described herein may be implemented as one or more software programs running on a computer processor or controller. Dedicated hardware implementations including, but not limited to, Application Specific Integrated Circuits (ASICs), programmable logic arrays, and other hardware devices can likewise be constructed to implement the methods and functions described herein. Methods and functions may be performed by modules, which may include one or more physical components of a computing device (e.g., logic, circuits, processors, sensors, etc.) configured to perform a particular task or job, or may include instructions that, when executed, can cause a processor to perform a particular task or job, or may include any combination thereof. Further, the methods described herein may be implemented as a computer readable storage medium or memory device including instructions that, when executed, cause a processor to perform the methods.
DNA sequencing can be accomplished by various methods. Two of the first-generation sequencing methods are the Sanger dideoxy terminator base method and the Maxam-Gilbert (chemical cleavage) method. The Maxam-Gilbert method works by selectively cleaving the DNA strand at instances of a specific chosen base (A, C, G or T). The Sanger dideoxy method works by replicating a strand but stopping at a random instance of the specific base. Both methods are similar in that they generate short DNA fragments (of random lengths) terminating at an instance of the selected base. A technique called polyacrylamide gel electrophoresis can be used to measure the lengths of the resulting fragments. Polyacrylamide gel electrophoresis indicates the positions of the specific base in the original DNA strand. The process can be done separately for each of the four bases to get the complete DNA sequence. A limitation of these two methods may be that they only work well for sequences that are 100 bases long. Beyond this limit, the uncertainty in the position of each base becomes unacceptable. Longer sequences must be broken down, sequenced individually, and stitched together like pieces of a jig saw puzzle using genome assembly algorithms.
Other methods of DNA sequencing include massively parallel sequencers that are still limited to processing short fragments. The idea of “single molecule sequencing” is to avoid fragmentation of the DNA and try to sequence the entire single stranded DNA (ssDNA) molecule in a single pass. Single-molecule sequencers can sequence molecules over 10 kilobases long. Some single-molecule sequencers can use an optical waveguide system to sequence DNA, while other single molecule sequencers can use a nanopore device with a sensor to measure electrical currents.
Nanopore DNA sequencing allows for low-cost high-throughput genome sequencing. Nanopore DNA sequencers comprise a tiny pore about 2 nm wide, just wide enough for a single stranded DNA molecule (ssDNA) to pass through. As a negatively charged single stranded DNA translocates under an external electric field by a process called electrophoresis, the device sensor picks up a transverse ionic or tunneling current between two electrodes. By measuring the changes in the current, with the movement of each nucleotide through the pore, and applying a base-calling algorithm to the signals, the bases in the ssDNA sequence can be detected.
Protein sequencing is another important problem within the realm of single molecule sequencing. Here the goal is to identify the component amino acids in a protein sequence. Although the underlying sequencing technology may vary from one application to another, the fundamental signal processing problem of segmenting the measured waveforms and labeling them into the constituent parts is similar.
Although signal events in the nanopore DNA model are simple level shifts contaminated by additive noise, this model is inadequate for other applications where these signal characteristics could be much more complex than level changes alone. While nanopore DNA sequencing may produce signal events with different signal levels, sequencing other molecule types like protein sequences may produce different events that do not have noticeably different signal levels. Instead, protein sequencing events may comprise different noise profiles such a correlated non-Gaussian noise, and even spiky noise. Some nanopore DNA models are unable to effectively segment and label signals that lack noticeably different signal levels.
Detection device 110 may interact with molecule sequence 101 to detect changes in molecule sequence 101 that indicate the timing of events. An “event” comprises the movement of a single molecule of molecule sequence 101 through detection device 110. For example, by measuring changes in current with the passing of molecule sequence 101 through detection device 110, detection device 110 can detect the constituent molecule types in molecule sequence 110.
As molecule sequence 101 passes through the detection device 110 in the translocation direction indicated on
Signal generation module 202 can receive a known molecular sequence 201. The known molecular sequence 201 can comprise a molecular strand with known constituent molecules arranged in a known sequence. For example, known molecular sequence 201 may comprise a polypeptide chain with a known sequence of amino acids (e.g., GLY-GLY-HIS-HIS-GLY-GLY-TYR). Signal generation module 202 can generate a signal that characterizes the known molecule sequence. The signal may comprise a waveform, noise profile, other signal characteristic, or a combination thereof that indicates individual molecules that comprise the known molecular sequence 201. The signal generation module 202 can convert the analog signal into a digital format and transfer it to event detection module 203.
Event detection module 203 can segment the signal into events based on the known constituent molecules and the known sequence of sequence 201. An event represents the passing of one molecule of known molecular sequence 201 through a molecular sequencer. For example, event detection module 203 may segment the signal in a manner similar to output signal 120 illustrated by
Model generation module 204 can generate an HMM to characterize the signal characteristics associated with each event. An HMI is an artificial statistical model that utilizes observed parameters (e.g., observable sequential symbols) to identify hidden parameters, which can then be used for further determinations of properties of other signals. In system 200, each HMI can correspond to an event type and each event type can correspond to a set of signal characteristics unique to that event type. For example, model generation module 204 can fit a probabilistic model for the HMI, where each node in the HMM represents an individual signal characteristic, and the edges indicate the probability a first signal characteristic will transition to another (or the same) signal characteristic. Model generation module 204 may also determine the transition probabilities for the edges. For example, model generation module 204 may implement a Baum-Welch algorithm to learn the emission and state transition probabilities to generate the HMI. Model generation module 204 can repeat this process until the signal characteristics for each event type have been classified. Model generation module 204 may combine individually generated HMMs into an overall HMI at the HMM module 205.
Signal generation module 202 can also receive an unknown molecule sequence 206. An unknown molecular sequence 206 may comprise the same types of molecules that comprise known molecular sequence 201 but in an unknown arrangement. Signal generation module 202 can generate an analog signal representative of unknown molecule sequence 206, convert the analog signal to a digital representation of the analog signal, and transfer the digital signal to HMM 205. HMI module 205 can then segment the digital signal into events. HMI module 205 may determine the probability that the observed signal characteristics correspond to a molecule type. For example, HMM module 205 can label the events based on the probability that their signal characteristics correspond to the molecule type. For example, HMM module 205 may estimate a probability that the signal characteristics of an event is the nucleobase adenine and responsively label the event as adenine based on the probability. Once the signal has been labeled, HMM module 205 can emit a labeled sequence 207. An example of a labeled output sequence or signal is provided in
The entire HMM 400 can represent a distinct signal characteristic for an event. The individual states of 400 may generate signals as simple as level shifts but the entire combination of HMI states can model signals with correlated samples, including spiky or other complicated behavior. For example, HMI 400 may be generated to characterize event type A depicted in chart 300 and the nodes 1-4 of HMI 400 may correspond to distinct signal levels that comprise the signal characteristics for event type A. States 1-4 are bridged by edges depicted by arrows. An edge from a state i to another state j≠i represents the end of a first signal characteristic and the beginning of a new signal characteristic. Some of the self-loops (an edge from a node to itself) may represent the continuation of the same signal characteristic from the same event and other ones of the self-loops may represent a transition to a new event with the same signal characteristic.
In general, the individual transition probabilities pij from state i to state j (e.g., the transition from state 1 to state 2 is labeled as p12 in HMI 400) can be modeled. HMI 400 models the overall signal characteristics for a signal event generated by single molecule sequencing. In some examples, HMM 400 may accurately model the signal characteristics to be able to generate a new (idealized estimate) signal(s) to mimic a real sequencer output. The idealized signal(s) may be compared to a signal generated by single molecule sequencing to characterize the signal. HMI 400 may model a wide range of signal behaviors including correlated noise, non-Gaussian noise, spiky noise, and the like.
For explanatory purposes, let Y=y1N denote a length-N signal that a molecular sequencer outputs. For example, the length-N signal may comprise the training signal illustrated in
In the example HMM 400, at each time t, a state transition occurs from a state πt-1 to another state πt with probability P (πt=v|πt-a=u)=Puv. For example, the probability of transitioning from state 1 to state 2 in HMM 400 would be P(πt=2|πt-1=1)=P12. The destination state πt emits an output sample yt with a probability P(yt|πt). In relation to the output signal, each state πt indicates a signal characteristic in line with the aforementioned probability. These transitions form a PDF that could be a conditional Gaussian, parametrized by its mean and variance, or any other suitable continuous probability distribution. The joint PDF of the sequence output Y and the hidden state sequence Π=π1N has the factor form:
In some examples, the Baum-Welch algorithm is used to learn the emission and state transition probabilities. The Baum-Welch algorithm comprises a special case of an Expectation Maximization (EM) algorithm used to find the unknown parameters of an HMM. This algorithm tries to maximize the likelihood of the observed data P(Y) over all the HMI parameters. The Baum-Welch algorithm may be used to learn the adjacency structure of HMI 400. HMM 400 starts fully connected (e.g., the solid arrows) and the algorithm prunes the edges with vanishingly small transition probabilities, say less than 10−9, in each iteration. Once the transition probability given by an edge falls below a threshold, the algorithm prunes the edge. With respect to HMI 400, the pruned edges comprise the dashed arrows. In some examples, HMM 400 comprises far fewer edges with sparse connectivity once the pruning process is complete. HMI 400 may be trained to model the sequencer output for single molecule sequencing. Specifically, HMM 400 may be used to determine the PDF P(Y, Π) for typical sequencer output signals. In some examples, the distribution generated by HMI 400 can be sampled to generate idealized signals Y that mimic those of a real sequencer.
In addition to generating signals, HMI 400 may also be used to test whether a given signal Y=y1N of unknown origin may be a typical output of the sequencer using a “likelihood test”. A normalized log-likelihood function
may be computed with respect to the estimated PDF. If a given signal scores score sufficiently high, HMM 400 may label the signal. A signal that is not a typical sequencer output would score low on this test. A suitable threshold for the score can be determined by testing on several signals from both inside and outside the class of typical sequencer outputs.
In this example, HMI 500 is configured to segment and label a signal generated by single molecule sequencing for two molecule types, type A and type B. HMI 500 is configured to automatically segment a signal into the individual type A and B events. HMI 500 may be configured to segment and label the output signal illustrated in
Once HMM 500 is constructed, HMM 500 may perform automatic segmentation of a sequencer output Y that comprises an unknown molecular sequence of molecules type-A and type-B into events of type-A and type-B. The Viterbi algorithm may be applied to HMI 500 with a negative log-likelihood branch metric to perform the automatic segmentation. The Viterbi algorithm comprises a dynamic programming algorithm for obtaining the maximum a posteriori probability estimate of the most likely sequence of hidden states. In this case, the sequence of hidden states comprises the sequencer output Y. It should be noted that the log-likelihood metric can be used to discriminate between signals that came from different sources. Here the log-likelihood metric can be used to discriminate between the signals of type-A and type-B. The Viterbi algorithm computes the maximum-likelihood (ML) path in HMM 500 which results in the ML segmentation of the signal into type-A and type-B events.
Advantageously, HMM 500 may automatically segment and label a signal generated by signal molecule sequencing. Moreover, HMM 500 may process signals that comprise different noise profiles that correspond to different types of molecules.
Known sequence 701 may be translocated through detection device 710. As known sequence 701 passes through detection device 710, detection device 710 can interact with the molecule sequence 701 to detect changes in known sequence 701 that indicate the timing of events. An event may be the movement of a single amino acid through detection device 710. As the amino acids of known sequence 701 pass through and interact with detection device 710, detection device 710 can generate a training signal 720 that indicates the interactions between the amino acids and the detection device.
Training signal 720 comprises the readout generated by detection device 710. Training signal 720 can be divided into seven distinct events based on the known sequence 701. The seven distinct events may correspond to the individual amino acids that comprise the known sequence 701 and can be labeled accordingly. In some examples, training signal 720 can be used to generate and train an HMM to sequence an unknown molecular sequence. Such a resulting HMM can be used to segment and label unknown signals that comprise the same molecules as training signal 720. For example, the resulting HMM may be unable to sequence an unknown sequence comprising different molecule types like DNA bases or different types of amino acids than the ones of known sequence 701.
In some embodiments, to train the HMIs 801-803, a molecular training sequence (e.g., sequence 701) can be modeled along X=x1N to denote the sequence of activator labels of amino acids. The signal output (e.g., training signal 720) is modeled along Y=y1N to denote the sequencer output. For example, x1-7N may correspond to glycine, x8-15N may correspond to glycine, and so on. In this example, the set xt is a sequence of discrete labels for glycine, tyrosine, and histidine and the label xt is attached to every signal sample yt. The labels xt can repeat themselves several times and change only at the event boundaries since an event duration can last many time steps. The HMMs 801-803 can be generated and trained using their corresponding signal samples. For each HMM, the sequences X and Y are jointly modeled. Thus, the state πt of HMMS 801-803 emit a new output pair (xt, yt) with a probability P(xt,yt|πt)=P(xt|πt)P(yt|xt,πt). The distribution P(xt|πt) is a discrete Probability Mass Function (PMF) that can be stored in a look-up table. The second term P(yt|xt,πt) is a Probability Density Function (PDF) that could be a conditional Gaussian, parametrized by it mean and variance, or some other parametrized continuous distribution. The full joint PDF of sequences X, Y, and Π has the factor form:
Note that the label xt is treated as an output of the HMM rather than a description of the state πt itself. By adopting this approach there is no preconceived notion of what states should represent. The Baum-Welch algorithm can be applied to HMMs 801-803 to learn the emission and state transition probabilities. This algorithm maximizes the likelihood of the observed data P(x1N,y1N) over all the HMM parameters. In a similar manner to HMM 400, the adjacency structure is learned by pruning edges with vanishingly small probabilities.
The generative HMMs described in these examples contain states πt that emit a single output a pair of outputs (xt, yt). In further examples, the edges (πt-1,πt) in HMMs may emit the output(s), specifically P(xt,yt|πt-1,πt). This class of PDF is indeed a generalization of the original PDF which can be obtained turning off its dependence on the state πt-1. All the algorithms presented for the “state-emitting” formulation can be adapted equally well to the “edge emitting” case.
An unknown output signal 1000 can be fed to a trained HMI (e.g., HMI 900) that is configured to process unknown signals generated by single molecule sequencing of protein chains that comprise glycine, histidine, and tyrosine. However, when the unknown signal 1000 is generated by a molecule sequence that comprises different types of molecules, the HMI can be trained to handle those molecule types. In some examples, to segment and label an unknown signal, let Π=π1N denote the hidden state sequence. The hidden state sequence denotes the sequence of nodes in the trained HMM that correctly output the observed signal characteristics of an unknown output signal 1000. To identify the correct events, the activator sequence X that maximizes the following is found by:
The summation is carried out in a manner like the marginalization technique used in the forward algorithm for HMMs so to output P(xt|Y) at each time t. The HMM can implement the summation and responsively segment and label output signal 1010.
In other examples, the HMM may implement the Viterbi algorithm to segment and label signal 1000. In doing so, the activator labels X can be treated as unknown or “missing data” and the Viterbi algorithm can be run on the HMM to detect the Maximum Likelihood (ML) state and activator sequences (Π, X) that explains the observed sequencer output Y.
In this disclosure, the problem of segmenting and labeling a sequencer output is addressed. There are several promising new technologies for single molecule sequencing for low-cost high throughput sequencing for medical applications as well as applications to DNA based data storage. For example, the systems and methods disclosed herein could be utilized in conjunction with the system and methods described in co-owned patent application Ser. No. 16/175,223, filed Oct. 30, 2018, entitled “Event Timing Detection for DNA Sequencing”, currently pending, the contents of which is hereby incorporated by reference in its entirety.
While some single molecule sequencing techniques produce signal events with different signal levels, sequencing other molecule types like protein sequences may produce waveforms that do not have noticeably different signal levels. Traditional signal classification methods are unable to effectively and effectively segment and label signals that lack noticeably different signal levels.
This disclosure addresses systems and methods of joint event segmentation and labeling using generative Hidden Markov Model (HMMs). Generative models are useful for generating signals that mimic a given class of training signals. The generative HMIs are used to identify the signal class from which a test signal originated. This ability to discriminate between different signal classes can be used to build a system for automatic signal segmentation and labeling. This disclosure also demonstrates a direct approach of building a generative HMM to jointly model both the labels and signal samples. Similarly, this HMM can also be used to perform automatic labeling and segmentation of sequencer signals.
The illustrations of the embodiments described herein are intended to provide a general understanding of the structure of the various embodiments. The illustrations are not intended to serve as a complete description of all of the elements and features of apparatus and systems that utilize the structures or methods described herein. Many other embodiments may be apparent to those of skill in the art upon reviewing the disclosure. Other embodiments may be utilized and derived from the disclosure, such that structural and logical substitutions and changes may be made without departing from the scope of the disclosure. Moreover, although specific embodiments have been illustrated and described herein, it should be appreciated that any subsequent arrangement designed to achieve the same or similar purpose may be substituted for the specific embodiments shown.
This disclosure is intended to cover any and all subsequent adaptations or variations of various embodiments. Combinations of the above embodiments, and other embodiments not specifically described herein, will be apparent to those of skill in the art upon reviewing the description. Additionally, the illustrations are merely representational and may not be drawn to scale. Certain proportions within the illustrations may be exaggerated, while other proportions may be reduced. Accordingly, the disclosure and the figures are to be regarded as illustrative and not restrictive.