Recent studies presented a significant progress in DNA synthesis and sequencing technologies. This progress also introduced the development of data storage technology based upon DNA molecules. A DNA storage system consists of three important components. The first is the DNA synthesis which produces the oligonucleotides, also called strands, that encode the data. In order to produce strands with accept able error rates, in a high throughput manner, the length of the strands is typically limited to no more than 250 nucleotides. The second part is a storage container with compartments which stores the DNA strands, however without order. Finally, sequencing is performed to read back a representation of the strands, which are called reads.
Current synthesis technologies are not able to generate a single copy for each DNA strand, but only multiple copies where the number of copies is in the order of thousands to millions. Moreover, sequencing of DNA strands is usually preceded by PCR amplification which replicates the strands. Hence, every strand has multiple copies and several of them are read during sequencing.
The encoding and decoding stages are two processes, external to the storage system, that convert the user's binary data into strands of DNA such that, even in the presence of errors, it will be possible to revert back and recover the original binary data. These two stages consist of three steps, which we refer by 1. clustering, 2. reconstruction, and finally 3. error correction. After the strands are read back by sequencing, the first task is to partition them into clusters such that all strands in the same cluster originated from the same synthesized strand. After the clustering step, the goal is to reconstruct each strand based upon all its noisy copies, and this stage is the main problem studied in this paper. Lastly, errors which were not corrected by the reconstruction step, mis-clustering errors, lost strands, and any other error mechanisms should be corrected by the use of an error-correcting code.
Any reconstruction algorithm for the second stage is performed on each cluster to recover the original strand from the noisy copies in the cluster. Having several copies for each strand is beneficial since it allows to correct errors that may occur during this process. In fact, this setup falls under the general framework of the string reconstruction problem which refers to recovering a string based upon several noisy copies of it. Examples for this problem are the sequence reconstruction problem which was first studied by Levenshtein and the trace reconstruction problem. In general, these models assume that the information is transmitted over multiple channels, and the decoder, which observes all channel estimations, uses this inherited redundancy in order to correct the errors.
Generally speaking, the main problem studied under the paradigm of the sequence reconstruction and trace reconstruction problems is to find the minimum number of channels that guarantee successful decoding either in the worst case or with high probability. However, in DNA-based storage systems we do not necessary have control on the number of strands in each cluster. Hence, the goal of this work is to propose efficient algorithms for the reconstruction problem as it is reflected in DNA-based storage systems where the cluster size is a given parameter. Then, the goal is to output a strand that is close to the original one so that the number of errors the error-correcting code should correct will be minimized. We will present algorithms that work with a flexible number of copies and various probabilities for deletion, insertion, and substitution errors.
One of the early experiments of data storage in DNA was conducted by Clellan et al. in 1999. In their study they coded and recovered a message consisting of 23 characters. Three sequences of nine bits each, have been successfully stored by Leier et al. in 2000. Gibson et al. presented in 2010 a more significant progress, in terms of the amount of data stored successfully. They demonstrated in-vivo storage of 1,280 characters in a bacterial genome. The first large scale demonstrations of the potential of in vitro DNA storage were reported by Church et al. who recovered 643 KB of data and by Goldman et al. who accomplished the same task for a 739 KB message. However, both of these pioneering groups did not recover the entire message successfully and no error correcting codes were used. Shortly later, Grass et al. have managed to successfully store and recover a 81 KB message, in an encapsulated media, and Bornholt et al. demonstrated storing a 42 KB message. A significant improvement in volume was reported in by Blawat et al. who successfully stored 22 MB of data. Erlich and Zielinski improved the storage density and stored 2.11 MB of data. The largest volume of stored data was reported by Organick et al. who stored roughly 200 MB of data, an order of magnitude more data than previously reported. Yazdi et al. developed a method that offers both random access and rewritable storage and in a portable DNA-based storage system. Recently, Anavy et al. enhanced the capacity of the DNA storage channel by using composite DNA letter. A similar approach, on a smaller scale. Lopez et al. stored and decoded a 1.67 MB of data. In their work they focused on increasing the throughput of nanopore sequencing by assembling and sequencing together fragments of 24 short DNA strands. Recent studies also presented an end-to-end demonstration of DNA storage, the use of LDPC codes for DNA-based storage, a computer systems prospective on molecular processing and storage, and lastly, the work of Tabatabaei et al. which uses existing DNA strands as punch cards to store information.
The processes of synthesizing, storing, sequencing, and handling strands are all error prone. Each step in these processes can independently introduce a significant number of errors. Additionally, the DNA storage channel has several attributes which distinguish it from other storage media such as tapes, hard disk drives, and flash memories. We summarize some of these differences and the special error behavior in DNA.
The subject matter disclosed herein is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the disclosed embodiments will be apparent from the following detailed description taken in conjunction with the accompanying drawings.
In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to obscure the present invention. The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings. It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements. Because the illustrated embodiments of the present invention may for the most part, be implemented using electronic components and circuits known to those skilled in the art, details will not be explained in any greater extent than that considered necessary as illustrated above, for the understanding and appreciation of the underlying concepts of the present invention and in order not to obfuscate or distract from the teachings of the present invention. Any reference in the specification to a method should be applied mutatis mutandis to a device or system capable of executing the method and/or to a non-transitory computer readable medium that stores instructions for executing the method. Any reference in the specification to a system or device should be applied mutatis mutandis to a method that may be executed by the system, and/or may be applied mutatis mutandis to non-transitory computer readable medium that stores instructions executable by the system. Any reference in the specification to a non-transitory computer readable medium should be applied mutatis mutandis to a device or system capable of executing instructions stored in the non-transitory computer readable medium and/or may be applied mutatis mutandis to a method for executing the instructions. Any combination of any module or unit listed in any of the figures, any part of the specification and/or any claims may be provided. The specification and/or drawings may refer to a processor. The processor may be a processing circuitry. The processing circuitry may be implemented as a central processing unit (CPU), and/or one or more other integrated circuits such as application-specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), full-custom integrated circuits, etc., or a combination of such integrated circuits. Any combination of any steps of any method illustrated in the specification and/or drawings may be provided. Any combination of any subject matter of any of claims may be provided. Any combinations of systems, units, components, processors, sensors, illustrated in the specification and/or drawings may be provided. There may be provided systems, methods and a non-transitory computer readable media for reconstruction of data stored in a DNA storage system. The following text refers to methods for simplicity of explanation. There may be provided methods that reconstructs the data using shortest common supersequences (SCS) and Longest Common Subsequences (LCS). In contrary to most of the previous algorithms that are shaped in the structure of alignment then majority reconstruction—the methods also involves finding the SCS and LCS, identifying similar patterns in the sequences, and using the maximum-likelihood decoder. The methods may use the edit distances between the traces as the basis of the estimation. Instead of “building” the estimation from sketch, the methods may use the edit distances, and error vectors between couple of traces, in order to align them, and then using one or more them as the basis of our output. The methods may be determined based on practical values of DNA storage system—by taking into account an error characterization of the DNA storage channel. The methods places less limitations on the input. While most previous algorithms support limited number of errors, dependency between the errors, limited lengths of the sequence or limited number of distinct traces etc., the methods is flexible, and consider any error distribution, any strands length. Minimization of the error rates and maximization of the success rate. The methods was designed for practical purposes. Hence, it aims on reducing the error rates of the reconstructed sequences and maximizing the success rate, which is the probability of exact reconstruction. The inventors evaluated the distribution of number of errors of the reconstructed sequences. Using these values may be beneficial as they define the parameters of the error correcting codes to guarantee exact reconstruction in a given library.
Step 220 may include applying a processing operation on at least some of the r-tuples—wherein the applying may include searching for a maximum likelihood SCS.When not finding, the maximum likelihood SCS, then returning a SCS that minimizes a sum of Levenshtein distances of all the traces of the cluster.
Step 220 may include repeating the processing operations for different values of r. The value of r may or may not exceed ten.
There may be only a few different values of r. A few—may not exceed 5, 6, 7, 8, 9, 10 and the like.
The processing operations applied on the at least some of the r-tuples may include calculating longest common subsequences (LCSs).
Step 220 of estimating the information unit may be based on a size of the cluster. Step 220 of estimating may include estimating an error probability of the cluster using an average length of the traces.
Step 220 of estimating may include applying the processing operations only on a group of longest traces of the cluster. The longest traces may be about one fifth of the traces of the cluster. Step 220 may include calculating a distance between the traces of the cluster. The calculating of the distance between the traces of the cluster may be followed by the processing of a r-tuple. The distance may be a k-mer distance.
We also apply our algorithms on data from previously published DNA-storage experiments and compare our accuracy and performance with state of the art algorithms known from the literature.
While the foregoing written description of the invention enables one of ordinary skill to make and use what is considered presently to be the best mode thereof, those of ordinary skill will understand and appreciate the existence of variations, combinations, and equivalents of the specific embodiment, method, and examples herein. The invention should therefore not be limited by the above described embodiment, method, and examples, but by all embodiments and methods within the scope and spirit of the invention as claimed.
In this work we present several reconstruction algorithms, crafted for DNA storage systems. Since our purpose is to solve the reconstruction problem as it is reflected in DNA-storage systems, our algorithms aim to minimize the distance between the output and the original strands. The algorithms in this work are different from most of the previously published reconstruction algorithms in several respects. Firstly, we do not require any assumption on the input. That is, the input can be arbitrary and does not necessarily belong to an error-correcting code. Secondly, our algorithms are not limited to specific cluster size, do not require any dependencies between the error probabilities, and do not assume zero errors in any specific location of the strands. Thirdly, we limit the complexity of our algorithms, so they can run with practical time on actual data from previous DNA-storage experiments. Lastly, since clusters in DNA storage systems may vary in their size and errors distributions, our algorithms are designed to minimize the distance between our output and the original strand, taking into account these errors can be corrected by the use of an error-correcting code.
We denote by Σq={0, . . . , q−1} the alphabet of size q and Σq*. The length of x∈Σn is denoted by |x|=n. The Levenshtein distance between two strings x, y∈Σq*, denoted by dL(x, y), is the minimum number of insertions and deletions required to transform x into y. The edit distance between two strings x, y∈Σq*, denoted by de(x,y), is the minimum number of insertions, deletions and substitution required to transform x into y, and dH(x, y) denotes the Hamming distance between x and y, when |x|=|y|. For a positive integer n, the set {1, . . . , n} is denoted by [n].
The trace reconstruction problem was studied in several theoretical works. Under this framework, a length-n string x, yields a collection of noisy copies, also called traces, y1, y2, . . . , yt where each yi is independently obtained from x by passing through a deletion channel, under which each symbol is independently deleted with some fixed probability pd. Suppose the input string x is arbitrary. In the trace reconstruction problem, the main goal is to determine the required minimum number of i.i.d traces in order to reconstruct x with high probability. This problem has two variants: in the “worst case” , the success probability refers to all possible strings, and in the “average case” (or “random case”) the success probability is guaranteed for an input string x which is chosen uniformly at random.
The trace reconstruction problem can be extended to the model where each trace is a result of x passing through a deletion-insertion-substitution channel. Here, in addition to deletions, each symbol can be switched with some substitution probability ps, and for each j, with probability pi, a symbol is inserted before the j-th symbol of x1. Under this setup, the goal is again to find the minimum number of channels which guarantee successful reconstruction of x with high probability. 1 Note that there are many interpretations for the deletion-insertion-substitution, most of them differ on the event when more than one error occured on the same index. Our interpretation of this channel is described in Section 0.14
Motivated by the storage channel of DNA and in particular the fact that different clusters can be of different sizes, this work is focused on another variation of the trace reconstruction problem, which is referred by the DNA reconstruction problem. The setup is similar to the trace reconstruction problem. A length-n string x is transmitted t times over the deletion-insertion-substitution channel and generates t traces y1, y2, . . . , yt. A DNA reconstruction algorithm is a mapping R: (Σq*)t→Σq* which receives the t traces y1, y2, . . . , yt as an input and produces {circumflex over (x)}, an estimation of x. The goal in the DNA reconstruction problem is to minimize de(x, {circumflex over (x)}), i.e., the edit distance between the original string and the algorithm's estimation. When the channel of the problem is the deletion channel, the problem is referred by the deletion DNA reconstruction problem and the goal is to minimize dL(x, {circumflex over (x)}). While the main figure of merit in these two problems is the edit/Levenshtein distance, we will also be concerned with the complexity, that is, the running time of the proposed algorithms.
his section reviews the related works on the different reconstruction problems. In particular we list the reconstruction algorithms that have been used in previous DNA storage experiments and summarize some of the main theoretical results on the trace reconstruction problem.
it is possible to reconstruct a sequence with high probability. However, it assumes the deletion and insertion probabilities are relatively small, while the substitution probability is relatively large. In practice, these probabilities vary from cluster to cluster and do not necessarily meet these assumptions.
While all previous works of reconstructing algorithms used variations of the majority algorithm on localized areas of the traces, we take a different global approach to tackle this problem. Namely, the algorithms presented in the paper are heavily based on the maximum likelihood decoder for multiple deletion channels as well as the concepts of the shortest common supersequence and the longest common subsequence.
For a sequence x∈Eq* and a set of indices I⊆[|x|], the sequence xI is the projection of x on the indices of I which is the subsequence of x received by the symbols at the entries of I. A sequence x∈Σ* is called a supersequence of y∈Σ*, if y can be obtained by deleting symbols from x, that is, there exists a set of indices I⊆[|x|] such that y=xI. In this case, it is also said that y is a subsequence of x. Furthermore, x is called a common supersequence (subsequence) of some sequences y1, . . . , yt if x is a supersequence (subsequence) of each one of these t sequences. The length of the shortest common supersequence (SCS) of y1, . . . , yt is denoted by SCS(y1, . . . , yt). The set of all shortest common supersequences of y1, . . . , yt∈Σ* is denoted by CS(y1, . . . , yt). Similarly, the length of the longest common subsequence (LCS) of y1, . . . , yt, is denoted by LCS(y1, . . . , yt), and the set of all longest subsequences of y1, . . . , yt is denoted by
CS(y1, . . . , yt).
Consider a channel S that is characterized by a conditional probability PrS, which is defined by
PrS{y rec. |x trans.},
for every pair (x, y)∈(Σq*)2. Note that it is not assumed that the lengths of the input and output sequences are the same as we consider also deletions and insertions of symbols. As an example, it is well known that if S is the binary symmetric channel (BSC) with crossover probability 0p
½, denoted by BSC(p), it holds that PrBSC(p){y rec. |x trans.}=pd
The maximum-likelihood (ML) decoder for a code C with respect to S, denoted by ML, outputs a codeword c∈C that maximizes the probability PrS{y rec. |c trans.}. That is, for y∈Σq*,
It is well known that for the BSC, the ML decoder simply chooses the closest codeword with respect to the Hamming distance.
The conventional setup of channel transmission is extended to the case of more than a single instance of the channel. Assume a sequence x is transmitted over some t identical channels of S and the decoder receives all channel outputs y1, . . . , yt. This setup is characterized by the conditional probability
Now, the input to the ML decoder is the sequences y1, . . . , yt and the output is the codeword c which maximizes the probability Pr(S,t){y1, . . . , yt rec.|trans.}.
For two sequences x, y∈Σq*, the number of times that y can be received as a subsequence of x is called the embedding number of y in x and is defined by
Emb(x; y)=|{I⊆[|x|]|xI=y}∥.
Note that if y is not a subsequence of x then Emb(x; y)=0. The embedding number has been studied in several previous works and was referred to as the binomial coefficient. In particular, this value can be computed with quadratic complexity.
While the calculation of the conditional probability PrS{y rec. |x trans.} is a rather simple task for many of the known channels, it is not straightforward for channels which introduce insertions and deletions. In the deletion channel with deletion probability p, denoted by Del(p), every symbol of the word x is deleted with probability p. For the deletion channel it is known that for all (x, y)∈(Σq*)2, it holds that
PrDel(p){y rec. |x trans.}=p|x|-|y|·(1−p)|y|·Emb(x; y).
According to this property, the ML decoder for one or multiple deletion channels is stated as follows:.
Lemma 1. Assume c∈C⊆(Σq)n is the transmitted sequence and y1, . . . , yt∈(Σq)* are the output sequences from Del(p), then
Note that since there is more than a single channel, when the goal is to minimize the average decoding error probability, the ML decoder does not necessarily have to output a codeword but any sequence that minimizes the average decoding error probability. In the next sections it will be shown how to use the concepts of the SCS and LCS together with the maximum likelihood decoder in order to build decoding algorithms for the deletion DNA reconstruction and the DNA reconstruction problems.
This section studies the deletion DNA reconstruction problem. Assume that a cluster consists of t traces, y1, y2, . . . , yt, where all of them are noisy copies of a synthesized strand. This model assumes that every strand is a sequence that is independently received by the transmission of a length-n sequence x (the synthesized strand) through a deletion channel with some fixed deletion probability pd. Our goal is to propose an efficient algorithm which returns {circumflex over (x)}, an estimation of the transmitted sequence x, with the intention of minimizing the Levenshtein distance between x and {circumflex over (x)}. We consider both cases when t is a fixed small number and large values of t.
Our approach is based on the maximum likelihood decoder over the deletion channel. A straightforward implementation of this approach on a cluster of size t is to compute the set of shortest common supersequences of y1, y2, . . . , yt, i.e., the set CS(y1, y2, . . . , yt), and then return the maximum likelihood sequence among them. This algorithm has been rigorously studied to analyze its Levenshtein error rate for t=2. The method to calculate the length of the SCS commonly uses dynamic programming and its complexity is the product of the lengths of all sequences. Hence, even for moderate cluster sizes, e.g. t
5, this solution will incur high complexity and impractical running times. However, for many practical values of n and pd, the original sequence x can be found among the list of SCSs while taking less than t traces or even only two of them. This fact, which we verified empirically, can drastically reduce the complexity of the ML-based algorithm. Furthermore, note that x is always a common supersequence of all traces, however it is not necessarily the shortest one. Hence, our algorithm works as follows. The algorithm creates sorted sets of r-tuples, where each tuple consists of r traces from the cluster. The r-tuples are sorted in a non-decreasing order according to the sum of their lengths. For each r-tuple (yi
CS(yi
CS(yi
CS(yi
In Algorithm 1, we present a pseudo-code of our solution for the deletion DNA reconstruction problem. Note that the algorithm uses another procedure which is presented in Algorithm 2 to filter the supersequences and output the maximum likelihood supersequence. The input to the algorithm is the length n of the original sequence, and a cluster of t traces C. Algorithm 1's main loop is in Step 2; first in Step 2-a it generates the set F, which is a sorted set of all r-tuples of traces by the sum of their lengths. Then, in Step 2-b it iterates over all r-tuples in F and checks for each r-tuple, (yiCS(yi
CS(c), which is the union of sets of SCSs of the r-tuples that the length of their SCS was nmax. In Step 4, the algorithm invokes again Algorithm 2 to check if Smax includes supersequences of all traces in C and returns the maximum likelihood among them. If none of the sequences in Smax is a supersequence of all traces in C, the algorithm returns in Step 5 the sequence which minimizes the sum of Levenshtein distances to all the traces in C.
We evaluated the accuracy and efficiency of Algorithm 1 by the following simulations. These simulations were tested over sequences of length n=200, clusters of size 4t
10, and deletion probability p in the range [0.01, 0.10]. The alphabet size was 4. Each simulation consisted of 100,000 randomly generated clusters. Furthermore, we had another set of simulations for n=100 with deletion probability p in the range [0.11, 0.20] and clusters of size 4
t
10. Each simulation for these values of p,n, and t included 10,000 randomly selected clusters. We calculated the Levenshtein error rate (LER) of the decoded output sequence as well as the average decoding success probability (referred as the success rate). We also calculated the kerror success rate, which is defined as the fraction of clusters where the Levenshtein distance between the algorithm's output sequence and the original sequence was at most k. Note that for k=0, this is equivalent to calculate the success rate. We also calculated the minimal k for which its k-error success rate is at least q, and denote this value of k by kq_succ. Note that for q=1 this value determines the minimal number of Levenshtein errors that an error-correcting code must correct in order to fully decode the original sequences using Algorithm 1 with an error-correcting code. In addition, each cluster was also reconstructed using the BMA algorithm.
i
(rt),1
i1 < i2 <
t} the set of all r-tuples from C, sorted by non-decreasing
CS(ci(r)) > nmax then
CS(c), the union of all
CS of ci(r) ∈Cmax.
In order to simulate also high deletion probabilities, we simulated 1000 clusters of sequences over 4-ry alphabet of length n=100 with cluster size t between 4 and 10, while the deletion probability was p=0.25.
In case the cluster is of larger size, for example in the order of Θ(n), we present in Algorithm 3, a variation of Algorithm 1 for large clusters. In this case, since the cluster is large, the probability to find a pair, triplet, or quadruplet of traces that their set of SCSs contains the original sequence x is very high, if not even 1. In fact, in all of our simulations, which we will elaborate below in this section, we were always able to successfully decode the original sequence with no errors even when the deletion probability was as high as 0.2. Hence, our main goal in this part is to decrease the runtime of Algorithm 1 while preserving the success rate to be 1. Algorithm 3 keeps the same structure of Algorithm 1, however, it performs two filters on the cluster in order to reduce the computation time.
The complexity of finding the length of the SCS of some set of r traces is the multiplication of their lengths, i.e., Θ(nr). Therefore, the complexity of finding the length of the SCS of a pair of traces is Θ(n2), while there are Θ(n2) pairs of traces (assuming the cluster size is Θ(n)). Therefore, in this case, calculating the length of the SCS of each pair of traces before considering some triplets is not necessarily the right strategy when our goal is to optimize the algorithm's running time. Hence, in Algorithm 3 we focused on filtering the traces in the cluster in order to check only a subset of the traces which are more likely to succeed and produce the correct sequence.
To define the filtering criteria for Algorithm 3, we simulated Algorithm 1 on large clusters. The length of the original sequence x was n=200 and the cluster size was
We generated 10,000 clusters of size t, where the deletion probability p was in the range [0.01, 0.15]. The success rate of all the simulations was 1. We evaluated the percentage of clusters that the first r-tuple to have an SCS of length n was consisted of the longest 20% traces in the cluster. We observed that when the deletion probability was at most 0.07, in all of the clusters the first r-tuple of traces that had an SCS of size n consisted from the longest 20% traces in the cluster. For deletion probabilities between 0.08 and 0.11 these percentages ranged between 94.76% and 99.98%, while for p=0.15 this percentage was 60.88%. Therefore, by filtering the longest 20% traces, it was enough to check only (220) pairs instead of (2100) (pairs in order to succeed and still reach the successful pair. The results of these simulations are depicted in
This observation lead us to the first filter in Algorithm 3, where we picked the longest 20% traces of the cluster. The second filter computes a cost function (in linear time complexity), to be explained below, on a given r-tuple of traces in order to evaluate if the traces in this r-tuple are likely to have an SCS of length n. Thus, the algorithm skips on the SCS computation of r-tuples that are less likely to have an SCS of length n. First, before performing the first filter, the algorithm calculates the average length of the traces in the cluster and uses it to estimate the deletion probability p. Then, if p>0.1, the algorithm calculates the cost function on every r-tuple and checks if it is higher than some fixed threshold. This threshold depends on the estimated value of p and the cost function is based on a characterization of the sequences, as will be described in Section 011.2.
In this section we present Algorithm 3. We list here the steps that are different from Algorithm 1. In Step 2 the algorithm estimates the deletion probability in the cluster by checking the average length of the traces n′ and then calculates
In Step 3, the algorithm filters the cluster so it contains only the longest 20% traces. The last difference between Algorithm 3 and Algorithm 1 can be found in Step 4-b. In this step, before the computation of the SCS of a given r-tuple of traces, the algorithm computes the k-mer cost function (for k-mers of size k=2) and checks if it is larger than the threshold Tp.
We evaluated the performance of Algorithm 3 and verified our filters by simulations. Each simulation consisted of 10,000 clusters of size t=100, the length of the original strand was n=200, the alphabet size was q=4, and the deletion probability p was in the range [0.01, 0.2]. Algorithm 3 reconstructed the exact sequence x in all of the tested clusters. A comparison between the runtime of Algorithm 1 and Algorithm 3 can be found in
The k-mer vector of a sequence y, denoted by k-mer(y), is a vector that counts the frequency in y of each subsequence of length k (k-mer). The frequencies are ordered in a lexicographical order of their corresponding k-mers. For example for a given sequence y=“ACCTCC” and k=2, its k-mer vector is k-mer(y)=0100020100000101, according to the following calculation of the frequencies {AA:0, AC:1, AG:0, AT:0, CA:0,CC:2,CG:0,CT:1,GA:0,GC:0,GG:0,GT:0,TA:0,TC:1,TG:0,TT:1}. We define the k-mer distance between two sequences y1 and y2 as the L1 distance between their k-mer vectors. The k-mer distance is denoted by dk-mer(y1, y2)
d
k-mer(y1,y2)=∥y1−y2∥1
. For a given set of r sequences Y={y1, y2, . . . yr}, we define its k-mer cost function, which is denoted by ck-mer(y1, y2, . . . , yr), as the sum of the k-mer distance of each pair of sequences in Y. That is,
Observe that the k-mer distance between a sequence x and a trace y1 which results from x by one deletion is at most 2k−1. Every deleted symbol in x decreases the
t'} the set of all r-tuples from C, sorted by non-decreasing
CS(ci(r))
CS of ci(r) ϵC
value of at most k entries in k-mer(x) and increases the number of at most k−1 of the entries. Hence, each deletion increases the k-mer distance by at most 2k−1, which means that an upper bound on the k-mer distance between the original strand x and a trace yi with np deletions is np(2k−1). However, when comparing the k-mer distance of two traces, y1 and y2, with more than one deletion, the k-mer distance can also decrease. An example of such a case is depicted in
This section studies the DNA reconstruction problem. Assume that a cluster consists of t traces, y1, y2, . . . , yt, where all of them are noisy copies of a synthesized strand. This model assumes that every trace is a sequence that is independently received by the transmission of a length-n sequence x (the synthesized strand) through a deletion-insertion-substitution channel with some fixed probability pd for deletion, pi for insertion, and ps, for substitution. Our goal is to propose an efficient algorithm which returns {circumflex over (x)}, an estimation of the transmitted sequence x, with the intention of minimizing the edit distance between x and {circumflex over (x)}. In our simulations, we consider several values of t and a wide range of error probabilities as well as data from previous DNA storage experiments.
Before we present the algorithms, we list here several more notations and definitions. An error vector of y and x, denoted by EV(y, x), is a vector of minimum number of edit operations to transform y to x. Each entry in EV(y, x) consists of the index in y, the original symbol in this index, the edit operation and in case the operation is an insertion, substitution the entry also includes the inserted, substituted symbol, respectively. Note that for two sequences y and x, there could be more than one sequence of edit operations to transform y to x. The edit distance between a pair of sequences is computed using a dynamic programming table and the error vector is computed by backtracking on this table. Hence, EV(y, x) is not unique and can be defined uniquely by giving priorities to the different operation in case of ambiguity. That is, if there is an entry in the vector EV(y, x) (from the last entry to the first), where more than one edit operation can be selected, then, the operation is selected according to these given priorities. The error vector EV(y, x) also maps each symbol in y to a symbol in x (and vice versa). We denote this mapping as VEV(y, x): {1, 2, . . . , |y|}→{1, 2, . . . , |x|}∪{?}, where VEV(y, x)(i)=j if and only if the i-th symbol in y appears as the j-th symbol in x, with respect to the error vector EV(y, x). Note that in the case where the i-th symbol in y was classified as a deleted symbol in EV(y, x), VEV(y, x)(i)=?. This mapping can also be represented as a vector of size |y|, where the i-th entry in this vector is VEV(y, x)(i). The reversed cluster of a cluster C, denoted by CR, consists of the traces in C where each one of them is reversed.
In this section we present Algorithm 4, the LCS-anchor algorithm. The algorithm receives C, a cluster of traces sorted by their lengths, from closest to n to the farthest to n. First, the algorithm initializes {circumflex over (x)} and {circumflex over (x)}bck as a length-n sequence of the symbol ‘-’. Second, the algorithm computes lcs, an arbitary LCS of y1 and y2, the two traces in the cluster which their length is closest to n. Then, for each of the t traces in the cluster, yk, the algorithm computes EV(yk, lcs), and the mapping vector VEV(lcs, yk). For 1i
|lcs| the algorithm performs a majority vote on the i-th entries of the t mapping vectors. If the majority is j≠?, the algorithm writes the symbol lcs(i) in the j-th index of {circumflex over (x)}. If j=? the symbol lcs(i) is omitted, and it is not written in {circumflex over (x)}. At this point, Algorithm 4 wrote into {circumflex over (x)} at most LCS(y1, y2) symbols, these symbol serves as “anchor” symbols in the estimated string. Each of the anchor symbols is located in a specific index of {circumflex over (x)}, and the rest of the symbols in {circumflex over (x)} are ‘-’. Note that the anchor symbols are not necessarily placed in consecutive indices of {circumflex over (x)}. In the following steps, Algorithm 4 computes for all yk∈C, the vectors EV({circumflex over (x)}, yk) and VEV({circumflex over (x)}, yk) Then, for each h, an entry of ‘-’ in {circumflex over (x)}, the algorithm performs a majority vote on yk(VEV({circumflex over (x)}, yk)(h)), to find the most frequent symbol in this entry, and saves it in the h-th entry of {circumflex over (x)}. Lastly, the algorithm performs these steps on CR and saves the resulted sequence in {circumflex over (x)}bck. Algorithm 4 returns as output
In this section we present Algorithm 5. The algorithm receives a cluster of t traces C and the design length n. Algorithm 5 uses several methods to revise the traces from the cluster and to generate from the revised traces a multiset of candidates. Then, Algorithm 5 returns the candidate that is most likely to be the original sequence x. The methods used to revise the traces are described in this section as Algorithm 6 and Algorithm 7. Algorithm 5 invokes Algorithm 6 and Algorithm 7 on the cluster in two different procedures as described in Algorithm 8 and Algorithm 9.
The first method is described in Algorithm 6. The algorithm receives C, a cluster oft traces, the design length n, and yi, a trace from the cluster. Algorithm 6 calculates for every 1k
t, k≠i, the vector EV(yi, yk). In some of the cases, there may be more than one error vector for EV(yi, yk), which corresponds to the edit operations to transform yi to yk. In these cases, the algorithm prioritizes substitutions, then insertions, then deletions in order to choose one unique vector2. Then, the algorithm performs a majority vote in each index on these vectors and creates S, which is a vector of edit operations. Lastly, Algorithm 6 performs the edit operations on yi, and returns it as an output for Algorithm 8 and Algorithm 9. Algorithm 6 is used as a procedure in Algorithm 8 and Algorithm 9 to correct substitution and insertion errors of the traces in the cluster. 2These priorities were selected to support our definition of the deletion-insertion-substitution channel. However, for practical uses, one can easily change these priorities if some preliminary knowledge of the error rates in the data is given.
The second method is described in Algorithm 7. Similarly to Algorithm 6, Algorithm 7 receives C, a cluster of t traces, the design length n, and yk, a trace from the cluster. Algorithm 7 uses similar patterns (defined in Section 0.13.1) on each pair of traces and creates a weighted graph from them. Each vertex of the graph represents a pattern, and an edge connects patterns with identical prefix and suffix. The weight on each edge represents the frequency of the incoming pattern,
i
lcs do
k
t, perform a majority vote on VEV(lcs, yk)(i)
k
t, perform a majority vote on yk(VEV({circumflex over (x)}, yk)(h))
the number of pairs of traces in the cluster that have this pattern in their sequences. Algorithm 7 is described in detail in Section 9A.3.1. Algorithm 7 is used as a procedure in in Algorithm 8 and Algorithm 9 to correct deletion errors in the traces in the cluster.
Algorithm 8 receives a cluster of t traces C and the design length n. Algorithm 8 performs k cycles, where in each cycle it iterates over all the traces in the cluster. For each trace yk, it first uses Algorithm 6 to correct substitution errors, then it uses Algorithm 7 to correct deletion errors, and lastly, it uses Algorithm 6 to correct insertion errors. When it finishes iterating over the traces in the cluster, Algorithm 8 updates the cluster with all the revised traces and continues to the next cycle. At the end, Algorithm 8 performs the same procedure on CR. Algorithm 8 returns a multiset of all the revised traces.
Algorithm 9 also receives a cluster of t traces C and the design length n. Algorithm 9 uses the same procedures as Algorithm 8. However, in each cycle, it first corrects substitutions in all of the traces in the cluster using algorithm 6, then it invokes algorithm 7 on each trace to correct deletions, and finally invokes Algorithm 6 to correct insertions. Similarly to Algorithm 8, Algorithm 9 performs the same operations also on CR and returns a multiset of the results.
Algorithm 5 invokes Algorithms 8 and 9, with k=2 cycles and combines the resulted multisets to the multiset S. If one or more sequences of length n exists in the multiset S, it returns the one that minimizes the sum of edit distances to the traces in the cluster. Otherwise, it checks if there are sequences of length n−1 or n+1 in S, and returns the most frequent among them. If such a sequence does not exists, it returns the first sequence in S.
In this section we present Algorithm 7, the Pattern-Path algorithm. Algorithm 7 is being used to correct deletion errors. Denote by w an arbitrary LCS sequence of x and y of length . The sequence w is a subsequence of x, and hence, all of its
symbols appear in some indices of x3, and assume these indices are given by
i1xi2x
. . .
.
Furthermore, we also define the embedding sequence of w in x, denoted by ux,w, as a sequence of length |x| where for 1j
, ux,w(ijx) equals to x(ijx) and otherwise it equals to ?. 3 A subsequence can have more than one set of such indices, while the number of such sets is defined as the embedding number (see more details in the work by Elzinga et al.). In our algorithm, we chose one of these sets arbitrarily.
The gap of x, y and their LCS sequence w in index 1j
|x| with respect to uz,w and uy,w, denoted by gapu
i2y
. . .
, such that w(j′)=y(ij′v) for 1
j′
. Given such a set of indices, gapu
The pattern of x and y with respect to the LCS sequence w, its embedding sequences ux,w and uy,w, an index 1i
|x| and a length m
2, denoted by Ptn(x,y,w,ux,w,uy,w,i,m), is defined as: Ptn(x,y,w,ux,w,uy,w,i,m)
(ux,w(i−1),gapu
We also define the prefix and suffix of a pattern Ptn(x,y,w,ux,w,uy,w,i,m) to be:
Finally, we define
P(x,y,w,ux,w,uy,w,m){Ptn(x,y,w,ux,w,uy,w, , i,m):1
i
|x|}.
Algorithm 7 receives a cluster C of t traces and one of the traces in the cluster yk. First, the algorithm initializes L[yk], which is a set of |yk| empty lists. For 1i
|yk|, the i-th list of L[yk] is denoted by L[yk]i. Algorithm 7 pairs yk with each of the other traces in C. For each pair of traces, yk and yh, Algorithm 7 computes an arbitrary LCS sequence w, and an arbitrary embedding sequence uy
i
|yk|, the algorithm saves Ptn(yk, yh, w, uy
w(e)=|{Ptn∈L[yk]i:Ptn=Ptn(yk,yh,w,uy
which is the number of appearances of Ptn(yk, yh, w, uy
5. The vertex U has incoming edges from all vertices of the last index and the weight of each edge is zero.
Algorithm 7 finds a longest path from S to U in the graph. This path induces a sequence, denoted by ŷk, that consists of patterns of yk where some of them include gaps. The algorithm returns ŷk, which is a revised version of yk, while adding to the original sequence of yk the gaps that are inherited from the longest path vertices.
Example 1. We present here a short example of the definitions above and Algorithm 7. The original strand in this example is x and the cluster of traces is C=y1, . . . , y5, Note that the original length is n=10. The traces are noisy copies of x and include deletions, insertions, and substitutions. In this example Algorithm 7 receives the cluster C and the trace yk=y1 as its input.
L[y1]1={GT:3, GX:1}.
L[y1]2={GTA:3, GXcA:1}.
L[y1]3={TAX:2, TAG:1, XcAX:1}.
L[y1]4={AXG:2, AGX:1, AXX:1}.
L[y1]5={XGT:2, GXX:1, XXT:1}.
L[y1]6={GTG:1, GTX:1, XXcG:1, XTG :1}.
L[y1]7={TGC:2, TXC:1, XcGC:1}.
L[y1]8={GCC:2, SCC:1, GCtG:1}.
L[y1]9={CCtG:2, CCa:1.CtCtG:1}.
L[y1]10={CtG:3, CaG:1}.
It is not hard to observe that the longest path in the pattern path graph of this example is:
GT→GTA→TAX→AXG→XGT→GTG→TGC→GCC→CCtG→CtG→G,
and the algorithm output will be ŷ1 =GTAGTGCCTG=x.
In this section we present an evaluation of the accuracy of Algorithm 4 and Algorithm 5 on simulated data and on data from previous DNA storage experiments. We also implemented the lookahead algorithms by gopalan et al.4 and by viswanathan et al.5, and our variation of the BMA algorithm to support also insertion and substitution errors, which is referred by the Divider BMA algorithm. 4The parameters we used for the window size w of the algorithm were 2w
4, and we present the results for all of them.5The parameters we used for the algorithm by viswanatan et al. were
=5, δ=(1+ps)/2, r=2 and γ=¾, while for the data of the DNA storage experiments the substitution probability ps was taken from SOLQC
j
|yi| + 1 do
i
yk add to L[yh]i the pattern
The Divider BMA algorithm receives a cluster and the design length n. The Divider BMA algorithm divides the traces of the cluster into three sub-clusters by their lengths, traces of length n, traces of length smaller than n and traces of length larger than n. It performs a majority vote on the traces of length n. Then, similarly to the technique presented in the BMA algorithm and in the lookahead algorithm by Gopalan et al., the Divider BMA algorithm performs a majority vote on the subcluster of traces of length smaller than n, while detecting and correcting deletion errors. Lastly, the Divider BMA algorithm uses the same technique on the traces of length larger than n to detect and correct insertion errors.
We compare the edit error rates and the success rates of all the algorithms. In all of the simulations, Algorithm 5 presented significantly smaller edit error rates and higher success rates. The results on the simulated data are depicted in Figure 6,
We evaluated the accuracy of Algorithm 4 and Algorithm 5 by simulations. First, we present our interpretation of the deletion-insertion-substitution channel. In our deletion-insertion-substitution channel, the sequence is transmitted symbol-by-symbol. First, before transmitting the symbol, it checks for an insertion error before the transmitted symbol. The channel flips a coin, and in probability pi, an insertion error occurs before the transmitted symbol. If an insertion error occurs, the inserted symbol is chosen uniformly. Then, the channel checks for a deletion error, and again flips a coin, and in probability pd the transmitted symbol is deleted. Lastly, the channel checks for a substitution error. The channel flips a coin, and in probability ps the transmitted symbol is substituted to another symbol. The substituted symbol is chosen uniformly. In case that both deletion and substitution errors occurs in the same symbol, we refer to it as a substitution.
We simulated 100,000 clusters of sizes t=6, 10, 20, the sequences length was n=100, and the alphabet size was q=4. The deletion, insertion, and substitution probabilities were all identical, and ranged between 0.01 and 0.1. It means that the actual error probability of each cluster was 1−(1−pi)(1−ps)(1−pd) and ranged between 0.029701 and 0.271. We reconstructed the original sequences of the clusters using Algorithm 4, Algorithm 5 and the algorithms from gopalan and from viswanathan. For each algorithm we evaluated its edit error rate, the success rate, and the value of k1_succ. The edit error rate of Algorithm 5 was the lowest among the tested algorithms, while the algorithm from Viswanathan presented the highest edit error rates. Moreover, it can be seen that Algorithm 5 presented significantly low edit error rates value for higher values of error probabilities. In addition, Algorithm 5 also presented the lowest value of k1_succ. For example, when the cluster size was t=20 and the error probability was p=0.142625, the value of k1_succ of Algorithm 5 was 2, while the other algorithms presented k1-succ values of at least 12. The results of these simulations for cluster sizes of t=10 and t=20 can be found in
In this section we present the results of the tested algorithms on data from previous DNA storage experiments. The clustering of these data sets was made by the SOLQC tool. We performed each of the tested algorithms on the data and evaluated the edit error rates. Note that in order to reduce the runtime of Algorithm 5 we filtered clusters of size t>25 to have only the first 25 traces. Also here, Algorithm 5 presented the lowest edit error rates in almost all of the tested data sets. These results are depicted in
We evaluated the performance of the different algorithms discussed in this paper. The performance evaluation was performed on our server with Intel(R) Xeon(R) CPU E5-2630 v3 2.40 GHz. We implemented our algorithms as well as the previously published algorithm from Gopalan, which presented the second-lowest error rates in our results from Section 0.14. In order to present reliable performance evaluation, the clusters in our experiments were reconstructed in serial order. However, it is important to note, that for practical uses, additional performance improvements can be made by performing the algorithms on the different clusters in parallel and shortening the running time.
We presented in this paper several new algorithms for the deletion DNA reconstruction problem and for the DNA reconstruction problem. While most of the previously published algorithms use a symbol-wise majority approaches, our algorithms look globally on the entire sequence of the traces, and use the LCS or SCS of a given set of traces. Our algorithms are designed to specifically support DNA storage systems and to reduce the edit error rate of the reconstructed sequences. According to our tests on simulated data and on data from DNA storage experiments, we found out that our algorithms significantly reduced the error rates compared to the previously published algorithms. Moreover, our algorithms performed even better when the error probabilities were high, while using less traces than the other algorithms. Even though our algorithms improved previous results, there are still several challenges that need to be addressed in order to fully solve the DNA reconstruction problem. Some of these challenges are listed as follows.
In this section we present Algorithm 0.17. Algorithm 0.17 receives cluster of t traces C, the design length n, and a trace from the cluster, yj∈C. The algorithm calculates for each 1k
t, k≠i, the vector EV(yj, yk), which is a vector of edit operations to transform yj to yk. The set of such vectors is denoted as EVv
i
n and for any two symbols u, v∈{A, C, G, T}, we define the conditional probability of symbol v in index i+1, given that the i-th symbol is u.
where v[i] referes to the i-th symb of the vector v. Following that, the algorithm calculates the set Posti,uv,C=v|v∈EVy
Based on these probabilities, we can now define the edit graph Gv
Finally, Algorithm 0.17 calculates the heaviest path of length n+1 in the edit graph Gy
), where the length of vectors is bounded by 2|yj| + 1.
i
2|yj| + 1, and for any two symbols u,v calculate
(v|u).
.
.
Edit Vector When editing string X to string Y, we use 4 operations: insert letters of Y before a letter of X, delete a letter of X, substitute a letter of X for a letter of Y, copy a letter of X. These operations can be divided into two categories: In-place edit operations: operations on a letter of x: delete the letter, substitute it or copy it. Out-of-place edit operations: insert a subtracting of Y before a letter of X. If X is a string of length m, we can regard any series of edit operations which edit X to string Y as edit strings: m strings for in-place operations on each letter of X and m+1 strings for out-of-place operations, i.e. inserts, before each letter of X and after the last one (before end of string). In-place edit strings: copy letter x of X: x substitute letter x of X for letter y of Y: y delete letter x of X: empty string Out-of-place edit strings: insert string s before letter of X: s (empty string for no insert) So, given a series of edit operations which edit X to string Y ,for we define a vector v of strings of length : in place the edit string of the in-place operation of letter of X. in place the edit string of the out-of-place operation of letter of X. In place the edit string of the before the end insert. Finally, we can define the edit vector of X to Y. Example: X=CJDHF, Y=ABCDEFG We can edit X to Y with the following operations: Insert AB before C, Copy C, Delete J, Copy D, Substitute H with E, Copy F, Insert G before end. The resulting Graph is illustrated in
In the foregoing specification, the invention has been described with reference to specific examples of embodiments of the invention. It will, however, be evident that various modifications and changes may be made therein without departing from the broader spirit and scope of the invention as set forth in the appended claims. Those skilled in the art will recognize that the boundaries between logic blocks are merely illustrative and that alternative embodiments may merge logic blocks or circuit elements or impose an alternate decomposition of functionality upon various logic blocks or circuit elements. Thus, it is to be understood that the architectures depicted herein are merely exemplary, and that in fact many other architectures may be implemented which achieve the same functionality. Any arrangement of components to achieve the same functionality is effectively “associated” such that the desired functionality is achieved. Hence, any two components herein combined to achieve a particular functionality may be seen as “ associated with” each other such that the desired functionality is achieved, irrespective of architectures or intermedial components. Likewise, any two components so associated can also be viewed as being “operably connected,” or “operably coupled,” to each other to achieve the desired functionality. Furthermore, those skilled in the art will recognize that boundaries between the above described operations merely illustrative. The multiple operations may be combined into a single operation, a single operation may be distributed in additional operations and operations may be executed at least partially overlapping in time. Moreover, alternative embodiments may include multiple instances of a particular operation, and the order of operations may be altered in various other embodiments. Also for example, in one embodiment, the illustrated examples may be implemented as circuitry located on a single integrated circuit or within a same device. Alternatively, the examples may be implemented as any number of separate integrated circuits or separate devices interconnected with each other in a suitable manner. However, other modifications, variations and alternatives are also possible. The specifications and drawings are, accordingly, to be regarded in an illustrative rather than in a restrictive sense. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word ‘comprising’ does not exclude the presence of other elements or steps then those listed in a claim. Furthermore, the terms “a” or “an,” as used herein, are defined as one or more than one. Also, the use of introductory phrases such as “at least one” and “one or more” in the claims should not be construed to imply that the introduction of another claim element by the indefinite articles “a” or “an” limits any particular claim containing such introduced claim element to inventions containing only one such element, even when the same claim includes the introductory phrases “one or more” or “at least one” and indefinite articles such as “a” or “an.” The same holds true for the use of definite articles. Unless stated otherwise, terms such as “first” and “second” are used to arbitrarily distinguish between the elements such terms describe. Thus, these terms are not necessarily intended to indicate temporal or other prioritization of such elements. The mere fact that certain measures are recited in mutually different claims does not indicate that a combination of these measures cannot be used to advantage. While certain features of the invention have been illustrated and described herein, many modifications, substitutions, changes, and equivalents will now occur to those of ordinary skill in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the invention. It is appreciated that various features of the embodiments of the disclosure which are, for clarity, described in the contexts of separate embodiments may also be provided in combination in a single embodiment. Conversely, various features of the embodiments of the disclosure which are, for brevity, described in the context of a single embodiment may also be provided separately or in any suitable sub-combination. It will be appreciated by persons skilled in the art that the embodiments of the disclosure are not limited by what has been particularly shown and described hereinabove. Rather the scope of the embodiments of the disclosure is defined by the appended claims and equivalents thereof.
Number | Date | Country | |
---|---|---|---|
63075753 | Sep 2020 | US |