A single DNA strand (i.e. oligonucleotide) is a molecule made up of two polymers forming a sequence of complimentary pairs of nucleobases, each of which typically is one of four possible nucleotides denoted as A, C, G and T. Short DNA sequences can be synthesized easily and have been proposed for use in different applications in information processing, including high density information storage [2], molecular computation of hard combinatorial problems [1], and molecular barcodes to identify individual modules in complex chemical libraries [3]. These applications rely on the specific hybridization properties of the half-strands that result when a specially chosen set of DNA strands is denatured and allowed to re-bind. A single short strand of DNA is referred to as a ‘code word pair’, each being made up of two half-strands, a ‘code word’ and its Watson Crick complement. The key to success in all of these DNA information processing applications is the availability of a large set, or library of DNA code word pairs whose half-strands crosshybridize perfectly within each code-word pair, but poorly with the half-strands from all other code word pairs in the set, according to some hybridization metric. Thus, if all the code word pairs in the library are heated and denatured in the same compartment and then cooled and allowed to hybridize, only the original code words hybridize and are recovered. This effect can be used to obtain spatial addressability of the binding location of different DNA half-strands in a second set of DNA strands, if each type of half-strand in the second set is ligated to a different half-strand in the code word library, and if the mixture is allowed to bind onto a substrate that is treated with different localized areas each treated, or containing, or functionalized with, a different corresponding half-strand of the library set. Alternatively, a string of many half-strand code words can be constructed, and each half-strand code word either hybridized with its complement or not, thereby representing 1's or 0's, and encoding digital data onto the constructed strand.
The capability of hybridization between two oligonucleotides is determined by the base sequences of the hybridizing oligonucleotides, the location of potential mismatches, the molar concentrations of the strands, the temperature of the reaction, the salinity of the medium, and the length of the sequences [4]. The melting temperature (Tm) is a parameter that characterizes these factors [4]. It is defined as the temperature at which 50% of the DNA molecules have been separated to single strands, or denatured. Another closely related measure of the relative stability of a DNA duplex is its Gibbs free energy denoted as ΔGO. The nearest-neighbor (NN) model [8] was proven to be an effective and accurate estimation of the free energy. In [14], the concept of t-stem block insertion-deletion codes was introduced that captures the key aspects of the nearest neighbor model. In the same reference, a dynamic programming algorithm is presented to calculate the maximum weight of the t-stem common subsequence.
Search methods for DNA codes are extremely time-consuming [5], and this has limited research on DNA codeword design, especially for codes of length greater than about 12-14 bases. For example, the largest known DNA codeword library generated based on the edit distance constraints (an easier to determine metric than pair-wise nearest neighbor model) with length 16 and edit distance 10 consists of 132 pairs, and composing such codes can take several days on a cluster of processors.
Commonly assigned U.S. patent application Ser. No. 12/378,261, disclosed a novel accelerator for the composition of reverse complement, edit distance DNA codes of length 16. The invention disclosed in Ser. No. 12/378,261 incorporates a hardware genetic algorithm (GA), hardware edit distance calculation, and hardware exhaustive search, the latter of which extends an initial codeword library by doing a final scan across the entire universe of possible code words. An embodiment of the invention disclosed in Ser. No. 12/378,261 comprises a host PC, a hardware accelerator implemented in reconfigurable logic on a field programmable gate array (FPGA) and a software program running in a host PC that controls and communicates with the hardware accelerator. The embodiment uses a modified genetic algorithm (GA) that uses a locally exhaustive, mutation-only heuristic tuned for speed, and optional selection, mating, and decloning steps. The embodiment successfully reduced the DNA codeword library construction time from 6 days (on 10 Pentium processors) to 1.5 hours (on a notebook computer with FPGA card) and achieved an effective 1000× speed-up.
The edit distance metric only provides a first order approximation of the free energy of the DNA duplexes. To improve the quality of the code words, a more accurate method based on the thermodynamics of DNA duplex binding is needed.
The DNA molecule is a nucleic acid. It consists of two oriented oligonucleotide sequences. One end of it is denoted as 3′ and the other as 5′. There are four types of bases, denoted briefly as A, T, C, and G. Each base can pair up with only one particular base through hydrogen bonds: A+T, T+A, C+G and G+C. It is common to say that A and T are complementary to each other while C and G are complementary to each other. A Watson-Crick complement of a DNA sequence is another DNA sequence which replaces all the A with T or vise versa and replaces all the T with A or vise versa, and also switches the 5′ and 3′ ends. Referring to
The thermodynamics of binding of nucleic acids has been widely studied and reported in the literature. The nearest-neighbor (NN) model [8] was proven to be effective and accurate for the thermodynamic energy estimation. The NN model assumes that stability of a DNA duplex depends on the identity and orientation of neighboring base pairs. There are 10 possible NN pairs: AA/TT, AT/TA, TA/AT, CA/GT, GT/CA, CT/GA, GA/CT, CG/GC, GC/CG, and GG/CC. Based on the NN model, the total free energy change of a DNA duplex at temperature T can be calculated by the following equation:
where ΔGOT,initiation is the initiation energy, ΔGOT,symmetry is a parameter that reflects whether the duplex is self-complementary, ΔGOT,AT Terminal is a parameter that accounts for the differences between duplexes with terminal AT versus terminal GC, ΔGOT,stack(i) gives the thermodynamic energy of Watson-Crick NN duplex i. Their values at 37° C. are given in
Example: Using the values depicted in
can be calculated as:
ΣiεWC NN
While the parameters ΔGOT,initiation, ΔGOT,symmetry, and ΔGOT,AT Terminal can be obtained in a straightforward manner, the NN free energy (i.e. ΣiεWC NNΔGOT,stack(i)) is determined by the structure of the primary sequence of the DNA duplex.
The present invention provides an apparatus that accelerates the determination of NN free energy using reconfigurable hardware and applies it to hardware based DNA code word search.
Specifically, the present invention provides an apparatus that determines the maximum weight of the 2-stem common subsequence of two DNA oligonucleotides.
The present invention provides an apparatus for determining the maximum weight of the 2-stem common subsequence of two DNA olgiconucleotides which in turn the present invention uses to determine an estimate of the nearest neighbor Gibb's free energy of binding between DNA olgiconucleotides.
The apparatus of the present invention provides a computer, hardware accelerator, and a software program for the production of high quality DNA codeword libraries constructed with hybridization constraints based on nearest neighbor method estimates of the Gibb's free energy of binding between DNA olgiconucleotides.
a depicts a Watson-Crick duplex.
b depicts a non-Watson-Crick or crosshybridized (CH) duplex.
a depicts the secondary structures in the CH duplex for στ2=2, 3, 8.
b depicts the secondary structures in the CH duplex for στ2=2, 3, 7.
a depicts the structure of each systolic array cell, including its input/output and the computation implemented.
b depicts the overall architecture of a 2D systolic array as well as the data dependency and timing information.
a depicts library size versus time for search time under different crosshybridized duplex (CH) ranges.
b depicts the number of codeword pairs found in 200 seconds.
c depicts the time to find 400 codeword pairs under different CH ranges.
a depicts the runtime of a GA under different ranges.
b depicts the size of the codeword library found by a GA and the size of the final library.
a depicts the number of codeword pairs found in five (5) minutes for CH upper bounds ranging from 5 to 8.0.
b depicts the runtime to find 300 codeword pairs for CH upper bounds ranging from 8.5 to 10.
The present invention provides a hardware accelerator and method for implementing a nearest-neighbor based free energy calculation. The invention further provides a method to produce the maximum weight of the 2-stem common subsequence of two DNA oligonucleotides. In practice, the present invention would comprise a general purpose microprocessor or computer, a hardware accelerator, and a software program. One skilled in the art would appreciate that the hardware accelerator could be implemented as a reconfigurable Field Programmable Gate Array (FPGA), or an Application Specific Integrated Circuit (ASIC), or an embedded processor, or a Cell BroadBand Engine processor, or a Graphical Processing Unit, but by no means is limited to such implementations.
The determination of the Gibbs free energy of the DNA crosshybridized (CH) duplexes based on the nearest-neighbor model is key to the present invention. The invention's transformed algorithm preserves the physical data locality and hence is suitable to be implemented using systolic array. The invention provides a novel hardware accelerator for accelerating the discovery of DNA which binds according to thermodynamic constraints and which provides more than 250× speed-up compared to a software only implementation.
The description of the present invention herein adopts notations that are likewise used in references [1] and [14]. We use [n] to denote the set {0, . . . , n−1 } and (n) to denote the sequence 1, 2, . . . , n. We call α<(n) a string if and only if it is a subsequence of consecutive integers, e.g., α=i, i+1, . . . , i+k. Let [q]n denote the set of sequences of length n with entries in [q]. For x=x1, . . . , xn with xε[q]n and σ=i1, i2, . . . , ik where σ<(n), we use xσ<x to denote the subsequence xi
For σ<(n), a substring β<σ is called a block of σ if β is not a subsequence of any substring α of σ with β≠α. Denote σ as a sequence of blocks β1, β2, . . . , βi, . . . , βl, if the difference between the endpoint of βi and the starting point of βi+1 is greater than or equal to t, then σ is a t-gap sequence of (n). It is denoted as σεGt(n). Given σ<(n), τ<(m) with |σ|=|τ| and σεGt(n), τεGt(m), let f: σ→τ be a unique mapping, σ and τ are said to be t-gap block isomorphic (denoted as
if α<σ is stringf(α)<τ is a string. For xε[q]n and yε[q]m, if xσ=yτ and
then we say xσ and yτ are t-gap block isomorphic and denote it as
Definition 1 For 2≦t≦n−1 and x, y ε[q]n, we define the weight of the longest common t-gap block subsequence of x and y as
The weighted distance of two t-gap block insertion-deletion codes is defined as ΦΩ,qt(x,y)≡min(∥x∥Ω,∥y∥Ω)−φΩt(x,y). When t=1 and ∥xσ∥Ω=|xσ|, ΦΩ,q1(x,y) is the Levenshtein insertion-deletion distance.
The weight of the longest common t-gap block subsequence of x and y (i.e. φΩt(x,y)) can be calculated using dynamic programming. For x, y ε[q]n and t≦i, j≦n, let MΩ,i,jt denote φΩt(x[1,i], y[1,j]) and suf (x,y) denote the length of the longest common suffix between x and y. It is proved ([1][14]) that:
M
Ω,i,j
t=max(MΩ,i−1,jt,MΩ,i,j−1t,DΩ,i,jt}, (1)
where DΩ,i,jt is defined as
Given two sequences x, yε[q]n, xσ=yτ with a unique mapping f: σ→, a t-stem exists if and only if subsequences x[i,i+t−1]=y[f(i),f(i)+t−1]. Let στt be the sequence of the first index of all the t-stems. For xε[q]n, let dq(x[α,α+t−1]) be a unique number in [qt] to represent xαxα+1 . . . xα+t−1, we define x(t)ε[qt]n−t as a sequence whose ith element is equal to (dq(xi,i+t−1]). For 2≦t≦n−1, it can be proved that if |στt|≠0, then
Definition 2 Let Ω be a weight function on [qt], the maximum weighted t-stem common subsequence is defined as
The t-stem code distance is defined as Ψt(x,y)=min(∥x(t)∥Ω,∥y(t)∥Ω)−ψΩt(x,y).
It is proved ([1][14]) that the maximum weighted t-stem common subsequence of x and y is equal to the weight of the longest common t-gap block subsequence of x(t) and y(t), i.e. ψΩt(x,y)=φΩ,q
Let the CH duplex between x and
where
is a representation of
have a secondary structure, then its free energy of nearest neighbor stacked pairs can be calculated as ψΩ2(x,y), where the weight function Ω is equal to −ΔGOT,stack.
Example: Consider the CH duplex
It corresponds to strings x=AACGTAGAT and y=CGACGATGA. Because x[2,4][8,9]=y[3,5][6,7], we have σ=[2, 4][8, 9], τ=[3, 5][6, 7], στ2=2, 3, 8, and ∥xσ(2)=∥x2,3,8(2)=−ΔG37,stack(AC)−ΔG37,stack(CG)−ΔG37,stack(AT)=4.49 kcal/mol.
Let A, C, G, T be encoded as 0, 1, 2, and 3, then x=0, 0, 1, 2, 3, 0, 2, 0, 3 and y=1, 2, 0, 1, 2, 0, 3, 2, 0. x(2)=0, 1, 6, 11, 12, 2, 8, 3 and y(2)=6, 8, 1, 6, 8, 3, 14, 8. It is easy to see that x2,3,8(2)=y3,4,6(2). Let σ=2, 3, 8 and τ=3, 4, 6, because any string in C corresponds to a string in τ, and the gaps between blocks in σ and τ are equal to 2, we say
Note that although x2,3,7,8(2)=y3,4,5,6(2), because a string in τ=3, 4, 5, 6 does not necessarily correspond to a string in σ=2, 3, 7, 8, therefore, x2,3,7,8(2) and y3,4,5,6(2) are not t-gap block isomorphic. Using equation (1) and (2) we can find that ψΩ2(x,y)=φΩt(x(2),y(2))=∥x2,3,7(2)∥=−ΔG37,stack(AC)−ΔG37,stack(CG)−ΔG37,stack(GA)=4.91 kcal/mol. Referring to
The present invention provides an apparatus for executing a dynamic programming algorithm within a 2D systolic array implementation. The invention employs software implementable steps to determine the conditions defined by equations (1) and (2). Given a CH duplex
we define 3 matrices. They include a suffix matrix (S) which stores the length of the longest common suffix between x and y, a weighted suffix matrix (WS) which stores the accumulated weight of each common stem-2 and an energy matrix (E) which stores the accumulated free energy of the possible NNs. The value of the ijth entry of these matrices can be determined using the following equations.
The parameter w(α[i−1],α[i]) is the stack-pair free energy specified in
Example: Consider x=5′AATGA3′ and
(i.e. y=5′GTACC3′,) the matrix S, WS, and E can be calculated as the following and the NN free energy of the CH duplex is 2.33.
It is important to note that the values in
Systolic array processing has been widely used in parallel computing to enhance performance. Its general architecture is depicted in
Equations (3)-(5) cannot be directly mapped to a 2D systolic array architecture because to calculate eij we need the value of wsi−d,j−d(ei−d,j−d), 1≦d≦sij. The variable eij is calculated by processor P(i,j). The variables wsi−d,j−d and ei−d,j−d are calculated by processor P(i−d,j−d). If the calculation of eij is performed at clock period t, then the calculations of wsi−d,j−d and ei−d,j−d of the same DNA duplex are performed at clock period t−2d. Because cells in the systolic array will register the new input and update their results every 2 clock periods, it is not possible for us to access the data of wsi−d,j−d and ei−d,j−d at clock period t if d is greater than 1. One way to handle this problem is to memorize the values of wsi−d,j−d and ei−d,j−d by adding extra storage elements. Because the maximum value of sij can be as high as the length of the DNA strand, which in our case is 16 (for 16 mers, or 32 for 32 mers, etc.), this solution potentially would require duplicatation of each cell in the systolic array 16 times. This is not practical as it significantly increases the hardware cost.
In the present invention, function transformation is used to simplify the hardware design. We define a minimum weighted suffix matrix (MIN_WS) which stores the minimum value of the difference between wsi−d,j−d and ei−d−1, j−d−1, where 1≦d≦sij. The ijth entry of MIN_WS can be calculated as
when x[i]≠y[j], min_wsij will be set to an extremely large number (for example 1,000,000 or some other large number that can be represented in the computing platform), otherwise, it is the minimum between min_wsi−1,j−1 and wsij−ei−1,j−1. The calculation of eij and wsij is transformed into the following equations.
Equations (6)˜(8) are equivalent to equations (3)˜(5), however, only information from adjacent cells is needed in the calculation, hence, they can be implemented using the systolic array architecture.
The hardware design of the 2D systolic array can be derived directly from equations (6)˜(8). The systolic array is an n×n array of identical cells. Each cell in the array has 7 inputs, among which the inputs ei−1,j and x[i−1,j] are coming from the cell that is located above, the inputs ei,j−1 and y[i,j−1] are coming from the cell that is located to the left, and the inputs ei−1,j−1, wsi−1,j−1 and min_wsi−1,j−1 are coming from the cell that is located to the upper left. Each cell performs the computations that are described in equations (6)˜(8). For cell (i,j), the outputs xi,j and yi,j are equal to the inputs xi−1,j and yi,j−1.
The overall architecture of the 2D systolic array as well as the data dependency and timing information are shown in
The present invention considers each DNA codeword as a sequence of length n in which each symbol is an element of an alphabet of 4 elements. Let
denote the nearest neighbor free energy of duplex
The invention focuses on searching for a set of DNA codeword pairs S, where S consists of a set of DNA strands of length n and their reverse complement strands e.g. {(S1,
where g and range are user defined threshold called CH upper bound and CH range. Equation (8) indicates that our objective is to maximize the size of the DNA codeword library. Constraints (9)˜(10) specify that the NN free energy of any CH duplexes must be lower than or equal to g but greater than or equal to g-range. For any DNA duplex, the weakest stacked pair is the AT pair with ΔGO37,stack(AT)=0.88 and ΔGO37,stack(TA)=0.58. Therefore, for y, x<[A,C,G,T]16, min(∥x(2)∥Ω,∥y(2)∥Ω)=7*(0.58+0.88)+0.58=10.8. Based on definition 2, the weighted t-stem distance between x and y is greater than 10.8−g and less than
Therefore, constraints (9)˜(10) ensure that the t-stem distance between any non-WC pairs in the library is within the range [10.8−g+range, 10.8−g]. The range was initially introduced because we thought that adding code words that are too far away from the rest of the library will restrict the future growth of the library. Therefore, we only add code words that are “just good enough”. Later in the experiments we found that the range has little impact on the size of the library, however, it has a significant impact on the convergence speed of the GA. Other sets of constraints can be used within the spirit and scope of the present invention, for example, constraints based upon different minimums, maximums, and ranges (that might even be specified separately for intended (WC) and unintended (CH) binding), desired melting temperatures, GC content, freedom from hairpins, and other constraints used by those skilled in the art of DNA Codeword library design.
The present invention solves the longstanding optimization problem involved in assembling a large set of DNA codeword pairs by using a genetic algorithm. A genetic algorithm (GA) is a stochastic search technique based on the mechanism of natural selection and recombination. Solutions, which are also called individuals, are evolved from generation to generation, with selection, mating, and mutation operators that provide an effective combination of exploration of the global search space.
Given a codeword library S, the fitness of each individual d reflects how well the corresponding codeword fits into the current codeword library. Two values define the fitness, reject_num and max_match. The reject_num is the number of codewords in the library which do not satisfy the condition (9)˜(10) and
The max_match is the largest violation of any of the constraints for individual d that is measured across all possible crosshybridizations in the library. Typically a code is built by repeatedly picking a first codeword candidate with random contents, and checking the constraint in (9), until one is found that passes. Then, a population of random candidate next codewords is set up. The candidates in the GA population are checked with constraints (9)-(10). Any good codewords satisfying all constraints are added to the library, the best failing codewords are identified (by storing a number that is a weighted sum of, reject_num and max_match), The GA operators are applied to produce a mixture of old and new candidates, and the steps after the initial setup are iterated. Some new candidates are constructed from old individuals in the population, and some new random individuals are added. The present invention stops after a time limit, or generation limit, or a desired library size is attained. Note that seeding the random number generator (that is used to set up the initial GA population, control the GA operators, and add new individuals) with different random number seeds causes a different first individual, different GA population candidates, and results in different DNA libraries. Thus, the algorithm can be re-run many times to produce many different libraries that each satisfy the same set of constraints.
A traditional GA mutation function might randomly pick an individual in the population, randomly pick a pair of bits in the individual representing one of its 16 bases, and randomly change the base to one of the 3 other bases in the set of 4 possible bases. In the proposed algorithm, however, an individual is randomly selected, but then exhaustively checked for all of the 48 possible base changes (for 16 mers, 3 in each of 16 base positions). This is an attempt to speed beneficial evolution of the population by minimizing the overhead that would be associated with randomly picking this individual again and again in order to test those mutations. We also specify that if none of the 48 mutations are beneficial, optionally a random individual is generated to replace the original one, or one of the mutated individuals is used to replace the original anyway. For more details about the genetic algorithm and its hardware implementation, please refer to [9]. In the present embodiment, the architecture of the hardware GA presented in [9] is extended to incorporate consideration of nearest-neighbor free energy. The 2D systolic array is used as a fitness evaluation module and the main state machine controller of the GA is modified so that it checks all the constraints (9)˜(10). In the present embodiment a hardware GA, hardware code library building loop, and hardware fitness evaluation are employed, although other embodiments within the spirit and scope of the present invention are possible that use any or all of a software GA, a software DNA library word building loop, or a software fitness evaluator.
The present invention has been tested and experimental results have been generated. A hardware accelerator that uses a stochastic GA to build DNA codeword libraries of codeword length 16 has also been designed, implemented, and tested. The invention was implemented on a reconfigurable computing platform that comprises a desktop computer and an Annapolis WildStar-Pro FPGA board. The FPGA board is plugged into the PCI-X slot of the host system. The WildStar-Pro uses XC2VP70 FPGA that has 74,448 programmable logic cells. The hardware accelerator uses about 80% of the logic resource, and it runs at 45 MHz clock frequency. A hardware based code extender that uses exhaustive search to complete the codeword library generated initially by running the GA for a limited time has also been designed and implemented. All the code word libraries that have been found were verified using the online tool SynDCode [13]. Since GA is a stochastic algorithm, all results reported are the average of multiple runs.
The first set of experiments compares the performance of the hardware-based and the software-only DNA codeword search. Two search algorithms are implemented. They are denoted as “deterministic search” (DS) and “randomized search” (RS). The population size was set to 16. The population of the DS was initialized using 16 sequential data from 0x000003F0 to 0x000003FF, which corresponds to DNA codeword 3′AATTTAAAAAAAAAAA′5 and 3′TTTTTAAAAAAAAAAA′5, while the population of the RS was initialized randomly. When a new codeword is found, or when none of the mutated codewords has lower fitness than the original individual, a new individual is generated to replace the original one. In the DS, a counter is used to generate the new individual. The counter is initialized to 0x000006D6. In the RS, the new individual is generated randomly. The random search is more effective than the deterministic search. However, in order to compare the speed of hardware-based implementation and software-based implementation, the two systems should perform exactly the same computation tasks. This is achievable only with a deterministic algorithm that uses the same sequence of (pseudo) random numbers to run the algorithm. All experiments were run with g=8.5 and range=1.0. They were terminated after 300 code word pairs have been found.
Referring to
The second set of experiments evaluated the impact of CH range on the speed and quality (library size vs time) of the search. The CH range was varied from 0.05 to 3 and the GA based code word search was run with g=8.5.
The third set of experiments compares the search speed for different CH upper bounds (g). The CH upper bound was varied from 6.5 to 10.0 and run the GA-based code word search. The search was stopped when it found 300 code word pairs or the run time exceeded 15 minutes.
This patent application claims the priority benefit of the filing date of a provisional application Ser. No. 61/132,338, filed in the United States Patent and Trademark Office on Jun. 5, 2008.
The invention described herein may be manufactured and used by or for the Government of the United States for governmental purposes without the payment of any royalty thereon.
Number | Date | Country | |
---|---|---|---|
61132338 | Jun 2008 | US |