The programmable chemistry of nucleic acid base pairing enables the rational design of self-assembling molecular structures, devices, and systems. However, the ability to specify a target nucleic acid structure, and then design a nucleic acid sequence that will adopt this target structure is still challenging and computationally intensive.
For an RNA strand with N nucleotides, the sequence, φ, is specified by base identities φiε{A, C, G, U} for i=1, . . . , N (T replaces U for DNA). The secondary structure, s, of one or more interacting RNA strands can be defined by a set of base pairs (each a Watson Crick pair [A-U or C-G] or wobble pair [G-U]). Using the set of base pairs, a polymer graph for a secondary structure can then be constructed by ordering the strands around a circle, drawing the backbones in succession from 5′ to 3′ around the circumference with a nick between each strand, and drawing straight lines connecting paired bases.
A secondary structure is “pseudoknotted” if every strand ordering corresponds to a polymer graph with crossing lines. A secondary structure is connected if no subset of the strands is free of the others. An “ordered complex” corresponds to the unpseudoknotted structural ensemble, Γ, comprising all connected polymer graphs with no crossing lines for a particular ordering of a set of strands. For a secondary structure, sεΓ, the free energy, ΔG(φ, s), is calculated using nearest-neighbor empirical parameters for RNA in 1M Na+5 or for DNA in user-specified Na+ and Mg++ concentrations. This physical model provides the basis for rigorous analysis and design of equilibrium base-pairing in the context of the free energy landscape defined over ensemble Γ.
Using this secondary energy model, the equilibrium of a nucleic acid complex can be characterized. The equilibrium state of nucleic acid complex can be determined by calculating the partition function
over the unpseudoknotted structure ensemble Γ. Using the partition function, it is possible to then evaluate the equilibrium probability
of any secondary structure sεΓ. Here, kB is the Boltzmann constant and T is temperature. The secondary structure with the highest probability at equilibrium is the minimum free energy (MFE) structure, satisfying
The equilibrium structural features of ensemble Γ are quantified by the base-pairing probability matrix, P(φ), with entries Pi,j(φ)ε[0, 1] corresponding to the probability,
that base pair i·j forms at equilibrium. Here, S(s) defines a structure matrix with entries Si,j(s)ε{0, 1}. If structure s contains pair i·j, then Si,j(s)=1, otherwise Si,j(s)=0. For convenience, the structure and probability matrices are augmented with an extra column to describe unpaired bases. The entry Si,N+1(s) is unity if base i is unpaired in structure s and zero otherwise; the entry Pi,N+1(φ)ε[0, 1] denotes the equilibrium probability that base i is unpaired over ensemble Γ. Hence the row sums of the augmented S(s) and P(φ) matrices are unity.
The distance between two secondary structures, s1 and s2, is the number of nucleotides paired differently in the two structures:
The discrete delta function is defined as
with respect to secondary structure.
Although the size of the ensemble, Γ, grows exponentially with the number of nucleotides N, the MFE structure having the lowest energy, the partition function, and the equilibrium base-pairing probabilities can be evaluated using Θ(N3) dynamic programs.
For a given target structure, s, the determination of the nucleotide sequence that will produce the target structure s can be specified as an optimization problem, minimizing an objective function with respect to sequence, φ. Rather than seeking a global optimum, optimization can be terminated if the objective function is reduced below a prescribed stop condition.
One strategy to determine the lowest free energy sequence that corresponds to a particular target structure s is to minimize the MFE defect
corresponding to the distance between the MFE structure sMFE(φ) and the target structure s. This approach hinges on whether or not the equilibrium structural features of ensemble Γ are well-characterized by the single structure sMFE(φ), which in turn depends on the specific sequence φ. If μ(φ, s)=0, the target structure s is the most probable secondary structure at equilibrium. However, p(φ, s) can nonetheless be arbitrarily small due to competition from other secondary structures in Γ.
To address this concern, an alternative strategy to MFE defect minimization is to minimize the probability defect:
π(φ,s)=1−p(φ,s),
corresponding to the sum of the probabilities of all non-target structures in the ensemble Γ. If π(φ, s)≈0, the sequence design is essentially ideal because the equilibrium structural properties of the ensemble are dominated by the target structure s. However, as π(φ, s) deviates from zero, it increasingly fails to characterize the quality of the sequence because the probability defect treats all non-target structures as being equally defective. This property is a concern for challenging designs where it may be infeasible to achieve π(φ, s)≈0.
To address these shortcomings, still another strategy is to minimize the ensemble defect between the target structure s and the equilibrium properties of sequence φ. The ensemble defect
corresponds to the average number of incorrectly paired nucleotides at equilibrium calculated over ensemble Γ.
These three objective functions, ensemble defect minimization, MFE defect minimization and probability defect minimization, and can be cast into a unified form to highlight their differences:
Using n(φ, s) to perform ensemble defect optimization, the average number of incorrectly paired nucleotides at equilibrium is evaluated over ensemble Γ using p(φ, σ), the Boltzmann-weighted probability of each secondary structure σεΓ, and d(σ, s), the distance between each secondary structure σεΓ and the target structure s. By comparison, using μ(φ, s) to perform MFE defect optimization, p(φ, σ) is replaced by the discrete delta function δσ, sMFE, which is unity for SMFE and zero for all other structures σεΓ. Alternatively, using σ(φ, s) to perform probability defect optimization, d(σ, s) is replaced by the binary distance function (1−δσ, s), which is zero for s and 1 for all other structures σεΓ.
Hence, the MFE defect makes the optimistic assumption that sMFE will dominate Γ at equilibrium, while the probability defect makes the pessimistic assumption that all structures σεΓ with d(σ, s)≠0 are equally distant from the target structure s. The objective function n(φ, s) quantifies the equilibrium structural defects of sequence φ even when μ(φ, s) and π(φ, s) do not.
Described herein are systems and processes for designing the sequence of one or more interacting nucleic acid strands intended to adopt a target secondary structure at equilibrium. Sequence design is formulated as an optimization problem with the goal of reducing an ensemble defect below a user-specified stop condition. For a candidate sequence and a given target secondary structure, the ensemble defect is the average number of incorrectly paired nucleotides at equilibrium evaluated over the ensemble of unpseudoknotted secondary structures. To reduce the computational cost of accepting or rejecting mutations to a random initial sequence, candidate mutations are evaluated on leaf nodes of a tree-decomposition of the target structure.
As described in more detail below, during leaf optimization, defect-weighted mutation sampling is used to select each candidate mutation position with probability proportional to its contribution to the ensemble defect of the leaf. As subsequences are merged moving up the tree, emergent structural defects resulting from crosstalk between sibling sequences are eliminated via reoptimization within the defective subtree starting from new random subsequences. Using a Θ(N3) dynamic program to evaluate the ensemble defect of a target structure with N nucleotides, this hierarchical approach implies an asymptotic optimality bound on design time. Thus, for sufficiently large N, the cost of sequence design is bounded below by 4/3 the cost of a single evaluation of the ensemble defect for the full sequence. Hence, the design process has time complexity Ω(N3). For target structures containing Nε{100, 200, 400, 800, 1600, 3200} nucleotides and duplex stems ranging from 1 to 30 base pairs, RNA sequence designs at 37° C. typically succeed in satisfying a stop condition with ensemble defect less than N/100. Empirically, the sequence design process exhibits asymptotic optimality and the exponent in the time complexity bound is sharp. This demonstrates a great improvement over prior systems that did not demonstrate this asymptotic optimality.
An embodiment is an electronic system for optimizing the sequence of a nucleic acid strand to adopt a specific target secondary structure at equilibrium. This embodiment of the system includes: an input for receiving a target secondary structure; a hierarchical structure decomposition module configured to decompose the target secondary structure at split points into a tree of parental nodes, child nodes and leaf nodes wherein each split point is formed within a duplex stem of the target structure; a leaf optimization module configured to determine a leaf nucleotide sequence of the target structure at each leaf node in the tree; and a merging module configured to recurse the nodes of the tree and merge the determined leaf nucleotide sequences at each node to optimize the nucleotide sequence of a nucleic acid strand that adopts the specific target secondary structure at equilibrium.
Yet another embodiment is a method in an electronic system for designing a nucleotide sequence to adopt a target secondary structure at equilibrium. This method includes: decomposing a target secondary structure of a nucleic acid molecule into a tree structure having leaves and nodes, wherein the decomposition takes place at splice points within duplex stems of the target secondary structure; designing a nucleotide sequence for each leaf within the binary tree; recursing the tree to merge and reoptimize the nucleotide sequence for each node of the tree; and determining the nucleotide sequence of the root node from the merged and reoptimized nucleotide sequences of the other nodes in the tree.
One other embodiment is an electronic system for designing a nucleotide sequence to adopt a target secondary structure at equilibrium. In this embodiment, the system includes means for decomposing a target secondary structure of a nucleic acid molecule into a tree having leaves and nodes, wherein the decomposition takes place at splice points within duplex stems of the target secondary structure; means for designing a nucleotide sequence for each leaf within the binary tree; means for recursing the tree to merge and reoptimize the nucleotide sequence for each node of the tree; and means for determining the nucleotide sequence of the root node from the merged and reoptimized nucleotide sequences of the other nodes in the tree.
Still another embodiment is a programmed storage device having instructions that when executed perform a method that includes: decomposing a target secondary structure of a nucleic acid molecule into a tree having leaves and nodes, wherein the decomposition takes place at splice points within duplex stems of the target secondary structure; designing a nucleotide sequence for each leaf within the binary tree; recursing the tree to merge and reoptimize the nucleotide sequence for each node of the tree; and determining the nucleotide sequence of the root node from the merged and reoptimized nucleotide sequences of the other nodes in the tree.
a-5c are plots of the structural features of test sets generated to characterize the nucleic acid design system.
a-6d are plots showing the performance of certain embodiments of the system as the target structure size is increased. Particularly,
a and 7b depict the degree leaf independence in the design process for the engineered and random test set, respectively. They show ensemble defects of the leaves vs the root ensemble defects for initial sequences, sequences in which the leaves are designed and sequences after merging and reoptimization.
a-8d are plots highlighting the effects of hierarchical structure decomposition and defect-weighted sampling on performance of certain embodiments on the engineered test set. Particularly,
a-9d show plots of the effects of sequence initialization on performance for certain of the embodiments. Particularly, four types of sequences are considered: entirely random sequences, sequences of random adenosine-uracil (AU), sequences of random guanine-cytosine (GC), and sequences satisfying sequence symmetry minimization (SSM).
a-10d show plots of the effects of a selected stop condition stringency on performance for certain embodiments of the invention. Particularly, five degrees of stop condition stringency are considered: fstop=0.001, fstop=0.005, fstop=0.01 fstop=0.05, and fstop=0.1. Each stringency was tested using the engineered test set.
a-11d show plots of the effects of single and multi-stranded structures on performance for certain embodiments of the invention. Structures were selected from the engineered test set.
a-12d show plots of the effects of the selected design material on performance for certain embodiments of the invention. Structures were selected from the engineered test set.
a-13d show plots of the effects of pattern prevention for certain of the embodiments described herein. Structures were selected from the engineered test set.
a plots the efficiency on engineered test structures when certain embodiments are parallelized.
a-15f are plots of the effects of different objective functions and variations on the embodiments described herein, specifically embodiments using single-scale methods to optimize the probability defect, single-scale methods to optimize the ensemble defect, hierarchical methods to optimize the MFE defect, and hierarchical methods to optimize the ensemble defect. These embodiments are all tested on the engineered test set. Particularly,
Embodiments of the invention relate to systems and processes for designing nucleic acid molecules that meet a target design criteria. In one embodiment, a single target unpseudoknotted structure having predetermined design criteria is input into the system. The target structure may include one or more nucleic acid molecule chains of unknown sequence. As described in more detail below, the system uses the target unpseudoknotted structure to optimize the base composition of one or more nucleotide sequences that will base pair together into a secondary structure that meets the target design criteria. This system is useful in order to design nucleic acid molecules that assemble in vitro or in vivo into target structures.
In one embodiment, the system is used to design small conditional RNAs that are used in a hybridization chain reaction (HCR). The HCR is described in U.S. Pat. No. 7,727,721 issued on Jun. 1, 2010, which is incorporated herein by reference in its entirety. These small conditional RNAs form target secondary structures that change conformation upon detection of a particular signal (Dirks R M, Pierce N A (2004) Triggered amplification by hybridization chain reaction. Proc Natl Acad Sci USA 101:15275-15278). For example, small conditional RNAs have been designed to self-assemble and disassemble along a predetermined pathway in order to perform dynamic functions (Yin, et al, (2008) Programming biomolecular self-assembly pathways Nature 451:318-322). Small conditional RNAs can cause selective cell death of cancer cells (Venkataramanm, et al, (2010) Selective cell death mediated by small conditional RNAs, Proc Natl Acad Sci USA 107:16777-16782) and have been used in multiplexed fluorescent in situ hybridization assays to map mRNA expression within intact biological samples (Choi, et al. (2010) Programmable in situ amplification for multiplexed imaging of mRNA expression Nature. Biotechnol. 28:1208-1212). The disclosed embodiments provide a nucleic acid molecule design process and system that achieves high design quality while maintaining a low computational overhead.
In one example, the system may use separate sub-processes to design a polynucleotide sequence φ of one or more strands that will form into a target secondary structure s at equilibrium. The system optimizes the ensemble defect of this target structure using several primary processes. The three main sub-processes in this embodiment are hierarchical structure decomposition of the target structure, leaf optimization with defect-weighted mutation sampling, and subsequence merging and reoptimization, each of which is described in more detail below. Equilibrium can be at 1M Na+, or designated by user-specified Na+ and Mg++ concentrations.
In one embodiment a target nucleotide sequence secondary structure is defined as a structure matrix S(s) with entries Si,j(s)ε{0, 1}. If structure s contains pair i·j, then Si,j(s)=1, otherwise Si,j(s)=0. The system performs hierarchical structure decomposition on the target secondary structure s, defect-weighted sampling on the leaves, and merging and reoptimization as it traverses up the decomposition tree. The hierarchical structure decomposition identifies stem structures within the target structure s, and divides the root structure into a tree of substructures. In one embodiment, the tree of substructures is a binary tree of substructures. The minimum number of nucleotides in a node can be set by the user. If a node can be divided into two nodes, both of which contain more than the minimum number of nucleotides, then the node is divided. For a given target secondary structure, s, with N nucleotides, we seek to design a sequence, φ, with ensemble defect, n(φ, s), satisfying the stop condition:
n(φ,s)≦fstopN,
for a user-specified value of fstopε(0, 1). Using ensemble defect analysis, candidate mutations in a test nucleotide sequence are evaluated at the leaves of the binary tree decomposition of the target structure. During leaf optimization, defect-weighted mutation sampling is used to select each candidate mutation position with probability proportional to its contribution to the ensemble defect of the leaf. Mutations are accepted if they decrease the leaf ensemble defect. Subsequences are merged moving up the tree and parental ensemble defects are evaluated. If emergent structural defects are encountered when merging subsequences, they are eliminated via defect-weighted child sampling and reoptimization. After all of the subsequences have been merged, the final result is a nucleotide sequence φ that has a minimized ensemble defect when folded into the target structure s. Because the minimization of the ensemble defect is performed by recursing nodes and leaves of the binary tree, the process is computationally much more efficient than prior systems that optimized the objective function over entire nucleotide sequence φ.
As used herein, an “input device” can be, for example, a keyboard, rollerball, mouse, voice recognition system or other device capable of transmitting information from a user to a computer. The input device can also be a touch screen associated with the display, in which case the user responds to prompts on the display by touching the screen. The user may enter textual information through the input device such as the keyboard or the touch-screen.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable fore use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices.
As used herein, “instructions” refer to computer-implemented steps for processing information in the system. Instructions can be implemented in software, firmware or hardware and include any type of programmed step undertaken by components of the system.
A “microprocessor” or “processor” may be any conventional general purpose single- or multi-core microprocessor such as a Pentium® processor, Intel® Core™, a 8051 processor, a MIPS® processor, or an ALPHA® processor. In addition, the microprocessor may be any conventional special purpose microprocessor such as a digital signal processor or a graphics processor.
The system is comprised of various modules as discussed in detail below. As can be appreciated by one of ordinary skill in the art, each of the modules comprises various sub-routines, procedures, definitional statements and macros. Each of the modules are typically separately compiled and linked into a single executable program. Therefore, the following description of each of the modules is used for convenience to describe the functionality of the preferred system. Thus, the processes that are undergone by each of the modules may be arbitrarily redistributed to one of the other modules, combined together in a single module, or made available in, for example, a shareable dynamic link library.
The system may be used in connection with various operating systems such as SNOW LEOPARD®, iOS®, LINUX, UNIX or MICROSOFT WINDOWS®.
The system may be written in any conventional programming language such as C, C++, BASIC, Pascal, or Java, and run under a conventional operating system.
A web browser comprising a web browser user interface may be used to display information (such as textual and graphical information) to a user. The web browser may comprise any type of visual display capable of displaying information received via a network. Examples of web browsers include Microsoft's Internet Explorer browser, Mozilla's Firefox browser, Apple Safari and PalmSource's Web Browser, or any other browsing or other application software capable of communicating with a network.
The invention disclosed herein may be implemented as a method, apparatus or article of manufacture using standard programming or engineering techniques to produce software, firmware, hardware, or any combination thereof. The term “article of manufacture” as used herein refers to code or logic implemented in hardware or computer readable media such as optical storage devices, and volatile or non-volatile memory devices. Such hardware may include, but is not limited to, field programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), complex programmable logic devices (CPLDs), programmable logic arrays (PLAs), microprocessors, or other similar processing devices.
In addition, the modules or instructions may be stored onto one or more programmable storage devices, such as FLASH drives, CD-ROMs, hard disks, and DVDs. One embodiment includes a programmable storage device having instructions stored thereon that when executed perform the methods described herein to determine nucleic acid sequence information.
As used herein, a “node” is a structure which may contain a value, a condition, or represent a separate data structure (which could be a tree of its own). Each node in a tree has zero or more child nodes, which are below it in the tree (by convention, trees are drawn growing downwards). A node that has a child is called the child's parent node (or ancestor node, or superior). A node typically has at most one parent. As is known, nodes that do not have any children are called leaf nodes or terminal nodes.
The topmost node in a tree is the root node. An internal node or inner node is any node of a tree that has child nodes and is thus is not considered to be a leaf node. Similarly, an external node or outer node is any node that does not have child nodes and is thus a leaf node
A subtree of a tree is a tree consisting of a node in the tree and all of its descendants in the tree. For example, the subtree corresponding to the root node is the entire tree. The subtree that corresponds to any other node may be called a proper subtree.
Also within the nucleic acid design system 110 is a hierarchical structure decomposition module 115 that contains instructions for dividing the larger target structure into a memory structure comprising a binary tree that has nodes and leaves for further analysis. Thus, the hierarchical structure decomposition module 115 provides one means for decomposing a target nucleic acid structure. Each node or leaf may correspond to a stem, or stem fragment of the target structure s. In order minimize emergent errors that may occur when the target structures in the leaf nodes are later rejoined, one or more “dummy” or spacer nucleotides can be added to the ends of the structures within each leaf so that the leaf structure more accurately mimics the performance of the larger overall target structure. The memory structure of the binary tree may be stored within the memory 113 of the system 110.
One schematic representation of a method of decomposing a target structure 182 is shown in
Referring back to
If the random nucleotide sequence has a normalized ensemble defect greater than the stop condition, then the leaf optimization module 120 also provides instructions for mutating the random nucleotide sequence, and recalculating the ensemble defect of the mutated sequence. Once the leaf optimization module 120 has performed enough rounds of optimization so that the ensemble defect is below the leaf stop condition, a subsequence merging and reoptimization module 130 is used to merge two or more optimized leaves into a node. In one embodiment, the dummy nucleotides are first removed from each leaf target structure prior to merger with an upper node. Thus, the subsequence merging and reoptimization module 130 provides one means for recursing the tree to merge and reoptimize the nucleotide sequence for each node of the tree.
The pair probabilities of the merged nucleotide sequence are then compared against the target structure at that node to determine whether the merged sequences meet a predetermined stop condition or minimum free energy. If the stop condition is reached, then the subsequence merging and reoptimization module recurses down a layer in the binary tree to re-optimize the child sequences prior to merger. If the stop condition is not reached, then the subsequence merger and reoptimization module 130 continues recursing up the binary tree until the entire nucleotide sequence corresponding to a target structure 140 has been determined. Thus, the subsequence merging and reoptimization module also provides one means for determining the nucleotide sequence of the root node of the tree from the merged and reoptimized nucleotide sequences of the other nodes in the tree. Additional details on the modules within the system 110 are described in more detail with regard to
Referring now to
Once the target structure s has been received by the system at the state 210, the process 200 moves to a state 215 wherein the hierarchical structure decomposition module 115 (
In contrast to earlier hierarchical methods that decompose parent structures at multiloops, this hierarchical tree structure decomposes parent structures within duplex stems within the target structure s. Eligible split-points within each stem are those locations within a duplex stem with at least Hsplit consecutive base-pairs to either side, such that both children would have at least Nsplit nucleotides. If there are no eligible split-points, a structure becomes a leaf node in the decomposition tree. Otherwise, an eligible split-point is selected so as to minimize the difference in the size of the children, ∥slk|−|srk∥. In order to minimize emergent defects resulting from crosstalk between child subsequences in one embodiment, dummy nucleotides are introduced into the children to compensate for the fact that the stem has been divided into multiple sequences for analysis. Dummy nucleotides can be defined by extending the newly-split duplex stem across the split-point by Hsplit base pairs (|sdummyk
For a parent node k, the sequence φk follows the same partitioning as the structure sk (φk=φlk∪φrk and φlk∩φrk=φ). Likewise, for a child node kl, the sequence may contain both native and dummy nucleotides (φk
Once the target secondary structure has been decomposed into nodes and leaves, and dummy nucleotides added where necessary, the process 200 moves to a state 225 wherein the root node of the binary tree is selected for analysis.
Leaf Optimization with Defect-Weighted Mutation Sampling
Once the first node for analysis has been selected at the state 225, the process 200 moves to a state 230 to determine whether to perform defect-weighted mutation sampling or to further recurse in the decomposition tree. If the node is a leaf, the process 200 moves to perform defect-weighted mutation sampling (states 235,240,250,255,260,265,270). If the node is not a leave, the process 200 recursively optimizes the left, then the right child nodes.
If a determination was made at the decision state 230 that the current node was a leaf, then defect-weighted mutation sampling starts from a random initial sequence at a state 235. The process 200 then moves to a decision state 240 to determine if a leaf stop condition has been satisfied. If the current sequence satisfies the stop condition, optimization stops and the process 200 moves to a state 320 to decide whether the current node is the root node. However, if a determination is made at the decision state 240 that the leaf stop condition has not been satisfied, then a mutation position is chosen using defect-weighted mutation sampling in a state 250. The process 200 then evaluates the ensemble defect at a state 255. The process 200 then moves to a decision state 260 to determine whether to accept the mutation or not. If the ensemble defect is found to have decreased due to the mutation, then the process moves to a state 265 and the mutation is accepted and the process returns to state 240 to determine if a leaf stop condition has been satisfied. However, if the ensemble defect is not found to have decreased at the decision state 260, then the process moves to a state 270, the mutation is rejected, and the process returns to state 240 to determine if a leaf stop condition has been satisfied.
In order to calculate the average number of mispaired nucleotides in the test nucleotide sequence in comparison to the target structure, the state 255 may call the ensemble defect evaluation module 125 (
After analyzing the initial random test nucleotide sequence, a sequence optimization process is performed by the leaf optimization module 120 for each leaf node using defect-weighted mutation sampling in which each candidate mutation position is selected with probability proportional to its contribution to the ensemble defect of the leaf.
At leaf node k, sequence optimization is performed by mutating either one base at a time Si,|s
Defect-weighted mutation sampling by selecting nucleotide i as a candidate for mutation with probability nik/nk. A candidate sequence {circumflex over (φ)}k is evaluated via calculation of {circumflex over (n)}k if the candidate mutation, ξ, is not in the set of previously rejected mutations, γunfavorable (position and sequence). A candidate mutation is retained if {circumflex over (n)}k<nk and rejected otherwise. The set, γunfavorable, is updated after each unsuccessful mutation and cleared after each successful mutation.
Optimization of leaf k at state 240 terminates successfully if the leaf stop condition:
n
k
≦f
stop
|s
k|
is satisfied, or restarts if Munfavorable|sk| consecutive unfavorable candidate mutations are either in unfavorable or are evaluated and added to γunfavorable. Leaf optimization is attempted from new random initial conditions up to Mleafopt times before terminating unsuccessfully. The outcome of leaf optimization is the leaf sequence φk corresponding to the lowest encountered value of the leaf ensemble defect nk.
After a leaf optimization has been performed for a first leaf of a selected node at the state 240, the process 200 moves to the decision state 320 to determine if the current node is a root node, and thereby the process can move up the decomposition tree. If a determination was made at the decision state 320 that the current node is not a root node, then the process 200 moves to a state 315 to set the current node to the parent of the current node. The process then moves to a state 275 to determine if the nucleotide sequence at the left child node of the parent has been designed. If the nucleotide sequence has not been designed, then the process sets the current node to the left child at a state 305 and moves to decision state 230 to determine if the current node is a leaf node.
However, if the nucleotide sequence of the left child node has been designed at the decision state 275, then the process 200 moves to a decision state 280 to determine if the right child has been designed. If the right child has not been designed, then the current node is set to the right child node at a state 310 and the process moves to the decision state 230 to again determine if the current node is a leaf.
If a decision is made that the right child has been designed at the decision state 280 then the process 200 moves to a state 285 to evaluate the ensemble defect of the current parental node.
Thus, once the leaves of a selected node have been optimized, they are merged at a state 285 so that the merged sequence can be analyzed for conformance with the target structure. As subsequences are merged moving up the tree, each parent node can call the subsequence merging and reoptimization module 130 to initiate defect-weighted child sampling and reoptimization within its sub-tree if there are emergent defects resulting from crosstalk between child subsequences.
Thus, after evaluation of the ensemble defect of the parental node at the state 285, the process 200 moves to a decision state 290 to determine if the merge sequences from the leaf nodes satisfy a stop condition. If a decision is made that the merged sequence does not satisfy a stop condition, then the process moves to a state 295 to perform defect-weighted child sampling and re-optimization. After reaching the leave by recursively performing defect-weighted child sampling, the process 200 then returns to the state 235 to perform leaf optimization. The merging and reoptimization begins again at the termination of the leaf optimization.
Leaf reoptimization at state 235 starts from a new random initial test nucleotide sequence. After sibling nodes kl and kr have been optimized, parent node k may merge their native subsequences (setting φlk=φnativek
n
k≦max(fstop|slk|,nnativek
For any node k with sequence φk and structure sk, the ensemble defect, nk≡n(φk, sk), may be expressed as
is the contribution of nucleotide i to the ensemble defect of the node. For a parent node k, the ensemble defect can be expressed as a sum of contributions from bases partitioned to the left and right children nk=nlk+nrk. For a child node kl, the ensemble defect can be expressed as a sum of contributions from native and dummy nucleotides (nk
Failure to satisfy the stop condition at decision state 290 implies the existence of emergent defects resulting from crosstalk between the two child sequences. In this case, parent node k initiates defect-weighted child sampling and reoptimization within its subtree. Left child kl is selected for reoptimization with probability nlk/nk and right child kr is selected for reoptimization with probability nrk/nk. This defect-weighted child sampling procedure is performed recursively until a leaf is encountered (each time using partitioned defect information inherited from the parent k that initiated the reoptimization). The standard leaf optimization procedure is then performed starting from a new random initial sequence. The use of random initial conditions during leaf reoptimization is based on the assumption that sequence space is sufficiently rich that emergent defects can typically be eliminated simply by designing a different leaf sequence. Following leaf reoptimization, merging begins again starting with the reoptimized leaf and its sibling. The elimination of emergent defects in parent k by defect-weighted child sampling and reoptimization is attempted up to Mreopt times.
If this stop condition is satisfied, the process 200 moves to a state 320 to determine if the last node in the binary tree has been analyzed. This allows the subsequence merging to continue up the tree. If the last node in the tree has not been analyzed, the process 200 moves to a state 315 wherein the nucleotide sequence at the next node up or across the tree is selected for design or processing. The process 200 then returns to state 275 to determine if the nucleotide sequence at the left child node has been designed. If the last node has been analyzed, then the process 200 stops at an end state 325.
The utility of hierarchical structure decomposition reflects the assumption that sequence space is sufficiently rich that two subsequences optimized for sibling substructures will often not exhibit crosstalk when merged by a parent node. This hierarchical mutation procedure is designed to benefit from this property when it holds true, and to eliminate emergent defects when they do arise.
One embodiment of this entire design process is detailed in pseudocode shown in
This hierarchical sequence design approach implies an asymptotic optimality bound on the cost of designing the full sequence relative to the cost of evaluating a single candidate mutation on the full sequence. For a target structure with N nucleotides, evaluation of a candidate sequence requires calculation of n(φ, s) at cost ceval(N)=Θ(N3). Performing sequence design using hierarchical structure decomposition, mutations are evaluated at the leaf nodes and merged subsequences are evaluated at all other nodes. For node k, the evaluation cost is ceval(|sk|). If at least one mutation is required in each leaf, the design cost is minimized by maximizing the depth of the binary tree. Furthermore, at each depth in the tree, the design cost is minimized by balancing the tree. Hence, a lower bound on the cost of designing the full sequence is given by
c
des(N)≧ceval(N)[1+2(½)3+4(¾)3+8(½)3+ . . . ]
or
c
des(N)≧4/3ceval(N).
Hence, if the sequence design process performs optimally for large N, we would expect the cost of full sequence design to be 4/3 the cost of evaluating a single mutation on the full sequence. In practice, many factors might be expected to undermine optimality: imperfect balancing of the tree, the addition of dummy nucleotides in each non-root node, the use of finite tree depth, leaf optimizations requiring evaluation of multiple candidate mutations, and reoptimization to eliminate emergent defects. This optimality bound implies time complexity Ω(N3) for the sequence design process.
The systems and processes described above relate to a single-objective sequence design where the goal was to design the sequence of one or more interacting nucleic acid strands intended to adopt a target secondary structure at equilibrium. Thus, the sequence design was formulated as an optimization problem with the objective of reducing the ensemble defect below a user-specified stop condition. For a candidate sequence, φ, and a target secondary structure, s, the ensemble defect, n(φ; s), is the average number of incorrectly paired nucleotides at equilibrium evaluated over ensemble Γ. As described above, for a target secondary structure with N nucleotides, the above process seeks to satisfy
n(φ,s)≦fstopN
with fstop=0.01.
A single-objective design job is specified by entering a target secondary structure into the nucleic acid sequence design system 110 (
Process performance was evaluated on structure test sets containing 30 target structures for each of Nε{100, 200, 400, 800, 1600, 3200}. An engineered test set was generated by randomly selecting structural components and dimensions from ranges intended to reflect current practice in engineering nucleic acid secondary structures. A multi-stranded version was produced by introducing nicks into the structures in the engineered test set. Each structure in a random test set was obtained by calculating an MFE structure of a different random RNA sequence at 37° C.
To illustrate the roles of hierarchical structure decomposition and defect-weighted sampling in the context of ensemble defect optimization, we compared our process to three alternative processes lacking either or both of these features:
We also modified our process to compare performance to processes already known by others:
The sequence design process was coded in the C programming language. By parallelizing the dynamic program for evaluating P(φ) using MPI,26 the sequence design process reduced run time using multiple cores. For a design job allocated M computational cores, each evaluation of Pk for node k with structure sk is performed using m cores for some mε1, . . . , M selected to approximately minimize run time based on |sk|.
Our primary test scenario was RNA sequence design at 37° C. for target structures in the engineered test set. For each target structure in a test set, 10 independent design trials were performed. Each plotted data point represents a median over 300 design trials (10 trials for each of 30 structures for a given size N).
For the engineered test set (
By comparison, for the random test set, merging of leaf-optimized sequences typically did lead to emergent defects in the root node. Even in this case, our process was found to successfully eliminate emergent defects using defect-weighted child sampling and reoptimization starting from new random initial subsequences.
To explore the effect of sequence initialization on typical design quality and cost, we tested four types of initial conditions (
Multi-stranded target structures arise frequently in engineering practice.
Molecular engineers sometimes constrain the sequence of certain nucleotides in the target structure (e.g., to ensure complementarity to a specific biological sequence), or prevent certain patterns from appearing anywhere in the design (e.g., GGGG). Our process accepts sequence constraints and pattern prevention requirements expressed using standard nucleic acid codes.
The contour plots of
By contrast, hierarchical MFE defect optimization with defect-weighted sampling led to efficient satisfaction of the MFE stop condition (
Using a Θ(N3) dynamic program to evaluate the design objective function, we derived an asymptotic optimality bound on design time: for large N, the minimum cost to design a sequence with N nucleotides was 4/3 the cost of evaluating the objective function once on N nucleotides. Hence, our design process has time complexity Ω(N3).
We studied the performance of our process in the context of empirical secondary structure free energy models that have practical utility for the analysis and design of functional nucleic acid systems. In particular, we examined RNA design at 37° C. on target structures containing Nε{100, 200, 400, 800, 1600, 3200} nucleotides and duplex stems ranging from 1 to 30 base pairs. Empirically, we observe several striking properties. For example, emergent defects were sufficiently infrequent that they were typically eliminated by leaf reoptimization starting from new random initial sequences. In addition, it was routine to design sequences with ensemble defect n(φ, s)<N/100 over a wide range of GC contents. Additionally, the process described herein exhibited asymptotic optimality for large N, with full sequence design costing roughly 4/3 the cost of a single evaluation of the objective function. Hence, the process was efficient in the sense that the exponent in the Ω(N3) time complexity bound is sharp.
While the above processes and methods are described above as including certain steps and are described in a particular order, it should be recognized that these processes and methods may include additional steps or may omit some of the steps described. Further, each of the steps of the processes does not necessarily need to be performed in the order it is described.
While the above description has shown, described, and pointed out novel features of the invention as applied to various embodiments, it will be understood that various omissions, substitutions, and changes in the form and details of the system or process illustrated may be made by those skilled in the art without departing from the spirit of the invention. As will be recognized, the present invention may be embodied within a form that does not provide all of the features and benefits set forth herein, as some features may be used or practiced separately from others.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
This application claims priority to U.S. Provisional Application No. 61/320,863 filed on Apr. 5, 2010 and hereby incorporates by reference all of the contents described therein.
This work was supported federal by funding from the National Science Foundation under grant numbers NSF-CCF-0832824 (The Molecular Programming Project) and NSF-CCF-CAREER-0448835. The government has rights in this invention.
Number | Date | Country | |
---|---|---|---|
61320863 | Apr 2010 | US |