1. Field of the Invention
Embodiments of the invention relate to systems and methods for designing a nucleic acid molecule that will fold into a target secondary structure at a target concentration. More specifically, embodiments relate to methods and systems that allow the design of particular target secondary structures that will exist in a predefined environment, such as a dilute solution in a test tube.
2. Description of the Related Art
The programmable chemistry of nucleic acid base pairing serves as a versatile medium for the rational design of self-assembling molecular structures, devices, and systems. To date, considerable effort has been invested in designing the equilibrium base pairing properties of a single complex of (one or more) interacting nucleic acid strands. However in these earlier efforts, neither the concentration of the complex, nor the concentrations of other undesired complexes were considered. As a result, sequences that were successfully optimized to stabilize a target secondary structure in the context of a complex, may nonetheless fail to ensure that the target complex would actually form at appreciable concentration when the strands were introduced into a test tube in the lab (see
We have previously shown that the design of target nucleic acid structures can be formulated as an optimization problem based on a physically meaningful objective function, the complex ensemble defect (Zadeh, et al. (2011) NUPACK: Analysis and design of nucleic acid systems J. Comput. Chem. 32, 170-173 and Zadeh, et al. (2011) Nucleic acid sequence design via efficient ensemble defect optimization. J. Comput. Chem. 32, 439-452). In this work, a candidate sequence and target secondary structure were evaluated as a complex ensemble defect corresponding to the average number of incorrectly paired nucleotides at equilibrium evaluated over the ensemble of the complex.
a and 1b show a schematic comparison of prior art complex design (
a-d are line graphs that show structural features of tetramer target structures for the on-target complexes in a standard test set.
a-e show line graphs that compare embodiments of the present invention to prior complex design processes. Embodiments of the test tube design process described below are shown as solid lines and the prior single-complex design process is shown as dashed lines for the design of test tubes containing only a single on-target complex.
a-e show the process performance for test tube design.
a and 8b show parallel process performance
a-d are line graphs that show the effect on process performance of GC content used for seeding and reseeding with embodiments of the invention. RNA design was performed at 37° C. on the standard test set.
a-d show the effect of design material on process performance RNA design was performed at 37° C. and DNA design was performed at 25° C.
a-e show test tube design with multiple on- and off-target complexes. In this design, target test tubes contained four tetramer on-targets and all off-target complexes up to size Lmaxε{0,1,2,3,4}.
a-g show a target test tube and graphs for designing competing on-target complexes.
One embodiment is a method of designing the sequence of nucleic acid molecules that will form a predetermined secondary structure in solution. This method includes providing a set of desired on-target nucleic acid complexes each having a target secondary structure and a target concentration; providing a set of undesired off-target nucleic acid complexes each having a vanishing target concentration; and designing the sequence of one or more nucleic acid molecules that will predominantly form the on-target secondary structures at approximately the on-target concentrations, and will predominantly not form the undesired off-target complexes.
Another embodiment is an electronic system configured to design a sequence of nucleic acid molecules that will form a predetermined secondary structure in solution. This embodiment can include a first module configured to determine a set of desired on-target nucleic acid complexes each having a target secondary structure and a target concentration; a second module configured to determine a set of undesired off-target nucleic acid complexes each having a vanishing target concentration; a processor programmed to design the sequence of one or more nucleic acid molecules that will predominantly form the on-target secondary structures at approximately the on-target concentrations, and will predominantly not form the undesired off-target complexes; and an output port configured to output the design of the one or more nucleic acid molecules.
Yet another embodiment is a non-transitory computer readable medium comprising instructions that when executed by a processor perform a method of: providing a set of desired on-target nucleic acid complexes each having a target secondary structure and a target concentration; providing a set of undesired off-target nucleic acid complexes each having a vanishing target concentration; and designing the sequence of one or more nucleic acid molecules that will predominantly form the on-target secondary structures at approximately the on-target concentrations, and will predominantly not form the undesired off-target complexes.
Embodiments of the invention relate to methods and electronic systems for designing a target secondary structure of a nucleic acid molecule that will form as a predominant species under predetermined environmental conditions. For example, in one embodiment the system can design a target secondary structure of a nucleic acid molecule at a target concentration in a dilute solution. In this case, the salt concentration, and/or temperature of the dilute solution would be one predetermined condition. Using the methods, systems and modules described herein, one can calculate the equilibrium base-pairing properties of a dilute solution of interacting nucleic acid strands (e.g., a “test tube”), yielding predictions for the equilibrium concentration and base-pairing probabilities for an arbitrary number of complex species that form from an arbitrary number of strand species.
As used herein a dilute solution means a solution in which the concentration of solvent molecules is much higher than the concentration of nucleic acid strands. For example, dilute solutions may have a concentration of 10 mM, 1 mM, 100 μM, 10 μM, 1 μM, 1 nM, 1 pM, 1 fM, 1 aM, 1 zM, or 1 yM and be within the scope contemplated by embodiments of the invention.
Accordingly, embodiments of the invention include programmed modules that have been configured to formulate the design of nucleic acid sequences in the context of a test tube of interacting nucleic acid strands at equilibrium. As used herein this type of design contemplates not only the base-pairing properties of a candidate nucleic acid strand, but also the predetermined environmental conditions that the nucleic acid will be placed into, and is termed herein “test tube design”. This allows the system to design nucleic acid sequences that interact at equilibrium to form a target secondary structure at a target concentration for an arbitrary number of desired complexes (the “on-target” complexes). The system also designs against formation of an arbitrary number of undesired complexes, each specified with a vanishing target concentration such as zero (the “off-target” complexes).
In one embodiment, a “test tube ensemble defect” is calculated, which represents the concentration of incorrectly paired nucleotides at equilibrium evaluated over the ensemble of the test tube. By calculating a test tube ensemble defect, the system is able to not just consider whether a target secondary structure has formed, but also take into consideration, and minimize, the formation of other off-target complexes of monomers, dimers, etc. that would form at the same time. This allows the system to determine a nucleic acid that not only forms into a target secondary structure, but also would form that structure as a dominant species in solution.
In one embodiment, the test tube ensemble defect is estimated by decomposing the target structure for each on-target complex into a tree of substructures, as described below. This leads to a forest of decomposition trees, with one tree for each on-target complex. The test tube ensemble defect at the root level of the forest can be efficiently estimated using only physical quantities calculated inexpensively at the leaf nodes of the forest, or at any other level of the forest intermediate between the leaf and root levels.
One embodiment is an electronic system that is configured to design a sequence of nucleic acid molecules, such as RNA molecules, that will form a predetermined secondary structure in solution. This electronic system can comprise several functional modules programmed to carry out aspects of the invention. The modules may be software modules stored in the memory of a computer system or a combination of software and processor executing instructions provided by the software to carry out aspects of the invention. Accordingly, the system may include a first module configured to determine a set of desired on-target nucleic acid complexes each having a target secondary structure and a target concentration. The system may also have a second module configured to determine a set of undesired off-target nucleic acid complexes each having a vanishing target concentration. Finally, the system may include at least one processor programmed to design the sequence of one or more nucleic acid molecules that will predominantly form the on-target secondary structures at approximately the on-target concentrations, and will predominantly not form the undesired off-target complexes. In order to provide an output of the designed nucleic acid molecules, the system may have an output port configured to output the design of the one or more nucleic acid molecules. In this circumstance, the output port may be connected to a computer or mobile display screen for displaying the sequence of the designed nucleic acid molecule, or a printer configured to print a copy of the designed nucleic acid molecule. Other possible output formats known to those with ordinary skill in the art are also contemplated within aspects of the invention.
The one or more processors in the system may also be programmed to calculate a test tube ensemble defect corresponding to the concentration of incorrectly paired nucleotides in a candidate sequence at equilibrium evaluated over the ensemble of a theoretical test tube containing the nucleic acid molecules, by the methods described herein and the pseudocode provided in the Appendix.
a, shows a schematic example of the prior art “complex design” method of determining a nucleic acid molecule that will form a target secondary structure 100. Sequence design formulated in the context of the target complex 100 ensures that at equilibrium the calculated target structure 110 dominates the structural ensemble of the complex that is determined by the complex design process. Unfortunately, subsequent thermodynamic analysis of the calculated target molecule in the context of a test tube can reveal that the desired target complex 100 occurs at negligible concentration (4 nm) relative to other undesired monomers and homodimers when actually built and put into solution in a test tube.
In one embodiment of a test tube design system, as described herein, the user specifies: 1) a set of desired on-target complexes, each with a target secondary structure and target concentration, 2) a set of undesired off-target complexes, each with vanishing target concentration. Given these parameters, the modules within the system derive a “test tube ensemble defect”, corresponding to the concentration of incorrectly paired nucleotides at equilibrium evaluated over the ensemble of the test tube. To efficiently optimize the test tube ensemble defect, the system builds on hierarchical sequence optimization concepts previously developed by us for complex design (See U.S. Pat. No. 8,478,543 entitled “System and Method for Nucleic Acid Sequence Design”, hereby incorporated by reference in its entirety). However, embodiments of the current invention address new conceptual challenges that arise in the context of test tube design which better reflects real-world experimental conditions, in that undesired off-target complexes compete with the desired on-target complexes.
In one embodiment, the process of determining the sequence of a nucleic acid that will form a target structure at a target concentration in predetermined environmental conditions begins by describing the physical quantities that provide the basis for analyzing and designing the equilibrium base-pairing properties of a test tube of interacting nucleic acid strands. This description starts with defining a secondary structure model for the nucleic acid strands to be evaluated.
The sequence, φ, of one or more interacting RNA strands is specified as a list of bases φaε{A, C, G, U} for a=1, . . . , |φ| (T replaces U for DNA). A secondary structure, s, of one or more interacting RNA strands is defined by a set of base pairs (each a Watson-Crick pair [A•U or C•G] or a wobble pair [G•U]). A polymer graph representation of a secondary structure is constructed by ordering the strands around a circle, drawing the backbones in succession from 5′ to 3′ around the circumference with a nick between each strand, and drawing straight lines connecting paired bases. A secondary structure is unpseudoknotted if there exists a strand ordering for which the polymer graph has no crossing lines. A secondary structure is connected if no subset of the strands is free of the others. A complex of interacting strands with strand ordering, π, has structural ensemble, Γ(π), containing all connected polymer graphs with no crossing lines. For sequence φ and secondary structure, sεF, the free energy, ΔG(φ, s), is calculated using nearest-neighbor empirical parameters for RNA in 1M Na+ or for DNA in user-specified Na+ and Mg++ concentrations. These physical models have practical utility for the analysis and design of functional nucleic acid systems, and provide the basis for rational analysis and design of equilibrium base-pairing in the context of a dilute solution.
Now that a model has been formulated to analyze the secondary structure of interacting nucleic acid molecules, the system also provides a means for determining the equilibrium base pairing of nucleotides within the interacting nucleic acid molecules.
Let Ψ0 denote the set of strand species that interact in a test tube to form the set of complex species Ψ. For complex jεΨ, with sequence φj and structural ensemble Γj, the partition function
can be used to calculate the equilibrium probability of any secondary structure sεΓj:
p(φj,s)=exp└−ΔG(φj,s)/kBT┘/Q(φj).
Here, kB is the Boltzmann constant and T is temperature. The equilibrium base-pairing properties of complex j are characterized by the base-pairing probability matrix P(φj), with entries Pa,b(φj)ε[0, 1] corresponding to the probability,
that base pair a·b forms at equilibrium within ensemble Γj. Here, S(s) is a structure matrix with entries Sa,b(s)=1 if structure s contains base pair a·b and Sa,b(s)=0 otherwise. For convenience, the structure and probability matrices are augmented with an extra column to describe unpaired bases. The entry Sa,|s|+1(s) is unity if base a is unpaired in structure s and zero otherwise; the entry Pa,|φ
Let QΨ=Qj∀jεΨ denote the set of partition functions for the complexes in the test tube. The set of equilibrium concentrations, xΨ, (specified as mole fractions) are the unique solution to the strictly convex optimization problem:
where the constraints impose conservation of mass. A is the stoichiometry matrix with entries Ai,j corresponding to the number of strands of type i in complex j, and xi0 is the total concentration of strand i introduced to the test tube.
To analyze the equilibrium base-pairing properties of a test tube of nucleic acid strands, the partition function, Q(φj), and equilibrium pair probability matrix, P(φj), are calculated for each complex jεΨ using θ(|φj|3) dynamic program modules. The equilibrium concentrations, xΨ, are calculated by solving the convex programming problem (equation (1)) using an efficient trust region method at a cost that is typically negligible by comparison. The overall time complexity to analyze the test tube is then O(|Ψ∥φ|3max), where |φ|max is the size of the largest complex.
In specifying an analysis problem, a convenient and powerful approach is to define Ψ if to include all complexes of up to Lmax strands. For a test tube containing the set of strands, Ψ0, the total number of complexes that can form of up to size Lmax is:
so the overall time complexity to analyze the test tube is O(|Ψ0|Lmax|s|3max/Lmax).
A test tube design problem is specified as a target test tube containing a set of desired on-target complexes, Ψon, and a set of undesired off-target complexes, Ψoff. The set of complexes in the test tube is then:
ψ=ψon∪ψoff.
Each complex, jεΨ, is specified as a strand ordering, πj, corresponding to structural ensemble Γ(πj). For each on-target complex, jεΨon, the user specifies a target secondary structure, sj, and a target concentration, yj. For each off-target complex, jεΨoff, the target concentration is vanishing (yj=0) and there is no target structure (sj=Ø). When specifying the off-targets in Ψoff, it is convenient to include all complexes of up to Lmax strands. For example, by equation 2, four strands can interact to form 108 complexes of up to size four.
Complementarity constraints may be imposed on the design at the sequence level by defining strands in terms of sequence domains (e.g., see the sequence domains in the monomer and dimer on-target structures of
Described herein are methods, systems and modules configured to perform sequence optimization for a test tube design based on a physically meaningful objective function that quantifies sequence quality with respect to the environment of a target test tube. This allows the design of nucleic acid sequences that will form a target secondary structure in a chosen concentration when tested in vitro in solution.
As a precedent for this approach, consider the related problem of complex design, where the goal is to design strands that, at equilibrium, adopt a target secondary structure within the ensemble of a complex, without considering the environment (e.g., a dilute solution) that the nucleic acid will eventually be placed within. For a candidate sequence, φj, and target structure, sj, the complex ensemble defect
is the average number of incorrectly paired nucleotides at equilibrium evaluated over the ensemble of the complex, Γj. The complex ensemble defect falls in the interval (0,|sj|). For complex design, the complex ensemble defect provides a physically meaningful objective function for quantifying sequence quality.
Here, to provide a basis for evaluating candidate nucleic acid sequences in the context of a test tube, we derive the “test tube ensemble defect”, representing the concentration of incorrectly paired nucleotides at equilibrium evaluated over the ensemble of the test tube. For a target test tube with target secondary structures, sΨ, target concentrations, yΨ, and candidate sequences, φΨ, the test tube ensemble defect
may be expressed in terms of the defect contribution of each complex jεΨ:
For each on-target complex jεΨon, the first term in equation (5) represents the structural defect, quantifying the concentration of nucleotides that are in an incorrect base-pairing state on average within the ensemble of complex j, and the second term represents the concentration defect, quantifying the concentration of nucleotides that are in an incorrect base-pairing state because there is a deficiency in the concentration of complex j. Because yj=0 for off-target complexes, the structural and concentration defects are both identically zero (so the sum in equation (4) may be written over Ψon instead of ‘Ψ’). This does not mean that the defects associated with the off-targets are ignored. By conservation of mass, non-zero off-target concentrations imply deficiencies in on-target concentrations, and these concentration defects are quantified by equation (4).
The test tube ensemble defect falls in the interval (0,ynt), where
is the total concentration of nucleotides in the test tube.
Note that if there is only one species of complex in the test tube (|Ψ|=1), its concentration is necessarily equal to the target concentration (x1=y1), so the formulation is independent of concentration. In this case, optimization of the test tube ensemble defect, C(φ1,s1,y1), is equivalent to optimization of the complex ensemble defect, n(φ1,s1).
Calculation of the test tube ensemble defect (equation 4) requires calculation of the complex partition functions, QΨ, which are used to calculate the equilibrium concentrations, xΨ, as well as the equilibrium pair probability matrices, PΨ
With the above structure in place, below is a description of a “test tube” design system and process that includes modules for calculating the nucleic acid sequence of a nucleotide strand that will adopt a target secondary structure at a target concentration in solution based on test tube ensemble defect optimization. For a target test tube with target secondary structures, sΨ, and target concentrations, yΨ, the system seeks to design a set of sequences, φΨ, such that the test tube ensemble defect satisfies the test tube stop condition:
C(φΨ,sΨ,yΨ)≦Cstop (6)
with
C
stop
≡f
stop
y
nt (7)
for a user-specified value of fstopε(0,1). It should be realized that the fstop condition may be, for example, between 0.5 and 0.001 in some embodiments. For example, the fstop condition may be 0.5, 0.2, 0.1, 0.05, 0.02, 0.01, 0.005, or 0.001. Using this notation, an fstop of 0.01 would correspond to requiring a normalized test tube ensemble defect of greater than 1%.
The test tube ensemble defect is reduced via iterative mutation of a random initial sequence. Because of the high computational cost of calculating the test tube ensemble defect, the system tries to avoid direct recalculation of C in evaluating each candidate mutation. To reduce the cost of sequence optimization, each on-target structure is decomposed into a tree of substructures, yielding a “forest” of decomposition trees. Candidate mutations are evaluated efficiently by estimating the test tube ensemble defect based on nodal contributions calculated efficiently at the leaves of the decomposition forest.
During leaf optimization, defect-weighted mutation sampling is used to select each candidate mutation position with probability proportional to its contribution to the estimated test tube ensemble defect. As optimized subsequences are merged toward the root level of the forest, emergent defects that arise due to crosstalk between subsequences are eliminated via reoptimization within a defective subtree. After subsequences are merged to the root level, the full test tube ensemble defect, C, is calculated for the first time, including all on- and off-target complexes in the test tube ensemble. Any off-target complexes that form at appreciable concentration are decomposed, added to the decomposition forest, and actively destabilized during subsequent forest reoptimization. The exclusion of off-targets from the decomposition forest during the initial phase of sequence design is critical to enabling the design of test tubes containing large numbers of off-target complexes (e.g., 104 off-targets). The elements of this hierarchical sequence design process are described below and detailed in the pseudocode in the attached appendix.
To initialize the system, a set of nucleic acid molecule complexes, Ψ, is partitioned into two disjoint sets of complexes:
Ψ=Ψactive∪Ψpassive
where Ψactive denotes complexes that will be actively designed and Ψpassive denotes complexes that will inherit sequence information from Ψactive. Initially, we set
Ψactive=Ψon,Ψpassive=Ψoff
so that only the on-target complexes are actively designed. The user-specified on-target structures provide the basis for hierarchical structure decomposition, which enables efficient sequence design. The sequences for the complexes jεΨactive are randomly initialized subject to respecting complementarity constraints provided by the design problem specification. Watson-Crick complements are used to initialize complementary sequence domains or any bases that are paired within an on-target structure.
The hierarchical decomposition module is configured to decompose each on-target structure into a binary tree of substructures. Each target structure sjεΨactive is decomposed into a (possibly unbalanced) binary tree of substructures, resulting in a forest of |Ψon| trees. Each node in the forest is indexed by a unique integer k. For each parent node, k, there is a left child node, kl, and a right child node, kr. Each nucleotide in parent structure sk is partitioned to either the left or right child substructure (sk=skl∪skr, skl∩skr=Ø). Eligible split-points are those locations within a duplex stem with at least Hsplit consecutive base-pairs on either side, such that each child would have at least Nsplit nucleotides. An eligible split-point is selected so as to minimize the difference in the size of the children, ∥skl|−|skr∥. Child node kl inherits from parent node k the substructure skl augmented by dummy nucleotides that approximate the influence of its sibling in the context of their parent. Dummy nucleotides are defined by extending the newly-split duplex stem across the split-point by Hsplit base pairs (|sk
Decomposition of the sequence, φk, is performed in accordance with decomposition of structure sk. If the maximum depth of a leaf in the forest of binary trees is D, any nodes with depth d<D that lack an eligible split-point are replicated at each depth down to D so that all leaves have depth D. Let Λdenote the set of all nodes in the forest. Let Λd denote the set of all nodes at depth d. Let Λd,j denote the set of all nodes at depth d resulting from decomposition of complex j. Each nucleotide in complex j is native in exactly one nodal structure, skεsΛ
Test Tube Ensemble Defect Estimation from Nodal Contributions.
The nodal test tube ensemble defect estimation module is configured to estimate the ensemble defect of one or more candidate nucleic acid molecules over the ensemble of a collection of nucleic acid molecules in a theoretical test tube based on the contributions of each nodal estimate. Evaluation of the test tube ensemble defect (equation (4)) at the root level of the forest requires calculation of the defect contribution (equation (5)) of each complex in the active complex Ψactive. The O(|Ψ∥φ|max3) time complexity for this calculation results from the Θ(|φj|3) dynamic programs used to calculate the partition function, Qj, and equilibrium pair probability matrix, Pj, for each complex jεΨ. To reduce the cost of evaluating candidate sequences, here, we derive decompositions of the relevant physical quantities so that defect contribution of each complex (equation (5)) can be estimated less expensively using nodal contributions calculated at any depth dε{2, . . . , D} within its decomposition forest.
For each complex jεΨ, the cost of estimating the defect contribution cj at level d is dominated by calculation of the nodal partition function, Qk, and nodal pair probability matrix, Pk, at cost Θ(|φk|3) for each node kεΛAd,j. For an optimal decomposition, |φk| halves and |Λd,j| doubles at each level moving down the tree, so the cost of estimating cj at level d can be a factor of ½2d−2 lower than the cost of calculating cj exactly at the root. Hence, for maximum efficiency, all candidate mutations are evaluated by estimating the test tube ensemble defect at the leaf level of the forest (depth d=D). As subsequences are merged toward the root level, the test tube ensemble defect is estimated at intermediate depths in the forest.
The accuracy of the defect estimate will depend on the equilibrium structural properties of the sequence. If a split-point partitions a parent structure within a duplex that is predominantly well-formed at equilibrium, the physical properties of the parent can be accurately estimated based on the physical properties of the native nucleotides in its children, because the children are relatively isolated from each other by the base pairs adjacent to the split-point. The role of the dummy nucleotides within each child is to approximate the stabilizing influence of the missing sibling on the base pairs adjacent to the split-point. As the quality of the sequence design improves, the quality of the decomposition approximation will also improve as the duplex containing the split-point increasingly dominates at equilibrium The accuracy of the decomposition breaks down if there is crosstalk when sibling sequences are merged within a parent; crosstalk can destabilize the duplex containing the split-point, undermining the validity of the decomposition. The utility of root defect estimation hinges on the assumption that sequence space is sufficiently rich that subsequences within the decomposition forest will often not exhibit crosstalk when merged to the root.
The hierarchical mutation procedure exploits root defect estimation when crosstalk is absent, and eliminates crosstalk when it does arise during subsequence merging.
The following sections describe how to calculate each of the nodal contributions at any level dε{2, . . . , D} so as to efficiently and accurately estimate the complex contributions, cΨ
The complex partition function module is configured to estimate the partition function for a root node in a decomposition tree. We begin by calculating the complex partition function estimate, {tilde over (Q)}j, for each complex jεΨactive in terms of partition function contributions evaluated efficiently at the nodes kεΛd,j at any level dε{2, . . . , D}. This decomposition is illustrated for parent node k and its children kl and kr in
The utility of this approach hinges on the assumption that the base pairs on either side of a split-point are predominantly well-formed at equilibrium, so that the partition functions of two sibling nodes are approximately independent and can be usefully combined to approximate the partition function of their parent node. Let Ek denote the set of native base pairs adjacent to decomposition split-points in node k. For each base pair a·bεEk, let φkdummy(a·b) denote the sequence of a·b and all the dummy nucleotides on the other side of the split-point. The native partition function for node k is then
where the approximation follows from the assumption that the equilibrium probabilities for the base pairs in Ek are independent; the expression becomes exact if Ek contains only one base pair, and in the limit as the equilibrium probabilities of the base pairs in Ek approach unity. The partition functions, Q(φk) and Q(φkdummy(a·b)), and the equilibrium base-pairing probability matrices, P(φk) and P(φkdummy(a·b)), are calculated using dynamic programs suitable for complexes containing arbitrary numbers of strands.
Note that the periodic strand repeat, vj of complex j is defined as the number of different rotations of the polymer graph that map strands of the same type to each other (e.g., vj=4 for complex AAAA, vj=3 for complex ABABAB, vj=2 for ABAABA). For complexes in which all strands are distinct, vj=1. However, complexes containing multiple copies of the same strand may have vj>1, in which case the dynamic program that is used to calculate the partition function of complex j will be incorrect due to symmetry and overcounting errors that are different for different structures in Γj. Fortunately, these errors interact in such a way that they can be exactly and simultaneously corrected by dividing the calculated partition function by the integer vj
Q(φj)=Qcalc(φj)/vj.
When employing the dynamic program to calculate the nodal partition functions for kεΛd,j, it is important to correct each of these values using vj.
Next, the system reconstructs the approximate partition function for complex j from the native partition functions of all descendant nodes at level d. Let Fd,j denote the set of base-pair stacks sandwiching the split-points in the decomposition of complex j at depth d. Each of these base-pair stacks a·b:e·f is an interior loop whose free energy, ΔGa·b:e·finterior, is not incorporated in the native partition functions calculated for the nodes on either side of the split point. The complex partition function estimate is then
representing the product of the native partition functions jεΛd,j and the additional contributions from the interior loops a·b:e·f at the split-points. This estimate becomes exact in the limit as the equilibrium probabilities of the base-pairing stacks in Fd,j approach unity.
Complex Concentration Estimate using Deflated Mass Constraints
After calculating the set of complex partition function estimates, {tilde over (Q)}Ψ
Initial sequence optimization is performed on a decomposition forest that contains only the on-target complexes in Ψactive, but ultimately, the system tries to satisfy the test tube stop condition (equation (6)) for the full set of complexes in Ψ, including the off-targets in Ψpassive. Recall that the off-targets in Ψpassive do not contribute directly to the sum used to calculate the test tube ensemble defect (equation (4)), but contribute indirectly by forming at positive concentrations, causing concentration defects for complexes in Ψactive as a result of conservation of mass. Hence, we can pre-allocate a portion of the permitted test tube ensemble defect, fstopynt, to the neglected off-target complexes in Ψpassive by deflating the total strand concentrations used to impose the mass constraints (equation (1b)) in calculating the equilibrium concentrations {tilde over (x)}Ψ
Following this approach, if Ψpassive≠Ø, we make the assumption that the complexes in Ψpassive consume a constant fraction of each total strand concentration:
corresponding to a total mass allocation of fpassivefstopynt to the neglected off-targets in Ψpassive.
To calculate the equilibrium concentrations of the complexes in Ψactive via (equation (1)), we therefore use the deflated strand concentrations:
in place of the full strand concentrations (equation (10)). For each complex jεΨactive, the concentration estimate, {tilde over (x)}j, is passed to the nodes in the subtree of complex j at level d:
x
k
={tilde over (x)}
j
∀kεΛ
d,j
Nodal concentrations are useful for representing the test tube ensemble defect estimate as a sum of nodal (rather than complex) contributions.
The complex ensemble defect estimate, ñj, is calculated for each complex jεΨactive active based on nodal defect contributions, nk, calculated efficiently at the nodes kεΛd,j at any level dε{2, . . . , D}. This decomposition is illustrated for parent node k and its children kl and kr in
Because each nucleotide in complex j is native in exactly one node kεΛd,j, the system can approximate the complex ensemble defect as the sum of the native nodal defect contributions at any depth in the subtree. The nodal pair probability matrix, Pk (with entries for both native and dummy nucleotides), was previously calculated in order to estimate the nodal partition function contribution (equation 8).
For any node kεΛd,j, the contribution of nucleotide aεsk to the nodal defect is given by
and the native nodal defect contribution is:
Based on nodal contributions at depth d, the complex ensemble defect estimate for any complex jεΨactive is then:
This estimate becomes exact in the limit as the equilibrium probabilities of the base pairs sandwiching the decomposition split-points approach unity.
Having calculated the complex concentration estimates, {tilde over (x)}Ψ
{tilde over (c)}
j
=ñ
jmin({tilde over (x)}j,yj)+|sj|max(yj−{tilde over (x)}j,0), (12)
and a test tube ensemble defect module can then calculate the test tube ensemble defect as:
This sum can equivalently be expressed as a sum over nodal contributions at depth d. The test tube ensemble defect associated with nucleotide a in node kεΛd is
c
k
a
=n
k
amin(xk,yk)+max(yk−xk,0)lem
so the native nodal defect contribution for node k is
and the test tube ensemble defect estimate (equation (13)) becomes:
The total defect permitted by the test tube stop condition (equation (6)) can be allocated proportionally to the nodes at depth d in the decomposition forest:
c
k
stop
≡f
stop
|s
k
native
|y
k
∀kεΛ
d (15)
so that the nodal defect allocations sum to the total permitted test tube ensemble defect
During hierarchical sequence optimization, candidate sequences are evaluated using
the thresholded test tube ensemble defect estimate:
in place of (equation (14)) to drive proportional defect allocation across the nodes at each level in the decomposition forest.
To minimize computational cost, all candidate mutations are preferably evaluated by a leaf mutation calculation module at the leaf nodes, kεΛD, of the decomposition forest. Leaf mutation terminates successfully if the leaf stop conditions,
c
k
native
≦c
k
stop
{kεΛ
D, (17)
are all satisfied. The multiple leaf stop conditions collectively enforce the single test tube stop condition (equation (6)) and further mandate consistent design quality across the leaves. A candidate mutation is accepted if it decreases the thresholded test tube ensemble defect estimate (equation (16)) and rejected otherwise. Let FD denote the set of leaves that do not yet satisfy the leaf stop condition (equation (17)). The thresholded test tube ensemble defect is compatible with the leaf stop conditions in the sense that a candidate mutation is accepted if and only if it reduces the defect contribution of the leaves in FD.
In some embodiments, defect weighted mutation sampling is performed by selecting nucleotide a for mutation from amongst those leaves kεFD with probability proportional to the contribution of nucleotide a to the defect contribution of these leaves:
If the selected candidate mutation position is subject to complementarity constraints implied by the design problem specification, either via complementary sequence domains or via base-pairing within any on-target structure, the candidate mutation respects the constraint in one of three strengths: 1) strong complementarity (default): constrained nucleotides are selected randomly from a uniform distribution of Watson-Crick pairs, b) medium complementarity: constrained nucleotides are selected randomly from a uniform distribution of Watson-Crick and wobble pairs, c) weak complementarity: constrained nucleotides are selected randomly from a uniform distribution of Watson-Crick pairs, wobble pairs, and mismatches. For design problems where on-target structures place competing demands on the test tube ensemble defect, permitting weak complementarity permits the process to increase the defect contribution in one part of a design in order to reduce the ensemble defect of the test tube as a whole (e.g., see the example of
A candidate sequence {circumflex over (φ)}Λ
After leaf mutation terminates, if any leaves fail to satisfy the leaf stop condition (i.e., FD≠Ø), leaf reoptimization commences by the leaf reoptimization module. The counter mkleaf is used to keep track of the number of times that leaf k is reoptimized. During each round of leaf reoptimization, the leaf kεFD with the minimal mkleaf is reseeded with a random initial sequence and a new round of leaf mutation is performed. After leaf mutation terminates, the counter mkleaf is incremented for any leaf whose sequence has changed. The reoptimized candidate sequences, {circumflex over (φ)}Λ
After leaf reoptimization terminates, parent nodes at depth d=D−1 merge their left and right child sequences to create the set of candidate sequences {circumflex over (φ)}Λ
c
k
native≦max(ck
or if all parents at level d have exhausted Mopt optimization attempts (i.e., mkopt≧Mopt), merging continues up to the next level in the forest. Otherwise, failure to satisfy the parental stop condition implies the existence of emergent defects resulting from crosstalk between child sequences. In this case, the parent node at depth d with the minimal mkopt that also fails to satisfy the parental stop condition (equation (18)), is selected for reoptimiziation (and labeled kreopt).
To reoptimize parent node kreopt at depth d, the current sequences at depth d are pushed down to all nodes below depth d, and the counter, mkopt, is reset to zero for all nodes below depth d. Let Fk denote the set of native nucleotides in parent kreopt, that are partitioned to leaf k. Parent kreopt performs defect weighted leaf sampling by selecting a leaf k within its subtree with probability:
The selected leaf (labeled kreseed) is reseeded to a random initial sequence and a new round of leaf mutation and leaf reoptimization is performed. Reseeding with a random initial sequence is based on the assumption that sequence space is sufficiently rich that emergent defects are atypical and can reliably be eliminated by designing a different leaf sequence. Following leaf reoptimization, merging begins again. Subsequence merging and reoptimization terminate successfully if all root nodes satisfy the parental stop condition (equation (18)). The outcome of subsequence merging and reoptimization are the sequences φΨ
Focusing Effort within the Decomposition Forest
To focus mutation effort in portions of the decomposition forest where it is most likely to reduce the test tube ensemble defect, we define the set of nodes, Ωfocus. Initially, all nodes in the decomposition forest, kεΛ, are placed in Ωfocus. During leaf reoptimization, ΩDfocus contains only those leaves whose sequences were changed by the most recent leaf reseeding. During parent reoptimization following a failed merge at level d, Ωd
To avoid expending undue effort on nodes that exhausted reoptimization attempts during a previous traversal of the decomposition forest, leaf mutation, leaf reoptimization, and parent reoptimization all focus on nodes in Ωfocus, as detailed in the pseudocode in the attached appendix. Leaf mutation is restricted to leaves in the set:
ΩDmutate={kεΩDfocus:cknative>ckstop,mkmutate<Mmutate|sknative|}
and terminates when ΩDmutate is empty. Leaf reoptimization is restricted to leaves in the set:
ΩDleaf={kεΩDfocus:cknative>ckstop,mkleaf<Mleaf}
and terminates when ΩDleaf is empty. Parent reoptimization at depth d is restricted to parents in the set:
Ωdopt={kεΩdfocus:cknative>max(ck
and merging continues up the forest when Ωdopt is empty.
Initial forest optimization is performed for the on-target complexes in Ψactive, neglecting the off-target complexes in Ψpassive. At the termination of initial forest optimization, the test tube ensemble defect estimate (equation (13)) is {tilde over (C)} calculated at depth d=1. For this estimate, the complex defect contributions, {tilde over (c)}Ψ
For the first time, the full test tube ensemble defect (equation (4)), C, is then calculated for all complexes in the complex Ψ. For this exact calculation, the complex defect contributions, cΨ, are based on complex concentrations, xΨ, calculated using the full strand concentrations (equation (10)).
Sequence design terminates successfully if the test tube ensemble defect satisfies either the test tube stop condition (equation (6)), or is no greater than the forest-estimated defect (equation (13)):
C≦max(Cstop,{tilde over (C)}). (19)
This latter condition allows sequence design to terminate if the actual defect contribution resulting from the off-target complexes in Ψpassive is no greater than the built-in defect allowance resulting from deflation of the total strand concentrations during forest optimization. Otherwise, we have
C>{tilde over (C)} (20)
and the off-target complex jεΨpassive with the largest concentration is transferred from Ψpassive to Ψactive. Because the off-target structure, sj, is undefined and we require a structural basis for tree decomposition, we generate an off-target structure, sj, that includes all base pairs a·b that form with equilibrium probability Pja,b>psplit (for a specified psplitε(0.5,1.0)) between nucleotides a and b that are constrained to be complementary (either due to specification of complementary sequence domains or due to specification of an on-target structure containing a·b). The root defect estimate, {tilde over (C)}, is then recalculated (using deflated strand concentrations (equation (11)) if Ψpassive≠Ø). This process of transferring the highest-concentration off-target complex, j, from Ψpassive to Ψactive generating an off-target structure sj, and re-calculating the root defect estimate, {tilde over (C)}, is repeated until (equation (20)) no longer holds.
The new off-target structures Ψinactive are then hierarchically decomposed, the decomposition forest is augmented with new nodes at all depths, and forest reoptimization commences starting from the final sequences from the previous round of forest optimization. During forest reoptimization, the process actively attempts to destabilize the off-targets that were added to Ψactive. This process of forest augmentation and reoptimization is repeated until (equation (19)) is satisfied, which is guaranteed to occur in the event that all off-targets are eventually added to Ψactive. The sequence design process shown in the attached pseudocode appendix returns the sequences φΨ that yielded the lowest encountered test tube ensemble defect, C. The appendix includes pseudocode for hierarchical test tube ensemble defect optimization. For a given set of target secondary structures, sΨ, and target concentrations, yΨ, a set of designed sequences φΨ, is returned by the function call OptimizeTube (sΨ, yΨ, Ψ, Ψon, Ψoff).
In one embodiment, the test tube design process described herein is coded in the C programming language. To reduce run-time for large jobs, the dynamic programs for evaluating the nodal partition function, Qk, and the nodal base-pairing probability matrix, Pk, can be parallelized using MPI. For parallel execution, each evaluation of Qk and Pk for node k with target structure sk is performed using a number of cores selected so as to approximately minimize run time based on node size, |sk|.
The performance of the test tube design process was demonstrated using a set of target test tubes. Within each target test tube, there was a single on-target tetramer with a target concentration of 1 μM. The off-targets were specified to be to all complexes of up to Lmax=4 strands (excluding the on-target tetramer), corresponding to a total of 107 off-target complexes. The target structure for each on-target tetramer was randomly generated with stem and loop sizes randomly selected from a distribution of sizes representative of the nucleic acid engineering literature. Sixty on-target tetramers were generated for each target structure size |s|ε{100,200,400,800} nucleotides, corresponding to a total of 240 target test tubes. Within a tetramer, all strands were of the same length. The structural properties of these on-target tetramers are summarized in
Design trials were run on a cluster of 2.53 GHz Intel E5540 Xeon® dual-processor/quad-core nodes with 24 GB of memory per node. Unless otherwise noted, trials were performed on a single computational core using the default process parameters of Table 1. Design quality is plotted as the normalized test tube ensemble defect, C/ynt.
The primary test scenario is RNA sequence design at 37° C. with fstop=0.01 (i.e., less than 1% of the nucleotides should be incorrectly paired within the test tube at equilibrium). Ten independent trials were performed for each of the 240 target test tubes in the standard test set.
Performance of Complex Design without Consideration of Off Target Complexes
In order to have a comparison with prior systems, an initial experiment was run to characterize the special case in which test tube design reduces to the earlier complex design: a target test tube containing one on-target complex and no off-target complexes. Once the performance of this method was determined, it could be used as a baseline for comparison against embodiments of the invention that utilize a test tube design process.
Complex Ensemble Defect Estimation within the On-Target Decomposition Tree
e compares the ensemble defect evaluated at the root of the on-target decomposition tree to the estimated ensemble defect based on physical quantities calculated efficiently at the leaves of the tree. These data reveal both the progression in design quality and the progression in the accuracy of defect estimation as tree optimization proceeds. Consistent with the performance of the earlier single-complex design process, three striking properties were observed. First, for a random initial sequence, the root defect was large and well-approximated by the leaf-estimated defect (data fall near the diagonal). Second, leaf-optimized sequences that were merged without reoptimization (Mopt=1) were typically estimated to satisfy the stop condition (leaf-estimated defect ≦0.01) but failed to satisfy the root stop condition (root defect ≦0.01) due to emergent defects resulting from crosstalk between merged subsequences. These emergent defects were successfully eliminated during reoptimization of defective subtrees from new random initial subsequences, resulting in final sequence designs that satisfied the root stop condition (root defect ≦0.01).
Furthermore, for the final sequence designs, the leaf-estimated defect typically closely approximated the root defect, indicating that there was minimal crosstalk between merged subsequences and that dummy nucleotides in the leaves did a good job of approximating parental context.
As previously observed, as |s| increased, the cost of optimizing the on target decomposition tree approached 4/3 the cost of a single evaluation of the complex ensemble defect for the on-target (
Indeed, this is typically the case for test tubes containing an on-target structure with |s|ε{200,400,800} (panel d). For example, for |s|=800, approximately 70% of design trials required no off-target destabilization and hence only a single evaluation of the test tube ensemble defect. A further 20% of design trials required only one round of off-target destabilization and a total of two evaluations of the test tube ensemble defect, leading to the observed stair step structure in the cumulative histogram of relative design cost (panel d). For test tubes containing an on-target structure with |s|=100, off-target destabilization was typically required, and a typical design trial costs about three times the cost of a single evaluation of the test tube ensemble defect.
Test Tube Ensemble Defect Estimation within the Decomposition Forest
e compares the test tube ensemble defect evaluated at the root level of the decomposition forest with the leaf-estimated defect. These data reveal both the progression in design quality and the progression in the accuracy of root defect estimation as optimization of the decomposition forest proceeds for the full test tube design process. Because of the presence of off-targets within the test tube (unlike
Methods of analyzing and designing equilibrium nucleic acid secondary structure depend on empirical free energy models. It is inevitable that the parameter sets in these models will continue to be refined. In order to make useful design predictions based on an approximate physical model, it is important that conclusions about design quality are robust to model perturbations. To assess the sensitivity of the test tube ensemble defect to model perturbations, we considered all 600 design trials for target test tubes with |s|=200. Each sequence design was evaluated using 100 perturbed parameter sets with each parameter perturbed by Gaussian noise with a standard deviation of 5, 10, 20, or 40 percent of the parameter absolute value (
a-d are line graphs that compare the process performance of the test tube design process using different GC contents for random seeding and reseeding.
The contour plots of
In the standard test set, there is only one on-target complex per test tube, so there is no disadvantage to stabilizing this complex to the maximum extent possible, since all off-target complexes have vanishing target concentration. However, if there are multiple on-target complexes competing for the same strands, then the process needs to look at balancing the relative stability of these competing on-target complexes. To examine this challenge, we considered target test tubes in which a strand was intended to form both a monomer hairpin and a dimer duplex (
b and 12c demonstrate that typical design quality varies greatly depending on the target monomer concentration (i.e., depending on the desired relative stability of the monomer and dimer on-targets). For example, the process typically succeeded in producing designs for low/high monomer/dimer target concentrations but struggled to satisfy the stop condition for high/low monomer/dimer target concentrations. These designs were performed with strong sequence complementarity constraints, requiring nucleotides that were intended complements in one or more on-target structures to be Watson-Crick complements.
If the designs were performed with weak complementarity requirements, permitting the process to introduce wobble pairs or mismatches between intended complements, typical design performance significantly improved (
Because of the competition between on-target complexes, we revisited the question of robustness to model perturbations. The perturbation studies of
Test Tube Design with Multiple On- and Off-Target Complexes
As illustrated above, the test tube design process was found to provide a powerful framework for engineering nucleic acid base pairing to conform to a target secondary structure at a target concentration. The desired equilibrium base-pairing properties for candidate nucleic acid molecules in a predetermined environment (such as a dilute solution in a test tube) were specified as an arbitrary number of on-target complexes, each with a target secondary structure and target concentration, and an arbitrary number of off-target complexes, each with vanishing target concentration.
Given a theoretical target test tube, embodiments of the invention determine a test tube ensemble defect that quantifies the concentration of incorrectly paired nucleotides at equilibrium evaluated over the ensemble of the test tube. Embodiments of the test tube ensemble defect optimization process implements a positive design paradigm (stabilize on-targets) and a negative design paradigm (destabilize off-targets) at two levels: a) designing for the on-target structure and against the off-target structures within the structural ensemble of each on-target complex, and b) designing for the on-target complexes and against the off-target complexes within the ensemble of the test tube. Using the hierarchical mutation process described above, test tube designs involving multiple on- and off-targets for strand lengths of practical interest to the molecular programming and synthetic biology communities can be realized.
In the preceding description, specific details are given to provide a thorough understanding of the examples. However, it will be understood by one of ordinary skill in the art that the examples may be practiced without these specific details. For example, electrical components/devices may be shown in block diagrams in order not to obscure the examples in unnecessary detail. In other instances, such components, other structures and techniques may be shown in detail to further explain the examples.
It is also noted that the examples may be described as a process, which is depicted as a flowchart, a flow diagram, a finite state diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel, or concurrently, and the process can be repeated. In addition, the order of the operations may be re-arranged. A process is terminated when its operations are completed. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a software function, its termination corresponds to a return of the function to the calling function or the main function.
Those of skill in the art will understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
Those having skill in the art will further appreciate that the various illustrative logical blocks, modules, circuits, and process steps described in connection with the implementations disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. One skilled in the art will recognize that a portion, or a part, may comprise something less than, or equal to, a whole. For example, a portion of a collection of pixels may refer to a sub-collection of those pixels.
The various illustrative logical blocks, modules, and circuits described in connection with the implementations disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The steps of a method or process described in connection with the implementations disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of non-transitory storage medium known in the art. An exemplary computer-readable storage medium is coupled to the processor such the processor can read information from, and write information to, the computer-readable storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal, camera, or other device. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal, camera, or other device.
Headings are included herein for reference and to aid in locating various sections. These headings are not intended to limit the scope of the concepts described with respect thereto. Such concepts may have applicability throughout the entire specification.
The previous description of the disclosed implementations is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these implementations will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other implementations without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the implementations shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
This application claim priority to U.S. Provisional Application Ser. No. 61/704,012 filed on Sep. 21, 2012, the entirety of which is hereby incorporated by reference.
This invention was made with government support under CCF0832824 and CFF0448835 awarded by the National Science Foundation. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
61704012 | Sep 2012 | US |