SYSTEMS AND METHODS OF DESIGNING NUCLEIC ACIDS THAT FORM PREDETERMINED SECONDARY STRUCTURE

BACKGROUND OF THE INVENTION

1. Field of the Invention

Embodiments of the invention relate to systems and methods for designing a nucleic acid molecule that will fold into a target secondary structure at a target concentration. More specifically, embodiments relate to methods and systems that allow the design of particular target secondary structures that will exist in a predefined environment, such as a dilute solution in a test tube.

2. Description of the Related Art

The programmable chemistry of nucleic acid base pairing serves as a versatile medium for the rational design of self-assembling molecular structures, devices, and systems. To date, considerable effort has been invested in designing the equilibrium base pairing properties of a single complex of (one or more) interacting nucleic acid strands. However in these earlier efforts, neither the concentration of the complex, nor the concentrations of other undesired complexes were considered. As a result, sequences that were successfully optimized to stabilize a target secondary structure in the context of a complex, may nonetheless fail to ensure that the target complex would actually form at appreciable concentration when the strands were introduced into a test tube in the lab (see FIG. 1).

We have previously shown that the design of target nucleic acid structures can be formulated as an optimization problem based on a physically meaningful objective function, the complex ensemble defect (Zadeh, et al. (2011) NUPACK: Analysis and design of nucleic acid systems J. Comput. Chem. 32, 170-173 and Zadeh, et al. (2011) Nucleic acid sequence design via efficient ensemble defect optimization. J. Comput. Chem. 32, 439-452). In this work, a candidate sequence and target secondary structure were evaluated as a complex ensemble defect corresponding to the average number of incorrectly paired nucleotides at equilibrium evaluated over the ensemble of the complex.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1
a and 1b show a schematic comparison of prior art complex design (FIG. 1a) with an embodiment of a test tube design (FIG. 1b).

FIG. 2 is a schematic that shows one embodiment of a hierarchical decomposition process of an on-target structure.

FIG. 3-Efficient estimation of physical quantities from nodal contributions. (a) Complex partition function estimate. Conceptually, Q_k_l^nativethe native partition function of child k_l(calculated on child node k_lat cost Θ(|φ_k_i|³), approximates the contribution of the left-child nucleotides to the partition function of parent k (which can be calculated exactly on parent node k at higher cost Θ(|φ_k|³)). (b) Complex ensemble defect estimate. Conceptually, n_k_l^nativethe native defect of child k_l(calculated on child node k_lat cost Θ(|φ_k_l|³), approximates the contribution of the left-child nucleotides to the native defect of parent k (which can be calculated exactly on parent node k at higher cost Θ(|φ_k|³)).

FIGS. 4
a-d are line graphs that show structural features of tetramer target structures for the on-target complexes in a standard test set. FIG. 4a shows the fraction of bases paired. FIG. 4b shows the number of duplex stems per structure. FIG. 4c shows the number base pairs per stem. FIG. 4d shows the minimum number of base pairs that must be cut to disconnect a strand from a structure.

FIGS. 5
a-e show line graphs that compare embodiments of the present invention to prior complex design processes. Embodiments of the test tube design process described below are shown as solid lines and the prior single-complex design process is shown as dashed lines for the design of test tubes containing only a single on-target complex. FIG. 5a shows the normalized ensemble defect which infers design quality. The stop condition is depicted as a dashed line. FIG. 5b shows design cost. FIG. 5c shows sequence composition. The initial GC content is depicted as a dashed line. FIG. 5d shows the cost of sequence design relative to a single evaluation of the objective function. The optimality bound is depicted as a dashed line. FIG. 5e shows leaf independence and emergent defects. A comparison of the complex ensemble defect to the leaf-based estimate of the complex ensemble defect at three stages in the sequence design process: random initial sequence, after initial leaf optimization, after the completion of complex design. Dots represent design trials. Symbols denote medians for each size of on-target structure in the standard test set (|s|={100, 200, 400, 800}; symbol size increases with |s|. RNA design was performed at 37° C. The test set shown is a single-complex version of the standard test set with all the off-targets removed from the test tubes.

FIG. 6 is a line graph showing the design quality of a set of sequences. Comparison of the test tube ensemble defect achieved for test tube design (solid line; design for the on-targets and against the off-targets) vs the prior complex design (dashed line; design for the on-targets while ignoring the off-targets). The stop condition is depicted as a dashed line. RNA design was performed at 37° C. for the standard test set.

FIGS. 7
a-e show the process performance for test tube design. FIG. 7a shows design quality with the stop condition depicted as a dashed line. FIG. 7b shows the design cost. FIG. 7c shows the sequence composition with the initial GC content depicted as a dashed line. FIG. 7d shows the cost of sequence design relative to a single evaluation of the objective function. FIG. 7e shows leaf independence and emergent defects. A comparison of the test tube ensemble defect to the leaf-based estimate of the test tube ensemble defect at four stages in the sequence design process: random initial sequence, after initial leaf optimization (for on-target complexes), after initial forest optimization (for on-target complexes), after the completion of test tube design (including destabilization of any troublesome off-target complexes). Dots represent design trials. Symbols denote medians for each size of on-target structure in the standard test set (|s|={100, 200, 400, 800}; symbol size increases with |s|. RNA design at 37° C. for the standard test set.

FIGS. 8
a and 8b show parallel process performance FIG. 8a shows parallel efficiency and FIG. 8b shows parallel speedup using multiple computational cores. Using M computational cores, efficiency (|s|,M)=t(|s|,1)/(t(|s|,M)×M) and speedup (|s|,M)=t(|s|,1)/t(|s|,M), where t is wall clock time. Dashed lines denote boundaries between compute nodes, indicating the use of message passing. RNA design at 37° C. on the standard test set.

FIGS. 9
a-d are line graphs that show the effect on process performance of GC content used for seeding and reseeding with embodiments of the invention. RNA design was performed at 37° C. on the standard test set.

FIGS. 10
a-d show the effect of design material on process performance RNA design was performed at 37° C. and DNA design was performed at 25° C.

FIGS. 11
a-e show test tube design with multiple on- and off-target complexes. In this design, target test tubes contained four tetramer on-targets and all off-target complexes up to size L_maxε{0,1,2,3,4}. FIG. 11a shows design quality: test tube ensemble defect for a test tube containing all complexes of up to size L_max=4. The stop condition depicted as a dashed line (L_stop=0.02). FIG. 11b shows the design cost. FIG. 11c shows a sequence composition. The initial GC content is depicted as a dashed line. FIG. 11d shows the cost of sequence design relative to a single evaluation of the test tube ensemble defect for a test tube containing all complexes of up to size L_max=4. FIG. 11e shows the relative design cost. RNA sequence design was performed at 37° C.

FIGS. 12
a-g show a target test tube and graphs for designing competing on-target complexes. FIG. 12a is a schematic showing that monomer and dimer on-targets compete for the same strand. To examine the challenge of designing the monomer and dimer on-targets with different relative stabilities, a range of target test tubes were defined by sweeping the target concentration of the monomer, y_monomer, from 0 to 1 μM in 0.01 μM increments, while letting y_dimer=(1−y_monomer)/2, so that the total strand concentration is fixed at 1 μM. b,c) Sequence designs with complementarity constraints (intended complements required to be Watson-Crick pairs). In FIGS. 12d and 12e, sequence designs without complementarity constraints (intended complements permitted to be Watson-Crick pairs, wobble pairs, or mismatches) are shown. In FIGS. 12f and 12g, the robustness of design quality predictions to model perturbations for sequence designs without complementarity constraints are shown. Each design trial evaluated with 100 perturbed models with every parameter perturbed by Gaussian noise with a standard deviation of 10% of the parameter modulus. FIGS. 12b, 12f and 12d show the median design quality for each target test tube. FIGS. 12c, 12e and 12g show the cumulative histogram of design quality for selected target test tubes with y_monomerε{0,0.33,0.67,1.0} μM. RNA sequence design was performed at 37° C. with 100 design trials for each target test tube. The test tube stop condition is depicted as a dashed line (f_stop=0.02).

FIG. 13 is a line graph that shows the robustness of design quality predictions to model perturbations. Each sequence design was evaluated using 100 perturbed physical models, with each parameter perturbed by Gaussian noise with a standard deviation of 5, 10, 20, or 40 percent of the parameter modulus. RNA design was performed at 37° C. for target test tubes in the standard test set with |s|=200.

SUMMARY

One embodiment is a method of designing the sequence of nucleic acid molecules that will form a predetermined secondary structure in solution. This method includes providing a set of desired on-target nucleic acid complexes each having a target secondary structure and a target concentration; providing a set of undesired off-target nucleic acid complexes each having a vanishing target concentration; and designing the sequence of one or more nucleic acid molecules that will predominantly form the on-target secondary structures at approximately the on-target concentrations, and will predominantly not form the undesired off-target complexes.

Another embodiment is an electronic system configured to design a sequence of nucleic acid molecules that will form a predetermined secondary structure in solution. This embodiment can include a first module configured to determine a set of desired on-target nucleic acid complexes each having a target secondary structure and a target concentration; a second module configured to determine a set of undesired off-target nucleic acid complexes each having a vanishing target concentration; a processor programmed to design the sequence of one or more nucleic acid molecules that will predominantly form the on-target secondary structures at approximately the on-target concentrations, and will predominantly not form the undesired off-target complexes; and an output port configured to output the design of the one or more nucleic acid molecules.

Yet another embodiment is a non-transitory computer readable medium comprising instructions that when executed by a processor perform a method of: providing a set of desired on-target nucleic acid complexes each having a target secondary structure and a target concentration; providing a set of undesired off-target nucleic acid complexes each having a vanishing target concentration; and designing the sequence of one or more nucleic acid molecules that will predominantly form the on-target secondary structures at approximately the on-target concentrations, and will predominantly not form the undesired off-target complexes.

DETAILED DESCRIPTION

Embodiments of the invention relate to methods and electronic systems for designing a target secondary structure of a nucleic acid molecule that will form as a predominant species under predetermined environmental conditions. For example, in one embodiment the system can design a target secondary structure of a nucleic acid molecule at a target concentration in a dilute solution. In this case, the salt concentration, and/or temperature of the dilute solution would be one predetermined condition. Using the methods, systems and modules described herein, one can calculate the equilibrium base-pairing properties of a dilute solution of interacting nucleic acid strands (e.g., a “test tube”), yielding predictions for the equilibrium concentration and base-pairing probabilities for an arbitrary number of complex species that form from an arbitrary number of strand species.

As used herein a dilute solution means a solution in which the concentration of solvent molecules is much higher than the concentration of nucleic acid strands. For example, dilute solutions may have a concentration of 10 mM, 1 mM, 100 μM, 10 μM, 1 μM, 1 nM, 1 pM, 1 fM, 1 aM, 1 zM, or 1 yM and be within the scope contemplated by embodiments of the invention.

Accordingly, embodiments of the invention include programmed modules that have been configured to formulate the design of nucleic acid sequences in the context of a test tube of interacting nucleic acid strands at equilibrium. As used herein this type of design contemplates not only the base-pairing properties of a candidate nucleic acid strand, but also the predetermined environmental conditions that the nucleic acid will be placed into, and is termed herein “test tube design”. This allows the system to design nucleic acid sequences that interact at equilibrium to form a target secondary structure at a target concentration for an arbitrary number of desired complexes (the “on-target” complexes). The system also designs against formation of an arbitrary number of undesired complexes, each specified with a vanishing target concentration such as zero (the “off-target” complexes).

In one embodiment, a “test tube ensemble defect” is calculated, which represents the concentration of incorrectly paired nucleotides at equilibrium evaluated over the ensemble of the test tube. By calculating a test tube ensemble defect, the system is able to not just consider whether a target secondary structure has formed, but also take into consideration, and minimize, the formation of other off-target complexes of monomers, dimers, etc. that would form at the same time. This allows the system to determine a nucleic acid that not only forms into a target secondary structure, but also would form that structure as a dominant species in solution.

In one embodiment, the test tube ensemble defect is estimated by decomposing the target structure for each on-target complex into a tree of substructures, as described below. This leads to a forest of decomposition trees, with one tree for each on-target complex. The test tube ensemble defect at the root level of the forest can be efficiently estimated using only physical quantities calculated inexpensively at the leaf nodes of the forest, or at any other level of the forest intermediate between the leaf and root levels.

One embodiment is an electronic system that is configured to design a sequence of nucleic acid molecules, such as RNA molecules, that will form a predetermined secondary structure in solution. This electronic system can comprise several functional modules programmed to carry out aspects of the invention. The modules may be software modules stored in the memory of a computer system or a combination of software and processor executing instructions provided by the software to carry out aspects of the invention. Accordingly, the system may include a first module configured to determine a set of desired on-target nucleic acid complexes each having a target secondary structure and a target concentration. The system may also have a second module configured to determine a set of undesired off-target nucleic acid complexes each having a vanishing target concentration. Finally, the system may include at least one processor programmed to design the sequence of one or more nucleic acid molecules that will predominantly form the on-target secondary structures at approximately the on-target concentrations, and will predominantly not form the undesired off-target complexes. In order to provide an output of the designed nucleic acid molecules, the system may have an output port configured to output the design of the one or more nucleic acid molecules. In this circumstance, the output port may be connected to a computer or mobile display screen for displaying the sequence of the designed nucleic acid molecule, or a printer configured to print a copy of the designed nucleic acid molecule. Other possible output formats known to those with ordinary skill in the art are also contemplated within aspects of the invention.

The one or more processors in the system may also be programmed to calculate a test tube ensemble defect corresponding to the concentration of incorrectly paired nucleotides in a candidate sequence at equilibrium evaluated over the ensemble of a theoretical test tube containing the nucleic acid molecules, by the methods described herein and the pseudocode provided in the Appendix.

FIG. 1
a, shows a schematic example of the prior art “complex design” method of determining a nucleic acid molecule that will form a target secondary structure 100. Sequence design formulated in the context of the target complex 100 ensures that at equilibrium the calculated target structure 110 dominates the structural ensemble of the complex that is determined by the complex design process. Unfortunately, subsequent thermodynamic analysis of the calculated target molecule in the context of a test tube can reveal that the desired target complex 100 occurs at negligible concentration (4 nm) relative to other undesired monomers and homodimers when actually built and put into solution in a test tube. FIG. 1b illustrates one embodiment of the result if the “test tube design” process described herein, wherein the target sequence structure 150 is formulated in the context of a test tube 160 which ensures that at equilibrium an actual molecule 170 having the calculated nucleotide sequence of the desired ‘on-target’ complex be the dominate target structure and form at approximately its target concentration. Moreover, with test tube design, the undesired ‘off-target’ complexes (all monomers and homodimers) will form at negligible concentrations. This allows any subsequent thermodynamic analysis in the context of a test tube (right) to be consistent with the test tube design formulation.

In one embodiment of a test tube design system, as described herein, the user specifies: 1) a set of desired on-target complexes, each with a target secondary structure and target concentration, 2) a set of undesired off-target complexes, each with vanishing target concentration. Given these parameters, the modules within the system derive a “test tube ensemble defect”, corresponding to the concentration of incorrectly paired nucleotides at equilibrium evaluated over the ensemble of the test tube. To efficiently optimize the test tube ensemble defect, the system builds on hierarchical sequence optimization concepts previously developed by us for complex design (See U.S. Pat. No. 8,478,543 entitled “System and Method for Nucleic Acid Sequence Design”, hereby incorporated by reference in its entirety). However, embodiments of the current invention address new conceptual challenges that arise in the context of test tube design which better reflects real-world experimental conditions, in that undesired off-target complexes compete with the desired on-target complexes.

In one embodiment, the process of determining the sequence of a nucleic acid that will form a target structure at a target concentration in predetermined environmental conditions begins by describing the physical quantities that provide the basis for analyzing and designing the equilibrium base-pairing properties of a test tube of interacting nucleic acid strands. This description starts with defining a secondary structure model for the nucleic acid strands to be evaluated.

Secondary Structure Model

The sequence, φ, of one or more interacting RNA strands is specified as a list of bases φ^aε{A, C, G, U} for a=1, . . . , |φ| (T replaces U for DNA). A secondary structure, s, of one or more interacting RNA strands is defined by a set of base pairs (each a Watson-Crick pair [A•U or C•G] or a wobble pair [G•U]). A polymer graph representation of a secondary structure is constructed by ordering the strands around a circle, drawing the backbones in succession from 5′ to 3′ around the circumference with a nick between each strand, and drawing straight lines connecting paired bases. A secondary structure is unpseudoknotted if there exists a strand ordering for which the polymer graph has no crossing lines. A secondary structure is connected if no subset of the strands is free of the others. A complex of interacting strands with strand ordering, π, has structural ensemble, Γ(π), containing all connected polymer graphs with no crossing lines. For sequence φ and secondary structure, sεF, the free energy, ΔG(φ, s), is calculated using nearest-neighbor empirical parameters for RNA in 1M Na⁺ or for DNA in user-specified Na⁺ and Mg⁺⁺ concentrations. These physical models have practical utility for the analysis and design of functional nucleic acid systems, and provide the basis for rational analysis and design of equilibrium base-pairing in the context of a dilute solution.

Now that a model has been formulated to analyze the secondary structure of interacting nucleic acid molecules, the system also provides a means for determining the equilibrium base pairing of nucleotides within the interacting nucleic acid molecules.

Analyzing Equilibrium Base-Pairing in a Test Tube

Let Ψ⁰denote the set of strand species that interact in a test tube to form the set of complex species Ψ. For complex jεΨ, with sequence φ_jand structural ensemble Γ_j, the partition function

$Q (φ_{j}) = \sum_{s \in Γ_{j}}^{} \exp ⌊ - Δ G (φ_{j}, s) / k_{B} T ⌋$

can be used to calculate the equilibrium probability of any secondary structure sεΓ_j:

p(φ_j,s)=exp└−ΔG(φ_j,s)/k_BT┘/Q(φ_j).

Here, k_Bis the Boltzmann constant and T is temperature. The equilibrium base-pairing properties of complex j are characterized by the base-pairing probability matrix P(φ_j), with entries P^a,b(φ_j)ε[0, 1] corresponding to the probability,

$P^{a, b} (φ_{j}) = \sum_{s \in Γ_{j}}^{} p (φ_{j}, s) S^{a, b} (s),$

that base pair a·b forms at equilibrium within ensemble Γ_j. Here, S(s) is a structure matrix with entries S^a,b(s)=1 if structure s contains base pair a·b and S^a,b(s)=0 otherwise. For convenience, the structure and probability matrices are augmented with an extra column to describe unpaired bases. The entry S^a,|s|+1(s) is unity if base a is unpaired in structure s and zero otherwise; the entry P^a,|φ^j^|+1(φ_j)ε[0,1] denotes the equilibrium probability that base a is unpaired over ensemble Γ_j. Hence the row sums of the augmented S(s) and P(φ) matrices are unity.

Let Q_Ψ=Q_j∀jεΨ denote the set of partition functions for the complexes in the test tube. The set of equilibrium concentrations, x_Ψ, (specified as mole fractions) are the unique solution to the strictly convex optimization problem:

$\begin{matrix} \min_{x_{ψ}} \sum_{j \in ψ}^{} x_{j} (\log x_{j} - \log Q_{j} - 1) subject to & (1 a) \\ A_{i, j} x_{j} = x_{i}^{0} \forall i \in ψ^{0}, & (1 b) \end{matrix}$

where the constraints impose conservation of mass. A is the stoichiometry matrix with entries A_i,jcorresponding to the number of strands of type i in complex j, and x_i⁰is the total concentration of strand i introduced to the test tube.

To analyze the equilibrium base-pairing properties of a test tube of nucleic acid strands, the partition function, Q(φ_j), and equilibrium pair probability matrix, P(φ_j), are calculated for each complex jεΨ using θ(|φ_j|³) dynamic program modules. The equilibrium concentrations, x_Ψ, are calculated by solving the convex programming problem (equation (1)) using an efficient trust region method at a cost that is typically negligible by comparison. The overall time complexity to analyze the test tube is then O(|Ψ∥φ|³_max), where |φ|_maxis the size of the largest complex.

In specifying an analysis problem, a convenient and powerful approach is to define Ψ if to include all complexes of up to L_maxstrands. For a test tube containing the set of strands, Ψ⁰, the total number of complexes that can form of up to size L_maxis:

$\begin{matrix} \langle ψ \rangle = \sum_{L = 1}^{L_{\max}} \sum_{l = 1}^{L} \frac{{\langle ψ^{0} \rangle}^{\gcd (l, L)}}{L}, & (2) \end{matrix}$

so the overall time complexity to analyze the test tube is O(|Ψ⁰|^L_max|s|³_max/L_max).

Test Tube Design Problem Specification

A test tube design problem is specified as a target test tube containing a set of desired on-target complexes, Ψ_on, and a set of undesired off-target complexes, Ψ_off. The set of complexes in the test tube is then:

ψ=ψ_on∪ψ_off.

Each complex, jεΨ, is specified as a strand ordering, π_j, corresponding to structural ensemble Γ(π_j). For each on-target complex, jεΨ_on, the user specifies a target secondary structure, s_j, and a target concentration, y_j. For each off-target complex, jεΨ_off, the target concentration is vanishing (y_j=0) and there is no target structure (s_j=Ø). When specifying the off-targets in Ψ_off, it is convenient to include all complexes of up to L_maxstrands. For example, by equation 2, four strands can interact to form 108 complexes of up to size four.

Complementarity constraints may be imposed on the design at the sequence level by defining strands in terms of sequence domains (e.g., see the sequence domains in the monomer and dimer on-target structures of FIG. 12a) and at the structural level by specifying base-pairing within the on-target structures. Complementarity constraints can propagate between complexes if, for example, nucleotides a and b are paired in one on-target structure and nucleotides b and c are paired in another on-target structure.

Test Tube Ensemble Defect Objective Function

Described herein are methods, systems and modules configured to perform sequence optimization for a test tube design based on a physically meaningful objective function that quantifies sequence quality with respect to the environment of a target test tube. This allows the design of nucleic acid sequences that will form a target secondary structure in a chosen concentration when tested in vitro in solution.

As a precedent for this approach, consider the related problem of complex design, where the goal is to design strands that, at equilibrium, adopt a target secondary structure within the ensemble of a complex, without considering the environment (e.g., a dilute solution) that the nucleic acid will eventually be placed within. For a candidate sequence, φ_j, and target structure, s_j, the complex ensemble defect

$\begin{matrix} n (φ_{j}, s_{j}) = \langle s_{j} \rangle - \sum_{\underset{}{\underset{1 \leq b \leq \langle φ_{j} \rangle + 1}{1 \leq a \leq \langle φ_{j} \rangle}}} P^{a, b} (φ_{j}) S (s_{j}), & (3) \end{matrix}$

is the average number of incorrectly paired nucleotides at equilibrium evaluated over the ensemble of the complex, Γ_j. The complex ensemble defect falls in the interval (0,|s_j|). For complex design, the complex ensemble defect provides a physically meaningful objective function for quantifying sequence quality.

Here, to provide a basis for evaluating candidate nucleic acid sequences in the context of a test tube, we derive the “test tube ensemble defect”, representing the concentration of incorrectly paired nucleotides at equilibrium evaluated over the ensemble of the test tube. For a target test tube with target secondary structures, s_Ψ, target concentrations, y_Ψ, and candidate sequences, φ_Ψ, the test tube ensemble defect

$\begin{matrix} C (φ_{Ψ}, s_{Ψ}, y_{Ψ}) = \sum_{j \in Ψ}^{} c (φ_{j}, s_{j}, y_{j}) & (4) \end{matrix}$

may be expressed in terms of the defect contribution of each complex jεΨ:

$\begin{matrix} c (φ_{j}, s_{j}, y_{j}) = n (φ_{j}, s_{j}) \min (x_{j}, y_{j}) + \langle s_{j} \rangle \max (y_{j} - x_{j}, 0) . & (5) \end{matrix}$

For each on-target complex jεΨ_on, the first term in equation (5) represents the structural defect, quantifying the concentration of nucleotides that are in an incorrect base-pairing state on average within the ensemble of complex j, and the second term represents the concentration defect, quantifying the concentration of nucleotides that are in an incorrect base-pairing state because there is a deficiency in the concentration of complex j. Because y_j=0 for off-target complexes, the structural and concentration defects are both identically zero (so the sum in equation (4) may be written over Ψ_oninstead of ‘Ψ’). This does not mean that the defects associated with the off-targets are ignored. By conservation of mass, non-zero off-target concentrations imply deficiencies in on-target concentrations, and these concentration defects are quantified by equation (4).

The test tube ensemble defect falls in the interval (0,y_nt), where

$y_{nt} \equiv \sum_{j \in Ψ_{on}}^{} \langle s_{j} \rangle y_{j}$

is the total concentration of nucleotides in the test tube.

Note that if there is only one species of complex in the test tube (|Ψ|=1), its concentration is necessarily equal to the target concentration (x₁=y₁), so the formulation is independent of concentration. In this case, optimization of the test tube ensemble defect, C(φ₁,s₁,y₁), is equivalent to optimization of the complex ensemble defect, n(φ₁,s₁).

Calculation of the test tube ensemble defect (equation 4) requires calculation of the complex partition functions, Q_Ψ, which are used to calculate the equilibrium concentrations, x_Ψ, as well as the equilibrium pair probability matrices, P_Ψ_on, which are used to calculate the complex ensemble defects, n_Ψ_on. Hence, the time complexity to evaluate the test tube ensemble defect is the same as the time complexity to analyze equilibrium base-pairing in a test tube.

Overview of the Process

With the above structure in place, below is a description of a “test tube” design system and process that includes modules for calculating the nucleic acid sequence of a nucleotide strand that will adopt a target secondary structure at a target concentration in solution based on test tube ensemble defect optimization. For a target test tube with target secondary structures, s_Ψ, and target concentrations, y_Ψ, the system seeks to design a set of sequences, φ_Ψ, such that the test tube ensemble defect satisfies the test tube stop condition:

C(φ_Ψ,s_Ψ,y_Ψ)≦C_stop (6)

with

C
_stop
≡f
_stop
y
_nt (7)

for a user-specified value of f_stopε(0,1). It should be realized that the f_stopcondition may be, for example, between 0.5 and 0.001 in some embodiments. For example, the f_stopcondition may be 0.5, 0.2, 0.1, 0.05, 0.02, 0.01, 0.005, or 0.001. Using this notation, an f_stopof 0.01 would correspond to requiring a normalized test tube ensemble defect of greater than 1%.

The test tube ensemble defect is reduced via iterative mutation of a random initial sequence. Because of the high computational cost of calculating the test tube ensemble defect, the system tries to avoid direct recalculation of C in evaluating each candidate mutation. To reduce the cost of sequence optimization, each on-target structure is decomposed into a tree of substructures, yielding a “forest” of decomposition trees. Candidate mutations are evaluated efficiently by estimating the test tube ensemble defect based on nodal contributions calculated efficiently at the leaves of the decomposition forest.

During leaf optimization, defect-weighted mutation sampling is used to select each candidate mutation position with probability proportional to its contribution to the estimated test tube ensemble defect. As optimized subsequences are merged toward the root level of the forest, emergent defects that arise due to crosstalk between subsequences are eliminated via reoptimization within a defective subtree. After subsequences are merged to the root level, the full test tube ensemble defect, C, is calculated for the first time, including all on- and off-target complexes in the test tube ensemble. Any off-target complexes that form at appreciable concentration are decomposed, added to the decomposition forest, and actively destabilized during subsequent forest reoptimization. The exclusion of off-targets from the decomposition forest during the initial phase of sequence design is critical to enabling the design of test tubes containing large numbers of off-target complexes (e.g., 10⁴off-targets). The elements of this hierarchical sequence design process are described below and detailed in the pseudocode in the attached appendix.

Initialization

To initialize the system, a set of nucleic acid molecule complexes, Ψ, is partitioned into two disjoint sets of complexes:

Ψ=Ψ_active∪Ψ_passive

where Ψ_activedenotes complexes that will be actively designed and Ψ_passivedenotes complexes that will inherit sequence information from Ψ_active. Initially, we set

Ψ_active=Ψ_on,Ψ_passive=Ψ_off

so that only the on-target complexes are actively designed. The user-specified on-target structures provide the basis for hierarchical structure decomposition, which enables efficient sequence design. The sequences for the complexes jεΨ_activeare randomly initialized subject to respecting complementarity constraints provided by the design problem specification. Watson-Crick complements are used to initialize complementary sequence domains or any bases that are paired within an on-target structure.

Hierarchical Decomposition of On-Target Structures.

The hierarchical decomposition module is configured to decompose each on-target structure into a binary tree of substructures. Each target structure s_jεΨ_activeis decomposed into a (possibly unbalanced) binary tree of substructures, resulting in a forest of |Ψ_on| trees. Each node in the forest is indexed by a unique integer k. For each parent node, k, there is a left child node, k_l, and a right child node, k_r. Each nucleotide in parent structure s_kis partitioned to either the left or right child substructure (s_k=s_k^l∪s_k^r, s_k^l∩s_k^r=Ø). Eligible split-points are those locations within a duplex stem with at least H_splitconsecutive base-pairs on either side, such that each child would have at least N_splitnucleotides. An eligible split-point is selected so as to minimize the difference in the size of the children, ∥s_k^l|−|s_k^r∥. Child node k_linherits from parent node k the substructure s_k^laugmented by dummy nucleotides that approximate the influence of its sibling in the context of their parent. Dummy nucleotides are defined by extending the newly-split duplex stem across the split-point by H_splitbase pairs (|s_k_l|=|s_k^l|+2H_split). All nucleotides in root nodes are termed “native nucleotides”. Nucleotides that are native in a parent are inherited as native in a child (|s_k_i^native|=|s_k^l∩s_k^native|). Nucleotides that are dummy in a parent are inherited as dummy in a child (|s_k_j^dummy|=s_k^l∩s_k^dummy+2H_split). See FIG. 2 for an example of hierarchical structure decomposition. In FIG. 2, a selected split-point within each parent is denoted by a solid line. The dummy nucleotides within each child are depicted in light grey. The native nucleotides within each structure are depicted in black. H_split=2, N_split=20 in this example.

Decomposition of the sequence, φ_k, is performed in accordance with decomposition of structure s_k. If the maximum depth of a leaf in the forest of binary trees is D, any nodes with depth d<D that lack an eligible split-point are replicated at each depth down to D so that all leaves have depth D. Let Λdenote the set of all nodes in the forest. Let Λ_ddenote the set of all nodes at depth d. Let Λ_d,jdenote the set of all nodes at depth d resulting from decomposition of complex j. Each nucleotide in complex j is native in exactly one nodal structure, s_kεs_Λ_d,j, at any depth dε{1, . . . , D}.

Test Tube Ensemble Defect Estimation from Nodal Contributions.

The nodal test tube ensemble defect estimation module is configured to estimate the ensemble defect of one or more candidate nucleic acid molecules over the ensemble of a collection of nucleic acid molecules in a theoretical test tube based on the contributions of each nodal estimate. Evaluation of the test tube ensemble defect (equation (4)) at the root level of the forest requires calculation of the defect contribution (equation (5)) of each complex in the active complex Ψ_active. The O(|Ψ∥φ|_max³) time complexity for this calculation results from the Θ(|φ_j|³) dynamic programs used to calculate the partition function, Q_j, and equilibrium pair probability matrix, P_j, for each complex jεΨ. To reduce the cost of evaluating candidate sequences, here, we derive decompositions of the relevant physical quantities so that defect contribution of each complex (equation (5)) can be estimated less expensively using nodal contributions calculated at any depth dε{2, . . . , D} within its decomposition forest.

For each complex jεΨ, the cost of estimating the defect contribution c_jat level d is dominated by calculation of the nodal partition function, Q_k, and nodal pair probability matrix, P_k, at cost Θ(|φ_k|³) for each node kεΛA_d,j. For an optimal decomposition, |φ_k| halves and |Λ_d,j| doubles at each level moving down the tree, so the cost of estimating c_jat level d can be a factor of ½^2d−2lower than the cost of calculating c_jexactly at the root. Hence, for maximum efficiency, all candidate mutations are evaluated by estimating the test tube ensemble defect at the leaf level of the forest (depth d=D). As subsequences are merged toward the root level, the test tube ensemble defect is estimated at intermediate depths in the forest.

The accuracy of the defect estimate will depend on the equilibrium structural properties of the sequence. If a split-point partitions a parent structure within a duplex that is predominantly well-formed at equilibrium, the physical properties of the parent can be accurately estimated based on the physical properties of the native nucleotides in its children, because the children are relatively isolated from each other by the base pairs adjacent to the split-point. The role of the dummy nucleotides within each child is to approximate the stabilizing influence of the missing sibling on the base pairs adjacent to the split-point. As the quality of the sequence design improves, the quality of the decomposition approximation will also improve as the duplex containing the split-point increasingly dominates at equilibrium The accuracy of the decomposition breaks down if there is crosstalk when sibling sequences are merged within a parent; crosstalk can destabilize the duplex containing the split-point, undermining the validity of the decomposition. The utility of root defect estimation hinges on the assumption that sequence space is sufficiently rich that subsequences within the decomposition forest will often not exhibit crosstalk when merged to the root.

The hierarchical mutation procedure exploits root defect estimation when crosstalk is absent, and eliminates crosstalk when it does arise during subsequence merging.

The following sections describe how to calculate each of the nodal contributions at any level dε{2, . . . , D} so as to efficiently and accurately estimate the complex contributions, c_Ψ_active, to the test tube ensemble defect. Also described below is how to construct the complex partition function estimates {tilde over (Q)}_Ψ_active, using nodal partition functions, Q_Λ_d, and nodal pair probability matrices, P_Λ_d. Complex concentration estimates, {tilde over (x)}_Ψ_active, are then calculated based on {tilde over (Q)}_Ψ_active, using deflated mass constraints to model the effect of the neglected off-target complexes in Ψ_passive. Complex ensemble defect estimates, ñ_Ψ_onare calculated based on P_Λ_d. These estimates are then used to calculate the defect estimates, {tilde over (c)}_Ψ_active, which are summed to produce the test tube ensemble defect estimate, {tilde over (C)}.

Estimation of the Complex Partition Function.

The complex partition function module is configured to estimate the partition function for a root node in a decomposition tree. We begin by calculating the complex partition function estimate, {tilde over (Q)}_j, for each complex jεΨ_activein terms of partition function contributions evaluated efficiently at the nodes kεΛ_d,jat any level dε{2, . . . , D}. This decomposition is illustrated for parent node k and its children k_land k_rin FIG. 3a.

The utility of this approach hinges on the assumption that the base pairs on either side of a split-point are predominantly well-formed at equilibrium, so that the partition functions of two sibling nodes are approximately independent and can be usefully combined to approximate the partition function of their parent node. Let E_kdenote the set of native base pairs adjacent to decomposition split-points in node k. For each base pair a·bεE_k, let φ_k^dummy(a·b)denote the sequence of a·b and all the dummy nucleotides on the other side of the split-point. The native partition function for node k is then

$\begin{matrix} Q_{k}^{native} \approx Q (φ k) \prod_{a \cdot b \in B_{k}}^{} \frac{P^{a, b} (φ k)}{Q (φ_{k}^{dummy (a \cdot b)}) P^{a, b} (φ_{k}^{dummy (a \cdot b)})} . & (8) \end{matrix}$

where the approximation follows from the assumption that the equilibrium probabilities for the base pairs in E_kare independent; the expression becomes exact if E_kcontains only one base pair, and in the limit as the equilibrium probabilities of the base pairs in E_kapproach unity. The partition functions, Q(φ_k) and Q(φ_k^dummy(a·b)), and the equilibrium base-pairing probability matrices, P(φ_k) and P(φ_k^dummy(a·b)), are calculated using dynamic programs suitable for complexes containing arbitrary numbers of strands.

Note that the periodic strand repeat, v_jof complex j is defined as the number of different rotations of the polymer graph that map strands of the same type to each other (e.g., v_j=4 for complex AAAA, v_j=3 for complex ABABAB, v_j=2 for ABAABA). For complexes in which all strands are distinct, v_j=1. However, complexes containing multiple copies of the same strand may have v_j>1, in which case the dynamic program that is used to calculate the partition function of complex j will be incorrect due to symmetry and overcounting errors that are different for different structures in Γ_j. Fortunately, these errors interact in such a way that they can be exactly and simultaneously corrected by dividing the calculated partition function by the integer v_j

Q(φ_j)=Q_calc(φ_j)/v_j.

When employing the dynamic program to calculate the nodal partition functions for kεΛ_d,j, it is important to correct each of these values using v_j.

Next, the system reconstructs the approximate partition function for complex j from the native partition functions of all descendant nodes at level d. Let F_d,jdenote the set of base-pair stacks sandwiching the split-points in the decomposition of complex j at depth d. Each of these base-pair stacks a·b:e·f is an interior loop whose free energy, ΔG_a·b:e·f^interior, is not incorporated in the native partition functions calculated for the nodes on either side of the split point. The complex partition function estimate is then

$\begin{matrix} {\tilde{Q}}_{j} = \prod_{k \in Λ_{d, j}}^{} Q_{k}^{native} \prod_{(a \cdot b : e \cdot f) \in F_{d, j}}^{} \exp (- Δ G_{a \cdot b : e  \cdot f}^{interior} / k_{B} T), & (9) \end{matrix}$

representing the product of the native partition functions jεΛ_d,jand the additional contributions from the interior loops a·b:e·f at the split-points. This estimate becomes exact in the limit as the equilibrium probabilities of the base-pairing stacks in F_d,japproach unity.

Complex Concentration Estimate using Deflated Mass Constraints

After calculating the set of complex partition function estimates, {tilde over (Q)}_Ψ_active, based on the nodal partition function contributions at level d, the corresponding equilibrium complex concentration estimates, {tilde over (x)}_Ψ_active, may be found by solving the convex programming problem shown above for equation (1). To impose the conservation of mass constraints (equation (1b)), the total concentration of each strand species, iεΨ⁰, is specified. The total strand concentrations follow from the target concentration and strand composition of each on-target complex jεΨ_on:

$\begin{matrix} x_{i}^{0} = \sum_{j \in Ψ_{on}}^{} A_{i, j} y_{j} \forall i \in Ψ^{0} . & (10) \end{matrix}$

Initial sequence optimization is performed on a decomposition forest that contains only the on-target complexes in Ψ_active, but ultimately, the system tries to satisfy the test tube stop condition (equation (6)) for the full set of complexes in Ψ, including the off-targets in Ψ_passive. Recall that the off-targets in Ψ_passivedo not contribute directly to the sum used to calculate the test tube ensemble defect (equation (4)), but contribute indirectly by forming at positive concentrations, causing concentration defects for complexes in Ψ_activeas a result of conservation of mass. Hence, we can pre-allocate a portion of the permitted test tube ensemble defect, f_stopy_nt, to the neglected off-target complexes in Ψ_passiveby deflating the total strand concentrations used to impose the mass constraints (equation (1b)) in calculating the equilibrium concentrations {tilde over (x)}_Ψ_active.

Following this approach, if Ψ_passive≠Ø, we make the assumption that the complexes in Ψ_passiveconsume a constant fraction of each total strand concentration:

$\sum_{j \in Ψ_{passive}}^{} A_{i, j} {\tilde{x}}_{j} = f_{passive} f_{stop} \sum_{j \in Ψ_{on}}^{} A_{i, j} y_{j} \forall i \in Ψ^{0},$

corresponding to a total mass allocation of f_passivef_stopy_ntto the neglected off-targets in Ψ_passive.

To calculate the equilibrium concentrations of the complexes in Ψ_activevia (equation (1)), we therefore use the deflated strand concentrations:

$\begin{matrix} \begin{matrix} x_{i}^{0} = (1 - f_{stop} f_{passive}) \sum_{j \in Ψ_{on}} A_{i, j} y_{j} & \forall i \in Ψ^{0} \end{matrix} & (11) \end{matrix}$

in place of the full strand concentrations (equation (10)). For each complex jεΨ_active, the concentration estimate, {tilde over (x)}_j, is passed to the nodes in the subtree of complex j at level d:

x
_k
={tilde over (x)}
_j
∀kεΛ
_d,j

Nodal concentrations are useful for representing the test tube ensemble defect estimate as a sum of nodal (rather than complex) contributions.

Complex Ensemble Defect Estimate.

The complex ensemble defect estimate, ñ_j, is calculated for each complex jεΨ_activeactive based on nodal defect contributions, n_k, calculated efficiently at the nodes kεΛ_d,jat any level dε{2, . . . , D}. This decomposition is illustrated for parent node k and its children k_land k_rin FIG. 3b.

Because each nucleotide in complex j is native in exactly one node kεΛ_d,j, the system can approximate the complex ensemble defect as the sum of the native nodal defect contributions at any depth in the subtree. The nodal pair probability matrix, P_k(with entries for both native and dummy nucleotides), was previously calculated in order to estimate the nodal partition function contribution (equation 8).

For any node kεΛ_d,j, the contribution of nucleotide aεs^kto the nodal defect is given by

$n_{k}^{a} = 1 - \sum_{1 \leq b \leq \langle s_{k} \rangle + 1} P_{k}^{a, b} S_{k}^{a, b}$

and the native nodal defect contribution is:

$n_{k}^{native} = \sum_{a \in s_{k}^{native}} n_{k}^{a} .$

Based on nodal contributions at depth d, the complex ensemble defect estimate for any complex jεΨ_activeis then:

${\tilde{n}}_{j} = \sum_{k \in Λ_{d, j}} n_{k}^{native} .$

This estimate becomes exact in the limit as the equilibrium probabilities of the base pairs sandwiching the decomposition split-points approach unity.

Test Tube Ensemble Defect Estimate

Having calculated the complex concentration estimates, {tilde over (x)}_Ψ_active, and the complex ensemble defect estimates, ñ_Ψ_active, based on nodal contributions at any depth dε{2, . . . , D}, the contribution of complex jεΨ_activeto the test tube ensemble defect is

{tilde over (c)}
_j
=ñ
_jmin({tilde over (x)}_j,y_j)+|s_j|max(y_j−{tilde over (x)}_j,0), (12)

and a test tube ensemble defect module can then calculate the test tube ensemble defect as:

$\begin{matrix} \tilde{C} = \sum_{j \in Ψ_{active}} {\tilde{c}}_{j} . & (13) \end{matrix}$

This sum can equivalently be expressed as a sum over nodal contributions at depth d. The test tube ensemble defect associated with nucleotide a in node kεΛ_dis

c
_k
^a
=n
_k
^amin(x_k,y_k)+max(y_k−x_k,0)lem

so the native nodal defect contribution for node k is

$c_{k}^{native} = \sum_{a \in s_{k}^{native}} c_{k}^{a}$

and the test tube ensemble defect estimate (equation (13)) becomes:

$\begin{matrix} \tilde{C} = \sum_{k \in Λ_{d}} c_{k}^{native} . & (14) \end{matrix}$

The total defect permitted by the test tube stop condition (equation (6)) can be allocated proportionally to the nodes at depth d in the decomposition forest:

c
_k
^stop
≡f
_stop
|s
_k
^native
|y
_k
∀kεΛ
_d (15)

so that the nodal defect allocations sum to the total permitted test tube ensemble defect

$C_{stop} = \sum_{k \in Λ_{d}} c_{k}^{stop} .$

During hierarchical sequence optimization, candidate sequences are evaluated using

the thresholded test tube ensemble defect estimate:

$\begin{matrix} {\tilde{C}}_{thresh} = \sum_{k \in Λ_{d}} \max (c_{k}^{stop}, c_{k}^{native}) & (16) \end{matrix}$

in place of (equation (14)) to drive proportional defect allocation across the nodes at each level in the decomposition forest.

Leaf Mutation

To minimize computational cost, all candidate mutations are preferably evaluated by a leaf mutation calculation module at the leaf nodes, kεΛ_D, of the decomposition forest. Leaf mutation terminates successfully if the leaf stop conditions,

c
_k
^native
≦c
_k
^stop
{kεΛ
_D, (17)

are all satisfied. The multiple leaf stop conditions collectively enforce the single test tube stop condition (equation (6)) and further mandate consistent design quality across the leaves. A candidate mutation is accepted if it decreases the thresholded test tube ensemble defect estimate (equation (16)) and rejected otherwise. Let F_Ddenote the set of leaves that do not yet satisfy the leaf stop condition (equation (17)). The thresholded test tube ensemble defect is compatible with the leaf stop conditions in the sense that a candidate mutation is accepted if and only if it reduces the defect contribution of the leaves in F_D.

In some embodiments, defect weighted mutation sampling is performed by selecting nucleotide a for mutation from amongst those leaves kεF_Dwith probability proportional to the contribution of nucleotide a to the defect contribution of these leaves:

$c_{k}^{a} / \sum_{k \in F_{D}} c_{k}^{native} .$

If the selected candidate mutation position is subject to complementarity constraints implied by the design problem specification, either via complementary sequence domains or via base-pairing within any on-target structure, the candidate mutation respects the constraint in one of three strengths: 1) strong complementarity (default): constrained nucleotides are selected randomly from a uniform distribution of Watson-Crick pairs, b) medium complementarity: constrained nucleotides are selected randomly from a uniform distribution of Watson-Crick and wobble pairs, c) weak complementarity: constrained nucleotides are selected randomly from a uniform distribution of Watson-Crick pairs, wobble pairs, and mismatches. For design problems where on-target structures place competing demands on the test tube ensemble defect, permitting weak complementarity permits the process to increase the defect contribution in one part of a design in order to reduce the ensemble defect of the test tube as a whole (e.g., see the example of FIG. 12).

A candidate sequence {circumflex over (φ)}_Λ_Dis evaluated via calculation of the thresholded test tube ensemble defect, {tilde over (C)}_thresh, if the candidate mutation, ξ, is not in the set of previously rejected mutations, γ_mutate(position and sequence). The set, γ_mutate, is updated after each unsuccessful mutation and cleared after each successful mutation. The counter m_k^mutateis used to keep track of the number of consecutive failed mutation attempts for each leaf. The counter m_k^mutateis incremented for leaves with φ_k≠{circumflex over (φ)}_kafter each unsuccessful mutation and reset to zero for leaves with φ_k≠{circumflex over (φ)}_kafter each successful mutation. Leaf mutation terminates unsuccessfully if each leaf that fails to satisfy the leaf stop condition (equation (17)) undergoes M_mutate|s_k^native| consecutive unfavorable mutation attempts (i.e., m_k^mutate≧M_mutate|s_k^native|). The outcome of leaf mutation is the set of leaf sequences, φ_Λ_D, corresponding to the lowest encountered {tilde over (C)}_thresh.

Leaf Reoptimization

After leaf mutation terminates, if any leaves fail to satisfy the leaf stop condition (i.e., F_D≠Ø), leaf reoptimization commences by the leaf reoptimization module. The counter m_k^leafis used to keep track of the number of times that leaf k is reoptimized. During each round of leaf reoptimization, the leaf kεF_Dwith the minimal m_k^leafis reseeded with a random initial sequence and a new round of leaf mutation is performed. After leaf mutation terminates, the counter m_k^leafis incremented for any leaf whose sequence has changed. The reoptimized candidate sequences, {circumflex over (φ)}_Λ_D, are accepted if they decrease {tilde over (C)}_threshand rejected otherwise. Leaf reoptimization terminates successfully if F_D=Ø or unsuccessfully if each leaf kεF_Dhas exhausted M_leafreoptimization attempts (i.e., m_k^leaf≧M_leaf). The outcome of leaf reoptimization is the set of leaf sequences, φ_Λ_D, corresponding to the lowest encountered {tilde over (C)}_thresh.

Subsequence Merging and Parent Reoptimization

After leaf reoptimization terminates, parent nodes at depth d=D−1 merge their left and right child sequences to create the set of candidate sequences {circumflex over (φ)}_Λ_d. The counter m_k^optis used to keep track of the number of times that parent k is optimized, and is incremented for each parent with φ_k≠{circumflex over (φ)}_kfollowing a merge. The nodal defect contributions, ĉ_Λ_d, are calculated for the parents at depth d and the candidates sequences, {circumflex over (φ)}_Λ_d, are accepted if they decrease {tilde over (C)}_threshcalculated at depth d and rejected otherwise. If each parent at depth d satisfies the parental stop condition:

c
_k
^native≦max(c_k_l^stop,c_k_l^native)+max(c_k_r^stop,c_k_r^native) (18)

or if all parents at level d have exhausted M_optoptimization attempts (i.e., m_k^opt≧M_opt), merging continues up to the next level in the forest. Otherwise, failure to satisfy the parental stop condition implies the existence of emergent defects resulting from crosstalk between child sequences. In this case, the parent node at depth d with the minimal m_k^optthat also fails to satisfy the parental stop condition (equation (18)), is selected for reoptimiziation (and labeled k_reopt).

To reoptimize parent node k_reoptat depth d, the current sequences at depth d are pushed down to all nodes below depth d, and the counter, m_k^opt, is reset to zero for all nodes below depth d. Let F_kdenote the set of native nucleotides in parent k_reopt, that are partitioned to leaf k. Parent k_reoptperforms defect weighted leaf sampling by selecting a leaf k within its subtree with probability:

$\sum_{a \in F_{k}} c_{k_{reopt}}^{a} / c_{k_{reopt}}^{native} .$

The selected leaf (labeled k_reseed) is reseeded to a random initial sequence and a new round of leaf mutation and leaf reoptimization is performed. Reseeding with a random initial sequence is based on the assumption that sequence space is sufficiently rich that emergent defects are atypical and can reliably be eliminated by designing a different leaf sequence. Following leaf reoptimization, merging begins again. Subsequence merging and reoptimization terminate successfully if all root nodes satisfy the parental stop condition (equation (18)). The outcome of subsequence merging and reoptimization are the sequences φ_Ψ_active, corresponding to the lowest encountered {tilde over (C)}_threshcalculated at depth d=1.

Focusing Effort within the Decomposition Forest

To focus mutation effort in portions of the decomposition forest where it is most likely to reduce the test tube ensemble defect, we define the set of nodes, Ω^focus. Initially, all nodes in the decomposition forest, kεΛ, are placed in Ω^focus. During leaf reoptimization, Ω_D^focuscontains only those leaves whose sequences were changed by the most recent leaf reseeding. During parent reoptimization following a failed merge at level d, Ω_d_reopt^focusis emptied for all levels d_reopt>d. When candidate sequences are accepted, Ω^focusis updated to include any nodes whose sequences have changed.

To avoid expending undue effort on nodes that exhausted reoptimization attempts during a previous traversal of the decomposition forest, leaf mutation, leaf reoptimization, and parent reoptimization all focus on nodes in Ω_focus, as detailed in the pseudocode in the attached appendix. Leaf mutation is restricted to leaves in the set:

Ω_D^mutate={kεΩ_D^focus:c_k^native>c_k^stop,m_k^mutate<M_mutate|s_k^native|}

and terminates when Ω_D^mutateis empty. Leaf reoptimization is restricted to leaves in the set:

Ω_D^leaf={kεΩ_D^focus:c_k^native>c_k^stop,m_k^leaf<M_leaf}

and terminates when Ω_D^leafis empty. Parent reoptimization at depth d is restricted to parents in the set:

Ω_d^opt={kεΩd^focus:c_k^native>max(c_k_l^stop,c_k_l^native)+max(c_k_r^stop,c_k_r^native),m_k^opt<M_opt}

and merging continues up the forest when Ω_d^optis empty.

Off-Target Evaluation, Decomposition and Destabilization.

Initial forest optimization is performed for the on-target complexes in Ψ_active, neglecting the off-target complexes in Ψ_passive. At the termination of initial forest optimization, the test tube ensemble defect estimate (equation (13)) is {tilde over (C)} calculated at depth d=1. For this estimate, the complex defect contributions, {tilde over (c)}_Ψ_active, are based on complex concentration estimates, {tilde over (x)}_Ψ_active, calculated using deflated total strand concentrations (equation (11)) to create a built-in defect allowance for the effect of the neglected off-target in Ψ_passive.

For the first time, the full test tube ensemble defect (equation (4)), C, is then calculated for all complexes in the complex Ψ. For this exact calculation, the complex defect contributions, c_Ψ, are based on complex concentrations, x_Ψ, calculated using the full strand concentrations (equation (10)).

Sequence design terminates successfully if the test tube ensemble defect satisfies either the test tube stop condition (equation (6)), or is no greater than the forest-estimated defect (equation (13)):

C≦max(C_stop,{tilde over (C)}). (19)

This latter condition allows sequence design to terminate if the actual defect contribution resulting from the off-target complexes in Ψ_passiveis no greater than the built-in defect allowance resulting from deflation of the total strand concentrations during forest optimization. Otherwise, we have

C>{tilde over (C)} (20)

and the off-target complex jεΨ_passivewith the largest concentration is transferred from Ψ_passiveto Ψ_active. Because the off-target structure, s_j, is undefined and we require a structural basis for tree decomposition, we generate an off-target structure, s_j, that includes all base pairs a·b that form with equilibrium probability P_j^a,b>p_split(for a specified p_splitε(0.5,1.0)) between nucleotides a and b that are constrained to be complementary (either due to specification of complementary sequence domains or due to specification of an on-target structure containing a·b). The root defect estimate, {tilde over (C)}, is then recalculated (using deflated strand concentrations (equation (11)) if Ψ_passive≠Ø). This process of transferring the highest-concentration off-target complex, j, from Ψ_passiveto Ψ_activegenerating an off-target structure s_j, and re-calculating the root defect estimate, {tilde over (C)}, is repeated until (equation (20)) no longer holds.

The new off-target structures Ψ_inactiveare then hierarchically decomposed, the decomposition forest is augmented with new nodes at all depths, and forest reoptimization commences starting from the final sequences from the previous round of forest optimization. During forest reoptimization, the process actively attempts to destabilize the off-targets that were added to Ψ_active. This process of forest augmentation and reoptimization is repeated until (equation (19)) is satisfied, which is guaranteed to occur in the event that all off-targets are eventually added to Ψ_active. The sequence design process shown in the attached pseudocode appendix returns the sequences φ_Ψthat yielded the lowest encountered test tube ensemble defect, C. The appendix includes pseudocode for hierarchical test tube ensemble defect optimization. For a given set of target secondary structures, s_Ψ, and target concentrations, y_Ψ, a set of designed sequences φ_Ψ, is returned by the function call OptimizeTube (s_Ψ, y_Ψ, Ψ, Ψ_on, Ψ_off).

In one embodiment, the test tube design process described herein is coded in the C programming language. To reduce run-time for large jobs, the dynamic programs for evaluating the nodal partition function, Q_k, and the nodal base-pairing probability matrix, P_k, can be parallelized using MPI. For parallel execution, each evaluation of Q_kand P_kfor node k with target structure s_kis performed using a number of cores selected so as to approximately minimize run time based on node size, |s_k|.

EXAMPLES
Standard Test Set

The performance of the test tube design process was demonstrated using a set of target test tubes. Within each target test tube, there was a single on-target tetramer with a target concentration of 1 μM. The off-targets were specified to be to all complexes of up to L_max=4 strands (excluding the on-target tetramer), corresponding to a total of 107 off-target complexes. The target structure for each on-target tetramer was randomly generated with stem and loop sizes randomly selected from a distribution of sizes representative of the nucleic acid engineering literature. Sixty on-target tetramers were generated for each target structure size |s|ε{100,200,400,800} nucleotides, corresponding to a total of 240 target test tubes. Within a tetramer, all strands were of the same length. The structural properties of these on-target tetramers are summarized in FIG. 4. For the design studies that follow, new target test tubes were generated from scratch. The design process was not tested on these target test tubes prior to generating the depicted results.

Sequence Design Trials.

Design trials were run on a cluster of 2.53 GHz Intel E5540 Xeon® dual-processor/quad-core nodes with 24 GB of memory per node. Unless otherwise noted, trials were performed on a single computational core using the default process parameters of Table 1. Design quality is plotted as the normalized test tube ensemble defect, C/y_nt.

TABLE 1

DEFAULT PROCESS PARAMETERS.

Parameter
RNA
DNA

H_split
2
3

N_split
20
30

P_split
0.9
0.9

f_stop
0.01
0.01

f_passive
0.01
0.01

M_opt
10
10

M_leaf
3
3

M_mutate
4
4

Results

The primary test scenario is RNA sequence design at 37° C. with f_stop=0.01 (i.e., less than 1% of the nucleotides should be incorrectly paired within the test tube at equilibrium). Ten independent trials were performed for each of the 240 target test tubes in the standard test set.

Performance of Complex Design without Consideration of Off Target Complexes

In order to have a comparison with prior systems, an initial experiment was run to characterize the special case in which test tube design reduces to the earlier complex design: a target test tube containing one on-target complex and no off-target complexes. Once the performance of this method was determined, it could be used as a baseline for comparison against embodiments of the invention that utilize a test tube design process.

FIG. 7 includes a set of line graphs demonstrating that the performance of the test tube design process and the prior single-complex design process were essentially indistinguishable for the on-target structures in the standard test set. Typical designs surpassed the desired design quality (normalized ensemble defect ≦0.01; panel a). Typical design costs ranges from a fraction of a second for |s|=100 to 100 seconds for |s|=800 (panel b). Typical GC content was less than 60% starting from random initial sequences (panel c). As the depth of the decomposition tree increased with |s|, the relative design cost, c_des(|S|)/c_eval(|s|), was found to decrease asymptotically towards the 4/3 optimality bound for typical design trials (panel d).

Complex Ensemble Defect Estimation within the On-Target Decomposition Tree

FIG. 5
e compares the ensemble defect evaluated at the root of the on-target decomposition tree to the estimated ensemble defect based on physical quantities calculated efficiently at the leaves of the tree. These data reveal both the progression in design quality and the progression in the accuracy of defect estimation as tree optimization proceeds. Consistent with the performance of the earlier single-complex design process, three striking properties were observed. First, for a random initial sequence, the root defect was large and well-approximated by the leaf-estimated defect (data fall near the diagonal). Second, leaf-optimized sequences that were merged without reoptimization (M_opt=1) were typically estimated to satisfy the stop condition (leaf-estimated defect ≦0.01) but failed to satisfy the root stop condition (root defect ≦0.01) due to emergent defects resulting from crosstalk between merged subsequences. These emergent defects were successfully eliminated during reoptimization of defective subtrees from new random initial subsequences, resulting in final sequence designs that satisfied the root stop condition (root defect ≦0.01).

Furthermore, for the final sequence designs, the leaf-estimated defect typically closely approximated the root defect, indicating that there was minimal crosstalk between merged subsequences and that dummy nucleotides in the leaves did a good job of approximating parental context.

Performance for Test Tube Design Process

FIG. 7 demonstrates the performance of the test tube design process detailed herein on the standard test set of 240 target test tubes. Typical designs trials surpassed the desired design quality (normalized test tube ensemble defect ≦0.01; panel a). The cost of test tube design was considerably higher than for single-complex design because evaluation of the test tube ensemble defect required consideration of 107 off-target complexes in addition to the single on-target tetramer. Typical design cost ranged from a second for |s|=100 to approximately half an hour for |s|=800 (panel b). Typical GC content was less than 65% starting from random initial sequences (panel c).

As previously observed, as |s| increased, the cost of optimizing the on target decomposition tree approached 4/3 the cost of a single evaluation of the complex ensemble defect for the on-target (FIG. 5d). These costs however were negligible compared to the cost of evaluating the full test tube ensemble defect, including all 107 off-targets. Hence, if initial forest optimization (with Ψ_active=Ψ_on) yields a design that satisfies the test tube stop condition without requiring explicit off-target destabilization and forest reoptimization (i.e., augmentation of Ψ_activewith off-targets that form at appreciable concentrations), the cost of test tube design should be almost indistinguishable from the cost of a single evaluation of the test tube ensemble defect.

Indeed, this is typically the case for test tubes containing an on-target structure with |s|ε{200,400,800} (panel d). For example, for |s|=800, approximately 70% of design trials required no off-target destabilization and hence only a single evaluation of the test tube ensemble defect. A further 20% of design trials required only one round of off-target destabilization and a total of two evaluations of the test tube ensemble defect, leading to the observed stair step structure in the cumulative histogram of relative design cost (panel d). For test tubes containing an on-target structure with |s|=100, off-target destabilization was typically required, and a typical design trial costs about three times the cost of a single evaluation of the test tube ensemble defect.

Test Tube Ensemble Defect Estimation within the Decomposition Forest

FIG. 7
e compares the test tube ensemble defect evaluated at the root level of the decomposition forest with the leaf-estimated defect. These data reveal both the progression in design quality and the progression in the accuracy of root defect estimation as optimization of the decomposition forest proceeds for the full test tube design process. Because of the presence of off-targets within the test tube (unlike FIG. 5e), crosstalk between merged subsequences can result not only in structural defects within the on-target complex, but also in concentration defects due to appreciable formation of off-target complexes. As a result, it is more challenging to estimate the root defect and to satisfy the root stop condition. For test tubes containing an on-target with |s|=100 (plus 107 off-target complexes), emergent defects are typical (median above the diagonal) for random initial sequences, for leaf-optimized sequences (merged without reoptimization), and for sequences merged and reoptimized within the on-target decomposition tree (with Ψ_active=Ψ_on). Only after explicit off-target destabilization and forest reoptimization do sequences typically satisfy the root stop condition (root defect ≦100). For the final sequence designs, the leaf-estimated defect closely approximates the root defect (data near the diagonal).

Importance of Destabilizing Off-Targets

FIG. 6 is a line graph that compares the test tube ensemble defect for design trials performed without or with off-target destabilization (sequences from FIGS. 5 and 7). If sequence design is performed without off-targets in the test tube ensemble, the resulting sequences often fail to satisfy the test tube stop condition evaluated with off-targets in the ensemble (majority of trials for |s|=100, sizable minority of trials for |s|ε{200, 400, 800}). By contrast, sequences design with both on- and off-target complexes present in the test tube ensemble satisfied the test tube stop condition for nearly all design trials.

Robustness to Model Perturbations

Methods of analyzing and designing equilibrium nucleic acid secondary structure depend on empirical free energy models. It is inevitable that the parameter sets in these models will continue to be refined. In order to make useful design predictions based on an approximate physical model, it is important that conclusions about design quality are robust to model perturbations. To assess the sensitivity of the test tube ensemble defect to model perturbations, we considered all 600 design trials for target test tubes with |s|=200. Each sequence design was evaluated using 100 perturbed parameter sets with each parameter perturbed by Gaussian noise with a standard deviation of 5, 10, 20, or 40 percent of the parameter absolute value (FIG. 13).

Random Seed Composition.

FIGS. 9
a-d are line graphs that compare the process performance of the test tube design process using different GC contents for random seeding and reseeding.

Design Material.

FIG. 10 compares RNA and DNA design. DNA designs were performed in 1M Na⁺ at 25° C. to reflect that DNA systems are typically engineered for room temperature studies.

Parallel Efficiency and Speedup.

The contour plots of FIG. 8 demonstrate the parallel efficiency and speedup achieved using a parallel implementation of the test tube design process.

Designing Competing On-Target Complexes

In the standard test set, there is only one on-target complex per test tube, so there is no disadvantage to stabilizing this complex to the maximum extent possible, since all off-target complexes have vanishing target concentration. However, if there are multiple on-target complexes competing for the same strands, then the process needs to look at balancing the relative stability of these competing on-target complexes. To examine this challenge, we considered target test tubes in which a strand was intended to form both a monomer hairpin and a dimer duplex (FIG. 12a), varying the target concentration of the monomer from 0 to 1 μM while keeping the total strand concentration fixed at 1 μM.

FIGS. 12
b and 12c demonstrate that typical design quality varies greatly depending on the target monomer concentration (i.e., depending on the desired relative stability of the monomer and dimer on-targets). For example, the process typically succeeded in producing designs for low/high monomer/dimer target concentrations but struggled to satisfy the stop condition for high/low monomer/dimer target concentrations. These designs were performed with strong sequence complementarity constraints, requiring nucleotides that were intended complements in one or more on-target structures to be Watson-Crick complements.

If the designs were performed with weak complementarity requirements, permitting the process to introduce wobble pairs or mismatches between intended complements, typical design performance significantly improved (FIG. 12de)

Because of the competition between on-target complexes, we revisited the question of robustness to model perturbations. The perturbation studies of FIGS. 12f and 12g demonstrate that the predicted design quality was typically robust to model perturbations for test tubes where one on-target dominates the other, but became more sensitive to model perturbations for test tubes where both on-targets were in competition at non-saturated target concentrations. Hence, for applications where on-targets are in competition, it is more likely that the relative stabilities of the on-targets will need to be fine-tuned to account for imperfections in the physical model. Many applications seek to saturate on-targets at maximum concentration and off-targets at vanishing concentration, reducing the sensitivity of computational predictions to perturbations in the model parameters.

Test Tube Design with Multiple On- and Off-Target Complexes

FIG. 11 demonstrates the performance of the process for target test tubes containing four on-target tetramers and different sets of off-target complexes (all off-target complexes up to size L_maxε{0,1,2,3,4}). If the design is performed without off-targets (L_max=0) or with all off-targets up to monomers or dimers, the typical design quality was poor. If the design was performed with all off-targets up to trimers or tetramers, typical design trials surpassed the desired design quality (normalized test tube ensemble defect ≦0.02; panel a). These results illustrate the importance of destabilizing off-target complexes during sequence design.

CONCLUSION

As illustrated above, the test tube design process was found to provide a powerful framework for engineering nucleic acid base pairing to conform to a target secondary structure at a target concentration. The desired equilibrium base-pairing properties for candidate nucleic acid molecules in a predetermined environment (such as a dilute solution in a test tube) were specified as an arbitrary number of on-target complexes, each with a target secondary structure and target concentration, and an arbitrary number of off-target complexes, each with vanishing target concentration.

Given a theoretical target test tube, embodiments of the invention determine a test tube ensemble defect that quantifies the concentration of incorrectly paired nucleotides at equilibrium evaluated over the ensemble of the test tube. Embodiments of the test tube ensemble defect optimization process implements a positive design paradigm (stabilize on-targets) and a negative design paradigm (destabilize off-targets) at two levels: a) designing for the on-target structure and against the off-target structures within the structural ensemble of each on-target complex, and b) designing for the on-target complexes and against the off-target complexes within the ensemble of the test tube. Using the hierarchical mutation process described above, test tube designs involving multiple on- and off-targets for strand lengths of practical interest to the molecular programming and synthetic biology communities can be realized.

In the preceding description, specific details are given to provide a thorough understanding of the examples. However, it will be understood by one of ordinary skill in the art that the examples may be practiced without these specific details. For example, electrical components/devices may be shown in block diagrams in order not to obscure the examples in unnecessary detail. In other instances, such components, other structures and techniques may be shown in detail to further explain the examples.

It is also noted that the examples may be described as a process, which is depicted as a flowchart, a flow diagram, a finite state diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel, or concurrently, and the process can be repeated. In addition, the order of the operations may be re-arranged. A process is terminated when its operations are completed. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a software function, its termination corresponds to a return of the function to the calling function or the main function.

Those of skill in the art will understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.

Those having skill in the art will further appreciate that the various illustrative logical blocks, modules, circuits, and process steps described in connection with the implementations disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. One skilled in the art will recognize that a portion, or a part, may comprise something less than, or equal to, a whole. For example, a portion of a collection of pixels may refer to a sub-collection of those pixels.

The various illustrative logical blocks, modules, and circuits described in connection with the implementations disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.

The steps of a method or process described in connection with the implementations disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of non-transitory storage medium known in the art. An exemplary computer-readable storage medium is coupled to the processor such the processor can read information from, and write information to, the computer-readable storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal, camera, or other device. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal, camera, or other device.

Headings are included herein for reference and to aid in locating various sections. These headings are not intended to limit the scope of the concepts described with respect thereto. Such concepts may have applicability throughout the entire specification.

The previous description of the disclosed implementations is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these implementations will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other implementations without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the implementations shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

APPENDIX

OPTIMIZETUBE(s_Ψ,y_Ψ,Ψ_on,Ψ_off)

φ_Ψ← INITSEQ(,S_Ψ,Ψ)

Ψ_active, Ψ_passive← Ψ_on, Ψ_off

φ_Λ,s_Λ,y_Λ,Λ,D ← DECOMPOSE(φ_Ψ_active,s_Ψ_active,y_Ψ_active)

φ_Ψ,C ← OPTIMIZEFOREST(φ_Λ,s_Λ,y_Λ,D)

C ← TUBEDEFECT(φ_Ψ,s_Ψ,y_Ψ)

{circumflex over (φ)}_Ψ,C ← φ_Ψ,C

while Ĉ > max(C_stop,{tilde over (C)})

S_Ψ_active← AUGMENTACTIVE(s_Ψ_active,Ĉ,{tilde over (C)},{circumflex over (φ)}_Ψ)

φ_Λ,s_Λ,y_Λ,Λ,D ← DECOMPOSE(φ_Ψ_active,s_Ψ_active,y_Ψ_active)

{circumflex over (φ)}_Ψ,Ĉ ← OPTIMIZEFOREST(φ_Λ,s_Λ,y_Λ,D)

Ĉ ← TUBEDEFECT({circumflex over (φ)}_Ψ,s_Ψ,y_Ψ)

if Ĉ < C

C,φ_Ψ← Ĉ,{circumflex over (φ)}_Ψ

return φ_Ψ

OPTIMIZEFORESTφ_Λ,s_Λ,y_Λ,D)

c_Λ← ∞

m_Λ^{opt ← 0}

Ω^focus← Λ

Ω₁^opt← Λ₁

while Ω₁^opt≠ 

φ_Λ_D,C_Λ_D← OPTIMIZELEAVES(φ_Λ_D,s_Λ_D,y_Λ_D,Ω_D^focus)

Ω_D^opt← 

d ← D − 1

while d ≧ 1 and Ω_d+1^opt= 

{circumflex over (φ)}_Λ_d← MERGESEQ(φ_Λ_d+1)

m_k^opt← m_k^opt+ 1 ∀k ∈ Λ_d: φ_k≠ {circumflex over (φ)}_k

{tilde over (c)}_Λ_d← NODALDEFECTS({circumflex over (φ)}_Λ_d,s_Λ_d,y_Λ_d)

if Σ_k∈Λ_dmax(c_k^native,c_k^stop) <

Σ_k∈Λ_dmax(c_k^native,c_k^stop)

Ω_d^focus← Ω_d^focus∪ {k ∈ Λ_d: φ_k≠ {circumflex over (φ)}_k}

φ_Λ_d,c_Λ_d← {circumflex over (φ)}_Λ_d,ĉ_Λ_d

else

φ_Λ_d+1← SPLITSEQ(φ_Λ_d)

c_Λ_d+1← NODALDEFECTS(φ_Λ_d+1,s_Λ_d+1,y_Λ_d+1)

Ω_d^opt← {k ∈ Ω_d^focus,m_k^opt< M_optand

c_k^native> max(c_k_l^native,c_k_l^stop) + max(c_k_r^native,c_k_r^stop

if Ω_d^opt≠ 

k_reopt← arg min_k∈Ω_d^optm_k^opt

for d′ = d + 1, . . . ,D^d

φ_Λ_d′← SPLITSEQ(φ_Λ_d′−1)

c_Λ_d′← ∞

m_Λ_d′^opt← 0

Ω_d′^focus← 

k_reseed← WEIGHTEDLEAFSAMPLING(c_k_reopt^native,

s_k_reopt^native,s_Λ_D^native)

{circumflex over (φ)}_Λ_D← INITSEQ(φ_Λ_D,s_Λ_D,k_reseed)

Ω_D^focus← {k ∈ Λ_D: φ_k≠ φ_K}

φ_Λ_D← {circumflex over (φ)}_Λ_D

d ← d − 1

return φ_Λ₁, Σc_Λ₁

AUGMENTACTIVE(s_Ψ_active,C,{tilde over (C)},φ_Ψ)

while C < {tilde over (C)}

ĵ ← j ∈ Ψ_passive: x_j≧ x_k∀k ∈ Ψ_passive

Ψ_active← {ĵ} ∪ Ψ_active

Ψ_passive← Ψ_passive\ {j}

s^ĵ ← PAIRPROBSTRUCTURE(φ^ĵ)

ĉ ← NODALDEFECTS(φ_Ψ_active,s_Ψ_active,y_Ψ_active)

{tilde over (C)} ← Σc

return s_Ψ_active

OPTIMIZELEAVES(φ_Λ_D,s_Λ_D,y_Λ_D,Ω_D^focus)

m_k^leaf← 0 ∀k ∈ Λ_D

{circumflex over (φ)}_Λ_D,c_Λ_D← MUTATELEAVES(φ_Λ_D,s_Λ_D,y_Λ_D,Ω_D^focus)

m_k^leaf← m_k^leaf+ 1 ∀k ∈ Λ_D: φk ≠ {circumflex over (φ)}k

Ω_D^focus← Ω_D^focus∪ {k ∈ Λ_D: φk ≠ {circumflex over (φ)}k}

φ_Λ_D← φ_Λ_D

Ω_D^leaf← {k ∈ Ω_D^focus: c_k^native> c_k^stop}

while Ω_D^leaf≠ 

k_reseed← arg min_k∈Ω_D^leafm_k^leaf

{circumflex over (φ)}_Λ_D← INITSEQ(φ_Λ_D,s_Λ_D,k_reseed)

{circumflex over (Ω)}_D^focus← {k ∈ Λ_D: φ_k≠ {circumflex over (φ)}_k}

{circumflex over (φ)}_Λ_D,ĉ_Λ_D← MUTATELEAVES(φ_Λ_D,s_Λ_D,y_Λ_D,{circumflex over (Ω)}_D^focus)

m_k^leaf← m_k^leaf+ 1 ∀k ∈ Λ_D: φ_k≠ {circumflex over (φ)}_k

if Σ_k∈Λ_Dmax(c_k^native,c_k^stop) < Σ_k∈Λ_Dmax(c_k^native,c_k^stop)

Ω_D^focus← Ω_D^focus∪ {k ∈ Λ_D: φ_k≠ {circumflex over (φ)}_k}

φ_Λ_D,c_Λ_D← {circumflex over (φ)}_Λ_D,ĉ_Λ_D

Ω_D^leaf← {k ∈ Ω_D^focus: c_k^native> c_k^stop

and m_k^leaf< M_leaf}

return φ_Λ_D,c_Λ_D

MUTATELEAVES(φ_Λ_D,s_Λ_D,y_Λ_D,Ω_D^focus)

c_Λ_D← NODALDEFECTS(φ_Λ_D,s_Λ_D,y_Λ_D)

γ^mutate← 

m_Λ_D^mutate← 0

Ω_D^mutate← {k ∈ Ω_D^focus: c_k^native> c_k^stop}

while Ω_D^mutate≠ 

ξ,φ_Λ_D← WEIGHTEDMUTATIONSAMPLING(φ^A_D,

{c_k¹, . . . ,c_k^|s_k^|∀k ∈ Ω_D^mutate})

if ξ ∈ γ^mutate

m_k^mutate← m_k^mutate+ 1 ∀k ∈ Λ_D: φ_k≠ {circumflex over (φ)}_k

else

ĉ_Λ_D← NODALDEFECTS({circumflex over (φ)}_Λ_D,s_Λ_D,y_Λ_D)

if Σ_{k∈ Λ}_Dmax(c_k^native,c_k^stop) < Σ_k∈Λ_Dmax(c_k^native,c_k^stop)

Ω_D^focus← Ω_D^focus∪ {k ∈ Λ_D: φ_k≠ {circumflex over (φ)}_k}

m_k^mutate← 0 ∀k ∈ Λ_D: φ_k≠ {circumflex over (φ)}_k

γ^mutate← 

φ_Λ_D,c_Λ_D← {circumflex over (φ)}_Λ_D,ĉ_Λ_D

} else

m_k^mutate← m_k^mutate+ 1 ∀k ∈ Λ_D: φ_k≠ {circumflex over (φ)}_k

γ^mutate← γ^mutate∪ ξ

Ω_D^mutate← {k ∈ Ω_D^focus: c_k^native> f_stop|s_k^native|y_k

and m_k^mutate< M_mutate|s_k| }

return φ_Λ_D,c_Λ_D

NODALDEFECTS(φ_Λ_d,s_Λ_d,y_Λ_d)

Q_Λ_d,P_Λ_d← NODALPROPERTIES(φ_Λ_d)

Q_Ψ_active← COMPLEXPFUNC(Q_Λ_d,P_Λ_d,s_Λ_d)

x_Ψ0⁰= A_Ψ0,jyj ∀j ∈ Ψ_active

if Ψ_passive≠ 

x_Ψ0⁰= x_Ψ0⁰(1 − f_stopf_passive)

{circumflex over (x)}_Ψ_active← COMPLEXCONCENTRATIONS(Q_Ψ_active,x_Ψ0⁰)

x_Λ_d← NODALCONCENTRATIONS({circumflex over (x)}_Ψ_active)

n_Λ_d← NODALCOMPLEXDEFECT(P_Λ_d)

c_Λ_d← NODALTESTTUBEDEFECT(n_Λ_d,x_Ψ_d,y_Λ_d)

return c_Λ_d

SYSTEMS AND METHODS OF DESIGNING NUCLEIC ACIDS THAT FORM PREDETERMINED SECONDARY STRUCTURE

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

US Classifications

International Classifications

Abstract

Description

Claims

CROSS-REFERENCE TO RELATED APPLICATIONS

STATEMENT REGARDING FEDERALLY SPONSORED R&D

Provisional Applications (1)