The programmable chemistry of nucleic acid base pairing has proven a fertile design space for engineering pathway-controlled self-assembly and disassembly processes. However, it remains an error-prone and time-consuming process to design nucleic acid sequences in a computer system so as to engineer a desired reaction pathway.
Described herein are systems and processes for designing the sequence of one or more interacting nucleic acid strands intended to hybridize in solution via a prescribed reaction pathway. Sequence design is formulated as a multistate optimization problem using a virtual set of target test tubes containing different subsets of strands to represent reactant, intermediate, and product states of the system. Each target test tube programmed in the system contains a set of desired “on-target” complexes, each with a target secondary structure and target concentration, and a set of undesired “off-target” complexes with vanishing target concentration. Sequences can be designed subject to diverse classes of user-specified sequence constraints. Constrained multistate sequence design is performed to facilitate the nucleic acid reaction pathway engineering described herein.
As described in more detail below, in accordance with one aspect, an electronic system for determining the suitability of candidate nucleic acid strands to hybridize in solution via a target reaction pathway includes a set of virtual target test tubes representing reactants, intermediates, or products in the target reaction pathway, each virtual target test tube comprising data representing on-target complexes each having a target secondary structure and a target concentration, and off-target complexes each having vanishing target concentration; one or more sequence constraints inherent to the reaction pathway; a mutations module configured to generate a feasible candidate sequence that satisfies the one or more sequence constraints; and an objective function module configured to calculate a design objective function that quantifies the quality of a feasible candidate sequence based on physical properties calculated over the set of target test tubes.
In accordance with another aspect, an electronic system for optimizing the sequence of one or more nucleic acid strands to hybridize in solution via a target reaction pathway includes a set of virtual target test tubes representing reactants, intermediates, or products in the target reaction pathway, each virtual target test tube comprising data representing on-target complexes each having a target secondary structure and a target concentration, and off-target complexes each having a vanishing target concentration; one or more sequence constraints; a mutations module configured to generate a feasible candidate sequence that satisfies the one or more sequence constraints; an objective function module configured to evaluate the quality of a feasible candidate sequence based on physical properties calculated over the set of target test tubes; a test tube ensemble focusing module configured to define a focused ensemble within each target test tube comprising a subset of the complexes within the target test tube; a hierarchical ensemble decomposition module configured to decompose each complex in the focused ensemble into a tree of conditional subensembles, yielding a forest of decomposition trees; and an estimations module configured to optimize the sequence by accepting or rejecting a feasible candidate sequence based on conditional physical properties calculated at any level within the forest of decomposition trees.
In accordance with another aspect, a computer-implemented method for determining the suitability of candidate nucleic acid strands to hybridize in solution via a target reaction pathway comprises representing reactants, intermediates, or products in the target reaction pathway using a set of virtual target test tubes, each virtual target test tube comprising data representing on-target complexes each having a target secondary structure and a target concentration, and off-target complexes each having vanishing target concentration; determining one or more sequence constraints inherent to the reaction pathway; generating a feasible candidate sequence that satisfies the one or more sequence constraints; and calculating a design objective function that quantifies the quality of a feasible candidate sequence based on physical properties calculated over the set of target test tubes.
Embodiments of the invention relate to systems and processes for designing nucleic acid molecules intended to hybridize in solution via a prescribed reaction pathway. In one embodiment, to enable sequence design for reaction pathway engineering, a multistate test tube design process is used to build on subsidiary engineering processes. These subsidiary engineering processes can include: complex design, test tube design, and multistate complex design. Such a multistate test tube design process may take into consideration various sequence constraints, pathway constraints, etc. The process may employ a test tube ensemble focusing module to focus design effort within the ensemble of the test tube. The process may also employ a hierarchical ensemble decomposition module to decompose target test tubes into a decomposition forest. The multistate test tube design process may also employ an estimation module to estimate whether a stop condition has been satisfied based on calculations performed efficiently at any level in the decomposition forest. The process may also employ an objective function module to calculate exactly (at higher cost) whether a stop condition has been satisfied over the full ensemble of the test tube. Finally, the process may employ a mutations module programmed to generate candidate sequences based on the various sequence constraints placed on each nucleic acid sequence in the design.
Specifically, in some embodiments, for complex design, the goal may be to design equilibrium base pairing properties of a single complex of one or more interacting nucleic acid strands. A complex design may utilize a complex ensemble defect optimization process that implements both a positive design paradigm, which explicitly designs for on-target structures, and a negative design paradigm that designs against off-target structures. However, complex design does not usually consider the concentration of the single on-target complex or the concentrations of any off-target complexes. As a result, sequences that are successfully optimized to stabilize a target secondary structure in the context of the on-target complex, may nonetheless fail to ensure that this complex forms at appreciable concentration when the actual molecules of each designed strand are synthesized and introduced into a physical test tube. To address this issue, one embodiment of the invention includes a computerized process of “test tube design” which expands the design ensemble and aims at properly engineering the equilibrium base-pairing properties of the actual molecules within a physical test tube of interacting nucleic acid strands. Within the systems described herein, a virtual target test tube may be specified as containing on-target complexes and off-target complexes. The goal of this system may be to design a set of sequences so that the normalized test tube ensemble defect satisfies the test tube stop condition so that the on-target complex will form at approximately the target concentration and predominantly adopt the target secondary structure at equilibrium if the constituent molecules are synthesized and placed within a container, such as a test tube, by a researcher.
But neither complex design nor test tube design address the multistate challenges inherent in reaction pathway engineering. To facilitate reaction pathway engineering, the multistate complex design process builds on the complex design process. For multistate complex design, the goal includes engineering the equilibrium base pairing properties of an arbitrary number of complexes representing reactant, intermediate, and product states along the reaction pathway. And the goal is to design a set of sequences such that for each complex, the normalized complex ensemble defect satisfies a complex stop condition.
And to facilitate reaction pathway engineering while retaining the conceptual benefits of test tube design over complex design, the multistate test tube design process may build on the test tube design process and the complex design process. For example, a set of target test tubes may be specified and the number of target test tubes can be arbitrary. By employing a set of target virtual test tubes to represent reactant, intermediate, and product states along the reaction pathway, positive and negative design paradigms may be implemented along the reaction pathway in the system to facilitate the engineering of the pathway. The system may evaluate whether stop conditions have been met using the objective function module and estimations module. The objective function module may be configured to minimize or evaluate a multistate test tube ensemble defect of the target test tubes, in view of stop conditions. And the mutations module may be used to generate candidate sequences subject to sequence and pathway-specific constraints, such as base-pairing properties, G-C content requirements, and so forth.
Furthermore, to enable efficient estimation of test tube ensemble properties, test tube ensemble focusing initially focuses design effort on the on-target portion of the test tube ensemble. To further enable efficient estimation of test tube ensemble properties, structural ensembles may be hierarchically decomposed into trees of sub-ensembles, yielding a forest of decomposition trees as described herein. Such decomposition may be guided by target structure considerations, base-pairing probabilities, or both. In addition, feasible sequences may be generated by solving constraint satisfaction problems using sequence and pathway constraints, as discussed in more detail below. Such feasible sequences may be evaluated via calculation of a multistate defect estimate. The designed test tubes may be subject to evaluation, refocusing, and reoptimization, and sequences may be augmented (as nodes at various depths in a decomposed forest) as described herein. A modified sequence may also be a result of a leaf-reoptimization process using sequence reseeding, generated from a valid sequence selected from a current set of leaf sequences, instead of starting from a random sequence or a sequence previously selected as a starting point.
Molecular programmers engineer nucleic acid reaction pathways using an ever-increasing variety of small conditional DNA (scDNA, alternatively scRNA) motifs that exploit diverse design elements to interact and change conformation via prescribed hybridization cascades. Modes of nucleating interactions include toe-hold/toehold, loop/toehold, loop/loop, and template/toehold hybridization. Modes of strand displacement include 3-way branch migration, 4-way branch migration, and spontaneous dissociation. To exert control over the order of self-assembly and disassembly events, scDNAs are sometimes designed to co-exist metastably (i.e., the molecules are kinetically trapped) or stably (i.e., the molecules are thermodynamically trapped), with the next step in the reaction pathway triggered either by a cognate molecular input detected from the environment or by a molecular output of a previous step in the reaction pathway. In some instances, principles for engineering conditional metastability include nucleation barriers, topological constraints, toehold sequestration, and template unavailability, while principles for engineering conditional stability include cooperativity and sequence transduction. These pathway-controlled self-assembly and disassembly reactions may be driven by the enthalpy of base pairing and the entropy of mixing. These design elements have enabled the rational design and construction of scDNAs executing diverse dynamic functions, including catalysis, signal amplification, sequence transduction, shape transduction, Boolean logic, and locomotion.
In some instances, devising a new reaction pathway is akin to molecular choreography, requiring conception of both the scDNA participants and the dance that they will execute via pathway-controlled self-assembly and disassembly operations. Once a new reaction pathway has been choreographed, the task remains of encoding the intended pathway-controlled interactions and conformation changes into the sequences of the constituent scDNAs. To program this dynamic function, the nucleic acid sequences may be designed so that the molecules predominantly execute the desired on-pathway interactions while avoiding off-pathway alternatives. Systems and methods for formulating and solving the sequence design problem for nucleic acid reaction pathway engineering are disclosed herein.
Consider a set of nucleic acid molecules intended to execute a prescribed hybridization cascade. For example, the reaction pathway of an embodiment depicted in
In this embodiment, the reaction pathway specifies the elementary self-assembly and disassembly operations by which the molecules are intended to interact, the desired secondary structure for each on-pathway complex, and the complementarity relationships between sequence domains in the molecules. For example, the reaction pathway in the embodiment in
In some embodiments, a multistate test tube design problem can be specified as a set of target test tubes, Ω. Each tube, hεΩ, contains a set of desired on-target complexes, Ψhon, and a set of undesired off-target complexes, Ψhoff. The set of complexes in tube h is then:
Ψh∝Ψhon∪Ψhoff
and the set of all complexes in Ω is:
Ψ∝∪hεΩΨh
For each on-target complex, jεΨhon, the user or the reaction pathway engineering system 1901 may specify a target secondary structure, sj, and a target concentration, yh,j. For each off-target complex, jεΨhoff, the target concentration is vanishing (yh,j=0) and there is no target structure (sj=). When specifying off-targets, the reaction pathway engineering system 1901 may define Ψhoff to include all complexes of up to Lmax strands. Let
φΨ∝φj∀jεΨ
denote the set of sequences for the complexes in Ω.
For a given reaction pathway, target test tubes can be used to represent reactant, intermediate, and product states of the system. For example,
In one example, each target test tube isolates a different stable subset of the system components in local equilibrium, enabling optimization of kinetically significant states that would appear insignificant if all components were placed in a single test tube at equilibrium. For example, with the target test tubes of the embodiment shown in
In one example, to provide a physically meaningful objective function for optimizing the equilibrium base-pairing properties of a single test tube of interacting nucleic acid strands, the reaction pathway engineering system 1901, and specifically, the objective function module 1910, may first derive the test tube ensemble defect, C, quantifying the concentration of incorrectly paired nucleotides at equilibrium evaluated over the ensemble of the target test tube (also described herein). In some embodiments, optimization of the test tube ensemble defect may implement a positive design paradigm (stabilize on-targets) and a negative design paradigm (destabilize off-targets) at two levels: (1) designing for the on-target structure and against the off-target structures within the structural ensemble of each on-target complex, and (2) designing for the on-target complexes and against the off-target complexes within the ensemble of the test tube. Both paradigms may be important at both levels in order to achieve high-quality test tube designs with a low test tube ensemble defect.
In this particular embodiment using the present multistate test tube design setting, let
h
∝C
h
/y
h
nt
denote the normalized test tube ensemble defect for tube hεΩ, which falls in the interval (0, 1). Here,
is the total concentration of nucleotides in tube h. As h approaches zero, each on-target complex, jεΨhon, approaches its target concentration, yh,j, and is dominated by its target structure, sj, and each off-target complex, jεΨhoff, forms with vanishing target concentration.
To quantify sequence quality over the set of target test tubes, Ω, the reaction pathway engineering system 1901 employs the multistate test tube ensemble defect,
representing the average normalized test tube ensemble defect over Ω.
In one embodiment, the goal is to design a set of sequences, φΨ, such that for each test tube, hεΩ, the normalized test tube ensemble defect, h, satisfies a test tube stop condition,
h
≦f
h
stop, (2)
for a user-specified value of fhstopε(0,1). To focus design effort on test tubes that do not yet satisfy (2), we modify (1) by thresholding the contribution of each test tube by its stop condition
In this embodiment, objective function module 1910 achieves the optimal value of the design objective function of:
if and only if all test tubes satisfy (2). By construction, ≦obj, so optimality for the design objective function obj, implies
≦fstopε(0,1) (5)
for the multistate test tube ensemble defect.
For design problems where it proves challenging to sufficiently reduce obj due to competing demands within the objective function, the user or the pathway engineering design system 1901 may wish to alter the weighting of various defect contributions within obj to prioritize or deprioritize design quality for a portion of the design ensemble. To provide flexibility
for the user to adjust priorities at any level of organization within the design ensemble, the reaction pathway engineering system 1901 permits the user to define defect weights at three levels of organization within the design ensemble: nucleotide, complex, and test tube. In some embodiments, with the default value of unity for all weights, the objective function has a clear physical interpretation. With user-defined weights, the physical meaning is distorted in the service of adjusting design priorities. Following sequence design, the multistate test tube ensemble defect, , provides a physically meaningful characterization of sequence quality independent of user-defined stop conditions and defect weights.
A nucleic acid reaction pathway imposes implicit sequence constraints on the constituent scDNAs (e.g., the complementary sequence domains ‘a’ and ‘a*’ in the reaction pathway of
In some embodiments, the reaction pathway engineering system 1901 may employ a unified framework for handling diverse types of sequence constraints:
Each constraint can be expressed as a constraint relation, such as the examples shown in Table 1. Additional types of sequence constraints can be supported within this framework by specifying new constraint relations. For a given design problem, the reaction pathway engineering system 1901 may use to denote the set of user-specified sequence constraints.
In one embodiment, to design a set of sequences, φΨ, for a nucleic acid reaction pathway, the reaction pathway engineering system 1901 may define a set of target test tubes, Ω, representing reactant, intermediate, and product states of the system, and formulate the constrained multistate sequence design problem as:
An optimal design would be satisfaction of the test tube stop condition (2) for all hεΩ, which in turn implies ≦fstop.
In one embodiment, the sequence, φ, of one or more interacting RNA strands can be specified as a list of bases φaε{A,C,G,U} for a=1, . . . . , |φ| (T replaces U for DNA). A secondary structure, s, of one or more interacting RNA strands can be defined by a set of base pairs (each a Watson-Crick pair [A·U or C·G] or a wobble pair [G·U]). A polymer graph representation of a secondary structure can be constructed by ordering the strands around a circle, drawing the backbones in succession from 5′ to 3′ around the circumference with a nick between each strand, and drawing straight lines connecting paired bases. A secondary structure is unpseudoknotted if there exists a strand ordering for which the polymer graph has no crossing lines. A secondary structure is connected if no subset of the strands is free of the others. A complex of interacting strands is specified as a strand ordering, π, corresponding to the structural ensemble, Γ, containing all connected polymer graphs with no crossing lines. For sequence and secondary structure, sεΓ, the free energy, ΔG(φ,s), may be calculated using nearest-neighbor empirical parameters for RNA in 1M Na+ or for DNA in user-specified Na+ and Mg++ concentrations. These physical models have practical utility for the analysis and design of functional nucleic acid systems, and provide the basis for rational analysis and design of equilibrium base-pairing in the context of a dilute solution using the reaction pathway engineering system 1901.
In one embodiment, the reaction pathway engineering system 1901 uses Ψh0 to denote the set of strand species that interact in a test tube hεΩ to form the set of complex species Ψh. For complex jεΨh, with sequence φj and structural ensemble Γj, the partition function
can be used to calculate the equilibrium probability of any secondary structure sεΓj:
p(φj,s)=exp[−ΔG(φj,s)/kBT]/Q(φj).
Here, kB is the Boltzmann constant and T is temperature. The equilibrium base-pairing properties of complex j can be characterized by the base-pairing probability matrix P (φj), with entries Pa,b(φj)ε[0,1] corresponding to the probability,
that base pair a·b forms at equilibrium within ensemble Γj. Here, S(s) is a structure matrix with entries Sa,b(s)=1 if structure s contains base pair a·b and Sa,b(s)=0 otherwise. For convenience, in some embodiments, the structure and probability matrices can be augmented with an extra column to describe unpaired bases. The entry Sa,|s|+1(s) is unity if base a is unpaired in structure s and zero otherwise; the entry Pa,|φ
In one embodiment, the reaction pathway engineering system 1901 may use QΨh∝Qj∀jεΨh to denote the set of partition functions for the complexes in test tube h. The set of equilibrium concentrations, xh,Ψ
where the constraints impose conservation of mass. A is the stoichiometry matrix with entries Ai,j corresponding to the number of strands of type i in complex j, and xh,i0 is the total concentration of strand i introduced to test tube h.
In one embodiment, to analyze the equilibrium base-pairing properties of all test tubes hεΩ, the partition function, Q(φj), and equilibrium pair probability matrix, P(φj), may be generated for each complex jεΨ using θ(|φj|3) dynamic programs. In some other embodiments, the reaction pathway engineering system 1901 may utilize other processes or methods. The equilibrium concentrations, xv,Ψ
In one scenario, calculation of the design objective function (3) may include calculation of the complex partition functions, QΨ, which are used to calculate the equilibrium concentrations, xh,Ψ
To enable sequence design for reaction pathway engineering, the reaction pathway engineering system 1901 may solve the multistate test tube design problem, which builds conceptually on three subsidiary design problems that may be viewed as special cases of multistate test tube design: complex design, test tube design, and multistate complex design. Here, we describe the design ensemble, design objective function, and design paradigms for each design problem, and highlight interesting conceptual relationships between the approaches.
For complex design, the goal is to design the equilibrium base pairing properties of a complex of (one or more) interacting nucleic acid strands. This subsidiary design problem is the foundation on which the other three design problems build, and has attracted the most engineering method development to date.
Design Ensemble. For a complex design problem, the user may specify a target secondary structure, s, for the complex. The reaction pathway engineering system 1901 may view complex design as a special case of multistate test tube design where the design ensemble contains a single target test tube containing a single on-target complex and zero off-target complexes (Table 2).
Objective Function. To provide a physically meaningful basis for complex design, the complex ensemble defect may be used by the objective function module 1910
to quantify the average number of incorrectly paired nucleotides at equilibrium evaluated over the ensemble of complex j. Let
N
j
∝n
j/|φj|
denote the normalized complex ensemble defect, which falls in the interval (0,1). The goal is to design a sequence, φj, such that the normalized complex ensemble defect satisfies the complex stop condition,
N
j
≦f
j
stop
for a user-specified value of fjstopε(0,1).
Design Paradigms. Complex ensemble defect optimization may implement both a positive design paradigm (explicitly designing for the on-target structure) and a negative design paradigm (explicitly designing against off-target structures) within the ensemble of the complex. In some embodiments, both paradigms can be important to achieving high sequence quality in the context of the complex. Because complex design may correspond to a test tube ensemble containing one on-target complex and zero off-target complexes, complex design may, in one embodiment, implement only a positive design paradigm at the level of the test tube ensemble (Table 3).
With complex design, neither the concentration of the on-target complex, nor the concentrations of any off-target complexes are considered. As a result, sequences that are successfully optimized to stabilize a target secondary structure in the context of the on-target complex, may nonetheless fail to ensure that this complex forms at appreciable concentration when the strands are introduced into a test tube. To address this issue, test tube design expands the design ensemble. For test tube design, the goal of the reaction pathway engineering system 1901 is to engineer the equilibrium base-pairing properties of a test tube of interacting nucleic acid strands.
Design Ensemble. For a test tube design problem, the user or the system 1901 may specify a target test tube h containing on-target complexes, jεΨhon (each with a target secondary structure, sj, and a target concentration, yh,j) and off-target complexes, jεΨhoff (each with vanishing target concentration). The reaction pathway engineering system 1901 may view test tube design as a special case of multistate test tube design where the design ensemble contains a single target test tube containing arbitrary numbers of on-target and off-target complexes (Table 2).
Objective Function. To provide a physically meaningful basis for test tube design, the test tube ensemble defect,
quantifies the concentration of incorrectly paired nucleotides at equilibrium evaluated over the ensemble of the target test tube. Here, xh,j, denotes the equilibrium concentration of complex j in tube h. For each on-target complex, jεΨhon, in one embodiment, the first term in the sum represents the structural defect, quantifying the concentration of nucleotides that are in an incorrect base-pairing state on average within the ensemble of complex j. The second term in the sum represents the concentration defect, quantifying the concentration of nucleotides that are in an incorrect base-pairing state because there is a deficiency in the concentration of complex j. Note that in some embodiments, the reaction pathway engineering system 1901 may sum over Ψhon rather than Ψh because yh,j=0 for off-target complexes, so the structural and concentration defects are both identically zero for all jεΨhon. This does not mean that the defects associated with off-targets are ignored in this embodiment. By conservation of mass, non-zero off-target concentrations imply deficiencies in on-target concentrations, and these concentration defects are quantified by (7). Let
h
∝C
h
/y
h
nt
denote the normalized test tube ensemble defect, which falls in the interval (0, 1). Here, yhnt∝ΣjεΨ
h
≦f
h
stop
for a user-specified value of fhstopε(0,1).
Design Paradigms. In one embodiment, test tube ensemble defect optimization implements a positive design paradigm and a negative design paradigm at two levels (Table 3): (1) designing for the on-target structure and against off-target structures within the structural ensemble of each on-target complex, and (2) designing for on-target complexes and against off-target complexes within the ensemble of the test tube. Both paradigms may in some embodiments be important at both levels in order to achieve high-quality test tube designs with a low test tube ensemble defect.
Neither complex design nor test tube design addresses the multistate challenges inherent in reaction pathway engineering. To address this issue, multistate complex design builds on complex design to expand the design ensemble. For multistate complex design, the goal of the reaction pathway engineering system 1901 is to engineer the equilibrium base pairing properties of an arbitrary number of complexes representing reactant, intermediate, or product states along the reaction pathway.
Design Ensemble. In one embodiment, for a multistate complex design problem, the user may specify a set of complexes, jεΨ, each with a target secondary structure, sj. The reaction pathway engineering system 1901 may view multistate complex design as a special case of multistate test tube design where the design ensemble contains an arbitrary number target test tubes each containing a single on-target complex and zero off-target complexes (Table 2).
Objective Function. To quantify sequence quality over the set of target complexes, Ψ, the multistate complex ensemble defect,
represents the average normalized complex ensemble defect over Ψ. Here,
N
j
∝n
j(φj,sj)/|φj|,
denotes the normalized complex ensemble defect for complex j, which falls in the interval (0,1). The goal of the reaction pathway engineering system 1901 is to design a set of sequences, φΨ, such that for each complex jεΨ, the normalized complex ensemble defect may satisfy a complex stop condition,
N
j
≦f
j
stop; (9)
for a user-specified value of fjstopε(0,1). To focus design effort on complexes that do not yet satisfy (9), the reaction pathway engineering system 1901 may modify (8) by thresholding the contribution of each complex by its stop condition
In one embodiment, this design objective function achieves its optimal value of
if and only if all complexes satisfy (9). By construction, N≦Nobj, so optimality for the design objective function, Nobj, implies
N≦f
stopε(0,1) (12)
for the normalized multistate complex ensemble defect.
Design Paradigms. In one embodiment, multistate complex ensemble defect optimization may implement both positive and negative design paradigms within the ensemble of each complex. Because each state in the multistate formulation corresponds to a test tube ensemble containing one on-target complex and zero off-target complexes, multistate complex design may, in some embodiments, implement only a positive design paradigm within the ensemble of each test tube. By designing for multiple states along the reaction pathway, multistate complex design using the reaction pathway engineering system 1901 also implements a positive design paradigm along the reaction pathway (Table 3).
To facilitate reaction pathway engineering while retaining the conceptual benefits of test tube design over complex design, multistate test tube design using the reaction pathway engineering system 1901 builds on test tube design to expand the design ensemble.
Design Ensemble. For a multistate test tube design problem, the user may specify a set of target test tubes, Ω. Each tube hεΩ contains a set of on-target complexes, jεΨhon (each with target secondary structure, sj, and target concentration, yh,j), and a set of off-target complexes, jεΨhoff (each with vanishing target concentration, yh,j=0). In one embodiment, in contrast to the three subsidiary design problems, for multistate test tube design, the number of target test tubes can be arbitrary and the numbers of on- and off-target complexes within each target test tube are also arbitrary (see the embodiment described in Table 2).
Objective Function. To provide a physically meaningful basis for multistate test tube design, the multistate test tube ensemble defect, , quantifies the average normalized test tube ensemble defect over Ω, as described previously (1-5). When the design ensemble comprises a single target test tube containing a single on-target complex and no off-target complexes, the present formulation may be considered to be equivalent to complex ensemble defect optimization. When the design ensemble comprises a single target test tube containing arbitrary numbers of on-target and off-target complexes, the present formulation may be considered to be equivalent to test tube ensemble defect optimization. When the design ensemble comprises multiple target test tubes, each containing a single on-target complex and no off-target complexes, the present formulation is equivalent to multistate complex ensemble defect optimization. The multistate test tube ensemble defect generalizes the three subsidiary ensemble defects to an ensemble of an arbitrary number of target test tubes, each containing arbitrary numbers of on- and off-target complexes.
Design Paradigms. The present formulation may, in some embodiments, implement positive and negative design paradigms within the ensemble of each complex and also within the ensemble of each test tube. Moreover, by employing a set of target test tubes to design for on-pathway states and against off-pathway states along the reaction pathway, the present formulation used by the reaction pathway engineering system 1901 also implements positive and negative design paradigms along the reaction pathway (Table 3).
In some embodiments, the constrained multistate sequence design process used by the reaction pathway engineering system 1901 generalizes the test tube design process of Wolfe and Pierce, adapting similar process ingredients to perform sequence design over an ensemble of an arbitrary number of target test tubes subject to implicit and explicit sequence constraints. In some other embodiments, additional limitations may also be added to the engineering process.
Specifically, in some embodiments, the objective function, obj, may be reduced via iterative mutation of a random initial sequence. Because of the high computational cost of calculating obj, it may be useful, in some situations, to avoid direct recalculation of the objective function in evaluating each candidate mutation. The reaction pathway engineering system 1901, may exploit two approximations to enable efficient calculation of the objective function estimate, {tilde over ()}obj: using test tube ensemble focusing (e.g., via the test tube ensemble focusing module 1915), sequence optimization initially focuses on only the on-target portion of each test tube ensemble; using hierarchical ensemble decomposition (e.g., via the hierarchical ensemble decomposition module 1920), the structural ensemble of each on-target complex is hierarchically decomposed into a tree of subensembles, yielding a forest of decomposition trees. Candidate sequences may be evaluated at the leaf level of the decomposition forest by the estimations module 1935 by calculating {tilde over ()}obj from nodal defects generated efficiently over the leaf subensembles. As optimized subsequences are merged toward the root level of the forest, any emergent defects may be eliminated via ensemble redecomposition from the parent level on down and sequence reoptimization may be done from the leaf level on up.
After subsequences are successfully merged to the root level, the exact objective function, {tilde over ()}obj, may be calculated (e.g. via the objective function module 1910) for the first time, explicitly checking for the effect of the previously neglected off-target complexes. Any off-target complexes observed to form at appreciable concentration are hierarchically decomposed, added to the decomposition forest, and actively destabilized during subsequent forest reoptimization. When decomposition or focusing defects are encountered, decomposition is performed using multiple exclusive split-points. For sequence initialization, mutation, and reseeding, the reaction pathway engineering system 1901 may solve a constraint satisfaction problem to obtain valid sequences satisfying all constraints in . The elements of this hierarchical sequence optimization process are described herein and in the pseudocode describing an embodiment of the process as included in Appendix I.
In one embodiment, to reduce the cost of sequence optimization, the set of complexes, Ψ, may be partitioned into two disjoint sets:
Ψ=Ψactive∪Ψpassive,
where Ψactive denotes complexes that will be actively designed and Ψpassive denotes complexes that will inherit sequence information from Ψactive. Only the complexes in Ψactive are directly accounted for in the focused test tube ensembles that are used to evaluate candidate sequences. Initially, the reaction pathway engineering system 1901 sets
Ψactive=Ψon,Ψpassive=Ψoff (13)
where
Ψon ∝∪hεΩΨhon
is the set of complexes that appear as on-targets in at least one test tube, and
Ψoff∝Ψ\Ψon
is the set of complexes that appear as off-targets in at least one test tube and do not appear as on-targets in any test tube. Hence, with test tube ensemble focusing, only complexes that are on-targets in at least one test tube are actively designed at the outset of sequence design.
In one scenario, exact evaluation of the objective function, obj, may require calculation of the test tube ensemble defect contribution, ch,j, for each complex jεΨh for each tube hεΩ. The θ(|φj|3)(cost of calculating ch,j is dominated by calculation of the partition function, Qj and equilibrium pair probability matrix, Pj. To reduce the cost of evaluating candidate sequences, the reaction pathway engineering system 1901, and specifically the hierarchical ensemble decomposition module 1920, seeks to estimate ch,j at lower cost by hierarchically decomposing the structural ensemble, j, of each complex jεΨactive into a binary tree of subensembles, yielding a forest of |Ψon| decomposition trees. Each node in the forest is indexed by a unique integer k.
In some embodiments, estimating the defect contribution, ch,j, using physical quantities generated at depth d via the estimations module 1935 may include calculation of the nodal partition function, Qk, and nodal pair probability matrix, Pk, at cost θ(|φk|3) for each node k at depth d in the decomposition tree of complex j. For an optimal binary decomposition, |φk| halves and the number of nodes doubles at each depth moving down the tree, so the cost of estimating ch,j at depth d can be a factor of ½2d−2 lower than the cost of calculating ch,j exactly on the full ensemble Γj. Hence, for maximum efficiency, candidate mutations can be evaluated based on the estimated test tube ensemble defect calculated at the leaves of the decomposition forest. As designed subsequences are merged toward the root level, the test tube ensemble defect is estimated at intermediate depths in the forest.
To decompose the structural ensemble Γk of parent node k, the nucleotides of k are partitioned by the hierarchical ensemble decomposition module 1920 into left and right child nodes kl and kr by a split-point F, as shown in
The dual goals of accuracy and efficiency can both be achieved by the hierarchical ensemble decomposition module 1920 through placing the split-point F within a duplex that forms with high probability at equilibrium such that approximately half the parent nucleotides are partitioned to each child. Recall that the structural ensemble Γk is defined to contain all unpseudoknotted secondary structures, corresponding to precisely those polymer graphs with no crossing base pairs. In one example, since no structure in Γk can contain both base pairs that sandwich F and base pairs that cross F, placement of F between base pairs with probability close to one implies that the structures containing base pairs crossing F occur with low probability at equilibrium and may be safely neglected. Partitioning the parent nucleotides into left and right children of equal size minimizes the total cost, θ(|φkl|3)+θ(|φkr|3) of evaluating both children.
During the course of sequence design, if the base pairs sandwiching split-point F in parent k do not form with probability close to one, the accuracy of the decomposition breaks down. In this case, {tilde over (Γ)}k excludes structures that are important to the equilibrium physical properties of Γk, preventing the children from approximating the full defect of the parent. As described herein, the resulting decomposition defects can be eliminated by redecomposing the parental ensemble, Γk, using a set of multiple exclusive split-points, {F}, that define exclusive child subensembles (see
Structure-Guided Decomposition of on-Target Complexes
In one embodiment, at the outset of sequence design, equilibrium base-pairing probabilities are not yet available to guide ensemble decomposition. Instead, initial decomposition of each on-target complex, jεΨactive, may be guided by the user-specified on-target structure, sj, making the optimistic assumption that the base-pairs in sj will form with probability close to one after sequence design. Using this structure-guided ensemble decomposition approach, as the quality of the sequence design improves, the quality of the ensemble decomposition approximation will also improve.
For each complex jεΨactive, the target structure sj can be decomposed by the hierarchical ensemble decomposition module 1920 into a (possibly unbalanced) binary tree of substructures, resulting in a forest of |Ψon| trees. Each nucleotide in parent structure sk is partitioned to either the left or right child substructure (sk=sk
In one embodiment, if the maximum depth of a leaf in the forest of binary trees is D, any nodes with depth d<D that lack an eligible split-point may be replicated at each depth down to D so that all leaves have depth D. Let Λ denote the set of all nodes in the forest. Let Λd denote the set of all nodes at depth d. Let Λd,j denote the set of all nodes at depth d resulting from decomposition of complex j. Note that each complex jεΨactive contributes a single tree to the decomposition forest whether it is contained in one or more tubes.
In order to build in a tolerance for a basal level of decomposition defect as subsequences are merged moving up the decomposition forest, the stringency of each test tube stop condition (2) can be increased by a factor of fstringentε(0,1) at each level moving down the decomposition forest:
f
h,d
stop
∝f
h
stop(fstringent)d−1∀dε{1, . . . ,D}.
In some embodiments, the reaction pathway engineering system 1901 may calculate physical quantities at any level dε{2, . . . , D} as described herein to efficiently and accurately estimate the complex contributions, CΨactive, to the test tube ensemble defect, C. The complex partition function estimates, {tilde over (Q)}Ψactive, can be constructed from the conditional partition functions, {tilde over (Q)}Λ
In one embodiment, the estimations module 1935 begins by calculating the complex partition function estimate, {tilde over (Q)}j, for each complex jεΨactive in terms of conditional partition functions evaluated efficiently at the nodes kεΛd,j at any level dε{2, . . . ,D}. Let Ek denote the set of base pairs that are enforced in node k, and are hence adjacent to a split-point in some ancestor. On node k, the reaction pathway engineering system 1901 calculates the conditional partition function
{tilde over (Q)}
k
∝Q(φk|Ek) (14)
over the conditional ensemble, {tilde over (Γ)}k, comprising all structures in Γk that contain all base pairs in Ek. In one example, this calculation can be performed using a dynamic program suitable for complexes containing arbitrary numbers of strands, enforcing the base pairs in Ek by applying a bonus energy, ΔGclamp, each time a base pair in Ek is encountered within the dynamic program. It may also be performed using other suitable methods.
Next, the conditional partition functions generated at depth d are merged recursively to estimate the partition function for complex j. Consider split-point Fin parent k, with left-child and right-child conditional partition function, {tilde over (Q)}k
{tilde over (Q)}
k
={tilde over (Q)}
k
{tilde over (Q)}
k
exp(ΔGiFinterior/kBT). (15)
At the conclusion of recursive merging, the partition function estimate for complex j based on conditional quantities calculated at depth d is {tilde over (Q)}j. This estimate may become exact as the equilibrium probabilities of the base pairs sandwiching decomposition split-points approach unity in accordance with the decomposition assumption. In this case, enforcing these base pairs within the conditional child ensembles at depth d leads to conditional child partition function estimates neglecting only structures that contribute negligibly to the partition function of complex j.
Similarly, on node k, the estimations module 1935 may calculate the conditional pair probability matrix
{tilde over (P)}
k
∝P(φk|Ek) (16)
over the conditional ensemble, {tilde over (Γ)}k, using a related dynamic program. The conditional pair probability matrices calculated at depth d may be merged recursively to calculate the pair probability matrix estimate for complex j. During each merge, the matrix entries for the parent are taken from the corresponding entries in the children. At the conclusion of recursive merging, the pair probability matrix estimate for complex j based on conditional quantities calculated at depth d is {tilde over (P)}j. This estimate can become exact in the limit as the equilibrium probabilities of the base pairs sandwiching the decomposition split-points approach unity in accordance with the decomposition assumption. In this case, enforcing these base pairs within the conditional child ensembles at depth d is an accurate reflection of the predominant equilibrium base-pairing state of these nucleotides in complex j.
Let Ek denote the set of base pairs that are enforced in node k, and are hence adjacent to a split-point in some ancestor. In one embodiment, the estimations module 1935 seeks to calculate the conditional partition function (14):
{tilde over (Q)}
k
∝Q(φk|Ek)
and the conditional equilibrium base-pairing probabilities (16):
{tilde over (P)}
k
∝P(φk|Ek)
over the conditional ensemble, {tilde over (Γ)}k, comprising those structures in Γk that contain all base pairs in Ek. The approach in this embodiment is to apply standard dynamic program processes that operate over ensemble Γk, but to modify the free energy model so that contributions from structures that do not contain all the base pairs in Ek are effectively neglected.
In one example, the partition function may be generated in a forward sweep moving from short subsequences to the full nodal sequence. Equilibrium base-pairing probabilities are then generated in a backward sweep moving from the full nodal sequence back to short subsequences. During the forward sweep, each time a base pair in Ek is encountered, the standard free energy is augmented with a bonus energy, ΔGclamp, thus increasing the partition function contribution of every structure containing that base pair by a factor of exp(−ΔGclamp/kBT). By choosing ΔGclamp to be sufficiently negative, contributions from structures lacking any base pair in Ek can be effectively neglected. After the forward sweep over ensemble Γk using the modified free energy model:
In some embodiments, these processes may be performed using dynamic program processes via general-purpose or specialized processors, memory, etc. suitable for processing complexes containing arbitrary numbers of strands. In some other embodiments, the reaction pathway engineering system 1901 may use other methods to perform these processes.
After calculating the left-child and right-child conditional partition functions, {tilde over (Q)}k
{tilde over (Q)}
k
={tilde over (Q)}
k
{tilde over (Q)}
k
exp(−ΔGFinterior/kBT),
where ΔGFinterior is the free energy for the interior loop formed by the base-pairs sandwiching F. In each child, the base pair adjacent to F in the parent can be enforced in the child using an energetic bonus as described above. By supposition, this base pair terminates a duplex in every structure in the conditional child ensemble, but borders an interior loop in the conditional parent ensemble. In some embodiments, using nearest neighbor free energy models that include a free energy penalty, ΔGterminal, for terminating a duplex with a non-G·C base pair, if a child has a non-G·C base pair as the terminal clamped pair, the parental partition function can be compensated to remove the penalty present in the child partition function. In some examples, this can be achieved by multiplying the child partition function by a factor of exp(ΔGterminal/kBT) at the time the parental partition function is reconstructed using (15).
Complex Concentration Estimation using Deflated Mass Constraints
In one example, after calculating the set of complex partition function estimates, {tilde over (Q)}Ω
x
h,i
0=ΣjεΨ
follows from the target concentration, yh,j, and strand composition, Ai,j of each on-target complex jεΨhon.
In one embodiment, using test tube ensemble focusing module 1915, initial sequence optimization can be performed on a decomposition forest that contains only the on-target complexes in Ψactive but ultimately, the reaction pathway engineering system 1901 may try to satisfy the test tube stop condition (2) for each tube hεΩ for the full set of complexes in Ψh, including the off-targets in Ψhpassive. Recall that the off-targets in Ψhpassive do not contribute directly to the sum used to calculate the test tube ensemble defect (7), but contribute indirectly by forming with positive concentrations, causing concentration defects for complexes in Ψhactive as a result of conservation of mass. Hence, for each tube hεΩ, the reaction pathway engineering system 1901 can pre-allocate a portion of the permitted test tube ensemble defect, fhstopyhnt, to the neglected off-target complexes in Ψhpassive by deflating the total strand concentrations (17) used to impose the mass constraints (6b) in calculating the equilibrium concentrations {tilde over (x)}h,Ψ
Following this approach if Ψhpassive≠, the estimations module 1935 makes the assumption that the complexes in Ψhpassive consume a constant fraction of each total strand concentration:
corresponding to a total mass allocation of fpassivefhstopyhnt to the neglected off-targets in Ψhpassive. To calculate the complex concentration estimates, {tilde over (x)}h,Ψ
in place of the full strand concentrations (17). In the context of the deflated-mass test tube, the complex concentration estimates, {tilde over (x)}h,Ψ
In some embodiments, for each complex jεΨon, the complex ensemble defect estimate, ñj, can be calculated using the complex pair probability matrix estimate, {tilde over (P)}j, reconstructed from conditional quantities calculated efficiently at the nodes, kεΛd,j, at any level dε{2, . . . , D}.
For complex j, the contribution of nucleotide a to the complex ensemble defect estimate is given by:
and the complex ensemble defect estimate is then:
This estimate becomes exact in the limit as the complex pair probability matrix estimate, {tilde over (P)}j, becomes exact.
Having calculated the complex concentration estimates, {tilde over (x)}h,Ψ
is the contribution of complex j. The normalized test tube ensemble defect estimate for tube hεΩ at level d can be represented as:
is the total concentration of nucleotides in tube h. In the context of the deflated-mass test tube, this estimate becomes exact in the limit as the concentration estimates, {tilde over (x)}h,Ψ
Additionally, the following:
{tilde over (M)}
h,d
a
=[ñ
j
amin({tilde over (x)}h,j,yh,j)+max(yh,j−{tilde over (x)}h,j,0]/yhnt
can represent the contribution of nucleotide a to {tilde over ()}kh,d, which will provide a basis for defect-weighted mutation sampling during leaf mutation and defect-weighted reseeding during leaf reoptimization.
The design objective function estimate based on physical quantities calculated efficiently at any depth dε{2, . . . , D}, is then:
If desired, the user may adjust design priorities by specifying defect weights at three levels of organization within the design problem:
By default, all weights are unity. This can be changed as necessary.
Initialization
Sequences can be randomly initialized subject to the constraints in by solving a constraint satisfaction problem using a branch and propagate process.
Leaf Mutation
In some embodiments, to minimize computational cost, all candidate mutation sets can be evaluated at the leaf nodes, kεΛD, of the decomposition forest. Leaf mutation terminates if each tube hεΩ satisfies its leaf stop condition,
{tilde over ()}h,D≦fh,Dstop. (20)
A candidate mutation set is accepted by the mutations module 1930 if it decreases the objective function estimate (19) and rejected otherwise. Let U denote the set of tubes in Ω that do not yet satisfy a leaf stop condition. The reaction pathway engineering system 1901 performs defect weighted mutation sampling by selecting nucleotide a in tube hε U for mutation with probability,
proportional to its contribution to the objective function estimate (19). After selecting a candidate mutation position, a candidate mutation is randomly selected from the set of permitted nucleotides at that position. If the resulting sequence is infeasible (due to constraint violations caused by the candidate mutation), the mutations module 1930 may generate a feasible candidate sequence, {circumflex over (φ)}Λ
A feasible candidate sequence, {circumflex over (φ)}Λ
Leaf Reoptimization
In one embodiment, after leaf mutation terminates, if a leaf stop condition (20) is not satisfied for any tube hεΩ (i.e., if U≠), leaf reoptimization commences. At the outset of each round of leaf reoptimization, the mutations module 1930 may perform defect-weighted reseeding of Mreseed positions by selecting nucleotide a in tube hεU for reseeding (with a new random initial sequence) with probability (21). Following reseeding, a feasible candidate sequence, {circumflex over (φ)}Λ
Subsequence Merging, Redecomposition, and Reoptimization
In some embodiments, moving down the decomposition forest, hierarchical ensemble decomposition module 1920 makes the assumption that base pairs sandwiching parental split-points form with probability approaching unity. Conditional child ensembles enforce these sandwiching base pairs at all levels in the decomposition forest in accordance with the decomposition assumption. As subsequences are merged moving up the decomposition forest, the accuracy of the decomposition assumption is checked. If the assumption is correct, the child-estimated defect calculated using estimations module 1935 will accurately predict the parent-estimated defect. If the assumption is incorrect, the child-estimated defect will not accurately predict the parent-estimated defect since the conditional child ensembles neglect the contributions of structures that lack the sandwiching base pairs. During subsequence merging, if the decomposition assumption is discovered to be incorrect, hierarchical ensemble redecomposition is performed by module 1920 based on the newly available parental base-pairing information. The details of subsequence merging, redecomposition, and reoptimization are as follows.
After leaf reoptimization terminates, parent nodes at depth d=D−1 may merge their left and right child sequences to create the candidate sequence {circumflex over (φ)}Λ
{tilde over ()}h,d≦max(fh,dstop,{tilde over ()}h,d+1/fstringent), (22)
merging continues up to the next level in the forest. Otherwise, failure to satisfy a parental stop condition may indicate the existence of a decomposition defect,
{tilde over ()}h,d−{tilde over ()}h,d+1/fstringent>0,
exceeding the basal level permitted by the parameter fstringent. Let V denote the set of tubes hεV that do not satisfy a parental stop condition at level d. For each tube hεV, the parent node at depth d whose replacement by its children results in the greatest underestimate of the test tube ensemble defect at level d is subjected to structure- and probability-guided hierarchical ensemble decomposition. In some examples, additional parents can be redecomposed until
{tilde over ()}h,d−{tilde over ()}h,d+1*/fstrigent≦fredecomp({tilde over ()}h,d−{tilde over ()}h,d+1/fstringent)
is satisfied for all tubes hεV. Here, {tilde over ()}h,d+1 is the child defect estimate before any redecomposition, {tilde over ()}h,d+1 is the child defect estimate after redecomposition, and fredecompε(0,1).
In some embodiments, after redecomposition, the current sequences at depth d can be pushed down to all nodes below depth d and a new round of leaf mutation and leaf reoptimization is performed. Following leaf reoptimization, merging begins again. Subsequence merging and reoptimization terminate successfully if each tube hεΩ satisfies its parental stop condition (22) at depth d=1. The outcome of subsequence merging, redecomposition, and reoptimization is the feasible sequence, φΛ
Using test tube ensemble focusing module 1915, initial sequence optimization can be performed for the on-target complexes in Ψactive, neglecting the off-target complexes in Ψpassive. At the termination of initial forest optimization, the estimated objective function is estimated objective function is {tilde over ()}1obj, calculated using (19). For this estimate, each normalized test tube ensemble defect estimate, {tilde over ()}h,1, can be based on complex defect contributions, {tilde over (c)}Ψ
If each test tube hεΩ satisfies its termination stop condition,
h≦max(fhstop,{tilde over ()}h,1), (23)
sequence design terminates successfully. Otherwise, failure to satisfy a termination stop condition may sometimes indicate the existence of a focusing defect,
−{tilde over ()}h,1>0. (24)
Let W denote the set of tubes in Ω that do not satisfy a termination stop condition.
For each tube hεW, the test tube ensemble is refocused using test tube ensemble focusing module 1915 by transferring the highest-concentration off-target in Ψhpassive to Ψhactive. For each tube hε W, additional off-targets are transferred from Ψhpassive to Ψhactive until
h−{tilde over ()}h,1*≦frefocus(h−{tilde over ()}h,1) (25)
where {tilde over ()}h,1 is the forest-estimated defect before any refocusing, {tilde over ()}h,1* is the forest-estimated defect after refocusing (calculated using deflated total strand concentrations (18) if Ψhpassive≠), and frefocusε(0,1).
The new off-targets in Ψactive are then decomposed using probability-guided hierarchical ensemble decomposition using module 1920, the decomposition forest is augmented with new nodes at all depths, and forest reoptimization commences starting from the final sequences from the previous round of forest optimization. During forest reoptimization, the process actively attempts to destabilize the off-targets that were added to Ψactive. This process of tube ensemble refocusing and forest reoptimization is repeated until each test tube hεΩ satisfies its termination stop condition (23), which is guaranteed to occur in the event that all off-targets are eventually added to Ψactive. At the conclusion of sequence design, the process returns the feasible sequence set, φΨ, that yielded the lowest encountered objective function, obj.
In one embodiment, prior to sequence design, in the absence of base-pairing probability information, hierarchical ensemble decomposition was performed for each complex, jεΨactive, based on the user-specified on-target structure, sj, using the hierarchical ensemble decomposition module 1920. For a parent node, k, with structural ensemble, Γk, a single split-point, F, was positioned within a duplex in target structure, sj, so as to minimize the cost of evaluating both children, yielding left and right child nodes kl and kr with conditional ensembles {tilde over (Γ)}k
In some embodiments, two candidate split-points, Fi and Fj, are exclusive if they cross when depicted in a polymer graph (denoted FiFj, see
For comparison with the formulations that follow, here we recast the previously described structure-guided hierarchical ensemble decomposition using modified notation. Let F denote a split-point and let F± denote the union of the sets of Hsplit base pairs sandwiching F on either side. For a node k descendant from on-target complex jεΨactive with user-specified target structure sj, the nodal target structure matrix, Sk, is defined using the corresponding entries from the root target structure matrix Sj. The set of valid split-points may be denoted
And the optimal split-point selected for decomposition by the hierarchical ensemble decomposition module 1920,
minimizes the cost of evaluating the two child nodes implied by F.
Probability-Guided Decomposition using Multiple Exclusive Split-Point
In some embodiments, the set of sets of valid exclusive split-points may be denoted
for a user-specified value of fsplitε(0,1) and the optimal set of exclusive split-points selected for decomposition by the hierarchical ensemble decomposition module 1920,
minimizes the cost of evaluating the 2|{F}| child nodes implied by {F}.
The set of sets of valid split-points may be denoted
And the optimal set of exclusive split-points selected for decomposition by the hierarchical ensemble decomposition module 1920
minimizes the cost of evaluating the 2|{F}| child nodes implied by {F}. In one example, the structure-guided component of this approach may ensure that the redecomposition is compatible with the user-specified target structure, while the probability-guided component of this approach ensures that the physical properties of the parent can be accurately estimated using the children. For any children resulting from split-points that are target-structure-incompatible, subsequent decomposition is performed via probability-guided decomposition using the hierarchical ensemble decomposition module 1920.
During probability-guided decomposition, the optimal set of exclusive split points, {F}*, may be chosen to minimize the cost:
subject to constraints (27) on the collective probability:
the minimum child size, and split-point exclusivity. The hierarchical ensemble decomposition module 1920 may, in some embodiments, solve this optimization problem using a depth-first branch and bound procedure, where each branch in the optimization tree corresponds to a set of split-points satisfying the exclusivity and child-size constraints. Starting at the root with no split-points, at each branching step, a split-point is added by the hierarchical ensemble decomposition module 1920 so as to greedily maximize the collective probability, P({F}), of the current branch, {F}, while satisfying the exclusivity and child-size constraints. The cost of the current branch, cost({F}), serves as a lower bound on the cost of all unexplored subtrees of this branch. Branching continues until a branch is encountered that satisfies the collective probability constraint, P({F})≧fsplit. This branch defines the current minimal-cost solution, {
In one scenario, the estimations module 1935 may generalize the formulation of test tube ensemble defect estimation at depth dε{2, . . . , D} to account for the possibility of multiple exclusive split-points within any parent in the decomposition forest. Consider parent k decomposed using the set of exclusive split-points, {F}.
Following (15), for each split-point Fiε{F}, the corresponding exclusive contribution to the parental partition function estimate is
yielding the partition function estimate for parent k:
Recursive merging is performed until the complex pair probability matrix estimate, {tilde over (Q)}j, is obtained.
Likewise, for each split-point Fiε{F}, the entries in the exclusive parental pair probability matrix estimate,
Botzmann weighting of exclusive parental estimates then yields the pair probability matrix estimate for parent k:
Recursive merging is performed until the complex pair probability matrix estimate, {tilde over (P)}j is obtained.
Calculation of the complex ensemble defect estimates, ñΨ
Each time the sequence is initialized, mutated, or reseeded, a feasible sequence is generated by solving a constraint satisfaction problem based on the user-specified constraints in .
A constraint satisfaction problem (CSP) can be specified as:
In the present setting, each variable is the sequence, φa, of a nucleotide, a. For RNA, the domain for each variable is {A, C, G, U}. Each constraint in is specified using one of the constraint relations in Table 1 applied to one or more nucleotides (e.g., specification of constraint Ra,bmatch requires that φa=φb for nucleotides a and b).
In general, constraint satisfaction problems are NP-complete, so general-purpose polynomial-time processes are unavailable. Empirically, we find that CSPs arising in the context of nucleic acid reaction pathway engineering specified in terms of the diverse constraint relations of Table 1 can typically be solved efficiently using the branch and propagate methods described below.
In some embodiments, the mutations module 1930 solves the CSP using a branch and propagate processes that returns a solution if one exists and returns a warning if no solution exists. Initially, the domain for each variable is {A, C, G, U}. The mutations module 1930 first pre-processes the CSP by trivially removing any value from the domain of a variable a that is inconsistent with a constraint (e.g., an assignment or library constraint). The mutations module 1930 may, in some embodiments, further pre-process the CSP using constraint propagation to impose arc consistency as described below.
The branch and propagate processes involves iterated application of two ingredients:
Sequence initialization commences with constraint propagation to impose arc consistency and then a first branching step in which a variable, a, is randomly selected and randomly assigned a value from the domain of a. Constraint propagation may then be used to impose arc consistency and the next branching step is taken by randomly selecting an unassigned variable, b, and assigning a consistent value from the domain of b. Backtracking is performed if no consistent value for b exists. The branch and propagate processes returns a feasible set of initial sequences, Cactve, if one exists, and a warning otherwise.
In one embodiment, during leaf mutation, a feasible candidate sequence can be generated by the mutations module 1930 by mutating the current leaf sequence, φΛ
Branching may be performed by the mutations module 1930 by randomly selecting an unassigned variable b from with probability proportional to the size of the domain of b. For each value in the domain of b, the mutations module 1930 may check the implications of arc consistency on the size of the candidate mutation set ξ, and randomly assign a value of b that minimizes the increase in the size of ξ. The branch and propagate process returns a feasible candidate sequence, {circumflex over (φ)}Λ
During leaf reoptimization, a feasible candidate reseeded sequence, {circumflex over (φ)}Λ
In one example, the constrained multistate sequence design can be coded in programming languages such as the C and C++.
For each design problem, 10 independent design trials were performed. Design trials were run on a cluster of 2.53 GHz Intel E5540 Xeon dual-processor/quad-core nodes with 24 GB of memory per node. Other testing environment and testing hardware may also be used. Each trial was run on a single computational core using the default process parameters of Table 5. Design quality is quantified by the multistate test tube ensemble defect divided by the mean stop condition, /fstop; this quantity should be less than or equal to one for a design that achieves the stop condition for each target test tube. Relative design cost is quantified by dividing the cost of sequence design (costdes) by the cost of a single evaluation of the multistate test tube ensemble defect (costeval). To characterize typical design performance for each design problem, we plot sample median over design trials. DNA designs are performed at 25° C. and RNA designs are performed at 37° C. For each design problem, the test tube stop condition was fhstop=0.05 for each tube hεΩ (i.e., no more than 5% of the nucleotides in each test tube incorrectly paired at equilibrium).
Default parameters that may be used in one embodiment are shown in Table 5. In other embodiments, different parameters may be used. Note that other parameters and values may be adopted in other embodiments.
To demonstrate performance, a set of constrained multistate design problems were formulated for selected reaction pathways from the molecular programming literature (Table 6). For each reaction pathway (
Detection of initiator I triggers a conditional self-assembly cascade in which metastable hairpins sequentially nucleate and open to polymerize into a long nicked double-stranded polymer, as shown in
Upon binding mRNA detection target containing subsequence ‘a-b-c’, scRNAs perform shape and sequence transduction to produce a Dicer substrate targeting an mRNA silencing target containing independent subsequence ‘x-y-z’, as shown in
Detection of input sequences F and G leads to release of output sequence E, as shown in
In
Upon detection of Initiator, metastable hairpins undergo conditional catalytic self-assembly to form a 3-arm branched junction, as shown in
In
Specifically, as shown in
One to five orthogonal systems were designed for each of five reaction pathways with the stop condition set to 5% for each target test tube (fhstop ∀hεΩ). Ten design trials were run for each of these 25 design problems.
To characterize typical performance for each design problem, we plot the median over 10 design trials for design quality, design cost, and relative design cost in
An example result for the design of two orthogonal HCR systems is depicted in
Adding pattern prevention constraints (RPpattern of Table 1) keeps particular subsequences from appearing in a specified strand or set of strands. For example, we can prevent any nucleotide from showing up in four positions in a row by preventing the patterns AAAA, CCCC, GGGG, UUUU, or we could prevent any pair of nucleotides from appearing in six consecutive positions by preventing the patterns RRRRRR, YYYYYY, MMMMMM, KKKKKK, WWWWWW, SSSSSS. The design performance while preventing four consecutive instances of the same nucleotide or both four consecutive of the same nucleotide and six consecutive of any two nucleotides are shown in
Composition constraints can be used to specify sequence composition and similarity constraints can be used to specify similarity to a sequence of choice (Rcomposition and Rsimilarity of Table 1). We demonstrate these capabilities by creating a set of designs that are constrained to have a GC content between 40% and 60% for every sequence domain. We created another set of designs where no nucleotide type could make up more than 35% of any domain longer than 3-nt. The effects of these sequence constraints on the quality and cost of design are shown in
In some embodiments, constrained multistate test tube design provides a powerful framework for nucleic acid reaction pathway engineering. As discussed above, sequence design may be formulated as a multistate optimization problem using a set of target virtual test tubes containing different subsets of the strands to represent reactant, intermediate, and product states of the reaction pathway. Each target test tube contains a set of desired ‘on-target’ complexes, each with a target secondary structure and target concentration, and a set of undesired ‘off-target complexes’, each with vanishing target concentration. In one embodiment, the multistate test tube ensemble defect provides a physically meaningful basis for quantifying sequence quality over the set of target test tubes using the objective function module 1910. Sequence design may be performed using the mutations module 1930 subject to both implicit sequence constraints inherent to the reaction pathway and explicit sequence constraints imposed by the user, including compatibility with prescribed biological sequences. In one embodiment, constrained multistate test tube ensemble defect optimization using the reaction pathway engineering system 1901 implements positive and negative design paradigms at three levels: (1) explicitly designing for the target structure and against all off-target structures within the ensemble of each complex, (2) explicitly designing for the target concentration of all on-target complexes and against the formation of all off-target complexes within the ensemble of each test tube, (3) explicitly designing for on-pathway states and against off-pathway states along the reaction pathway.
The reaction pathway engineering system 1901 may, in some examples, implement three concepts, among others, that enable efficient multistate test tube design by reducing the cost of evaluating candidate sequences:
Other techniques may be used in combination with these methods.
Used in combination in some embodiments, test tube ensemble focusing, hierarchical ensemble decomposition, and calculation of conditional physical properties enable the reaction pathway engineering system 1901 to efficiently estimate and optimize the multistate test tube ensemble defect. Generation of feasible candidate sequences by efficient solution of a constraint satisfaction problem enables sequence design even in presence of stringent user-specified sequence constraints. Constrained multi-state sequence design facilitates nucleic acid reaction pathway engineering for diverse applications in molecular programming and synthetic biology.
As used herein, an “input device” can be, for example, a keyboard, rollerball, mouse, voice recognition system or other device capable of transmitting information from a user to a computer. The input device can also be a touch screen associated with the display, in which case the user responds to prompts on the display by touching the screen. The user may enter textual information through the input device such as the keyboard or the touch-screen.
Embodiments of the invention are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices.
As used herein, “instructions” refer to computer-implemented steps for processing information in the system. Instructions can be implemented in software, firmware or hardware and include any type of programmed step undertaken by components of the system.
A “microprocessor” or “processor” may be any conventional general purpose single- or multi-core microprocessor such as a Pentium® processor, Intel® Core™, a 8051 processor, a MIPS® processor, or an ALPHA® processor. In addition, the microprocessor may be any conventional special purpose microprocessor such as a digital signal processor or a graphics processor.
The reaction pathway engineering system 1901 is comprised of various modules as discussed in detail herein and in
As can be appreciated by one of ordinary skill in the art, each of the modules comprises various sub-routines, procedures, definitional statements and macros. Each of the modules are typically separately compiled and linked into a single executable program. Therefore, the following description of each of the modules is used for convenience to describe the functionality of the preferred system. Thus, the processes that are undergone by each of the modules may be arbitrarily redistributed to one of the other modules, combined together in a single module, or made available in, for example, a shareable dynamic link library.
The reaction pathway engineering system 1901 system may be used in connection with various operating systems such as SNOW LEOPARD®, iOS®, LINUX, UNIX or MICROSOFT WINDOWS®.
The reaction pathway engineering system 1901 may be written in any conventional programming language such as C, C++, BASIC, Pascal, or Java, and run under a conventional operating system.
A web browser comprising a web browser user interface may be used to display information (such as textual and graphical information) to a user. The web browser may comprise any type of visual display capable of displaying information received via a network. Examples of web browsers include Microsoft's Internet Explorer browser, Mozilla's Firefox browser, Apple Safari and PalmSource's Web Browser, or any other browsing or other application software capable of communicating with a network.
The invention disclosed herein may be implemented as a method, apparatus or article of manufacture using standard programming or engineering techniques to produce software, firmware, hardware, or any combination thereof. The term “article of manufacture” as used herein refers to code or logic implemented in hardware or computer readable media such as optical storage devices, and volatile or non-volatile memory devices. Such hardware may include, but is not limited to, field programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), complex programmable logic devices (CPLDs), programmable logic arrays (PLAs), microprocessors, or other similar processing devices.
In addition, the modules or instructions may be stored onto one or more programmable storage devices, such as FLASH drives, CD-ROMs, hard disks, and DVDs. One embodiment includes a programmable storage device having instructions stored thereon that when executed perform the methods described herein to determine nucleic acid sequence information.
As used herein, a “node” is a structure which may contain a value, a condition, or represent a separate data structure (which could be a tree of its own). Each node in a tree has zero or more child nodes, which are below it in the tree (by convention, trees are drawn growing downwards). A node that has a child is called the child's parent node (or ancestor node, or superior). A node typically has at most one parent. As is known, nodes that do not have any children are called leaf nodes or terminal nodes.
The topmost node in a tree is the root node. An internal node or inner node is any node of a tree that has child nodes and is thus is not considered to be a leaf node. Similarly, an external node or outer node is any node that does not have child nodes and is thus a leaf node
A subtree of a tree is a tree consisting of a node in the tree and all of its descendants in the tree. For example, the subtree corresponding to the root node is the entire tree. The subtree that corresponds to any other node may be called a proper subtree.
This application claims priority to U.S. Provisional Application No. 61/883,405 filed on Sep. 27, 2013 and hereby incorporates by reference all of the contents described therein.
This work was funded by the National Science Foundation via the Molecular Programming Project (NSF-CCF-0832824 and NSF-CCF-1317694). The government has rights in this invention.
Number | Date | Country | |
---|---|---|---|
61883405 | Sep 2013 | US |