The present disclosure relates to computational methods of compressing genetic information and, in particular, to computational methods and systems for compressing genetic information in multiple reading frames to reduce the total amount of linear sequence required to encode a set of genetic elements by overlapping the sequences that encode them.
There are many situations where the amount of deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) available to encode genetic information (e.g., protein coding sequences, RNAs, regulatory elements) is limiting. These situations include, but are not limited to, delivery of genetic information (e.g., large genes) to animals (e.g., humans, domestic animals) or plants (e.g., crops) for gene therapy or genetic editing applications using viruses such as adeno-associated virus (AAV) or other vectors (e.g., other viruses, or non-viral delivery methods). Additional situations involve CRISPR-Cas (or similar programmable DNA-binding proteins such as zinc finger nucleases (ZFNs) and transcription activator-like effector nucleases (TALENs)) based systems for genome editing including: targeted DNA cutting, homology directed repair, base editing, transcriptional regulation, translational regulation, or splicing regulation. In some cases there is a strict upper-limit on the amount of DNA or RNA that can be delivered. For example, AAV vectors for mammalian gene delivery are limited to genetic cargos of <5 kb. Geminivirus vectors have similar limitations. Moreover, it is generally preferable to use shorter sequences because many steps of the engineering process, including delivery, are typically more efficient with shorter sequences.
To reduce the total amount of linear sequence (DNA, RNA or other medium) required to encode a set of genetic elements, the present disclosure describes a computational method for compressing genetic information by finding a single sequence that mutually encodes two genetic elements in the same stretch of sequence (a “co-encoding”). The co-encoding can, in principle, be in either strand of a double-stranded encoding and the two elements can be encoded in the same or different reading frames if they are proteins. This technique can also be applied to a single genetic element if it is split into two elements (e.g., a protein like galactosidase or cas9 that can be functionally expressed in two fragments). Potential split points can be identified using computational methods or by functional screens. Split proteins may spontaneously assemble after translation. Alternatively, split proteins can be reconstituted by intein-mediated trans-splicing, mRNA trans-splicing, or known protein-protein interaction domains. Split RNAs may spontaneously assemble through base pairing interactions or be reconstituted through RNA trans-splicing.
Because the natural genetic code is redundant (several codons code for the same amino acid) and many amino acids in proteins are readily substituted, functional genetic elements like proteins admit many DNA representations. The disclosed methods encode information about acceptable nucleotide representations of genetic elements (e.g., proteins or functional RNAs) as a directed acyclic graph (DAG) structure. The DAG is generated from multiple data sources, including codon degeneracy, information from multiple sequence alignments (MSAs), and diverse functional screens of mutant elements. Mutations can include single substitutions as well as insertions, deletions and topological rearrangements of the sequence (e.g., circular permutations and other reorderings). DAG representations of two genetic elements can be used to compute viable co-encodings, which are equivalent to partial intersections of the two DAGs. Individual co-encodings are paths through the resulting intersection graph. Many possible overlaps can be tested by changing the relative positions of the two graphs before calculating the partial intersection.
In nearly all cases with viable co-encodings, many viable sequences are found, many more than can be feasibly tested in a laboratory setting. A computational method for evaluating the quality of individual sequences uses information from natural and mutant sequences to quantify the degree to which overlapping variants preserve the sequence characteristics of functional variants. Data for this ranking procedure is drawn from diverse sources including MSAs, protein structure, and high-throughput functional assays of mutant sequences. A score ranking the quality of a co-encoding is computed, which enables prioritization of variants in a “library generation” procedure. A “library” is a suite of putative functional co-encoding sequences to be tested in a laboratory setting. When high-throughput tests (e.g., functional screens and selections in microbes) are available, large co-encoding libraries (e.g., >105 sequence variants) are generated computationally and synthesized for testing. Smaller libraries (e.g., 10-1000 variants) comprising only the top-ranking variants are designed in cases where low-throughput methods (e.g., in-vitro biochemistry) are the only means of testing. Overall, the disclosed methods integrate multiple sources of data through computation to predict functional co-encodings of two (or more) genetic elements (or fragments thereof). These co-encodings are then tested by experimental means to identify those that work best for the desired application.
A method of compressing genetic information in multiple reading frames by intersecting graph representations, includes for a series of first genetic sequences encoding first proteins or nucleic acid sequences, associating a first score with each possible nucleotide or amino acid residue, insertion, and deletion at each position. The method further includes encoding the first genetic sequences in first computer-readable data structures comprising first directed acyclic graphs (DAGs) or a first finite automatons (FAs) such that (i) a plurality of potential genetic sequences for the first proteins or nucleic acid sequences are encoded in the first data structures, (ii) each edge in the first DAGs or first FAs represents a nucleotide residue, insertion, or deletion at that position and the first score associated with the nucleotide residue, insertion, or deletion at that position, (iii) each path through the first DAGs or accepted sequence in the first FAs represents a potential sequence encoding one of the first proteins or nucleic acid sequences, and (iv) for each path through one of the first DAGs or accepted sequence in one of the first FAs, a first aggregate score of the path or accepted sequence is the accumulation of the first score of all edges along the path or accepted sequence. The method includes encoding, in a second DAG or a second FA, overlapping sequences between the encoded first genetic sequences for the first proteins or nucleic acid sequences, calculating, for each edge in the second DAG or the second FA, a second score representing a combined total effect of the component edges of the first data structures, and selecting, according to a second aggregate score of each of the edges, a sequence represented by a path through the second DAG or the second FA.
A system includes a computer processor, and a memory, communicatively coupled to the computer processor. The memory stores instructions, executable by the computer processor to cause the processor to perform a number of steps. The steps include, for a series of first genetic sequences encoding first proteins or nucleic acid sequences, associating a first score with each possible nucleotide or amino acid residue, insertion, and deletion at each position. The steps also include encoding the first genetic sequences in first computer-readable data structures comprising first directed acyclic graphs (DAGs) or first finite automatons (FAs) such that (i) a plurality of potential genetic sequences for the first proteins or nucleic acid sequences are encoded in the first data structures, (ii) each edge in the first DAGs or first FAs represents a nucleotide residue, insertion, or deletion at that position and the first score associated with the nucleotide residue, insertion, or deletion at that position, (iii) each path through the first DAGs or accepted sequence in the first FAs represents a potential sequence encoding one of the first proteins or nucleic acid sequences, and (iv) for each path through one of the first DAGs or accepted sequence in one of the first FAs, a first aggregate score of the path or accepted sequence is the accumulation of the first score of all edges along the path or accepted sequence. The steps further include encoding, in a second DAG or a second FA, overlapping sequences between the encoded first genetic sequences for the first proteins or nucleic acid sequences. Still further, the steps include calculating, for each edge in the second DAG or the second FA, a second score representing a combined total effect of the component edges of the first data structures, and selecting, according to a second aggregate score of each of the edges, a sequence represented by a path through the second DAG or the second FA.
This disclosure describes a computational method of designing co-encoding nucleic acid sequences for reducing the number of bases (DNA or RNA) required to encode larger constructs. The co-encoded sequences can be separate genetic elements or single genetic elements that are split and then reassemble in situ using native interactions, intein mediated trans-splicing, mRNA trans splicing, or known/engineered interaction domains.
The admissible encodings of relevant genetic elements may be determined for a number of nucleic acids, proteins, or other genetic elements/sequences and stored, for example, in a library of such admissible encodings. Alternatively, a user may indicate or select specific genetic elements or sequences of interest, and the admissible encodings of the selected genetic elements or sequences may be determined from the data sources 101 according to the selection.
In any event, for each of the genetic elements, a score is associated with each possible residue, insertion, and deletion at each position in the genetic element. In various embodiments, the scores reflect different types of metrics, depending on the particular application, payload genetic elements, and/or goals. In some embodiments, the score reflects a likelihood or statistical probability of a residue, insertion, or deletion at a position, while in other embodiments, the score may reflect a fitness metric (e.g., fitness of the resulting genetic element for performing its intended function). Generally, the score reflects an expression of a probability or effect of the residue, insertion, or deletion at the position in question.
In some embodiments, the type of score employed is determined by the genetic elements under analysis. In other embodiments, the type of score employed may be selected by a user along with the selection of the genetic elements to analyze. In some embodiments, the library of admissible encodings may store multiple types of scores for each residue, insertion, or deletion at each position while, in other embodiments, the scores may be determined or calculated for each genetic element by referencing specific relevant ones of the data sources 101 to calculate the type of score requested or required.
The scores associated with the various substitutions, insertions, and deletions (referred to sometimes as “mutations” for brevity), are used to determine, for each genetic element under evaluation, which sequences for the genetic element are viable. The viable nucleotide sequences are represented in any desired format.
Referring again to
In order to encode information about allowed sequences in a DAG format, the incorporation of a particular nucleotide at a position in the sequence is represented as an edge in the graph. This edge has an associated nucleotide (or degenerate code indicating several possible nucleotides) and a length defined according to one of the variety of scoring metrics such as probability, negative log-likelihood, or fitness effect. In this formulation, multiple edges leaving any one node may be associated with the same nucleotide, this results in a DAG that is isomorphic to a nondeterministic FA. Alternatively, this graph could be arranged such that each node could have at most one outgoing edge for each nucleotide in which case the graph is isomorphic to a deterministic FA. In a graph constructed in this manner, a node with no incident edges represents the starting position. A node with no outgoing edges represents an accepting state or end state. In this construction, any path from the start position to the end position represents a potential sequence, and the length of this path represents that sequence's score. This construction allows the storage of a large number of sequences in minimal space by storing the rules of how to make and score sequences instead of the sequences themselves. This avoids the combinatorial (O(n!)) increase in the number of sequences that would have to be stored. At the same time, this allows fast longest/shortest pathfinding algorithms to generate best scoring sequences from the graph very quickly. Further, modifying edge lengths according to desirable characteristics of the sequences such as GC content, codon usage, proximity to known functional sequences, position entropy, or random variation, run linearly with respect to edge number. This then allows new paths to be generated that are weighted according to new criteria.
A similar construction can be used in order to represent sequence overlaps. Edges in the DAG represent a nucleotide (or degenerate multi-nucleotide) that can be accepted by both sequences, and the length of the edge represents the combination of the scores (e.g. product for probabilities, or sum for negative log likelihoods). This configuration benefits from all of the advantages mentioned above. However, paths through this graph represent sequences that contain overlaps of both parental sequences and scores that represent the joint score of the overlap. To generate this overlap graph representation, two sequence DAGs can be compared using graph search algorithms given an overlap start position in each sequence. This search adds edges that are valid combinations of edges in the parental graph, and trims paths that end at nodes which are not end nodes but have no valid outgoing paths. This removes complexity from the graph. This graph can then be used to generate sequences as described for single sequence graphs and can be modified to reflect different sequence priorities as described there. By generating sequences that reflect design priorities or high scoring sequences, sequences likely to perform well in downstream steps can be efficiently sampled from the combinatorially large (and therefore computationally intractable) set of possible sequences that could be made for a particular overlap.
Turning to the graph 150, the DFA graph representation is discussed in more detail. In the first amino acid residue position 152, either an N (Asparagine) or a deletion is viable in that position. In the second amino acid residue position 153, any of an A (Alanine), a V (Valine), or a G (Glycine) is viable in that position. In the third amino acid residue position 154, either a K (Lysine) or a V (Valine) is viable in that position. Within the graph 150, each of the amino acid residue positions 152-154 is represented by nodes (represented by circles) and edges (represented by direction specific lines). In the amino acid residue position 152, a start node 156 denotes the start of the peptide. An Asparagine amino acid is encoded by a sequence AAY (using standard genetic coding), in which the Y denotes a wobble for which either a C or T nucleotide may be present. Thus, the Asparagine amino acid may be encoded either by the sequence A-A-C or by the sequence A-A-T. As a result, the graph representation 150 depicts a first edge 152A associated with a nucleotide A, a second edge 1528, associated with a nucleotide A, and a third edge 152C associated with either a nucleotide C or a nucleotide T. Edge 152C could also be represented as a separate edge for each nucleotide in particular instantiations. Each of the edges 152A-C is separated from the others by a node, and each of the edges 152A-C is associated with a corresponding score for the corresponding nucleotide. The nucleotide score may be derived from the score for the relevant amino acid or amino acids. The amino acid residue position 152 also depicts an edge 152D from the start node 156 to a node between the edges 153A and 153B in the next amino acid position 153, indicating that a deletion is a viable option at the first amino acid residue position 152, and has associated with it a corresponding score for the deletion and a G nucleotide representing the first nucleotide of the amino acid position 153. Edges associated with deletions must still be associated with a symbol for the graph to remain isomorphic to a DFA.
Similarly, in the amino acid residue position 153, the node 158 separates the first and second amino acid residue positions 152 and 153. The amino acids Alanine, Valine, and Glycine are notated, respectively, as GCN, GUN, and GGN, with N denoting a wobble for which any nucleotide may be present. Thus, each potential amino acid at the second residue position 153 has a first edge 153A associated with a G nucleotide, and a third edge 153C associated with any one of an A, C, T, or G, nucleotide. The edge 153C could also be represented as a separate edge for each nucleotide in particular instantiations. A second edge 153B— representing the second nucleotide encoding of the codon—is associated with a C, T, or G, nucleotide, depending on whether the second amino acid residue is Alanine, Valine, or Glycine. The edge 153B could also be represented with a separate edge for each nucleotide in particular instantiations. Each of the edges 153A-C is separated from the others by a node, and each of the edges 153A-C is associated with a corresponding score for the corresponding nucleotide.
Likewise, in the amino acid residue position 154, a node 160 separates the second and third amino acid reside positions 153 and 154. The amino acids Lysine and Valine are notated, respectively, as AAR or GUN, with R denoting a wobble for which either an A or a G nucleotide may be present. A first path from the node 160 to an end node 162 denoting the end of the peptide sequence includes edges 154A-C and represents the codon for the Lysine amino acid, while a second path from the node 160 to the node 162 includes edges 154D-F and represents the codon for the Valine amino acid. The edges 154A-C are associated, respectively, with nucleotides A, A, and either A or G, and respective scores associated with the corresponding nucleotides. Similarly, the edges 154D-F are associated, respectively, with nucleotides G, T, and any one of A,C, G, and T, and respective scores associated with the corresponding nucleotides. Any of the edges 154C and 154F could also be represented as a separate edge for each nucleotide in particular instantiations.
We will now discuss the NFA representation graph 170 in more detail.
Similarly, in the amino acid residue position 173, the node 178 separates the first and second amino acid residue positions 172 and 173. The amino acids Alanine, Valine, and Glycine are notated, respectively, as GCN, GUN, and GGN, with N denoting a wobble for which any nucleotide may be present. Each potential amino acid at the second residue position 173 has its own first edge 173A, D, or G associated with a G nucleotide, and third edge 173C, F or I associated with any one of an A, C, T, or G, nucleotide. Any of the edges 173C, 173F or 1731 could also be represented as a separate edge for each nucleotide in particular instantiations. A second edge 1728, E, or H—representing the second nucleotide encoding of the codon—is associated with a C, T, or G, nucleotide, depending on whether the second amino acid residue is Alanine, Valine, or Glycine. Each of these edges is arranged into paths, one for each amino acid represented. A first path from the node 178 to an end node 180 denoting the end of the codon includes edges 173A-C and represents the codon for the Valine amino acid, while a second path from the node 178 to an end node 180 includes edges 174D-F and represents the codon for the Glycine amino acid, finally a third path from the node 178 to an end node 180 includes edges 174G-I and represents the codon for the Alanine amino acid. Each of the edges 153A-I is separated from its neighbors by a node, and each of the edges 173A-I is associated with a corresponding score for the corresponding nucleotide.
Likewise, in the amino acid residue position 174, a node 180 separates the second and third amino acid reside positions 173 and 174. The amino acids Lysine and Valine are notated, respectively, as AAR or GUN, with R denoting a wobble for which either an A or a G nucleotide may be present. A first path from the node 180 to an end node 182 denoting the end of the peptide sequence includes edges 174A-C and represents the codon for the Lysine amino acid, while a second path from the node 180 to the node 182 includes edges 174D-F and represents the codon for the Valine amino acid. The edges 174A-C are associated, respectively, with nucleotides A, A, and either A or G, and respective scores associated with the corresponding nucleotides. Similarly, the edges 174D-F are associated, respectively, with nucleotides G, T, and any one of A,C, G, and T, and respective scores associated with the corresponding nucleotides. Any of the edges 174C and 174F could also be represented as a separate edge for each nucleotide in particular instantiations.
When attempting to compress the sequence data for two or more genetic elements, the computational method 100 (
A library of co-encodings may be created (block 110). The library may be prioritized according to various characteristics such as co-encoding length, total payload size, suitability for the intended purpose, and the like. A set of top co-encodings may be selected according to the prioritization and/or the scores associated with each co-encoding before further optimization is performed. These scores can be adjusted by taking into account non-local interactions. The library of co-encodings may be optimized further (block 112) by adjusting the sequence to maximize positive non-local interactions and minimize negative non-local interactions. Experimental (e.g., in vitro, in vivo, in silico, etc.) testing may be conducted (block 114) on selected and/or optimized candidate co-encodings.
The general concept of the generation of co-encoding sequences will be illustrated further with reference to
C
T
A
C
G
C
A/T/C/G
T
C
T
A
C
G
Once a co-encoding sequence is encoded in a graph, the graph can be used to generate potential overlap sequences by using any of a variety of longest/shortest path algorithms, stochastic algorithms, deterministic algorithms, or a combination, coupled with adjustments of edge weights to up or down weight paths with specific attributes. These algorithms can be employed to quickly generate a large number of potentially overlapping sequences with associated scores that are maximized for attributes of their sequence, such as highest or lowest scores, similarity to a specific sequence, amino acid preference, or codon usage. Potential overlapping sequences can then be further scored while taking into account interactions between distant positions to account for any non-local effects that are expected from mutagenesis or bioinformatics studies.
The computational methods described herein may be implemented in a computer environment. An example computational environment 300 is depicted in
Specifically, the memory 304 may store a variety of data sources 312 corresponding to some or all of the data sources 101 described above with respect to
The memory 304 may also store genetic element data 314. The genetic element data 314 may include various genetic elements that may be selected for analysis and/or compression using the methods described herein. For example, the genetic element data 314 may include protein data 314A for a variety of proteins, such that a user could select two or more of the proteins to determine whether suitable overlapping sequences exist for the selected proteins, which would allow for compression. The protein data 314A may include, for example, for each protein, the possible amino acid residue sequences that make up the protein. At the same time, a set of amino acid data 314B may include, for each of the amino acids the possible nucleic acid residue sequences that code for the particular amino acid.
Of course, each amino acid, protein, or other genetic element may be susceptible to any number of nucleic acid residue substitutions, insertions, or deletions. That is, for a given nucleotide sequence, a substitution, insertion, or deletion may occur at any position, with a potentially known probability and, potentially, a known effect on the overall functionality or suitability of the resulting nucleotide sequence. The data sources 312 may include data directed to the probability of a particular substitution, insertion, or deletion at a specific position, may include data directed to the advantageous or deleterious effects of such a substitution, insertion, or deletion at a specific position, and may provide other data that may be used to develop a score associated with the presence (or absence) of a particular nucleotide at a specific position.
A scoring routine 316 may be stored in the memory and executed by the processor 302 to determine, for a selected genetic element, a score associated with each substitution, insertion, or deletion at each position in the nucleotide or amino acid sequence for the genetic element. The scoring routine may make use of the data sources 312. The scoring routine 316 may store in the memory, the scores associated with each position in the sequence, for each mutation. In embodiments, the type of scoring to be used may be selected by the user, while in other embodiments the type of scoring used may be determined according to the genetic element type (e.g., protein, gene, etc.) or according to the intended use of the co-encoding sequence (e.g., gene editing, etc.).
A graph generation routine 318 may use the genetic element data 314, the data sources 312, and the output of the scoring routine 316 to generate data structures (e.g., FAs or DAGs) representing each of the selected genetic elements. The resulting data structure for each selected genetic element may include the information to generate a representation of every possible sequence of nucleotide residues, along with the scores for each potential substitution, insertion, or deletion at each position. In embodiments, the graph generation routine may ignore potential substitutions, insertions, or deletions having scores that are above (or below) some predefined threshold, such as those that are exceedingly improbable, unsuitable, or undesirable. The resulting data structures may be stored (e.g., in graph storage 320) in the memory 304.
An overlap analysis routine 322 may retrieve data from the graph storage 320 and may analyze graphs for selected genetic elements to determine overlapping sequences between the selected genetic elements. The overlap analysis routine 322 may analyze the graphs for the selected elements by shifting the starting points of each genetic element relative to the other(s) to determine whether there may be overlapable segments. In embodiments, the overlap analysis routine 322 may also analyze the reverse complement of one or more of the selected genetic elements—for example, comparing the reverse complement of a first genetic element relative to a second genetic element. The overlap analysis routine 322 may also generate a new graph (or FA) data structure representing the overlap sequences between the selected genetic elements, and may associate with each edge in the graph or FA an aggregate score representing the combined effects of the corresponding edges in the graphs for the selected genetic elements. The data structure representing the overlap sequences may likewise be stored in the memory 304 (e.g., in the graph storage 320).
In embodiments, the overlap analysis routine 322 (or another routine) may also generate, from the overlap data structure, a co-encoding library 324. The overlap analysis routine 322 (or other routine) may traverse the various paths or acceptable states through the overlap data structure to determine nucleotide sequences, exhibiting various levels of overlap, that encode the selected genetic elements. Each of the sequences in the co-encoding library 324 may have associated with it one or more scores. For example, each sequence in the co-encoding library 324 may have associated with it a score for each of the selected genetic elements encoded by the co-encoding sequence, and/or may have associated with it an overall score indicative of the relative suitability of the co-encoding sequence.
An optimization routine 326 may further score and/or optimize the sequences in the co-encoding library 324 using, for example, data from bioinformatics studies or experimental results (e.g., data from the data sources 312) to inform knowledge of higher order interactions between nucleotides at various positions. The best scoring co-encoding sequences may then be selected for in vivo, in vitro, and/or in silico testing.
For each genetic element, a score is associated with each nucleic acid residue, amino acid residue, insertion, or deletion at each position (block 402). The data for determining the scores associated with each nucleic acid residue, insertion, or deletion at each position of the first sequence is taken from bioinformatics and experimental data 403 (which may correspond to the data sources 101 described with respect to
Once the data structures have been created, encoding, the potential nucleotide sequences for the selected genetic elements, the method 400 includes encoding, in one or more co-encoding data structures and, particularly, one or more DAGs or FAs, overlapping sequences between genetic element data structures such that each co-encoding data structure captures a particular position and orientation of the relevant genetic elements, and all interesting positions and orientations are accounted for (block 405). Each edge in a co-encoding data structure corresponds to a combination of edges in the genetic element data structures that are reached at the same point in the progression through an overlapping path in the overlapped data structures and associated with overlapping sets of nucleotides. Accordingly, the score for each edge of the co-encoding data structure is the aggregation of the scores for the corresponding edges of the overlapping genetic element data structures data structures. Similarly, the associated nucleotide or nucleotides is the intersection of the sets of the associated nucleotide(s) of the overlapped edges in the genetic element data structures. In embodiments, the scores may be updated or adjusted according to the bioinformatics and experimental data 403. Further, because shifting the relative start positions for the genetic element data structures, or analyzing the reverse complement of data structures with respect to various start positions of the other data structures, may result in different sets of overlapping sequences, a number of co-encoding data structures may be generated, with each corresponding a different relative start position between the genetic element data structures.
As a result, when block 405 is executed it may create multiple new co-encoding data structures, with each new co-encoding data structure corresponding to the overlapping sequences between the genetic element data structures when the start nodes are shifted relative to one another and/or when a different set of genetic element data structures are analyzed as reverse complements. Thus the number of possible co-encoding positions and orientations and therefore, the number of co-encoding data structures will fall between hard upper and lower bounds. The lower bound is two times the summation of the length of the longest sequence minus the length of the current sequence for each sequence (2 Σi=0n (max(lengths)−lengthsi) for n sequences with lengths in array lengths). The upper bound is calculated similarly but substituting the sum of all the sequence lengths for the longest length (2 Σi=0n (E(lengths)−lengthsi) for n sequences with lengths in array lengths). In practice, however, many of the relative start positions would not be worth analyzing, because the opportunity for compression is not meaningful—for example, when two genetic elements are being overlapped, but the start position of one genetic element is analyzed with respect to only the last few positions of the other genetic element.
The weights of the co-encoding data structure can then be updated to allow for biasing produced sequences towards particular characteristics such as, GC content, codon usage, amino acid usage, or biasing towards specific sequences (block 406). Then, a library of overlapping, co-encoding sequences can be created (block 407), for example by using a shortest/longest path algorithm to select the best scoring co-encoding sequence, by using a partially stochastic algorithm to find sequences similar but distinct from the best scoring sequence, or by using a weighted stochastic algorithm to generate random sequences that prefer high scoring paths. Each sequence in the library of co-encoding sequences may be a sequence that co-encodes the entirety or some part of both of the selected genetic elements (e.g., such as that depicted in
The co-encodings resulting from the application of one or more of the methods described and claimed herein facilitates the delivery of larger payloads by compressing the data for multiple sequences into a single co-encoded sequence that is shorter than the combined length of the individual sequences. As a result, it may be possible for vectors to carry sequences that would otherwise exceed the maximum payload for the vector, in turn potentially facilitating treatment of conditions that would otherwise not be treatable using currently known methods or, at least, facilitating treatment of those conditions with methods that might be easier than those capable of carrying the uncompressed payloads. This may also allow treatments that would have previously required multiple vectors to deliver to instead be delivered in only one vector, reducing costs and easing treatment. These efforts are also not restricted to applications in medicine, but also provide similar benefits for delivery to plants, fungi, or animals for agricultural purposes and for delivery to microorganisms for biotechnological applications. Further, many delivery vectors and plasmids are easier to synthesize, clone, manufacture and/or otherwise work with when they are smaller, even if their maximum size is not exceeded. All of the above applications in medicine, agriculture, and biotechnology may also be eased through reductions in the sizes of the necessary components even in the absence of direct payload limits.
The following list of aspects reflects a variety of the embodiments explicitly contemplated by the present disclosure. Those of ordinary skill in the art will readily appreciate that the aspects below are neither limiting of the embodiments disclosed herein, nor exhaustive of all of the embodiments conceivable from the disclosure above, but are instead meant to be exemplary in nature.
This application is a national phase application under 35 U.S.C. § 371 of international application PCT/US2021/062573, filed Dec. 9, 2021, which claims the benefit of priority under 35 U.S.C. § 119 to U.S. Provisional Patent Application No. 63/123,884, filed on Dec. 10, 2020, the disclosure of which is hereby incorporated by reference in its entirety.
This invention was made with government support under Grant Number GM127463 awarded by the National Institutes of Health. The government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US21/62573 | 12/9/2021 | WO |
Number | Date | Country | |
---|---|---|---|
63123884 | Dec 2020 | US |