This instant application contains a Sequence Listing which has been submitted electronically in ASCII format and is hereby incorporated by reference in its entirety. Said ASCII copy, created on Apr. 14, 2022, is named 59725703601_SL.txt and is 23,977,365 bytes in size.
Codon rewriting and repurposing translational machinery may be important tools to expand the genetic code artificially and ultimately to custom-design a synthetic genome.
These may also be important tools to enable incorporation of non-canonical amino acids (ncAAs) into proteins. However, approaches for determining codon replacement remain limited, and there is a need for improved approaches for selecting a codon/s for rewriting and replacement.
In some aspects, provided herein, is a method comprising: a) analyzing at least a portion of a genome of an organism to identify a first plurality of codons based on at least in part on a first local context of a codon-of-interest in the genome of the organism to be rewritten; b) rewriting the first plurality of codons in the genome of the organism to a second codon, wherein the first plurality of codons and the second codon encode a first amino acid, and wherein the rewriting of the first plurality of codons modulates an occurrence of the first plurality of codons; and c) synthesizing a nucleic acid construct comprising the portion of the genome, wherein the first plurality of codons is rewritten to the second codon.
Another aspect of the present disclosure provides a method of producing a polypeptide comprising a non-canonical amino acid (ncAA) or a population of polypeptide molecules comprising the ncAA in an organism, the method comprising: rewriting a first codon encoding a first amino acid to a second codon encoding the first amino acid in a genome of the organism, wherein the rewriting comprises identifying the first codon based at least in part on a first local context of a codon-of-interest in the genome of the organism; reassigning the first codon to encode the ncAA in the genome of the organism; and introducing into the organism an aminoacyl-tRNA synthetase (aaRS)/tRNA pair engineered to recognize the first codon and incorporate the ncAA into an amino acid sequence of the polypeptide or the population of the polypeptide molecules.
Another aspect of the present disclosure provides a method of producing a peptide, the method comprising editing a genome of an organism, wherein the editing comprises revising a codon of the genome to encode a non-canonical amino acid, wherein the peptide comprises the non-canonical amino acid.
Another aspect of the present disclosure provides a cell or a population of cells comprising a genome, wherein a first plurality of codons in the genome of the organism is rewritten to a second codon, wherein the first plurality of codons and the second codon encode a first amino acid, and wherein an occurrence of the first plurality of codons is modulated responsive to being rewritten to the second codon. Another aspect of the present disclosure provides an organism comprising the cell or the population of cells described herein.
Another aspect of the present disclosure provides a computer system for editing a genome of an organism, comprising: a database that is configured to store at least a portion of the genome of the organism; and one or more computer processors operatively coupled to said database, wherein said one or more computer processors are individually or collectively programmed to: a) analyze the at least the portion of the genome of the organism to identify a first plurality of codons in the genome of the organism to be rewritten; and b) rewrite the first plurality of codons in the genome of the organism to a second codon, wherein the first plurality of codons and the second codon encode a first amino acid, and wherein the rewriting of the first plurality of codons modulates an occurrence of the first plurality of codons, thereby editing the genome of the organism.
Another aspect of the present disclosure provides a non-transitory computer-readable storage medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for editing a genome of an organism, the method comprising: a) analyzing at least a portion of the genome of the organism to identify a first plurality of codons in the genome of the organism to be rewritten; and b) rewriting the first plurality of codons in the genome of the organism to a second codon, wherein the first plurality of codons and the second codon encode a first amino acid, and wherein the rewriting of the first plurality of codons modulates an occurrence of the first plurality of codons, thereby editing the genome of the organism.
Additional aspects and advantages of the present disclosure will become readily apparent to those skilled in this art from the following detailed description, wherein only illustrative embodiments of the present disclosure are shown and described. As will be realized, the present disclosure is capable of other and different embodiments, and its several details are capable of modifications in various obvious respects, all without departing from the disclosure. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.
Each patent, publication, and non-patent literature cited in the application is hereby incorporated by reference in its entirety as if each was incorporated by reference individually. To the extent publications and patents or patent applications incorporated by reference contradict the disclosure contained in the specification, the specification is intended to supersede and/or take precedence over any such contradictory material.
The features of the present disclosure are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present disclosure will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the disclosure are utilized, and the accompanying drawings (also “Figure” and “FIG.” herein), of which:
In some cases, the computer system comprises a computer processing unit and a sequence processing unit, wherein the computer processing unit and the sequence processing unit are bilaterally communicatively coupled. In some embodiments, the sequence processing unit and the computer processing unit comprise a storage component. 1410: Computer system. 1420: Central processing unit of computer system. 1430: Data storage with files containing the translation tables representing the genetic code of the organism whose genome is being rewritten. 1440: Instructions describing which translation table to use, the codons to be eliminated, and the locations of input and output files. 1450: Computer program implementing the methods to perform the codon rewriting. 1460: Input file, possibly on the same computer system or accessible from a different computer system, providing the sequence of protein-coding regions in the original genome. 1470, 1460: Output file, possibly on the same computer system or writeable on a different computer system, with the gene sequences rewritten to eliminate specified codons, and possible additional files with diagnostics, statistical analyses providing context-specific codon usage, and other reports. 1480: The computer system may also be attached to cloud resources for data import and export.
Provided herein are methods for designing a genome of an organism by rewriting one or more codons. In some aspects, methods described herein may comprise replacing one or more codons with another codon encoding the same amino acid. In some aspects, the one or more codons being replaced may be used to encode another amino acid, for example, a non-canonical amino acid (ncAA). Provided herein are methods for reducing or minimizing an occurrence of one or more synonymous codons used to encode an amino acid. Also provided herein are methods for efficient translation of a protein or a portion thereof with one or more ncAAs. The present specification also describes how to identify one or more codons for rewriting and/or replacement.
As used in this specification and the appended claims, the singular forms “a,” “an,” and “the” include plural referents unless the content clearly dictates otherwise. It should also be noted that the term “or” is generally employed in its sense including “and/or” unless the content clearly dictates otherwise. The terms “and/or” and “any combination thereof” and their grammatical equivalents as used herein, can be used interchangeably. These terms can convey that any combination is specifically contemplated. Solely for illustrative purposes, the following phrases “A, B, and/or C” or “A, B, C, or any combination thereof” can mean “A individually; B individually; C individually; A and B; B and C; A and C; and A, B, and Cn” The term “or” can be used conjunctively or disjunctively, unless the context specifically refers to a disjunctive use.
The term “about” or “approximately” can mean within an acceptable error range for the particular value, which may depend in part on how the value is measured or determined, e.g., the limitations of the measurement system. For example, “about” can mean within 1 or more than 1 standard deviation. Alternatively, “about” can mean a range of up to 20%, up to 10%, up to 5%, or up to 1% of a given value. Alternatively, particularly with respect to biological systems or processes, the term can mean within an order of magnitude, within 5-fold, or within 2-fold, of a value. Where particular values are described in the application and claims, unless otherwise stated the term “about” meaning within an acceptable error range for the particular value should be assumed.
Throughout this disclosure, numerical features are presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of any embodiments. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range to the tenth of the unit of the lower limit unless the context clearly dictates otherwise. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual values within that range, for example, 1.1, 2, 2.3, 5, and 5.9. This applies regardless of the breadth of the range. The upper and lower limits of these intervening ranges may independently be included in the smaller ranges, and are also encompassed within the present disclosure, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the present disclosure, unless the context clearly dictates otherwise.
As used in this specification and claim(s), the words “comprising” (and any form of comprising, such as “comprise” and “comprises”), “having” (and any form of having, such as “have” and “has”), “including” (and any form of including, such as “includes” and “include”) or “containing” (and any form of containing, such as “contains” and “contain”) are inclusive or open-ended and do not exclude additional, unrecited elements or method steps. It is contemplated that any embodiment discussed in this specification can be implemented with respect to any method or composition of the present disclosure, and vice versa. Furthermore, compositions of the present disclosure can be used to achieve methods of the present disclosure.
Reference in the specification to “some embodiments,” “an embodiment,” “one embodiment” or “other embodiments” means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least some embodiments, but not necessarily all embodiments, of the present disclosures. To facilitate an understanding of the present disclosure, a number of terms and phrases are defined below.
Certain specific details of this description are set forth in order to provide a thorough understanding of various embodiments. However, one skilled in the art will understand that the present disclosure may be practiced without these details. In other instances, well-known techniques or methods have not been shown or described in detail to avoid unnecessarily obscuring descriptions of the embodiments. Unless the context requires otherwise, throughout the specification and claims which follow, the word “comprise” and variations thereof, such as, “comprises” and “comprising” are to be construed in an open, inclusive sense, that is, as “including, but not limited to.” Further, headings provided herein are for convenience only and do not interpret the scope or meaning of the claimed disclosure.
The nomenclature used to describe polypeptides or proteins follows the conventional practice wherein the amino group is presented to the left (the amino- or N-terminus) and the carboxyl group to the right (the carboxy- or C-terminus) of each amino acid residue. When amino acid residue positions are referred to in a polypeptide or a protein, they are numbered in an amino to carboxyl direction with position one being the residue located at the amino terminal end of the polypeptide or the protein of which it can be a part. The amino acid sequences of peptides set forth herein are generally designated using the standard single letter or three letter symbol. (A or Ala for Alanine; C or Cys for Cysteine; D or Asp for Aspartic Acid; E or Glu for Glutamic Acid; F or Phe for Phenylalanine; G or Gly for Glycine; H or His for Histidine; I or Ile for Isoleucine; K or Lys for Lysine; L or Leu for Leucine; M or Met for Methionine; N or Asn for Asparagine; P or Pro for Proline; Q or Gln for Glutamine; R or Arg for Arginine; S or Ser for Serine; T or Thr for Threonine; V or Val for Valine; W or Trp for Tryptophan; and Y or Tyr for Tyrosine).
The term “non-canonical amino acid” or “ncAA” refers to any amino acid other than the 20 standard amino acids (alanine, arginine, asparagine, aspartic acid, cysteine, glutamic acid, glutamine, glycine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine, and valine). There are over 700 known ncAA any of which may be used in the methods described herein. In some embodiments, examples of ncAA include, but are not limited to, L-Tryptazan, 5-Fluoro-L-tryptophan, L-Ethionine, L-Selenomethionine, Trifluoro-L-methionine, L-Norleucine, L-Homopropargylglycine, (2S)-2-amino-5-(methylsulfanyl) pentanoic acid, (2S)-2-amino-6-(methylsulfanyl) hexanoic acid, Para-fluoro-L-phenylalanine, Para-iodo-L-phenylalanine, Para-azido-L-phenylalanine, Para-acetyl-L-phenylalanine, Para-benzoyl-L-phenylalanine, Meta-fluoro-L-tyrosine, O-methyl-L-tyrosine, Para-propargyloxy-L-phenylalanine, (2S)-2-aminooctanoic acid, (2S)-2-aminononanoic acid, (2S)-2-aminodecanoic acid, (2S)-2-aminohept-6-enoic acid, (2S)-2-aminooct-7-enoic acid, L-Homocysteine, (2S)-2-amino-5-sulfanylpentanoic acid, (2S)-2-amino-6-sulfanylhexanoic acid, L-S-(2-nitrobenzyl) cysteine, L-S-ferrocenyl-cysteine, L-O-crotylserine, L-O-(pent-4-en-1-yl)serine, L-O-(4,5-dimethoxy-2-nitrobenzyl)serine, (2S)-2-amino-3-({[5-(dimethylamino)naphthalen−1-yl]sulfonyl}amino)propanoic acid, (2S)-3-[(6-acetyl-naphthalen−1-yl)amino]-2-aminopropanoic acid, L-Pyrrolysine, N6-[(propargyloxy)carbonyl]-L-lysine, L-N6-acetyllysine, N6-trifluoroacetyl-L-lysine, N6—{[1-(6-nitro-1,3-benzodioxol-5-yl)ethoxy]carbonyl}-L-lysine, N6—{[2-(3-methyl-3H-diaziren-3-yl)ethoxy]carbonyl}-L-lysine, p-azidophenylalanine, and 2-aminoisobutyric acid. In some embodiments, examples of ncAA include, but are not limited to, AbK (unnatural amino acid for Photo-crosslinking probe), 3-Aminotyrosine (unnatural amino acid for inducing red shift in fluorescent proteins and fluorescent protein-based biosensors), L-Azidohomoalanine hydrochloride (unnatural amino acid for bio-orthogonal labeling of newly synthesized proteins), L-Azidonorleucine hydrochloride (unnatural amino acid for bio-orthogonal or fluorescent labeling of newly synthesized proteins), BzF (photoreactive unnatural amino acid; photo-crosslinker), DMNB-caged-Serine (caged serine; excited by visible blue light), HADA (blue fluorescent D-amino acid for labeling peptidoglycans in live bacteria), NADA-green (fluorescent D-amino acid for labeling peptidoglycans in live bacteria), NB-caged Tyrosine hydrochloride (ortho-nitrobenzyl caged L-tyrosine), RADA (orange-red TAMRA-based fluorescent D-amino acid for labeling peptidoglycans in live bacteria), Rf470DL (blue rotor-fluorogenic fluorescent D-amino acid for labeling peptidoglycans in live bacteria), sBADA (green fluorescent D-amino acid for labeling peptidoglycans in bacteria), and YADA (green-yellow lucifer yellow-based fluorescent D-amino acid for labeling peptidoglycans in live bacteria). In some embodiments, examples of ncAA include, but are not limited to, β-alanine, D-alanine, 4-hydroxyproline, desmosine, D-glutamic acid, γ-aminobutyric acid, β-cyanoalanine, norvaline, 4-(E)-butenyl-4(R)-methyl-N-methyl-L-threonine, N-methyl-L-leucine, selenocysteine, and statine. In some embodiments, a ncAA comprises p-azidophenylalanine or 2-aminoisobutyric acid (also known as α-aminoisobutyric acid, AIB, α-methylalanine, or 2-methylalanine).
The terms “codon” and “anticodon” as used herein may refer to DNA or RNA. In some embodiments, DNA comprises nucleotide bases adenine (A), guanine (G), cytosine (C), or thymine (T). In some embodiments, RNA comprises nucleotide bases adenine (A), guanine (G), cytosine (C), or uracil (U). In some embodiments, DNA or RNA may comprise inosine (I). in some embodiments, inosine (I) may pair with adenine (A), cytosine (C), or uracil (U). In some embodiments, DNA or RNA may comprise queuosine (Q). In some embodiments, queuosine (Q) may pair with cytosine (C) or uracil (U).
Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present disclosure, suitable methods, and materials are described below.
In some aspects, provided herein are methods for selecting a codon for rewriting or replacement. In some embodiments, a codon may be selected based on an analysis of the genetic code. In some embodiments, the analysis may depend on messenger RNA (mRNA) codon recognition by a tRNA anticodon. In some embodiments, ribonucleotides (e.g., A, C, G, U, or I) may be used. In some embodiments, deoxyribonucleotides (e.g., A, C, G, or T) may be used.
In some aspects, a codon may be selected for replacement to minimize wobble. In some embodiments, more than one codon ending in different nucleotides can encode the same amino acid. For example, this may happen because a single transfer RNA (tRNA) anticodon can recognize multiple mRNA codons through wobble. The third nucleotide position of a codon is the wobble position, corresponding to the first nucleotide position of a corresponding anticodon.
For example, the wobble rule may be that an anticodon starting with the nucleotide C (e.g., CXX from 5′ to 3′ direction of an anticodon, wherein X can be any nucleotide) can only recognize the nucleotide G in the third nucleotide position of a corresponding codon (e.g., XXG from 5′ to 3′ direction of a codon, wherein X can be any nucleotide). In some embodiments, an anticodon starting with the nucleotide C may only recognize G in the third nucleotide position of a codon. Thus, in some embodiments, ATG codon may only encode methionine (Met). In some embodiments, UGG codon may only encode tryptophan (Trp). In some embodiments, CUA anticodon may suppress the amber stop codon UAG. In some embodiments, CUA anticodon may not suppress the ochre stop codon UAA.
In some embodiments, an anticodon may start with nucleotide G and G may be converted to queuosine (Q) that can recognize nucleotide C or U in a codon. In some embodiments, an anticodon may start with nucleotide A, and A may be converted to I (inosine) that can recognize nucleotide A, C, or U in a codon. In some embodiments, an anticodon may start with U and may be modified to recognize nucleotide A or G, or in some cases C or U. Thus, in an embodiment, a codon starting with G may be used in the wobble position as a target for rewriting.
In some embodiments, an amino acid may be encoded by one codon (e.g., out of 64 possible permutations of codons, having one of 4 different nucleotides at each of 3 different positions). For example, Methionine (Met) can be encoded by a single codon AUG. In some embodiments, an amino acid may be encoded by one or more codons. In some embodiments, an amino acid may be encoded by one or two codons (e.g., out of 64 possible permutations of codons). For example, Lysine (Lys) can be encoded by either of the two codons AAA or AAG. For example, Glutamic acid (Glu) can be encoded by either of the two codons GAA or GAG. In these embodiments, an anticodon starting with U may recognize AAA or GAA, and in addition, AAG or GAG, due to cross-talk (see Table 1). Thus, in some embodiments, a codon encoding an amino acid encoded by one or two codons may not be used for genome rewriting or replacement.
In some embodiments, an amino acid may be encoded by any of one, two, three, four, five, or six codons. For example, arginine (Arg) can be encoded by any of the six codons CGU, CGC, CGA, CGG, AGA, or AGG. For example, serine (Ser) can be encoded by any of the six codons AGU, AGC, UCU, UCC, UCA, or UCG. For examples, leucine (Leu) can be encoded by any of the six codons UUA, UUG, CUU, CUC, CUA, or CUG. In some embodiments, a codon of the set of one, two, three, four, five, or six codons that encode the same amino acid may be selected for rewriting or replacement.
Table 2 below shows standard rules for anticodon-codon pairing in a model organism, yeast.
In some embodiments, a class of codons for which a corresponding anticodon is not a part of the tRNA identity element recognized by a corresponding aminoacyl-tRNA synthetase (aaRS) may be considered. In some embodiments, this class of codons comprises, but is not limited to, leucine (Leu), serine (Ser), or alanine (Ala).
In some aspects, provided herein are methods for codon rewriting and replacement that allow high fitness of an organism. In some embodiments, at the amino acid-to-tRNA level, aminoacyl-tRNA synthetase (aaRS) that may not interact with an anticodon for clean codon reassignment downstream may be considered. In some embodiments, yeast genetic code evolution may be considered. In some embodiments, at the codon-to-anticodon level, codon removal may allow for deletion of all tRNAs used for decoding. In some embodiments, deletion of tRNAs may not disable decoding of synonymous codons through wobble. In some embodiments, no remaining natural tRNAs can decode rewritten, replaced, or eliminated codon(s), if reinserted.
In some embodiments, methods for codon rewriting and/or replacement disclosed herein can use a context-sensitive design (e.g., learned from a host organism) for unbiased discovery of problematic motifs based on positive evolutionary selection and/or negative evolutionary selection. In some embodiments, each codon may be considered in the local context (e.g., based on the codons on either side of a given codon of interest), and codons may be selected for re-writing at least in part by normalizing for the observed frequency of the codon in the context of its surrounding codons relative to the null hypothesis of overall relative synonymous codon usage.
In some embodiments, genes such as Saccharomyces cerevisiae genes can be examined for context-sensitive codon usage. In some embodiments, S. cerevisiae genes may have statistically significant evolutionary signals, such as negative selection leading to predictable de-enriched sequences, such as “slippery sites” (e.g., homopolymer runs), and/or positive selection for functional regulatory motifs, such as Rap1 binding sites. In some embodiments, methods for selecting a replacement codon may comprise a statistical optimization or outlier avoidance approach (e.g., a “Goldilocks” approach) to avoid selection of a replacement codon with a positive evolutionary signal (e.g., a codon that is too “hot” having a usage that is significantly higher than the overall RSCU for that given codon) or a negative evolutionary signal (e.g., a codon that is too “cold” having a usage that is significantly lower than the overall RSCU for that given codon), and instead to select a replacement codon based at least in part on consideration of the codon's local context (e.g., by considering replacement codons whose relative synonymous usage in the given context most closely matches its relative synonymous usage overall). In some embodiments, such selection of replacement codons may comprise determining context-sensitive relative synonymous codon usage (RSCU) value for each of a plurality of codons (e.g., representing a local context of a given codon of interest), and identifying a codon from among the plurality of codons having a maximum or largest RSCU value. For example, the plurality of codons may comprise a codon of interest, a second codon that is upstream of the codon of interest, and a third codon that is downstream of the codon of interest. For example, the plurality of codons may comprise a set of at least three consecutive codons: a codon of interest, a second codon that is upstream of and adjacent to the codon of interest, and a third codon that is downstream of and adjacent to the codon of interest. For example, the maximal RSCU value may be at least about 0.01, at least about 0.05, at least about 0.10, at least about 0.11, at least about 0.12, at least about 0.13, at least about 0.14, at least about 0.15, at least about 0.16, at least about 0.17, at least about 0.18, at least about 0.19, at least about 0.20, at least about 0.21, at least about 0.22, at least about 0.23, at least about 0.24, at least about 0.25, at least about 0.26, at least about 0.27, at least about 0.28, at least about 0.29, at least about 0.30, at least about 0.31, at least about 0.32, at least about 0.33, at least about 0.34, at least about 0.35, at least about 0.36, at least about 0.37, at least about 0.38, at least about 0.39, at least about 0.40, at least about 0.41, at least about 0.42, at least about 0.43, at least about 0.44, at least about 0.45, at least about 0.46, at least about 0.47, at least about 0.48, at least about 0.49, at least about 0.50, at least about 0.51, at least about 0.52, at least about 0.53, at least about 0.54, at least about 0.55, at least about 0.56, at least about 0.57, at least about 0.58, at least about 0.59, at least about 0.60, at least about 0.61, at least about 0.62, at least about 0.63, at least about 0.64, at least about 0.65, at least about 0.66, at least about 0.67, at least about 0.68, at least about 0.69, at least about 0.70, at least about 0.71, at least about 0.72, at least about 0.73, at least about 0.74, at least about 0.75, at least about 0.76, at least about 0.77, at least about 0.78, at least about 0.79, at least about 0.80, at least about 0.81, at least about 0.82, at least about 0.83, at least about 0.84, at least about 0.85, at least about 0.86, at least about 0.87, at least about 0.88, at least about 0.89, or at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or about 1.00. This approach may advantageously select the replacement codon having the maximum context-sensitive codon usage. In some embodiments, motifs identified as associated with positive evolutionary signals or negative evolutionary signals that include codons that are to be replaced by a rewriting design may be highlighted as requiring greater scrutiny to avoid introducing fitness defects by rewriting. In this embodiment, methods using an approach to use a replacement codon that shares the same evolutionary signal as the re-written codon may be used. In some embodiments, rewriting designs may be selected to minimize the number of evolutionary motifs affected. In some embodiments, nonsynonymous codons may be introduced instead of introducing a motif with an evolutionary signal through replacement with a synonymous codon.
In some embodiments, codon and/or genome rewriting may comprise a risk. In some embodiments, the risk may comprise translational frameshifts (
In some embodiments, the risk may be related to orthogonal translation system. In some embodiments, the risk may comprise low uptake of ncAA from media into an organism (e.g., yeast), low expression levels of aaRS, or mislocalization of aaRS. In some embodiments, the risk may comprise inefficient interaction between an ncAA and the corresponding aaRS, inefficient acylation of a tRNA, or suboptimal ribosome interaction of tRNA or codon (
In some embodiments, each aaRS may recognize all of the tRNAs for an amino acid for amino acid targeting. In some embodiments, recognition may involve amino acid and depending on the aaRS, regions of the tRNA, for example, attachment region, variable loops and stems, and/or an anticodon loop. In some embodiments, the anticodon loop recognition may pose an issue for a method disclosed herein. For example, if an anticodon that is part of aaRS recognition is used, then the native aaRS may still recognize the anticodon and give a mixture of canonical and non-canonical amino acid incorporation. Serine, leucine, and alanine are special in this regard as aaRS generally does not recognize the anticodon. In some embodiments, it may be because serine and leucine have 6 codon blocks, which can provide more diversity in the anticodon. In some embodiments, it may be because in yeast, a part of the anticodon loop is recognized for leucine.
In some aspects, the genetic code may have variations depending on organism. This may be because of evolutionary reassignment of codons (see Table 3). For example, leucine codons are captured by serine in Candida (e.g., CTG). For example, leucine codons are captured by alanine in a fungal clade including Pachysolen. In another example, arginine codons have been lost in yeast mitochondria. In another example, serine-aaRS does not recognize serine anticodon.
In some embodiments, stop codons deleted for codon reassignment/replacement may be captured by nearby amino acids (eRFI in ciliates evolved for UGA vs UAA/UAG recognition). In some embodiments, alanine is not captured by evolution. In some embodiments, alanine's 4-codon block (i.e., there are 4 synonymous codons encoding alanine) in yeast is covered by two larger tRNA families, so it may be difficult to completely eliminate one of the families. In some embodiments, tRNA-aaRS interaction with amino acid works by excluding large sidechains.
In some embodiments, the following codons may be removed for rewriting and/or replacement.
In some embodiments, a host genome may be divided into multiple regions for codon replacement design. In some embodiments, a host genome may be divided into at least 2, 3, 4,5,6,7,8,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, or at least 50 regions for codon design. In some embodiments, a host genome may be divided into approximately 2, 3,4,5,6,7,8,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, or approximately 50 regions for codon design. In some embodiments, a host genome may be divided into 5 regions for codon design.
In some embodiments, each region may be at least about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, or at least about 50 kilobases (kb). In some embodiments, each region may be approximately 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, or approximately 50 kb. In some embodiments, each region may have at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, or at least 50 designs. In some embodiments, each region may have approximately 1,2,3,4,5,6,7,8,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, or approximately 50 designs.
In some embodiments, the total region of codon removal design may comprise at least 1,2,3,4,5,6,7,8,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400, 410, 420, 430, 440, 450, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, 600, 610, 620, 630, 640, 650, 660, 670, 680, 690, 700, 710, 720, 730, 740, 750, 760, 770, 780, 790, 800, 810, 820, 830, 840, 850, 860, 870, 880, 890, 900, 910, 920, 930, 940, 950, 960, 970, 980, 990, or at least 1000 kb. In some embodiments, the total region of codon removal design may comprise approximately 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400, 410, 420, 430, 440, 450, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, 600, 610, 620, 630, 640, 650, 660, 670, 680, 690, 700, 710, 720, 730, 740, 750, 760, 770, 780, 790, 800, 810, 820, 830, 840, 850, 860, 870, 880, 890, 900, 910, 920, 930, 940, 950, 960, 970, 980, 990, or approximately 1000 kb.
In some embodiments, each region may have at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, or at least 50 codons removed. In some embodiments, each region may have approximately 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, or approximately 50 codons removed. In some embodiments, each region may have 2 codons removed (e.g., “Individual” design). In some embodiments, the “Individual” design may comprise removing one or more codons encoding leucine, arginine, or serine. In some embodiments, each region may have 3 codons removed (e.g., “Paired” design). In some embodiments, the “Paired” design may comprise removing one or more codons encoding leucine/arginine, leucine/serine, or arginine/serine. In some embodiments, each region may have 6 codons removed (e.g., “All” design). In some embodiments, the “All” design may comprise removing one or more codons encoding leucine, arginine, and serine.
In some embodiments, the total number of codons removed, rewritten, or replaced may comprise at least 1, 10, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, or at least 1000 codons. In some embodiments, the total number of codons removed, rewritten, or replaced may comprise approximately 1, 10, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, or approximately 1000 codons. In some embodiments, the total number of codons removed, rewritten, or replaced may comprise at least 1K, 2K, 3K, 4K, 5K, 6K, 7K, 8K, 9K, 10K, 20K, 30K, 40K, 50K, 60K, 70K, 80K, 90K, 100K, 110K, 120K, 130K, 140K, 150K, 160K, 170K, 180K, 190K, 200K, 250K, 300K, 350K, 400K, 450K, 500K, 550K, 600K, 650K, 700K, 750K, 800K, 850K, 900K, 950K, or at least 1000K codons. In some embodiments, the total number of codons removed, rewritten, or replaced may comprise approximately 1K, 2K, 3K, 4K, 5K, 6K, 7K, 8K, 9K, 10K, 20K, 30K, 40K, 50K, 60K, 70K, 80K, 90K, 100K, 110K, 120K, 130K, 140K, 150K, 160K, 170K, 180K, 190K, 200K, 250K, 300K, 350K, 400K, 450K, 500K, 550K, 600K, 650K, 700K, 750K, 800K, 850K, 900K, 950K, or approximately 1000K codons.
In some aspects, provided herein are methods for synonymous codon rewriting and design rules for synonymous codon rewriting and observed bug rate. A bug or bugs, as used here, may refer to unanticipated fitness defect(s) caused by designed DNA sequence. In some embodiments, a bug may also be referred to a risk. Methods for synonymous codon rewriting may follow design rules that provide technical improvements in decreasing or minimizing a bug rate (e.g., by avoiding the selection of codons for use in re-writing that may introduce unanticipated fitness defects in the designed DNA sequence). In some embodiments, methods disclosed herein may comprise utilizing encoded watermarks (e.g., PCRTags or any other DNA barcodes) in the genome. For example, watermarks may be encoded in non-protein-coding regions. In some embodiments, watermarks may be encoded in ORFs. In some embodiments, methods described herein may synonymously rewrite 1 out of approximately every 20 codons globally. In some embodiments, methods disclosed herein may comprise performing a PCRTag algorithm. In some embodiments, the PCRTag algorithm may specify a ‘most-different’ design. In some embodiments, the “most-different” design may ignore the relative synonymous codon usage (RSCU), codon adaptation, or translation efficiency matching to maximize base pair changes. In some embodiments, the “most-different” design may yield about 1 bug per 10K codons removed, rewritten, or replaced. In some embodiments, the “most-different” design may yield about 3 bugs per 20K codons removed, rewritten, or replaced (details described in Richardson, et al., Science (2017) 355, 1040-1044, which is incorporated by reference herein in its entirety). In some embodiments, methods disclosed herein may decrease the number of bugs. In some embodiments, methods disclosed herein may eliminate one or more bugs. In some embodiments, methods disclosed herein may avoid a bug or a risk. In some embodiments, the risk may comprise a known regulatory site in ORFs that can impede transcription. In some embodiments, the known regulatory site may comprise a binding site of Repressor Activator Protein 1 (Rap1p, essential DNA-binding transcription regulator) in ORFs. Details are described in Yarrington, et al. Genetics (2012) 190(2):523-35 and Wu, et al., Science (2017) 355, 1048, each of which is incorporated by reference herein in its entirety. In some embodiments, a Rap1p binding site consensus sequence may comprise ACACCCRYACAYM (SEQ ID NO: 11,813), wherein R may be G or A, Y may be C or T, and M may be A or Cn
In some aspects, provided herein are methods for codon rewriting and/or replacement. In some embodiments, methods described herein may comprise rewriting and/or replacing a codon while retaining GC content. In some embodiments, a nucleotide in the wobble position of a codon (third position of a codon) is changed in a way that retains GC content. For example, a codon ending in G or A in a 4-codon block may be changed to C or T, respectively, to retain GC content. In some embodiments, these changes may also replace codons with other codons having the same frequency. Alternatively, in some embodiments, methods for codon rewriting and/or replacing described herein, may comprise changing one or more codons encoding an amino acid to the most frequently used codon for that specific amino acid in the genome. For example, one or more synonymous codons can be replaced with a synonymous codon with the highest number of occurrences for that specific amino acid in the genome. In some embodiments, methods that have the smallest effect on tRNA pools may be used.
Many synonymous codon rewriting methods are based on matching single-codon properties such as, for example, relative synonymous codon usage (RSCU) over all genes, codon adaptation index (CAI) over highly-expressed or stress-response genes, and translational efficiency (TE) incorporating tRNA pool. Some methods optimize over 2-codon windows or mRNA secondary structure using a hidden Markov model (HMM). Another new approach for codon rewriting and/or replacement is a Goldilocks method which utilizes machine learning analysis (e.g., statistical analysis) of a host genome.
The present disclosure provides computer systems that are programmed to implement methods of the disclosure.
The computer system 1410 can regulate various aspects of analysis, calculation, and generation of the present disclosure, such as, for example, analyzing at least a portion of the genome of the organism to identify a first plurality of codons in the genome of the organism to be rewritten, rewriting the first plurality of codons in the genome of the organism to a second codon, and analyzing a local context of a codon-of-interest in the genome of the organism. The computer system 1410 can be an electronic device of a user or a computer system that is remotely located with respect to the electronic device. The electronic device can be a mobile electronic device.
The computer system 1410 includes a central processing unit (CPU, also “processor” and “computer processor” herein) 1420, which can be a single core or multi core processor, or a plurality of processors for parallel processing. The computer system 1410 also includes memory or memory location 1440 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 1430 (e.g., hard disk), communication interface 1420 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 1450, such as cache, other memory, data storage and/or electronic display adapters. The memory 1440, storage unit 1430, interface 1420 and peripheral devices 1450 are in communication with the CPU 1420 through a communication bus (solid lines), such as a motherboard. The storage unit 1430 can be a data storage unit (or data repository) for storing data. The computer system 1410 can be operatively coupled to a computer network (“network”) 1480 with the aid of the communication interface 1420. The network 1480 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet.
The network 1480 in some cases is a telecommunication and/or data network. The network 1480 can include one or more computer servers, which can enable distributed computing, such as cloud computing. For example, one or more computer servers may enable cloud computing over the network 1480 (“the cloud”) to perform various aspects of analysis, calculation, and generation of the present disclosure, such as, for example, analyzing at least a portion of the genome of the organism to identify a first plurality of codons in the genome of the organism to be rewritten, rewriting the first plurality of codons in the genome of the organism to a second codon, and analyzing a local context of a codon-of-interest in the genome of the organism. Such cloud computing may be provided by cloud computing platforms such as, for example, Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform, and IBM cloud. The network 1480, in some cases with the aid of the computer system 1410, can implement a peer-to-peer network, which may enable devices coupled to the computer system 1410 to behave as a client or a server.
The CPU 1420 may comprise one or more computer processors and/or one or more graphics processing units (GPUs). The CPU 1420 can execute a sequence of machine-readable instructions, which can be embodied in a program or software. The instructions may be stored in a memory location, such as the memory 1440. The instructions can be directed to the CPU 1420, which can subsequently program or otherwise configure the CPU 1420 to implement methods of the present disclosure. Examples of operations performed by the CPU 1420 can include fetch, decode, execute, and writeback.
The CPU 1420 can be part of a circuit, such as an integrated circuit. One or more other components of the system 1410 can be included in the circuit. In some cases, the circuit is an application specific integrated circuit (ASIC).
The storage unit 1430 can store files, such as drivers, libraries and saved programs. The storage unit 1430 can store user data, e.g., user preferences and user programs. The computer system 1410 in some cases can include one or more additional data storage units that are external to the computer system 1410, such as located on a remote server that is in communication with the computer system 1410 through an intranet or the Internet.
The computer system 1410 can communicate with one or more remote computer systems through the network 1480. For instance, the computer system 1410 can communicate with a remote computer system of a user. Examples of remote computer systems include personal computers (e.g., portable PC), slate or tablet PC's (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone, Android-enabled device, Blackberry®), or personal digital assistants. The user can access the computer system 1410 via the network 1480.
Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 1410, such as, for example, on the memory 1440 or electronic storage unit 1430. The machine executable or machine readable code can be provided in the form of software. During use, the code can be executed by the processor 1420. In some cases, the code can be retrieved from the storage unit 1430 and stored on the memory 1440 for ready access by the processor 1420. In some situations, the electronic storage unit 1430 can be precluded, and machine-executable instructions are stored on memory 1440.
The code can be pre-compiled and configured for use with a machine having a processer adapted to execute the code, or can be compiled during runtime. The code can be supplied in a programming language that can be selected to enable the code to execute in a pre-compiled or as-compiled fashion.
Aspects of the systems and methods provided herein, such as the computer system 1410, can be embodied in programming. Various aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Machine-executable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk. “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server. Thus, another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software. As used herein, unless restricted to non-transitory, tangible “storage” media, terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.
Hence, a machine readable medium, such as computer-executable code, may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings. Volatile storage media include dynamic memory, such as main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
The computer system 1410 can include or be in communication with an electronic display 1460 that comprises a user interface (UI) 1470 for providing, for example, a visual display indicative of training and testing of a trained algorithm. Examples of UIs include, without limitation, a graphical user interface (GUI) and web-based user interface.
Methods and systems of the present disclosure can be implemented by way of one or more algorithms. An algorithm can be implemented by way of software upon execution by the central processing unit 1420. The algorithm can, for example, analyze at least a portion of the genome of the organism to identify a first plurality of codons in the genome of the organism to be rewritten, rewrite the first plurality of codons in the genome of the organism to a second codon, and analyze a local context of a codon-of-interest in the genome of the organism.
In some embodiments, the computer system may be a machine learning-based computer system comprising a computer processing unit communicatively coupled to a sequence processing unit via a first controller and to a storage unit via a second controller. In some embodiments, the machine learning-based computer system optionally comprises a sequence analyzer that sequences at least a portion of a genome of an organism (e.g., at least in part by assaying nucleic acid molecules obtained or derived from the organism to determine genetic sequences of the at least the portion of the genome of the organism). In some embodiments, the sequence processing unit comprises a storage component that retains genome sequence data generated by the sequence processing unit. The sequence processing unit may receive input data from the computer processing unit. For example, the input data may comprise translation tables obtained from the National Center for Biotechnology Information (NCBI), a sequence read of at least a portion of a genome of an organism contained in a sample, or a combination thereof. In some embodiments, the at least the portion of the genome comprises a nucleus-derived DNA. In some embodiments, the at least the portion of the genome comprises protein-coding genes. In some embodiments, mitochondrial genes, transposable element genes, pseudogenes, and blocked reading frames are excluded from the method disclosed herein. The sequence processing unit determines the codon count for each of a plurality of codons in the genome (e.g., including stop codons). In some embodiments, a translation table is used to map codons to amino acids. In some embodiments, the sequence processing unit determines an RSCU for each codon (e.g., as the number of counts for the codon divided by the number of counts for all codons for the same amino acid).
In some embodiments, the sequence processing unit determines the frequency of 9 mers in coding domains of a genome of an organism. In some embodiments, the 9 mers are converted to contexts. Contexts, as disclosed herein, may comprise a codon-amino acid-codon pattern.
In some embodiments, the sequence processing unit comprises an algorithm that determines a value for each coding sequence by identifying positions of one or more codons to eliminate; analyzing each codon, in turn; and rewriting the codon with the most frequently used codon as the central codon in a 3-codon (9 mer) context. In some embodiments, the first codon is unique because there is no preceding context. In standard genetic codes, however, the first codon is always ATG. In some cases, the last codon (e.g., stop codon) has no following context. In some embodiments, if stop codons are rewritten, a favored design comprises changing TAA and TAG to TGA. TGA has only one single choice. Alternatively, in some embodiments, a 6nt (6-nucleotide) context or 9nt (9-nucleotide) context with the stop codon as the final 3nt may be used.
In some embodiments, the sequence processing unit performs dynamical programming for treatment of neighboring codons. In some embodiments, the sequencing processing unit uses a different codon selection criterion, such as maintaining GC content, codon adaptation index, or translational efficiency, as the main codon replacement rule. In some embodiments, the sequence processing unit employs a Goldilocks codon with the greatest fold-enrichment, rather than a Goldilocks codon that is most often used, in the context. In some embodiments, the sequence processing unit uses random codons selected using the Goldilocks context-dependent probabilities as the probability distribution.
In some embodiments, the final codon is a stop codon and a special case. Most designs may be a single choice for the stop codon, TGA, or a pair of choices, TGA and TAA. For the stop codon, a 9 mer pattern or a 5 mer pattern ending with the stop codon may be used instead of the 9 mer pattern with the codon of interest in the middle position. Some example embodiments avoid significantly enriched codons as possible regulatory signals (e.g., too hot), thereby choosing codons whose usage matches the overall RSCU. Some example embodiments avoid codons that are used significantly less (e.g., too cold), thereby choosing codons whose usage matches the overall RSCU. Some example embodiments may consider the RSCU value for the specific codon. In some embodiments, a codon with an RSCU value of at least about 0.01, at least about 0.05, at least about 0.10, at least about 0.11, at least about 0.12, at least about 0.13, at least about 0.14, at least about 0.15, at least about 0.16, at least about 0.17, at least about 0.18, at least about 0.19, at least about 0.20, at least about 0.21, at least about 0.22, at least about 0.23, at least about 0.24, at least about 0.25, at least about 0.26, at least about 0.27, at least about 0.28, at least about 0.29, at least about 0.30, at least about 0.31, at least about 0.32, at least about 0.33, at least about 0.34, at least about 0.35, at least about 0.36, at least about 0.37, at least about 0.38, at least about 0.39, at least about 0.40, at least about 0.41, at least about 0.42, at least about 0.43, at least about 0.44, at least about 0.45, at least about 0.46, at least about 0.47, at least about 0.48, at least about 0.49, at least about 0.50, at least about 0.51, at least about 0.52, at least about 0.53, at least about 0.54, at least about 0.55, at least about 0.56, at least about 0.57, at least about 0.58, at least about 0.59, at least about 0.60, at least about 0.61, at least about 0.62, at least about 0.63, at least about 0.64, at least about 0.65, at least about 0.66, at least about 0.67, at least about 0.68, at least about 0.69, at least about 0.70, at least about 0.71, at least about 0.72, at least about 0.73, at least about 0.74, at least about 0.75, at least about 0.76, at least about 0.77, at least about 0.78, at least about 0.79, at least about 0.80, at least about 0.81, at least about 0.82, at least about 0.83, at least about 0.84, at least about 0.85, at least about 0.86, at least about 0.87, at least about 0.88, at least about 0.89, or at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or about 1.00 may be selected. In some embodiments, a codon with the highest RSCU value for a local context may be selected.
Codons are under evolutionary selection pressure such as positive selection or negative selection. For example, positive selection can include, but is not limited to, within-ORF regulatory elements. For example, negative selection can include, but is not limited to, frameshifts, ribosome stalls, and secondary structure interfering with transcription/translation. Codon choice can depend on context of surrounding codons.
For example, a Goldilocks method may be performed based on a principle that 1) most open reading frame (ORF) regions are not regulatory, 2) a replacement codon that is not too “hot” (e.g., a codon with usage that is significantly higher than the overall RSCU for that specific codon; positive selection) and not too “cold” (e.g., a codon with usage that is significantly lower than the overall RSCU for that specific codon; negative selection) is chosen, and 3) a replacement codon depends on context of upstream and downstream codons. In some embodiments, a replacement codon that is “too hot” may comprise a codon that may have been evolutionarily positively selected.
In some embodiments, methods for selecting a replacement codon may comprise an optimization or outlier avoidance approach (e.g., a “Goldilocks”) approach to avoid selection of a replacement codon with a positive evolutionary signal (e.g., a codon that is too “hot” having a usage that is significantly higher than the overall RSCU for that given codon) or a negative evolutionary signal (e.g., a codon that is too “cold” having a usage that is significantly lower than the overall RSCU for that given codon), and instead to select a replacement codon based at least in part on consideration of the codon's local context (e.g., by considering replacement codons whose relative synonymous usage in the given context most closely matches its relative synonymous usage overall). In some embodiments, such selection of replacement codons may comprise determining context-sensitive relative synonymous codon usage (RSCU) value for each of a plurality of codons (e.g., representing a local context of a given codon of interest), and identifying a codon from among the plurality of codons having a maximum or largest RSCU value. For example, the plurality of codons may comprise a codon of interest, a second codon that is upstream of the codon of interest, and a third codon that is downstream of the codon of interest. For example, the plurality of codons may comprise a set of at least three consecutive codons: a codon of interest, a second codon that is upstream of and adjacent to the codon of interest, and a third codon that is downstream of and adjacent to the codon of interest. For example, the maximal RSCU value may be at least about 0.01, at least about 0.05, at least about 0.10, at least about 0.11, at least about 0.12, at least about 0.13, at least about 0.14, at least about 0.15, at least about 0.16, at least about 0.17, at least about 0.18, at least about 0.19, at least about 0.20, at least about 0.21, at least about 0.22, at least about 0.23, at least about 0.24, at least about 0.25, at least about 0.26, at least about 0.27, at least about 0.28, at least about 0.29, at least about 0.30, at least about 0.31, at least about 0.32, at least about 0.33, at least about 0.34, at least about 0.35, at least about 0.36, at least about 0.37, at least about 0.38, at least about 0.39, at least about 0.40, at least about 0.41, at least about 0.42, at least about 0.43, at least about 0.44, at least about 0.45, at least about 0.46, at least about 0.47, at least about 0.48, at least about 0.49, at least about 0.50, at least about 0.51, at least about 0.52, at least about 0.53, at least about 0.54, at least about 0.55, at least about 0.56, at least about 0.57, at least about 0.58, at least about 0.59, at least about 0.60, at least about 0.61, at least about 0.62, at least about 0.63, at least about 0.64, at least about 0.65, at least about 0.66, at least about 0.67, at least about 0.68, at least about 0.69, at least about 0.70, at least about 0.71, at least about 0.72, at least about 0.73, at least about 0.74, at least about 0.75, at least about 0.76, at least about 0.77, at least about 0.78, at least about 0.79, at least about 0.80, at least about 0.81, at least about 0.82, at least about 0.83, at least about 0.84, at least about 0.85, at least about 0.86, at least about 0.87, at least about 0.88, at least about 0.89, or at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or about 1.00. This approach may advantageously select the replacement codon having the maximum context-sensitive codon usage. In some embodiments, motifs identified as associated with positive evolutionary signals or negative evolutionary signals that include codons that are to be replaced by a rewriting design may be highlighted as requiring greater scrutiny to avoid introducing fitness defects by rewriting. In this embodiment, methods using an approach to use a replacement codon that shares the same evolutionary signal as the re-written codon may be used. In some embodiments, rewriting designs may be selected to minimize the number of evolutionary motifs affected. In some embodiments, nonsynonymous codons may be introduced instead of introducing a motif with an evolutionary signal through replacement with a synonymous codon.
In some embodiments, a replacement codon that is “too hot” may comprise a codon that may be a regulatory element, e.g., an within-ORF regulatory element. In some embodiments, a replacement codon that is not “too hot” may comprise a codon that may not be an regulatory element, e.g., an within-ORF regulatory element. In some embodiments, a replacement codon that is “too cold” may comprise a codon that may have been evolutionarily negatively selected. In some embodiments, a replacement codon that is “too cold” may comprise a codon that may cause frameshifts, ribosome stalls, or secondary structure interfering with transcription and/or translation. In some embodiments, a replacement codon that is not “too cold” may comprise a codon that may not cause frameshifts, ribosome stalls, or secondary structure interfering with transcription and/or translation. In some embodiments, machine learning approaches (e.g., statistical analysis approaches) can be performed to determine the rules for Goldilocks methods for codon replacement from the host genome. Details of examples of Goldilocks methods are provided in, for example, Example 3 and Example 4. In some embodiments, sequences of original yeast ORFs (Saccharomyces cerevisiae S288C strain) and rewritten yeast ORFs using methods described herein are shown as SEQ ID NOs: 1-11,812.
In some aspects, provided herein are methods for codon rewriting and/or replacement, wherein a codon may be selected by examining a local context of the codon. In some embodiments, a codon may be selected by examining a local context of a codon-of-interest within an ORF or a gene. In some embodiments, a local context of a codon-of-interest may comprise the codon-of-interest and a codon on each side of the codon-of-interest. In some embodiments, a local context of a codon-of-interest may comprise the codon-of-interest and codons on both 5′ and 3′ side of the codon-of-interest. In some embodiments, a local context of a codon-of-interest may comprise a preceding codon, the codon-of-interest, and the subsequent codon. In some embodiments, a local context of a codon-of-interest may comprise a codon upstream of the codon-of-interest, the codon-of-interest, and a codon downstream of the codon-of-interest. In some embodiments, a local context of a codon-of-interest may comprise a codon 5′ to the codon-of-interest, the codon-of-interest, and a codon 3′ to the codon-of-interest.
In some embodiments, a local context of a codon-of-interest may comprise at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or at least 21 codons. In some embodiments, a local context of a codon-of-interest may comprise 3 codons, i.e., a preceding codon, the codon-of-interest, and the subsequent codon. In some embodiments, a local context of a codon-of-interest may comprise 3 codons, i.e., a codon upstream of (or 5′ to) the codon-of-interest, the codon-of-interest, and a codon downstream of (or 3′ to) the codon-of-interest. In some embodiments, a local context of a codon-of-interest may comprise 5 codons, i.e., two preceding codons, the codon-of-interest, and the two subsequent codons. In some embodiments, a local context of a codon-of-interest may comprise 5 codons, i.e., two codons upstream of (or 5′ to) the codon-of-interest, the codon-of-interest, and two codons downstream of (or 3′ to) the codon-of-interest.
In some embodiments, a local context of a codon-of-interest may comprise at least 3, 4,5,6,7,8,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, or at least 63 nucleotides or base pairs. In some embodiments, a local context of a codon-of-interest may comprise a total of 9 nucleotides. For example, a local context of a codon-of-interest may comprise a 3 nucleotide preceding codon, the 3 nucleotide codon-of-interest, and a 3 nucleotide subsequent codon. For example, a local context of a codon-of-interest may comprise a 3 nucleotide codon upstream of (or 5′ to) the codon-of-interest, the 3 nucleotide codon-of-interest, and a 3 nucleotide codon downstream of (or 3′ to) the codon-of-interest. In some embodiments, a local context of a codon-of-interest may comprise a total of 11 nucleotides. For example, a local context of a codon-of-interest may comprise 4 nucleotides upstream of (or 5′ to) the codon-of-interest, the 3 nucleotide codon-of-interest, and 4 nucleotides downstream of (or 3′ to) the codon-of-interest. In some embodiments, a local context of a codon-of-interest may comprise a total of 15 nucleotides. For example, a local context of a codon-of-interest may comprise two preceding codons, each having 3 nucleotides, the 3 nucleotide codon-of-interest, and two subsequent codons, each having 3 nucleotides. For example, a local context of a codon-of-interest may comprise two codons, each having 3 nucleotides, upstream of (or 5′ to) the codon-of-interest, the 3 nucleotide codon-of-interest, and two codons, each having 3 nucleotides, downstream of (or 3′ to) the codon-of-interest.
In some embodiments, a local context of a codon-of-interest may comprise
C
(n−1)
−C
n
−C
(n+1), wherein
C(n−1) denotes a codon downstream of the codon-of-interest;
Cn denotes the codon-of-interest; and
C(n+1) denotes a codon upstream of the codon-of-interest.
In some embodiments, a local context of a codon-of-interest may comprise
C
(n−1)−AAn−C(n+1), wherein
C(n−1) denotes a codon downstream of the codon-of-interest;
AAn is an amino acid encoded by the codon-of-interest; and
C(n+1) denotes a codon upstream of the codon-of-interest.
In some embodiments, methods described herein may comprise determining a number of occurrences of the local context of the codon-of-interest. In some embodiments, methods described herein may comprise determining a relative synonymous codon usage (RSCU) of the codon-of-interest (Cn). In some embodiments, the RSCU may be determined as the frequency of a codon divided by the frequency of all codons encoding the same amino acid.
In some embodiments, a codon may be selected based on the RSCU value of the codon for a local context. In some embodiments, a codon with an RSCU value of at least about 0.01, at least about 0.05, at least about 0.10, at least about 0.11, at least about 0.12, at least about 0.13, at least about 0.14, at least about 0.15, at least about 0.16, at least about 0.17, at least about 0.18, at least about 0.19, at least about 0.20, at least about 0.21, at least about 0.22, at least about 0.23, at least about 0.24, at least about 0.25, at least about 0.26, at least about 0.27, at least about 0.28, at least about 0.29, at least about 0.30, at least about 0.31, at least about 0.32, at least about 0.33, at least about 0.34, at least about 0.35, at least about 0.36, at least about 0.37, at least about 0.38, at least about 0.39, at least about 0.40, at least about 0.41, at least about 0.42, at least about 0.43, at least about 0.44, at least about 0.45, at least about 0.46, at least about 0.47, at least about 0.48, at least about 0.49, at least about 0.50, at least about 0.51, at least about 0.52, at least about 0.53, at least about 0.54, at least about 0.55, at least about 0.56, at least about 0.57, at least about 0.58, at least about 0.59, at least about 0.60, at least about 0.61, at least about 0.62, at least about 0.63, at least about 0.64, at least about 0.65, at least about 0.66, at least about 0.67, at least about 0.68, at least about 0.69, at least about 0.70, at least about 0.71, at least about 0.72, at least about 0.73, at least about 0.74, at least about 0.75, at least about 0.76, at least about 0.77, at least about 0.78, at least about 0.79, at least about 0.80, at least about 0.81, at least about 0.82, at least about 0.83, at least about 0.84, at least about 0.85, at least about 0.86, at least about 0.87, at least about 0.88, at least about 0.89, or at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or about 1.00 may be selected. In some embodiments, a codon with the highest RSCU value for a local context may be selected.
In some embodiments, methods described herein may comprise determining an expected number of occurrences of the local context of the codon-of-interest. In some embodiments, the expected number of occurrences of the first local context of the codon-of-interest is determined as a product of: a number of occurrences of the second local context of the codon-of-interest, and the determined RCSU of the codon-of-interest. In some embodiments, the expected number of occurrences of C(n−1)−Cn−C(n+1) is determined as:
(a number of occurrences of C(n−1)−AAn−C(n+1))X(RCSU of theCn).
In some embodiments, methods described herein may comprise identifying a statistically significant evolutionary signal. In some embodiments, statistically significant evolutionary signals may comprise a negative evolutionary selection signal, a positive evolutionary selection signal, or a combination thereof. For example, the negative selection signal may include, but is not limited to, a frameshift, a ribosome stall, or a secondary RNA structure interfering with transcription and/or translation. For example, the positive selection signal may include, but is not limited to, a regulatory element within an open reading frame (ORF).
tRNA Removal & Supplementation
In some embodiments, methods described herein may comprise removing or supplementing one or more tRNAs with corresponding codons to one or more codons to be rewritten or replaced. In some embodiments, methods described herein may comprise supplementing the ones that may be oversubscribed as a function of replacement strategy
In some embodiments, performing genome design may comprise removing codons and corresponding tRNAs for rewriting and/or replacement. For example, codons may be rewritten synonymously and tRNAs with complementary anticodons may be deleted as part of the genome design (e.g., deleting tRNA genes). In this embodiment, deleting one or more tRNA genes prior to rewriting the entire genome may cause slow growth or lethality of an organism. In some embodiments, tRNA genes may be provided on a plasmid or chromosomal region that may be removed at the final step of genome rewriting or strain construction.
In some embodiments, additional tRNAs with anticodons recognizing the newly assigned codons (i.e., codons encoding a newly assigned amino acid or an ncAA) may be provided. In some embodiments, the total number of tRNA genes deleted can be determined, and the copy number of the remaining tRNA genes for an amino acid can be increased by the same amount. In some embodiments, wobble rules can be used to identify the tRNA genes responsible for decoding the replacement codons, and copy number increases can be allocated proportionally. In some embodiments, one or more non-native tRNA genes may be introduced. For example, for leucine, tL(AAG) from Candida species may be introduced.
In some aspects, methods described herein may comprise synthesizing a nucleic acid construct comprising one or more codons rewritten based on codon rewriting/replacement methods described herein. In some embodiments, any known methods in the art can be used to synthesize the nucleic acid construct comprising one or more codons rewritten based on codon rewriting/replacement methods described herein. In some embodiments, a chromosome can be computationally divided into 30-60 kilobase long constructs, each comprising a set of segments that is less than about 10 kilobase in length. Each segment can be synthesized using any known methods in the art, e.g., a polymerase chain reaction (PCR), and/or restriction enzyme digestion/ligation. In some embodiments, these segments can be assembled into a construct by restriction enzyme cutting and ligation in vitro, or any other methods known in the art. In some embodiments, the construct can be sequenced to confirm the sequence of the nucleic acid construct and subsequently integrated into the host genome, e.g., an yeast genome, using any known methods in the art to replace the corresponding portion, region, or segment of the wile-type.
In some aspects, methods described herein may further comprise replacing a portion of a genome with a nucleic acid construct comprising one or more codons rewritten based on codon rewriting/replacement methods described herein. In some embodiments, site-specific nucleases (SSNs) or homology-directed recombination (HR) can be used to replace a portion of a genome. In some embodiments, HR can be used utilizing an endogenous homologous recombination machinery. In some embodiments, a yeast homologous recombination machinery can be used as detailed in Example 6.
In some embodiments, SSN may comprise meganucleases, zinc-finger nucleases (ZFN), TAL effector nucleases (TALEN), and clustered regularly interspaced short palindromic repeats (CRISPR)/CRISPR-associated (Cas) system. These four major classes of gene-editing techniques, namely, meganucleases, ZFNs, TALENs, CRISPR/Cas systems share a common mode of action in binding a user-defined sequence of DNA and mediating a double-stranded DNA break (DSB). DSB may then be repaired by HR, an event that introduces the homologous sequence from a donor DNA fragment, or by non-homologous end joining (NHEJ), when there is no donor DNA present.
CRISPR-Cas system may be used with a guide target sequence for genetic screening, targeted transcriptional regulation, targeted knock-in, and targeted genome editing, including base editing, epigenetic editing, and introducing double strand breaks (DSBs) for homologous recombination-mediated insertion of a nucleotide sequence. CRISPR-Cas system comprises an endonuclease protein whose DNA-targeting specificity and cutting activity can be programmed by a short guide RNA or a duplex crRNA/TracrRNA. A CRISPR endonuclease comprises a caspase effector nuclease, typically microbial Cas9 and a short guide RNA (gRNA) or a RNA duplex comprising a 18 to 20 nucleotide targeting sequence that directs the nuclease to a location of interest in the genome. Genome editing can refer to the targeted modification of a DNA sequence, including but not limited to, adding, removing, replacing, or modifying existing DNA sequences, and inducing chromosomal rearrangements or modifying transcription regulation elements (e.g., methylation/demethylation of a promoter sequence of a gene) to alter gene expression. As described above CRISPR-Cas system requires a guide system that can locate Cas protein to the target DNA site in the genome. In some instances, the guide system comprises a crispr RNA (crRNA) with a 17-20 nucleotide sequence that is complementary to a target DNA site and a trans-activating crRNA (tracrRNA) scaffold recognized by the Cas protein (e.g., Cas9). The 17-20 nucleotide sequence complementary to a target DNA site is referred to as a spacer while the 17-20 nucleotide target DNA sequence is referred to a protospacer. While crRNAs and tracrRNAs exist as two separate RNA molecules in nature, single guide RNA (sgRNA or gRNA) can be engineered to combine and fuse crRNA and tracrRNA elements into one single RNA molecule. Thus, in one embodiment, the gRNA comprises two or more RNAs, e.g., crRNA and tracrRNA. In another embodiment, the gRNA comprises a sgRNA comprising a spacer sequence for genomic targeting and a scaffold sequence for Cas protein binding. In some instances, the guide system naturally comprises a sgRNA. For example, Cas12a/Cpf1 utilizes a guide system lacking tracrRNA and comprising only a crRNA containing a spacer sequence and a scaffold for Cas12a/Cpf1 binding. While the spacer sequence can be varied depending on a target site in the genome, the scaffold sequence for Cas protein binding can be identical for all gRNAs.
CRISPR-Cas systems described herein can comprise different CRISPR enzymes. For example, the CRISPR-Cas system can comprise Cas9, Cas12a/Cpf1, Cas12b/C2cl, Cas12c/C2c3, Cas12d/CasY, Cas12e/CasX, Cas12g, Cas12h, or Cas12i. Non-limiting examples of Cas enzymes include, but are not limited to, Cas1, Cas1B, Cas2, Cas3, Cas4, Cas5, Cas5d, Cas5t, Cas5h, Cas5a, Cas6, Cas7, Cas8, Cas8a, Cas8b, Cas8c, Cas9 (also known as Csn1 or Csx12), Cas10, Cas10d, Cas12a/Cpf1, Cas12b/C2cl, Cas12c/C2c3, Cas12d/CasY, Cas12e/CasX, Cas12f/Cas14/C2c10, Cas12g, Cas12h, Cas12i, Cas12k/C2c5, Cas13a/C2c2, Cas13b, Cas13c, Cas13d, C2c4, C2c8, C2c9, Csy1, Csy2, Csy3, Csy4, Cse1, Cse2, Cse3, Cse4, Cse5e, Csc1, Csc2, Csa5, Csn1, Csn2, Csm1, Csm2, Csm3, Csm4, Csm5, Csm6, Cmr1, Cmr3, Cmr4, Cmr5, Cmr6, Csb1, Csb2, Csb3, Csx17, Csx14, Csx1O, Csx16, CsaX, Csx3, Csx1, Csx1S, Csx11, Csf1, Csf2, CsO, Csf4, Csd1, Csd2, Cst1, Cst2, Csh1, Csh2, Csa1, Csa2, Csa3, Csa4, Csa5, GSU0054, Type II Cas effector proteins, Type V Cas effector proteins, Type VI Cas effector proteins, CARF, DinG, homologues thereof, or modified or engineered versions thereof such as dCas9 (endonuclease-dead Cas9) and nCas9 (Cas9 nickase that has inactive DNA cleavage domain). In some cases, the compositions, methods, devices, and systems, described herein, may use the Cas9 nuclease from Streptococcus pyogenes, of which amino acid sequences and structures are well known to those skilled in the art.
In some aspects, described herein, are methods for contacting a genome from a sample with one or more agents configured to cleave the genome at a locus. In some embodiments, the contacting may occur in vitro. In some embodiments, the contacting may occur in vivo, e.g., in a cell. In some embodiments, the one or more agents comprise a polypeptide, a polynucleotide, or a combination thereof. In some embodiments, the polypeptide comprises an enzyme, e.g., a site-specific nuclease. Examples of a site-specific nuclease are shown above. In some embodiments, a site-specific nuclease comprises an engineered homing endonuclease or meganuclease, a zinc-finger nuclease (ZFN), a transcription activator-like effector nuclease (TALEN), a clustered regularly interspaced short palindromic repeat (CRISPR/Cas), or a combination thereof. In some embodiments, the polynucleotide comprises a guide RNA (gRNA). In some embodiments, the one or more agents comprise a site-specific nuclease and a gRNA (e.g., CRISPR/Cas system).
Agents described herein can be delivered into cells in vitro or in vivo by art-known methods or as described herein. Delivery methods such as physical, chemical, and viral methods are also known in the art. In some instances, physical delivery methods can be selected from the methods but not limited to electroporation, microinjection, or use of ballistic particles. On the other hand, chemical delivery methods require use of complex molecules such calcium phosphate, lipid, or protein. In some embodiments, viral delivery methods are applied for gene editing techniques using viruses such as but not limited to adenovirus, lentivirus, and retrovirus. In some embodiments, agents described herein can be delivered via a carrier. In some embodiments, agents described herein can be delivered by, e.g., vectors (e.g., viral or non-viral vectors), non-vector based methods (e.g., using naked DNA, DNA complexes, lipid nanoparticles, RNA such as mRNA), or a combination thereof. In some embodiments, a carrier can comprise comprises a vector, a messenger RNA (mRNA), double stranded DNA (dsDNA), single stranded DNA (ssDNA), or a plasmid. In some embodiments, agents can be delivered directly to cells as naked DNA or RNA, for instance by means of transfection or electroporation, or can be conjugated to molecules (e.g., N-acetylgalactosamine) promoting uptake by cells.
In some embodiments, vectors can comprise one or more sequences encoding one or more agents described herein. Vectors can also comprise a sequence encoding a signal peptide (e.g., for nuclear localization, nucleolar localization, or mitochondrial localization), associated with (e.g., inserted into or fused to) a sequence coding for a protein. As one example, vectors can include a Cas9 coding sequence that includes one or more nuclear localization sequences (e.g., a nuclear localization sequence from SV40). Vectors described herein can also include any suitable number of regulatory/control elements, e.g., promoters, enhancers, introns, polyadenylation signals, Kozak consensus sequences, or internal ribosome entry sites (IRES). These elements are well known in the art. Vectors described herein may include recombinant viral vectors. Any viral vectors known in the art can be used. Examples of viral vectors include, but are not limited to lentivirus (e.g., HIV and FIV-based vectors), Adenovirus (e.g., AD100), Retrovirus (e.g., Maloney murine leukemia virus, MML-V), herpesvirus vectors (e.g., HSV-2), and Adeno-associated viruses (AAVs), or other plasmid or viral vector types. In some embodiments, agents described herein may be delivered in one carrier (e.g., one vector). In some embodiments, agents described herein may be delivered in in multiple carriers (e.g., multiple vectors).
In addition, viral particles can be used to deliver agents in nucleic acid and/or peptide form. For example, “empty” viral particles can be assembled to contain any suitable cargo. Viral vectors and viral particles can also be engineered to incorporate targeting ligands to alter target tissue specificity. Non-viral vectors can be also used to deliver agents according to the present disclosure. One example of non-viral nucleic acid vectors is an nanoparticle, which can be organic or inorganic. Nanoparticles are well known in the art. Any suitable nanoparticle design can be used to deliver agents described herein (e.g., nucleic acids encoding such agents).
In some embodiments, agents described herein can be delivered as a ribonucleoprotein (RNP) to cells. An RNP may comprise a nucleic acid binding protein, e.g., Cas9, in a complex with a gRNA targeting a genome/locus/sequence of interest. RNPs can be delivered to cells using known methods in the art, including, but not limited to electroporation, nucleofection, or cationic lipid-mediated methods, for example, as reported by Zuris, J. A. et al., 2015, Nat. Biotechnology, 33(1):73-80.
In some aspects, methods described herein may comprise utilizing a machine learning-based computer system. In some embodiments, machine learning-based computer systems described herein may comprise one or more storage units comprising, respectively, one or more storage devices included within respective storage arrays controlled by a respective one or more storage controllers; and one or more computer processing units, wherein the one or more computer processing units are configured to communicate with the one or more storage units over a communication interface.
In some embodiments, the machine learning-based computer system provides the plurality of intermediate scores to a machine learning algorithm that processes the plurality of intermediate scores to generate the rewritten codons (e.g., the first plurality of codons that are selected to be rewritten into a second codon). The machine learning algorithm may comprise a function that determines how intermediate scores are combined and weighted. The machine learning algorithm may comprise a supervised machine learning algorithm. The supervised machine learning algorithm may be trained on prior data from a reference genome, or on prior data from multiple genomes. The prior data may include observed fitness values for genomes, including growth rates on different media. The machine learning-based computer system can train the supervised machine learning algorithm by providing examples of fitness values to an untrained or partially trained version of the algorithm to generate replacement codons for one or more of the input genomes or of a different genome. The system can compare the predicted fitness to the measured fitness (i.e., whether the cell growth rate was maintained), and if there is a difference, the system can perform training at least in part by updating the parameters of the supervised machine learning algorithm. The supervised machine learning algorithm may comprise a regression algorithm, a support vector machine, a decision tree, a neural network, or the like. In cases in which the machine learning algorithm comprises a regression algorithm, the weights may be regression parameters. The supervised machine learning algorithm may comprise a classifier or a predictor that determines a prediction of which replacement codons (e.g., selected from among a plurality of possible replacement codons) are least likely to result in a fitness deficit. The predictor may generate a fitness risk score that is indicative of a likelihood of being indicative of a fitness risk (e.g., probabilistic fitness risk score between 0 and 1). In some cases, the machine learning-based computer system may map the probabilistic risk score to a qualitative risk category (e.g., selected from among a plurality of risk categories). For example, a fitness risk score that is at least 0.5 may be considered a high risk, while a fitness risk score that is less than 0.5 may be considered a low risk. Alternatively, the supervised machine learning algorithm may be a multi-class classifier (e.g., binary classifier) that predicts a qualitative risk category directly.
The machine learning algorithm may be comprise unsupervised machine learning algorithm. The unsupervised machine learning algorithm may identify patterns in a genome or multiple genomes of interest. For example, it may identify a set of codon usage contexts that are an outlier as compared to other sets of codon usage for the same amino acid. If the unsupervised machine learning algorithm determines that a particular context-dependent codon usage is an outlier, the machine learning-based computer system may determine that relying on genome-wide codon usage for codon selection may lead to a fitness deficit. On the other hand, a set of codon usage scores that is consistent with overall codon usage for the genome may indicate that codon replacement has lower risk of generating a fitness defect. The unsupervised machine learning algorithm may comprise a clustering algorithm, an isolation forest, an autoencoder, or the like.
In some aspects, methods and systems described herein may employ one or more trained algorithms. The trained algorithm(s) may process or operate on one or more datasets comprising information about a codon-of-interest, a codon upstream of (or 5′ to) the codon-of-interest, a codon downstream of (or 3′ to) the codon-of-interest, or any combination thereof. In some embodiments, the datasets comprise structural or sequence information about codons. In some embodiments, the datasets comprise one or more datasets of codons. The one or more datasets may be observed empirically, derived from computational studies, be derived from or retrieved from one or more databases, be artificially generated (e.g., as in silico variants of empirically observed datasets), or any combination thereof.
The trained algorithm may comprise an unsupervised machine learning algorithm. The trained algorithm may comprise a supervised machine learning algorithm. The trained algorithm may comprise a classification and regression tree (CART) algorithm. The supervised machine learning algorithm may comprise, for example, a Random Forest, a support vector machine (SVM), a neural network, or a deep learning algorithm. The trained algorithm may comprise a self-supervised machine learning algorithm. The trained algorithm may comprise a statistical model, statistical analysis, or statistical learning.
In some embodiments, a machine learning algorithm (or software module) of a platform as described herein utilizes one or more neural networks. In some embodiments, a neural network is a type of computational system that can learn the relationships between an input dataset and a target dataset. A neural network may be a software representation of a human neural system (e.g., cognitive system), intended to capture “learning” and “generalization” abilities as used by a human. In some embodiments, the machine learning algorithm (or software module) comprises a neural network comprising a convolutional neural network (CNN). Non-limiting examples of structural components of embodiments of the machine learning software described herein include: CNNs, recurrent neural networks, dilated CNNs, fully-connected neural networks, deep generative models, and Boltzmann machines.
In some embodiments, a neural network comprises a series of layers termed “neurons.” In some embodiments, a neural network comprises an input layer, to which data is presented; one or more internal, and/or “hidden”, layers; and an output layer. A neuron may be connected to neurons in other layers via connections that have weights, which are parameters that control the strength of the connection. The number of neurons in each layer may be related to the complexity of the problem to be solved. The minimum number of neurons required in a layer may be determined by the problem complexity, and the maximum number may be limited by the ability of the neural network to generalize. The input neurons may receive data being presented and then transmit that data to the first hidden layer through connections' weights, which are modified during training. The first hidden layer may process the data and transmit its result to the next layer through a second set of weighted connections. Each subsequent layer may “pool” the results from a set of the previous layers into more complex relationships. In addition, whereas some software programs require writing specific instructions to perform a task, neural networks are programmed by training them with a known sample set and allowing them to modify themselves during (and after) training so as to provide a desired output such as an output value (e.g., predicted value). After training, when a neural network is presented with new input data, it generalizes what was “learned” during training and applies what was learned from training to the new, previously unseen, input data in order to generate an output associated with that input (e.g., a predicted value). The output may be generated in order to minimize an expected error or loss function between the output value and an expected value.
In some embodiments, the neural network comprises artificial neural networks (ANNs). ANNs may be machine learning algorithms that may be trained to map an input dataset to an output dataset, where the ANN comprises an interconnected group of nodes organized into multiple layers of nodes. For example, the ANN architecture may comprise at least an input layer, one or more hidden layers, and an output layer. The ANN may comprise any total number of layers, and any number of hidden layers, where the hidden layers function as trainable feature extractors that allow mapping of a set of input data to an output value or set of output values. As used herein, a deep learning algorithm (such as a deep neural network, or DNN) is an ANN comprising a plurality of hidden layers, e.g., two or more hidden layers. Each layer of the neural network may comprise a number of nodes (or “neurons”). A node receives a set of inputs that are retrieved from either directly from the input data or the output of nodes in previous layers, and performs a specific operation, e.g., a summation operation, on the set of inputs. A connection from an input to a node is associated with a weight (or weighting factor). The node may determine a sum of the products of all pairs of inputs and their associated weights. The weighted sum may be offset with a bias. The output of a node or neuron may be gated using a threshold or activation function. The activation function may be a linear or non-linear function. The activation function may be, for example, a rectified linear unit (ReLU) activation function, a Leaky ReLU activation function, or other function such as a saturating hyperbolic tangent, identity, binary step, logistic, arctan, softsign, parametric rectified linear unit, exponential linear unit, softplus, bent identity, softexponential, sinusoid, sine, Gaussian, or sigmoid function, or any combination thereof.
The weighting factors, bias values, and threshold values, or other computational parameters of the neural network, may be “taught” or “learned” in a training phase using one or more sets of training data. For example, the parameters may be trained using the input data from a training dataset and a gradient descent or backward propagation method so that the output value(s) that the ANN determines are consistent with the examples included in the training dataset.
The number of nodes used in the input layer of the ANN or DNN may be at least about 10, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1,000, 2,000, 3,000, 4,000, 5,000, 6,000, 7,000, 8,000, 9,000, 10,000, 20,000, 30,000, 40,000, 50,000, 60,000, 70,000, 80,000, 90,000, 100,000, or greater. In other instances, the number of node used in the input layer may be at most about 100,000, 90,000, 80,000, 70,000, 60,000, 50,000, 40,000, 30,000, 20,000, 10,000, 9,000, 8,000, 7,000, 6,000, 5,000, 4,000, 3,000, 2,000, 1,000, 900, 800, 700, 600, 500, 400, 300, 200, 100, 50, 10, or fewer. In some instances, the total number of layers used in the ANN or DNN (including input and output layers) may be at least about 3, 4, 5, 10, 15, 20, or greater. In other instances, the total number of layers may be at most about 20, 15, 10, 5, 4, 3, or fewer.
In some instances, the total number of learnable or trainable parameters, e.g., weighting factors, biases, or threshold values, used in the ANN or DNN may be at least about 10, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1,000, 2,000, 3,000, 4,000, 5,000, 6,000, 7,000, 8,000, 9,000, 10,000, 20,000, 30,000, 40,000, 50,000, 60,000, 70,000, 80,000, 90,000, 100,000, or greater. In other instances, the number of learnable parameters may be at most about 100,000, 90,000, 80,000, 70,000, 60,000, 50,000, 40,000, 30,000, 20,000, 10,000, 9,000, 8,000, 7,000, 6,000, 5,000, 4,000, 3,000, 2,000, 1,000, 900, 800, 700, 600, 500, 400, 300, 200, 100, 50, 10, or fewer.
In some embodiments of a machine learning software module as described herein, a machine learning software module comprises a neural network such as a deep CNN. In some embodiments in which a CNN is used, the network is constructed with any number of convolutional layers, dilated layers, or fully-connected layers. In some embodiments, the number of convolutional layers is between 1-10, and the number of dilated layers is between 0-10. The total number of convolutional layers (including input and output layers) may be at least about 1, 2, 3, 4, 5, 10, 15, 20, or greater, and the total number of dilated layers may be at least about 1, 2, 3, 4, 5, 10, 15, 20, or greater. The total number of convolutional layers may be at most about 20, 15, 10, 5, 4, 3, or fewer, and the total number of dilated layers may be at most about 20, 15, 10, 5, 4, 3, or fewer. In some embodiments, the number of convolutional layers is between 1-10 and the fully-connected layers between 0-10. The total number of convolutional layers (including input and output layers) may be at least about 1, 2, 3, 4, 5, 10, 15, 20, or greater, and the total number of fully-connected layers may be at least about 1, 2, 3, 4, 5, 10, 15, 20, or greater. The total number of convolutional layers may be at most about 20, 15, 10, 5, 4, 3, 2, 1, or less, and the total number of fully-connected layers may be at most about 20, 15, 10, 5, 4, 3, 2, 1, or fewer.
In some embodiments, the input data for training of the ANN may comprise a variety of input values depending whether the machine learning algorithm is used for processing sequence or structural data. In some embodiments, the ANN or deep learning algorithm may be trained using one or more training datasets comprising the same or different sets of input and paired output data.
In some embodiments, a machine learning software module comprises a neural network comprising a CNN, recurrent neural network (RNN), dilated CNN, fully-connected neural networks, deep generative models, and deep restricted Boltzmann machines.
In some embodiments, a machine learning algorithm comprises CNNs. The CNN may be deep and feedforward ANNs. The CNN may be applicable to analyzing visual imagery. The CNN may comprise an input, an output layer, and multiple hidden layers. The hidden layers of a CNN may comprise convolutional layers, pooling layers, fully-connected layers, and normalization layers. The layers may be organized in 3 dimensions: width, height, and depth.
The convolutional layers may apply a convolution operation to the input and pass results of the convolution operation to the next layer. For processing sequence data, the convolution operation may reduce the number of free parameters, allowing the network to be deeper with fewer parameters. In neural networks, each neuron may receive input from some number of locations in the previous layer. In a convolutional layer, neurons may receive input from only a restricted subarea of the previous layer. The convolutional layer's parameters may comprise a set of learnable filters (or kernels). The learnable filters may have a small receptive field and extend through the full depth of the input volume. During the forward pass, each filter may be convolved across the length of the input sequence, determine the dot product between the entries of the filter and the input, and produce a two-dimensional activation map of that filter. As a result, the network may learn filters that activate when it detects some specific type of feature at some spatial position in the input.
In some embodiments, the pooling layers comprise global pooling layers. The global pooling layers may combine the outputs of neuron clusters at one layer into a single neuron in the next layer. For example, max pooling layers may use the maximum value from each of a cluster of neurons in the prior layer; and average pooling layers may use the average value from each of a cluster of neurons at the prior layer.
In some embodiments, the fully-connected layers connect every neuron in one layer to every neuron in another layer. In neural networks, each neuron may receive input from some number locations in the previous layer. In a fully-connected layer, each neuron may receive input from every element of the previous layer.
In some embodiments, the normalization layer is a batch normalization layer. The batch normalization layer may improve the performance and stability of neural networks. The batch normalization layer may provide any layer in a neural network with inputs that are zero mean/unit variance. The advantages of using batch normalization layer may include faster trained networks, higher learning rates, easier to initialize weights, more activation functions viable, and simpler process of creating deep networks.
In some embodiments, a machine learning software module comprises a recurrent neural network software module. A recurrent neural network software module may receive sequential data as an input, such as consecutive data inputs, and the recurrent neural network software module updates an internal state at every time step. A recurrent neural network can use internal state (memory) to process sequences of inputs. The recurrent neural network may be applicable to tasks such as codon selection. The recurrent neural network may also be applicable to next codon prediction, and codon usage anomaly detection. A recurrent neural network may comprise fully recurrent neural network, independently recurrent neural network, Elman networks, Jordan networks, Echo state, neural history compressor, long short-term memory, gated recurrent unit, multiple timescales model, neural Turing machines, differentiable neural computer, and neural network pushdown automata.
In some embodiments, a machine learning software module comprises a supervised or unsupervised learning method such as, for example, support vector machines (“SVMs”), random forests, clustering algorithm (or software module), gradient boosting, linear regression, logistic regression, and/or decision trees. The supervised learning algorithms may be algorithms that rely on the use of a set of labeled, paired training data examples to infer the relationship between an input data and output data. The unsupervised learning algorithms may be algorithms used to draw inferences from training datasets to the output data. The unsupervised learning algorithm may comprise cluster analysis, which may be used for exploratory data analysis to find hidden patterns or groupings in process data. One example of unsupervised learning method may comprise principal component analysis. The principal component analysis may comprise reducing the dimensionality of one or more variables. The dimensionality of a given variable may be at least 1, 5, 10, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1,000, 1,100, 1,200, 1,300, 1,400, 1,500, 1,600, 1,700, 1,800, or greater. The dimensionality of a given variables may be at most 1,800, 1,700, 1,600, 1,500, 1,400, 1,300, 1,200, 1,100, 1,000, 900, 800, 700, 600, 500, 400, 300, 200, 100, 50, 10, or fewer.
In some embodiments, the machine learning algorithm may comprise reinforcement learning algorithms. The reinforcement learning algorithm may be used for optimizing Markov decision processes (i.e., mathematical models used for studying a wide range of optimization problems where future behavior cannot be accurately predicted from past behavior alone, but rather also depends on random chance or probability). One example of reinforcement learning may be Q-learning. Reinforcement learning algorithms may differ from supervised learning algorithms in that correct training data input/output pairs are not presented, nor are sub-optimal actions explicitly corrected. The reinforcement learning algorithms may be implemented with a focus on real-time performance through finding a balance between exploration of possible outcomes (e.g., correct compound identification) based on updated input data and exploitation of past training.
In some embodiments, training data resides in a cloud-based database that is accessible from local and/or remote computer systems on which the machine learning-based sensor signal processing algorithms are running. The cloud-based database and associated software may be used for archiving electronic data, sharing electronic data, and analyzing electronic data. In some embodiments, training data generated locally may be uploaded to a cloud-based database, from which it may be accessed and used to train other machine learning-based detection systems at the same site or a different site.
The trained algorithm may accept a plurality of input variables and produce one or more output variables based on the plurality of input variables. The input variables may comprise one or more datasets of codons. For example, the input variables may comprise information about a codon-of-interest, a codon upstream of (or 5′ to) the codon-of-interest, a codon downstream of (or 3′ to) the codon-of-interest, or any combination thereof.
The trained algorithm may be trained with a plurality of independent training samples. Each of the independent training samples may comprise information about a codon-of-interest, a codon upstream of (or 5′ to) the codon-of-interest, a codon downstream of (or 3′ to) the codon-of-interest, or a combination thereof. The trained algorithm may be trained with at least about 5, at least about 10, at least about 15, at least about 20, at least about 25, at least about 30, at least about 35, at least about 40, at least about 45, at least about 50, at least about 100, at least about 150, at least about 200, at least about 250, at least about 300, at least about 350, at least about 400, at least about 450, at least about 500, at least about 1,000, at least about 1,500, at least about 2,000, at least about 2,500, at least about 3,000, at least about 3,500, at least about 4,000, at least about 4,500, at least about 5,000, at least about, 5,500, at least about 6,000, at least about 6,500, at least about 7,000, at least about 7,500, at least about 8,000, at least about 8,500, at least about 9,000, at least about 9,500, at least about 10,000, or more independent training samples.
The trained algorithm may associate information about a codon-of-interest, a codon upstream of (or 5′ to) the codon-of-interest, a codon downstream of (or 3′ to) the codon-of-interest, or a combination thereof for the best selection of codons for rewriting/replacement at an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more. The trained algorithm may be adjusted or tuned to improve a performance or accuracy of determining the prediction or classification. The trained algorithm may be adjusted or tuned by adjusting parameters of the trained algorithm. The trained algorithm may be adjusted or tuned continuously during the training process or after the training process has completed.
After the trained algorithm is initially trained, a subset of the inputs may be identified as most influential or most important to be included for making high-quality predictions. For example, a subset of the data may be identified as most influential or most important to be included for making high-quality choice for selecting codons for rewriting and/or replacement. The data or a subset thereof may be ranked based on classification metrics indicative of each parameter's influence or importance toward making high-quality selection of codons for rewriting and/or replacement. Such metrics may be used to reduce, in some embodiments significantly, the number of input variables (e.g., predictor variables) that may be used to train the trained algorithm to a desired performance level (e.g., based on a desired minimum accuracy). For example, if training the trained algorithm with a plurality comprising several dozen or hundreds of input variables in the trained algorithm results in an accuracy of classification of more than 99%, then training the trained algorithm instead with only a selected subset of no more than about 5, no more than about 10, no more than about 15, no more than about 20, no more than about 25, no more than about 30, no more than about 35, no more than about 40, no more than about 45, no more than about 50, or no more than about 100 such most influential or most important input variables among the plurality can yield decreased but still acceptable accuracy of classification (e.g., at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%). The subset may be selected by rank-ordering the entire plurality of input variables and selecting a predetermined number (e.g., no more than about 5, no more than about 10, no more than about 15, no more than about 20, no more than about 25, no more than about 30, no more than about 35, no more than about 40, no more than about 45, no more than about 50, or no more than about 100) of input variables with the best association metrics.
Systems and methods as described herein may use more than one trained algorithm to determine an output. Systems and methods may comprise 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more trained algorithms. A trained algorithm of the plurality of trained algorithms may be trained on a particular type of data (e.g., sequence data, structural data). Alternatively, a trained algorithm may be trained on more than one type of data. The inputs of one trained algorithm may comprise the outputs of one or more other trained algorithms. Additionally, a trained algorithm may receive as its input the output of one or more trained algorithms. A set of outputs generated using one or more trained algorithms may be combined into a single output (e.g., by determining a sum, an average, a minimum, a maximum, or any other function applied to the set of outputs).
In some aspects, provided herein, are methods for codon rewriting and replacement. In some embodiments, codons rewritten or replaced can be used to encode a new amino acid. In some embodiments, the new amino acid can be any canonical amino acids. For example, the new amino acid can be alanine, arginine, asparagine, aspartic acid, cysteine, glutamine, glutamic acid, glycine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine, or valine. In some embodiments, the new amino acid can be a non-canonical amino acid (ncAA).
In some aspects, provided herein, are methods for genetic code expansion using codon rewriting and replacement. In some embodiments, methods described herein, may enable site-specific, co-translational incorporation of one or more ncAAs into a polypeptide or a protein. In some embodiments, methods described herein can provide transformational approaches to understand and control one or more biological functions. For example, codon rewriting/replacement can allow genetically encoding amino acids corresponding to post-translationally modified versions of natural amino acids. For example, codon rewriting/replacement to allow genetically encoding photocaged amino acids can enable the rapid activation of protein function with light to dissect dynamic processes in cells. For example, codon rewriting/replacement to allow genetically encoding crosslinkers can provide a way to map protein interactions. For example, ncAAs containing fluorophores or other biophysical probes can be used to follow changes in protein structure and/or activity. In some embodiments, ncAAs may be used to alter enzyme function. In some embodiments, ncAAs may be used to trap labile enzyme-substrate intermediates for structural studies and substrate identification. In some embodiments, ncAAs bearing bio-orthogonal and chemically reactive groups may provide strategies for rapidly attaching a wide range of functionalities to proteins to precisely control and image protein function in cells and to create protein conjugates, including defined therapeutic conjugates. In some embodiments, genetic code expansion using codon rewriting and replacement methods described herein may form the basis of strategies for the reversible control of gene expression in animals and strategies for determining cell type-specific proteomes in animals. In some embodiments, genetic code expansion using codon rewriting and replacement methods described herein may allow incorporating multiple distinct ncAAs into polypeptides or proteins.
Non-Canonical Amino Acid (ncAA)
As used herein, a non-canonical amino acid (ncAA) can refer to any amino acid other than the 20 genetically encoded alpha-amino acids comprising alanine, arginine, asparagine, aspartic acid, cysteine, glutamine, glutamic acid, glycine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine, or valine. In some aspects, described herein are non-canonical amino acids (ncAAs) that may comprise side chain chemistries and/or structures that are not available from canonical amino acids (cAAs). In some embodiments, ncAAs may comprise fluorinated amino acids or amino acids comprising a reactive group (e.g., carbonyl, alkene, or alkyne moieties), or photoactivatable group (e.g., azide, benzophenone, or fluorophores). Translation of ncAAs into proteins may allow chemical modification and accordingly, ncAAs may be useful for in vivo structure-function studies, protein-protein interaction studies, protein localization studies, protein activity regulation studies or studies to generate new protein function. ncAA can be incorporated in different cells, including, but not limited to bacterial cells (e.g., Escherichia coli), yeast cells (e.g., Saccharomyces cerevisiae, Pichia pastoris, or Candida albicans), mammalian cells and plant cells or in organisms, including, but not limited to Drosophila melanogaster, Caenorhabditis elegans, Bombyx mori, rabbit and cow.
In some embodiments, a ncAA may comprise Para-fluoro-L-phenylalanine, Para-iodo-L-phenylalanine, Para-azido-L-phenylalanine, Para-acetyl-L-phenylalanine, Para-benzoyl-L-phenylalanine, Meta-fluoro-L-tyrosine, O-methyl-L-tyrosine, Para-propargyloxy-L-phenylalanine, (2S)-2-aminooctanoic acid, (2S)-2-aminononanoic acid, (2S)-2-aminodecanoic acid, (2S)-2-aminohept-6-enoic acid, (2S)-2-aminooct-7-enoic acid, L-Homocysteine, (2S)-2-amino-5-sulfanylpentanoic acid, (2S)-2-amino-6-sulfanylhexanoic acid, L-S-(2-nitrobenzyl) cysteine, L-S-ferrocenyl-cysteine, L-O-crotylserine, L-O-(pent-4-en-1-yl)serine, L-O—(4,5-dimethoxy-2-nitrobenzyl)serine, (2S)-2-amino-3-({[5-(dimethylamino)naphthalen-1-yl]sulfonyl}amino)propanoic acid, (2S)-3-[(6-acetyl-naphthalen-1-yl)amino]-2-aminopropanoic acid, L-Pyrrolysine, N6-[(propargyloxy)carbonyl]-L-lysine, L-N6-acetyllysine, N6-trifluoroacetyl-L-lysine, N6-{[1-(6-nitro-1,3-benzodioxol-5-yl)ethoxy]carbonyl}-L-lysine, N6-{[2-(3-methyl-3H-diaziren-3-yl)ethoxy]carbonyl}-L-lysine, p-azidophenylalanine or 2-aminoisobutyric acid (also known as α-aminoisobutyric acid, AIB, α-methylalanine, or 2-methylalanine).
In some embodiments, a ncAA may comprise AbK (unnatural amino acid for Photo-crosslinking probe), 3-Aminotyrosine (unnatural amino acid for inducing red shift in fluorescent proteins and fluorescent protein-based biosensors), L-Azidohomoalanine hydrochloride (unnatural amino acid for bio-orthogonal labeling of newly synthesized proteins), L-Azidonorleucine hydrochloride (unnatural amino acid for bio-orthogonal or fluorescent labeling of newly synthesized proteins), BzF (photoreactive unnatural amino acid; photo-crosslinker), DMNB-caged-Serine (caged serine; excited by visible blue light), HADA (blue fluorescent D-amino acid for labeling peptidoglycans in live bacteria), NADA-green (fluorescent D-amino acid for labeling peptidoglycans in live bacteria), NB-caged Tyrosine hydrochloride (ortho-nitrobenzyl caged L-tyrosine), RADA (orange-red TAMRA-based fluorescent D-amino acid for labeling peptidoglycans in live bacteria), Rf470DL (blue rotor-fluorogenic fluorescent D-amino acid for labeling peptidoglycans in live bacteria), sBADA (green fluorescent D-amino acid for labeling peptidoglycans in bacteria), or YADA (green-yellow lucifer yellow-based fluorescent D-amino acid for labeling peptidoglycans in live bacteria).
In some embodiments, a ncAA may comprise an O-methyl-L-tyrosine, an L-3-(2-naphthyl)alanine, a 3-methyl-phenylalanine, an O—4-allyl-L-tyrosine, a 4-propyl-L-tyrosine, a tri-O-acetyl-GlcNAcβ-serine, an L-Dopa, a fluorinated phenylalanine, an isopropyl-L-phenylalanine, a p-azido-L-phenylalanine, a p-acyl-L-phenylalanine, a p-benzoyl-L-phenylalanine, an L-phosphoserine, a phosphonoserine, a phosphonotyrosine, a p-iodo-phenylalanine, a p-bromophenylalanine, a p-amino-L-phenylalanine, or an isopropyl-L-phenylalanine.
In some embodiments, a ncAA may comprise an unnatural analogue of a canonical amino acid. For example, a ncAA may comprise an unnatural analogue of a tyrosine amino acid, an unnatural analogue of a glutamine amino acid, an unnatural analogue of a phenylalanine amino acid, an unnatural analogue of a serine amino acid, an unnatural analogue of a threonine amino acid. In some embodiments, a ncAA may comprise an alkyl, aryl, acyl, azido, cyano, halo, hydrazine, hydrazide, hydroxyl, alkenyl, alkynl, ether, thiol, sulfonyl, seleno, ester, thioacid, borate, boronate, phospho, phosphono, phosphine, heterocyclic, enone, imine, aldehyde, hydroxylamine, keto, or amino substituted amino acid, or any combination thereof.
In some embodiments, a ncAA may comprise an amino acid with a photoactivatable cross-linker, a spin-labeled amino acid, a fluorescent amino acid, an amino acid with a novel functional group, an amino acid that covalently or noncovalently interacts with another molecule, a metal binding amino acid, a metal-containing amino acid, a radioactive amino acid, a photocaged amino acid, a photoisomerizable amino acid, a biotin or biotin-analogue containing amino acid, a glycosylated or carbohydrate modified amino acid, a keto containing amino acid, an amino acid comprising polyethylene glycol, an amino acid comprising polyether, a heavy atom substituted amino acid, a chemically cleavable or photocleavable amino acid, an amino acid with an elongated side chain, an amino acid containing a toxic group, or a sugar substituted amino acid. In some embodiments, a sugar substituted amino acid may comprise a sugar substituted serine. In some embodiments, a ncAA may comprise a carbon-linked sugar-containing amino acid, a redox-active amino acid, an α-hydroxy containing amino acid, an amino thio acid containing amino acid, an α,α disubstituted amino acid, a β-amino acid, or a cyclic amino acid other than proline.
In some embodiments, a ncAA may comprise p-azidophenylalanine or 2-aminoisobutyric acid (also known as α-aminoisobutyric acid, AIB, α-methylalanine, or 2-methylalanine).
The ribosome uses tRNA adaptors, aminoacylated with their cognate amino acids by specific aminoacyl-tRNA synthetases (aaRSs), to progressively decode the triplet codons in a coding sequence and polymerize the corresponding sequence of amino acids into a protein. 64 triplet codons are used to encode the 20 canonical amino acids, and the initiation and termination of protein synthesis. In some aspects, codon rewriting and replacement methods described herein may allow reassigning those rewritten codons to encode a new amino acid (referred to as orthogonal codons). In some embodiments, orthogonal codons can be assigned to ncAAs. In some embodiments, each new orthogonal codon must be decoded by an additional aminoacyl-tRNA synthetase (aaRS)/tRNA pair. In some embodiments, these aaRS/tRNA pairs may uniquely decode distinct codons and recognize distinct ncAAs.
In some aspects, methods described herein may require an orthogonal aaRS/tRNA pairs. In some embodiments, each orthogonal aaRS may aminoacylate its cognate orthogonal tRNA, and/or minimally aminoacylate the other tRNAs in an organism. In some embodiments, the orthogonal tRNA may be aminoacylated by its cognate synthetase and/or minimally be aminoacylated by the aaRSs of the organism. In some embodiments, the orthogonal tRNA may be engineered to recognize an orthogonal codon that is not assigned to a canonical amino acid (i.e., rewritten/replaced codons), while maintaining selective aminoacylation by the orthogonal synthetase. In some embodiments, an active site of the orthogonal synthetase may be engineered.
In some aspects, provided herein are methods for reassigning a codon to encode an amino acid that the codon does not naturally encode. For example, a codon may be reassigned to a ncAA, i.e., the codon encodes a ncAA instead of an amino acid naturally encoded by the codon. Over 100 ncAAs with diverse chemistries may be synthesized and co-translationally incorporated into polypeptides and proteins using evolved orthogonal aminoacyl-tRNA synthetase (aaRSs)/tRNA pairs. Various aaRS/tRNA pairs can be used for methods described herein. In some embodiments, an ncAA may be designed based on tyrosine or pyrrolysine. In some embodiments, an aaRS/tRNA pair may be provided on a plasmid or into the genome of a cell or an organism comprising one or more reassigned codons. In some embodiments, an orthogonal aaRS/tRNA pair can be used to bioorthogonally incorporate ncAAs into polypeptides or proteins.
In some embodiments, vector-based over-expression systems may be used. In some embodiments, vector-based over-expression systems may outcompete natural codon function with its reassigned function. In some embodiments where natural aaRS and/or tRNAs for the rewritten codon are completely abolished or removed, lower amount of aaRS/tRNA for the newly assigned ncAA may be sufficient to achieve efficient ncAA incorporation. In some embodiments, genome-based aaRS/tRNA pairs (i.e., aaRS/tRNA pairs incorporated into the genome of the cell or organism) may be used to reduce the mis-incorporation of canonical amino acids in the absence of available ncAAs. In some embodiments, ncAA incorporation into polypeptides or proteins may involve supplementing the growth media with the ncAA described herein and an inducer for the aaRS expression. Alternatively, the aaRS may be expressed constitutively.
In some embodiments, aaRS/tRNA pairs may be imported from evolutionarily divergent organisms, wherein the sequence has diverged from that of the aaRS/tRNA pairs in the host organism or cell of interest (e.g., archaeal and eukaryotic pairs in an E. coli host). In some embodiments, derivatives of the Methanocaldococcus janaschii tyrosyl-tRNA synthetase (MjTyrRS)/MjtRNATyr pair may be used to incorporate a wide variety of ncAAs into polypeptides or proteins. In some embodiments, derivatives of the E. coli leucyl-tRNA synthetase (EcLeuRS)/EctRNALeu, E. coli tryptophanyl-tRNA synthetase (EcTrpRS)/EctRNATrp, or EcTyrRS/EctRNATyr pairs may be used to incorporate one or more ncAAs into polypeptides or proteins. In some embodiments, EcTyrRS/EctRNATyr pair or EcTrpRS/EctRNATrppair may be directly evolved for a new ncAA specificity. In some embodiments, endogenous copies of aaRS/tRNA pairs maybe replaced with pairs that are orthogonal in another host organism.
In some embodiments, evolved derivatives of a Methanococcus maripaludis phosphoseryl-tRNA synthetase (MmpSepRS)/MjtRNASep pair may be used to incorporate phosphoserine, its non-hydrolysable analogue, or phosphothreonine. In some embodiments, Methanosarcina mazei pyrrolysyl-tRNA synthetase (MmPylRS)/MmtRNAPylCUA pair, Methanosarcina barkeri PylRS (MbPylRS)/MbtRNAPYlCUA pair, or derivatives thereof, may be used to incorporate one or more ncAAs. In some embodiments, Archaeoglobus fulgidus (Af)TyrRS/AffRNATyrCUA may be used to incorporate one or more ncAAs. In some embodiments, engineered aaRS/tRNA pairs may be used to incorporate one or more ncAAs.
An organism or a host organism described herein can be an animal. In some embodiments, the animal may be a mammal. In some embodiments, the mammal comprises a human, non-human primate, rodent, caprine, bovine, ovine, equine, canine, feline, mouse, rat, rabbit, horse or goat. In some embodiments, an organism or a host organism may comprise E. coli, Salmonella enterica subsp. enterica serovar Typhimurium, Saccharomyces cerevisiae, cultured mammalian cells, Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster or Mus musculus.
A cell or a host cell described herein can be a bacterial cell, a yeast cell, a fungal cell, an insect cell, or a mammalian cell. In some embodiments, a cell may comprise a mammalian cell. Mammalian cells can be derived or isolated from a tissue of a mammal. In some embodiments, mammalian cells may comprise COS cells, BHK cells, 293 cells, 3T3 cells, NSO hybridoma cells, baby hamster kidney (BHK) cells, PER.C6™ human cells, HEK293 cells or Cricetulus griseus (CHO) cells. In some embodiments, a mammalian cell may comprise a human cell, a rodent cell, or a mouse cell. Examples of mammalian cells can also include but are not limited to cells from humans, non-human primates such as chimpanzees, and other apes and monkey species; farm animals such as cattle, horses, sheep, goats, swine; domestic animals such as rabbits, dogs, and cats; laboratory animals including rodents, such as rats, mice and guinea pigs, and the like. In some embodiments, a mammalian cell is a human cell. In some embodiments, a mammalian cell is a mouse cell. In some embodiments, a mammalian cell comprises an embryonic stem cell (ESC), a pluripotent stem cell (PSC), or an induced pluripotent stem cell (iPSC). In some embodiments, a cell or a host cell may comprise an eukaryotic cell or a prokaryotic cell. In some embodiments, the prokaryotic cell comprises an archaebacteria cell, a bacterial cell, or a combination thereof. In some embodiments, the eukaryotic cell comprises an yeast cell, a fungal cell, a plant cell, an animal cell, an insect cell, a mammalian cell, or a combination thereof. In some embodiments, the mammalian cell comprises a rodent cell, a mouse cell, or a human cell, or a combination thereof.
Methods for incorporating non-canonical amino acids in yeast are described in, for example, Stieglitz J. T., Van Deventer J. A. (2022) Incorporating, Quantifying, and Leveraging Noncanonical Amino Acids in Yeast. In: Rasooly A., Baker H., Ossandon M. R. (eds) Biomedical Engineering Technologies. Methods in Molecular Biology, vol 2394. Humana, New York, NY (doi.org/10.1007/978-1-0716-1811-0_21), which is incorporated by reference herein in its entirety.
Applications of proteins with non-canonical amino acids are described in, for example, Jeremiah A Johnson, Ying Y Lu, James A Van Deventer, David A Tirrell, Residue-specific incorporation of non-canonical amino acids into proteins: recent developments and applications,
Current Opinion in Chemical Biology, Volume 14, Issue 6, 2010, Pages 774-780, ISSN 1367-5931, doi.org/10.1016/j.cbpa.2010.09.013 (www.sciencedirect.com/science/article/pii/S1367593110001390), which is incorporated by reference herein in its entirety.
Examples of orthogonal translation in E. coli with a genome rewritten to exclude a subset of sense codons are described in, for example, Robertson W E, Funke LFH, de la Torre D, Fredens J, Elliott T S, Spinck M, Christova Y, Cervettini D, Böge FL, Liu K C, Buse S, Maslen S, Salmond GPC, Chin JW. Sense codon reassignment enables viral resistance and encoded polymer synthesis. Science. 2021 Jun. 4; 372(6546):1057-1062. doi: 10.1126/science.abg3029. PMID: 34083482; PMCID: PMC7611380, which is incorporated by reference herein in its entirety.
Additional examples of orthogonal translation are described in, for example, de la Torre, D., Chin, J. W. Reprogramming the genetic code. Nat Rev Genet 22, 169-184 (2021) (doi.org/10.1038/s41576-020-00307-7), which is incorporated by reference herein in its entirety.
Quantitative Reporter Platform to Evaluate ncAA Incorporation
In some embodiments, a precise plate-based assay using flow cytometry-based endpoint readouts can be used to measure efficiency and fidelity of an orthogonal translation system (as shown in
In some aspects, provided herein, is a method comprising: a) analyzing at least a portion of a genome of an organism to identify a first plurality of codons based on at least in part on a first local context of a codon-of-interest in the genome of the organism to be rewritten; b) rewriting the first plurality of codons in the genome of the organism to a second codon, wherein the first plurality of codons and the second codon encode a first amino acid, and wherein the rewriting of the first plurality of codons modulates an occurrence of the first plurality of codons; and c) synthesizing a nucleic acid construct comprising the portion of the genome, wherein the first plurality of codons is rewritten to the second codon.
In some embodiments, the method further comprises introducing the nucleic acid construct into a cell of the organism to replace the portion of the genome of the organism. In some embodiments, the modulating of the occurrence of the first plurality of codons comprises eliminating the occurrence of the first plurality of codons. In some embodiments, the analyzing comprises identifying one or more synonymous codons with a least number of occurrences in the genome of the organism. In some embodiments, the first plurality of codons comprises the one or more synonymous codons with the least number of occurrences.
In some embodiments, the first local context of the codon-of-interest comprises C(n-1) Cn−C(n+1), wherein C(n−1) denotes a codon downstream of the codon-of-interest; Cn denotes the codon-of-interest; and C(n+1) denotes a codon upstream of the codon-of-interest. In some embodiments, the analyzing further comprises determining a number of occurrences of the first local context of the codon-of-interest. In some embodiments, the analyzing further comprises determining a relative synonymous codon usage (RSCU) of the codon-of-interest.
In some embodiments, the analyzing further comprises identifying the first plurality of codons based at least in part on a second local context of the codon-of-interest in the genome of the organism. In some embodiments, the second local context of the codon-of-interest comprises C(n−1)−AAn−C(n+1), wherein C(n−1) denotes a codon downstream of the codon-of-interest; AAn denotes an amino acid encoded by the codon-of-interest; and C(n+1) denotes a codon upstream of the codon-of-interest. In some embodiments, the analyzing further comprises determining a number of occurrences of the second local context of the codon-of-interest. In some embodiments, the analyzing further comprises determining an expected number of occurrences of the first local context of the codon-of-interest. In some embodiments, the expected number of occurrences of the first local context of the codon-of-interest is determined as a product of: a number of occurrences of the second local context of the codon-of-interest, and the determined RCSU of the codon-of-interest.
In some embodiments, the analyzing comprises processing the at least the portion of the genome of the organism using a machine learning-based computer system. In some embodiments, the machine learning-based computer system comprises one or more storage units comprising, respectively, one or more storage devices included within respective storage arrays controlled by a respective one or more storage controllers; and one or more computer processing units, wherein the one or more computer processing units communicate with the one or more storage units over a communication interface.
In some embodiments, the analyzing further comprises identifying one or more statistically significant evolutionary signals. In some embodiments, the one or more statistically significant evolutionary signals comprise a negative evolutionary selection signal, a positive evolutionary selection signal, or a combination thereof. In some embodiments, the negative selection signal comprises a frameshift, a ribosome stall, or a secondary RNA structure interfering with transcription or translation. In some embodiments, the positive selection signal comprises a regulatory element within an open reading frame (ORF).
In some embodiments, the method further comprises reassigning the first plurality of codons to a second amino acid. In some embodiments, the first amino acid or the second amino acid comprises alanine, cysteine, aspartic acid, glutamic acid, phenylalanine, glycine, histidine, isoleucine, lysine, leucine, methionine, asparagine, proline, glutamine, arginine, serine, threonine, valine, tryptophan, or tyrosine. In some embodiments, the first amino acid comprises arginine, leucine, or serine. In some embodiments, the first plurality of codons comprises CGT, CGC, CGA, CGG, AGA, AGG, or a combination thereof. In some embodiments, the first plurality of codons comprises CGA, CGG, or a combination thereof. In some embodiments, the first plurality of codons comprises TTA, TTG, CTT, CTC, CTA, CTG, or a combination thereof. In some embodiments, the first plurality of codons comprises CTA, CTG, or a combination thereof. In some embodiments, the first plurality of codons comprises TCT, TCC, TCA, TCG, AGT, AGC, or a combination thereof. In some embodiments, the first plurality of codons comprises AGT, AGC, TCG, TCA, or a combination thereof.
In some embodiments, the rewriting further comprises removing a plurality of tRNA molecules with anticodons that recognize the first plurality of codons. In some embodiments, the removing comprises deleting one or more genes that encode the plurality of tRNA molecules that recognize the first plurality of codons. In some embodiments, the method further comprises providing additional tRNA molecules that recognize the first plurality of codons and aminoacyl-tRNA synthetases (aaRSs) for charging the additional tRNA molecules with the second amino acid. In some embodiments, the method further comprises providing a tRNA pre-charged with the second amino acid.
In some embodiments, the second amino acid comprises a non-canonical amino acid. In some embodiments, the non-canonical amino acid comprises p-azidophenylalanine, 2-aminoisobutyric acid (Aib), or a combination thereof.
In some embodiments, the rewriting of the first plurality of codons comprises modulating one or more codons in the first plurality of codons, wherein the one or more codons are within 4 codons of each other. In some embodiments, the rewriting of the first plurality of codons comprises modulating a codon fragment of one or more codons in the first plurality of codons. In some embodiments, the codon fragment comprises a trimer, a hexamer, a 9 mer, or a combination thereof.
In some aspects, provided herein, is a method of producing a polypeptide comprising a non-canonical amino acid (ncAA) or a population of polypeptide molecules comprising the ncAA in an organism, the method comprising: rewriting a first codon encoding a first amino acid to a second codon encoding the first amino acid in a genome of the organism, wherein the rewriting comprises identifying the first codon based at least in part on a first local context of a codon-of-interest in the genome of the organism; reassigning the first codon to encode the ncAA in the genome of the organism; and introducing into the organism an aminoacyl-tRNA synthetase (aaRS)/tRNA pair engineered to recognize the first codon and incorporate the ncAA into an amino acid sequence of the polypeptide or the population of the polypeptide molecules.
In some embodiments, the first codon has a least number of occurrences for the first amino acid in the genome of the organism. In some embodiments, the first local context of the codon-of-interest comprises C(n−1)−Cn−C(n+1), wherein C(n−1) denotes a codon downstream of the codon-of-interest; Cn denotes the codon-of-interest; and C(n+1) denotes a codon upstream of the codon-of-interest. In some embodiments, the rewriting comprises determining a number of occurrences of the first local context of the codon-of-interest. In some embodiments, the rewriting further comprises determining a relative synonymous codon usage (RSCU) of the codon-of-interest.
In some embodiments, the rewriting further comprises identifying the first codon based at least in part on a second local context of the codon-of-interest in the genome of the organism. In some embodiments, the second local context of the codon-of-interest comprises C(n−1)−AAn−C(n+1), wherein C(n−1) denotes a codon downstream of the codon-of-interest; AAn denotes an amino acid encoded by the codon-of-interest; and C(n+1) denotes a codon upstream of the codon-of-interest. In some embodiments, the rewriting further comprises determining a number of occurrences of the second local context of the codon-of-interest. In some embodiments, the rewriting further comprises determining an expected number of occurrences of the first local context of the codon-of-interest. In some embodiments, the expected number of occurrences of the first local context of the codon-of-interest is determined as a product of: a number of occurrences of the second local context of the codon-of-interest, and the determined RCSU of the codon-of-interest.
In some embodiments, the rewriting comprises analyzing at least a portion of the genome of the organism using a machine learning-based computer system. In some embodiments, the machine learning-based computer system comprises one or more storage units comprising, respectively, one or more storage devices included within respective storage arrays controlled by a respective one or more storage controllers; and one or more computer processing units, wherein the one or more computer processing units communicate with the one or more storage units over a communication interface.
In some embodiments, the method further comprises identifying one or more statistically significant evolutionary signals. In some embodiments, the one or more statistically significant evolutionary signals comprises a negative evolutionary selection signal, a positive evolutionary selection signal, or a combination thereof. In some embodiments, the negative selection signal comprises a frameshift, a ribosome stall, or a secondary RNA structure interfering with transcription or translation. In some embodiments, the positive selection signal comprises a regulatory element within an open reading frame (ORF).
In some embodiments, the first amino acid comprises alanine, cysteine, aspartic acid, glutamic acid, phenylalanine, glycine, histidine, isoleucine, lysine, leucine, methionine, asparagine, proline, glutamine, arginine, serine, threonine, valine, tryptophan, or tyrosine. In some embodiments, the first amino acid comprises arginine, leucine, or serine. In some embodiments, the first codon or the second codon comprises CGT, CGC, CGA, CGG, AGA, AGG, or a combination thereof. In some embodiments, the first codon comprises CGA, CGG, or a combination thereof. In some embodiments, the first codon or the second codon comprises TTA, TTG, CTT, CTC, CTA, CTG, or a combination thereof. In some embodiments, the first codon comprises CTA, CTG, or a combination thereof. In some embodiments, the first codon or the second codon comprises TCT, TCC, TCA, TCG, AGT, AGC, or a combination thereof. In some embodiments, the first codon comprises AGT, AGC, TCG, TCA, or a combination thereof.
In some embodiments, the first codon comprises a plurality of codons. In some embodiments, the rewriting further comprises removing a plurality of tRNA molecules that recognize the first codon. In some embodiments, the removing comprises deleting one or more genes that encode the plurality of tRNA molecules that recognize the first codon. In some embodiments, the introducing further comprises providing a tRNA pre-charged with the ncAA. In some embodiments, the ncAA comprises p-azidophenylalanine, 2-aminoisobutyric acid (Aib), or a combination thereof.
In some aspects, provided herein, is a method of producing a peptide, the method comprising editing a genome of an organism, wherein the editing comprises revising a codon of the genome to encode a non-canonical amino acid, wherein the peptide comprises the non-canonical amino acid.
In some aspects, provided herein, is a cell or a population of cells comprising a genome, wherein a first plurality of codons in the genome of the organism is rewritten to a second codon, wherein the first plurality of codons and the second codon encode a first amino acid, and wherein an occurrence of the first plurality of codons is modulated responsive to being rewritten to the second codon.
In some embodiments, the occurrence of the first plurality of codons is eliminated. In some embodiments, the first plurality of codons is reassigned to a second amino acid. In some embodiments, the first plurality of codons is identified based on a first plurality of codons based on at least in part on a first local context of a codon-of-interest.
In some embodiments, the first local context of the codon-of-interest comprises C(n−1) Cn−C(n+1), wherein C(n−1) denotes a codon downstream of the codon-of-interest; Cn denotes the codon-of-interest; and C(n+1) denotes a codon upstream of the codon-of-interest. In some embodiments, the identifying comprises determining a number of occurrences of the first local context of the codon-of-interest. In some embodiments, the identifying further comprises determining a relative synonymous codon usage (RSCU) of the codon-of-interest.
In some embodiments, the first plurality of codons is further identified based at least in part on a second local context of the codon-of-interest in the genome of the organism. In some embodiments, the second local context of the codon-of-interest comprises C(n−1)−AAn C(n+1), wherein C(n−1) denotes a codon downstream of the codon-of-interest; AAn denotes an amino acid encoded by the codon-of-interest; and C(n+1) denotes a codon upstream of the codon-of-interest.
In some embodiments, the identifying further comprises determining a number of occurrences of the second local context of the codon-of-interest. In some embodiments, the identifying further comprises determining an expected number of occurrences of the first local context of the codon-of-interest. In some embodiments, the expected number of occurrences of the first local context of the codon-of-interest is determined as a product of: a number of occurrences of the second local context of the codon-of-interest, and the determined RCSU of the codon-of-interest.
In some embodiments, the identifying comprises analyzing at least a portion of the genome of the organism using a machine learning-based computer system. In some embodiments, the machine learning-based computer system comprises one or more storage units comprising, respectively, one or more storage devices included within respective storage arrays controlled by a respective one or more storage controllers; and one or more computer processing units, wherein the one or more computer processing units communicate with the one or more storage units over a communication interface.
In some embodiments, the identifying further comprises identifying one or more statistically significant evolutionary signals. In some embodiments, the one or more statistically significant evolutionary signals comprises a negative evolutionary selection signal, a positive evolutionary selection signal, or a combination thereof. In some embodiments, the negative selection signal comprises a frameshift, a ribosome stall, or a secondary RNA structure interfering with transcription or translation. In some embodiments, the positive selection signal comprises a regulatory element within an open reading frame (ORF). In some embodiments, the cell or the population of cells comprises an eukaryotic cell or a prokaryotic cell. In some embodiments, the prokaryotic cell comprises an archaebacteria cell, a bacterial cell, or a combination thereof. In some embodiments, the eukaryotic cell comprises an yeast cell, a fungal cell, a plant cell, an animal cell, an insect cell, a mammalian cell, or a combination thereof. In some embodiments, the mammalian cell comprises a rodent cell, a mouse cell, or a human cell, or a combination thereof.
In some embodiments, the first amino acid comprises alanine, cysteine, aspartic acid, glutamic acid, phenylalanine, glycine, histidine, isoleucine, lysine, leucine, methionine, asparagine, proline, glutamine, arginine, serine, threonine, valine, tryptophan, or tyrosine. In some embodiments, the first amino acid comprises arginine, leucine, or serine. In some embodiments, the first plurality of codons comprises CGT, CGC, CGA, CGG, AGA, AGG, or a combination thereof. In some embodiments, the first plurality of codons comprises CGA, CGG, or a combination thereof. In some embodiments, the first plurality of codons comprises TTA, TTG, CTT, CTC, CTA, CTG, or a combination thereof. In some embodiments, the first plurality of codons comprises CTA, CTG, or a combination thereof. In some embodiments, the first plurality of codons comprises TCT, TCC, TCA, TCG, AGT, AGC, or a combination thereof. In some embodiments, the first plurality of codons comprises AGT, AGC, TCG, TCA, or a combination thereof.
In some embodiments, the second amino acid comprises alanine, cysteine, aspartic acid, glutamic acid, phenylalanine, glycine, histidine, isoleucine, lysine, leucine, methionine, asparagine, proline, glutamine, arginine, serine, threonine, valine, tryptophan, or tyrosine. In some embodiments, the second amino acid comprises a non-canonical amino acid (ncAA). In some embodiments, the ncAA comprises p-azidophenylalanine, 2-aminoisobutyric acid (Aib), or a combination thereof.
In some aspects, provided herein, is an organism comprising the cell or the population of cells described herein.
In some aspects, provided herein, is a computer system for editing a genome of an organism, comprising: a database that is configured to store at least a portion of the genome of the organism; and one or more computer processors operatively coupled to said database, wherein said one or more computer processors are individually or collectively programmed to: a) analyze the at least the portion of the genome of the organism to identify a first plurality of codons in the genome of the organism to be rewritten; and b) rewrite the first plurality of codons in the genome of the organism to a second codon, wherein the first plurality of codons and the second codon encode a first amino acid, and wherein the rewriting of the first plurality of codons modulates an occurrence of the first plurality of codons, thereby editing the genome of the organism.
In some aspects, provided herein, is a non-transitory computer-readable storage medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for editing a genome of an organism, the method comprising: a) analyzing at least a portion of the genome of the organism to identify a first plurality of codons in the genome of the organism to be rewritten; and b) rewriting the first plurality of codons in the genome of the organism to a second codon, wherein the first plurality of codons and the second codon encode a first amino acid, and wherein the rewriting of the first plurality of codons modulates an occurrence of the first plurality of codons, thereby editing the genome of the organism.
These examples are provided for illustrative purposes only and not to limit the scope of the claims provided herein.
For maximum flexibility in selecting replacement codons, amino acids encoded by 6 different codons are used for this example using Saccharomyces cerevisiae as the model organism. As this example focuses on DNA genes, DNA nomenclature, e.g., A, C, G, or T, is used.
Leucine: Leucine may be encoded by a set of 6 codons, which include CTT, CTC, CTG, CTA, TTG, and TTA. The choices are to rewrite CTG/CTA (1.42% of all Leucine codons) or TTG/TTA (5.2% of all Leucine codons). To reduce the number of rewritten codons, CTG/CTA is chosen to be rewritten. It's noteworthy that the Candida genus of yeast has lineages in which CTG has been reassigned from leucine (the ancestral state) to serine.
This demonstrates the ability to reassign this codon. The leucine anticodons for the 4-block are GAG (1 copy) and TAG (3 copies). It is most likely the TAG anticodon that decodes CTG. The GAG anticodon may decode CTC and CTT. Deleting the GAG anticodon tRNA (YNCG0028 W) causes no fitness defect, which means that the 3-copy TAG anticodon supplies it. Candida species have additional tRNAs with the AAG anticodon for the 4-block. If the TAG tRNAs are deleted, then these additional tRNAs may have to be supplied.
Leucine design summary: rewrite CTG/CTA codons, or possibly just the CTG codons. Delete the tL(TAG) genes, 3 copies. Possibly supplement with tL(AAG) tRNA genes from a related yeast species.
Serine: Serine may be encoded by a set of 6 codons, which include TCT, TCC, TCG, TCA, AGT, and AGC. The candidates for rewriting are TCG/TCA (2.78% of all serine codons) or AGT/AGC (2.47% of all serine codons). For the TCG/TCA choice, the anticodons are tS(CGA) 1 copy and tS(TGA) 3 copies. For the AGT/AGC choice, the anticodons are tS(GCT) 4 copies. Although in some embodiments it is favored to rewrite codons ending in G, in this case it may be reasonable to rewrite the AGT/AGC pair, because the GCT anticodon may not give cross-talk outside of the AGT/AGC 2-block.
Serine design summary, design 1: rewrite TCG/TCA codons, delete tS(CGA) 1 copy, tS(TGA) 3 copies. Increase copy numbers of other tS tRNA genes.
Serine design summary, design 2: rewrite AGT, AGC codons, delete tS(GCT) 4 copies. Increase copy numbers of other tS tRNA genes.
Arginine: Arginine may be encoded by a set of 6 codons, which include CGT, CGC, CGG, CGA, AGG, and AGA. The choices are to rewrite CGG/CGA (0.56% of all arginine codons) or AGG/AGA (3.110% of all arginine codons). To reduce the number of rewritten codons, CGG/CGA is chosen to be rewritten. The anticodons in the 4-block are ACG (6 copies) and CCG (1 copy). The single-copy CCG anticodon tRNA is TRR4. It is an essential tRNA gene, suggesting that no other tRNA recognizes CGG. Rewriting CGG and deleting TRR4 may permit use of CGG for orthogonal translation. In this case it may not be necessary to rewrite CGA because it is decoded by the ACG tRNA that may not recognize CGG.
Arginine design summary: rewrite CGG/CGA codons, delete tR(CCG) single-copy tRNA. Possibly increase copy number of remaining Arg tRNA genes to account for rewritten codons.
Leu CTG/CTA rewrite: 69K codons, 3 tRNAs.
Arg CGG/CGA rewrite: 14K codons, 1 tRNA.
Ser AGT/AGC rewrite: 70K codons, 4 tRNAs.
Ser TCG/TCA rewrite: 78K codons, 4 tRNAs.
Total over 6 codons: ˜160K codons to rewrite.
5 regions of 20 kb each, 7 designs per region, 700 kb total.
‘Individual’ designs: 2 codons removed: Leu, Arg, Ser.
‘Paired’ designs: 3 codons removed: Leu/Arg, Leu/Ser, Arg/Ser.
‘All’ design: 6 codons removed: Leu/Arg/Ser.
A simple method for rewriting a codon is to change a nucleotide in the wobble position (third position of a codon) in a way that retains GC content. For example, a codon that ends with G or A in a 4-codon block (4 codons encoding a same amino acid) may be to change C or T, respectively. Alternatively, a codon may be changed to another codon having the highest frequency for that specific amino acid.
The Goldilocks method for codon replacement can start with examining the local context of a codon. First, the frequency of each single codon is determined, and the relative synonymous codon usage (RSCU) may be determined (e.g., as the frequency of a codon divided by the frequency of all codons encoding the same amino acid). Second, the context of a codon is determined considering the preceding codon, the codon under consideration, and the subsequent codon. A protein-coding gene of a host species is examined, and the number of times each codon-codon-codon 9 mer occurs is determined. For example, in yeast, there are 4{circumflex over ( )}9 (=262,144) different 9 mers and approximately 3 million different codons. On average, each 9 mer occurs 11 times. The observed number of occurrences of the 9 mer may be defined as O(9 mer). The 9 mer contexts are then converted to patterns of codon-amino acid (aa)-codon, wherein aa is the amino acid encoded by the central codon. There are 4{circumflex over ( )}3×20×4{circumflex over ( )}3( =8,190) different patterns.
Next, the number of times that the central codon is expected to be observed under the null hypothesis is the number of times that the codon-aa-codon pattern occurs times the RCSU for the central codon. This is denoted as E(9 mer) for the expected number of occurrences of the 9 mer.
The p-value is then determined for a two-sided Poisson test for enrichment or depletion of the 9 mer relative to the null distribution. Standard significance at the 0.05 level, corrected for 262,144 9 mer tests, requires a single-test p-value of 1.9E-7 for significance.
The 9 mers that are over-represented or under-represented suggest selective pressure. Over-represented 9 mers may include regulatory motifs. Under-represented 9 mers may have undesired functions, such as frameshifts. The Goldilocks approach may have a goal to avoid creating 9 mers that have a significant deviation from the null.
One implementation is to use a simple codon replacement (maintaining GC content as described in Example 3) unless the result creates a 9 mer that deviates from the null, in which case an alternative is selected. An alternative implementation is to choose the new codon as the 9 mer whose observed frequency is closest to the expected frequency, excluding 9 mers whose central codon is in the set to be replaced. For repeated occurrences of codons that are to be replaced, the Goldilocks method may be applied in overlapping 9 mer windows across the region.
This example uses the Goldilocks method to rewrite yeast protein-coding genes. This example uses computer files with the following directory structure (Table 5).
Translation tables were retrieved from NCBI from:
www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
Yeast ORFs were retrieved from NCBI from: sgd-archive.yeastgenome.org/?prefix=sequence/S288C_reference/
This release is Genome Release 64-3-1.
The ORF files have the following counts:
Mitochondrial genes are excluded because the application is to the nuclear genome, not the mitochondrial genome. Codon usage in the nuclear and mitochondrial genome are different, and in some organisms the genetic codes are different.
The transposable element genes are excluded for two reasons. First, transposable elements are parasitic DNA that may be better to be removed. Therefore, they may not be retained in a rewritten genome. Second, transposable elements have very similar DNA sequences because of recent common ancestors. Their codon usage does not necessarily match the codon usage of the rest of the yeast genome. This can create a spurious statistical signal.
Pseudogenes are excluded because mutations are free to occur in non-functional DNA.
Codon counts, amino acids counts, and relative synonymous codon usage (RSCU)
The codon count for each codon, including stop codons is then determined. For simplicity, when writing “for each amino acid”, the stop symbols and their codons UAA, UAG, and UGA are included as among the amino acids. The translation table for the organism is used—see Tables 6A and 6B (translation table 1 for yeast or the standard table from the website provided above)-to map codons to amino acids. The number of codons for each amino acid is determined. Then for each codon, the RSCU is determined (e.g., as the number of counts for the codon divided by the number of counts for all codons for the same amino acid).
Results for yeast are based on 2,832,327 codons and are in the Table 6C (amino acid counts), Table 6D (codon counts and RSCU for the original yeast genome), and Table 6E (codon counts and RSCU for the yeast genome after rewriting).
Next, the frequency of 9 mers in coding domains is determined. The 9 mers are in-frame sliding windows across the coding sequence (CDS). A CDS with n amino acids (including the stop codon) may have (n−2) different 9 mers. The total number of 9 mers determined is 2,820,515 and the number of unique 9 mers is 215,766. The maximum number of unique 9 mers is not 64*64*64=262,144, but rather 61*61*64=238,144, because stop codons can only occur in the third position. The actual number observed is smaller because some codon patterns are too rare to be observed.
Codon-codon-codon patterns are then converted to contexts, which may be determined as a codon-aa-codon patterns. There are 61*20*64=78,080 possible contexts, of which 75,918 are observed in the yeast genome.
Next for each context, a test of the null hypothesis is performed that the frequency of the central codon, conditioned on the context of the surrounding codons, follows the same distribution as the RSCU. This is performed as a single statistical test for all the possible central codons given the central amino acid.
The test is motivated by considering a likelihood ratio test with test statistic
where Pr(D|null) is the probability of central codon counts under the null distribution given by the genome-wide RSCU, and Pr(D|ML) is the probability of the central codon counts under an alternative distribution in which the codon usage depends on the context defined by the outer codons, using the maximum likelihood estimator for the model parameters. Under the null, Q follows a chi-square distribution with a number of degrees of freedom (df) equal to the number of possible codons minus 1. Thus, for amino acids with a single amino acid, the test has 0 df (only a single choice), amino acids with 2 codons have 1 df, amino acids with 4 codons have 3 df, and amino acids with 6 codons have 5 df. The stop signal has 3 codons and 2 df.
For a given context, let c be one of the possible codons, r(c) be the RSCU for that codon, and n(c) be the number of times that codon occurs in the central position of that context. Under the null,
For the ML distribution, the standard result is that the maximum likelihood probabilities are the observed probabilities. Let N=sum_c n(c) be the number of examples of the context. The maximum likelihood estimate for the frequency of codon c is determined as:
Putting this together,
Note that the argument of the logarithm is the ratio of the number of codons observed to the number expected under the null.
In the case that a particular codon is not observed,
There are no problems with divergences. Other statistical tests are possible, including using pseudocounts to smooth out the distributions.
The single-tailed p-value is then determined for the chi-square values to identify contexts whose codon usage differs from the null. For a stringent family-wise error of 0.05, an individual test p-value is required to be smaller than 0.05/78,080=6.4E-7.
The likelihood ratio test is asymptotic to a chi-square distribution, but for small values of observations there are standard corrections. Therefore, a chisquare test is also performed as implemented by scipy.stats.chisquare, which takes as arguments the same lists of observed and expected counts, including the zero counts. The test statistics and p-values may be very similar.
A small p-value can result from many observations with a small difference between observed and expected counts, or from fewer observations with a larger difference between observed and effected counts. The difference is quantified as a weighted geometric mean of the observed-to-expected ratio magnitudes as follows.
Let n(c) be the number of occurrences of codon c as before, and N r(c) be the null expectation as before. The weighted log-ratio w is determined as:
where the vertical bars indicate absolute value. The absolute value is taken to count both enrichment, n(c) higher than expected, and depletion, n(c) lower than expected, as contributing their magnitudes rather than cancelling each other out.
The ratio magnitude R is then determined as:
For a context with a small p-value and large ratio magnitude, it is instructive to examine the under-represented codon choices and over-represented codon-choices. For a codon c, the regularized log-ratio is determined as:
which is just the log ratio, but with n(c) changed from 0 to 0.5 for codons that are never observed. Then, within each context, the 9 mer patterns with the most negative LR and the most positive LR are provided.
Contexts, their observed and null hypothesis counts of central codons, p-values, and ratios are provided in Table 6F (context_cnt.txt as tab-delimited text). Amino acids with a single codon are included in the results. For these amino acids, observed and expected counts are identical, and all p-values are set to 1.
The number of contexts with p-value below 6.4E-7 is 584. The rows of the context_cnt.txt belonging to this subset are provided in Table 9. A few of the patterns observed are discussed.
One pattern of depleted codon use is to avoid creating codon patterns that are slippery sites for ribosomal frameshifting. An exemplary pattern for a slippery site is:
nnX XXY YYZ
where spaces indicate codon boundaries, X and Y may be A or T, YYZ may be AAC or TTA, and the small n's at the beginning of the pattern may be any nucleotides. This site promotes a −1 frameshift in which the new codon boundaries are:
nn XXX YYY X.
Note that in both the original reading frame and in the −1 frameshift, the first two codon position are XX in the second codon and YY in the third codon. The only changes in base pairing are to the wobble position codon.
See, for example, these references:
An example is the context GAA_K_AAA encoding the three amino acids E_K_K. There are two possible choices for the lysine codon, AAA (195 observed, 312 expected) and AAG (343 observed, 226 expected). The 1.5-fold change from the expected distribution is highly significant, p=2.eE-24.
A second example is the context GGT_G_GGT encoding the three amino acids G_G_G. The most depleted central codon is GGG (5 observed, 28 expected), and the most enriched is GGT (172 observed, 102 expected). The mean ratio magnitude is 1.8, p=1.8E-19.
A third example is the context CTC_P_TTG encoding the three amino acids L_P_L. The most depleted central codon is CCT (0 observed, 3 expected). This creates a possible slippery site with a −1 frameshift:
CTC CCT TTG−>CT CCC TTT C
The most enriched is CCC (22 observed, 4 expected), which eliminates the slippery site.
Some patterns of context-dependent codon usage match regulatory signal sequences. An example is the ACCCA sequence recognized by the Raplp binding protein:
This sequence can cause transcriptional silences, and inadvertent creation of a Raplp binding site created a fitness defect in Sc2.0 synthetic chromosome synX:
The context TTA_P_AGA, with amino acids L_P_R, has a depleted central codon CCC (2 observed, 11 expected) that creates the ACCCA Rap1p binding motif. The most enriched central codon is CCA (50 observed, 27 expected), with a mean ratio magnitude 1.9 and p=3.7E-7.
The inspiration for Goldilocks is codon usage that is not too hot, not too cold, but just right for the context. Given a set of codons to avoid throughout the genome, the codon is mapped to the amino acid, and then a replacement codon is determined based at least in part on statistical analysis of a local context of the replacement codon.
A one-pass Goldilocks algorithm is performed as follows, processing each CDS in turn:
An implementation of a one-pass Goldilocks algorithm is provided, along with sample input and output for the entire yeast genome. The codons removed are as follows (Table 7):
The method rewrites 164,568 out of 2,832,327 codons=5.8% of the total codons.
The output CDS records are validated to lack any instances of the codons, and the translation of the CDS is validated to be identical to the original translation.
The one-pass method described above is appropriate for separated instances of codons to rewrite. If adjacent codons are in the rewrite set, however, then rewriting one changes the context for the other. There are many instances of this in the yeast genome. For each CDS, the maximum run length of codons to rewrite was determined. These are the rewrite lengths and numbers of genes (Table 8):
The gene with the longest run length of 13 codons in a row is YGR130C SGDID:S000003362, Chr VII from 753844-751394, Genome Release 64-3-1, reverse complement, Verified ORF, “Component of the eisosome with unknown function; GFP-fusion protein localizes to the cytoplasm; specifically phosphorylated in vitro by mammalian diphosphoinositol pentakisphosphate (IP7)”, which is incorporated by reference herein in its entirety.
This is the protein sequence with a run of 16 serine residues highlighted in bold, with many encoded by TCA and TCG codons in the set to be rewritten.
A dynamic programming optimization proceeds as follows. Suppose a sequence of n codons, numbered 1 through n, must be rewritten. Denote c(1) as a permitted codon for position 1, which means that it encodes the same amino acid as the original codon and it is not in the set of codons to remove. Similarly c(2) is a permitted codon for position 2, and so on. Codons c0 and c(n+1) are fixed by the pre-existing codons, which by definition are outside the set to be removed. As described above, the boundary case that c(1) is the start codon should not occur because ATG is the only start codon. The boundary case that c(n) is the stop codon is a special case in which our favored design uses only a single stop codon, TGA.
Denote the score for a codon as a value that increases monotonically with our preference for the context with that codon in the middle. Scores should be additive. A suitable value for the score of a codon given its context is In [n(c)], the number of times the codon is observed to occur in that context.
Denote Context[x, y, z] as this type of additive score for the choice of codon y given the amino acid required and the flanking codons x and z.
Denote S[c(1), c(2)] as the best score for codons through position 1 that have position 1 set to c(1) and position 2 set to c(2). This can be determined by enumeration.
Then S[c(2), c(3)]=max_c(1) S[c(1), c(2)]+Context[c(1), c(2), c(3)], which is the best score for having position c(2) and c(3) as specified.
This process continues,
S[c(n), c(n+1)]max_c(n−1) S[c(n−1), c(n)]+Context[c(n−1), c(n), c(n+1)], which is the best score for having position c(n) and c(n+1) as specified.
The search ends here because the codon c(n+1) is not in the set to be removed. The traceback of the maximum values leading to this last step provides the codons that together optimize an objective function corresponding to context-dependent codon usage.
Alternatively or in combination, one or more of the following algorithm choices may be used:
Use dynamical programming for a more sophisticated treatment of neighboring codons.
Use a different codon selection strategy, for example maintaining GC content, codon adaptation index, or translational efficiency, as the main codon replacement rule, but if this may result in the creation of a pattern that is depleted with statistical significance or other relevant criterion, use the Goldilocks-selected codon instead.
Use the Goldilocks codon with the greatest fold-enrichment over the null hypothesis, rather than the Goldilocks codon that is most often used in the context.
Use a random codon selected using the Goldilocks context-dependent probabilities as the probability distribution.
The final codon is a stop codon and a special case. Some designs may be a single choice for the stop codon, TGA, or a pair of choices, TGA and TAAn For the stop codon, a 9 mer pattern or 6 mer pattern ending with the stop codon may be used instead of the 9 mer pattern with the codon of interest in the middle position.
Avoid significantly enriched codons as possible regulatory signals, choosing a codons whose usage matches the overall RSCU and is not too hot, not too cold, but just right.
These and other methods that determine context-dependent codon usage values and use them as the basis for codon selection may be used.
The sequences of original yeast ORFs (Saccharomyces cerevisiae S288C strain) and rewritten yeast ORFs using methods described herein are shown as SEQ ID NOs: 1-11,812.
This example shows site-specific incorporation of ncAAs in proteins in Yeast using generic orthogonal translation system with both displayed and intracellular proteins in the yeast display strain RJY100. ncAA incorporation systems comprise a protein construct containing a TAG codon, an orthogonal translation system, and a ncAA added during expression of the protein construct. This method can be adapted for use in other yeast strains, and plasmids encoding the protein of interest and plasmids encoding the orthogonal translation systems need to contain unique selection markers that must be compatible with the genotype of the yeast strain.
1. One or more yeast display vectors containing a protein of interest (POI) with and without a TAG stop codon at a permissible site under a galactose-inducible promoter are prepared. The vectors can be named pPOIVector-POI-TAG (with a TAG stop codon) and pPOIVector-POI (without a TAG stop codon), respectively. The vectors also contain an autotrophic marker, e.g., tryptophan marker, for use in yeast and an antibiotic marker, e.g., ampicillin marker, for propagation in E. coli.
2. One or more galactose-inducible vectors for a dual-fluorescent protein construct consisting of a fluorescent protein, e.g., blue fluorescent protein and superfolder green fluorescent protein connected by a linker sequence, with or without a TAG codon (BXG and BYG, respectively) are prepared. These vectors can be named pPOIVector-BXG and pPOIVector-BYG, respectively. The vectors also contain an autotrophic marker, e.g., tryptophan marker, for use in yeast and an antibiotic marker, e.g., ampicillin marker, for propagation in E. coli.
3. One or more galactose-inducible vector for a single-fluorescent protein construct consisting of a fluorescent protein, e.g., superfolder green fluorescent protein containing a TAG codon in place of tyrosine at position 151 are prepared. These vectors can be named pPOIVector-GFP-TAG and pPOIVector-GFP, respectively. The vectors also contain an autotrophic marker, e.g., tryptophan marker, for use in yeast and an antibiotic marker, e.g., ampicillin marker, for propagation in E. coli.
4. One or more constitutive expression vector for orthogonal translation system comprised of an aminoacyl-tRNA synthetase and cognate tRNA is prepared (pOTSVector-OTS). The vectors also contain an autotrophic marker, e.g., leucine marker, for use in yeast and an antibiotic marker, e.g., ampicillin marker, for propagation in E. coli.
5. Saccharomyces cerevisiae yeast display strain RJY100 is prepared for use with conventional yeast display and intracellular fluorescent protein expression.
6. Media preparation:
A) SD-SCAA-TRP-LEU-URA and SD-SCAA-TRP-URA media, pH 4.5: Dissolve 20 g glucose, 6.7 g yeast nitrogen base without amino acids, 2 g synthetic casamino acids (-TRP-LEU-URA or -TRP-URA), and citrate buffer salts (10.4 g sodium citrate, 7.4 g citric acid monohydrate) in 1 L ddH2O. Filter sterilize using a 0.2 m filter and store at room temperature.
B) SD-SCAA-TRP-LEU-URA and SD-SCAA-TRP-URA plates, pH 6.0: Mix phosphate buffer salts (5.4 g sodium phosphate dibasic, anhydrous, and 8.56 g sodium phosphate monobasic monohydrate), 15 g agar, and 182 g sorbitol in a final volume of 900 mL with ddH2O in a 1 L bottle with a magnetic stir bar. Autoclave the mixture and cool with stirring at room temperature. At the same time, dissolve 20 g glucose, 6.7 g yeast nitrogen base without amino acids, and 2 g synthetic casamino acids (-TRP-LEU-URA or -TRP-URA) in a final volume of 100 mL using vigorous stirring. Once the autoclaved solution has cooled to approximately 60° C., filter sterilize the glucose/yeast nitrogen base/synthetic casamino acid mixture directly into the autoclaved solution, mix briefly, and pour plates. This recipe is expected to produce approximately 80-100, 100 mm plates. Store at room temperature or at 4° C.
C) SG-SCAA-TRP-LEU-URA and SG-SCAA-TRP-URA media, pH 6.0: Dissolve 20 g galactose, 2 g glucose, 6.7 g yeast nitrogen base without amino acids, 2 g synthetic casamino acids (-TRP-LEU-URA or -TRP-URA), and phosphate buffer salts (5.4 g sodium phosphate dibasic, anhydrous, and 8.56 g sodium phosphate monobasic monohydrate) in 1 L ddH2O. Filter sterilize using a 0.2 m filter and store at room temperature.
D) Yeast Extract-Peptone-Dextrose (YPD) media: Mix 20 g peptone and 10 g yeast extract in 900 mL ddH2O. Separately, prepare a solution of 100 mL 20% glucose (20 g glucose in 100 mL ddH2O). Autoclave both solutions, let them cool, and combine the two to make the final product (see Note 11). Store at room temperature.
E) Yeast Extract Peptone-Glycerol (YPG) media: Mix 20 g peptone and 10 g yeast extract in 900 mL ddH2O. Separately, prepare a solution of 100 mL 20% galactose (20 g galactose in 100 mL ddH2O). Autoclave both solutions, let them cool, and combine the two to make the final product. Store at room temperature.
F) YPD plates: Mix 10 g peptone, 5 g yeast extract, and 7.5 g agar in 450 mL ddH2O in a 1 L bottle with a magnetic stir bar. Separately, make a solution of 50 mL 20% glucose (10 g in 50 mL). Autoclave both solutions, cool both solutions to 55° C. with stirring, mix them together, and pour plates. This recipe is expected to produce approximately 40-50, 100 mm plates. The 20% glucose solution can be made ahead of time. Store at room temperature or at 4° C.
7. Other reagents to be prepared:
A) Penicillin-streptomycin: 10,000 IU/mL and 10,000 μg/mL, respectively, in 100×solution
B) 50 mM noncanonical amino acid (ncAA): Prepare a 50 mM liquid stock of the L-isomer of the ncAAs by dissolving the ncAA in 90% of the final volume ddH2O and vortexing thoroughly. The addition of NaOH may be required to fully dissolve the ncAA. Add ddH2O to a final volume and sterile filter using a 0.2 m filter before use. Use immediately or store at 4° C.
8. Kits, containers and instruments needed:
A) Zymo Research Frozen-EZ Yeast Transformation II Kit (Zymo Research).
B) Cryoprotectant isopropanol containers to slow-freeze competent yeast cells. An example of a suitable isopropanol container is the Thermo Scientific™ Mr. Frosty™ (Thermo Fisher catalog number 5100-0001).
C) Sterile 1.7 mL microcentrifuge tubes.
D) Sterile polyethylene culture tubes.
E) Sterile 15 mL polypropylene conical tubes.
F) Benchtop vortexer.
G) Benchtop centrifuge for spinning culture tubes.
H) Stationary incubator at 30° C. (for yeast plate incubation).
I) Shaking incubator at 30° C., 300 rpm (for yeast liquid culture growth).
J) Shaking incubator at 20° C., 300 rpm (for induction of liquid cultures).
K) NanoDrop or other spectrophotometer for measuring yeast culture density.
9. Flow Cytometry system for Flow Cytometry- and Microplate Reader-based evaluation of ncAA Incorporation events.
A) Refrigerated benchtop centrifuge for spinning microcentrifuge tubes.
B) Rotary wheel at room temperature.
C) Flow cytometer.
D) Flow cytometry data analysis software.
E) Spectrophotometric microplate reader.
F) Flow cytometry tubes compatible with available flow cytometer.
G) 96-well microplates compatible with available flow cytometer for large-scale experiments (provided that the flow cytometer has an autosampler).
H) Adhesive foil for covering 96-well microplates.
I) Primary antibodies: Chicken anti-c-Myc (Gallus Immunotech) and Mouse anti-HA antibody (BioLegend).
J) Secondary antibodies: Goat anti-chicken Alexa Fluor 647 (Invitrogen); Goat anti-chicken Alexa Fluor 488 (Invitrogen); Goat anti-mouse Alexa Fluor 488 (Invitrogen).
K) 96-well clear bottom black-walled microplates.
10. Bioorthogonal Reactions with ncAAs on the yeast surface.
A) Rotary wheel at 4° C.
B) 1×PBS, pH 7.4: Mix 8 g sodium chloride, 0.2 g potassium chloride, 1.44 g sodium phosphate dibasic (anhydrous), and 0.24 g potassium phosphate monobasic (anhydrous) in 1 L ddH2O. Use hydrochloric acid or sodium hydroxide to adjust the pH to 7.4. Sterile filter using a 0.2 m filter and store at room temperature.
C) Sterile PBS+0.1% bovine serum albumin (BSA), pH 7.4 (PBSA): Add 1 g BSA to 1 L1×PBS, pH 7.4, dissolve, and sterile filter using a 0.2 m filter. Store at room temperature.
D) 20 mM copper sulfide (CuSO4): Dissolve 0.0050 g of CuSO4 powder (MW 249.68 g/mol) in 1 mL ddH2O by vortexing. Store at 4° C.
E) 50 mM Tris(benzyltriazolylmethyl)amine (THPTA): Dissolve 0.0217 g THPTA powder (MW 434.50 g/mol) in 1 mL ddH2O by vortexing. Store at 4° C.
F) 1:2 solution of 20 mM CuSO4: 50 mM THPTA: Combine 20 mM CuSO4 and 50 mM THPTA at a 1:2 volume ratio. Prepare immediately prior to use.
G) 20 mM biotin-(PEG)4-alkyne or biotin-(PEG)4-azide: Dissolve biotin-(PEG)4-alkyne or biotin-(PEG)4-azide in dimethyl sulfoxide (DMSO). Store at −20° C. in a desiccant jar.
H) 200 mM cargo-alkyne or cargo-azide: Dissolve the cargo-alkyne or cargo-azide in ddH2O or DMSO for long-term storage at −20° C.
I) 100 mM aminoguanidine: Dissolve 0.011 g aminoguanidine HCl (MW 110.55 g/mol) in 1 mL ddH2O immediately prior to use.
J) 100 mM sodium ascorbate: Dissolve 0.020 g sodium ascorbate (MW 198.11 g/mol) in 1 mL ddH2O immediately prior to use.
K) 20 mM dibenzocyclooctyne-amine (DBCO)-biotin: Dissolve DBCO-biotin (MW=749.91 g/mol) in DMSO and store at −20° C. Dilute to 2 mM in DMSO prior to use.
L) 200 mM dibenzocyclooctyne-amine (DBCO)-cargo: Dissolve DBCO-cargo in DMSO.
11. Click Chemistry Analysis
A) Secondary antibody: Streptavidin, Alexa Fluor 488 conjugate (Invitrogen).
12. Preparation of Libraries Involving the Use of Orthogonal Translation Systems
A) A yeast display vector pCTCON2 that contains tryptophan marker for use in yeast and ampicillin marker for propagation in E. coli.
B) A constitutive expression vector pRS315-LeuOmeRS for orthogonal translation system comprising an E. coli leucyl-tRNA synthetase mutant and cognate tRNA. This vector contains leucine marker for use in yeast and ampicillin marker for propagation in E. coli.
C) Restriction enzymes NcoI and NdeI for preparing libraries of OTSs in pRS315-LeuOmeRS.
D) Restriction enzymes SalI, NheI, and BamHI for preparing libraries of POIs in pCTCON2.
E) DNA polymerase and corresponding buffers for PCR.
F) 10 mM dNTPs.
G) Thin-walled PCR tubes.
H) Template DNA for library amplification.
I) Primers for template amplification with homologous recombination flanking regions. Each protein library will contain different 5′ and 3′ ends and will need to be designed to accommodate the specific library design.
J) Additional primers needed to construct the library of interest.
K) Forward and reverse pCTCON2 sequencing primers.
L) Forward and reverse pRS315 sequencing primers.
M) Molecular biology-grade agarose.
N) Tris-acetate-EDTA (TAE) buffer (50×): Dissolve 242 g Tris base in ddH2O, then add 57.1 mL glacial acetic acid and 100 mL 500 mM EDTA, pH 8.0, and add ddH2O to 1 L. Store at room temperature.
O) Nucleic acid gel stain, DNA gel loading dye (1×), DNA molecular weight size marker.
P) DNA gel electrophoresis equipment: gel mold and extraction combs, gel box, voltage box, gel imager.
Q) Heat block set to 55° C. for melting agarose containing DNA fragments.
R) Gel extraction kit (Gel extraction buffer for melting agarose gel, DNA purification columns and wash buffers).
S) NanoDrop or other spectrophotometer for measuring DNA concentrations.
T) Sterile ddH2O chilled to 4° C.
U) Pellet Paint co-precipitant (EMD Millipore).
V) 70% ethanol in ddH2O and 100% ethanol.
W) SD-SCAA-LEU-URA media, pH 4.5:
Dissolve 20 g glucose, 6.7 g yeast nitrogen base without amino acids, 2 g synthetic casamino acids [25](-LEU-URA), and citrate buffer salts (10.4 g sodium citrate, 7.4 g citric acid monohydrate) in 1 L ddH2O. Filter sterilize using a 0.2 m filter and store at room temperature.
X) 100 mM lithium acetate (sterile) and 1 M dithiothreitol (DTT)
Y) 50 mL conical tubes and 2 mm electroporation cuvettes chilled on ice prior to use in electroporations
Z) Refrigerated benchtop centrifuge for spinning 50 mL conical tubes and for pelleting large volumes (1 L or greater)
AA) Bio-Rad Gene Pulser XCell Total System (Bio-Rad) or other electroporator with square wave protocol capability.
BB) Sterile 250 mL and 2 L flasks for liquid culture growth.
CC) Autoclavable centrifuge bottles (500 mL or greater capacity).
DD) Sterile 60% glycerol: Prepare a solution of 60% v/v glycerol in ddH2O and autoclave to sterilize. Store at room temperature.
EE) 2 mL cryogenic screw-cap vials.
FF) Zymoprep Yeast Plasmid Miniprep II kit (Zymo Research).
GG) Chemically competent E. coli.
HH) SOC medium: Mix 2 g bactotryptone, 0.5 g yeast extract, 0.2 mL 5 M NaCl, and 0.2 mL 1.25 M KCl in ddH2O to approximately 97 mL and autoclave to sterilize. Under sterile conditions, add 1 mL sterile 1 M MgCl2 and 1.8 mL sterile 20% glucose. Store at room temperature.
II) Luria-Bertani (LB) medium (available as premixed powder or use the following recipe: for 1 L, mix 10 g tryptone, 5 g yeast extract, and 10 g sodium chloride in 1 L ddH2O and autoclave to sterilize). Store at room temperature.
JJ) 2000× ampicillin stock: Dissolve ampicillin in ddH2O at 100 mg/mL and sterile filter using a 0.2 m filter. Store at −20° C. for up to 1 year or at 4° C. for up to 1 month. The working concentration of ampicillin in liquid or solid media is 50 μg/mL.
KK) Luria-Bertani (LB) plates with antibiotics: Mix 5 g tryptone, 2.5 g yeast extract, 5 g sodium chloride, and 7.5 g agar in 500 mL ddH2O with a stir bar in a 1 L bottle.
Autoclave to sterilize, allow media to cool with stirring to 55° C., add ampicillin, and pour plates. This recipe is expected to produce approximately 40-50, 100 mm plates. Store at 4° C.
LL) E. coli plasmid DNA miniprep kit such as those sold by Qiagen, Epoch Life Science, or Zymo Research.
1. Site-specific Incorporation of ncAAs in Proteins in Yeast
(a) Prepare chemically competent yeast by first streaking out cells from a glycerol or other stock on a YPD plate. Grow at 30° C. in a stationary incubator for 1-2 days, then inoculate a single, isolated colony from the YPD plate into a 5 mL YPD culture supplemented with penicillin-streptomycin. Grow the culture at 30° C. in a shaking incubator overnight or until the culture is saturated, then dilute 500 μL into 4.5 mL YPD supplemented with penicillin-streptomycin and grow for another 4-6 h at 30° C. in a shaking incubator.
Continue to prepare cells using a kit such as the Zymo Research Frozen-EZ Yeast Transformation II Kit. Chemically competent yeast can be used immediately or frozen in a cryoprotectant container at −80° C.
(b) Using the same yeast chemical competence preparation and transformation kit, transform the plasmid DNA of interest into the cells. For yeast-displayed proteins, prepare the following separate transformations: pPOIVector-TAG and pOTSVector, pPOIVector-WT and pOTSVector, and the pPOIVector-WT only (this serves as a control for yeast display). For intracellular proteins, only the pPOIVector-TAG/pOTSVector and pPOIVector-WT/pOTSVector combinations are necessary. Plate on selective media for retention of the specific combinations of plasmids. Grow at 30° C. in a stationary incubator for 2-3 days.
(c) For each non-control plasmid combination, inoculate three single, isolated colonies from the selective media plate into three 5 mL selective media cultures supplemented with penicillin-streptomycin. For yeast-displayed protein controls, only one culture is needed. Note that separate cultures of yeast that do not contain any plasmid DNA are necessary for microplate reader-based data collection. Grow the cultures at 30° C. in a shaking incubator until the culture is saturated, then dilute each culture to OD600 of 1 in 5 mL of the identical growth media supplemented with penicillin-streptomycin until the OD600 is between 2 and 5 (this should take 4-6 h). Induce each culture at OD600 of 1 in 2 mL galactose-containing selective media supplemented with penicillin-streptomycin. For each POI, prepare a culture with no ncAA, and one tube each for the ncAAs of interest. Incubate cultures at 20° C. in a shaking incubator for 16 h.
2. Flow Cytometry- and Microplate Reader-Based Evaluation of ncAA Incorporation Events in Yeast
(a) To prepare cells with yeast-displayed POIs for flow cytometry, begin by removing two million cells to microcentrifuge tubes. Centrifuge to pellet, aspirate supernatant, and resuspend each pellet in 1 mL PBSA to wash. Repeat the wash twice more and then resuspend each sample in 50 μL PBSA with the necessary primary label(s), then incubate on a rotary wheel for 30 min at room temperature. Following this step, all steps should be performed on ice or in a refrigerated centrifuge at 4° C. to reduce label dissociation. Dilute each sample with 950 μL ice-cold PBSA, centrifuge to pellet, and aspirate supernatant. Wash twice more with ice-cold PBSA, then resuspend each sample in 50 μL PBSA with the necessary secondary label(s). Incubate on ice in the dark for 15 min. Cells can be immediately resuspended and evaluated on the flow cytometer or kept as wet pellets on ice or at 4° C. in the dark for short periods before evaluation.
(b) To prepare cells with intracellular POIs for flow cytometry, begin by removing two million cells to microcentrifuge tubes. Centrifuge to pellet, aspirate supernatant, and resuspend each pellet in 1 mL PBSA to wash. Repeat the wash twice more for a total of three washes. Cells can be immediately resuspended and evaluated on the flow cytometer or kept as wet pellets on ice or at 4° C. for short periods before evaluation.
(c) To prepare cells with intracellular POIs for microplate reader assays, begin by removing two million cells to microcentrifuge tubes. Centrifuge to pellet, aspirate supernatant, and resuspend each pellet in 1 mL PBSA to wash. Repeat the wash twice more for a total of three washes. Cells can be immediately resuspended and evaluated on the microplate reader or kept as wet pellets on ice or at 4° C. for short periods before evaluation. Samples should be resuspended and transferred to 96-well black wall microplates, taking care not to introduce any air bubbles, prior to being evaluated on the microplate reader.
(a) To begin isolating single cells, draw a polygon gate on the unlabeled yeast sample on a log plot of side scatter (SSC) area versus forward scatter (FSC) area. This population is now called Gate 1 and contains cells that are morphologically similar and are likely to be alive based on size and scatter.
(b) Within Gate 1, draw a polygon gate on a log plot of FSC height versus FSC width. This population is now called Gate 2 and contains single cells while excluding doublets, triplets, or other groups of cells. Further isolation of the single-cell populations may be possible on some flow cytometers (such as with SSC height versus SSC width).
(c) Within Gate 2, prepare a dot plot with axes set to the fluorescence heights corresponding to detection of the C-terminus and N-terminus. For samples with only C-terminus detection ability (e.g., GFP-only samples), the second axis should be set to another fluorescence detection channel that is not expected to have crosstalk with the C-terminus detection channel.
(d) For samples with dual-terminus detection capability, gate the population of cells with above-background levels of N-terminus detection on the Gate 2 histogram plot of N-terminus detection.
4. Bioorthogonal Reactions with ncAAs on the Yeast Surface
(a) One-step click chemistry is used as a control for reacting available azide or alkyne functional groups that have been genetically encoded in the protein of interest on the yeast surface with a probe that can be labeled and detected on a flow cytometer, such as biotin. Step 1: react the surface-displayed protein with an encoded ncAA containing an azide or alkyne functional group with an alkyne- or azide-biotin, or cyclooctyne-biotin for use with azide functional groups only (strain-promoted click chemistry).
(b) Two-step click chemistry. Step 1: react the surface-displayed protein with an encoded ncAA containing an azide or alkyne functional group with an alkyne- or azide-cargo, or cyclooctyne-cargo for use with azide functional groups only (strain-promoted click chemistry). The outcome of the first step may include a mixture of unreacted proteins and cargo-modified proteins. Step 2: react the population of yeast from the first step with an alkyne- or azide-biotin, or cyclooctyne-biotin (for use with azide functional groups only; strain-promoted click chemistry). The products of the second step are expected to be a mixture of cargo-modified proteins and biotin-modified proteins (reactions with biotin probes should be performed under conditions known to lead to complete reactions to avoid unreacted functional groups, shown in brackets).
(c) The level of chemical modification with the cargo of interest can be evaluated by determining the extent of reaction. The background-subtracted one-step biotin detection and background-subtracted two-step biotin detection are required for this calculation. CuAAC: copper-catalyzed azide-alkyne cycloaddition. SPAAC: strain-promoted azide-alkyne cycloaddition.
Details of click chemistry analysis are shown in for example, Stieglitz and Deventer 2022 Biomedical Engineering Technologies. Methods in Molecular Biology, vol 2394.
Humana, New York, NY.
6. Preparation of Libraries Involving the Use of Orthogonal Translation Systems
(a) To prepare a library of OTSs, begin by performing a double restriction enzyme digest on the pRS315-LeuOmeRS plasmid. Note that other OTS expression vectors can be used with corresponding restriction enzymes specific to that vector. Evaluate on a DNA gel and extract the band corresponding to the vector with no OTS insert. Amplify the OTS library insert(s) via PCR with primers containing the desired degenerate codon(s) or mutation(s), then evaluate and extract from a DNA gel. Follow Pellet Paint manufacturing protocols to concentrate the pooled OTS and vector DNA. Separately, prepare yeast cells that only contain a ncAA incorporation reporter.
(b) To prepare a library of POIs, begin by performing a triple restriction enzyme digest on pCTCON2. Note that other yeast display vectors can be used with corresponding restriction enzymes specific to that vector. Evaluate on a DNA gel and extract the band corresponding to the vector with no POI insert. Amplify the POI library insert(s) via PCR with primers containing the desired degenerate codon(s) or mutation(s), then evaluate and extract from a DNA gel. Follow Pellet Paint manufacturing protocols to concentrate the pooled POI and vector DNA. Separately, prepare yeast cells that only contain the pOTSVector.
(c) Prepare electrocompetent cells then combine with the concentrated library and vector DNA and electroporate. Recover each electroporated sample with 2 mL YPD at 30° C. for 1 h with no shaking. Also, pre-warm one selective media plate for each sample at this time. To determine the transformation efficiency, prepare four serial dilutions of each sample and plate on quadrants of the selective media plates. Grow at 30° C. for 3-4 days and determine a number of the colonies in each quadrant to determine the approximate number of transformants. Centrifuge the remainder of the recovered samples and aspirate the YPD, then resuspend each pellet in 100 mL selective media supplemented with penicillin-streptomycin and grow at 30° C. with shaking for 1-2 days until saturated. Centrifuge the culture to pellet, decant supernatant, and resuspend in 1 L selective media supplemented with penicillin-streptomycin. At this point, remove 200 μL of the 1 L cultures and set aside for additional characterization steps. Grow at 30° C. for 1-2 days until saturated, then centrifuge and resuspend the entire pellet in 5 mL 60% glycerol. Freeze library at −80° C. Take the 200 μL removed after passaging to 1 L and propagate for flow cytometry characterization. Also, use a yeast DNA purification “miniprep” kit such as the Zymoprep Yeast Plasmid Miniprep II kit to isolate the plasmid DNA and characterize the constructed library or libraries.
This example uses an assembly strategy to generate an yeast strain with synthetic genome. Yeast has 16 chromosomes (ChrI to ChrXVI). In some embodiments, an assembly strategy may comprise endogenous homologous recombination machinery to replace one or more of 30- to 60-kilobase segments of each wild-type chromosome with the corresponding synthetic sequence. A chromosome can be computationally divided into 30-60 kilobase long “megachunks,” each comprising a set of “chunks” of segments that is less than about 10 kilobase in length. These “chunks” can be assembled into “megachunks” by restriction enzyme cutting and ligation in vitro, or any other methods known in the art. The “megachunks” can be subsequently integrated into the host genome, e.g., an yeast genome, replacing the corresponding wile-type segment.
In some embodiments, “megachunks” can be introduced sequentially from left to right (i.e., from 5′ to 3′ direction) using the endogenous homologous recombination machinery and termini. In some embodiments, the termini may comprise a terminal universal telomere cap (UTC) sequences, for the first and last “megachunk” extremities. In some embodiments, the termini may comprise terminal sequences of up to 500 bp that can facilitate integration into a partially synthetic, partially native chromosome. In some embodiments, “chunks” and/or “megachunks” may comprise a selectable marker. In some embodiments, the right most “chunk” in each “megachunk” (i.e., a “chunk” in the most 3′ side of a “megachunk”) may comprise a selectable marker. For example, the selectable marker can be any auxotrophic marker. In some embodiments, an auxotrophic marker may comprise URA3, LYS2, LEU2, TRP1, HIS3, MET15, or ADE2. In some embodiments, the selectable marker may be LEU2 or URA3. In some embodiments, as each “megachunk” is introduced, the previously used marker is overwritten as a consequence of homologous recombination with the incoming “megachunk.” In some embodiments, if the first “megachunk” is tagged with LEU2, the second “megachunk” is tagged with another marker, such as URA3. In some embodiments, two markers can be alternated. For example if the first “megachunk” is tagged with LEU2, the second “megachunk” is tagged with URA3, and the third “megachunk” is tagged with LEU2.
In other embodiments, “chunks” can be provided as a series of “minichunks” that overlap with each other and can be recombined with each other. In this embodiment, the series of “minichunks” can be integrated into the genome simultaneously by using a selective marker (e.g., auxotrophic marker) switching. In some embodiments, the first (5′) “megachunk” of a synthetic chromosome may be provided with a telomere seed sequence (TeSS) within the larger UTC fragment. In some embodiments, the last (3′) “megachunk” of a synthetic chromosome may be provided with a terminal sequence homology targeting the wild type chromosome. In some embodiments, the TeSS end may be designed to grow a new telomere. In some embodiments, the TeSS may not participate in homologous recombination. In some embodiments, the last or the rightmost “megachunk” of a synthetic chromosome (i.e., the“megachunk” of the 5′ end of a synthetic chromosome) may comprise a selectable marker. In some embodiments, the last or the rightmost “megachunk” of a synthetic chromosome (i.e., the“megachunk” of the 5′ end of a synthetic chromosome) may not comprise a selectable marker. In this embodiments, the second-to-last “megachunk” may comprise a URA3 marker. In this embodiment, selection for the last “megachunk” can be provided by 5-fluoroorotic acid (5′FOA) resistance phenotype conferred by the last “megachunk” as it overwrites the URA3 marker from the second-to-last “megachunk.”
In some embodiments, integration may comprise utilizing an inducible genome rearrangement system. In some embodiments, the inducible genome arrangement system may be based on a chemically inducible Cre recombinase. In some embodiments, a palindromic recombination site loxPsym may be inserted in the genome. In some embodiments, the palindromic recombination site loxPsym may be inserted 3 bp downstream of the stop codon of an nonessential gene/ORF.
Next, the assembled synthetic chromosomes are sequenced to verify and quantify the synthetic content of the genome. A “PCRTagging” watermark system can be used by introducing slight nucleotide sequence alterations through synonymous recoding within ORFs to specify pairs of primers specific to either the wild type or synthetic version of that gene/ORFs. In addition synthetic chromosomes are validated by whole-genome sequencing. In some embodiments, “semisynthetic” strains may be sequenced at major intervals during assembly (e.g., 300 to 500 kb integrated) in order to identify major structural variants that occur at about that frequency and to eliminate them early in assembly.
In addition, the fitness of the resulting recombinant semi-synthetic yeast strains is assessed, and any substitution that proves lethal or leads to a measurable fitness defect can be corrected. The correction can be done by reverting the sequence to wild type (“debugging”). The hierarchical nature of the assembly scheme can facilitate debugging, as specific designer features for codon rewriting can be corrected and fixed once bugs are identified. In some embodiments, this can facilitate a “design-build-assemble-test-learn” cycle used in the final stage of production of synthetic chromosomes.
Once assembly of the various synthetic chromosomes is completed, an efficient meiotic strategy can be used to combine all synthetic chromosomes. In one embodiment, synthetic chromosomes can be consolidated into a single strain by mating and sporulation. In another embodiment, a conditional chromosome destabilization can used (e.g., endoreduplication intercross). In this embodiment, a centromere function of two specified native chromosomes may be simultaneously disrupted in a doubly heterozygous diploid synthetic strain (e.g., synIII/III VI/synVI). In some embodiments, this can be performed by using the GAL1 promoter in cis to generate a “2n-2” strain. In some embodiments, each chromosome can be individually lost, in diploids, yielding hemizygotes for the destabilized chromosome. In some embodiments, most such “2n−1” strains may endoreduplicate the remaining single chromosomes to regenerate a 2n state. In some embodiments, conditional chromosome destabilization can be used to backcross synthetic strains to wild type, called an “endoreduplication backcross,” to revert the sequence to wild type or to debug. Diploid strains can be sporulated to produce haploid strains. Karyotypic analysis by pulsed-field gel electrophoresis in the haploid strains can be used to visualize mobility shifts of synthetic chromosomes in resulting haploid strains to compare with wild type chromosomes.
The examples and embodiments described herein are for illustrative purposes only and various modifications or changes suggested to persons skilled in the art are to be included within the spirit and purview of this application and scope of the appended claims.
Biol. 1995, which is incorporated by reference herein in its entirety.
This application claims the benefit of U.S. Provisional Application No. 63/174,823, filed on Apr. 14, 2021, which is incorporated herein by reference in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2022/024888 | 4/14/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63174823 | Apr 2021 | US |