The present invention is in the field of nucleic acid editing and translation optimization.
There is growing evidence that local mRNA folding (i.e., short-range secondary-structure) inside the coding region is often stronger or weaker than expected, but the explanation for this phenomenon is yet to be fully understood. mRNA folding strength affects many central cellular processes, including the transcription rate and termination, translation initiation, translation elongation and ribosomal traffic jams, co-translational folding, mRNA aggregation, mRNA stability and mRNA splicing. Many of these effects are mediated by interactions of mRNA within the CDS (protein-coding sequence) with proteins and other RNAs and may include structure-specific or non-structure-specific interactions.
In recent years several studies showed evidence for selection acting directly to affect mRNA folding strength within the CDS (
The present invention provides nucleic acid molecules comprising a coding sequence and a region of increased folding energy upstream of a stop codon. Expression vectors and cells comprising the nucleic acid molecule are also provided. Methods for optimizing a coding sequence comprising increasing folding energy in a region upstream of that stop codon are also provided.
According to a first aspect, there is provided a method for optimizing a coding sequence, the method comprising introducing a mutation into a first region from 90 nucleotides upstream of a stop codon of the coding sequence to the stop codon; wherein the mutation increases folding energy of the first region or of RNA encoded by the first region, thereby optimizing a coding sequence.
According to another aspect, there is provided a nucleic acid molecule comprising a coding sequence, the coding sequence comprises at least one codon substituted to a synonymous codon within a first region from 90 nucleotides upstream of a stop codon of the coding sequence to the stop codon, wherein the substitution increases folding energy of the first region or of RNA encoded by the first region.
According to another aspect, there is provided an expression vector comprising a nucleic acid molecule of the invention.
According to another aspect, there is provided a cell comprising a nucleic acid molecule of the invention or an expression vector of the invention.
According to another aspect, there is provided a computer program product comprising a non-transitory computer-readable storage medium having program instructions embodied therewith, the program instructions executable by at least one hardware processor to execute a genetic-type machine learning algorithm configured to:
According to some embodiments, the optimizing comprises optimizing expression of protein encoded by the coding sequence.
According to some embodiments, the optimizing is optimizing in a target cell.
According to some embodiments, the target cells is selected from:
According to some embodiments, the mutation is a synonymous mutation.
According to some embodiments, the introducing comprises providing a mutated sequence or providing a mutation to be made in the coding sequence.
According to some embodiments, the mutation increases folding energy of the first region to above a predetermined threshold.
According to some embodiments, the predetermined threshold is a value above which the difference as compared to folding energy of the region without the substitution would be significant.
According to some embodiments, the threshold is species-specific and is selected from a threshold provided in Tables 5 or the threshold is domain-specific and is selected from a threshold provided in Table 1.
According to some embodiments, the method comprises introducing a plurality of mutations wherein each mutation increases folding energy of the first region or of RNA encoded by the first region or wherein the plurality of mutations in combination increases folding energy of the first region or of RNA encoded by the first region.
According to some embodiments, the method comprises mutating all possible codons within the region to a synonymous codon that increases folding energy of the first region or of RNA encoded by the first region.
According to some embodiments, the method comprises introducing synonymous mutations to produce a first region or RNA encoded by the first region with the maximum possible folding energy.
According to some embodiments, the method further comprises introducing a mutation into a second region from a translational start site (TSS) to 20 nucleotides downstream of the TSS, wherein the mutation increases folding energy of the second region or of RNA encoded by the second region.
According to some embodiments, the method is a method for optimizing expression in a target cell, and wherein the target cells is selected from:
According to some embodiments, the method is a method for optimizing expression in a target cell, and wherein the target cell is a bacterial or archeal cell and the method further comprises introducing a mutation into a third region between the first and the second regions, wherein the mutation decreases folding energy of the third region or of RNA encoded by the third region.
According to some embodiments, the method is a method for optimizing expression in a target cell, and wherein the target cell is a eukaryotic cell and the method further comprises introducing a mutation into a third region between the first and the second regions, wherein the mutation increases folding energy of the third region or of RNA encoded by the third region.
According to some embodiments, the third region is from 20 to 50 nucleotides downstream of the TSS.
According to some embodiments, the third region is from 20 to 300 nucleotides downstream of the TSS or from 300 to 90 upstream of the stop codon.
According to some embodiments, the nucleic acid molecule is an RNA molecule, or a DNA molecule.
According to some embodiments, the first region is from 50 nucleotides upstream of the stop codon to the stop codon.
According to some embodiments, the first region is from 40 nucleotides upstream of the stop codon to the stop codon.
According to some embodiments, the substitution increases folding energy of the first region to above a predetermined threshold.
According to some embodiments, the predetermined threshold is a value above which the difference as compared to folding energy of the region without the substitution would be significant.
According to some embodiments, the threshold is species-specific and is selected from a threshold provided in Tables 5 or the threshold is domain-specific and is selected from a threshold provided in Table 1.
According to some embodiments, the nucleic acid molecule comprises a plurality of synonymous substitutions, wherein each substitution increases folding energy of the first region or of RNA encoded by the first region or wherein the plurality of synonymous substitutions in combination increases folding energy of the first region or of RNA encoded by the first region.
According to some embodiments, all possible codons within the first region are substituted to a synonymous codon that increases folding energy of the first region or of RNA encoded by the first region.
According to some embodiments, the region comprises synonymous codons substituted to increase folding energy to a maximum possible.
According to some embodiments, a second region of the coding sequence from a translational start site (TSS) to 20 nucleotides downstream of the TSS comprises at least one codon substituted to a synonymous codon, and wherein the substitution increases folding energy of the second region or of RNA encoded by the second region.
According to some embodiments, the coding sequence encodes a bacterial or archeal gene and further comprises a third region of the coding sequence between the first region and the second region comprises at least one codon substituted to a synonymous codon, and wherein the substitution decreases folding energy of the third region or of RNA encoded by the third region.
According to some embodiments, the coding sequence encodes a eukaryotic gene and further comprises a third region of the coding sequence between the first region and the second region comprises at least one codon substituted to a synonymous codon, and wherein the substitution increases folding energy of the third region or of RNA encoded by the third region.
According to some embodiments, the third region is from 20 to 50 nucleotides downstream of the TSS.
According to some embodiments, the third region is from 20 to 300 nucleotides downstream of the TSS or from 300 to 90 upstream of the stop codon.
According to some embodiments, the folding energy is the RNA secondary structure folding Gibbs free energy.
According to some embodiments, the cell is a target cell.
According to some embodiments, the nucleic acid molecule, expression vector or both are optimized for expression in the cell.
Further embodiments and the full scope of applicability of the present invention will become apparent from the detailed description given hereinafter. However, it should be understood that the detailed description and specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only, since various changes and modifications within the spirit and scope of the invention will become apparent to those skilled in the art from this detailed description.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
For each region, the following symbols identify the relation between the “high” and “low” groups: (+) The trend observed in this region (i.e., increased or decreased folding strength) is more extreme in highly expressed or highly abundant genes. (−) The trend observed in this region (i.e., increased or decreased folding strength) is less extreme in highly expressed or highly abundant genes (or the opposite trend is observed). (no symbol) There is no consistent and statistically significant difference between the groups (or there is no ΔLFE trend in this region). (+/−) Inconsistent or contradictory results in different positions. (NA) Data was not available for this species.
The present invention, in some embodiments, provides nucleic acid molecules comprising a coding sequence, wherein the coding sequence comprises at least one codon substituted to a synonymous codon within a region upstream of the stop codon and wherein the substitution increases folding energy of the region. The present invention further concerns a method of optimizing a coding sequence by introducing a mutation that increases folding energy into a region upstream of the stop codon.
The invention is based on the following suppressing findings. First, it was found that selection on mRNA folding strength in most (but not all) species follows a conserved structure with three distinct regions (
Conformance to different model elements varies significantly between the three domains: weak folding at the beginning of the coding regions appears in the great majority of bacterial species (88%) but only in 56%/60% of eukaryotes/archaea respectively (
Second, it was found that in some eukaryotes (in 13% of the analyzed eukaryotes and in one bacterium: D. puniceus) there is significant positive ΔLFE throughout the mid-CDS region (i.e., opposite to the general trend in prokaryotes,
Third, it was shown that the “transition peak”, a region of selection for strong mRNA folding beginning around 30-70 nt downstream of the start codon that was reported elsewhere to be associated with translation efficiency, appears frequently (45%) in the analyzed organisms, indicating this mechanism is common (
Fourth, despite these differences, there was found a strong correlation between the strengths of three profile elements (found at the beginning, middle and end of the coding regions,
Fifth, there were found several variables that correlate with ΔLFE (and account for much of the variation mentioned above). The variables showing the strongest correlation are genomic GC-content (despite being explicitly controlled for by the randomizations as explained above,
The influence on ΔLFE of all traits analyzed in the mid-CDS region can be compared in
Sixth, there were identified four specific conditions that tend to prevent strong ΔLFE from occurring (separately and together). The first two conditions are based on the correlated traits described above: low GC-content and low CUB. Another characteristic is optimum growth temperature, since in higher temperatures base-pairing is weakened and consequently the influence of codons arrangement and composition must also be reduced, and so is any possible effect of ΔLFE. The last disrupting factor, an intracellular life phase, stems from the fact that such organisms generally have lower effective population size (due to recurring population bottlenecks) and lower selection pressure on gene expression (because they partly rely on the host). A binary classification model based on these four features has precision 0.66 and recall 0.82 in classification of ΔLFE strength (see Example 2 and
These results point to cases where evolutionary close organisms exhibit very different ΔLFE patterns and selection levels. For example, in fungi, members of Pezizomycotina (such as Aspergillus niger or Zymoseptoria brevis) have much more positive ΔLFE compared to members of Saccharomycotina (including Eremothecium gossyppi and Candida albicans). Notably, a few eukaryotic species (e.g., the unrelated species Fonticula alba and Saprolegnia parasitica) have a ΔLFE profile that looks typical for bacteria (
Finally, it should be noted that this analysis is based on average values over entire genomes. This provides important statistical power and reduces the random effects of other factors on specific genes. It is important to remember, however, that some of the gene-level factors filtered this way are nevertheless important and there is considerable variation between genes.
By a first aspect, there is provided a nucleic acid molecule comprising a coding sequence comprising at least one codon substituted to a different codon within a first region of said coding sequence, wherein said substitution increases or decreases folding energy of the first region or of RNA encoded by the first region.
In some embodiments, the nucleic acid molecule is an RNA molecule or a DNA molecule. In some embodiments, the nucleic acid molecule is an RNA molecule. In some embodiments, the nucleic acid molecule is a DNA molecule. In some embodiments, the DNA is genomic DNA. In some embodiments, the DNA is cDNA. In some embodiments, the nucleic acid molecule is a vector. In some embodiments, the vector is an expression vector. In some embodiments, the expression vector is a prokaryotic expression vector. In some embodiments, the expression vector is a eukaryotic expression vector. In some embodiments, the prokaryote is a bacterium. In some embodiments, the prokaryote is an archaeon. In some embodiments, the eukaryote is a mammal. In some embodiments, the mammal is a human. In some embodiments, the eukaryote is not a fungus.
In some embodiments, the nucleic acid molecule comprises a coding region. In some embodiments, the nucleic acid molecule comprises a coding sequence. In some embodiments, the coding region comprises a start codon. In some embodiments, the nucleic acid molecule comprises a stop codon. It will be understood by a skilled artisan that both DNA and RNA can be considered to have codons. Within a DNA molecule a codon refers to the 3 bases that will be transcribed into RNA bases that will act as a codon for recognition by a ribosome and will thus translate an amino acid. In some embodiments, the nucleic acid molecule further comprises an untranslated region (UTR). In some embodiments, the UTR is a 5′ UTR. In some embodiments, the UTR is a 3′ UTR.
As used herein, the term “coding sequence” refers to a nucleic acid sequence that when translated results in an expressed protein. In some embodiments, the coding sequence is to be used as a basis for making codon alterations. In some embodiments, the coding sequence is a gene. In some embodiments, the coding sequence is a viral gene. In some embodiments, the coding sequence is a prokaryotic gene. In some embodiments, the coding sequence is a bacterial gene. In some embodiments, the coding sequence is a eukaryotic gene. In some embodiments, the coding sequence is a mammalian gene. In some embodiments, the coding sequence is a human gene. In some embodiments, the coding sequence is a portion of one of the above listed genes. In some embodiments, the coding sequence is a heterologous transgene. In some embodiments, the above listed genes are wild type, endogenously expressed genes. In some embodiments, the above listed genes have been genetically modified or in some way altered from their endogenous formulation. These alterations may be changes to the coding region such that the protein the gene codes for is altered.
The term “heterologous transgene” as used herein refers to a gene that originated in one species and is being expressed in another. In some embodiments, the transgene is a part of a gene originating in another organism. In some embodiments, the heterologous transgene is a gene to be overexpressed. In some embodiments, expression of the heterologous transgene in a wild-type cell reduces global translation in the wild-type cell.
In some embodiments, the nucleic acid molecule further comprises a regulatory element. In some embodiments, regulatory element is configured to induce transcription of the coding sequence. In some embodiments, the regulatory element is a promoter. In some embodiments, the regulatory element is selected from an activator, a repressor, an enhancer, and an insulator. In some embodiments, the coding region is operably linked to the regulatory element. The term “operably linked” is intended to mean that the coding sequence is linked to the regulatory element or elements in a manner that allows for expression of the coding sequence (e.g., in an in vitro transcription/translation system or in a host cell when the vector is introduced into the host cell). In some embodiments, the promoter is a promoter specific to the expression vector. In some embodiments, the promoter is a viral promoter. In some embodiments, the promoter is a bacterial promoter. In some embodiments, the promoter is a eukaryotic promoter.
A vector nucleic acid sequence generally contains at least an origin of replication for propagation in a cell and optionally additional elements, such as a heterologous polynucleotide sequence, expression control element (e.g., a promoter, enhancer), selectable marker (e.g., antibiotic resistance), poly-Adenine sequence.
The vector may be a DNA plasmid delivered via non-viral methods or via viral methods. The viral vector may be a retroviral vector, a herpesviral vector, an adenoviral vector, an adeno-associated viral vector or a poxviral vector.
The term “promoter” as used herein refers to a group of transcriptional control modules that are clustered around the initiation site for an RNA polymerase i.e., RNA polymerase II. Promoters are composed of discrete functional modules, each consisting of approximately 7-20 bp of DNA, and containing one or more recognition sites for transcriptional activator or repressor proteins.
In some embodiments, nucleic acid sequences are transcribed by RNA polymerase II (RNAP II and Pol II). RNAP II is an enzyme found in eukaryotic cells. It catalyzes the transcription of DNA to synthesize precursors of mRNA and most snRNA and microRNA.
In some embodiments, mammalian expression vectors include, but are not limited to, pcDNA3, pcDNA3.1 (±), pGL3, pZeoSV2(±), pSecTag2, pDisplay, pEF/myc/cyto, pCMV/myc/cyto, pCR3.1, pSinRep5, DH26S, DHBB, pNMT1, pNMT41, pNMT81, which are available from Invitrogen, pCI which is available from Promega, pMbac, pPbac, pBK-RSV and pBK-CMV which are available from Strategene, pTRES which is available from Clontech, and their derivatives.
In some embodiments, expression vectors containing regulatory elements from eukaryotic viruses such as retroviruses are used by the present invention. SV40 vectors include pSVT7 and pMT2. In some embodiments, vectors derived from bovine papilloma virus include pBV-1MTHA, and vectors derived from Epstein Bar virus include pHEBO, and p2O5. Other exemplary vectors include pMSG, pAV009/A+, pMTO10/A+, pMAMneo-5, baculovirus pDSVE, and any other vector allowing expression of proteins under the direction of the SV-40 early promoter, SV-40 later promoter, metallothionein promoter, murine mammary tumor virus promoter, Rous sarcoma virus promoter, polyhedrin promoter, or other promoters shown effective for expression in eukaryotic cells.
In some embodiments, recombinant viral vectors, which offer advantages such as lateral infection and targeting specificity, are used for in vivo expression. In one embodiment, lateral infection is inherent in the life cycle of, for example, retrovirus and is the process by which a single infected cell produces many progeny virions that bud off and infect neighboring cells. In one embodiment, the result is that a large area becomes rapidly infected, most of which was not initially infected by the original viral particles. In one embodiment, viral vectors are produced that are unable to spread laterally. In one embodiment, this characteristic can be useful if the desired purpose is to introduce a specified gene into only a localized number of targeted cells.
In one embodiment, plant expression vectors are used. In one embodiment, the expression of a polypeptide coding sequence is driven by a number of promoters. In some embodiments, viral promoters such as the 35S RNA and 19S RNA promoters of CaMV [Brisson et al., Nature 310:511-514 (1984)], or the coat protein promoter to TMV [Takamatsu et al., EMBO J. 6:307-311 (1987)] are used. In another embodiment, plant promoters are used such as, for example, the small subunit of RUBISCO [Coruzzi et al., EMBO J. 3:1671-1680 (1984); and Brogli et al., Science 224:838-843 (1984)] or heat shock promoters, e.g., soybean hsp17.5-E or hsp17.3-B [Gurley et al., Mol. Cell. Biol. 6:559-565 (1986)]. In one embodiment, constructs are introduced into plant cells using Ti plasmid, Ri plasmid, plant viral vectors, direct DNA transformation, microinjection, electroporation and other techniques well known to the skilled artisan. See, for example, Weissbach & Weissbach [Methods for Plant Molecular Biology, Academic Press, NY, Section VIII, pp 421-463 (1988)]. Other expression systems such as insects and mammalian host cell systems, which are well known in the art, can also be used by the present invention.
It will be appreciated that other than containing the necessary elements for the transcription and translation of the inserted coding sequence (encoding the polypeptide), the expression construct of the present invention can also include sequences engineered to optimize stability, production, purification, yield or activity of the expressed polypeptide.
In some embodiments, another codon is a synonymous codon. In some embodiments, a codon is substituted to a synonymous codon. In some embodiments, the substitution is a silent substitution. In some embodiments, the substitution is a mutation. In some embodiments, a codon is mutated to another codon. In some embodiments, the other codon is a synonymous codon. In some embodiments, the mutation is a silent mutation.
The term “codon” refers to a sequence of three DNA or RNA nucleotides that correspond to a specific amino acid or stop signal during protein synthesis. The codon code is degenerate, in that more than one codon can code for the same amino acid. Such codons that code for the same amino acid are known as “synonymous” codons. Thus, for example, CUU, CUC, CUA, CUG, UUA, and UUG are synonymous codons that code for Leucine. Synonymous codons are not used with equal frequency. In general, the most frequently used codons in a particular cell are those for which the cognate tRNA is abundant, and the use of these codons enhances the rate of protein translation. Conversely, tRNAs for rarely used codons are found at relatively low levels, and the use of rare codons is thought to reduce translation rate. “Codon bias” as used herein refers generally to the non-equal usage of the various synonymous codons, and specifically to the relative frequency at which a given synonymous codon is used in a defined sequence or set of sequences.
Synonymous codons are provided in Table 6. The first nucleotide in each codon encoding a particular amino acid is shown in the left-most column; the second nucleotide is shown in the top row; and the third nucleotide is shown in the right-most column.
Table 6: Codon table showing synonymous codons
As used herein, the term “silent mutation” refers to a mutation that does not affect or has little effect on protein functionality. A silent mutation can be a synonymous mutation and therefore not change the amino acids at all, or a silent mutation can change an amino acid to another amino acid with the same functionality or structure, thereby having no or a limited effect on protein functionality.
In some embodiments, the first region is from 90 nucleotides upstream of a stop codon of the coding sequence to the stop codon. In some embodiments, the first region is from 50 nucleotides upstream of the stop codon to the stop codon. In some embodiments, the first region is from 40 nucleotides upstream of the stop codon to the stop codon. It will be understood by a skilled artisan that “upstream from the stop codon” refers to from the first base of the stop codon. Thus, the first base of the stop codon is considered to be nucleotide zero, and the base directly 5′ to that first base of the stop codon is therefore 1 nucleotide upstream of the stop codon. Thus, the first region may be from 90, 50 or 40 nucleotides upstream of the stop codon. In some embodiments, the first region does not include the stop codon. In some embodiments, the first region does include the stop codon. In some embodiments, the first region is from 90 nucleotides upstream of the stop codon to 1 nucleotide upstream of the stop codon. In some embodiments, the first region is from 50 nucleotides upstream of the stop codon to 1 nucleotide upstream of the stop codon. In some embodiments, the first region is from 40 nucleotides upstream of the stop codon to 1 nucleotide upstream of the stop codon. In some embodiments, the first region does not comprise the two codons closest to the stop codon. In some embodiments, the first region is from 90 nucleotides upstream of the stop codon to 7 nucleotides upstream of the stop codon. In some embodiments, the first region is from 50 nucleotides upstream of the stop codon to 7 nucleotides upstream of the stop codon. In some embodiments, the first region is from 40 nucleotides upstream of the stop codon to 7 nucleotides upstream of the stop codon.
In some embodiments, the first region is upstream and proximal to the stop codon and folding energy of the first region or of RNA encoded by the first region is increased. In some embodiments, the folding energy is RNA secondary structure folding Gibbs free energy. In some embodiments, the region is DNA and the folding energy of the RNA encoded by the region is increased. It will be understood by a skilled artisan that the measure of folding energy is generally negative, and that an area with complex secondary structure, i.e., abundant folding, will have a very low, negative folding energy. Thus, increasing folding energy is decreasing secondary structure complexity and decreasing folding. In some embodiments, the substitution increases folding energy of the first region or RNA encoded by the first region to above a predetermined threshold. In some embodiments, the predetermined threshold is −5 kcal/mol/40 bp. In some embodiments, the predetermined threshold is −6 kcal/mol/40 bp. In some embodiments, the predetermined threshold is −6.09 kcal/mol/40 bp. In some embodiments, the predetermined threshold is −6.8 kcal/mol/40 bp. In some embodiments, the threshold is a statistically significant increase. In some embodiments, the threshold is derived from a randomized sequence. In some embodiments, threshold is derived from a null hypothesis. In some embodiments, the threshold is the folding energy of a random sequence. In some embodiments, the threshold is 0 kcal/mol/40 bp. In some embodiments, the threshold is a value above which the difference as compared to the already existing folding energy would be significant. In some embodiments, the threshold is a level that is statistically significant as compared to a null model for folding energy of the region. In some embodiments, the threshold is organism specific. In some embodiments, the threshold is selected from a threshold provided in Table 1. In some embodiments, the threshold is domain-specific and selected from a threshold provided in Table 1. In some embodiments, the threshold is species-specific and is selected from a threshold provided in Table 5. In embodiments, wherein the species is not provided in Table 5, the more general thresholds from Table 1 are used. In some embodiments, the threshold is selected from a threshold provided in Table 5. In some embodiments, the domain is Archaea, and the threshold is −5.76 kcal/mol/40 bp. In some embodiments, the threshold is an archaeal threshold, and the threshold is −5.76 kcal/mol/40 bp. In some embodiments, the domain is Bacteria, and the threshold is −6.17 kcal/mol/40 bp. In some embodiments, the threshold is a bacterial threshold, and the threshold is −6.17 kcal/mol/40 bp. In some embodiments, the domain is Eukaryotes, and the threshold is −5.95 kcal/mol/40 bp. In some embodiments, the threshold is a eukaryotic threshold, and the threshold is −5.95 kcal/mol/40 bp. In some embodiments, the threshold is the native LFE mean aat 0 nt. In some embodiments, the mean at 0 nt in the table is the threshold for a given domain or species.
Acidiplasma aeolicum str. VT
Aeropyrum camini SY1 = JCM 12091
Aeropyrum pernix K1
Archaeoglobus fulgidus DSM 4304
Caldisphaera lagunensis DSM 15908
Candidatus Haloredivivus sp. G17
Candidatus Korarchaeum cryptofilum OPF8
Candidatus Methanomassiliicoccus intestinalis
Candidatus Methanomethylophilus alvus Mx1201
Candidatus Nanopusillus acidilobi
Candidatus Nitrosoarchaeum limnia BG20
Candidatus Nitrosopumilus koreensis AR1
Candidatus Nitrososphaera gargensis Ga9.2
Cenarchaeum symbiosum A
Ferroglobus placidus DSM 10642
Ferroplasma acidarmanus fer1
Halobacterium salinarum NRC-1
Halobacterium salinarum R1
Haloferax mediterranei ATCC 33500
Halogeometricum borinquense DSM 11551
Halopiger xanaduensis SH-6
Haloquadratum walsbyi DSM 16790
Halosimplex carlsbadense 2-9-1
Ignisphaera aggregans DSM 17230
Methanobrevibacter smithii ATCC 35061
Methanocaldococcus jannaschii DSM 2661
Methanococcus maripaludis S2
Methanocorpusculum labreanum Z
Methanoculleus bourgensis MS2
Methanofollis liminatans DSM 4140
Methanohalobium evestigatum Z-7303
Methanomethylovorans hollandica DSM 15978
Methanopyrus kandleri AV19
Methanosarcina acetivorans C2A
Methanosarcina mazei S-6
Methanosphaera stadtmanae DSM 3091
Methanosphaerula palustris E1-9c
Methanothermobacter thermautotrophicus
Nanoarchaeum equitans
Nanohaloarchaea archaeon SG9
Natronobacterium gregoryi SP2
Nitrosopumilus maritimus SCM1
Nitrososphaera viennensis EN76
Palaeococcus pacificus DY20341
Picrophilus torridus DSM 9790
Pyrobaculum aerophilum str. IM2
Pyrococcus abyssi GE5
Pyrococcus furiosus DSM 3638
Pyrococcus horikoshii OT3
Pyrodictium delaneyi
Pyrolobus fumarii 1A
Sulfolobus islandicus L.S.2.15
Sulfolobus tokodaii str. 7
Thaumarchaeota archaeon SCGC AB-539-E09
Thermococcus barophilus MP
Thermococcus cleftensis
Thermococcus gammatolerans EJ3
Thermococcus guaymasensis DSM 11113
Thermococcus nautili
Thermoplasma acidophilum DSM 1728
Thermoplasma volcanium GSS1
Thermoproteus tenax Kra 1
Vulcanisaeta distributa DSM 14429
Abiotrophia defectiva ATCC 49176
Acetobacter pasteurianus 386B
Acetohalobium arabaticum DSM 5501
Acetonema longum DSM 6540
Acholeplasma laidlawii PG-8A
Acidimicrobium ferrooxidans DSM 10331
Acidithiobacillus ferrivorans SS3
Acidithiobacillus ferrooxidans ATCC 23270
Acidobacterium capsulatum ATCC 51196
Acidothermus cellulolyticus 11B
Acinetobacter baumannii ATCC 17978
Aequorivita sublithincola DSM 14238
Agrobacterium fabrum str. C58
Agrobacterium tumefaciens LBA4213 (Ach5)
Ahrensia marina str. LZD062
Akkermansia muciniphila ATCC BAA-835
Alcanivorax borkumensis SK2
Alicyclobacillus acidocaldarius LAA1
Alkalilimnicola ehrlichii MLHE-1
Anabaena sp. 90
Anaerobaculum mobile DSM 13181
Anaerococcus prevotii DSM 20548
Anaerolinea thermophila UNI-1
Anoxybacillus flavithermus WK1
Aquifex aeolicus VF5
Arthrospira platensis NIES-39
Asticcacaulis excentricus CB 48
Bacillus coagulans DSM 1 = ATCC 7050
Bacillus halodurans C-125
Bacillus selenitireducens MLS10
Bacillus subtilis subsp. subtilis str. 168
Bacteroides fragilis YCH46
Bacteroides nordii
Bacteroides thetaiotaomicron VPI-5482
Bartonella henselae str. Houston-1
Bdellovibrio bacteriovorus HD100
Berkelbacteria bacterium GW2011_GWA1_36_9
Bifidobacterium animalis subsp. animalis ATCC
Bizionia argentinensis JUB59
Blattabacterium sp. (Blattella germanica) str. Bge
Bordetella parapertussis Bpp5
Brachyspira murdochii DSM 12563
Bradyrhizobium japonicum SEMIA 5079
Brevibacillus brevis NBRC 100599
Brevundimonas subvibrioides ATCC 15264
Brucella melitensis bv. 1 str. 16M
Buchnera aphidicola str. APS (Acyrthosiphon
pisum)
Caldilinea aerophila DSM 14535 = NBRC 104270
Caldisericum exile AZM16c01
Calditerrivibrio nitroreducens DSM 19672
Caldithrix abyssi DSM 13497
Campylobacter jejuni subsp. jejuni NCTC 11168 =
Candidatus Azambacteria bacterium
Candidatus Azambacteria bacterium
Candidatus Beckwithbacteria bacterium
Candidatus Blochmannia floridanus
Candidatus Collierbacteria bacterium
Candidatus Curtissbacteria bacterium
Candidatus Desulforudis audaxviator MP104C
Candidatus Endomicrobium trichonymphae
Candidatus Entotheonella sp. TSY1
Candidatus Entotheonella sp. TSY2
Candidatus Falkowbacteria bacterium
Candidatus Hepatoplasma crinochetorum Av
Candidatus Jorgensenbacteria bacterium
Candidatus Kaiserbacteria bacterium
Candidatus Kaiserbacteria bacterium
Candidatus Kinetoplastibacterium oncopeltii
Candidatus Magasanikbacteria bacterium
Candidatus Magnetobacterium bavaricum
Candidatus Moranella endobia PCIT
Candidatus Nomurabacteria bacterium
Candidatus Nomurabacteria bacterium
Candidatus Nomurabacteria bacterium
Candidatus Nomurabacteria bacterium
Candidatus Pelagibacter sp. IMCC9063
Candidatus Peregrinibacteria bacterium
Candidatus Photodesmus katoptron Akat1
Candidatus Solibacter usitatus Ellin6076
Candidatus Woesebacteria bacterium
Candidatus Wolfebacteria bacterium
Candidatus Yanofskybacteria bacterium
Capnocytophaga ochracea DSM 7271
Catenulispora acidiphila DSM 44928
Caulobacter crescentus CB15
Cellulophaga lytica
Cetobacterium somerae ATCC BAA-474
Chlamydia abortus S26-3
Chlamydophila pneumoniae CWL029
Chlamydophila pneumoniae J138
Chlorobaculum parvum NCIB 8327
Chlorobium tepidum TLS
Chloroflexus aggregans DSM 9485
Chloroflexus aurantiacus J-10-fl
Chloroherpeton thalassium ATCC 35110
Chromobacterium violaceum ATCC 12472
Chryseobacterium greenlandense
Chthonomonas calidirosea T49
Clavibacter michiganensis subsp. michiganensis
Cloacibacillus evryensis DSM 19522
Clostridium lentocellum DSM 5427
Clostridium tetani E88
Cobetia amphilecti str. KMM 296
Conexibacter woesei DSM 14684
Coraliomargarita akajimensis DSM 45221
Corynebacterium efficiens YS-314
Corynebacterium glutamicum ATCC 13032
Coxiella burnetii RSA 493
Croceibacter atlanticus HTCC2559
Cryobacterium sp. MLB-32
Curtobacterium flaccumfaciens UCD-AKU
Deferribacter desulfuricans SSM1
Dehalococcoides mccartyi CBDB1
Dehalococcoides mccartyi CG5
Dehalogenimonas lykanthroporepellens BL-DC-9
Deinococcus geothermalis DSM 11300 str.
Deinococcus peraridilitoris DSM 19664
Deinococcus puniceus
Deinococcus radiodurans R1
Denitrovibrio acetiphilus DSM 12809
Desulfobacula toluolica Tol2
Desulfonatronospira thiodismutans ASO3-1
Desulfosporosinus orientis DSM 765
Desulfovibrio vulgaris str. Hildenborough
Desulfurispirillum indicum S5
Desulfurobacterium thermolithotrophum DSM
Dialister microaerophilus UPII 345-E
Dictyoglomus thermophilum H-6-12
Dictyoglomus turgidum DSM 6724
Eggerthia catenaformis OT 569 = DSM 20559
Elusimicrobium minutum Pei191
Enterococcus faecalis V583
Enterovibrio norvegicus FF-454
Erythrobacter litoralis HTCC2594
Escherichia coli str. K-12 substr. MG1655
Escherichia coli str. K-12 substr. W3110
Exiguobacterium sp. AT1b
Fervidobacterium nodosum Rt17-B1
Fibrobacter succinogenes subsp. succinogenes S85
Fimbriimonas ginsengisoli Gsoil 348
Flavobacteriales bacterium ALC-1
Flavobacterium limnosediminis JC2902
Flavobacterium psychrophilum JIP02/86
Fluviicola taffensis DSM 16823
Formosa agariphila KMM 3901
Frateuria aurantia DSM 6220
Fructobacillus fructosus KCTC 3544
Fusobacterium gonidiaformans ATCC 25563
Fusobacterium nucleatum subsp. nucleatum ATCC
Fusobacterium periodonticum 2_1_31
Galbibacter marinus
Gardnerella vaginalis 409-05
Gelidibacter algens
Gemmata sp. SH-PL17
Gemmatimonas aurantiaca T-27
Gemmatimonas phototrophica
Gemmatirosa kalamazoonesis
Geoalkalibacter ferrihydriticus DSM 17813
Geobacillus kaustophilus HTA426
Geobacillus stearothermophilus 10
Geobacter lovleyi SZ
Gloeobacter kilaueensis JS1
Gloeobacter violaceus PCC 7421
Gluconobacter oxydans 621H
Gramella forsetii KT0803
Granulibacter bethesdensis CGDNIH1
Haemophilus ducreyi 35000HP
Halobacillus halophilus DSM 2266
Halobacteriovorax marinus SJ
Haloplasma contractile SSD-17B
Halothermothrix orenii H 168
Halothiobacillus neapolitanus c2
Helicobacter pylori 26695
Herpetosiphon aurantiacus DSM 785
Hippea maritima DSM 10411
Holospora undulata HU1
Hydrocarboniphaga effusa AP103
Hydrogenobacter thermophilus TK-6
Hydrogenobaculum sp. HO
Ignavibacterium album JCM 16511
Ilumatobacter coccineus YM16-304
Ilyobacter polytropus DSM 2926
Imtechella halotolerans K1
Isoptericola variabilis 225
Isosphaera pallida ATCC 43644
Joostella marina DSM 19592
Kineococcus radiotolerans SRS30216 = ATCC
Kitasatospora setae KM-6054
Klebsiella pneumoniae subsp. pneumoniae HS11286
Kluyvera ascorbata ATCC 33433
Kosmotoga olearia TBF 19.5.1
Kosmotoga pacifica
Ktedonobacter racemifer DSM 44963
Lacinutrix sp. 5H-3-7-4
Lactobacillus johnsonii NCC 533
Lactobacillus plantarum WCFS1
Lactococcus garvieae Lg2
Lactococcus lactis subsp. lactis Il1403
Leclercia adecarboxylata ATCC 23216 = NBRC
Leeuwenhoekiella blandensis MED217
Leifsonia xyli subsp. xyli str. CTCB07
Lelliottia amnigena CHS 78
Lentisphaera araneosa HTCC2155
Leptospira biflexa serovar Patoc strain ‘Patoc 1
Leptospira interrogans serovar Copenhageni str.
Leptospirillum ferriphilum YSK
Leptotrichia goodfellowii F0264
Listeria innocua Clip11262
Listeria monocytogenes EGD-e
Lyngbya confervoides BDU141951
Magnetococcus marinus MC-1
Marinithermus hydrothermalis DSM 14884
Marinitoga piezophila KA3
Meiothermus ruber DSM 1279
Mesorhizobium australicum WSM2073
Mesotoga prima MesG1.Ag.4.2
Methylacidiphilum infernorum V4
Methylobacterium extorquens PA1
Methylococcus capsulatus str. Bath
Microcystis aeruginosa NIES-843
Mitsuokella multacida DSM 20544
Mobiluncus curtisii ATCC 43063
Mucispirillum schaedleri ASF457
Muricauda ruestringensis DSM 13258
Mycobacterium leprae TN
Mycobacterium tuberculosis H37Rv
Mycoplasma agalactiae PG2
Mycoplasma genitalium G37
Mycoplasma mycoides subsp. mycoides SC str. PG1
Mycoplasma penetrans HF-2
Mycoplasma pneumoniae M129
Mycoplasma pulmonis UAB CTIP
Natranaerobius thermophilus JW/NM-WN-LF
Neisseria meningitidis MC58
Neorhizobium galegae bv. orientalis str. HAMBI
Nitritalea halalkaliphila LW7
Nitrococcus mobilis Nb-231
Nitrolancea hollandica Lb
Nitrosomonas europaea ATCC 19718
Nitrospina gracilis 3-211
Nitrospira defluvii
Nocardioides sp. JS614
Nonlabens dokdonensis DSW-6
Nostoc punctiforme PCC 73102
Oceanithermus profundus DSM 14977
Oceanobacillus iheyensis HTE831
Oenococcus oeni PSU-1
Olsenella uli DSM 7084
Opitutus terrae PB90-1
Oscillochloris trichoides DG-6
Owenweeksia hongkongensis DSM 17368
Parachlamydia acanthamoebae UV-7
Parageobacillus toebii
Parcubacteria group bacterium
Parcubacteria group bacterium
Parcubacteria group bacterium
Parcubacteria group bacterium
Parvibaculum lavamentivorans DS-1
Parvularcula bermudensis HTCC2503
Pasteurella multocida str. ATCC 43137
Persephonella marina EX-H1
Petrotoga mobilis SJ95
Photobacterium profundum SS9
Photorhabdus luminescens subsp. laumondii TTO1
Phycisphaera mikurensis NBRC 102666
Piscirickettsia salmonis LF-89 = ATCC VR-1361
Planctopirus limnophila DSM 3776
Porphyromonas gingivalis ATCC 33277
Prochlorococcus marinus str. MIT 9301
Pseudomonas aeruginosa PAO1
Pseudomonas stutzeri
Pseudothermotoga hypogea DSM 11164 = NBRC
Psychrobacter arcticus 273-4
Psychrobacter cryohalolentis K5
Psychroflexus gondwanensis ACAM 44
Ralstonia solanacearum GMI1000
Ramlibacter tataouinensis TTB310
Rathayibacter toxicus
Renibacterium salmoninarum ATCC 33209
Rhizobium leguminosarum bv. trifolii CB782
Rhodopirellula baltica SH 1
Rhodopseudomonas palustris CGA009
Rhodothermus marinus DSM 4252
Richelia intracellularis HH01
Robiginitalea biformata HTCC2501
Roseburia hominis A2-183
Roseiflexus castenholzii DSM 13941
Rothia dentocariosa ATCC 17931
Rubidibacter lacunae KORDI 51-2
Saccharopolyspora erythraea NRRL 2338
Salegentibacter salarius
Salinicoccus halodurans
Salinicoccus roseus
Salinicoccus sediminis
Salinisphaera shabanensis E1L3A
Salinispira pacifica
Salmonella enterica subsp. enterica serovar
Typhimurium str. LT2
Sebaldella termitidis ATCC 33386
Shewanella oneidensis MR-1
Siansivirga zeaxanthinifaciens CC-SAMT-1
Simkania negevensis Z
Singulisphaera acidiphila DSM 18658
Sinorhizobium meliloti 1021
Slackia piriformis YIT 12062
Solitalea canadensis DSM 3403
Sphaerobacter thermophilus DSM 20745
Sphaerochaeta globosa str. Buddy
Stackebrandtia nassauensis DSM 44728
Staphylococcus aureus subsp. aureus NCTC 8325
Staphylococcus epidermidis ATCC 12228
Streptobacillus moniliformis DSM 12112
Streptococcus pyogenes M1 GAS
Streptomyces avermitilis MA-4680 = NBRC 14893
Streptomyces coelicolor A3(2)
Streptomyces thermoautotrophicus
Succinatimonas hippei YIT 12066
Sulfurihydrogenibium azorense Az-Fu1
Sulfurihydrogenibium yellowstonense SS-5
Sulfurimonas denitrificans DSM 1251
Synechococcus elongatus PCC 6301
Synechococcus sp. CC9902
Synechocystis sp. PCC 6803
Tepidanaerobacter acetatoxydans Re1
Thalassolituus oleivorans R6-15
Thalassospira profundimaris WP0211
Thermanaerovibrio acidaminovorans DSM 6589
Thermobaculum terrenum ATCC BAA-798
Thermobifida fusca YX
Thermobispora bispora DSM 43833
Thermocrinis albus DSM 14484
Thermodesulfatator indicus DSM 15286
Thermodesulfobacterium commune DSM 2178
Thermodesulfobacterium geofontis OPF15
Thermodesulfovibrio yellowstonii DSM 11347
Thermomicrobium roseum DSM 5159
Thermosipho africanus TCF52B
Thermosipho melanesiensis BI429
Thermosulfidibacter takaii ABI70S6
Thermotoga maritima MSB8
Thermovibrio ammonificans HB-1
Thermovirga lienii DSM 17291
Thermus aquaticus Y51MC23
Thermus oshimai JL-2
Thermus thermophilus HB8
Thiocapsa marina 5811
Thiohalorhabdus denitrificans
Thiovulum sp. ES
Tolypothrix campylonemoides VB511288
Treponema denticola ATCC 35405
Trichodesmium erythraeum IMS101
Tropheryma whipplei str. Twist
Truepera radiovictrix DSM 17093
Tumebacillus flagellatus
Turicella otitidis ATCC 51513
Ureaplasma parvum serovar 3 str. ATCC 27815
Verrucosispora maris AB-18-032
Vibrio fischeri MJ11
Vibrio parahaemolyticus RIMD 2210633
Vibrio vulnificus YJ016
Winogradskyella psychrotolerans RS-3
Wolbachia pipientis wAlbB
Wolinella succinogenes DSM 1740
Xanthomonas axonopodis Xac29-1
Xanthomonas campestris pv. campestris str. ATCC
Xylella fastidiosa 9a5c
Xylella fastidiosa subsp. sandyi Ann-1
Zunongwangia profunda SM-A87
Acanthamoeba castellanii str. Neff
Adineta vaga
Albugo candida
Allomyces macrogynus ATCC 38327
Amphimedon queenslandica
Aspergillus niger
Aureococcus anophagefferens
Babesia bovis T2Bo
Bigelowiella natans
Botryobasidium botryosum FD-172 SS1
Candida albicans SC5314
Capsaspora owczarzaki ATCC 30864
Chlamydomonas reinhardtii
Chondrus crispus (carragheen)
Coccomyxa subellipsoidea C-169
Cryptococcus neoformans var. neoformans JEC21
Cryptomonas paramecium
Cryptosporidium parvum Iowa II
Cyanidioschyzon merolae
Daphnia pulex
Dictyostelium discoideum AX4
Diplodia seriata
Dunaliella salina
Emiliania huxleyi CCMP1516
Entamoeba histolytica HM-1: IMSS-A
Eremothecium cymbalariae DBVPG#7215
Eremothecium gossypii ATCC 10895 (assembly
Fistulifera solans
Fonticula alba
Fragilariopsis cylindrus CCMP1102
Galdieria sulphuraria
Giardia lamblia ATCC 50803
Guillardia theta CCMP2712
Gymnopus luxurians FD-317 M1
Hypholoma sublateritium FD-334 SS-4
Laccaria bicolor S238N-H82
Leishmania major strain Friedlin
Magnaporthe oryzae
Micromonas pusilla CCMP1545
Mnemiopsis leidyi
Moniliophthora pemiciosa FA553
Monosiga brevicollis MX1
Naegleria gruberi strain NEG-M
Nematostella vectensis
Neofusicoccum parvum UCRNP2
Ostreococcus lucimarinus
Paramecium tetraurelia strain d4-2
Perkinsus marinus ATCC 50983
Phaeodactylum tricornutum CCAP 1055/1
Physcomitrella patens
Phytophthora ramorum
Plasmodium falciparum 3D7
Plasmopara halstedii
Pneumocystis murina b123
Postia placenta Mad-698-R
Puccinia graminis f. sp. tritici
Pythium vexans DAOM BR484
Saccharomyces cerevisiae S288c
Salpingoeca rosetta
Saprolegnia parasitica CBS 223.65
Schizophyllum commune H4-8
Schizosaccharomyces pombe (strain 972/ATCC
Spirodela polyrhiza
Spizellomyces punctatus DAOM BR117
Sporothrix schenckii 1099-18
Tetrahymena thermophila SB210
Thalassiosira pseudonana
Theileria annulata strain Ankara
Toxoplasma gondii ME49
Trichomonas vaginalis G3
Trichoplax adhaerens
Trypanosoma cruzi
Vanderwaltozyma polyspora DSM 70294
Volvox carteri
Wickerhamomyces anomalus NRRL Y-366-8
Wickerhamomyces ciferrii
Zymoseptoria brevis
Zymoseptoria tritici
In some embodiments, the threshold is species-specific. In some embodiments, the threshold is domain-specific. In some embodiments, the threshold is kingdom specific. In some embodiments, the threshold is a prokaryotic threshold. In some embodiments, the threshold is a eukaryotic threshold. In some embodiments, the threshold is a archaea threshold. In some embodiments, the threshold is a bacteria threshold.
In some embodiments, the first region comprises at least one codon substituted to another codon. In some embodiments, the first region comprises at plurality of codons substituted to another codon. In some embodiments, each substitution increases folding energy of the first region or RNA encoded by the first region. In some embodiments, the plurality of mutations in combination increases folding energy of the first region or RNA encoded by the first region.
In some embodiments, at least 1, at least 2, at least 3, at least 4, at least 5, at least 10, at least 15, at least 20, at least 25, or at least 30 codons of the first region have been substituted. Each possibility represents a separate embodiment of the present invention. In some embodiments, at least 5%, at least 10%, at least 15%, at least 20%, at least 25%, at least 30%, at least 35%, at least 40%, at least 45%, at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, or 100% of all codons in the region have been substituted. Each possibility represents a separate embodiment of the present invention. In some embodiments, at least 5%, at least 10%, at least 15%, at least 20%, at least 25%, at least 30%, at least 35%, at least 40%, at least 45%, at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, or 100% of codons in the region that have synonymous codons that increase the folding energy of the region have been substituted. Each possibility represents a separate embodiment of the present invention.
In some embodiments, all possible codons with the first region are substituted to synonymous codons that increase folding energy of the region or RNA encoded by the region. In some embodiments, codons are substituted to synonymous codons to produce a region with the highest possible folding energy while maintaining the amino acid sequence of a peptide encoded by the region. In some embodiments, all possible combinations of synonymous mutations are examined and the combination with the highest folding energy is selected. In some embodiments, the region comprise synonymous codons substituted to increase folding energy to a maximum possible for the region.
In some embodiments, the coding sequence comprises a second region. In some embodiments, the second region is from the translational start site (TSS) to 20 nucleotides downstream of the TSS. In some embodiments, the TSS is a start codon. It will be understood by a skilled artisan that the first base of the start codon is considered base 1, and so bases 1 to 3 of the region are the start codon. In some embodiments, the second region comprises the start codon. In some embodiments, the second region is from the TSS to 10 nucleotides downstream. In some embodiments, the second region is from the TSS to 150 nucleotides downstream. In some embodiments, the second region does not include the start codon. In some embodiments, the second region comprises at least one codon substituted to another codon. In some embodiments, the another codon is a synonymous codon. In some embodiments, the substitution increases folding energy in the second region or of RNA encoded by the second region. In some embodiments, the second region comprises synonymous mutations that increase the folding energy of the region or of RNA encoded by the region to a maximum possible while retaining the amino acid sequence encoded by the region.
In some embodiments, the coding sequence comprises a third region. In some embodiments, the third region is from the first region to the second region. In some embodiments, the third region is between the first region and the second region. In some embodiments, the third region is from the end of the second region to the beginning of the first region. In some embodiments, the third region is between the end of the second region to the beginning of the first region. In some embodiments, the third region does not overlap with the first region, the second region or both. In some embodiments, the third region does not overlap with the first region. In some embodiments, the third region does not overlap with the second region. In some embodiments, the third region overlaps with the second region. In some embodiments, the third region is from 20 to 50 nucleotides downstream of the TSS. In some embodiments, the third region is from 21 to 50 nucleotides downstream of the TSS. In some embodiments, the third region is from 20 to 70 nucleotides downstream of the TSS. In some embodiments, the third region is from 21 to 70 nucleotides downstream of the TSS. In some embodiments, the third region is from 20 to 150 nucleotides downstream of the TSS. In some embodiments, the third region is from 21 to 150 nucleotides downstream of the TSS. In some embodiments, the third region is from 20 to 300 nucleotides downstream of the TSS. In some embodiments, the third region is from 21 to 300 nucleotides downstream of the TSS. In some embodiments, the third region is from 300 to 90 nucleotides upstream of the stop codon. In some embodiments, the third region is from 300 to 70 nucleotides upstream of the stop codon. In some embodiments, the third region is from 300 to 50 nucleotides upstream of the stop codon. In some embodiments, the third region is from 300 to 40 nucleotides upstream of the stop codon. In some embodiments, the third region comprises at least one codon substituted to another codon. In some embodiments, the another codon is a synonymous codon. In some embodiments, the substitution decreases folding energy in the third region or of RNA encoded by the third region. In some embodiments, the third region comprises synonymous mutations that decrease the folding energy of the region or of RNA encoded by the region to a minimum possible while retaining the amino acid sequence encoded by the region.
In some embodiments, the first region is the second region. In some embodiments, the first region is the third region. In some embodiments, the coding sequence comprises only the second region. In some embodiments, the coding region comprises only the third region. In some embodiments, the coding region comprises the second and third regions and not the first region.
Whether a mutation increase or decreases local folding energy can be determined by modeling or empirically. Methods of determining local folding energy are well known in the art and any such method may be employed. Methods are also provided herein and any of these methods may be employed. In some embodiments, the method comprises determining the local folding energy for a region, generating at least one mutation in the region, determining the local folding energy in the mutated region and selecting the mutation if it increases the local folding energy. In some embodiments, the method comprises determining the local folding energy for a region, generating at least one mutation in the region, determining the local folding energy in the mutated region and selecting the mutation if it decreases the local folding energy. In some embodiments, determining local folding energy comprises inputting the sequence into a folding program. In some embodiments, a folding program is a program that predicts RNA folding. In some embodiments, a folding program is a program that models RNA folding. In some embodiments, a folding program provides a folding energy for a sequence. In some embodiments, the folding energy is local folding energy. In some embodiments, local is over a given window. In some embodiments, the window is 40 nt. In some embodiments, the sequence is the sequence of the region. Examples of folding programs are well known in the art and include for example, Mfold, RNAfold, RNA123, RNAshapes, RNAstructure, RNAstructureWeb, RNAslider and UNAFold to name but a few. In some embodiments, local folding energy is determined with RNAfold. Once the local folding energy is found for a given sequence over a given window various mutations can be tested for their effect on local folding energy. A mutation that increases folding energy or a mutation that decreases folding energy can be selected. Multiple mutations can be tested at once, or one at a time. When the folding architecture of a window is known, the mutations can be designed rationally, as generating mismatches in areas of secondary structure will reduce the secondary structure and thus increase local folding energy. Similarly, generating secondary structure where there was none will decrease local folding energy. Since the G-C bonds is stronger than the T-A bond, substituting one for the other can decrease local folding energy (T-A to G-C) or increase local folding energy (G-C to T-A). The predicted local folding energy can be compared to a null model to detect/predict meaningful levels of folding energy changes. A mutant region can also be tested empirically by methods such as are described herein. The region can be inserted into a reporter plasmid comprising a detectable protein (e.g., a fluorescent protein). The detectable protein may be for example GFP or RFP. Changes in expression of the reporter (e.g., GFP) can be monitored. Increases in expression of the reporter indicate that the folding energy just before the stop codon has been increased (i.e., weaker folding) leading to increased translation. Decreases in expression of the reporter indicate that the folding energy just before the stop codon has been decreased leading to decreased translation. Changes made in any of the regions can be measured in this way as well. Weaking folding just after the start codon will improve translation and increasing/decreasing folding in the middle of the CDS will affect translation in different ways depending on the domain/species of the coding/region target cell.
By another aspect, there is provided a vector comprising a nucleic acid molecule of the invention.
In some embodiments, the vector is an expression vector. In some embodiments, the vector is configured for expression in a target cell. In some embodiments, the vector comprises at least one regulatory element for expression in the target cell. In some embodiments, the regulatory element is configured for producing expression in the target cell. In some embodiments, the regulatory element produces expression in the target cell. In some embodiments, the regulatory element regulates expressing on the target cell.
By another aspect, there is provided a cell comprising the expression vector or nucleic acid molecule of the invention.
In some embodiments, the cell is a target cell. In some embodiments, the cell is a archeal cell. In some embodiments, the cell is a bacterial cell. In some embodiments, the cell is a eukaryotic cell. In some embodiments, the eukaryotic cell is anot a fungal cell. In some embodiments, the cell is in culture. In some embodiments, the cell is in vivo. In some embodiments, the cell is ex vivo. In some embodiments, the nucleic acid molecule is optimized for expression in the cell.
According to another aspect, there is provided a method for optimizing a coding sequence, the method comprising introducing a mutation into a first region of the coding sequence, wherein the mutation increases or decreases folding energy of the first region or RNA encoded by the first region.
In some embodiments, the first region is upstream and proximal to the stop codon and the mutation increases folding energy of the first region or RNA encoded by the first region. In some embodiments, the first region is downstream and proximal to the start codon and the mutation increases folding energy of the first region or RNA encoded by the first region. In some embodiments, the first region is in the gene body not proximal to the start codon or stop codon and the mutation decreases folding energy of the first region or RNA encoded by the first region.
In some embodiments, optimizing comprises optimizing expression of a protein encoded by the coding sequence. In some embodiments, optimizing is optimizing in a target cell. In some embodiments, optimizing is optimizing protein expression in a target cell. In some embodiments, optimizing is optimizing expression of a protein from a heterologous transgene in a target cell. In some embodiments, the heterologous transgene is not native to the target cell. In some embodiments, the target cell is a prokaryotic cell. In some embodiments, the target cell is a bacterial cell. In some embodiments, the target cell is an archaeal cell. In some embodiments, the target cell is a eukaryotic cell. In some embodiments, the target cell is a mammalian cell. In some embodiments, the target cell is a human cell. In some embodiments, the coding sequence is a viral, bacterial, archaeal, or eukaryotic sequence. In some embodiments, the coding sequence is exogenous to the target cell.
In some embodiments, the target cell is an archaeal cell and the first region is from 90 nucleotides upstream of the stop codon of the coding sequence to the stop codon. In some embodiments, the target cell is a bacterial cell and the first region is from 50 nucleotides upstream of the stop codon of the coding sequence to the stop codon. In some embodiments, the target cell is a eukaryotic cell and the first region is from 40 nucleotides upstream of the stop codon of the coding sequence to the stop codon.
In some embodiments, the mutation is a synonymous mutation. In some embodiments, the mutation is a silent mutation. In some embodiments, introducing comprises providing a mutated sequence. In some embodiments, introducing comprises providing a mutation or a list of mutations to be made in the coding sequence. In some embodiments, introducing is introducing a plurality of mutations. In some embodiments, each mutation of the plurality of mutations increases folding energy in the first region or RNA encoded by the first region. In some embodiments, a plurality of mutations in combination increases folding energy of the first region or of RNA encoded by the first region.
In some embodiments, the method comprises introducing at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25 or 30 mutation into the first region. Each possibility represents a separate embodiment of the invention. In some embodiments, the method comprises introducing all possible synonymous mutation that increase folding energy of the first region or RNA encoded by the first region. In some embodiments, the method comprises mutating all possible codons with synonymous codons that increase folding energy of the first region or RNA encoded by the first region. In some embodiments, the method comprises introducing synonymous mutation to produce a first region or RNA encoded by the first region with the maximum possible folding energy. Thus, the method may include calculating all possible synonymous mutations that increase folding energy, and all possible combinations of mutations that increase folding energy and selecting the combination of synonymous mutations that increase the folding energy of the region or RNA encoded by the region the most.
In some embodiments, folding energy is increased. In some embodiments, folding energy is decreased. In some embodiments, the folding energy is folding energy of the coding sequence. In some embodiments, the folding energy is folding energy of the region. In some embodiments, the folding energy is folding energy of the RNA encoded.
In some embodiments, the method further comprises introducing a mutation into a second region. In some embodiments, the second region is from the TSS to 20 nucleotides downstream of the TSS. In some embodiments, the cell is an archaeal cell the second region is from the TSS to 10 nucleotides downstream of the TSS. In some embodiments, the cell is selected from a bacterial cell and a eukaryotic cell and the second region is from the TSS to 20 nucleotides downstream of the TSS. In some embodiments, the mutation increases folding energy of the second region or of RNA encoded by the second region. In some embodiments, the second region is mutated with synonymous mutation such that the folding energy is increased to the maximum while retaining the amino acid sequence encoded by the region.
In some embodiments, the method further comprises introducing a mutation into a third region. In some embodiments, the third region is from the second region to the first region. In some embodiments, the third region is from 20 to 50 nucleotides downstream of the TSS. In some embodiments, the size of the region is organism specific. In some embodiments, the size of the region is domain-specific. In some embodiments, the size of the region is specific to bacteria. In some embodiments, the size of the region is specific to archaea. In some embodiments, the size of the region is specific to prokaryotes. In some embodiments, the size of the region is specific to eukaryotes. In some embodiments, the mutation decreases folding energy of the third region or of RNA encoded by the third region. In some embodiments, the third region is mutated with synonymous mutation such that the folding energy is decreased to the minimum while retaining the amino acid sequence encoded by the region.
In some embodiments, the method is an ex vivo method. In some embodiments, the method is an in vitro method. In some embodiments, the method is performed in a cell.
According to another aspect, there is provided a computer program product comprising a non-transitory computer-readable storage medium having program instructions embodied therewith, the program instructions executable by at least one hardware processor to execute a genetic-type machine learning algorithm configured to perform a method of the invention.
According to another aspect, there is provided a computer program product comprising a non-transitory computer-readable storage medium having program instructions embodied therewith, the program instructions executable by at least one hardware processor to execute a genetic-type machine learning algorithm configured to:
According to another aspect, there is provided a computer program product comprising a non-transitory computer-readable storage medium having program instructions embodied therewith, the program instructions executable by at least one hardware processor to execute a genetic-type machine learning algorithm configured to:
In some embodiments, the computer program product optimizes the region for expression in a target cell. In some embodiments, the computer program product determines the combination of mutations that increases folding energy to a maximum while retaining the amino acid sequence of the encoded by the region.
In some embodiments, the computer program product also determines within a second region of the coding sequence at least one mutation that increases folding energy of the second region or RNA encoded by the second region and outputs a mutated coding sequence that further comprises at least one mutation in the second region. In some embodiments, the computer program product also determines within a second region of the coding sequence at least one mutation that increases folding energy of the second region or RNA encoded by the second region and outputs a list of possible mutations that further comprises mutations in the second region that increase folding energy of the second region or of RNA encoded by the second region. In some embodiments, the computer program product determines the combination of mutations in the second region that produces the maximum folding energy while retaining the amino acid sequence encoded by the second region.
In some embodiments, the computer program product also determines within a third region of the coding sequence at least one mutation that decreases folding energy of the third region or RNA encoded by the third region and outputs a mutated coding sequence that further comprises at least one mutation in the third region. In some embodiments, the computer program product also determines within a third region of the coding sequence at least one mutation that decreases folding energy of the third region or RNA encoded by the third region and outputs a list of possible mutations that further comprises mutations in the third region that decreases folding energy of the third region or of RNA encoded by the third region. In some embodiments, the computer program product determines the combination of mutations in the third region that produces the minimum folding energy while retaining the amino acid sequence encoded by the third region.
The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire. Rather, the computer readable storage medium is a non-transient (i.e., not-volatile) medium.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention may be described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer readable program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
Before the present invention is further described, it is to be understood that this invention is not limited to particular embodiments described, as such may, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting, since the scope of the present invention will be limited only by the appended claims.
Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limit of that range and any other stated or intervening value in that stated range, is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included in the smaller ranges, and are also encompassed within the invention, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the invention.
As used herein, the term “about” when combined with a value refers to plus and minus 10% of the reference value. For example, a length of about 1000 nanometers (nm) refers to a length of 1000 nm+−100 nm.
It is noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to “a polynucleotide” includes a plurality of such polynucleotides and reference to “the polypeptide” includes reference to one or more polypeptides and equivalents thereof known to those skilled in the art, and so forth. It is further noted that the claims may be drafted to exclude any optional element. As such, this statement is intended to serve as antecedent basis for use of such exclusive terminology as “solely,” “only” and the like in connection with the recitation of claim elements, or use of a “negative” limitation.
In those instances where a convention analogous to “at least one of A, B, and C, etc.” is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., “a system having at least one of A, B, and C” would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, and/or A, B, and C together, etc.). It will be further understood by those within the art that virtually any disjunctive word and/or phrase presenting two or more alternative terms, whether in the description, claims, or drawings, should be understood to contemplate the possibilities of including one of the terms, either of the terms, or both terms. For example, the phrase “A or B” will be understood to include the possibilities of “A” or “B” or “A and B.”
It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable sub-combination. All combinations of the embodiments pertaining to the invention are specifically embraced by the present invention and are disclosed herein just as if each and every combination was individually and explicitly disclosed. In addition, all sub-combinations of the various embodiments and elements thereof are also specifically embraced by the present invention and are disclosed herein just as if each and every such sub-combination was individually and explicitly disclosed herein.
Additional objects, advantages, and novel features of the present invention will become apparent to one ordinarily skilled in the art upon examination of the following examples, which are not intended to be limiting. Additionally, each of the various embodiments and aspects of the present invention as delineated hereinabove and as claimed in the claims section below finds experimental support in the following examples.
Various embodiments and aspects of the present invention as delineated hereinabove and as claimed in the claims section below find experimental support in the following examples.
Generally, the nomenclature used herein and the laboratory procedures utilized in the present invention include molecular, biochemical, microbiological and recombinant DNA techniques. Such techniques are thoroughly explained in the literature. See, for example, “Molecular Cloning: A laboratory Manual” Sambrook et al., (1989); “Current Protocols in Molecular Biology” Volumes I-III Ausubel, R. M., ed. (1994); Ausubel et al., “Current Protocols in Molecular Biology”, John Wiley and Sons, Baltimore, Md. (1989); Perbal, “A Practical Guide to Molecular Cloning”, John Wiley & Sons, New York (1988); Watson et al., “Recombinant DNA”, Scientific American Books, New York; Birren et al. (eds) “Genome Analysis: A Laboratory Manual Series”, Vols. 1-4, Cold Spring Harbor Laboratory Press, New York (1998); methodologies as set forth in U.S. Pat. Nos. 4,666,828; 4,683,202; 4,801,531; 5,192,659 and 5,272,057; “Cell Biology: A Laboratory Handbook”, Volumes I-III Cellis, J. E., ed. (1994); “Culture of Animal Cells—A Manual of Basic Technique” by Freshney, Wiley-Liss, N. Y. (1994), Third Edition; “Current Protocols in Immunology” Volumes I-III Coligan J. E., ed. (1994); Stites et al. (eds), “Basic and Clinical Immunology” (8th Edition), Appleton & Lange, Norwalk, Conn. (1994); Mishell and Shiigi (eds), “Strategies for Protein Purification and Characterization—A Laboratory Course Manual” CSHL Press (1996); all of which are incorporated by reference. Other general references are provided throughout this document.
Species selection and sequence filtering: The set of species included in the dataset (Table 2) was chosen to maximize taxonomic coverage, include closely related species which differ in GC-contents and other traits (
CDS sequences and gene annotations for all species were obtained from Ensembl genomes, NCBI, JGI and SGD (Table 4). CDS sequences were matched with their GFF3 annotations to filter suspect sequences, as follows. The dataset excludes CDSs marked as pseudo-genes or suspected pseudo-genes, incomplete CDSs and those with sequencing ambiguities, as well as CDSs of length <150 nt. If multiple isoforms were available, only the primary (or first) transcript was included. Genes annotated as belonging to organelle genomes were also excluded. Genomic GC-content, optimum growth temperatures and translation tables were extracted from NCBI Entrez automatically, using a combination of Entrez and E-utilities requests (Table 4). A few general characteristics of the included CDSs are shown in
The taxonomic hierarchy and classifications used to analyze and present the data were obtained from NCBI Taxonomy. Endosymbionts were annotated using a literature survey (Table 4). Growth rates were extracted from Vieira-Silva S, Rocha EPC. The Systemic Imprint of Growth and Its Uses in Ecological (Meta)Genomics. PLOS Genet. 2010 Jan. 15; 6(1):e1000808 herein incorporated by reference.
Pasteurella multocida str. ATCC 43137
Desulfovibrio vulgaris str. Hildenborough
Cellulophaga lytica
Synechocystis sp. PCC 6803
Chondrus crispus (carragheen)
Cryptomonas paramecium
Dunaliella salina
Chlamydomonas reinhardtii
Volvox carteri
Physcomitrella patens
Plasmopara halstedii
Wickerhamomyces anomalus
Aspergillus niger
Trypanosoma cruzi
Daphnia pulex
Trichoplax adhaerens
Mnemiopsis leidyi
Methanofollis liminatans DSM 4140
Candidatus Magnetobacterium bavaricum
Spirodela polyrhiza
Plasmodium falciparum 3D7
Aureococcus anophagefferens
Nematostella vectensis
Salinicoccus roseus
Anabaena sp. 90
Gelidibacter algens
Fibrobacter succinogenes subsp.
succinogenes S85
Nostoc punctiforme PCC 73102
Halobacterium salinarum NRC-1
Albugo candida
Pyrococcus horikoshii 0T3
Mycobacterium tuberculosis H37Rv
Helicobacter pylori 26695
Staphylococcus aureus subsp. aureus
Pseudomonas stutzeri
Salmonella enterica subsp. enterica
serovar
Typhimurium str. LT2
Streptomyces coelicolor A3(2)
Adineta vaga
Buchnera aphidicola str. APS
Chlamydophila pneumoniae CWL029
Neisseria meningitidis MC58
Persephonella marina EX-H1
Galdieria sulphuraria
Chlamydophila pneumoniae J138
Rathayibacter toxicus
Parageobacillus toebii
Xylella fastidiosa subsp. sandyi Ann-1
Magnetococcus marinus MC-1
Sphaerochaeta globosa str. Buddy
Streptococcus pyogenes M1 GAS
Xylella fastidiosa 9a5c
Thermococcus cleftensis
Phytophthora ramorum
Prochlorococcus marinus str. MIT 9301
Listeria monocytogenes EGD-e
Staphylococcus epidermidis ATCC 12228
Agrobacterium fabrum str. C58
Pyrobaculum aerophilum str. IM2
Giardia lamblia ATCC 50803
Pyrococcus furiosus DSM 3638
Alkalilimnicola ehrlichii MLHE-1
Methanothermobacter
thermautotrophicus str. Delta H
Methanosarcina acetivorans C2A
Methanopyrus kandleri AV19
Fusobacterium nucleatum subsp.
nucleatum ATCC 25586
Xanthomonas campestris pv. campestris
Caulobacter crescentus CB15
Campylobacter jejuni subsp. jejuni NCTC
Chlorobium tepidum TLS
Thermococcus nautili
Nocardioides sp. JS614
Corynebacterium efficiens YS-314
Vibrio vulnificus YJ016
Corynebacterium glutamicum ATCC 13032
Oenococcus oeni PSU-1
Trichodesmium erythraeum IMS101
Tropheryma whipplei str. Twist
Candidatus Blochmannia floridanus
Sulfurihydrogenibium azorense Az-Fu1
Pseudomonas aeruginosa PAO1
Shewanella oneidensis MR-1
Clostridium tetani E88
Methanosarcina mazei S-6
Cryptococcus neoformans var. neoformans
Croceibacter atlanticus HTCC2559
Chlamydia abortus S26-3
Lactobacillus plantarum WCFS1
Oceanobacillus iheyensis HTE831
Vibrio parahaemolyticus RIMD 2210633
Bacillus subtilis subsp. subtilis str. 168
Aquifex aeolicus VF5
Archaeoglobus fulgidus DSM 4304
Brucella melitensis bv. 1 str. 16M
Enterococcus faecalis V583
Bacteroides thetaiotaomicron VPI-5482
Coxiella burnetii RSA 493
Streptomyces avermitilis MA-4680 = NBRC
Nitrosomonas europaea ATCC 19718
Nanoarchaeum equitans
Haemophilus ducreyi 35000HP
Candidatus Solibacter usitatus Ellin6076
Geobacillus kaustophilus HTA426
Candida albicans SC5314
Acidobacterium capsulatum ATCC 51196
Magnaporthe oryzae
Rhodopirellula baltica SH 1
Acidithiobacillus ferrooxidans ATCC 23270
Deinococcus radiodurans RI
Methanocaldococcus jannaschii DSM 2661
Methylococcus capsulatus str. Bath
Photorhabdus luminescens subsp.
laumondii TTO1
Mycoplasma genitalium G37
Thermotoga maritima MSB8
Treponema denticola ATCC 35405
Chromobacterium violaceum ATCC 12472
Gloeobacter violaceus PCC 7421
Dehalococcoides mccartyi CBDB1
Lactobacillus johnsonii NCC 533
Rhodopseudomonas palustris CGA009
Psychrobacter arcticus 273-4
Onion yellows phytoplasma OY-M
Verrucosispora maris AB-18-032
Picrophilus torridus DSM 9790
Bdellovibrio bacteriovorus HD100
Sinorhizobium meliloti 1021
Kineococcus radiotolerans SRS30216 =
Methanococcus maripaludis S2
Ralstonia solanacearum GMI1000
Leptospira interrogans serovar
Copenhageni str. Fiocruz L1-130
Synechococcus elongatus PCC 6301
Thermobifida fusca YX
Aeropyrum pernix K1
Bacillus halodurans C-125
Geobacillus stearothermophilus 10
Lactococcus lactis subsp. lactis ll1403
Listeria innocua Clip11262
Mycobacterium leprae TN
Mycoplasma mycoides subsp. mycoides SC
Mycoplasma penetrans HF-2
Mycoplasma pneumoniae M129
Mycoplasma pulmonis UAB CTIP
Pyrococcus abyssi GE5
Sulfolobus tokodaii str. 7
Thermoplasma acidophilum DSM 1728
Thermoplasma volcanium GSS1
Wolinella succinogenes DSM 1740
Emiliania huxleyi CCMP1516
Cyanidioschyzon merolae
Leifsonia xyli subsp. xyli str. CTCB07
Bartonella henselae str. Houston-1
Eremothecium gossypii ATCC 10895
Schizosaccharomyces pombe (strain 972/
Renibacterium salmoninarum ATCC 33209
Thermodesulfovibrio yellowstonii
Thermodesulfobacterium commune
Gluconobacter oxydans 621H
Bacteroides fragilis YCH46
Thalassiosira pseudonana
Photobacterium profundum SS9
Thermus thermophilus HB8
Dictyoglomus thermophilum H-6-12
Thermomicrobium roseum DSM 5159
Tetrahymena thermophila SB210
Robiginitalea biformata HTCC2501
Lentisphaera araneosa HTCC2155
Erythrobacter litoralis HTCC2594
Parvularcula bermudensis HTCC2503
Nitrococcus mobilis Nb-231
Herpetosiphon aurantiacus DSM 785
Synechococcus sp. CC9902
Escherichia coli str. K-12 substr. W3110
Deinococcus geothermalis DSM 11300 str.
Aster yellows witches'-broom phytoplasma
Chloroflexus aurantiacus J-10-fl
Sulfurimonas denitrificans DSM 1251
Chloroflexus aggregans DSM 9485
Nitrospira defluvii
Blattabacterium sp. (Blattella germanica)
Simkania negevensis Z
Ferroplasma acidarmanus fer1
Psychrobacter cryohalolentis K5
Zymoseptoria tritici
Methanosphaera stadtmanae DSM 3091
Chryseobacterium greenlandense
Mycoplasma agalactiae PG2
Leishmania major strain Friedlin
Akkermansia muciniphila ATCC BAA-835
Acidothermus cellulolyticus 11B
Dictyostelium discoideum AX4
Cryptosporidium parvum Iowa II
Theileria annulata strain Ankara
Brevibacillus brevis NBRC 100599
Exiguobacterium sp. AT1b
Haloquadratum walsbyi DSM 16790
Ramlibacter tataouinensis TTB310
Halothermothrix orenii H 168
Candidatus Korarchaeum cryptofilum OPF8
Gemmatimonas aurantiaca T-27
Thiohalorhabdus denitrificans
Fervidobacterium nodosum Rtl7-Bl
Roseiflexus castenholzii DSM 13941
Vibrio fischeri MJ11
Thermosipho melanesiensis BI429
Granulibacter bethesdensis CGDNIH1
Flavobacteriales bacterium ALC-1
Thermococcus barophilus MP
Alcanivorax borkumensis SK2
Leeuwenhoekiella blandensis MED217
Geobacter lovleyi SZ
Acinetobacter baumannii ATCC 17978
Amphimedon queenslandica
Flavobacterium psychrophilum JIP02/86
Parvibaculum lavamentivorans DS-1
Petrotoga mobilis SJ95
Saccharopolyspora erythraea NRRL 2338
Salinicoccus halodurans
Methanocorpusculum labreanumZ
Gramella forsetii KT0803
Paramecium tetraurelia strain d4-2
Trichomonas vaginalis G3
Cenarchaeum symbiosum A
Puccinia graminis f. sp. tritici
Methylobacterium extorquens PA1
Methanobrevibacter smithii ATCC 35061
Diplodia seriata
Lactococcus garvieae Lg2
Perkinsus marinus ATCC 50983
Sulfolobus islandicus L.S.2.15
Monosiga brevicollis MX1
Porphyromonas gingivalis ATCC 33277
Sulfurihydrogenibium yellowstonense SS-5
Salegentibarter salarius
Ostreococcus lucimarinus
Nitrosopumilus maritimus SCM1
Vanderwaltozyma polyspora DSM 70294
Bacillus selenitireducens MLS10
Acholeplasma laidlawii PG-8A
Marinitoga piezophila KA3
Clavibacter michiganensis subsp.
michiganensis NCPPB 382
Elusimicrobium minutum Pei191
Stackebrandtia nassauensis DSM 44728
Microcystis aeruginosa NIES-843
Opitutus terrae PB90-1
Kitasatospora setae KM-6054
Leptospira biflexa serovar Patoc strain
Natranaerobius thermophilus JW/NM-WN-
Thermobispora bispora DSM 43833
Halogeometricum borinquense DSM 11551
Conexibacter woesei DSM 14684
Fusobacterium periodonticum 2_1_31
Fusobacterium gonidiaformans
Bradyrhizobium japonicum SEMIA 5079
Candidatus Desulforudis audaxviator
Halobacterium salinarum R1
Catenulispora acidiphila DSM 44928
Sphaerobacter thermophilus DSM 20745
Methylacidiphilum infernorum V4
Thermosipho africanus TCF52B
Babesia bovis T2Bo
Ktedonobacter racemifer DSM 44963
Laccaria bicolor S238N-H82
Anoxybacillus flavithermus WK1
Thermus aquaticus Y51MC23
Mitsuokella multacida DSM 20544
Meiothermus ruber DSM 1279
Ureaplasma parvum serovar 3 str.
Acidiplasma aeolicum str. VT
Toxoplasma gondii ME49
Caldisericum exile AZM16c01
Escherichia coli str. K-12 substr. MG1655
Dictyoglomus turgidum DSM 6724
Chlorobaculum parvum NCIB 8327
Chloroherpeton thalassium ATCC 35110
Rhodothermus marinus DSM 4252
Streptobacillus moniliformis DSM 12112
Methanosphaerula palustris E1-9c
Kosmotoga olearia TBF 19.5.1
Capnocytophaga ochracea DSM 7271
Planctopirus limnophila DSM 3776
Denitrovibrio acetiphilus DSM 12809
Haloferax mediterranei ATCC 33500
Thermanaerovibrio acidaminovorans DSM
Thermobaculum terrenum ATCC BAA-798
Acidimicrobium ferrooxidans DSM 10331
Anaerococcus prevotii DSM 20548
Sebaldella termitidis ATCC 33386
Brachyspira murdochii DSM 12563
Alicyclobacillus acidocaldarius LAA1
Hydrogenobaculum sp. HO
Mobiluncus curtisii ATCC 43063
Dehalogenimonas lykanthroporepellens
Gardnerella vaginalis 409-05
Moniliophthora perniciosa FA553
Galbibacter marinus
Halothiobacillus neapolitanus c2
Desulfonatronospira thiodismutans
Phaeodactylum tricornutum CCAP 1055/1
Saccharomyces cerevisiae S288c
Postia placenta Mad-698-R
Micromonas pusilia CCMP1545
Vulcanisaeta distributa DSM 14429
Ilyobacter polytropus DSM 2926
Asticcacaulis excentricus CB 48
Acetohalobium arabaticum DSM 5501
Coccomyxa subellipsoidea C-169
Isosphaera pallida ATCC 43644
Schizophyllum commune H4-8
Allomyces macrogynus ATCC 38327
Thermovirga lienii DSM 17291
Rubidibacter lacunae KORDI 51-2
Coraliomargarita akajimensis DSM 45221
Ignisphaera aggregans DSM 17230
Roseburia hominis A2-183
Ferroglobus placidus DSM 10642
Abiotrophia defectiva ATCC 49176
Nonlabens dokdonensis DSW-6
Thermococcus gammatolerans EJ3
Capsaspora owczarzaki ATCC 30864
Leptotrichia goodfellowii F0264
Hydrogenobacter thermophilus TK-6
Olsenella uli DSM 7084
Brevundimonas subvibrioides ATCC 15264
Fragilariopsis cylindrus CCMP1102
Thermocrinis albus DSM 14484
Deferribarter desulfuricans SSM1
Winogradskyella psychrotolerans RS-3
Clostridium lentocellum DSM 5427
Methanohalobium evestigatum Z-7303
Spizellomyces punctatus DAOM BR117
Thermovibrio ammonificans HB-1
Truepera radiovictrix DSM 17093
Desulfobacula toluolica Tol2
Desulfurispirillum indicum S5
Zunongwangia profunda SM-A87
Mesotoga prima MesGl.Ag.4.2
Fimbriimonas ginsengisoli Gsoil 348
Thermodesulfatator indicus DSM 15286
Oceanithermus profundus DSM 14977
Fonticula alba
PyroIobus fumarii 1A
Saprolegnia parasitica CBS 223.65
Arthrospira platensis NIES-39
Bifidobacterium animalis subsp. animalis
Slackia piriformis YIT 12062
Acidithiobacillus ferrivorans SS3
Isoptericola variabilis 225
Naegleria gruberi strain NEG-M
Aequorivita sublithincola DSM 14238
Thermus oshimai JL-2
Bigelowiella natans
Mesorhizobium australicum WSM2073
Fluviicola taffensis DSM 16823
Hippea maritima DSM 10411
Rothia dentocariosa ATCC 17931
Succinatimonas hippei YIT 12066
Oscillochloris trichoides DG-6
Parachlamydia acanthamoebae UV-7
Frateuria aurantia DSM 6220
Calditerrivibrio nitroreducens DSM 19672
Thiocapsa marina 5811
Thermoproteus tenax Kra 1
Desulfosporosinus orientis DSM 765
Thermodesulfobacterium geofontis OPF15
Halosimplex carlsbadense 2-9-1
Halopiger xanaduensis SH-6
Natronobacterium gregoryi SP2
Candidatus Nitrosoarchaeum limnia BG20
Gemmatirosa kalamazoonesis
Halobacteriovorax marinus SJ
Cloacibacillus evryensis DSM 19522
Halobacillus halophilus DSM 2266
Methanomethylovorans hollandica
Desulfurobacterium thermolithotrophum
Marinithermus hydrothermalis DSM 14884
Caldithrix abyssi DSM 13497
Turicella otitidis ATCC 51513
Entamoeba histolytica HM-1:IMSS-A
Singulisphaera acidiphila DSM 18658
Muricauda ruestringensis DSM 13258
Anaerobaculum mobile DSM 13181
Candidatus Moranella endobia PCIT
Guillardia theta CCMP2712
Dialister microaerophilus UPII 345-E
Leclercia adecarboxylata ATCC 23216 =
Caldilinea aerophila DSM 14535 =
Joostella marina DSM 19592
Owenweeksia hongkongensis DSM 17368
Anaerolinea thermophila UNI-1
Nitrososphaera viennensis EN76
Solitalea canadensis DSM 3403
Fructobacillus fructosus KCTC 3544
Botryobasidium botryosum FD-172 SSI
Eremothecium cymbalariae DBVPG#7215
Deinococcus peraridilitoris DSM 19664
Gymnopus luxurians FD-317 M1
Hypholoma sublateritium FD-334 SS-4
Ignavibacterium album JCM 16511
Imtechella halotolerans K1
Salpingoeca rosetta
Lacinutrix sp. 5H-3-7-4
Bacteroides nordii
Eggerthia catenaformis OT 569 = DSM
Candidatus Pelagibacter sp. IMCC9063
Kluyvera ascorbata ATCC 33433
Acetonema longum DSM 6540
Neorhizobium galegae bv. orientalis str.
Salinisphaera shabanensis E1L3A
Haloplasma contractile SSD-17B
Rhizobium leguminosarum bv. trifolii
Wickerhamomyces ciferrii
Bizionia argentinensis JUB59
Zymoseptoria brevis
Cobetia amphilecti str. KMM 296
Caldisphaera lagunensis DSM 15908
Pneumocystis murina b123
Candidatus Haloredivivus sp. G17
Wolbachia pipientis wAIbB
Bacillus coagulans DSM 1 = ATCC 7050
Geoalkalibacter
ferrihydriticus DSM 17813
Pseudothermotoga hypogea DSM 11164 =
Klebsiella pneumoniae subsp. pneumoniae
Nitrolancea hollandica Lb
Phycisphaera mikurensis NBRC 102666
Tumebacillus flagellatus
Richelia intracellularis HH01
Hydrocarboniphaga effusa AP103
Thalassospira profundimaris WP0211
Thiovulum sp. ES
Deinococcus puniceus
Gloeobacter kilaueensis JS1
Enterovibrio norvegicus FF-454
Psychroflexus gondwanensis ACAM 44
Nitritalea halalkaliphila LW7
Thaumarchaeota archaeon SCGC
Aeropyrum camini SY1 = JCM 12091
Methanoculleus bourgensis MS2
Thalassolituus oleivorans R6-15
Bordetella parapertussis Bpp5
Candidatus Kinetoplastibacterium
oncopeltii TCC290E
Tepidanaerobacter acetatoxydans Re1
Pythium vexans DAOM BR484
Piscirickettsia salmonis LF-89 =
Candidatus Nitrosopumilus koreensis AR1
Candidatus Methanomethylophilus alvus
Candidatus Photodesmus katoptron Akat1
Candidatus Nitrososphaera gargensis
Tolypothrix campylonemoides VB511288
Acanthamoeba castellanii str. Neff
Nitrospina gracilis 3-211
Acetobacter pasteurianus 386B
Pyrodictium delaneyi
Neofusicoccum parvum UCRNP2
Curtobacterium flaccumfaciens UCD-AKU
Candidatus Methanomassiliicoccus
intestinalis Issoire-Mx1 str. Mx1-Issoire
Thermosulfidibacter takaii ABI70S6
Chthonomonas calidirosea T49
Xanthomonas axonopodis Xac29-1
Salinispira pacifica
llumatobacter coccineus YM16-304
Cetobacterium somerae ATCC BAA-474
Holospora undulata HU1
Kosmotoga pacifica
Flavobacterium limnosediminis JC2902
Palaeococcus pacificus DY20341
Formosa agariphila KMM 3901
Gemmatimonas phototrophica
Mucispirillum schaedleri ASF457
Sporothrix schenckii 1099-18
Candidatus Endomicrobium
trichonymphae
Candidatus Hepatoplasma crinochetorum
Candidatus Entotheonella sp. TSY1
Candidatus Entotheonella sp. TSY2
Dehalococcoides mccartyi CG5
Salinicoccus sediminis
Thermococcus guaymasensis DSM 11113
Agrobacterium tumefaciens LBA4213
Lelliottia amnigena CHS 78
Leptospirillum ferriphilum YSK
Siansivirga zeaxanthinifaciens CC-SAMT-1
Streptomyces thermoautotrophicus
Ahrensia marina str. LZD062
Fistulifera Solaris
Cryobacterium sp. MLB-32
Lyngbya confervoides BDU141951
Candidatus Nanopusillus acidilobi
Berkelbacteria bacterium
Candidatus Beckwithbacteria bacterium
Candidatus Collierbacteria bacterium
Candidatus Curtissbacteria bacterium
Candidatus Gottesmanbacteria bacterium
Candidatus Woesebacteria bacterium
Candidatus Azambacteria bacterium
Candidatus Azambacteria bacterium
Candidatus Falkowbacteria bacterium
Candidatus Jorgensenbacteria bacterium
Candidatus Kaiserbacteria bacterium
Candidatus Kaiserbacteria bacterium
Candidatus Nomurabacteria bacterium
Candidatus Nomurabacteria bacterium
Candidatus Nomurabacteria bacterium
Candidatus Nomurabacteria bacterium
Parcubacteria group bacterium
Parcubacteria group bacterium
Parcubacteria group bacterium
Parcubacteria group bacterium
Candidatus Wolfebacteria bacterium
Candidatus Yanofskybacteria bacterium
Candidatus Magasanikbacteria bacterium
Candidatus Peregrinibacteria bacterium
Gemmata sp. SH-PL17
Nanohaloarchaea archaeon SG9
Abiotrophia defectiva
Lactobacillus
johnsonii NCC 533
Acanthamoeba castellanii
Lactobacillus
plantarum WCFS1
Acetobacter pasteurianus
Lactococcus garvieae
Acetohalobium arabaticum
Lactococcus lactis
Acetonema longum
Leclercia
adecarboxylata ATCC
Acholeplasma laidlawii
Leeuwenhoekiella
blandensis MED217
Acidimicrobium
Leifsonia xyli subsp.
ferrooxidans DSM 10331
xyli str. CTCB07
Acidiplasma aeolicum str.
Leishmania major
Acidithiobacillus
Lelliottia amnigena
ferrivorans SS3
Acidithiobacillus
Lentisphaera
ferrooxidans ATCC 23270
araneosa HTCC2155
Acidobacterium
Leptospira biflexa
capsulatum ATCC 51196
serovar Patoc strain
Acidothermus cellulolyticus
Leptospira
interrogans serovar
Copenhageni str.
Acinetobacter baumannii
Leptospirillum
ferriphilum YSK
Adineta vaga
Leptotrichia
goodfellowii F0264
Aequorivita sublithincola
Listeria innocua
Aeropyrum camini SY1 =
Listeria
monocytogenes
Aeropyrum pernix K1
Lyngbya
confervoides
Agrobacterium fabrum str.
Magnaporthe oryzae
Agrobacterium
Magnetococcus
tumefaciens LBA4213
marinus MC-1
Ahrensia marina str.
Akkermansia muciniphila
Marinithermus
hydrothermalis
Albugo candida
Marinitoga
piezophila KA3
Alcanivorax borkumensis
Meiothermus ruber
Alicyclobacillus
Mesorhizobium
acidocaldarius LAA1
australicum
Alkalilimnicola ehrlichii
Mesotoga prima
Allomyces macrogynus
Methanobrevibacter
smithii ATCC 35061
Amphimedon
Methanocaldococcus
queenslandica
jannaschii DSM 2661
Anabaena sp. 90
Methanococcus
maripaludis S2
Anaerobaculum mobile
Methanocorpusculum
labreanum Z
Anaerococcus prevotii
Methanoculleus
bourgensis MS2
Anaerolinea thermophila
Methanofollis
liminatans DSM 4140
Anoxybacillus flavithermus
Methanohalobium
evestigatum Z-7303
Aquifex aeolicus VF5
Methanomethylovorans
hollandica
Archaeoglobus fulgidus
Methanopyrus
kandleri AV19
Arthrospira platensis
Methanosarcina
acetivorans C2A
Aspergillus niger
Methanosarcina
mazei S-6
Methanosphaera
stadtmanae
Asticcacaulis excentricus
Methanosphaerula
palustris E1-9c
Aureococcus
Methanothermobacter
anophagefferens
thermautotrophicus
Babesia bovis T2Bo
Methylacidiphilum
infernorum V4
Bacillus coagulans DSM 1 =
Methylobacterium
extorquens PA1
Bacillus halodurans C-125
Methylococcus
capsulatus str. Bath
Bacillus selenitireducens
Microcystis
aeruginosa NIES-843
Bacillus subtilis subsp.
Micromonas pusilia
subtilis str. 168
Bacteroides fragilis YCH46
Mitsuokella
multacida
Bacteroides nordii
Mnemiopsis leidyi
Bacteroides
Mobiluncus curtisii
thetaiotaomicron VPI-5482
Bartonella henselae str.
Moniliophthora
perniciosa FA553
Bdellovibrio bacteriovorus
Monosiga brevicollis
Berkelbacteria bacterium
Mucispirillum
schaedleri ASF457
ruestringensis
Bigelowiella natans
Mycobacterium
leprae TN
Bizionia argentinensis
Mycobacterium
tuberculosis H37Rv
Blattabacterium sp.
Mycoplasma
agalactiae PG2
Bordetella parapertussis
Mycoplasma
genitalium G37
Botryobasidium botryosum
Mycoplasma
mycoides subsp.
mycoides SC str. PG1
Brachyspira murdochii
Mycoplasma
penetrans HF-2
Bradyrhizobium japonicum
Mycoplasma
pneumoniae M129
Brevibacillus brevis
Mycoplasma
pulmonis UAB CTIP
Brevundimonas
Naegleria gruberi
subvibrioides ATCC 15264
Brucella melitensis bv. 1
Nanoarchaeum
equitans
Buchnera aphidicola str.
Nanohaloarchaea
archaeon SG9
Caldilinea aerophila DSM
Natranaerobius
thermophilus
Caldisericum exile
Natronobacterium
gregoryi SP2
Caldisphaera lagunensis
Neisseria
meningitidis MC58
Calditerrivibrio
Nematostella
nitroreducens DSM 19672
vectensis
Caldithrix abyssi
Neofusicoccum
parvum UCRNP2
Campylobacter jejuni
Neorhizobium
galegae bv. orientalis
Candida albicans SC5314
Nitritalea
halalkaliphila LW7
Candidatus Azambacteria
Nitrococcus mobilis
bacterium
Candidatus Azambacteria
Nitrolancea
bacterium
hollandica Lb
Candidatus
Nitrosomonas
Beckwithbacteria
europaea
bacterium
Candidatus Blochmannia
Nitrosopumilus
floridanus
maritimus SCM1
Candidatus Collierbacteria
Nitrososphaera
bacterium
viennensis EN76
Candidatus Curtissbacteria
Nitrospina gracilis
bacterium
Candidatus Desulforudis
Nitrospira defluvii
audaxviator MP104C
Candidatus Endomicrobium
Nocardioides sp.
trichonymphae
Candidatus Entotheonella
Nonlabens
dokdonensis DSW-6
Candidatus Entotheonella
Nostoc punctiforme
Candidatus Falkowbacteria
Oceanithermus
bacterium
profundus
Candidatus
Oceanobacillus
Gottesmanbacteria
iheyensis HTE831
bacterium
Candidatus Haloredivivus
Oenococcus oeni
Candidatus Hepatoplasma
Olsenella uli
crinochetorum Av
Candidatus
Jorgensenbacteria
bacterium
phytoplasma OY-M
Candidatus Kaiserbacteria
Opitutus terrae
bacterium
Candidatus Kaiserbacteria
Oscillochloris
bacterium
trichoides DG-6
Candidatus
Ostreococcus
Kinetoplastibacterium
lucimarinus
oncopeltii TCC290E
Candidatus Korarchaeum
Owenweeksia
cryptofilum OPF8
hongkongensis
Candidatus
Palaeococcus
Magasanikbacteria
pacificus DY20341
bacterium
Candidatus
Parachlamydia
Magnetobacterium
acanthamoebae
bavaricum
Candidatus
Parageobacillus
Methanomassiliicoccus
toebii
intestinalis Issoire-Mx1 str.
Candidatus
Paramecium
Methanomethylophilus
tetraurelia strain
Candidatus Moranella
Parcubacteria group
bacterium
Candidatus Nanopusillus
Parcubacteria group
acidilobi
bacterium
Candidatus
Parcubacteria group
Nitrosoarchaeum limnia
bacterium
Candidatus Nitrosopumilus
Parcubacteria group
koreensis AR1
bacterium
Candidatus Nitrososphaera
Parvibaculum
gargensis Ga9.2
lavamentivorans
Candidatus
Parvularcula
Nomurabacteria bacterium
bermudensis
Candidatus
Pasteurella
Nomurabacteria bacterium
multocida str.
Candidatus
Perkinsus marinus
Nomurabacteria bacterium
Candidatus
Persephonella
Nomurabacteria bacterium
marina EX-H1
Candidatus Pelagibacter sp.
Petrotoga mobilis
Candidatus
Phaeodactylum
Peregrinibacteria
tricornutum CCAP
bacterium
Candidatus Photodesmus
Photobacterium
katoptron Akat1
profundum SS9
Candidatus Solibacter
Photorhabdus
usitatus Ellin6076
luminescens subsp.
laumondii TTO1
Candidatus Woesebacteria
Phycisphaera
bacterium
mikurensis
Candidatus Wolfebacteria
Physcomitrella
bacterium
patens
Candidatus
Phytophthora
Yanofskybacteria
ramorum
bacterium
Capnocytophaga ochracea
Picrophilus torridus
Capsaspora owczarzaki
Piscirickettsia
salmonis LF-89 =
Catenulispora acidiphila
Planctopirus
limnophila
Caulobacter crescentus
Plasmodium
falciparum 3D7
Cellulophaga lytica
Plasmopara halstedii
Cenarchaeum symbiosum A
Pneumocystis
murina b123
Cetobacterium somerae
Porphyromonas
gingivalis
Chlamydia abortus S26-3
Postia placenta Mad-
Chlamydomonas reinhardtii
Prochlorococcus
marinus str.
Chlamydophila
Pseudomonas
pneumoniae CWL029
aeruginosa PAO1
Chlamydophila
Pseudomonas
pneumoniae J138
stutzeri
Chlorobaculum parvum
Pseudothermotoga
hypogea DSM 11164 =
Chlorobium tepidum TLS
Psychrobacter
arcticus 273-4
Chloroflexus aggregans
Psychrobacter
cryohalolentis K5
Chloroflexus aurantiacus
Psych roflexus
gondwanensis
Chloroherpeton thalassium
Puccinia graminis f.
Pyrobaculum
aerophilum str. IM2
Chromobacterium
Pyrococcus abyssi
violaceum ATCC 12472
Chryseobacterium
Pyrococcus furiosus
greenlandense
Chthonomonas calidirosea
Pyrococcus
horikoshii OT3
Clavibacter michiganensis
Pyrodictium delaneyi
Cloacibacillus evryensis
PyroIobus fumarii 1A
Clostridium lentocellum
Pythium vexans
Clostridium tetani E88
Ralstonia
solanacearum
Cobetia amphilecti str.
Ramlibacter
tataouinensis
Coccomyxa subellipsoidea
Rathayibacter
toxicus
Conexibacter woesei
Renibacterium
salmoninarum
Coraliomargarita
Rhizobium
akajimensis DSM 45221
leguminosarum bv.
trifolii CB782
Corynebacterium efficiens
Rhodopirellula
baltica SH 1
Corynebacterium
Rhodopseudomonas
glutamicum ATCC 13032
palustris CGA009
Coxiella burnetii RSA493
Rhodothermus
marinus DSM 4252
Croceibacter atlanticus
Richelia
intracellularis HH01
Cryobacterium sp. MLB-32
Robiginitalea
biformata HTCC2501
Cryptococcus neoformans
Roseburia hominis
Cryptomonas paramecium
Roseiflexus
castenholzii
Cryptosporidium parvum
Rothia dentocariosa
Curtobacterium
Rubidibacter lacunae
flaccumfaciens UCD-AKU
Cyanidioschyzon merolae
Saccharomyces
cerevisiae S288c
Daphnia pulex
Saccharopolyspora
erythraea NRRL2338
Deferribacter desulfuricans
Salegentibacter
salarius
Dehalococcoides mccartyi
Salinicoccus
halodurans
Dehalococcoides mccartyi
Salinicoccus roseus
Dehalogenimonas
Salinicoccus
lykanthroporepellens BL-
sediminis
Deinococcus geothermalis
Salinisphaera
shabanensis E1L3A
Deinococcus peraridilitoris
Salinispira pacifica
Deinococcus puniceus
Salmonella enterica
serovar
Typhimurium str. LT2
Deinococcus radiodurans
Salpingoeca rosetta
Denitrovibrio acetiphilus
Saprolegnia
parasitica
Desulfobacula toluolica
Schizophyllum
Desulfonatronospira
Schizosaccharomyces
thiodismutans ASO3-1
pombe (strain 972/
Desulfosporosinus orientis
Sebaldella termitidis
Desulfovibrio vulgaris str.
Shewanella
oneidensis MR-1
Desulfurispirillum indicum
Siansivirga
zeaxanthinifaciens
Desulfurobacterium
Simkania negevensis
thermolithotrophum DSM
Dialister microaerophilus
Singulisphaera
acidiphila
Dictyoglomus
Sinorhizobium
thermophilum H-6-12
meliloti 1021
Dictyoglomus turgidum
Slackia piriformis
Dictyostelium discoideum
Solitalea canadensis
Diplodia seriata
Sphaerobacter
thermophilus
Dunaliella salina
Sphaerochaeta
globosa str. Buddy
Eggerthia catenaformis OT
Spirodela polyrhiza
Elusimicrobium minutum
Spizellomyces
punctatus DAOM
Emiliania huxleyi
Sporothrix schenckii
Entamoeba histolytica
Stackebrandtia
nassauensis
Enterococcus faecalis V583
Staphylococcus
aureus subsp. aureus
Enterovibrio norvegicus
Staphylococcus
epidermidis
Eremothecium cymbalariae
Streptobacillus
moniliformis
Eremothecium gossypii
Streptococcus
pyogenes M1 GAS
Erythrobacter litoralis
Streptomyces
avermitilis MA-4680 =
Escherichia coli str. K-12
Streptomyces
coelicolor A3(2)
Escherichia coli str. K-12
Streptomyces
thermoautotrophicus
Exiguobacterium sp. AT1b
Succinatimonas
hippei YIT 12066
Ferroglobus placidus
Sulfolobus islandicus
Ferroplasma acidarmanus
Sulfolobus tokodaii
Fervidobacterium nodosum
Sulfurihydrogenibiu
m azorense Az-Fu1
Fibrobacter succinogenes
Sulfurihydrogenibium
yellowstonense
Fimbriimonas ginsengisoli
Sulfurimonas
denitrificans
Fistulifera Solaris
Synechococcus
elongatus PCC 6301
Flavobacteriales bacterium
Synechococcus sp.
Flavobacterium
Synechocystis sp.
limnosediminis JC2902
Flavobacterium
Tepidanaerobacter
psychrophilum JIP02/86
acetatoxydans Re1
Fluviicola taffensis
Tetrahymena
thermophila SB210
Fonticula alba
Thalassiosira
pseudonana
Formosa agariphila
Thalassolituus
oleivorans R6-15
Fragilariopsis cylindrus
Thalassospira
profundimaris
Frateuria aurantia
Thaumarchaeota
archaeon SCGC AB-
Fructobacillus fructosus
Theileria annulata
Fusobacterium
Thermanaerovibrio
gonidiaformans
acidaminovorans
Fusobacterium nucleatum
Thermobaculum
terrenum ATCC
Fusobacterium
Thermobifida fusca
periodonticum 2_1_31
Galbibacter marinus
Thermobispora
bispora DSM 43833
Galdieria sulphuraria
Thermococcus
barophilus MP
Gardnerella vaginalis
Thermococcus
cleftensis
Gelidibacter algens
Thermococcus
gammatolerans EJ3
Thermococcus
Gemmata sp. SH-PL17
guaymasensis
Gemmatimonas aurantiaca
Thermococcus
nautili
Gemmatimonas
Thermocrinis albus
phototrophica
Gemmatirosa
Thermodesulfatator
kalamazoonesis
indicus DSM 15286
Geoalkalibacter
Thermodesulfobacterium
ferrihydriticus DSM 17813
commune
Geobacillus kaustophilus
Thermodesulfobacterium
geofontis
Geobacillus
Thermodesulfovibrio
stearothermophilus 10
yellowstonii
Geobacter lovleyi SZ
Thermomicrobium
roseum DSM 5159
Thermoplasma
Giardia lamblia ATCC 50803
acidophilum
Gloeobacter kilaueensis JS1
Thermoplasma
volcanium GSS1
Gloeobacter violaceus
Thermoproteus
tenax Kra 1
Gluconobacter oxydans
Thermosipho
africanus TCF52B
Gramella forsetii KT0803
Thermosipho
melanesiensis BI429
Granulibacter bethesdensis
Thermosulfidibacter
takaii ABI70S6
Guillardia theta CCMP2712
Thermotoga
maritima MSB8
Gymnopus luxurians FD-
Thermovibrio
ammonificans HB-1
Haemophilus ducreyi
Thermovirga lienii
Halobacillus halophilus
Thermus aquaticus
Halobacteriovorax marinus
Thermus oshimai
Halobacterium salinarum
Thermus
thermophilus HB8
Halobacterium salinarum
Thiocapsa marina 5811
Haloferax mediterranei
Thiohalorhabdus
denitrificans
Halogeometricum
Thiovulum sp. ES
borinquense DSM 11551
Halopiger xanaduensis SH-6
Tolypothrix
campylonemoides
Haloplasma contractile
Toxoplasma gondii
Haloquadratum walsbyi
Treponema
denticola
Halosimplex carlsbadense
Trichodesmium
erythraeum IMS101
Halothermothrix orenii
Trichomonas
vaginalis G3
Halothiobacillus
Trichoplax
neapolitanus c2
adhaerens
Helicobacter pylori 26695
Tropheryma
whipplei str. Twist
Herpetosiphon aurantiacus
Hippea maritima
Trypanosoma cruzi
Holospora undulata HU1
Tumebacillus
flagellatus
Hydrocarboniphaga effusa
Turicella otitidis
Hydrogenobacter
Ureaplasma parvum
thermophilus TK-6
serovar 3 str.
Hydrogenobaculum sp. HO
Vanderwaltozyma
polyspora
Hypholoma sublateritium
Verrucosispora maris
Ignavibacterium album
Vibrio fischeri MJ11
Ignisphaera aggregans
Vibrio
parahaemolyticus
Ilumatobacter coccineus
Vibrio vulnificus
Ilyobacter polytropus
Volvox carteri
Imtechella halotolerans K1
Vulcanisaeta
distributa
Isoptericola variabilis 225
Wickerhamomyces
anomalus NRRL
Isosphaera pallida
Wickerhamomyces
ciferrii
Joostella marina
Winogradskyella
psychrotolerans RS-3
Kineococcus radiotolerans
Wolbachia pipientis
Kitasatospora setae
Wolinella
succinogenes
Klebsiella pneumoniae
Xanthomonas
axonopodis Xac29-1
Kluyvera ascorbata
Xanthomonas
campestris pv.
campestris str.
Kosmotoga olearia TBF
Xylella fastidiosa
Kosmotoga pacifica
Xylella fastidiosa
Ktedonobacter racemifer
Zunongwangia
profunda SM-A87
Laccaria bicolor S238N-H82
Zymoseptoria brevis
Lacinutrix sp. 5H-3-7-4
Zymoseptoria tritici
Randomization procedures: To test different hypotheses regarding local folding-energy (LFE), native sequences were compared against randomized sequences preserving attributes as defined by each null hypothesis, as follows (
To test the hypothesis that the native arrangement of synonymous codons causes a significant bias in LFE, synonymous codons were randomly permuted within each CDS (i.e., all codons encoding for the same amino acid within a given CDS are randomly rearranged). This “CDS-wide” randomization preserves the encoded proteins sequence, nucleotide frequencies (including GC-content) and codon frequencies of each CDS (but generally disrupts longer-range dependencies). Synonymous codons were determined according to the nuclear genetic code annotated for each species in NCBI genomes.
To test the contribution of position-specific biases in amino-acid composition, nucleotide frequencies and codon frequencies including CUB (factors that are equalized at the CDS level by the CDS-wide randomization) on the observed LFE, a second “position-specific” randomization was used. In this randomization, synonymous codons were randomly permuted between codons found at the same position (relative to the CDS start) across all CDSs in each genome. This randomization preserves the amino-acid sequence of each CDS, while nucleotide (including GC-content) and codon frequencies are preserved at each position across a genome.
LFE profile calculation: Local folding-energy (LFE) profiles were created by calculating the folding-energy of all 40 nt-long windows, at 10 nt intervals, relative to the CDS start and end, on each native and randomized sequence. This measure estimates local secondary-structure strength (ignoring the specific structures) and reflects (among other considerations) the structure of mRNA during translation, which prevents long-range structures but allows formation of local secondary-structure and generally agrees with existing large-scale experimental validation results. Previous studies showed that this measure is robust to changes in the window size. The coordinates shown always refer to the window start position relative to the CDS start (e.g., window 0 includes the first 40 nt in the CDS) or to the window end position relative to the CDS end. Estimated folding-energies were calculated for each window using RNAfold from the ViennaRNA package 2.3.0, with the default settings. All folding-energies were estimated at 37° C. so as to compare equivalent quantities between all genomes (but see below under native-temperature profiles). The ΔLFE profile for each protein, defined as the estimated excess local folding-energy caused by the arrangement of synonymous codons at any CDS position, was created by subtracting the average profile of 20 randomized sequences for that protein from the native LFE profile:
(i—CDS position, N—number of randomized sequences)
The mean ΔLFE profile for each species was created by averaging each position i over all proteins of sufficient length (so a different number of sequences may be averaged at each position). Note that while the native LFE of different CDSs within each genome vary considerably, the LFE of each native CDS is compared to its own set of randomized sequences.
To determine if the mean ΔLFE for a species in position i (relative to CDS start or end) is significantly different than 0, the differences di(p, n) between LFE of the native and randomized sequences for each CDS at that position were collected:
d
i(p,n)=nativeLFEi−randomizedLFEi(p,n)
(p—CDS index, n≤N=20—number of randomized sequences) The Wilcoxon signed-rank test was used on all values d(p, n) (with the null hypothesis implying that the distribution is symmetrical).
Native-temperature profiles: The predicted folding-energy calculations for native and randomized sequences for a sample of N=71 bacterial and archaeal species were repeated using the same procedure but with folding predicted at the optimal growth temperature specified for that species (instead of 37° C.).
Phylogenetic tree preparation: To study the relation between ΔLFE profiles and other traits, the profiles were analyzed using a phylogenetic tree as follows. The phylogenetic tree is based on Hug L A, Baker B J, Anantharaman K, Brown C T, Probst A J, Castelle C J, et al. A new view of the tree of life. Nat Microbiol. 2016 Apr. 11; 1:16048, herein incorporated by reference in its entirety see Tables 2-4) and contains species from our dataset across the three domains of life. Since there are slight discrepancies in some node identifiers between the tree and accessions table, species names were matched by hand. Tree nodes and profiles were then matched by NCBI tax-id at the species or lower level between the available genomes and phylogenetic tree nodes (e.g., when the tree species a species, and there is only one genome available for a specific strain of this species). The tree distances were converted to approximate relative ultrametric distances using PATHd8 version 1.9.8 with the default settings. Finally, the tree was pruned to the set of leaf nodes found in the dataset (or a subset of them which has data for both variables being correlated), by removing unused inner and leaf nodes and merging single-child inner nodes by summing distances. The resulting ultrametric tree was used to create a covariance matrix using a Brownian process (to reflect the null hypothesis that a trait is not under selection), using the ape package in R.
Phylogenetically-controlled regression: To test for correlations between traits among species while controlling for the similarity expected to exist between related species even in the absence of selection on either trait, generalized least-squared (GLS) regression was performed with the nlme package in R and using REML optimization. Each regression included the subset of species for which data for both correlated traits was available, and which were also included in the tree. Regression p-values are based on the null-hypothesis that the slope of the explanatory variable is 0 (i.e., that the variables are independent), and estimated using the t-test. Coefficient of determination (R2) values were calculated according to:
û—residuals, V—variance-covariance matrix, Y—observations,
For continuous traits, regression formulas included an intercept term. Discrete traits were represented by ordered or unordered factors and the intercept term was omitted from the regression formula. For discrete traits, values of the explained variable (such as ΔLFE) were centered to have mean 0 (so regression is based on a null hypothesis that all levels have the same mean).
Regression robustness verification: To test the robustness of a correlation between traits at different CDS regions, the regression was repeated at all profile positions starting between 0-300 nt (relative to CDS start and end) and all contiguous subranges (using the mean ΔLFE value in each range) and reported only if consistent over the relevant range of positions (
To test for specific trait correlations in individual taxa, the regression procedure was repeated for each taxonomic group (at any rank) containing at least 9 species (
Elements of the ΔLFE profile model were formalized as follows to allow estimation of their prevalence (
w
i(p,n)=di*(p,n)−di(p,n)
To measure the performances of several criteria in predicting ΔLFE strength, the following simple model was used. ΔLFE values for all species were divided into weak and strong groups based on the standard-deviation of the mean ΔLFE at positions 0-300 nt. Species with standard-deviation <0.14 were included in the “weak ΔLFE” group. The binary classification of each species is based on 4 species traits as inputs, using the following rule (optimized using grid search):
PredictedWeakLFE=(Endosymbiont=True) or (Genomic-GC<38%) or (Genomic-ENc′>56.5) or (Optimum-temp>58° C.)
Maximal Information Coefficient (MIC) is a statistical measure of general (not necessarily linear) dependence between two variables. Informally, it is a generalization of R2, and also has values in the range 0.0-1.0, with high values indicating knowing the value of one variable allows inferring the value of the other. MIC was calculated using the minerva package in R. p-values were estimated using 10,000 random samples.
Correlogram plot (
Codon-bias metrics (CAI, CBI, Nc, Fop) were calculated for each genome using codonW version 1.4.4. ENc′ was calculated using ENCprime (github user jnovembre, commit 0ead568, October 2016) using the default settings. I_TE was calculated using DAMBE7, based on the included codon frequency tables for each species. DCBS was calculated according to Sabi R, Tuller T. Modelling the Efficiency of Codon-tRNA Interactions Based on Codon Usage Bias. DNA Res. 2014 Oct. 1; 21(5):511-26, herein incorporated by reference.
Shine-Dalgarno (SD) strength for each gene was calculated according to Bahiri Elitzur S, et al. “Prokaryotic rRNA-mRNA interactions are involved in all translation steps and shape bacterial transcripts.” Rev. 2020, herein incorporated by reference in its entirety, based on the minimal anti-SD hybridization energy found in the 20 nt region upstream of the start codon.
Taxon characteristic profiles chart: The mean ΔLFE profiles for CDS positions 0-300 nt relative to the CDS start and end within each taxon were summarized (
PCA display for ΔLFE profiles: To summarize ΔLFE profiles and show how different values related to different profile types, we used PCA analysis to obtain a two-dimensional arrangement in which similar ΔLFE profiles are mapped to nearby positions. (see for example
PCA analysis for the ΔLFE profiles (treated as vectors of length 31) was performed using SciKit Learn. Analysis was limited to the first 3 components and only the first two components are displayed (
Evolutionary and taxonomic trees were plotted using ETE toolkit.
Methodology for
Methodology for
RNA sequencing data was obtained through ENA from the experiments detailed in the table below. Species were chosen based on availability of data using for the same strain or a closely related strain and using short-read sequencing technology compatible with the pipeline described here. Experiments are transcriptomic in their design and the control sample from each experiment was used (from the logarithmic growth phase if possible).
Normalized read counts were calculated as follows. Trimmomatic version 0.38, using the single-end or paired-end mode and the Illumina adapters, sliding window with window size 4 nt and quality threshold 15, leading and trailing below 3 and minimum length of 36 nt. Reads were mapped to reference genomes obtained from Ensemble genomes, except for E. coli that was obtained from NCBI. Reads were mapped to genomic positions with Bowtie2 version 2.3.4.3 using local alignment with the default settings. Read were then assigned to coding sequences using htseq-count version 0.11.2 in union mode with non-unique matches included and ignoring expected strand. Normalized counts for each CDS were finally obtained by dividing by the CDS length. Genes were divided to the “low” and “high” groups based on the median normalized read count for each species, with genes having no reads counted as 0.
PA results were obtained from PaxDB using the “Integrated” dataset. Genes were divided to the “low” and “high” groups based on the median count for each species, with genes having no reads counted as 0. I_TE, a CUB measure designed to measure codon optimization for translation elongation, was computed using DAMBE7 based on the included codon frequency tables for each species.
To test different hypotheses related to direct selection acting on the local folding-energy (LFE) in different regions of the coding sequence, the mean deviation in LFE between the native and randomized sequences was measured (maintaining the amino-acid sequence of all CDSs as well as codon and nucleotide composition including the GC-content, see Materials and Methods for more details). The resulting deviation values, denoted ΔLFE, measure the increase or decrease in local mRNA folding-energy relative to what would be expected based on the encoded protein and codon frequencies. Any significant deviation from random can be attributed to a specific arrangement of codons that supports increased or decreased base-pairing and folding strength along the mRNA strand (
Specifically, if the null hypothesis used to generate the randomized sequences holds for the native sequences at some position, the expected ΔLFE is 0. Otherwise, a significant deviation from ΔLFE=0 indicates that the local folding-energy values cannot be explained by selection on amino-acid content, codon bias or GC-content alone and serves as evidence for direct selection on local folding-energy (
It was observed that significant ΔLFE is present in most species and in most regions of the CDS (
To measure how frequently these elements appear together within the same species, they were tested against a model, based on two variants. The stricter variant, Model 1, counts species in which the regions of weak folding at the beginning and end of the CDS have, on average, weaker than expected folding, i.e., significantly positive ΔLFE. The less restrictive Model 2 requires folding in these regions to be significantly weaker than in the middle of the CDS, but not necessarily significantly weaker than random (see Materials and Methods for details). Since the models are applied to the mean ΔLFE of a population of genes which may vary greatly in their individual values, both estimates of the adherence to the model are informative. The combined models (composed of the three regions described) are found in 23% (Model 1) and 69% (Model 2) of the species analyzed (
GC-content and LFE both change during evolution, and it is worthwhile to compare their level of conservation in related species. LFE is to a large degree determined by GC-content (as evident by the almost perfect correlations found between GC-content and native or randomized LFE,
Additional tests also support direct selection acting to maintain folding strength. ΔLFE profile features are also preserved when calculated using a null distribution that maintains the codon distribution at any position in the CDS relative to the CDS start; thus, local (position-specific) genomic amino-acid or codon distributions are not enough to explain the ΔLFE profile (
It should be noted, that the randomized LFE profiles also aren't always flat, revealing some residual influence on LFE, caused by the amino-acid frequencies at different regions, remains even after randomization. ΔLFE controls for this by separately measuring the folding-energy biases found in each position.
The different elements making up the model profile structure have functions associated with them. The weak folding region at the beginning of the coding region may improve access to the regulatory signals in this region (e.g., the start codon). The region of positive ΔLFE preceding the CDS end may help recognition of the stop codon and ribosomal dissociation from the mRNA and prevent ribosomal read-through. Strong folding in the middle of the coding sequence may assist co-translational folding by slowing down translation in specific positions to allow protein folding or other co-translational processes to take place, as well as regulate mRNA stability or prevent mRNA aggregation.
The division of the profile into the three regions described here is also apparent when the data is analyzed in an unsupervised manner via Principal Components Analysis (PCA) (
In 45% of the organisms there was found an additional feature: a peak of selection for strong mRNA folding around 30-70 nt downstream of the start codon (
The ΔLFE profiles of eukaryotes are much more diverse than those found in prokaryotes. One striking observation is that significant positive ΔLFE throughout the mid-CDS region, present in 13% of the eukaryotes tested, is not observed in any of the 371 bacterial species tested except in Deinococcus puniceus (
Despite these general trends, there is also significant variation in the ΔLFE profiles across and within taxonomic groups. Examples 4-7 discuss genomic and environmental factors that explain some of the variation between mean ΔLFE profiles in different species.
The strengths of the three major regions of the ΔLFE profile described above are strongly correlated (
Together these results suggest that the different elements making up the typical profile structure are influenced at the genome level by a factor or combination of factors acting jointly on all regions and strengthening or weakening |ΔLFE|, as well distinct factors acting on each region differently. Some factors contributing to this scaling effect are discussed in Examples 4-7.
Codon usage bias is generally correlated with adaptation to translation efficiency. If ΔLFE is also related to selection for translation efficiency, it is reasonable to expect it would correlate with CUB. To test this hypothesis. ENc′ (ENc prime), a measure of codon usage bias (CUB) that compensates for the influence of extreme GC-content values that skews standard ENc (Effective Number of Codons) scores was used. Indeed, such a correlation is found (
Using genomic CUB as a measure of optimization for efficient translation elongation, it was found that it is also a good predictor of the strength of ΔLFE. One interpretation of this is that the genomic variation in ΔLFE can largely be explained not by different species having distinct ‘target’ ΔLFE levels, but by different species having varying ‘abilities’ to maintain ΔLFE in the presence of mutations and drift because the selection pressure is insufficient under their effective population size (either because the selection pressure is low or because the effective population size is low).
GC-content is a fundamental genomic feature and is correlated with many other genomic traits and environmental aspects. It might be a trait maintained under direct selection, or merely a statistical measure of the genome that other traits evolve in response to because of its biological and thermodynamic consequences. GC-content is also the strongest factor determining the native LFE (
The correlations (expressed as R2) between genomic GC-content and ΔLFE at different points near the CDS start and end are shown in
Near the CDS start, positive correlation (indicating a moderating effect) exists in the windows starting at 0-60 nt (
The opposite effect exists in the mid-CDS: negative (reinforcing) dependence on genomic GC-content appears in the region at 70-300 nt after CDS start in most bacterial and archaeal taxa (
In eukaryotes, there was observed a wider variation in mid-CDS ΔLFEs (which is not found in other groups), from strongly positive to strongly negative, with a non-linear dependence on genomic-GC (
The group of fungi and other eukaryotes having strong selection for weak local mRNA folding in the mid-CDS region (all of which have high genomic GC-content) runs counter to the general trend in prokaryotes. It is possible that these species are under selection for higher translation elongation speeds, which tend to be hindered by stronger mRNA folding; however, it is not clear why such cases are not observed in other groups like bacteria. The correlation with GC-content reported here may also be partially explained by the fact that both GC-content and ΔLFE are affected by common factors such as the ability to maintain the selected sequences under the effective population size. The wide range of ΔLFE values for eukaryotic species and the absence of linear correlation with GC-content (in general) reveals additional factors are involved in this aspect of gene expression.
Many endosymbionts and other species with intracellular life stages have low effective population sizes, because their lifecycle includes recurring population bottlenecks or have lower selective pressure due to reliance on the host. These species generally have weaker ΔLFE compared to their relatives, as can be clearly seen from their ΔLFE profiles (
In temperatures approaching the RNA melting temperature base-pairing is destabilized and it is likely that codon arrangement and ΔLFE can no longer significantly affect the secondary-structure. It was found that hyperthermophilic archaea and bacteria have weaker (closer to 0) ΔLFE in the mid-CDS region (
Although the invention has been described in conjunction with specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, it is intended to embrace all such alternatives, modifications and variations that fall within the spirit and broad scope of the appended claims.
This application is a Bypass continuation of PCT Patent Application No. PCT/IL2021/050074, having International filing date of Jan. 24, 2021, which claims the benefit of priority of U.S. Provisional Patent Application No. 62/964,859 filed Jan. 23, 2020, both entitled “MOLECULES AND METHODS FOR INCREASED TRANSLATION”, the contents of which are all incorporated herein by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
62964859 | Jan 2020 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/IL2021/050074 | Jan 2021 | US |
Child | 17870029 | US |