This application claims the benefit of priority of Israeli Patent Application No. 261157 filed 14 Aug. 2018, the contents of which are incorporated herein by reference in their entirety.
The ASCII file, entitled 78359 Sequence Listing.txt, created on 14 Aug. 2019, comprising 188,416 bytes, submitted concurrently with the filing of this application is incorporated herein by reference.
The present invention, in some embodiments thereof, relates to enzymology, and more particularly, but not exclusively, to phosphotriesterase variants designed by a designated computational method to exhibit catalytic activity towards a broad range of organophosphates and chemical warfare nerve agents.
At present, both prophylaxis and post-intoxication treatments of chemical warfare nerve agent (CWNA) poisoning are based on drugs selected to counteract the symptoms caused by accumulation of acetylcholine in cholinergic neurons. Current antidotal regimes consist of pretreatment with pyridostigmine, and of post-exposure therapy that involves administration of a cocktail containing atropine, an oxime reactivator and an anticonvulsant drug such as diazepam. The multi-drug approach against CWNA toxicity has been adopted by many countries and integrated into their civil and military medical protocols. However, it is commonly recognized that these drug regimens suffer from several disadvantages that call for new therapeutic strategies. The preferred approach is to rapidly detoxify the CWNA in the blood before it has had the chance to reach its physiological targets. One way of achieving this objective is by the use of bioscavengers. However, use of the best stoichiometric bioscavenger currently available (human butyrylcholinesterase, hBChE) requires administration of hundreds of milligrams of protein to confer protection against toxic doses of CWNA.
A safer and more effective treatment strategy can be achieved by using a catalytic bioscavenger to rapidly degrade the intoxicating organophosphate (OP) in the circulation. The promiscuous nerve-agent hydrolyzing activities of the enzyme phosphotriesterase (PTE) make it a prime candidate both for prophylactic and post exposure treatment of nerve-agent intoxications. However, efficient in-vivo detoxification using low doses of enzymes (≤50 mg/70 kg) following exposure to toxic doses of nerve agents, requires that the catalytic efficiencies (kcat/KM) of wild-type PTE towards the toxic nerve agent isomers will be increased.
PTE variants that can efficiently hydrolyze V-type nerve agents were disclosed previously [Cherney, I. et al., ACS Chem Biol, 2013, 8(11), pp. 2394-2403]. In-vivo post-exposure activity of one of these variants (C23) was demonstrated in guinea-pigs intoxicated with a lethal dose of VX [Worek, F. et al., Toxicol Lett, 2014, 231(1), pp. 45-54].
Additional background art pertaining to PTE variants includes U.S. Pat. No. 8,735,124, WO2016/092555, WO2018/087759 and Roodveldt, C. and Tawfik, D.S., Protein Eng Des Sel., 2005, 18(1), pp. 51-8. Mutations that alter enzyme activity profiles are essential for adaptation to an organism's changing needs, such as metabolizing new substrates. Such mutations are also highly desired in basic research, biotechnology, and biomedicine to enable efficient and environmentally safe solutions, for instance in the synthesis of useful molecules or the degradation of harmful ones. Most mutations, however, are deleterious to protein activity and stability, constraining the emergence of improved variants through natural evolution or protein engineering. Furthermore, due to mutational epistasis, a mutation's effect on activity depends on whether or not other mutations were previously acquired. In the extreme case, known as sign epistasis, two mutations that are individually deleterious, enhance activity when combined, or vice versa. In natural evolution, mutations usually occur one at a time, and thus, epistatic combinations of mutations must accumulate in a specific order, since all intermediates must be at least as active as their predecessors or they would be purged by selection. The high prevalence of sign epistasis in improved mutants further reduces the likelihood of obtaining beneficial combinations. Protein evolution is additionally constrained by stability-threshold effects, whereby activity-enhancing mutations may destabilize the protein, and therefore accumulate only up to a threshold in which additional mutations are no longer tolerated. To overcome stability-threshold effects, stabilizing mutations, both in proximity to the active-site pocket and in distant regions, are essential for the accumulation of function-enhancing mutations.
Due to epistasis and stability-threshold effects, the evolution of variants with significant enhancement in an enzyme activity demands multiple mutations of different type and affecting different regions of the protein. Laboratory-evolution experiments, for instance, may comprise more than a dozen rounds of genetic diversification and selection for improved mutants, and substantial improvements by three orders of magnitude or more require on average ten mutations. The majority of these mutations occur outside the catalytic pocket and are likely to affect activity only indirectly by enhancing tolerance to function-enhancing mutations. Another complication is that laboratory-evolution experiments are laborious and demand high-throughput or even ultrahigh-throughput screening (>106 variants per round). Such screens, however, are only applicable to certain enzyme activities and typically employ synthetic model substrates.
In principle, computational protein design strategies could bypass the need for multiple rounds of experimental optimization, since they are unconstrained by mutational trajectories. Previous applications of protein design computed favorable point mutants or focused libraries for experimental screening, yielding limited gains in activity, and de novo designed enzymes exhibited low catalytic efficiencies. Overall, computational enzyme design remains a specialized expertise, and still depends on laboratory evolution to reach comparable efficiencies to those seen in natural enzymes. Thus, substantial gaps remain in the understanding and control of the basic principles of enzyme design.
Additional background art pertaining to computational design of protein variants includes U.S. Patent Application Publication No. 2017/0032079, International Patent Application No. WO 2017/017673, Fleishman, S. L. et al., PLoS One, 2011, 6(6), and Goldenzweig, A. et al. Mol Cell., 2016, 63(2), pp. 337-346.
Substantial improvements in enzyme activity demand multiple mutations at spatially proximal positions in the active site. Such mutations, however, often exhibit unpredictable epistatic (non-additive) effects on activity. Here, the present invention provides an automated method for designing multipoint mutations at enzyme active sites using phylogenetic analysis and Rosetta design calculations, referred to herein as FuncLib. FuncLib is demonstrated herein using phosphotriesterase; the designed variants of PTE were all active, and most showed activity profiles that significantly differed from the wild type and from one another. Several dozen designs with only 3-6 active-site mutations exhibited 10-4,000-fold higher efficiencies with a range of alternative substrates, including the hydrolysis of the toxic organophosphate nerve agents soman and cyclosarin. FuncLib has also been implemented as a web-server (www(dot)funclib(dot)weizmann(dot)ac(dot)il); it circumvents iterative, high-throughput screens and opens the way to design highly efficient and diverse catalytic repertoires.
Thus, according to an aspect of some embodiments of the present invention, there is provided a protein having a sequence selected from the group consisting of any combination of at least 2 amino acid substitutions of a sequence space afforded for phosphotriesterase (PTE) from Pseudomonas diminuta as an original protein, and listed in Table A:
In some embodiments, the protein is a hybrid protein wherein the combination of amino acid substitutions is implemented on a PTE protein other than the original protein.
In some embodiments, the protein is characterized by a sequence selected from the group consisting of presented in Table A set forth hereinbelow.
In some embodiments, the protein is characterized by a sequence selected from the group consisting of PTE_28 (SEQ ID NO: 28), PTE_29 (SEQ ID NO: 29), PTE_56 (SEQ ID NO: 56), and PTE_57 (SEQ ID NO: 57).
According to an aspect of some embodiments of the present invention, there is provided a method of detoxification and decontamination of organophosphate agents, which is effected by contacting an area suspected of being contaminated with the organophosphate agents with at least one of the PTE variant proteins provided herein according to some embodiments of the present invention.
In some embodiments, the area is selected from the group consisting of a floor, a wall, a building or a part thereof, a vehicle, a piece of clothing, a piece of equipment, a plant, an animal, and an inanimate object.
In some embodiments, the organophosphate agents are selected from the group consisting of a G-type nerve agent, a V-type nerve agent, and a GV-type nerve agent.
According to an aspect of some embodiments of the present invention, there is provided a method generating a library of enzyme variants (designs), having a diverse improved catalytic activity compared to an original enzyme, the method is effected by:
identifying a group of substitutable residues (substitutable positions) in a first shell and a second shell of an active site of the enzyme, and a group of fixed residues (fixed positions) in these shells;
permuting mutations of the substitutable residues according to a PSSM scoring regimen using a computational software that calculates stability parameters and ranks the permutated mutants according to their energy value, thereby obtaining a stability score list of enzyme variants;
enumerating the enzyme variants resulting from the previous step;
selecting a number of the resulting variants (permutated mutants) at the top of the stability score list, which have at least two mutations in the substitutable residues compared to the original enzyme; and
cloning and expressing that number of variants having top stability score and at least two mutations relative to the original enzyme.
In some embodiments, the method of generating a library of enzyme variants, further includes, prior to identifying substitutable and fixed residues, providing a stabilized variant of the wild-type enzyme using any design-for-stability method (such as PROSS), and using this variant as the original enzyme.
Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention pertains. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the invention, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.
Implementation of the method and/or system of embodiments of the invention can involve performing or completing selected tasks manually, automatically, or a combination thereof. Moreover, according to actual instrumentation and equipment of embodiments of the method and/or system of the invention, several selected tasks could be implemented by hardware, by software or by firmware or by a combination thereof using an operating system.
For example, hardware for performing selected tasks according to embodiments of the invention could be implemented as a chip or a circuit. As software, selected tasks according to embodiments of the invention could be implemented as a plurality of software instructions being executed by a computer using any suitable operating system. In an exemplary embodiment of the invention, one or more tasks according to exemplary embodiments of method and/or system as described herein are performed by a data processor, such as a computing platform for executing a plurality of instructions. Optionally, the data processor includes a volatile memory for storing instructions and/or data and/or a non-volatile storage, for example, a magnetic hard-disk and/or removable media, for storing instructions and/or data. Optionally, a network connection is provided as well. A display and/or a user input device such as a keyboard or mouse are optionally provided as well.
Some embodiments of the invention are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments of the invention. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments of the invention may be practiced.
In the Drawings:
The present invention, in some embodiments thereof, relates to enzymology, and more particularly, but not exclusively, to phosphotriesterase variants designed by a designated computational method to exhibit catalytic activity towards a broad range of organophosphates and chemical warfare nerve agents.
Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not necessarily limited in its application to the details of calculation, enumeration and the values of the computational parameters and/or laboratory methods set forth in the following description and/or illustrated in the drawings and/or the Examples. The invention is capable of other embodiments or of being practiced or carried out in various ways.
Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not necessarily limited in its application to the details set forth in the following description or exemplified by the Examples. The invention is capable of other embodiments or of being practiced or carried out in various ways.
A method for designing functionally diverse repertoires of an enzyme:
To address the gaps still plaguing contemporary protein design approaches, as discussed in the introductory section hereinabove, the present inventors have developed a protein design strategy that affords sequences of proteins having stable networks of interacting residues at the active site and selects a small set of diverse designs amenable to low-throughput screening. This design paradigm and practical strategy, and the corresponding computational tools and methods provided herein, addresses epistasis by designing dense and pre-organized networks of interacting active-site multipoint mutants. Optionally, the protein design strategy may further include the use of PROSS that addresses stability-threshold effects, by first designing a stable enzyme scaffold. The method does not a priori target a specific substrate, as this demands accurate models of the enzyme transition-state complex, and such models are rarely attainable and are mostly approximate. Rather, the method (design strategy) provided herein, according to some embodiments of the present invention, results in a repertoire of stable and highly efficient proteins (e.g., enzymes, antibodies etc.) that can be screened for the activities of interest.
As presented herein, starting from exemplary enzymes for demonstrative purpose, the method provided herein was used to design functionally diverse repertoires comprising dozens of enzymes that exhibited 10-4,000 fold improvements in a range of activities. The robustness and effectiveness of the herein-presented strategy, can be combined with the previously provided method, implemented publicly available protein-stabilization platform “PROSS” (see, U.S. Patent Application Publication No. 2017/0032079 and WO 2017/017673, each of which is incorporated herein by reference as if fully set forth herein; and e.g., www(dot)pross(dot)weizmann(dot)ac(dot)il/). The method, provided herewith and referred to as “FuncLib” or “AbLift”, has also been implemented as an automated web-accessible server.
Main differences between PROSS, and the method provided herein and implemented in FuncLib and AbLift, is that PROSS designs the protein outside the active/binding site, while FuncLib and AbLift designs the active/binding sites, since PROSS's objective is to stabilise the protein, without changing its structure-related activity. This distinction is of paramount importance: Since there are many positions in any protein open to design of stable variants (>90% of the protein is not directly related to function), PROSS looks only for the safest combinations of mutations, using a combinatorial design algorithm that assumes that the backbone stays fixed and results in a combination of mutations with a mostly additive effect on stability. In contrast, FuncLib/AbLift work in the regions of the protein system where positions are highly interdependent (the active/binding site). In such structural regions, there are fewer allowed mutations (⇐10% of the protein and very high conservation due to functional constraint) and almost all positions are dependent on one-another so there are almost no “safe” combinations of mutations, in which each mutation impacts activity in an additive way; they're all potentially deleterious, and indeed experiments show that these regions are incredibly sensitive to mutation, let alone multipoint mutations. Therefore, in the method provided herein, and implemented as the exemplary procedures FuncLib and AbLift, the tolerated sequence space is identified firstly, using more relaxed settings (energetic stability threshold) than PROSS, so as to enable mutations even in conserved positions, and secondly enumerates all of the possible combinations, which are kept at manageable numbers to enable effective computation. In each instance of a multipoint mutant generated by the method provided herein (FuncLib/AbLift), the backbone is allowed to change conformation, thereby allowing mutations, including small-to-large mutations that are considered very difficult for computational design and even combinations of small-to-large mutations. All of the enumerated multipoint mutants are then ranked by energy to ensure that only stable, pre-organised networks of mutations are selected. It has been surprisingly noticed by the inventors of the present invention, that there are often hundreds or even thousands of sequences with lower energies (more stable) than the wild type or the original/starting sequence, which has never been seen by applying straightforward combinatorial design simulations or in PROSS results. Thus, the method provided herein is based on a rigorous sampling of sequence space with fewer assumptions on the rigidity of the protein or on the additive contribution of mutations to function or stability.
While FuncLib and AbLift share many computational components, the main difference between the two implementation of the computational protein design method provided herein, is that FuncLib is mainly applied to enzyme active sites, which are solvent exposed and therefore potentially still tolerant to mutation, whereas AbLift is applied to the interface between two protein chains (e.g., light/heavy chain interface in antibodies). This chain interface region is as tightly packed as a protein core, and therefore potentially less tolerant to mutation. It is noted herein that PROSS, the previously provided method, typically fails to find mutations in such regions, and AbLift is designated to readily find hundreds of multipoint combinations with improved energy (stability and preorganization).
Hence, the method provided herein (FuncLib/AbLift) deals with the problem of how to find favourable multipoint mutants among interdependent positions in highly conserved regions—an outcome that PROS S explicitly tries to avoid, other computational design in general typically fail in, and experimental in vitro evolution strategies often require multiple iterative step-by-step screening in order to achieve.
Thus, according to an aspect of some embodiments of the present invention, there is provided a method for computationally designing a library of proteins (polypeptides), stemming from a template/original protein (original polypeptide chain), e.g., an enzyme, wherein members of this library exhibit 10-4,000 fold improvements in a range of activities and functionalities, compared to the template/original protein. In some embodiments, the protein is an enzyme with a known activity in terms of substrate/product/rate, and the library, which is generated according to embodiments of the present invention, include enzymes with either or both improved known activities, and/or new activities. It is noted that in the context of the present invention, a new activity may be seen as an activity known to be low or essentially null, hence the description below addresses both new and improved activities, as improvement can start from essentially no activity up to an enhanced activity, regardless of the known activity.
In terms of parameter values and Rosetta energy units, the more relaxed energetic stability threshold used in FuncLib/AbLift includes PSSM score ≥−2 or −1 and ΔΔG score ≤+1, +2, +3, +4, +5, or +6, compared to the energetic stability threshold used in PROSS, which includes PSSM score ≥0 and ΔΔG score ≤−0.45, −0.9, −2.0, −3.0, or −4.0.
For the demonstration of the method, the enzyme with a publically available crystal structure, zinc-containing phosphotriesterase (PTE) from Pseudomonas diminuta (PDB entry 1HZY), was selected. The method presented herein was effectively used to provide modified polypeptide chains, starting with an original polypeptide chain, such as found in a corresponding wild type protein or a previously engineered/designed variant, wherein several amino acid residues in the original polypeptide chains have been substituted such that a protein expressed to have the modified polypeptide chains (a variant protein) exhibits improved catalytic activity with respect to a certain substrate, as well as structural stability, compared to the wild type protein. The term “variant”, as used herein, refers to a designed protein obtained by employing the method presented herein. Herein and throughout, a terms “amino acid sequence” and/or “polypeptide chain” is used also as a reference to the protein having that amino acid sequence and/or that polypeptide chain; hence the terms “original amino acid sequence” and/or “original polypeptide chain” are equivalent or relate to the terms “original protein” and “wild type protein”, and the terms “modified amino acid sequence” and/or “modified polypeptide chain” and/or “designed polypeptide” are equivalent or relate to the terms “designed protein” and “variant”.
In some embodiments, the original polypeptide chain, or the original protein, is naturally occurring (wild type; WT) or artificial (man-made non-naturally occurring), or a designed polypeptide chain, namely a product of a computational method, such as PROSS.
In the context of some embodiments of the present invention, the term “designed” and any grammatical inflections thereof, refers to a non-naturally occurring sequence or protein.
In the context of some embodiments of the present invention, the term “sequence” is used interchangeably with the term “protein” when referring to a particular protein having the particular sequence.
According to an aspect of some embodiments of the present invention, there is provided a method of computationally designing a modified polypeptide chain starting from an original polypeptide chain.
Method requirements and input preparation:
The basic requirements for implementing the method for designing modified polypeptide chains for activity diversification include:
availability of structural information pertaining to the original polypeptide chain, such as obtained from an experimentally determined crystal structure of the original polypeptide chain, or a crystal structure of a close homolog thereof, having at least 30-60% amino acid sequence identity, or computationally derived structural information based on an experimentally determined structure of a close homolog thereof;
optional availability of experimental mutation analysis, either point mutations, combinations of mutations, or deep mutational scanning; and
availability of sequence data derived from several qualifying homologous proteins, whereas the criteria for a qualifying homologous sequence are described below (
In the context of embodiments of the present invention, the term “% amino acid sequence identity” or in short “% identity” is used herein, as in the art, to describe the extent to which two amino acid sequences have the same residues at the same positions in an alignment. It is noted that the term “% identity” is also used in the context of nucleotide sequences.
It is noted herein that in general, the method presented herein (e.g., FuncLib) does not require a structural model of a transition state or its complex structure. Rather it computes diverse yet stable networks of interacting residues at the active-site pocket, thereby encoding different stereochemical complementarities for alternative substrates/ligands that do not need to be defined a priori. It is therefore expected that the method provides designs that form a functional repertoire, from which individual designs that efficiently turns-over various target substrates could be isolated. In applications that target a specific substrate, by contrast, sequence space can be further constrained by designing the enzyme in the presence of the substrate or transition-state model, and this option is enabled in the web-server, presented herein.
Structural data preparation:
According to some embodiments of the invention, the structural information is a set of atomic coordinates of the original polypeptide chain. This set of atomic coordinates is referred to herein as the “template structure”, which is used in the method as discussed below. In some embodiments, the template structure is a crystal structure of the original polypeptide chain, and in some embodiments the template structure is a computationally generated structure based on a crystal structure of a close homolog (more than 30-60% identity) of the original polypeptide chain, wherein the amino acid sequence of the original polypeptide chain has been threaded thereon and subjected to weighted fitting to afford energy minimization thereof, as these are discussed below.
In cases where the protein of interest is an oligomer (having several polypeptide chains), the chain of interest, or the original polypeptide chains to be modified, is defined in the template structure. In the case of hetero-oligomers, it is required to select the chain that will undergo the sequence design procedure or to subject both chains to simultaneous design. For homo-oligomers, it is advantageous to select the original polypeptide chain containing having more or better quality structural data. For example, in some homo-oligomers, binding ions may be discernible in a crystal structure in some of the chains and less so in others. In addition, it is advantageous to define key residues related to function and activity, as discussed hereinbelow.
Structure refinement:
According to some embodiments, prior to its use in the method presented herein, the template structure is optionally subjected to a global energy minimization, afforded by weighted fitting thereof, as discussed below.
According to some embodiments of the present invention, the template structure is optionally refined by energy minimization prior to using its coordinates, while fixing the conformations of key residues, as defined hereinbelow. Structure refinement is a routine procedure in computational chemistry, and typically involves weight fitting based on free energy minimization, subjected to rules, such as harmonic restraints.
The term “weight fitting”, according to some embodiments of any of the embodiment of the present invention, refers to a one or more computational structure refinement procedures or operations, aimed at optimizing geometrical, spatial and/or energy criteria by minimizing polynomial functions based on predetermined weights, restraints and constrains (constants) pertaining to, for example, sequence homology scores, backbone dihedral angles and/or atomic positions (variables) of the refined structure. According to some embodiments, a weight fitting procedure includes one or more of a modulation of bond lengths and angles, backbone dihedral (Ramachandran) angles, amino acid side-chain packing (rotamers) and an iterative substitution of an amino acid, whereas the terms “modulation of bond lengths and angles”, “modulation of backbone dihedral angles”, “amino acid side-chain packing” and “change of amino acid sequence” are also used herein to refer to, inter alia, well known optimization procedures and operations which are widely used in the field of computational chemistry and biology. An exemplary energy minimization procedure, according to some embodiments of the present invention, is the cyclic-coordinate descent (CCD), which can be implemented with the default all-atom energy function in the Rosetta™ software suite for macromolecular modeling. For a review of general optimization approaches, see for example, “Encyclopedia of Optimization” by Christodoulos A. Floudas and Panos M. Pardalos, Springer Pub., 2008.
According to some embodiments of the present invention, a suitable computational platform for executing the method presented herein is the Rosetta™ software suite platform, publically available from the “Rosetta@home” at the Baker laboratory, University of Washington, U.S.A. Briefly, Rosetta™ is a molecular modeling software package for understanding protein structures, protein design, protein docking, protein-DNA and protein-protein interactions. The Rosetta software contains multiple functional modules, including RosettaAbinitio, RosettaDesign, RosettaDock, RosettaAntibody, RosettaFragments, RosettaNMR, RosettaDNA, RosettaRNA, RosettaLigand, RosettaSymmetry, and more.
Weight fitting, according to some embodiments, is effected under a set of restraints, constrains and weights, referred to as rules. For example, when refining the backbone atomic positions and dihedral angles of any given polypeptide segment having a first conformation, so as to drive towards a different second conformation while attempting to preserve the dihedral angles observed in the second conformation as much as possible, the computational procedure would use harmonic restraints that bias, e.g., the Cα positions, and harmonic restraints that bias the backbone-dihedral angles from departing freely from those observed in the second conformation, hence allowing the minimal conformational change to take place per each structural determinant while driving the overall backbone to change into the second conformation.
In some embodiments, a global energy minimization is advantageous due to differences between the energy function that was used to determine and refine the source of the template structure, and the energy function used by the method presented herein. By allowing changes to occur in backbone conformation and in rotamer conformation through minimization, the global energy minimization relieves small mismatches and small steric clashes, thereby lowering the total free energy of some template structures by a significant amount.
In some embodiments, energy minimization may include iterations of rotamer sampling (repacking) followed by side chain and backbone minimization. An exemplary refinement protocol is provided in Korkegian, A. et al., Science, 2005. In some embodiments, energy minimization may include more substantial energy minimization in the backbone of the protein.
As used herein, the terms “rotamer sampling” and “repacking” refer to a particular weight fitting procedure wherein favorable side chain dihedral angles are sampled, as defined in the Rosetta software package. Repacking typically introduces larger structural changes to the weight fitted structure, compared to standard dihedral angles minimization, as the latter samples small changes in the residue conformation while repacking may swing a side chain around a dihedral angle such that it occupies an altogether different space in the protein structure.
In some embodiments, wherein the template structure is of a homologous protein, the query sequence is first threaded on the protein's template structure using well established computational procedures. For example, when using the Rosetta software package, according to some embodiments of the present invention, the first two iterations are done with a “soft” energy function wherein the atom radii are defined to be smaller. The use of smaller radius values reduces the strong repulsion forces resulting in a smoother energy landscape and allowing energy barriers to be crossed. The next iterations are done with the standard Rosetta energy function. A “coordinate constraint” term may be added to the standard energy function to allow substantial deviations from the original Cα coordinates. The coordinate constraint term behaves harmonically (Hooke's law), having a weight ranging between about 0.05-0.4 r.e.u (Rosetta energy units), depending on the degree of identity between the query sequence and the sequence of the template structure. During refinement, key residues are only subjected to small range minimization but not to rotamer sampling.
Sequence data preparation:
Once an original polypeptide chain has been identified, and a corresponding template structure has been provided, the method requires assembling a database of qualifying homologous amino acid sequences related to the amino acid sequence of the original polypeptide chain. The amino acid sequence of the original polypeptide chain can be extracted, for example, from a FASTA file that is typically available for proteins in the protein data bank (PDB), or provided otherwise. The search for qualifying homologous sequences is done, according to some embodiments of the present invention, in the non-redundant (nr) protein database, using the sequence of the original polypeptide chain as a search query. Such nr-database typically contains manually and automatically annotated sequences and is therefore much larger than databases that contain only manually annotated sequences.
A non-limiting examples of protein sequence databases include INSDC EMBL-Bank/DDBJ/GenBank nucleotide sequence databases, Ensembl, FlyBase (for the insect family Drosophilidae), H-Invitational Database (H-Inv), International Protein Index (IPI), Protein Information Resource (PIR-PSD), Protein Data Bank (PDB), Protein Research Foundation (PRF), RefSeq, Saccharomyces Genome Database (SGD), The Arabidopsis Information Resource (TAIR), TROME, UniProtKB/Swiss-Prot, UniProtKB/Swiss-Prot protein isoforms, UniProtKB/TrEMBL, Vertebrate and Genome Annotation Database (VEGA), WormBase, the European Patent Office (EPO), the Japan Patent Office (JPO) and the US Patent Office (USPTO).
A search in an nr-database yields variable results depending on the search query (amino-acid sequence of the original polypeptide chain). For proteins with lacking sequence data, results may include less than 10 hits. For proteins common to all life kingdoms the results may include thousands of hits. For most proteins, hundreds to thousands of hits are expected upon search in an nr-database. In all databases, including an nr-database and despite its name, there may be redundancy to some extent, and hits may be found in groups of identical sequences. The redundancy problem is addressed during the sequence data editing.
In some embodiments of the invention, the obtained sequence data is optionally filtered and edited as follows:
(a) Redundant sequences are clustered into a single representative sequence. The clustering is carried out with a predetermined threshold. For example, a threshold of 0.97 means that all sequences that share at least 97% identity among themselves are clustered into a single representative sequence that is the average of all the sequences contributing to the cluster;
(b) Sequences for which the alignment length is less than a predetermined threshold (e.g., 60%) of the search query length are excluded; and
(c) Sequences that exhibit less than about 28% to 34% identity cutoff, for example, with respect to the search query are excluded, following guidelines such as provided elsewhere [Rost, B., Protein Eng, 1999, 12(2):85-94].
The exact choice of the minimal identity parameter depends on the richness of the sequence data. Hence, according to some embodiments of the invention, if the number of sequence hits afforded under a strict threshold is about 50 or less, a less strict threshold may be used (lower % identity). The effect of threshold tuning of the identity parameter is demonstrated in the design of a phosphotriesterase from pseudomonas diminuta, where lowering the threshold from 30% to 28% identity increased the number of qualifying homologous sequences from 45 to 95.
In some embodiments of the invention, the cutoff for electing qualifying homologous sequences for a multiple sequence alignment is more than 20%, 25%, 30%, 35%, 40%, or more than 50% identity with respect to the original polypeptide chain.
It is noted that the method is not limited to any particular sequence database, search method, identity determination algorithm, and any set of criteria for qualifying homologous sequences. However, the quality of the results obtained by use of the method depends to some extent on the quality of the input sequence data.
Once an assembly of qualifying homologous sequences is obtained, a multiple sequence alignment (MSA) is generated (
Cases of low availability of homologous proteins:
Generally, adding sequences exhibiting a % identity below 20% to a MSA having dozens of homologous sequences of higher % identity may contribute diversity to the alignment; however, adding such kind of low % identity sequences increases the risk of errors (false positives) significantly while not necessarily improving diversity by much, since most of this diversity will probably be covered by the high homology sequences that were already part of the MSA. On the other hand, when the protein of interest is poorly represented in the sequence database, using a low % identity homolog becomes an advantage rather than a risk.
In some cases the protein of interest is poorly represented in the currently available protein sequence databases in terms of the number of non-redundant homologous sequences. For example, in case that a sequence homology search finds only one homologous sequence having 60% sequence identity to the protein of interest, that means that the method is limited to zero amino acid substitutions in 60% of the sequence positions, and out of the remaining 40% it would have been difficult to identify a position with more than few amino acid alternatives.
In such cases, the present inventors have envisioned several scenarios where standard sequence homology search methods might result in low sequence diversity within the space of homologous sequences (e.g., less than 50%, less than 40%, less than 30%, less than 25% (the “twilight zone”) or less than 20% sequence identity with respect to the amino acid sequence of the protein of interest). An example for such a scenario is where the fold of the protein of interest (the target protein, also referred to herein as the original polypeptide chain) is unique or phylogenetically restricted to particular genera or phyla, or the protein function has emerged in recent millennia and the protein of interest therefore has few homologues. It was envisioned by the present inventors that in such or other cases of low sequence diversity, the following steps could be taken to increase the sequence diversity used by presently provided method, while minimizing the risk of introducing unrelated sequences.
An exemplary sub-algorithm for treating such cases is described in U.S. Patent Application Publication No. 2017/0032079, which is incorporated herein by reference. The general rational behind this sub-algorithm is to increase the number of homologous sequences in the MSA as much as possible while minimizing the risk of including non-related sequences; for example, accounting for the fact that the fold of the protein of interest is unique and/or phylogenetically distant from typical organisms interrogated by sequencing efforts.
Step 1: search for low-sequence identity homologous sequences (e.g., less than 50%, less than 40%, less than 30%, less than 25% or less than 20% sequence identity; preferably less than 30% identity) in any given sequence database by using an algorithm that specializes in detection of distant homologues (e.g., CSI-BLAST; see, PMIDs: 19234132, 18004781);
Step 2: cluster the results from Step 1 using a clustering threshold 90-100% (see, e.g., PMID: 11294794);
Step 3: remove sequences with coverage below 40% relative to that of the original polypeptide chain (protein of interest), and sequence identity of less than 15%;
Step 4: inspect the annotation and source organism of each sequence in the list resulting from Step 3, and exclude sequences that have a high chance of being false positives. Non-limiting examples are hits that have no molecular-function annotation (typically these are annotated as “hypothetical protein”), sequences from genera or phyla other than the protein of interest's genus or phylum, or proteins that are annotated with functions that are different from the function of the protein of interest;
Step 5 Exclude sequences that have more than 5%, more than 4%, more than 3%, more than 2%, more than 1%, or more than 0.5% gaps (insertions or deletions, known by the acronym INDELs), preferably less than 5% gaps in a pairwise alignment with the original polypeptide chain (see, e.g., PMID: 18048315); Step 6: Combine sequences resulting from Step 5 with high sequence identity sequences (i.e., more than 30% sequence identity to the protein of interest) that were collected and processed using any sequence identity search protocol, and generate a multiple-sequence alignment (MSA). This MSA can then be used as input by the method presented herein even if it contains few (less than 3-10) sequences.
Following is a More Specific Yet Non-Limiting Example:
Step I: Use the CSI-BLAST search algorithm instead of BLASTP to identify homologs. The use of an alternative sequence search algorithm to find distant homologues, such as using CSI-BLAST (context-specific iterative BLAST) with 3 iterations instead of BLASTP is advantageous in some cases since CSI-BLAST constructs a different substitution matrix to calculate alignment scores. The CSI-BLAST matrix is context specific (i.e., each position probabilities depend also on 12 neighboring amino acids), thus it finds 50% more homologous sequences than BLAST at the same error rate. The iterative use means that this process is repeated and at the end of each round the substitution matrix is updated according the sequence information from homologues collected up to that point.
Step II: Use minimal sequence identity thresholds of 19% and 15% for strict and permissive alignments respectively. Lowering the minimal sequence identity threshold to 15% (permissive alignment) and 19%, (strict alignment) while using BLASTP may be meaningless since BLASTP is tuned to find sequences with higher sequence identity to the target. Secondly, these thresholds are chosen according to the results obtained from the CSI-BLAST search; hence these thresholds are set after the CSI-BLAST search and depend on outcome; specifically, the thresholds may need to be adjusted to obtain more true positive or fewer false positive hits, where true positive are hits with a functional annotation and phylogenetic origin that correspond to the requirements of Step III, below.
Step III: Exclude sequences from genera or phyla other than the one corresponding to the protein of interest if it is expected that protein target's fold or function are unique to the genus of phylum of the target protein. If this expectation holds, proteins from genera and phyla outside those of the target protein are likely to be false-positive hits; that is, proteins that adopt different folds or function.
Step IV: Use an INDEL fraction of up to 1% for sequences sharing below 19% sequence identity, in pairwise alignment with the query. In the treatment of gaps/INDELs, the CSI-BLAST pairwise alignment INDELS fraction may be required to be up to 1% for sequence with minimal % identity below 19%. The rationale is that for low-homology sequences sharing such a small sequence identity to the query, the risk of inserting false positives in the MSA is too high, but a small INDEL fraction indicates that these are likely to be true hits.
Step V: Use sequence coverage threshold for hits relative to the target protein in the alignment to 50%. It is likely that all the sequences that passed the criteria set forth in Steps II, III and IV will exhibit a coverage of more than 50%; however, if the coverage threshold is set to 60%, as typically practiced in the art, most of the sequences would be filtered out.
Step VI: Generate MSA for the remaining sequences as typically practiced in the art.
Variable loop regions:
BLAST algorithms may provide results that include sequences with different lengths. The differences typically stem from different lengths in loop regions, and loops with different lengths may reflect different biochemical context. As a result, MSA columns representing loop positions may contain aligned residues from loops with different length, thus possibly degrading the data with information from different biochemical context, possibly irrelevant to the biochemical context of the protein of interest. A BLAST hit may therefore contain relevant information at some positions while containing non-relevant information in other positions. To minimize the level of irrelevant sequence information for each loop, the secondary structure of the original protein is identified and a context specific sub-MSA file is created for each loop region, and the sub-MSA contains only loop sequences with the same length.
Secondary structure identification is done through identification of hydrogen bond patterns in the structure and this is termed “dictionary of protein secondary structure” (DSSP). There are several software packages available that offer such analysis, such as, for example, a Rosetta™ module for loop identification.
The output of the secondary structure identification procedure is typically a string (i.e., an output string) that has the same length as the template structure, wherein each character represents a residue in a secondary structure element that may be either H, E or L, denoting an amino acid forming a part of either an a-helix, a β-sheet or a loop.
According to some embodiments of the invention, the amino acid sequence of the loop regions in the structure of the original protein is processed as follows:
(a) Loops in the template structure are identified by automatic or manual inspection of a structure model, and/or by any secondary-structure analyzing algorithms.
(b) The positions representing each loop on the output string are determined including loop stems (two additional amino acids at each end of the loop). To account for the stems, two positions are added to each of the loop's ends, unless the loop is at one of the main-chain termini. According to some embodiments of the invention, it is advantageous to include the stems in the loop definition since stems anchoring different loops may potentially exhibit different conformations and form different contacts among themselves or with the loop residues, and it is advantageous that the sequence data used as input in the method presented would represent that.
For example, if the secondary structure output string is:
LLLHHHHHHHLLLLLHHHHHLLLEEEE
then the loop regions are defined at positions 1-5, 9-17 and 19-25 (bold characters).
(c) The positions that represent each loop are identified in the query sequence in the MSA. The loop positions in the MSA may be different than the loop positions in the original string from the previous step since in the MSA the query is aligned to other sequences and may therefore contain both amino acid characters and hyphens, representing gaps.
(d) After the loop positions were located on the query sequence in the MSA, a character pattern is defined for each loop. For example, a pattern may comprise “X” character to represent an amino acid and “-” (hyphen) to represent a gap.
(e) Lastly, a context specific sub-MSA file is generated for each loop excluding all sequences that do not share the same character pattern for that loop, namely context specific sub-MSA contains sequences wherein the loop has the same length, gaps included.
For example, positions 4-10 in a hypothetical original protein are recognized as a loop with the hypothetical sequence “APTESVV” including stems. The loop is identified on the query protein in the MSA file and the pattern is found to be “A—PTESVV”. The context specific sub-MSA file that will be generated for this loop with all the sequences in the MSA file will contain the pattern “X—XXXXX”.
Thus, according to some embodiments of the present invention, for loop regions, the sequence alignment comprises amino acid sequences having sequence length equal to a corresponding loop in the original polypeptide chain. Accordingly, sequence alignments, which are relevant in the context of loop regions, are referred to herein as “context specific sub-MSA”.
Rules for substitutions:
The method calls for identification of substitutable residues. The selection of substitutable residues may rely on expert-guided decision on positions to mutate. These positions are typically positions in the active site of an enzyme that are not crucial for the core catalytic activity but are in proximity (first shell) of the substrate or in proximity to first shell positions (second shell) etc.
In some embodiments of the present invention, a set of restraints, constrains and weights are used as rules that govern some of the computational procedures. In the context of some embodiments of the present invention, these rules are applied in the method presented herein to determine which of the positions in the original polypeptide chain will be allowed to permute (be substituted), and to which amino acid alternative. These rules may also be used to preserve, at least to some extent, some positions in the sequence of the original polypeptide chain.
One of the rules employed in amino acid sequence alterations stem from highly conserved sequence patterns at specific positions, which are typically exhibited in families of structurally similar proteins. According to some embodiments of the present invention, the rules by which a substitution of amino acids is dictated during a sequence design procedure include position-specific scoring matrix values, or PSSMs.
A “position-specific scoring matrix” (PSSM), also known in the art as position weight matrix (PWM), or a position-specific weight matrix (PSWM), is a commonly used representation of recurring patterns in biological sequences, based on the frequency of appearance of a character (monomer; amino acid; nucleic acid etc.) in a given position along the sequence. Thus, PSSM represents the log-likelihood of observing mutations to any of the 20 amino acids at each position. PSSMs are often derived from a set of aligned sequences that are thought to be structurally and functionally related and have become widely used in many software tools for computational motif discovery. In the context of amino acid sequences, a PSSM is a type of scoring matrix used in protein BLAST searches in which amino acid substitution scores are given separately for each position in a protein multiple sequence alignment. Thus, a Tyr-Trp substitution at position A of an alignment may receive a very different score than the same substitution at position B, subject to different levels of amino acid conservation at the two positions. This is in contrast to position-independent matrices such as the PAM and BLOSUM matrices, in which the Tyr-Trp substitution receives the same score no matter at what position it occurs. PSSM scores are generally shown as positive or negative integers. Positive scores indicate that the given amino acid substitution occurs more frequently in the alignment than expected by chance, while negative scores indicate that the substitution occurs less frequently than expected. Large positive scores often indicate critical functional residues, which may be active site residues or residues required for other intermolecular or intramolecular interactions. PSSMs can be created using Position-Specific Iterative Basic Local Alignment Search Tool (PSI-BLAST) [Schäffer, A. A. et al., Nucl. Acids Res., 2001, 29(14), pp. 2994-3005], which finds similar protein sequences to a query sequence, and then constructs a PSSM from the resulting alignment. Alternatively, PSSMs can be retrieved from the National Center for Biotechnology Information Conserved Domains Database (NCBI CDD) database, since each conserved domain is represented by a PSSM that encodes the observed substitutions in the seed alignments. These CD records can be found either by text searching in Entrez Conserved Domains or by using Reverse Position-Specific BLAST (RPS-BLAST), also known as CD-Search, to locate these domains on an input protein sequence.
In the context of some embodiments of the present invention, a PSSM data file can be in the form of a table of integers, each indicating how evolutionary conserved is any one of the 20 amino acids at any possible position in the sequence of the designed protein. As indicated hereinabove, a positive integer indicates that an amino acid is more probable in the given position than it would have been in a random position in a random protein, and a negative integer indicates that an amino acid is less probable at the given position than it would have been in a random protein. In general, the PSSM scores are determined according to a combination of the information in the input MSA and general information about amino acid substitutions in nature, as introduced, for example, by the BLOSUM62 matrix [Eddy, S. R., Nat Biotechnol, 2004, 22(8), pp. 1035-6].
In general, the method presented herein can use the PSSM output of a PSI-BLAST software package to derive a PSSM for both the original MSA and all sub-MSA files. A final PSSM input file, according to some embodiments of the present invention, includes the relevant lines from each PSSM file. For sequence positions that represent a secondary structure, relevant lines are copied from the PSSM derived from the original full MSA. For each loop, relevant lines are copied from the PSSM derived from the sub-MSA file representing that loop. Thus, according to some embodiments of the present invention, a final PSSM input file is a quantitative representation of the sequence data, which is incorporated in the structural calculations, as discussed hereinbelow.
According to some embodiments of the present invention, MSA and PSSM-based rules determine the unsubstitutable positions and the substitutable positions in the amino acid sequence of the original polypeptide chain, and further determine which of the amino acid alternatives will serve as candidate alternatives in the single position scanning step of the method, as discussed hereinbelow.
Key residues:
The method, according to some embodiments of the present invention, allows the incorporation of information about the original polypeptide chain and/or the wild type protein. This information, which can be provided by various sources, in incorporated into the method as part of the rules by which amino acid substitutions are governed during the design procedure. Albeit optional, the addition of such information is advantageous as it reduces the probability of the method providing results which include folding- and/or function-abrogating substitutions. In the examples presented in the Example section below, valuable information about activity has been employed successfully as part of the rules.
The term “key residues” refer to positions in the designed sequence that are defined in the rules as fixed (invariable), at least to some extent. Sequence positions, which are occupied by key residues optionally, constitute a part of the unsubstitutable positions.
Information pertaining to key residues can be extracted, for example, from the structure of the original polypeptide chain (or the template structure), or from other highly similar structures when available. Exemplary criteria that can assist in identifying key residues, and support reasoning for fixing an amino-acid type or identity at any given position, include:
In the previous provided protein stabilization design method, PROSS, when used to provide stabilized enzyme variants, the key residues are selected within a radius of about 5-8 Å around the substrate binding site, as may be inferred from complex crystal structures comprising a substrate, a substrate analog, an inhibitor and the like. Similarly, when using PROSS to provide stabilized metal binding proteins, key residues are selected within about 5-8 Å around a metal atom. Other key residues may be designated in protein interface that involves the chain of interest in an oligomers, as interacting chains are oftentimes involved in dimerization interfaces, binding ligands or protein-substrates interactions. Likewise, key residues may be designated within a certain distance from DNA/RNA chains interacting with the protein of interest, within a certain distance from an epitope region, and the likes.
It is noted that the shape and size of the space within which key residues are selected is not limited to a sphere of a radius of 5-8 Å; the space can be of any size and shape that corresponds to the sequence, function and structure of the original protein. It is further noted that specific key residues may be provided by any external source of information (e.g., a researcher).
In the context of the present invention, key residues are selected sparingly (≤10 positions, and more typically 0-3 positions), even and particularly in and around regions of the activity the method is attempting to diversify or improve. This strategy allows the activity-determining regions to diversify while the stability of the protein is not sacrificed.
When the template structure, the PSSM file (which is based on the full MSA and any optional context specific sub-MSA), and the identification of key residues, unsubstitutable positions and the substitutable positions are provided, the method presented herein can use these data to provide the modified polypeptide chain starting from the original polypeptide chain.
Main method steps:
The objective of the method provided herein (FuncLib/AbLIFT) is to design a small set of stable, efficient, and functionally diverse multipoint active-site mutants suitable for low-throughput experimental testing. The design strategy is general and can be applied, in principle, to any natural enzyme or designed protein, using its molecular structure and a diverse set of homologous sequences.
According to some embodiments of the present invention, the method presented herein includes a step that determines which of the positions in the amino-acid sequence of the original polypeptide chain will be subjected to amino-acid substitution and which amino acid alternatives will be assessed. (referred to herein as substitutable positions), and in which positions in the amino acid sequence of the original polypeptide chain the amino-acid will not be subjected to amino-acid substitution (referred to herein as unsubstitutable positions).
In a following step, (single position scanning step), a position-specific stability score is given to each of the allowed amino acid alternatives at each substitutable position. In the enzyme repertoire cases, the active-site residues were defined to be designed by visual examination of the enzyme molecular structures. Evolutionary conservation scores were computed from PSSMs and ΔΔG values were computed essentially as described previously [Goldenzweig, A. et al. Mol Cell., 2016, 63(2), pp. 337-346]. Tolerated amino acid identities at the active site of PTE were filtered according to the following thresholds: PSSM≥−2 and ΔΔG≤+6 R.e.u.
It is noted that the detailed description of the method presented herein is using some terms, units and procedures with are common or unique to the Rosetta™ software package, however, it is to be understood that the method is capable of being implemented using other software modules and packages, and other terms, units and procedures are therefore contemplated within the scope of the present invention.
It is also noted that the detailed description of the method presented herein is using the proteins and variables presented in the Examples section, which are not to be seen as limiting in any way, as the method is applicable for any protein and polypeptide chain sequence for which the required data is available.
According to some embodiments of the present invention, the following step of the method is an exhaustive enumeration of all possible combinations of at least 3 and as many as 5, 6, 7, 8, 9, 10 or more six mutations in the original polypeptide chain (e.g. of PTE). Each mutant was modeled in Rosetta, including combinatorial sidechain packing, and the backbone and sidechains of all residues were minimized energetically, subject to harmonic restraints on the Cα coordinates of the entire protein (being composed of one polypeptide chain or more). All designed polypeptide chains (designed proteins, or “designs” for short) were ranked according to all-atom energy, and the top-ranked designs were chosen for experimental analysis after removing designs with fewer than two mutations relative to one another.
As stated hereinabove, one of the main differences between PROSS and the method provided herein is the combinatorial design step in PROSS that is being replaced by a comprehensive enumeration step in the instant method. In the exemplary study presented here, small-scale testing of the method provided herein (FuncLib/AbLift) proved sufficient to identify variants that exhibited orders-of-magnitude changes in enzyme activity profiles without loss in apparent protein stability. The method can therefore be used to rapidly optimize specific activities or generate functional repertoires from enzymes that are not amenable to high-throughput screening. Whereas conventional active-site design strategies rely on transition-state modeling, the method provided herein computes diverse and stable networks of interacting active-site mutations, enabling design even in the cases discussed here, for which enzyme transition-state models are uncertain. Although the designed mutations conserve the wild type backbone structure, some designs exhibit sign-epistatic relationships, which render these designs all but inaccessible to stepwise mutational trajectories. Thus, the sequence space of an enzyme active site provides a vast resource of functional diversity that defies exploration by natural and laboratory evolution but can now be accessed through computational protein design.
According to some embodiments of the present invention, the method is implemented effectively for original polypeptide chains that comprise more than 100 amino acids (aa). In some embodiments, the original polypeptide chains comprise more than 110 aa, more than 120 aa, more than 130 aa, more than 140 aa, more than 150 aa, more than 160 aa, more than 170 aa, more than 180 aa, more than 190 aa, more than 200 aa, more than 210 aa, more than 220 aa, more than 230 aa, more than 240 aa, more than 250 aa, more than 260 aa, more than 270 aa, more than 280 aa, more than 290 aa, more than 300 aa, more than 350 aa, more than 400 aa, more than 450 aa, more than 500 aa, more than 550 aa, or more than 600 amino acids.
According to some embodiments of the present invention, the method presented herein provides modified polypeptide chains having more than 2 amino acid substitutions (mutations), more than 3 substitutions, more than 4 substitutions, more than 5 amino acid substitutions, more than 6 substitutions, more than 7 substitutions, more than 8 substitutions, more than 9 substitutions, more than 10 substitutions, more than 11 substitutions, or more than 12 substitutions compared to the starting original polypeptide chain.
Sequence space:
According to some embodiments of the present invention, after filtering key residues and imposing a free energy acceptance threshold, the number of substitutable positions in a given sequence is greatly reduced, thereby providing a wide yet manageable combinatorial sequence space from which designed sequences can be selected. Thus, the term “sequence space” refers to a set of substitutable positions, each having at least one optional substitution over the original/WT amino acid at the given position.
A sequence space is therefore a result of a certain acceptance threshold; each acceptance threshold produces a different sequence space, where sequence spaces defined by stricter acceptance thresholds are contained within larger sequence spaces defined by more permissive acceptance thresholds. As discussed hereinabove, in order to avoid false positives the acceptance threshold can be small and should be negative, wherein −2 r.e.u is considered to be highly restrictive (strict) and +6 r.e.u is highly permissive. The sequence space obtained by using acceptance threshold of +6 r.e.u will inevitably be larger (permissive) than a sequence space obtained by using acceptance threshold of −2.00 r.e.u (strict). Experimental use of the method presented herein to produce actual proteins has shown that an intermediate acceptance threshold produces an optimal sequence space. In fact, the sequence space is a sub-space of the broader space defined by the PSSM rules.
An exemplary and general means to present a sequence space is in a list of sequence positions based on the wild-type sequence numbering, P1, P2, P3, . . . , Pn, wherein each position is either designated as a key residue, namely an amino acid as found in the WT, AAWT; or a position that can take any one amino acid from a limited list comprising at least one alternative amino acid based on the PSSM and energy minimization analysis, AAm, wherein m is a number denoting one of the naturally occurring amino acids, e.g., A=1, R=2, N=3, D=4, C=5, Q=6, E=7, G=8, H=9, L=10, I=11, K=12, M=13, F=14, P=15, S=16, T=17, W=18, Y=19 and V=20 (aa numbering is arbitrary and used herein to demonstrate a general representation of a sequence space.
For example, the sequence space can be presented as:
P1: AAWT, AA5, AA8, and AA12;
P2: AAWT;
P3: AAWT and AA16;
P4: AAWT, AA1, AA3, AA6, AA10, and AA14;
P5: AAWT, AA4, AA8, and AA11;
. . .
Pn: AAWT, AAm, AAm, AAm, AAm, and AAm,;
whereas in this general example, P1 has four alternative amino acids, P2 is a key residue and so forth.
According to some embodiments of the present invention, the sequence space can be further limited by imposing a stricter acceptance threshold, or expanded by imposing a more permissive acceptance threshold. In general, the value of +2 r.e.u has been found to be adequately permissive; however sequence space based on an acceptance threshold larger than +2 r.e.u (e.g., +6 r.e.u) or based on an acceptance threshold smaller than −2.00 r.e.u (e.g., −2.1 r.e.u) are also contemplated.
In the Examples section that follows below, a sequence space based on acceptance threshold of +6 r.e.u is presented for some of the exemplary proteins on which the method has been demonstrated. Any designed sequence having any choice of any 2 or more substitutions relative to the wild-type/starting sequence that are selected from the presented sequence space, and that exhibits, at least one improved catalytic activity, is contemplated within the scope of the present invention.
It is noted herein that embodiments of the present invention encompass any and all the possible combinations of amino acid alternatives in any given sequence space afforded by the method presented herein (all possible variants stemming from the sequence space as defined herein).
It is further noted that in some embodiments of the present invention, the sequence space resulting from implementation of the method presented herein on an original protein, can be applied on another protein that is different than the original protein, as long as the other protein exhibits at least 30%, at least 40%, or at least 50% sequence identity and higher. For example, a set of amino acid alternatives, taken from a sequence space afforded by implementing the method presented herein on a human protein, can be used to modify a non-human protein by producing a variant of the non-human protein having amino acid substitutions at the sequence-equivalent positions. The resulting variant of the non-human protein, referred to herein as a “hybrid variant”, would then have “human amino acid substitutions” (selected from a sequence space afforded for a human protein) at positions that align with the corresponding position in the human protein. In some embodiments of the present invention, any such hybrid variant, having at least 2 substitutions that match amino acid alternatives in any given sequence space afforded by the method presented herein (all possible variants stemming from the sequence space as defined herein), is contemplated and encompassed in the scope of the present invention.
FuncLib web-server:
A FuncLib web-server was constructed to implement several improvements of the method presented herein. In designing the exemplary enzyme PTE variants, as presented herein, a multiple-sequence alignment (MSA) was computed for the entire protein sequence, and wherever loops were observed in the query structure, any aligned sequence that exhibited gaps relative to the query was eliminated to reduce alignment ambiguity (see [Goldenzweig, A. et al.. Mol Cell., 2016, 63(2), pp. 337-346]). In the FuncLib web-server, by contrast, all secondary-structure elements are subjected to this filtering, resulting in improved PSSM accuracy, particularly in the active-site pocket. Furthermore, the web-server implements more accurate atomistic modeling and scoring: it uses the recent Rosetta energy function [Park, H. et al., J Chem Theory Comput., 2016, 12(12), pp. 6201-6212] with improved electrostatics and solvation potentials relative to previous Rosetta energy functions; implements harmonic coordinate restraints on sidechain atoms of essential amino acid residues in the catalytic pocket to guarantee their preorganization; restricts refinement to amino acids within 8 Å (or within the range of 6-10 Å) of designed positions instead of refining the entire protein; allows the user to modify the tolerated sequence space (for instance, based on prior experimental and structural analysis); and enables modeling of small-molecule ligands or transition-state complexes.
Diverse phosphotriesterase repertoire:
Natural and laboratory evolution of altered activities depend on the stepwise accumulation of mutations, each of which must be at least neutral in fitness. Following a few mutations, however, improvements in activity often plateau due to epistasis or stability-threshold effects. Typical evolutionary trajectories leading from one highly efficient enzyme to another are therefore time-consuming and often comprise dozens of enabling mutations outside the active site, most of which only contribute to the activity indirectly, for instance by stabilizing the enzyme. The strategy presented herein rationalizes and accelerates the generation of stable enzymes exhibiting altered activities: it starts by designing stable and highly expressed enzyme variants, using a method provided previously (PROSS), and then designs dozens of variants that encode preorganized networks of active-site mutants exhibiting different stereochemical features. The combination of evolutionary-conservation analysis and Rosetta atomistic modeling focuses design calculations on stable, preorganized, and functional active-site constellations.
Accordingly, the present inventors have implemented the FuncLib procedure in order to enumerate PTE variants with enhanced catalytic activities towards substrates, towards which WT PTE is less effective, as such PTE variants could serve as a detoxification agent against various organophosphate/nerve agents, as well as to increase PTE's catalytic activity towards known PTE substrates, such as VX type nerve agent. Using a PROSS-stabilized sequence [WO 2017/017673; Goldenzweig, A. et al.. Mol Cell., 2016, 63(2), pp. 337-346] dPTE2 (SEQ ID NO: 1), which is a variant of PTE that contained 20 mutations outside the active-site pocket and stemming from PTE-S5 [Roodveldt, C. and Tawfik, D.S., Protein Eng Des Sel., 2005, 18(1), pp. 51-8], and using the crystal structure of WT PTE (PDB Entry: 1HZY), the designed variants obtained by the method presented herein exhibited broad spectrum activity having thousands-folds activity relative to WT PTE.
Thus, according to one aspect of the invention there is provided a protein having a sequence selected from the group consisting of any combination of at least 2 amino acid substitutions of a sequence space afforded for phosphotriesterase (PTE) from Pseudomonas diminuta as an original protein, and listed in Table A blow, whereas wild type positons, I106, F132, H254, H257, L271, L303, F306 and M317, are not shown therein.
The protein, according to some embodiments of the present invention, can be selected from the list presented in Table A set forth herein. In some embodiments the protein has a sequence selected from the group consisting of PTE_28 (SEQ ID NO: 28), PTE_29 (SEQ ID NO: 29), PTE_56 (SEQ ID NO: 56), and PTE_57 (SEQ ID NO: 57).
According to some embodiments, the protein can be an isolated protein, a fusion to another domain, such as Fc, or a mixture of proteins and other agents, factors carriers and the likes, as long as it includes at least one of the PTE designed proteins, as defined in Table A.
The original protein can be any enzyme of the PTE family having the EC No. 3.1.8.1 (EC: 3.1.8.1), including wild-type PTE from Pseudomonas diminuta or any other biological, or any designed of artificial PTE, including PTE variants obtained by using a computational method, such as, but not limited to, PROSS. In order to identify the amino acid residues for substitution of any original protein, the sequence of the original protein is aligned with the sequence of phosphotriesterase (PTE) from Pseudomonas diminuta as presented in PDB entry: 1HZY. As used herein, the term “phosphotriesterase” abbreviated herein to PTE, also referred to as Parathion hydrolase (EC: 3.1.8.1), refers to an enzyme belonging to the amidohydrolase superfamily. The phosphotriesterases of this aspect of the present invention are bacterial phosphotriesterases that have an enhanced catalytic activity towards V-type organophosphonates due to an extended loop 7 amino acid sequence, as compared to other phosphotriesterases. Such phosphotriesterases have been identified in Brevundimonas diminuta, Flavobacterium sp. (PTEflavob) and Agrobacterium sp.
As used herein, a “nerve agent” refers to an organophosphate (OP) compound such as having an acetylcholinesterase inhibitory activity. The toxicity of an OP compound depends on the rate of its inhibition of acetylcholinesterase with the concomitant release of the leaving group such as fluoride, alkylthiolate, cyanide or aryoxy group. The nerve agent may be a racemic composition or a purified enantiomer (e.g., Sp or Rp). In the context of embodiments of the present invention, the terms “organophosphate” or “nerve agent” encompass V-type (Amiton) nerve agent, G-type (Trilon) nerve agents and GV-type (Novichok) nerve agents. In the context of embodiments of the present invention, the term “nerve agent” includes, without limitation, G-type agents such as Tabun (GA), Sarin (GB), Chlorosarin (GC), Soman (GD), Ethylsarin (GE), and Cyclosarin (GF), V-type agents such as EA-3148, VE, VG, VM, VP, VR, VS, R/S-VX, CVX and RVX, and GV-type such as Novichok agents and GV (2- [dimethylamino(fluoro)phosphoryl]-N,N-dimethylethanamine).
A method of organophosphate detoxification:
According to an aspect of the present invention, the designed proteins, or PTE variants provided herein, can be used for decontamination of equipment, clothes and environment by hydrolyzing a broad spectrum of organophosphate agents, including nerve agents from the G-type, V-type, and GV-type nerve agents, and thereby detoxify an object or an area which is suspected of being contaminated with such agents. The area can be an inanimate object, a ground, a piece of equipment, a piece of clothing and a bodily surface.
In some embodiments, the designed proteins, or PTE variants provided herein, can be administered in vivo to a subject being suspected of nerve agent poisoning. In such uses, the protein is administered as a pharmaceutical composition, and may include a pharmaceutically accepted carrier as well as other active ingredients and excipients.
It is expected that during the life of a patent maturing from this application many relevant designed PTE variants with broad specificity hydrolysis of organophosphates will be developed and the scope of the phrase “designed PTE variants” is intended to include all such new technologies a priori.
As used herein the term “about” refers to ±10%.
The terms “comprises”, “comprising”, “includes”, “including”, “having” and their conjugates mean “including but not limited to”.
The term “consisting of” means “including and limited to”.
As used herein, the phrases “substantially devoid of” and/or “essentially devoid of” in the context of a certain substance, refer to a composition that is totally devoid of this substance or includes less than about 5, 1, 0.5 or 0.1 percent of the substance by total weight or volume of the composition. Alternatively, the phrases “substantially devoid of” and/or “essentially devoid of” in the context of a process, a method, a property or a characteristic, refer to a process, a composition, a structure or an article that is totally devoid of a certain process/method step, or a certain property or a certain characteristic, or a process/method wherein the certain process/method step is effected at less than about 5, 1, 0.5 or 0.1 percent compared to a given standard process/method, or property or a characteristic characterized by less than about 5, 1, 0.5 or 0.1 percent of the property or characteristic, compared to a given standard.
As used herein, the singular form “a”, “an” and “the” include plural references unless the context clearly dictates otherwise. For example, the term “a compound” or “at least one compound” may include a plurality of compounds, including mixtures thereof.
Throughout this application, various embodiments of this invention may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.
Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases “ranging/ranges between” a first indicate number and a second indicate number and “ranging/ranges from” a first indicate number “to” a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween.
As used herein the term “method” refers to manners, means, techniques and procedures for accomplishing a given task including, but not limited to, those manners, means, techniques and procedures either known to, or readily developed from known manners, means, techniques and procedures by practitioners of the chemical, pharmacological, biological, biochemical and medical arts.
As used herein, the term “treating” includes abrogating, substantially inhibiting, slowing or reversing the progression of a condition, substantially ameliorating clinical or aesthetical symptoms of a condition or substantially preventing the appearance of clinical or aesthetical symptoms of a condition.
When reference is made to particular sequence listings, such reference is to be understood to also encompass sequences that substantially correspond to its complementary sequence as including minor sequence variations, resulting from, e.g., sequencing errors, cloning errors, or other alterations resulting in base substitution, base deletion or base addition, provided that the frequency of such variations is less than 1 in 50 nucleotides, alternatively, less than 1 in 100 nucleotides, alternatively, less than 1 in 200 nucleotides, alternatively, less than 1 in 500 nucleotides, alternatively, less than 1 in 1000 nucleotides, alternatively, less than 1 in 5,000 nucleotides, alternatively, less than 1 in 10,000 nucleotides.
It is understood that any Sequence Identification Number (SEQ ID NO) disclosed in the instant application can refer to either a DNA sequence or a RNA sequence, depending on the context where that SEQ ID NO is mentioned, even if that SEQ ID NO is expressed only in a DNA sequence format or a RNA sequence format. For example, SEQ ID NO: # is expressed in a DNA sequence format (e.g., reciting T for thymine), but it can refer to either a DNA sequence that corresponds to an # nucleic acid sequence, or the RNA sequence of an RNA molecule nucleic acid sequence. Similarly, though some sequences are expressed in a RNA sequence format (e.g., reciting U for uracil), depending on the actual type of molecule being described, it can refer to either the sequence of a RNA molecule comprising a dsRNA, or the sequence of a DNA molecule that corresponds to the RNA sequence shown. In any event, both DNA and RNA molecules having the sequences disclosed with any substitutes are envisioned.
It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable sub-combination or as suitable in any other described embodiment of the invention. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.
Various embodiments and aspects of the present invention as delineated hereinabove and as claimed in the claims section below find experimental or calculated support in the following examples.
Reference is now made to the following examples, which together with the above descriptions illustrate some embodiments of the invention in a non-limiting fashion.
Embodiments of the present platform, also termed as FuncLib, aim at the design of a small set of stable, efficient, and functionally diverse multipoint active-site mutants suitable for low-throughput experimental testing. The design strategy is general and can be applied, in principle, to any natural enzyme using its molecular structure and a diverse set of homologous sequences (
Computational tools:
The Rosetta software suite for biomolecular design was used as the framework for the computational part of the method, and is available for download at www(dot)rosettacommons(dot)org. Specifically, the Rosetta GitHub version 627f7dd22223c3074594934b789abb4f4e2e3b10 was used for all simulations. All Rosetta modeling and design was done using RosettaScripts [Fleishman, S. L. et al., PLoS One, 2011, 6(6)], which are available with their command lines and flag files herein below. All design calculations used the Rosetta talaris14 all-atom energy function, which is dominated by van der Waals packing, hydrogen bonding, solvation, and electrostatics.
FuncLib design strategy:
The objective of the method provided herein (FuncLib) was to design a small set of stable, efficient, and functionally diverse multipoint active-site variants (mutants) suitable for low-throughput experimental testing. The design strategy, which was used, is general and can be applied to any natural enzyme or designed protein, using its molecular structure and a diverse set of homologous sequences.
As seen in
One of the reasons for selecting metalloenzyme phosphotriesterase (PTE) from Pseudomonas diminuta for the demonstration of the method presented herein is that in addition to highly efficient hydrolysis of the organophosphate pesticide paraoxon (kcat/KM approximately 108 M−1s−1), PTE promiscuously hydrolyzes esters, lactones, and diverse organophosphates, including toxic nerve agents, such as VX, Russian VX, soman (GD), and cyclosarin (GF), albeit with kcat/KM values that are orders-of-magnitude lower than for paraoxon.
Effective organophosphate detoxification for in vivo protection, however, demands high catalytic efficiency, with a minimal kcat/KM of 107 M−1 min−1, thereby motivating several recent enzyme-engineering efforts that targeted PTE. Furthermore, the threat from a new generation of nerve agents (“Novichoks”), similar in structure to VX and GF, reinforces the need for broad-spectrum nerve-agent hydrolases.
Since active-site mutations often impair protein stability, active-site design calculations may be started from a polypeptide chain of a stabilized design of the original polypeptide chain, namely a design provided by a method such as PROSS (see above). In the example used to demonstrate the method provided herein, the inventors employed dPTE2 (SEQ ID NO: 1), which is a variant of PTE-S5 [Roodveldt, C. and Tawfik, D. S., Protein Eng Des Sel., 2005, 18(1), pp. 51-8] with 20 stabilizing mutations outside the active-site pocket that was previously designed using the PROSS stability-design algorithm [Goldenzweig, A. et al.. Mol Cell., 2016, 63(2), pp. 337-346]. Original sequence dPTE2 (SEQ ID NO: 1) exhibited higher stability and fivefold higher bacterial-expression yields than PTE-S5, while retaining wild-type levels of activity.
Eight active-site positions that comprise the PTE active-site wall (first-shell) were selected for the design method, however, it is noted that the number of starting positions vary depending on the subject of the method and the available information thereof. The method, using FuncLib, started by defining a sequence space comprising active-site point mutations that are predicted to be individually tolerated (see,
Method results and sequence space:
Table 1 presents the results obtained using FuncLib as described hereinabove, starting from the original sequence of PTE, dPTE2 (SEQ ID NO: 1), and represents, at least to some extent, the sequence space of PTE variants designed for improved reactivity towards a broad spectrum of substrates. Marked in bold are the variants PTE_28 (SEQ ID NO: 28), PTE_29 (SEQ ID NO: 29), PTE_56 (SEQ ID NO: 56), and PTE_57 (SEQ ID NO: 57), which exhibited substantially broadened substrate selectivity relative to the enzyme of the original sequence.
I/C/H/L/M
F/L
H/G/R
H/Y/W
L/I/R
L/T
F/I
M/L
dPTE2
1
I
F
H
H
L
L
F
M
28
28
L
F
G
H
L
L
F
L
29
29
L
F
G
W
L
T
F
M
56
56
I
F
G
W
L
T
F
M
57
57
I
F
G
W
L
T
I
M
RosettaScripts xml and flags files:
Materials:
Substrates were synthesized as previously published: 5-thiobutyl butyrolactone (TBBL) [Khersonsky, O. and Tawfik, D. S., Chembiochem, 2006, 7, pp. 49-53]; phosphonates with cyanocoumarin leaving group, ethyl methyl phosphocyanocoumarin (EMP), isopropyl methyl phosphocyanocoumarin (IMP), cyclohexyl methyl phosphocyanocoumarin (CMP), and pinacolyl methyl phosphocyanocoumarin (PMP) [Ashani, Y. et al., Chemico-Biological Interactions, 2010, 187(1-3), pp. 362-369]; and VX and RVX enantiomers [Berman, H. A. and Leonard, K., J. Biol. Chem., 1989, 264, pp. 3942-3950].
All the other reagents (paraoxon, malathion, p-nitrophenyl acetate, p-nitrophenyl octanoate, 2-naphthyl acetate, γ-nonanoic lactone, DTNB, m-cresol, sodium acetate, propionic acid, butyric acid, isobutyric acid, valeric acid, isovaleric acid, sodium lactate, caproic acid, NADH, lactate dehydrogenase, phosphoenol pyruvate, pyruvate kinase, adenosine 3-phosphate, coenzyme A) were purchased from Sigma-Aldrich, and yeast myokinase was purchased from Merck.
Cloning:
Synthetic genes for the original enzyme and the designed variants were codon optimized for efficient E. coli expression, and custom synthesized as linear fragments by Twist Bioscience. The genes of PTE designs were amplified and cloned into the pMal C2 vector with N-terminal MBP fusion tag through the EcoRI and PstI restriction sites. The plasmids were transformed into E. coli BL21 DE3 cells, and DNA was extracted for Sanger sequencing to validate accuracy. The plasmids with genes of active designs were deposited at AddGene (deposit number 75507).
Protein expression:
2 ml of 2YT medium supplemented with 100 μg/ml ampicillin (and 0.1 mM ZnCl2 in case of PTE) were inoculated with a single colony and grown at 37° C. for about 15 hours. 10 ml 2YT medium supplemented with 50 μg/ml kanamycin (and 0.1 mM ZnCl2 in case of PTE) were inoculated with 0.2 ml overnight culture and grown at 37° C. to an OD600 of about 0.6. Overexpression was induced with 0.2 mM IPTG, and the cultures were grown for about 24 hours at 20° C. After centrifugation and storage at −20° C., the pellets were resuspended in lysis buffer and lysed by sonication.
PTE purification:
PTE lysis buffer: 50 mM Tris (pH 8.0), 100 mM NaCl, 10 mM NaHCO3, 0.1 mM ZnCl2, benzonase and 0.1 mg/ml lysozyme. The protein was bound to amylose resin (NEB), washed with 50 mM Tris with 100 mM NaCl and 0.1 mM ZnCl2, and the proteins were eluted with wash buffer containing 10 mM maltose. The elution fraction was used for SDS-PAGE gel and before activity assays the proteins were dialyzed in wash buffer. For crystallization, the PTE variants were re-cloned into pETMBPH vector containing an N-terminal 6×His tag and MBP fusion [Peleg, Y. and Unger, T., Methods Mol. Biol., 2008, 426, pp. 197-208] and the expression was performed with 500 ml culture. After purification, the protein was digested with TEV protease to remove the MBP fusion tag (1:20 TEV, 1 mM DTT, 24-48 h/RT). The MBP fusion was removed by binding to Ni2+-NTA resin, and the protein was purified by gel filtration (HiLoad 26/600 Superdex75 preparative grade column, GE).
Kinetic measurements:
The kinetic measurements of PTE designs were performed with purified proteins in activity buffer (50 mM Tris pH 8.0 with 100 mM NaCl, and 0.1 mM ZnCl2). A range of enzyme concentrations was used, depending on the activity. The activity of PTE designs was tested colorimetrically with phosphotriesters (paraoxon (0.5 mM), malathion (0.25 mM), EMP, IMP, CMP, PMP (0.1 mM each), esters (p-nitrophenyl acetate (0.5 mM), p-nitrophenyl octanoate (0.1 mM), 2-naphthyl acetate (0.3 mM), and lactones (TBBL) (0.5 mM), γ-nonanoic lactone (0.5 mM, pH-sensitive assay, by monitoring the absorbance of m-cresol indicator at 577 nm). The kinetic measurements were performed in 96-well plates (optical length—0.5 cm), and background hydrolysis rates were subtracted.
The rate of hydrolysis of the V-type nerve agents in presence of organophosphate (OP) hydrolases was performed as described [Cherny, I. et al., ACS Chem Biol., 2013, 8(11), pp. 2394-403]. The in situ conversion of the coumarin surrogates to the corresponding G nerve agents in diluted aqueous solutions and the monitoring of the rate of detoxification of the G agents by OP hydrolases were performed as previously described [Ashani, Y. et al., Toxicology Letters, 2011, 206, pp. 24-28; and Gupta, R. D. et al., Nat Chem Biol., 2011, 7(2), pp. 120-5]. Note that the concentration of the in situ generated G-and V-agents is non-hazardous foremost because the in situ synthesis was performed on a small (mg) scale in diluted aqueous solutions. Nonetheless, due to their high potency as inhibitors of AChE, all safety requirements were strictly observed.
Catalytic efficiencies (kcat/KM) were determined for the most active PTE designs by measuring the activity at several low substrate concentrations in the approximated first-order kinetics region of the Michaelis-Menten equation. All the reported values represent the averages ±standard deviations based on at least two independent measurements.
Structure determination and refinement of the PTE designs structures:
Crystals of PTE_6 (SEQ ID NO: 6), PTE_28 (SEQ ID NO: 28) and PTE_29 (SEQ ID NO: 29) were obtained using the hanging-drop vapor-diffusion method with a Mosquito robot (TTP LabTech). All data sets were collected at 100 K on a single crystal on in-house RIGAKU RU-H3R X-ray. The crystals of PTE_6 (SEQ ID NO: 6) were grown from 0.85 M Lithium sulfate and 0.05M HEPES pH=7.0. The crystals formed in the space group P43212, with one dimer per asymmetric unit and diffracted to 1.63 Å resolution. Crystals of PTE_28 (SEQ ID NO: 28) were grown from 0.1 M MgCl2*6H2O, 10% PEG 4000 and 0.05 M Tris pH=7.5. The crystals formed in the space group C2, with one dimer per asymmetric unit and diffracted to 1.9 Å resolution. Crystals of PTE_29 (SEQ ID NO: 29) were grown from 0.1 M Mg(OAC)2*4H2O, 8% PEG 8000 and 0.05 M Na cacodylate pH=6.4. The crystals formed in the space group C2, with one dimer per asymmetric unit and diffracted to 1.95 Å resolution.
Diffraction images of PTE_6 (SEQ ID NO: 6), PTE_28 (SEQ ID NO: 28) and PTE_29 (SEQ ID NO: 29) crystals were indexed and integrated using the Mosflm program, and the integrated reflections were scaled using the SCALA program. Structure factor amplitudes were calculated using TRUNCATE from the CCP4 program suite. The PTE_6 (SEQ ID NO: 6), PTE_28 (SEQ ID NO: 28) and PTE_29 (SEQ ID NO: 29) structures were solved by molecular replacement with the program PHASER. The model used to solve the PTE_6 (SEQ ID NO: 6), PTE_28 (SEQ ID NO: 28) and PTE_29 (SEQ ID NO: 29) structures was the engineered organophosphorous hydrolase (PDB entry: 1QW7).
All steps of atomic refinement were carried out with the CCP4/REFMAC5 program and by Phenix refine. The models were built into 2 mFobs-DFcalc, and mFobs-DFcalc maps by using the COOT program. Details of the refinement statistics of the PTE_6 (SEQ ID NO: 6), PTE_28 (SEQ ID NO: 28) and PTE_29 (SEQ ID NO: 29) structures are described in Table 1. The coordinates of PTE_6 (SEQ ID NO: 6), PTE_28 (SEQ ID NO: 28) and PTE_29 (SEQ ID NO: 29) were deposited in the RCSB Protein Data Bank with accession codes 6GBJ, 6GBK and 6GBL respectively. The structures will be released upon publication.
All PTE designs retained detectable levels of paraoxonase activity (see, Table 2 below), demonstrating that their active site was intact and functional despite the high sequence diversity.
PTE variants and paraoxon/malathion:
Table 2 presents specific activity of PTE variants (μM product/min for mg protein) with phosphotriesters paraoxon (0.5 mM) and malathion (0.25 mM).
NDa
The specific activities of the variants were measured with alternative, promiscuous substrates including phosphotriesters other than paraoxon, phosphonodiesters, carboxy-esters, and lactones (see,
PTE variants and phosphotriesters with coumarin:
Table 3 presents specific activity of PTE variants (μM product/min for mg protein) with phosphotriesters with coumarin leaving group (0.1 mM). Bold face indicates relaxed enantioselectivity (no biphasic behavior characteristic of different hydrolysis rates of the two stereoisomers was observed).
2465
166006
1558
25702
6534
2190
3131
1549
47759
1478
76404
940
2344
1785
29633
1072
42811
420
1055
7293
5976
1234
694
767
3513
4347
123657
784
4408
43103
612
23822
1666
39817
329
2749
10074
1115
2501
10662
18288
155709
1523
3989
57811
9703
1880
187
3124
95
4410
1005
360
402
1400
11207
84039
331
8489
127
13306
7461
23941
26543
423
15879
437
3435
240
6659
1562
7348
68
23786
1375
PTE variants and esters:
Table 4 presents specific activity of PTE variants (μM product/min for mg protein) with esters. ND=below detection limit.
NDa
PTE variants and lactones:
Table 5 presents specific activity of PTE variants (μM product/min for mg protein) with lactones. ND=below detection limit.
In addition to exhibiting improved catalytic efficiencies against a range of substrates, the PTE variants presented herein, according to some embodiments of the present invention, also showed vast changes in substrate selectivity. For instance, PTE-S5 is selective for paraoxon over the ester 2-naphthyl acetate (2NA) by 3×104-fold. Through only five active-site mutations, selectivity has been reversed in the variant PTE_37 (SEQ ID NO: 37) to 0.04; a nearly million-fold selectivity switch. Similarly, PTE-S5 favors paraoxon over the synthetic lactone tetrabutyl butyrolactone (TBBL) by 103-fold, whereas in design PTE_27 (SEQ ID NO: 27) selectivity is switched to 0.1 (see, Table 6 below).
Catalytic efficiency of PTE variants:
Table 6 presents specificity changes (as ratios of catalytic efficiency, kcat/KM) in PTE variants.
Remarkably, these designs retained substantial paraoxonase activity (kcat/KM≥104 M−1s−1), demonstrating that some of the designs broadened substrate recognition rather than only trading off one activity for another (see,
Next, the catalytic efficiency of the designs that retained high phosphotriesterase activity with the toxic nerve agents VX, Russian VX (RVX), Soman (GD), and Cyclosarin (GF) was measured (see, Table 7 and Table 8).
Table 7 presents activity of PTE variants with nerve agents of V type, kcat/KM s-1M-1.
Table 8 presents comparison of best PTE designs activity with nerve agents with that of PTE variants obtained by directed evolution; kcat/KM,×106 M−1min−1, measured in 50 mM Tris with 50 mM NaCl at pH 8, 25° C.
aData for wt-PTE-S5 taken from Cherny et al. [Cherny, I. et al., ACS Chem Biol., 2013, 8(11), pp. 2394-403]. Determined at 25° C., by use of both the DTNB and the loss of anti-AChE protocols.
bIn some cases, detoxification of the two S-enantiomers of GD was biphasic, which is attributed to the two toxic isomers, SpCR and SPCS. The parameters for the slow phase are given in the parentheses.
cData from Goldsmith et al. [Goldsmith, M. et al., Arch. Toxicol., 2016, 90, pp. 2711-2724.]. All entries determined with authentic nerve agents at 37° C. using the protocol of monitoring the ani-AChE loss of the OPs.
dData from Goldsmith et al. [Goldsmith, M. and Tawfik, D. S., Curr. Opin. Struct. Biol., 2017, 47, pp. 140-150].
As can be seen in Table 8, PTE_28 (SEQ ID NO: 28) exhibited 66-fold increase in VX hydrolysis efficiency relative to wild-type PTE, and PTE_29 (SEQ ID NO: 29) exhibited remarkable gains in efficiency of 1,550 and 3,980-fold in hydrolyzing RVX and GF, respectively.
Starting from PTE_28 (SEQ ID NO: 28), a second round of design was initiated, this time directing FuncLib to model all combinations of 3-5 mutations that occurred in the best nerve-agent hydrolases tested in the first round and eliminating designs that were predicted to be unstable (>8 Rosetta energy units relative to PTE_28 (SEQ ID NO: 28)). The 14 resulting designs were experimentally tested, finding that designs PTE_56 (SEQ ID NO: 56) and PTE_57 (SEQ ID NO: 57) exhibited increased activities towards GD (32-fold and 122-fold, respectively), and both designs exhibited a 3,000-fold increase in hydrolyzing GF. These variants, with kcat/kM≥107 M−1min−1 for the highly toxic nerve agents RVX, GD, and GF, may be suitable for in vivo detoxification.
As can further be seen in Table 8, the efficiency gains observed by testing 63 variants were comparable to the best variants from the application of more than a dozen rounds of diversification and experimental testing of thousands of variants using conventional laboratory-evolution strategies. Furthermore, laboratory-evolution experiments demand separate selection campaigns for each substrate, whereas the designed repertoire comprised dozens of enzymes with improved efficiency towards each of the substrates we tested. Additionally, all of the variants showed bacterial-expression levels comparable to the highly expressed dPTE2 (SEQ ID NO: 1) starting sequence (>300 mg protein per liter culture).
These results demonstrate that the combination of PROSS and FuncLib may not exhibit the stability-threshold bottlenecks that have constrained the laboratory evolution of many enzymes, including PTE. Thus, FuncLib results in a small but functionally highly diverse repertoire of stable and efficient enzymes and may in some cases bypass the requirement for high-throughput screens.
Sequence space for PTE:
Table B presents the sequence space of amino acid substitutions (mutations) resulting from the method presented herein (FuncLib), imposing the key residues described above and allowing active-site residues to be substituted. The sequence space has 8 amino acid substitution positions, each with at least one optional substitution over the WT (or starting sequence) amino acid at the given position, wherein the original (wild type) amino acid in the position is marked by bold face and is the first from the left.
To understand what molecular factors underlie the high gains in catalytic efficiency in some variants obtained by implementing the design method provided herein, X-ray crystallography was used to determine the molecular structures of PTE_6 (SEQ ID NO: 6) (280-fold improved activity with 2NA), PTE_28 (SEQ ID NO: 28) (65-fold improved activity with TBBL and 103-fold improved activity with S-VX), and PTE_29 (SEQ ID NO: 29) (3,980-fold improved activity with GF), and the results are presented in
Table 9 presents crystallographic data collection and refinement statistics for the PTE designs, wherein values in parentheses refer to the data of the corresponding upper resolution shell.
Structural insights:
Visual inspection and position analysis of the crystal structures revealed that all three structures showed high accuracy relative to their respective models (root mean square deviation [rmsd] <0.5 Å over the backbone and 0.3 Å all-atom RMSD in mutated active-site residues), confirming that the design process resulted in precise and preorganized active-site pockets as required for high-efficiency catalysis.
The crystal structures were also compared to the structures obtained in molecular docking simulations, which were generated to model the toxic Sp stereoisomers of VX, RVX, and GD in the active-site pockets of PTE_28 (SEQ ID NO: 28), PTE_29 (SEQ ID NO: 29), and PTE_56 (SEQ ID NO: 56), respectively. The resulting models indicated that the designed active-site pockets were large enough to accommodate the bulky nerve agents and form direct contacts with them, mostly due to two large-to-small substitutions, His254Gly and Leu303Thr (see,
Sign epistasis among designed mutations:
In each variant of PTE, according to some embodiments of the present invention, the mutations are spatially clustered. It was therefore anticipated that some designs would show complex epistatic relationships, whereby the effects of multipoint mutants could not be simply predicted based on the effects of the single-point mutants. The specific activities of all single- and double-point mutants comprising three of the best designs were therefore measured: PTE_6 (SEQ ID NO: 6), PTE_28 (SEQ ID NO: 28), and PTE_33 (SEQ ID NO: 33) with four, three, and four active-site mutations relative to PTE, respectively (see,
As can be seen in
Although the invention has been described in conjunction with specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, it is intended to embrace all such alternatives, modifications and variations that fall within the spirit and broad scope of the appended claims.
All publications, patents and patent applications mentioned in this specification are herein incorporated in their entirety by reference into the specification, to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention. To the extent that section headings are used, they should not be construed as necessarily limiting.
In addition, any priority document(s) of this application is/are hereby incorporated herein by reference in its/their entirety.
Number | Date | Country | Kind |
---|---|---|---|
261157 | Aug 2018 | IL | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IL2019/050916 | 8/14/2019 | WO | 00 |