This instant application contains a Sequence Listing which has been submitted electronically in XML file format and is hereby incorporated by reference in its entirety. Said XML Copy, created on Dec. 11, 2024, is named Seqlisting.xml and is 34000 bytes in size.
This invention relates to the field of bioinformatics, biotechnology, biochemistry, computational biology, molecular biology, Artificial intelligence, Machine-learning
Proteins are functional biomolecules of the cell. As enzymes, they are necessary for the catalysis of chemical reactions. Structural proteins are crucial components of cytoskeleton and locomotory elements of cells. Transporter proteins act as carriers of compounds to different regions of the cell or across membranes. Many proteins are involved in the regulatory mechanism of cells as interactive species in a particular pathway. Proteins can also function as hormones to illicit a desired gene expression or trigger a specific biochemical pathway. Protein function is entirely dependent on the three-dimensional structure of the proteins and the specific physiochemical interaction in between two protein species or protein and ligand species. Protein structures are in turn, defined by the interaction, packing and spatial arrangement of the amino acid residues constituting the long polypeptide sequence of the protein.
Structure-function relationship studies of proteins is a well-studied subject in the field of bioinformatics and molecular biology to determine the function of uncharacterized proteins. Most structure-function relationship studies depend on comparative studies between two or more proteins, where a protein of unknown characteristics is compared using sequence and structural similarity to a protein with defined characteristics. Using this comparison, it can be inferred that two proteins that share similarity in sequence and structure are homologs of each other and thus share a similar function. Furthermore, structural homologs are more prevalent than sequence homologs as protein structures are evolutionarily more conserved that is, two proteins that have low sequence similarity (˜30% sequence identity) can still have conserved domains and structural similarities.
Homology based structural modelling is a method that is used to determine the structure of a protein whose structure was not derived experimentally. In homology modelling, structures are derived using a sequence homology-based search, wherein the 3D structure details of local high identity matching regions of a template structure are used to model the structure of the query protein sequence.
One of the challenges in protein comparison is that when two proteins do not share similar global structures or sequences, the traditional superimposition methods which rely on overall alignment, becomes ineffective (
A grid based structural characterization of a protein structures is a commonly used method to derive information of the protein structure using organized grid points to capture atomistic details since grid points are evenly spaced and can provide for regular normalized data points which are easier to compute when compared to the unique, often irregular spatial distribution of atoms in a protein structure. Several grid-based methods to study protein structure and function have been developed.
FEATURE is a tool that was developed to structurally and functionally characterize microenvironments with protein structures. The tool defines the microenvironment by measuring physicochemical properties of atoms around a specific chosen site using concentric shells of 1.25 Å thickness to capture 80 different biochemical characteristics such as charge and polypeptide-based characteristics such as secondary structure type resulting in a numeric vector of length 480. The tool provides unique features of functional sites by using non-site microenvironment characterization to eliminate background properties. (Bagley, S. C., et. al., 1995)
Torng et al. (2017) developed a method for structure-based protein analysis using 3D-convolutional neural networks to predict amino acids most compatible with a specific location within a protein structure. Protein microenvironments are defined as atom channels, one for each atom type (C, O, N, S), within a 20 Å box around a central location within the protein to develop a visualization method known as a “atom importance map”, to inspect individual contributions of each atom within the input. The method was developed based on the principle that mutations introduced into a protein sequence is considered non-detrimental if the newly introduced residue can maintain the critical interactions observed between the wild-type residue and its surrounding residues. Atom importance map visualization provides information to validate and rank introduction of mutations into the protein structure.
Siamese Atomic Surfacelet Network (SASNet) is a tool that was developed to determine the probability of an amino acid on the surface of the protein to interact with another amino acid on the surface of another protein by voxelizing the local atomic environments, or “surfacelets” into 4D grids, the last dimension being the atomic element type. The method uses a Siamese-like three-dimensional convolutional neural network trained on the database of interacting protein structures (DIPS) which leverages already existing protein complex structures in their bound states (Townshend et al., 2019).
Sato et al. (2019) developed a quality assessment method for protein tertiary structure prediction based on deep neural network and three-dimensional convolutional neural network layer by assessing the local residue structure quality and integrating the local residue assessments to derive a whole-structure model quality assessment. Local residue quality assessment was conducted using a 3D grid bounding box centered on the Cα atom of a residue. The bounding box was oriented with respect to the vectors formed between the Cα, C and N atoms of the residue backbone of the structure. The bounding grid was divided into 1 Å voxels and atoms within the voxels were used to characterize the voxels based on atom types which were assigned to an independent channel of the neural network.
The above-mentioned literatures detail tools developed to map the atomic properties of a protein structure. However, a cubic grid with only a six-faced comparison restriction would limit method for spatial orientation-independent comparison of atomic properties. Therefore, in the present invention we report a method that employs localized spherical feature grids to capture atomistic properties as spherical grids can offer many different rotational orientations for increased variations in comparison between regions of proteins.
The primary objective of the present invention is to provide a method for alignment-free protein comparison at the atomic level, using an atomistic grid match-based computational approach. This method enables the identification of chemical and functional similarities across proteins without relying on traditional structural alignment, thereby overcoming limitations associated with conventional spatial alignment methods. The invention achieves high-resolution chemical profiling by constructing a finely spaced, three-dimensional grid around protein structures, which captures the potential energy landscape across the protein, allowing for the identification of high-energy residues. Another objective of the invention is to analyze and identify high-energy residues and functionally relevant localized regions within the protein structure by capturing the potential energy landscape. This identification facilitates targeted modifications in protein engineering to enhance desired properties. A further aim is to construct the localized spherical feature grid (LSFG) around the identified regions and design protein variants with enhanced functionality and stability, especially focusing on enzymes such as glucose dehydrogenase, where improvements in co-factor recycling and catalytic efficiency are targeted. Additionally, the invention facilitates functional annotation of uncharacterized proteins by comparing their LSFGs to known protein structures within a pre-established database, enabling prediction of potential functions, active sites, binding pockets, and catalytic domains. Another aspect of the invention involves optimizing antibody engineering, using LSFG-based analysis to enhance binding specificity, stability, and affinity for target antigens in therapeutic contexts. Overall, the objectives of this invention advance the field of protein engineering by providing a versatile and precise method for protein analysis, functional prediction, and the development of biologically active proteins for industrial and therapeutic applications.
This invention introduces a computational method for engineering proteins, especially a glucose dehydrogenase (GDHs), by analyzing their atomic composition through a 3D grid-based system. Unlike conventional methods that depend on structural alignment, this approach arranges a protein's atoms into a fine, three-dimensional grid to capture potential energies from all atoms in the protein. These properties are then used to identify high-energy residues and functional regions within the protein. The process involves constructing a localized spherical feature grid (LSFG) around targeted regions of the protein to store atomic-level information such as atom type, partial charge, polarity, atomic volume, solvent accessible surface area, electronegativity, ionization energy, polarizability, electron affinity, electrostatic potential, solvent accessibility and coordination number to derive composite values, stored in each grid point for comparison with predefined LSFG's of other protein structures in a comprehensive database. To determine the best match, the invention employs two methods: (1) a geometric alignment method, where rotation matrices and quaternions systematically test all orientations, and (2) a transformer-based similarity approach, which captures spatial and chemical patterns at each grid point. The method is particularly useful for identifying functional motifs, optimizing enzyme stability, and designing mutations for improved protein functionality. This approach was used to engineer glucose dehydrogenase (GDH) variants with enhanced stability and co-factor recycling efficiency, relevant in biocatalytic processes. Additionally, the approach has potential applications in antibody engineering and functional annotation of novel proteins by aligning chemical properties without requiring structural similarity.
Table 1: Atomic properties calculated at every grid point.
Table 2: Atomistic property descriptors captured for each grid point for a segment of the localized spherical feature grid.
Table 3: Table shows residue difference relative to SEQ ID No: 1 on engineered GDH.
“Protein,” “polypeptide”, and “peptide” are used interchangeably herein to denote a polymer of at least two amino acids covalently linked by an amide bond, regardless of length or post-translational modification.
“Amino acids” are referred to herein by either their commonly known three-letter symbols or by the one-letter symbols recommended by IUPAC-IUB biochemical nomenclature commission.
“Atomistic Grid Match” herein refers to a computational technique arranging protein atoms into a spherical 3D grid to capture atomic-level details for protein comparison and analysis.
“3D Spherical Grid” herein refers to a high-resolution grid enclosing the protein structure, spaced at regular intervals
“Grid points” herein refers to finely spaced positions in the grid, where probe atoms are placed to capture relevant data.
“Probe Atoms” herein refers to atoms (C, O, N, H, S, P) used in the grid to compute potential energy
“Angstrom (Å)” herein refers to a unit of length equal to 0.1 nanometers, used to measure atomic-scale distances
“Potential energy” herein refers to energies which are calculated based on the Coulombic and Lennard-Jones potential functions.
“High-Energy Residues” herein refers to the top 5% of residues with the highest energy values, identified by sorting all residues in descending order based on their energy values.
“Localised region” herein refers to the region around the high-energy residues and the super-secondary structures, domain, or motifs are identified for the protein of interest with unknown function.
“Localized Spherical Feature Grid (LSFG)” herein refers to a spherical grid with a 6 Å radius around localized regions, capturing both chemical and spatial information of the protein of interest.
“Atomic Properties” herein refers to the chemical and physical characteristics of atoms, such as atom type (C, O, N, H, S, P), partial charge, polarity, atomic volume (Å3), accessible surface area (Å2), electronegativity, ionization energy (eV), polarizability (Å3), electron affinity (eV), electrostatic potential (kcal/mol), solvent accessibility, and coordination number
“Composite values” herein are the aggregated representations of multiple atomic properties (such as atom type, partial charge, and polarity) at a specific grid point.
“Solvent Accessible Surface Area (SASA)” herein refers to the area of an atom exposed to the solvent, helping to identify buried or exposed regions in the protein.
“Electronegativity” herein represents an atom's ability to attract electrons.
“Polarizability” herein indicates the flexibility of an atom's electron cloud.
“Coordination Number” herein specifies the number of atoms bonded to a central atom.
“Energy Maps” herein refers to the 2D representation of potential energy distributions around the protein, created using probe atoms.
“One-Hot Encoding” herein refers to a method for representing atom types (e.g., C: [1, 0, 0, 0, 0, 0]) as a vector.
“Self-attention mechanism” herein refers to a technique in transformers that allows the model to weigh the importance of different tokens (grid points) in relation to each other, enabling it to capture both local and global patterns in the data.
“Positional encodings” herein refers to the information added to the input data in transformers to preserve the spatial or sequential positions of elements, ensuring that the model maintains the relative positions or distances within the data.
“Contrastive learning” herein refers to a machine learning technique that trains models by comparing pairs of similar and dissimilar examples, encouraging the model to learn distinct features for each class.
“Similarity scores” herein refers to the values that indicate how similar two grids are to each other, often used in matching or ranking.
“Attention scores” herein refers to the values that quantify how much focus each grid point in a transformer model should give to other grid point based on their relationships.
“Query vector” herein refers to a vector in the transformer model that represents the information the token seeks from others, used to compute attention scores during the self-attention mechanism.
“Key vector” herein refers to a vector in the transformer model that represents the information offered by a token, used to match with the query vector in the self-attention mechanism.
“Value vector” herein refers to a vector that holds the actual information of a token, which is weighted by the attention score during the self-attention mechanism.
“Amino acid difference or residue difference” refers to a change in the residue at a specified position of a polypeptide sequence when compared to a reference sequence.
This invention provides a novel method for engineering proteins, specifically glucose dehydrogenase, by utilizing an atomistic grid match based computational method to analyze and compare the atomic composition of proteins. The method arranges the atoms in a protein's 3D structure into a finely spaced three-dimensional spherical grid. This spherical grid captures atomic-level details in a highly localized manner, allowing for the comparison of specific regions within two proteins, even if they are globally dissimilar.
In conventional protein comparison methods, spatial alignment is heavily relied upon to superimpose proteins or binding sites to reveal conformational similarities. However, these approaches are limited when proteins lack significant structural similarity, even though they may possess similar chemical environments in functionally relevant regions. This invention provides an alternative by mapping chemical properties directly onto a 3D grid, allowing for alignment-free comparison focused on chemical composition rather than spatial arrangement.
The Atom Type (AT) identifies each atom as Carbon (C), Nitrogen (N), Oxygen (O), Hydrogen (H), Sulphur(S), or phosphorous (P) and is typically encoded as a categorical or one-hot vector (e.g., C: [1, 0, 0, 0, 0, 0]). Partial Charge (PC) reflects the charge distribution based on the atom's bonding environment and is represented as a real number derived from molecular mechanics or quantum calculations (e.g., C: +0.1, O: −0.8, etc.). Polarity (PO) is a binary indicator of whether the atom is polar or non-polar, where polar atoms (like Oxygen, nitrogen, sulphur and phosphorous) are assigned a 1 and non-polar atoms (like Carbon) are assigned a 0. Atomic Volume (AV) represents the approximate space occupied by an atom (e.g., 20.58 Å3 for Carbon), and Solvent Accessible Surface Area (SASA) indicates how much of an atom's surface is exposed to the solvent, with values ranging from 5-10 Å2 for Cα and 12-18 Å2 for O atoms etc. Electronegativity (EN) shows each atom's ability to attract electrons, which influences molecular bonding; for instance, Carbon has a value of 2.55 and Oxygen 3.44, etc., Ionization Energy (IE) represents the energy required to remove an electron, relevant for chemical reactivity, with Carbon at 11.26 eV and Oxygen at 13.62 eV, etc., Polarizability (PZ) describes the flexibility of an atom's electron cloud, influencing van der Waals interactions (C: 11.3 a.u., N: 7.4 a.u., etc., where a.u. is atomic units). Electron Affinity (EA) indicates the energy change when an electron is added, showing the atom's propensity to gain electrons, with Carbon at 1.26 eV and Oxygen at 1.46 eV etc., Electrostatic Potential (ESP) represents the potential energy of a unit positive charge near an atom, calculated in context and influenced by surrounding atoms.
Solvent Accessibility (SA) is a binary indicator of exposure to solvent (1 for exposed, 0 for buried). Coordination Number (CN) specifies the number of atoms bonded to the central atom, relevant in structural modeling.
When multiple atoms overlap at a single grid point, composite values (112) for the atomic properties are calculated using methods like averaging or weighted selection to reflect the most chemically relevant atom. This ensures an accurate representation of overlapping atomic contributions in a grid point's chemical profile. To derive composite values (112) when multiple atoms overlap at a single grid point, each property must be aggregated to represent the combined effect of these atoms as depicted in the
The following provides a stepwise approach to determining a composite property value:
For an atom equidistant from two grid points G1 or G2, the preferred grid point for composite parameter calculation is selected from the grid point to which another atom is selected for composite parameter calculation and that another atom is bonded or attached to the atom equidistant from the grid points (
The resultant grid point vector integrates the combined chemical properties of the overlapping atoms, as in the following example: [0.6,0.3,0,0,0,0,−0.07,1,55.87,30.32,8.54,36.14,27.9,3.98,−0.8,1,8]. This vector structure serves to represent the composite effect of all contributing atoms at a specific grid point.
The entire localized spherical feature grid is constructed using the vectors at each grid point, which are derived from the composite values calculated for atomic properties (113). For instance, the vectors at individual grid points might appear as follows: Grid point 1: [0.5,0.5,0,0,0,0,−0.35, 1,36.18, . . . ], Grid point 2: [1,0,0,0,0,0,+0.1,0,20.58, . . . ], Grid point 3: [0,1,0,0,0,0,−0.6,1,14.71, . . . ] and so forth. The generated localized spherical feature grid (LSFG) around the protein region of interest is then compared to a comprehensive database of similar predefined grids (114). This database was created from protein datasets, including BRENDA, ProThermDB, ThermoMutDB, FireProt, and an in-house collection of thermally stable enzymes collected from published literatures (
The localised spherical feature grids are compared using two different ways:
Rotation matrices are applied incrementally around each principal axis (X, Y, Z), typically in small increments, such as 5° or 10°, to ensure thorough coverage of potential orientations.
Alternatively, quaternions can be employed to represent rotations in a more efficient manner. Quaternions enable smooth, continuous rotation by defining the rotation as
q=w+xi+yj+zk, where q is applied to each vector at the grid points to rotate it in 3D space.
A quaternion rotation can be applied by calculating v′=qvq−1 where q is the quaternion, v is the vector to be rotated, and v′ is the rotated vector. By systematically varying the quaternion, you can smoothly rotate the grid around any arbitrary axis.
For each rotational orientation, whether derived through rotation matrices or quaternions, a match score is computed by comparing the query rotated grid's vectors to those of the rotated dataset grid. This is achieved by calculating either the Euclidean distance, which reflects the spatial difference in position, or cosine similarity, which assesses directional alignment (
Euclidean Distance: Calculate the Euclidean distance between the vector at each grid point in the query grid and the corresponding grid points in each dataset grid. Smaller distances indicate higher similarity.
Euclidean distance between two grid points is given by the following equation:
Where, GP1i and GP2i are the individual components such as the values of the atomic descriptors, of the two grid point vectors GP1 and GP2
For instance, two grid points, GP1=[0.6,0.3,0,0,0,0,−0.07,1,55.87,30.32, 8.54,36.14,27.9,3.98,−0.8,1,8] of LSFG1, and GP2=[0,1,0,0,0,0,−0.12,0,14.71,21,3.44,13.62,5.3, 1.46,−0.85,1,2,62.56] of LSFG2, the Euclidean distance, d(GP1,GP2) would be calculated as: 53.56
Cosine Similarity: Compute cosine similarity (Sc) between the vectors at each grid point in the query and dataset grids. Cosine similarity falls within the values of (−1,1), wherein, the values of Sc=1 indicates that the two vectors are in the same direction, Sc=0, indicates that the two vectors are orthogonal and Sc=−1, indicates the two vectors are in opposite directions.
Cosine similarity between two grid points is given by the following equation:
Where, GP1i and GP2i are the individual components such as the values of the atomic descriptors, of the two grid point vectors GP1 and GP2
For instance, two grid points, GP1=[0.6,0.3,0,0,0,0,−0.07, 1,55.87,30.32, 8.54,36.14,27.9,3.98,−0.8,1,8], of LSFG1 and GP2=[0,1,0,0,0,0,−0.12,0,14.71,21,3.44,13.62,5.3,1.46,−0.85,1,2,62.56] of LSFG2, the Cosine similarity, Sc(GP1,GP2) would be calculated as: 0.91, indicating that the vectors are in the same direction
For the comparison between two LSFGs, the combined score, as a function of Euclidean distance and cosine similarity, is given by the following equation:
Where, w1 and w2 are weights derived from the range of Euclidean distances and cosine similarities, respectively, for each grid point compared between LSFG1 and LSFG2; L1d(GP1,GPn) and L1Sc(GP1,GPn) are the Euclidean distances and cosine similarities, respectively, derived from the comparisons between the normalized vectors GP1, GPn.
Among the various orientations tested, the orientation yielding the highest match score is selected as the best alignment for that dataset grid.
Afterward, all dataset grids are ranked based on their optimal match scores, with the highest-ranking grids representing the closest spatial and chemical alignment with the region of interest. This approach ensures that the dataset grids are compared comprehensively in all possible orientations, with thresholding applied if necessary to retain only grids with significant similarity, thus identifying the most relevant spatial matches across the dataset.
Transformer-Based Similarity Method (115): A transformer-based approach can effectively match, score, and rank localized spherical feature grids within a protein by leveraging its ability to capture complex relational data across spatial and chemical dimensions (
where WQ, WK, and WV are learnable matrices and xi is the feature vector of grid point i.
The attention score between two grid points i and j is computed as the dot product of their Query and Key vectors, scaled by the square root of the key dimension dk and this score indicates how much token j's features should contribute to token i's representation.
The raw scores are then normalized using the SoftMax function to ensure they sum to 1:
The result, αij, represents the normalized attention score that reflects the influence of grid point j on grid point i. These attention scores are used to compute a weighted sum of the Value vectors across all grid points j, updating the representation of grid point i:
where zi is the updated feature representation of grid point i, incorporating the contributions of all other grid points weighted by their attention scores.
The attention scores help the model to capture both local and global spatial relationships between grid points based on their chemical and spatial features. This enables the transformer to prioritize more relevant grid points during similarity ranking.
For instance, considering two grid points, GP1=[0.6,0.3,0,0,0,0,−0.07,1,55.87,30.32, 8.54,36.14,27.9,3.98,−0.8,1,8], of LSFG1 and GP2=[0,1,0,0,0,0,−0.12,0,14.71,21,3.44,13.62,5.3,1.46,−0.85,1,2,62.56] of LSFG2, the attention scoring is as follows:
Assuming Weight matrices WQ, WK, WV are identity matrices for simplicity:
Assuming dk=17 (Feature Vector Length)
Assuming we compare GP1 with GP2, GP3 and GP4, the scores are
The attention score for GP2 (α12=1) with respect to GP1 is dominant and is the highest ranked followed by GP3 and GP4.
To address rotational variance, data augmentation with random rotations can be applied during training, or rotationally invariant transformers can be used to handle orientation differences directly.
The top-ranked Localized Spherical Feature Grid (LSFG) matches are analyzed to gain insights into the protein structure-function relationship. This analysis involves several steps, with a focus on incorporating mutations into the protein of interest and identifying key functional domains.
Analysis of Top-Ranked LSFG Matches: Once the LSFGs from the protein of interest are compared with the LSFGs in the dataset (using the two methods outlined previously), the highest-ranked matches are selected. These high-ranking LSFGs represent grid regions in the protein that exhibit the most similarity to known protein regions with well-characterized functions. By analyzing the atomic-level features in these matched regions (such as atom types, charges, hydrophobicity, and spatial arrangement), it is possible to identify conserved patterns and functional motifs shared between the protein of interest and known functional protein domains.
Incorporation of Mutations: The information derived from the top-ranked matches can be used to introduce mutations into the protein of interest. By incorporating specific mutations into the protein's amino acid sequence and observing how they affect the LSFG or the spatial arrangement of atomic properties, it can be predicted how these mutations impact the protein's stability, function, or interactions. If the mutation disrupts a functionally important region, the LSFG comparison can reveal potential compensatory mutations or guide the design of mutations that enhance the desired function.
Characterization of Domain Function: LSFG matching helps in identifying functional domains within the protein of interest. Functional domains are regions of the protein that are responsible for carrying out specific biological activities, such as binding to substrates or interacting with other proteins. By comparing the LSFG of the protein of interest with the LSFGs of known functional domains from the dataset, researchers can identify regions of high similarity that likely correspond to similar functions. The matched regions can be further analyzed to characterize the specific type of function and understand how mutations might influence these activities.
Mapping Mutations to Functional Impacts: Through this analysis, it becomes possible to predict how the mutations could alter the protein's overall function. For example, mutations that occur within regions matching known active sites or interaction domains can be evaluated for their potential to enhance or inhibit enzymatic activity, change binding specificity, or affect protein stability.
After introducing the mutations into the enzyme, the structural integrity and stability of the engineered enzyme are validated using AlphaFold, to predict the modified protein's conformation, ensuring that the introduced mutations do not negatively impact the enzyme's functional integrity. Once the structure is validated, the engineered enzyme gene is cloned into an appropriate expression vector, and the recombinant enzyme is expressed in a suitable host organism. Following expression, the enzyme activity is assessed by testing its catalytic efficiency. This ensures that the engineered enzyme demonstrates the desired improved performance for the intended applications.
This method offers an alignment-free comparison that identifies chemical similarities across structurally diverse proteins, facilitates high-resolution localized chemical profiling, and enhances functional insight into protein interactions.
In some embodiments, the antibodies are engineered using this method. This approach can be applied to engineer antibodies with enhanced binding specificity, stability, and affinity for their target antigens. By understanding how mutations in key functional regions affect antibody structure and interaction, this method can be used to optimize antibody properties for therapeutic use, such as in cancer immunotherapy, autoimmune disease treatments, or infectious disease management. Furthermore, this method can be used to identify the functionality of a protein by analyzing the spatial arrangement of atomic types within its functional domains. By matching the LSFGs of the protein of interest with those in a database of known protein structures with defined structure function characteristics, we can predict the function of uncharacterized proteins and identify novel functions such as enzyme activation loops of tyrosine kinases, TATA box binding proteins, nuclear localizing signals and SH3 binding domains. This is particularly valuable for the functional annotation of novel proteins, allowing for the identification of active sites, binding pockets, or catalytic domains.
The local spherical feature grid of the present invention was used to engineer and design variants of a Glucose dehydrogenase (GDH) enzyme for improved functionality and co-factor recycling ability. Enzymes such as short-chain dehydrogenase/reductase, imine reductases, reductive aminases, amine-dehydrogenases, amino-acid dehydrogenases, ene-reductase and other oxidoreductase enzymes bind Nicotinamide adenine dinucleotide phosphate (NAD(P)H) molecules as cofactors for a source of hydrides required during reduction reactions. GDH, therefore, is an enzyme of immense utility in biocatalysis for the replenishment of NAD(P)H cofactor that is consumed during reduction reactions. GDH enzymes are coupled with any reductase enzyme in a one-pot reaction with a sacrificial substrate such as glucose to convert oxidized NAD(P)+ to reduced NAD(P)H. Hence, another objective of the current invention is to use the method of the localized spherical feature grids described in the present invention to design variants through enzyme engineering for achieving an improvement in GDH stability and recycling efficiency.
Specifically, the present invention provides for an engineered glucose dehydrogenase designed using the localized spherical feature grid method descried in the present invention and the glucose dehydrogenase shows 90% sequence identity to the polypeptide sequence as given in SEQ ID No. 1 containing a feature of residue difference corresponding to X152S and X199H, for the improved conversion of glucose to gluconic acid, with simultaneous conversion of NADP+ to NADPH.
Additionally, the engineered glucose dehydrogenase polypeptide of the present invention contains one or more of the following residue differences as compared to SEQ ID 1: The residue corresponding to X6 is glutamate, or arginine; The residue corresponding to X7 is glycine, or glutamate; The residue corresponding to X9 is valine, or arginine; The residue corresponding to X15 is serine, or alanine; The residue corresponding to X16 is serine, cysteine, threonine, or alanine; The residue corresponding to X17 is threonine, or arginine; The residue corresponding to X19 is leucine, alanine, or tyrosine; The residue corresponding to X20 is glycine, or cysteine; The residue corresponding to X21 is lysine, or histidine; The residue corresponding to X22 is serine, alanine, or lysine; The residue corresponding to X25 is isoleucine, or valine; The residue corresponding to X29 is threonine, arginine, lysine, or alanine; The residue corresponding to X31 is lysine, glutamine, or asparagine; The residue corresponding to X33 is lysine, aspartate, arginine, or glutamine; The residue corresponding to X36 is valine, or arginine; The residue corresponding to X38 is tyrosine, or cysteine; The residue corresponding to X40 is serine, leucine, or glutamate; The residue corresponding to X41 is lysine, or arginine; The residue corresponding to X41 is lysine, or glutamate; The residue corresponding to X42 is glutamate, lysine, or glutamine; The residue corresponding to X45 is alanine, or aspartate; The residue corresponding to X46 is asparagine, or aspartate; The residue corresponding to X47 is serine, aspartate, or lysine; The residue corresponding to X49 is leucine, or valine; The residue corresponding to X53 is lysine, or histidine; The residue corresponding to X56 is glycine, asparagine, serine, or aspartate; The residue corresponding to X57 is glycine, lysine, aspartate, proline, or asparagine; The residue corresponding to X58 is glutamate, lysine, or isoleucine; The residue corresponding to X60 is isoleucine, or arginine; The residue corresponding to X61 is alanine, lysine, or arginine; The residue corresponding to X62 is valine, or aspartate; The residue corresponding to X73 is isoleucine, or lysine; The residue corresponding to X74 is asparagine, or arginine; The residue corresponding to X78 is serine, glutamate, or lysine; The residue corresponding to X83 is phenylalanine, or aspartate; The residue corresponding to X83 is phenylalanine, or glutamate; The residue corresponding to X92 is asparagine, or cysteine; The residue corresponding to X95 is leucine, or isoleucine; The residue corresponding to X96 is glutamate, glutamine, valine, aspartate, alanine, isoleucine, or methionine; The residue corresponding to X97 is asparagine, or isoleucine, valine; The residue corresponding to X98 is proline, tyrosine, phenylalanine, threonine, asparagine, alanine, or serine; The residue corresponding to X100 is serine, threonine, alanine, or proline; The residue corresponding to X101 is serine, threonine, or alanine; The residue corresponding to X102 is histidine, or lysine; The residue corresponding to X105 is serine, lysine, or threonine; The residue corresponding to X107 is serine, or glutamate; The residue corresponding to X108 is aspartate, glutamate, or leucine; The residue corresponding to X110 is asparagine, arginine, or histidine; The residue corresponding to X113 is isoleucine, or aspartate; The residue corresponding to X117 is leucine, or tyrosine; The residue corresponding to X118 is threonine, lysine, arginine, or glutamate; The residue corresponding to X120 is alanine, or threonine; The residue corresponding to X122 is leucine, or glutamate; The residue corresponding to X131 is phenylalanine, or cysteine; The residue corresponding to X132 is valine, or aspartate; The residue corresponding to X137 is lysine, or cysteine; The residue corresponding to X138 is glycine, or cysteine; The residue corresponding to X139 is threonine, or aspartate; The residue corresponding to X146 is valine, aspartate, serine, alanine, isoleucine, or glutamate; The residue corresponding to X147 is histidine, serine, alanine, tyrosine, proline, arginine, glutamine, isoleucine, valine, asparagine, glycine, phenylalanine, threonine, or glutamate; The residue corresponding to X148 is glutamate, or cysteine; The residue corresponding to X149 is lysine, glutamate, threonine, or isoleucine; The residue corresponding to X151 is proline, valine, tyrosine, phenylalanine, alanine, aspartate, methionine, cysteine, glutamate, histidine, or serine; The residue corresponding to X153 is proline, methionine, asparagine, threonine, leucine, alanine, cysteine, or isoleucine; The residue corresponding to X154 is leucine, valine, tryptophan, glutamine, threonine, or asparagine; The residue corresponding to X155 is phenylalanine, aspartate, asparagine, isoleucine, proline, leucine, valine, serine, threonine, histidine, tryptophan, methionine, glutamine, glutamate, or cysteine; The residue corresponding to X160 is alanine, cysteine, or lysine; The residue corresponding to X163 is glycine, or alanine; The residue corresponding to X164 is glycine, or cysteine; The residue corresponding to X166 is lysine, arginine, or cysteine; The residue corresponding to X167 is leucine, or lysine; The residue corresponding to X168 is methionine, or cysteine; The residue corresponding to X170 is glutamate, or lysine; The residue corresponding to X175 is glutamate, or cysteine; The residue corresponding to X177 is alanine, cysteine, or aspartate; The residue corresponding to X179 is lysine, or arginine; The residue corresponding to X180 is glycine, cysteine, serine, or glutamate; The residue corresponding to X185 is asparagine, leucine, or glutamine; The residue corresponding to X187 is glycine, or alanine; The residue corresponding to X189 is glycine, lysine, glutamate, cysteine, aspartate, threonine, or alanine; The residue corresponding to X190 is alanine, cysteine, proline, or glycine; The residue corresponding to X191 is isoleucine, leucine, phenylalanine, serine, histidine, proline, tyrosine, methionine, or glycine; The residue corresponding to X192 is asparagine, aspartate, or arginine; The residue corresponding to X194 is proline, alanine, glutamine, valine, glutamate, methionine, histidine, or phenylalanine; The residue corresponding to X195 is isoleucine, glutamate, tryptophan, glycine, serine, valine, alanine, threonine, proline, histidine, aspartate, arginine, asparagine, glutamine, tyrosine, lysine, or methionine; The residue corresponding to X196 is asparagine, glutamate, threonine, or alanine; The residue corresponding to X197 is alanine, valine, tryptophan, histidine, asparagine, lysine, or isoleucine; The residue corresponding to X198 is glutamate, tyrosine, cysteine, histidine, valine, leucine, arginine, isoleucine, glycine, serine, methionine, asparagine, threonine, glutamine, phenylalanine, tryptophan, alanine, or aspartate; The residue corresponding to X203 is proline, alanine, or phenylalanine; The residue corresponding to X204 is glutamate, valine, glutamine, lysine, or alanine; The residue corresponding to X205 is glutamine, lysine, or arginine; The residue corresponding to X207 is alanine, asparagine, lysine, arginine, or serine; The residue corresponding to X208 is aspartate, glutamate, glycine, or lysine; The residue corresponding to X209 is valine, or threonine; The residue corresponding to X211 is serine, alanine, glutamate, glutamine, leucine, or methionine; The residue corresponding to X212 is methionine, leucine, or threonine; The residue corresponding to X214 is proline, or cysteine; The residue corresponding to X215 is methionine, cysteine, leucine, or glutamate; The residue corresponding to X216 is glycine, arginine, or valine; The residue corresponding to X217 is tyrosine, valine, or arginine; The residue corresponding to X218 is isoleucine, or aspartate; The residue corresponding to X220 is glutamate, or arginine; The residue corresponding to X222 is glutamate, lysine, or arginine; The residue corresponding to X223 is glutamate, or cysteine; The residue corresponding to X227 is valine, or lysine; The residue corresponding to X230 is tryptophan, phenylalanine, or tyrosine; The residue corresponding to X234 is serine, lysine, aspartate, or glutamate; The residue corresponding to X235 is glutamate, or arginine; The residue corresponding to X237 is serine, histidine, lysine, glutamate, arginine, or alanine; The residue corresponding to X238 is tyrosine, or cysteine; The residue corresponding to X240 is threonine, or lysine; The residue corresponding to X242 is isoleucine, glutamine, lysine, or glutamate; The residue corresponding to X243 is threonine, alanine, glycine, or lysine; The residue corresponding to X244 is leucine, isoleucine, or aspartate; The residue corresponding to X248 is glycine, cysteine, or lysine; The residue corresponding to X250 is methionine, isoleucine, asparagine, aspartate, serine, glycine, threonine, alanine, glutamate, cysteine, tryptophan, proline, or leucine; The residue corresponding to X252 is glutamine, or lysine; The residue corresponding to X253 is tyrosine, or cysteine; The residue corresponding to X255 is serine, cysteine, leucine, tyrosine, phenylalanine, histidine, glycine, glutamate, glutamine, alanine, or aspartate; The residue corresponding to X256 is phenylalanine, proline, glutamine, histidine, leucine, alanine, tryptophan, or arginine; The residue corresponding to X257 is glutamine, phenylalanine, alanine, cysteine, tyrosine, lysine, leucine, or methionine; The residue corresponding to X258 is alanine, arginine, tryptophan, glutamate, asparagine, lysine, valine, tryptophan, glutamate, asparagine, lysine, or valine;
In some embodiments, the engineered glucose dehydrogenase polypeptide given in SEQ ID NO: 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, and 30 can have an amino acid difference by one or more of the above mentioned substitutions in combination with one or multiple residue differences when compared to SEQ ID NO: 1 (Table 3)
The invention offers a significant advantage by eliminating the need for traditional spatial alignment. By focusing on the chemical composition and atomic-level properties, the method can compare proteins without requiring superimposition or matching of their overall structures. This enables more accurate comparison of proteins with divergent overall shapes, allowing the identification of functionally relevant regions even in proteins with low global similarity.
The invention provides a computational approach that can identify high-energy residues and functional regions within proteins by analyzing their potential energy distribution. This leads to better predictions of protein function and the identification of critical sites for mutation or modification.
By capturing the localized atomic properties in a protein's structure through a grid-based approach, this method enables the precise engineering of proteins, antibodies, and enzymes. It allows for targeted modifications that enhance the stability, activity, and specificity of proteins for therapeutic, and industrial purposes.
The use of localized spherical feature grids (LSFGs) and advanced machine learning models (e.g., transformer-based similarity method) enables the analysis of a wide range of proteins, regardless of their structural differences. This versatility makes the invention applicable to numerous areas, including antibody engineering, enzyme engineering, and functional domain identification.
| Number | Date | Country | Kind |
|---|---|---|---|
| 202341079086 | Dec 2023 | IN | national |