This application claims the benefit of European Application No. EP22177692 (filed on Jun. 7, 2022). The entirety of the foregoing application is incorporated by reference herein.
In addition, this application shares at least one common inventors with those individuals named as authors of the following publication: Gainza et al., “DE NOVO DESIGN OF PROTEIN INTERACTIONS WITH LEARNED SURFACE FINGERPRINTS,” Nature vol. 617 (May 4, 2023). The entirety of the foregoing publication is incorporated by reference herein.
Designing novel protein-protein interactions (PPIs) remains a fundamental challenge in computational protein design, with broad basic and translational applications in biology. The challenge consists of generating amino acid sequences that engage a target site and form a quaternary complex with a given protein, a stringent test of our understanding of the determinants that drive biomolecular interactions. Robust computational methods to design de novo PP Is could be used to rapidly engineer protein-based therapeutics, as antibodies and inhibitors, vaccine design and others, and therefore are of major interest for biomedical and translational applications.
Despite recent advances in rational PPI design and prediction, designing novel protein binders against specific targets remains a challenge, particularly when no structural elements from preexisting binders are known. Current state-of-the-art methods for de novo PPI design, such as hotspot-centric approaches and rotamer information fields, rely on placing disembodied residues on the target interface and then optimizing their presentation on a protein scaffold. Intrinsic limitations of these approaches relate to the very weak energetic signatures provided by scoring functions to single-side chain placements, which is compounded in flat interfaces that lack deep pockets. These methods also face the challenge of finding compatible protein scaffolds to precisely display the generated constellations of residues. To circumvent these limitations, new approaches are needed to design de novo binders to various surface types and protein sites.
A long-standing model of molecular recognition postulates that PPIs form between protein surfaces with chemical and geometric complementarity. The complementarity features arise as a consequence of the energetic contributions that are critical to stabilize PPIs, including van der Waals interactions (geometric complementarity), hydrophobic effect, and electrostatics interactions (chemical complementarity). At the structural level, most protein interfaces contain surface regions that become inaccessible to solvent upon complex formation, which we refer to as buried or core interface, as well as patches that are involved in the interface but also exposed to solvent, which we refer as the interface rim. Residues within the buried areas tend to be much less tolerant of mutations and give larger contributions to the affinity of PP Is. Rim regions are often more polar and tolerant to mutations, giving also important contributions to affinity and more notably specificity. Based on these general principles of molecular recognition, we introduce a novel protein design approach based on the critical importance of the fully buried patches of the interface to drive protein interactions. We implemented these design principles by leveraging surface fingerprints learned from interacting protein surfaces which capture features that are determinant for molecular recognition.
Physical interactions between proteins are essential for most biological processes governing life. However, the determinants of such interactions have been challenging to understand, even as genomic, proteomic, and structural data grows. This knowledge gap has been a major obstacle for the complete understanding of cellular protein-protein interaction (PPI) networks and for the de novo design of protein binders that are crucial for synthetic biology and translational applications. We exploit a geometric deep learning framework to generate surface fingerprints from protein structures, which are learned from protein interfaces to describe geometric and chemical features critical to drive PPIs. We hypothesized these fingerprints efficiently capture key aspects of molecular recognition, and used them as the foundation of a new PPI design method. As a proof-of-principle, we computationally designed four de novo protein binders to engage three protein targets: SARS-CoV-2 spike, PD-1, and PD-L1. The designs bound the target sites with high affinity upon experimental optimization, structural and mutational characterization showed highly accurate predictions. Overall, we present a surface-centric approach to describe structure and capture molecular recognition determinants enabling a novel approach for the de novo design of protein interactions and, more broadly, of artificial proteins with function.
Pursuant to 37 C.F.R. § 1.84(a)(2), the patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
The Figures described below depict various aspects of the system and methods disclosed therein. It should be understood that each Figure depicts an embodiment of a particular aspect of the disclosed system and methods, and that each of the Figures is intended to accord with a possible embodiment thereof. Further, wherever possible, the following description refers to the reference numerals included in the following Figures, in which features depicted in multiple Figures are designated with consistent reference numerals.
There are shown in the drawings arrangements which are presently discussed, it being understood, however, that the present embodiments are not limited to the precise arrangements and instrumentalities shown, wherein:
The Figures depict preferred embodiments for purposes of illustration only. Alternative embodiments of the systems and methods illustrated herein may be employed without departing from the principles of the invention described herein.
Design of De Novo PPIs Using Learned Surface Fingerprints.
We have introduced a geometric deep learning framework, MaSIF (Molecular Surface Interaction Fingerprinting), to generate surface fingerprints that embed the geometric and chemical features of molecular surfaces, and learn patterns that determine the propensity of protein interactions. Since MaSIF has robust performance in PPI-related prediction tasks, we hypothesized that we could leverage it to design novel PPIs by targeting sites only using structural information from the target. To address the de novo PPI design problem we devised a three-stage computational approach (
The new MaSIF-seed protocol tackles the problem of identifying binding seeds that can mediate productive binding interactions (see
In MaSIF-seed, protein molecular surfaces are decomposed into overlapping radial patches with a 12 Å radius, capturing on average nearly 400 Å2 of surface area, consistent with the buried surface areas observed in native interfaces (
To benchmark our method, we built a database of 31 dimeric PPIs from the PDBBind database, where, in each pair, the interface of one of the proteins consists of a single α-helical domain. We then assembled a database of 1000 helical fragments (decomposed into 600K patches for MaSIF-seed) extracted randomly from the PDB, and benchmarked the capacity of MaSIF-seed to identify the true binder from the co-crystal structure in the correct orientation (<3 Å iRMSD) among 1000 decoys (
Encouraged by MaSIF-seed's speed and accuracy in discriminating the true binders from decoys based on rich surface features, we sought to design de novo protein binders to engage challenging protein targets in disease-relevant sites. We thus assembled a database of 250,000 structural fragments (140 million surface fingerprints) with helical secondary structure, extracted from the PDB. We computationally designed and experimentally validated binders against three structurally diverse targets: the receptor binding domain (RBD) of the SARS-CoV-2 spike protein where we identified a neutralization-sensitive site, and the two partners of the PD-1/PD-L1 complex, an important protein interaction in immuno-oncology that displays a flat interface and is considered a “hard-to-drug” target using small-molecules (
Table 1 is shown below, which shows an example benchmark of MaSIF-seed against traditional docking methods in recovering the native binder in the correct conformation from co-crystal structures for 31 helix-receptor complexes, discriminating between 1000 decoys.
With reference to Table 1, the first column identifies the benchmarked method. The second through fourth columns identify the number of receptors for which the given benchmarked method recovered the native binding helix (<3 Å iRMSD) within the top 1 (second column), top 10 (third column), and top 100 (fourth column) results. Finally, the fifth columns shows the average running time of the given benchmarked method in minutes, excluding precomputation time.
Further, as shown for
Further, as shown for
Targeting a Predicted Neutralization-Sensitive Site on SARS-CoV-2 Spike Protein.
We applied our surface-centric approach to design de novo binders to target the SARS-CoV-2 RBD. First, we used MaSIF-site to predict surface sites on the RBD with high propensity to be engaged by protein binders. We selected a site distinct from the ACE2 binding region, but sufficiently proximal that a putative binder could inhibit the ACE2-RBD interaction (
Next, using the Rosetta MotifGraft protocol we identified several protein scaffolds compatible with both binding modes of the seed (
To structurally characterize the binding mode of DBR3_03 we solved a cryo-EM structure of the design in complex with the trimeric spike protein at 2.9 Å local resolution (
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Targeting Flat Surface Sites in Immune Checkpoint Receptors.
Surface sites presenting flat structural features are difficult to target with small molecule drugs, leading to their categorization as undruggable. To test our fingerprint-based approach, we sought to design binders to target the PD-1/PD-L1 interaction, which is central to the regulation of T-cell activity in the immune system. We used MaSIF-site to find high propensity protein binding sites in PD-L1, and unsurprisingly, the identified site overlapped significantly with the native binding site engaged by PD-1 (
The second lead design based on a different scaffold, DBL2_01, could not be solubly expressed and therefore we designed a combinatorial library to improve expression and binding affinity (
Based on the SSM data, we generated the DBL2_04 design with additional polar mutations (
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
One-Shot Design De Novo Protein Binders with Native-Like Affinites.
Despite the successes in designing site specific binders to engage two different targets, the computational designs still required in vitro evolution to enable expression and detectable binding affinities that could be biochemically characterized. To address these issues we assembled a more comprehensive design pipeline (
This was a promising result given that the design was not subjected to experimental optimization by in vitro evolution. Next, we performed an SSM study and we observed that mutations at the predicted core interface positions (L23, L27, 130, M31) were generally deleterious for binding, supporting the structural and sequence accuracy of the design (
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Physical interactions between proteins in living cells are one of the hallmarks of function. Our incomplete understanding of the complex interplay of molecular forces that drive PPIs has greatly hindered the comprehension of fundamental biological processes as well as the capability to engineer such interactions from first principles. It has been particularly challenging for protein modeling methodologies that use discrete atomic representations to perform de novo design of PPIs. In large part this is due to the small number of molecular interactions involved in most protein interfaces and to the very small energetic contributions that determine binding affinities, making physics-based energy functions less reliable. To address this gap we developed an enhanced data framework to represent proteins as surfaces and learn the geometric and chemical patterns that ultimately determine the propensity of two molecules to interact. We proposed a new geometric deep learning tool, MaSIF-seed, to overcome the PPI design challenge by both identifying patches with a high propensity to form buried surfaces and binding seeds with complementary surfaces to those patches. By embedding molecular surfaces as numerical fingerprints, we rapidly and reliably identify complementary surface fragments that can engage a specific target within 140 million candidate surfaces, solving an important challenge in protein design by efficiently handling search spaces of daunting scales.
The identified binding seeds were then used as the interface driving core to design novel binding proteins against three challenging targets: a novel predicted interface in the SARS-CoV-2 spike protein, which ultimately yielded a SARS-CoV-2 inhibitor and the PD-1/PD-L1 protein complex, which exemplifies sites that are difficult to target with small molecules due to its flat surface. Multiple designed binders showed close mimicry to computationally predicted models and achieved high binding affinities, often, after experimental optimization. In the case of the PD-1 binder, the design showed low micromolar affinity without experimental optimization, which is the range of many native PPIs. By using surface fingerprints we identified novel structural motifs that can mediate de novo PPIs presenting a route to expand the landscape of motifs that can be used to functionalize proteins and be critical for the de novo design of function.
For all three targets, the original binding seed arguably provided the principal driver of molecular recognition representing the design's binding interface core (
In our study several limitations of the approach became evident, namely the absence of conformational flexibility and adaptation of the protein backbone to mutations and the difficulty of designing polar interactions that balance the hydrophobic patches of the interface contributing for affinity and specificity, which has also been observed by other authors. In future methodological developments, neural network architectures could be optimized to capture such features of native interfaces. Lastly, our approach is intrinsically limited by the restricted structural space used for binding seeds with helical structure, which present advantages in terms of the modularity and predictability of their structures, but are unlikely to be the universal solution for the de novo PPI design problem. Most likely, a larger diversity of structural motifs will be necessary and we anticipate the emergence of generative algorithms that can construct backbones conditioned to the target binding sites.
Here we presented a surface-centric design approach that leveraged molecular representations of protein structures based on learned geometrical and chemical features. We showed that these structural representations can be efficiently used for the design of de novo protein binders, one of the most challenging problems in computational protein design. We anticipate that this conceptual framework for generation of rich descriptors of molecular surfaces can open possibilities in other important biotechnological fields like drug design, biosensing or biomaterials in addition to providing a means to study interaction networks in biological processes at the systems levels.
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
As shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
As shown for
Further as shown for
Further as shown for
Further as shown for
As shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Further as shown for
Description of Various Methods in Accordance with the Disclosure Herein.
The below describes various methods in accordance with the present disclosure and various embodiments herein.
Analysis of Buried Molecular Surface Areas on Transient Interactions
A dataset of protein-protein interactions was downloaded from the PDBBind database containing all interactions with a reported affinity stronger than 10 μM; since these PPIs have a reported affinity, all were assumed to be transient. The PDBBind database does not report the chains involved in the interaction with the reported affinity; thus, for simplicity, only those complexes containing exactly two chains in the PDB crystal structure were considered for analysis.
Computing PPI Buried Surface Areas on Protein Molecular Surfaces
The MSMS program was used to compute all molecular surfaces in this work (density=3.0, water radius=1.5 Å). Since MSMS produces molecular surfaces with highly irregular meshes, meshes were regularized to a resolution of 1.0 Å using PyMESH. To compute buried surface areas of protein subunits in co-crystallized complexes, the buried surface area of the subunit was first computed, followed by the buried surface area of the complex. Any vertex in the surface of the subunit that was farther than 2.0 Å from a vertex in the complex was labeled as a buried surface vertex. The size of buried areas was computed by computing the area of each vertex labeled as a buried surface vertex.
We note that in this work we focus on the protein molecular surface (also known as solvent excluded surface) as opposed to the solvent accessible surface, which is slightly larger. In most analysis of protein-protein interactions performed in the field, the solvent accessible surface area is used to measure the buried area of PPIs, and typically this value counts the interfaces of both partners that become buried. Thus, in general the areas presented in this work are less than half of the areas reported in other work and must therefore not be used comparatively.
Decomposing Protein Surfaces into Radial Patches.
In order to process protein surface information, all molecular surfaces are decomposed into overlapping radial patches. This means that each vertex on the surface becomes the center of a radial patch of a given radius. To compute the geodesic radius of patches, throughout this work we used the Dijkstra algorithm, a fast and simple approximation to the true geodesic distance in the patch. We used a radius size of 12 Å for patches, limited to at most 200 points, which we found corresponds roughly to 400 Å2 (
Computing the Largest Circumscribed Patch in Buried Surface Areas.
From each labeled interface point, we used the Dijkstra algorithm to compute the shortest distance to a non-interface point. The interface point with the largest distance to a non-interface point was labeled as the center of the interface, and the distance to the nearest non-interface point as the radius of the largest circumscribed patch.
Geometric and Chemical Features Used to Describe Molecular Surfaces.
Each point in a patch of the computed molecular surface was assigned an array of two geometric features (shape index, distance-dependent curvature), and three chemical features (hydrophobicity, Poisson-Boltzmann electrostatics, and a hydrogen bond potential). These features are identical to those described in Gainza et al.
Prediction of Buried Surface Area Sites in Proteins.
The MaSIF-site tool was trained to predict buried surface areas on the surface of proteins. Here, MaSIF-site was used to predict buried surface areas in each of the 27 targets of our benchmark (
Identifying Complementary Surfaces Using Fingerprints.
MaSIF-search was used to compute fingerprints for every overlapping patch in proteins of interest. MaSIF-search was trained on a large dataset of protein-protein interactions to receive as input the features of the target, a binder, and a random patch from potentially a different protein. MaSIF-search was trained on a Siamese architecture to produce similar fingerprints for the target patch vs. the binder patch, and dissimilar fingerprints for the target patch vs. the random patch. In order to decrease training time and improve performance, the features of the target were multiplied by −1, in order to turn the problem from one of complementarity to one of similarity.
Decomposition of the PDB into Alpha Helical Peptides.
A snapshot of the non-redundant set of the PDB was downloaded and decomposed into alpha helices, removing all non-helical elements. The DSSP program was used to label each residue according to their secondary structure. Fragments with 10 or more consecutive residues with a helical (‘H’) label assigned by DSSP were extracted. Each extracted helical fragment was treated as a monomeric protein, and surface features were computed for each one. MaSIF-search fingerprints and MaSIF-site labels were then computed for all extracted helices.
Dataset of Helix: Receptor Proteins.
The set of transient interactions from PDBBind was scanned to identify proteins that bind to helices. A protein was determined to be a helix if 80% of residues are helical and the total number of residues does not exceed 60. If both partners met the criteria, the PPI was discarded. The set was later cured to remove pairs of PPIs with high homology and finally a set of 27 unique PPIs was defined. The helices from these receptors were added to the database of alpha helical peptides; MaSIF-search fingerprints and MaSIF-site fingerprints were computed for them.
MaSIF-Seed: A Pipeline to Identify Helical Fragments in Proteins.
Based on the MaSIF-ppi-search ultra-fast searching pipeline, we developed a novel pipeline to identify potential binding seeds to targets. For each target, first MaSIF-seed was used to label each point in the surface for the propensity to form a buried surface region. Then, a fingerprint was computed for the target site. Finally, after scanning the entire protein, the best patch was selected. In one case, the SARS-CoV-2 RBD, the fourth best site was selected as it was the site with the highest potential to disrupt binding to the natural receptor. Then, a MaSIF-search fingerprint was computed for the target patch, inverting the target features before inputting them to the MaSIF-search network. The Euclidean distance between the target fingerprint and 140 million fingerprints in the helical peptide database was then computed, and all patches with a fingerprint distance <2.0 (<1.7 for RBD) were accepted.
Once fingerprints are matched, a second-stage alignment and scoring method used the RANSAC algorithm implemented in Open 3D to align the patches, similar to that presented in Gainza et al. The RANSAC algorithm implemented in Open 3D chooses three random points in the binder patch and computes the Euclidean distance of the MaSIF-search fingerprints between these points and all those points in the target patch; the most similar fingerprints provide the RANSAC algorithm with 3 correspondences to compute a transformation between the patches. Once an alignment is made a neural network was trained to classify true binders vs. non-binders. Those binders with a neural network score of more than 0.90-0.97 in the neural network score were accepted.
Clustering of Solutions.
In each design case all of the top matched seeds were clustered by first computing the root mean square deviation between all pairwise helices, computed on the C-alpha atoms of each pair of helices, in the segment overlapping over the buried surface area. The pairwise distances were then clustered using metric multi-dimensional scaling implemented in scikit-learn.
Seed Refinement.
For the PD-1 target, seed candidates proposed by MaSIF were refined using RosettaScript and a FastDesign protocol with a penalty for buried unsatisfied polar atoms in the scoring function. 33 refined seeds were selected based on the computed binding energy, the shape complementarity, the number of hydrogen bonds and the number of buried unsatisfied polar atoms.
Grafting of Seeds onto Monomeric Scaffolds and Computational Design with Rosetta.
A representative seed was selected from each solution space, and then matched using Rosetta MotifGraft to a database of 1300 monomeric scaffolds in the case of the RBD and PD-L1 designs For PD-1 designs, the 33 selected seeds were grafted to a database of 4,347 small globular proteins (<100 amino acids), originating from the PDB, two computationally designed miniprotein databaseS27 and one AF2 proteome prediction database. After grafting by Rosetta, a computational design protocol was used to design the outside of the interface for affinity to the target. Final designs were selected for yeast display based on the computed binding energy, the shape complementarity, the number of hydrogen bonds and the number of buried unsatisfied polar atoms.
Yeast Surface Display of Single Designs.
DNA sequences of binder designs were purchased from Twist Bioscience containing homology overhangs for cloning. DNA was transformed with linearized pCTcon2 (Addgene #41843) or a modified pNTA vector with V5 tag into EBY-100 yeast using the Frozen-EZ Yeast Transformation II Kit (Zymo Research). Transformed yeast were passaged once in minimal glucose medium (SDCAA) before induction of surface display in minimal galactose medium (SGCAA) overnight at 30° C. Transformed cells were washed in cold PBS with 0.05-0.1% BSA and incubated with the binding target for 2 hours at 4° C. Cells were washed once and incubated for an additional 30 minutes with appropriate antibodies. Cells were washed and analyzed using a Gallios flow cytometer (Beckman Coulter). For quantitative binding measurements, binding was quantified by measuring the fluorescence of a PE-conjugated anti-human Fc antibody (Invitrogen) detecting the Fc-fused protein target. Yeast cells were gated for the displaying population only (V5 positive).
Yeast Libraries.
Combinatorial sequence libraries were constructed by assembling multiple overlapping primers containing degenerate codons at selected positions for combinatorial sampling of the binding interface, core residues or hydrophobic surface residues. Primers were mixed (10 μM each) and assembled in a PCR reaction (55° C. annealing for 30 sec, 72° C. extension time for 1 min, 25 cycles). To amplify full-length assembled products, a second PCR reaction was performed, with forward and reverse primers specific for the full-length product. For SSM libraries and oligo pools, DNA was ordered from Twist Biosciences and amplified with primers to give homology to the pCTcon2/pNTA backbone. In all cases, the PCR product was desalted and used for transformation.
Yeast Surface Display of Libraries.
Combinatorial libraries, SSM libraries, and oligo pools were transformed as linear DNA fragments in a 5:1 ratio with linearized pCTcon2 or pNTA V5 vector as described previously into EBY-100 yeast. Transformation efficiency generally yielded around 107 transformants per cuvette. Transformed yeast were passaged at least once in minimal glucose medium (SDCAA) before induction of surface display in minimal galactose medium (SGCAA) overnight at 30° C. Induced cells were labeled in the same manner as the single designs. Labeled cells were washed and sorted on a Sony SH800 cell sorter. For combinatorial libraries and oligo pool libraries, sorted cells were grown in SDCAA and prepared similarly for two additional rounds of sorting. After the third sort cells were plated on SDCAA agar and single colonies were sequenced. SSM libraries were sorted once, collecting both binding and nonbinding populations, and grown in liquid culture for plasmid preparation.
MiSeq Sequencing.
After sorting, yeast cells were grown in SDCAA medium, pelleted and plasmid DNA was extracted using Zymoprep Yeast Plasmid Miniprep II (Zymo Research) following the manufacturer's instructions. The coding sequence of the designed variants was amplified using vector-specific primer pairs, Illumina sequencing adapters and Nextera barcodes were attached using an additional overhang PCR, and PCR products were desalted with Qiaquick PCR purification kit (Qiagen) or AMPure XP selection beads (Beckman Coulter). Next generation sequencing was performed using Illumina MiSeq with appropriate read length, yielding between 0.45-0.58 million reads/sample. For bioinformatic analysis, sequences were translated in the correct reading frame, and enrichment values were computed for each sequence.
Protein Expression and Purification.
DNA sequences were ordered from Twist Bioscience and Gibson cloning or T7 ligation used to clone into bacterial (pET21b) or mammalian (pHLSec) expression vectors. Specific protein constructs and tags are listed in table AV12. Mammalian expressions were performed using the Expi293™ expression system from Thermo Fisher Scientific. Supernatant was collected 6 days post transfection, filtered, and purified. E. coli expressions were performed using BL21 (DE3) cells and IPTG induction (1 mM at OD 0.6-0.8) and growth overnight at 16-18° C. Pellets were lysed in lysis buffer (50 mM Tris, pH 7.5, 500 mM NaCl, 5% Glycerol, 1 mg/ml lysozyme, 1 mM PMSF, and 1 μg/ml DNase) with sonication, the lysate clarified, and purified. All proteins were purified using an ÄKTA pure system (GE healthcare) with either Ni-NTA affinity or protein A affinity columns followed by size exclusion chromatography. If TEV cleavage was necessary, fused proteins were dialyzed overnight at 4° C. (dialysis buffer 20 mM Tris pH 7.5, 150 mM NaCl, 10% glycerol) with excess TEV enzymes.
Surface Plasmon Resonance.
SPR measurements were performed on a Biacore 8K (GE Healthcare) with HBS-EP+ as running buffer (10 mM HEPES pH 7.4, 150 mM NaCl, 3 mM EDTA, 0.005% v/v Surfactant P20, GE Healthcare). Ligands were immobilized on a CM5 chip (GE Healthcare #29104988) via amine coupling. 500-1000 response units (RU) were immobilized and designed proteins were injected as an analyte in serial dilutions. The flow rate was 30 μl/min for a contact time of 120 s followed by 800 s dissociation time. After each injection, the surface was regenerated using 3 M magnesium chloride (for PD-L1) or 10 mM glycine, pH 3.0 (for RBD). Data were fit with 1:1 Langmuir binding model within the Biacore 8K analysis software (GE Healthcare #29310604).
Biolayer Interferometry.
BLI measurements were performed on a Gator BLI system. The running buffer was 150 mM NaCl, 10 mM HEPES pH 7.5. Fc-tagged designs were diluted to 5 ug/mL and immobilized on the tips (1-2 nm immobilized). The loaded tips were then dipped into serial dilutions of either spike protein or RBD. Curves were fit using a 1:1 model on the Gator software after subtracting the background.
Size Exclusion Chromatography Multi-Angle Light Scattering (SEC-MALS).
Size exclusion chromatography with an online multi-angle light scattering device (miniDAWN TREOS, Wyatt) was used to determine the oligomeric state and molecular weight for the protein in solution. Purified proteins were concentrated to 1 mg/ml in PBS (pH 7.4), and 100 μl of the sample was injected into a Superdex 75 300/10 GL column (GE Healthcare) with a flow rate of 0.5 ml/min, and UV280 and light scattering signals were recorded. Molecular weight was determined using the ASTRA software (version 6.1, Wyatt).
Circular Dichroism.
Far-UV circular dichroism spectra were measured using a Chirascan™ spectrometer (AppliedPhotophysics) in a 1-mm path-length cuvette. The protein samples were prepared in a 10 mM sodium phosphate buffer at a protein concentration between 20 and 50 μM. Wavelengths between 200 nm and 250 nm were recorded with a scanning speed of 20 nm min−1 and a response time of 0.125 secs. All spectra were averaged two times and corrected for buffer absorption. Temperature ramping melts were performed from 20 to 90 □C with an increment of 2 ≡C/min. Thermal denaturation curves were plotted by the change of ellipticity at the global curve minimum to calculate the melting temperature (Tm).
Purification of Proteins for Crystallography.
PD-L1 extracellular domain fragment (from F19 to R238) was over-expressed as inclusion bodies in the BL21 (DE3) strain of E. coli. Renaturation and purification of PD-L1 was performed as previously described. Briefly, inclusion bodies of PD-L1 was diluted against a refolding buffer (100 mM Tris, pH 8.0; 400 mM L-Arginine; 5 mM EDTA-Na; 5 mM Glutathione (GSH); 0.5 mM Glutathione disulfide (GSSG)) at 4° C. for 24 h. Then the PD-L1 was concentrated and exchanged into a buffer of 20 mM Tris-HCl (pH 8.0) and 15 mM NaCl and further analyzed by HiLoad 16/60 Superdex 75 pg (Cytiva) chromatography.
PD-L1 binder designs, DBL1_03 and DBL2-02, were over expressed in E. coli as inclusion bodies. Renaturation and purification of the PD-L1 binder designs was performed as the PD-L1 protein. PD-L1 and binder designs were then mixed together at a molar ratio of 1:2 and incubated for 1 h on ice. The binder/PD-L1 complex was further purified by HiLoad 16/60 Superdex 75 pg (Cytiva) chromatography.
Data Collection and Structure Determination.
For crystal screening, 1 μl of binder/PD-L1 complex protein solution (10 mg/mL) was mixed with 1 μl of crystal growing reservoir solution. The resulting mixture was sealed and equilibrated against 100 μl of reservoir solution at 4° or 18° C. Crystals of DBL1_03/PD-L1 complex were grown in 0.2 M potassium/sodium tartrate, 0.1 M Bis Tris propane, pH 6.5, 20 w/v PEG 3350, and crystals of DBL2_02/PD-L1 complex were grown in. 0.1 M Sodium citrate tribasic dihydrate pH 5.5, 16% w/v Polyethylene glycol 8,000. Crystals were flash-cooled in liquid nitrogen after incubating in anti-freezing buffer (reservoir solution containing 20% (v/v) glycerol). Diffraction data of crystals were collected at Shanghai Synchrotron Radiation Facility (SSRF) BL19U. The collected intensities were subsequently processed and scaled using the Denzo program and the HKL2000 software package (HKL Research). The structures were determined using molecular replacement with the program Phaser MR in CCP4, with the reported PD-L1 structure (PDB: 3RRQ) as the search model. COOT and PHENIX were used for subsequent model building and refinement. The stereochemical qualities of the final model were assessed with MolProbit. Structure-related figures were generated using PyMOL.
Luminex Binding Assays.
Luminex beads were prepared as previously published. Briefly, MagPlex beads were covalently coupled to SARS-CoV-2 spike proteins of different variants. The serial dilutions of the antibodies or design were performed and binding curves were calculated as previously published. Response curves were fit using GraphPad Prism nonlinear four parameter curve fitting analysis of the log(agonist) versus response.
Live Virus Neutralization Assays.
The virus neutralization assays were performed as previously published. Briefly, VeroE6 cells were seeded in 96 well plates the day before the infection. The DBR3_03-Fc compound in serial dilutions was mixed with omicron virus and incubated at 37° C. for one hour before addition to the cells. The cells with virus were kept a further 48 hours at 37° C., then washed and fixed for crystal violet staining and analysis. Neutralization EC50 calculations were performed using GraphPad Prism nonlinear four parameter curve fitting analysis.
Cryo-EM Sample Preparation and Data Acquisition.
The carbon/copper grids (Quantifoil R2/1, 400 mesh) were glow-discharged at 15 mA for 30 seconds. Respectively, 3.0 μl aliquot at concentration of 0.87 or 1.0 mg/ml of the spikeD614G-binder sample and spikeOmicron-binder sample were applied onto the grids and blotted for 4.0-8.0 s, then flash-frozen in a pre-cooling liquid ethane/propane mixture using Vitrobot Mark IV (Thermo Fisher Scientific) with 100% humidity, which operated at 4° C.
The spikeD614G-binder data composed of 20,794 movies was collected on a Titan Krios G4 microscope (Cold FEG, Thermo Fisher Scientific), operated at 300 kV and equipped with Falcon4 direct detection camera (Thermo Fisher Scientific). Movies were recorded in counting mode using automation program of EPU (Thermo Fisher Scientific) with a physical pixel size of 0.40 Å per pixel and the defocus ranging from −0.8 to −2.0 μm. Exposures were calibrated to 60 e-/Å2 total dose. For spikeOmicron-binder data, 22,266 movies were recorded on a Titan Krios G4 microscope, equipped with SelectrisX imaging filter (Thermo Fisher Scientific) containing a Falcon4 direct detection camera. Exposures were adjusted to 60 e-/Å2 total dose with the physical pixel size of 0.726 Å per pixel and a defocus range of −0.8 to −2.5 μm. The raw movies were exported as EER format.
Cryo-EM Image Processing, Model Building and Refinement.
Details of the image processing are shown in
CTF Estimation was performed using the patch-based option in cryoSPARC. For the sample of spikeD614G in complex with the de-novo designed binder, 832,816 particles were automatically picked by template picker and followed by three rounds of 2D classifications, resulting in a particle set of 184,763 particles. With the ab-Initio and hetero-Refine implementation in cryoSPARC, the particles were grouped into three classes. The best 3D class composed of 97,804 particles were further subjected to another round of ab-Initio reconstruction and hetero-Refinement. The well-resolved class consisting of 50,448 particles resulted in a 2.63 Å overall resolution global map in Cl symmetry. The binder-RBD region was refined with soft mask, resulting in a local map at 3.10 Å resolution. Resolution of all the 3D maps was estimated based on the Fourier shell correlation (FSC) with a cutoff value of 0.143. For the data processing of the spikeOmicron-binder complex sample, 1,820,333 particles were picked by using cryoSPARC template-based implementation. After two rounds of 2D classifications, 981,561 particles were selected and then subjected to ab-Initio reconstruction and hetero-Refinement, resulting in a set of 595,599 particles. Subsequently, the selected particle set was classified by multiple rounds of 3D classifications in cryoSPARC. The best-resolved 3D class containing 50,758 particles resulted in a 2.80 Å overall resolution map and the binder-RBD region was further improved by performing the focused refinement with a soft mask and had a 3.29 Å resolution as estimated by FSC at 0.143 cutoff.
For model building of spikeD614G-binder, the previous model (PDB: 7BNO, spikeD614G) was used for the region of spikeD614G as a starting model. Model was rigid-body fit into resulted cryo-EM density in UCSF Chimera and adjusted manually in Coot 0.9.4. De novo building for the binder parts were performed manually in Coot 0.9.4. For the building of spikeOmicron-binder structure, the model (PDB: 7QO7, spikeOmicron) was fit into the density and rebuilt, extended manually using UCSF Chimera and Coot 0.9.4. After the structural rebuilding, all the atomic models were refined using the Phenix (1.19.2-4158) implementation of Real. Space. refine with general restraintS7. EM densities and atomic models were visualized in UCSF Chimera, UCSF ChimeraX and Pymol.
Data Availability.
Cryo-EM maps for spikeD614G-binder full, spikeD614G-binder local, spikeOmicron-binder full and spikeOmicron-binder local were deposited in the Electron Microscopy Data Bank respectively under the codes of EMD-14947 (spike(D614G)-binder full and spike(D614G)-binder local maps), EMD-14922 (spike(Omicron)-binder full) and EMD-14930 (spike(Omicron)-binder local). Atomic models were deposited at the PDB under the following accession codes: 7ZSS (spike(D614G)-binder), 7ZRV (spike(Omicron)-binder full) and 7ZSD (spike(Omicron)-binder local). Crystal structures have been deposited at the PDB under the following accession codes: 7XYQ (DBL1_03-PD-L1 complex) and 7XAD (DBL2_02-PD-L1 complex).
Additional Aspects of the Disclosure.
Although some aspects have been described in the context of a system, apparatus, and/or method it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus.
Embodiments of the invention may be implemented on a computer system. The computer system may be a local computer device (e.g. personal computer, laptop, tablet computer or mobile phone) with one or more processors and one or more storage devices or may be a distributed computer system (e.g. a cloud computing system with one or more processors and one or more storage devices distributed at various locations, for example, at a local client and/or one or more remote server farms and/or data centers). The computer system may comprise any circuit or combination of circuits. In one embodiment, the computer system may include one or more processors which can be of any type. As used herein, processor may mean any type of computational circuit, such as but not limited to a microprocessor, a microcontroller, a complex instruction set computing (CISC) microprocessor, a reduced instruction set computing (RISC) microprocessor, a very long instruction word (VLIW) microprocessor, a graphics processor, a digital signal processor (DSP), multiple core processor, a field programmable gate array (FPGA), or any other type of processor or processing circuit. Other types of circuits that may be included in the computer system may be a custom circuit, an application-specific integrated circuit (ASIC), or the like, such as, for example, one or more circuits (such as a communication circuit) for use in wireless devices like mobile telephones, tablet computers, laptop computers, two-way radios, and similar electronic systems. The computer system may include one or more storage devices, which may include one or more memory elements suitable to the particular application, such as a main memory in the form of random access memory (RAM), one or more hard drives, and/or one or more drives that handle removable media such as compact disks (CD), flash memory cards, digital video disk (DVD), and the like. The computer system may also include a display device, one or more speakers, and a keyboard and/or controller, which can include a mouse, trackball, touch screen, voice-recognition device, or any other device that permits a system user to input information into and receive information from the computer system.
Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a processor, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, some one or more of the most important method steps may be executed by such an apparatus.
Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a non-transitory storage medium such as a digital storage medium, for example a floppy disc, a DVD, a Blu-Ray, a CD, a ROM, a PROM, and EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may, for example, be stored on a machine readable carrier.
Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
In other words, an embodiment of the present invention is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
A further embodiment of the present invention is, therefore, a storage medium (or a data carrier, or a computer-readable medium) comprising, stored thereon, the computer program for performing one of the methods described herein when it is performed by a processor. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitionary. A further embodiment of the present invention is an apparatus as described herein comprising a processor and the storage medium.
A further embodiment of the invention is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may, for example, be configured to be transferred via a data communication connection, for example, via the internet.
A further embodiment comprises a processing means, for example, a computer or a programmable logic device, configured to, or adapted to, perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.
In some embodiments, a programmable logic device (for example, a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are preferably performed by any hardware apparatus.
Embodiments may be based on using a machine-learning model or machine-learning algorithm. Machine learning may refer to algorithms and statistical models that computer systems may use to perform a specific task without using explicit instructions, instead relying on models and inference. For example, in machine-learning, instead of a rule-based transformation of data, a transformation of data may be used, that is inferred from an analysis of historical and/or training data. For example, the content of images may be analyzed using a machine-learning model or using a machine-learning algorithm. In order for the machine-learning model to analyze the content of an image, the machine-learning model may be trained using training images as input and training content information as output. By training the machine-learning model with a large number of training images and/or training sequences (e.g. words or sentences) and associated training content information (e.g. labels or annotations), the machine-learning model “learns” to recognize the content of the images, so the content of images that are not included in the training data can be recognized using the machine-learning model. The same principle may be used for other kinds of sensor data as well: By training a machine-learning model using training sensor data and a desired output, the machine-learning model “learns” a transformation between the sensor data and the output, which can be used to provide an output based on non-training sensor data provided to the machine-learning model. The provided data (e.g. sensor data, metadata and/or image data) may be preprocessed to obtain a feature vector, which is used as input to the machine-learning model.
Machine-learning models may be trained using training input data. The examples specified above use a training method called “supervised learning”. In supervised learning, the machine-learning model is trained using a plurality of training samples, wherein each sample may comprise a plurality of input data values, and a plurality of desired output values, i.e. each training sample is associated with a desired output value. By specifying both training samples and desired output values, the machine-learning model “learns” which output value to provide based on an input sample that is similar to the samples provided during the training. Apart from supervised learning, semi-supervised learning may be used. In semi-supervised learning, some of the training samples lack a corresponding desired output value. Supervised learning may be based on a supervised learning algorithm (e.g. a classification algorithm, a regression algorithm or a similarity learning algorithm. Classification algorithms may be used when the outputs are restricted to a limited set of values (categorical variables), i.e. the input is classified to one of the limited set of values. Regression algorithms may be used when the outputs may have any numerical value (within a range). Similarity learning algorithms may be similar to both classification and regression algorithms but are based on learning from examples using a similarity function that measures how similar or related two objects are. Apart from supervised or semi-supervised learning, unsupervised learning may be used to train the machine-learning model. In unsupervised learning, (only) input data might be supplied and an unsupervised learning algorithm may be used to find structure in the input data (e.g. by grouping or clustering the input data, finding commonalities in the data). Clustering is the assignment of input data comprising a plurality of input values into subsets (clusters) so that input values within the same cluster are similar according to one or more (pre-defined) similarity criteria, while being dissimilar to input values that are included in other clusters.
Reinforcement learning is a third group of machine-learning algorithms. In other words, reinforcement learning may be used to train the machine-learning model. In reinforcement learning, one or more software actors (called “software agents”) are trained to take actions in an environment. Based on the taken actions, a reward is calculated. Reinforcement learning is based on training the one or more software agents to choose the actions such, that the cumulative reward is increased, leading to software agents that become better at the task they are given (as evidenced by increasing rewards).
Furthermore, some techniques may be applied to some of the machine-learning algorithms. For example, feature learning may be used. In other words, the machine-learning model may at least partially be trained using feature learning, and/or the machine-learning algorithm may comprise a feature learning component. Feature learning algorithms, which may be called representation learning algorithms, may preserve the information in their input but also transform it in a way that makes it useful, often as a pre-processing step before performing classification or predictions. Feature learning may be based on principal components analysis or cluster analysis, for example.
In some examples, anomaly detection (i.e. outlier detection) may be used, which is aimed at providing an identification of input values that raise suspicions by differing significantly from the majority of input or training data. In other words, the machine-learning model may at least partially be trained using anomaly detection, and/or the machine-learning algorithm may comprise an anomaly detection component.
In some examples, the machine-learning algorithm may use a decision tree as a predictive model. In other words, the machine-learning model may be based on a decision tree. In a decision tree, observations about an item (e.g. a set of input values) may be represented by the branches of the decision tree, and an output value corresponding to the item may be represented by the leaves of the decision tree. Decision trees may support both discrete values and continuous values as output values. If discrete values are used, the decision tree may be denoted a classification tree, if continuous values are used, the decision tree may be denoted a regression tree.
Association rules are a further technique that may be used in machine-learning algorithms. In other words, the machine-learning model may be based on one or more association rules. Association rules are created by identifying relationships between variables in large amounts of data. The machine-learning algorithm may identify and/or utilize one or more relational rules that represent the knowledge that is derived from the data. The rules may e.g. be used to store, manipulate or apply the knowledge.
Machine-learning algorithms are usually based on a machine-learning model. In other words, the term “machine-learning algorithm” may denote a set of instructions that may be used to create, train or use a machine-learning model. The term “machine-learning model” may denote a data structure and/or set of rules that represents the learned knowledge (e.g. based on the training performed by the machine-learning algorithm). In embodiments, the usage of a machine-learning algorithm may imply the usage of an underlying machine-learning model (or of a plurality of underlying machine-learning models). The usage of a machine-learning model may imply that the machine-learning model and/or the data structure/set of rules that is the machine-learning model is trained by a machine-learning algorithm.
For example, the machine-learning model may be an artificial neural network (ANN). ANNs are systems that are inspired by biological neural networks, such as can be found in a retina or a brain. ANNs comprise a plurality of interconnected nodes and a plurality of connections, so-called edges, between the nodes. There are usually three types of nodes, input nodes that receiving input values, hidden nodes that are (only) connected to other nodes, and output nodes that provide output values. Each node may represent an artificial neuron. Each edge may transmit information, from one node to another. The output of a node may be defined as a (non-linear) function of its inputs (e.g. of the sum of its inputs). The inputs of a node may be used in the function based on a “weight” of the edge or of the node that provides the input. The weight of nodes and/or of edges may be adjusted in the learning process. In other words, the training of an artificial neural network may comprise adjusting the weights of the nodes and/or edges of the artificial neural network, i.e. to achieve a desired output for a given input.
Alternatively, the machine-learning model may be a support vector machine, a random forest model or a gradient boosting model. Support vector machines (i.e. support vector networks) are supervised learning models with associated learning algorithms that may be used to analyze data (e.g. in classification or regression analysis). Support vector machines may be trained by providing an input with a plurality of training input values that belong to one of two categories. The support vector machine may be trained to assign a new input value to one of the two categories. Alternatively, the machine-learning model may be a Bayesian network, which is a probabilistic directed acyclic graphical model. A Bayesian network may represent a set of random variables and their conditional dependencies using a directed acyclic graph. Alternatively, the machine-learning model may be based on a genetic algorithm, which is a search algorithm and heuristic technique that mimics the process of natural selection.
Number | Date | Country | Kind |
---|---|---|---|
22177692.5 | Jun 2022 | EP | regional |