The Sequence Listing, which is a part of the present disclosure, is submitted concurrently with the specification in computer readable format. The name of the file containing the Sequence Listing is “57581_Seglisting.xml”, which was created on Feb. 9, 2023 and is 489,494 bytes in size. The subject matter of the Sequence Listing is incorporated herein in its entirety by reference.
The present application claims priority to U.S. Provisional Patent Application No. 63/478,933, entitled UNLOCKING DE NOVO ANTIBODY DESIGN WITH GENERATIVE ARTIFICIAL INTELLIGENCE, filed Jan. 7, 2023, and U.S. Provisional Patent Application No. 63/308,495, entitled STRUCTURE-BASED DESIGN OF BINDING PARTNER-TARGETED BIOMOLECULES, filed Feb. 9, 2022. The contents of each are incorporated herein by reference in their entireties.
Antibodies are diverse proteins used by the immune system to naturally bind and neutralize foreign objects, such as viruses, fungi, and bacteria. Accordingly, antibodies have the potential to serve as drug candidates for many infectious diseases and cancers. Conventional techniques for in silico automatic design and/or improvement of antibodies are lacking, for example, because such deep learning-based protein design techniques design proteins unconditionally, use only open source data, and do not enable targeted design to a specified binding partner. Such techniques have also not demonstrated the ability to condition design on other criteria/aims. Conventional laboratory-based techniques are time-consuming and expensive. For example, traditional de novo antibody discovery requires time and resource intensive screening of large immune or synthetic libraries. These methods also offer little control over the output sequences, which can result in lead candidates with sub-optimal binding and poor developability attributes. Conventional generative antibody design techniques have not demonstrated de novo antibody design with experimental validation. Thus, improved techniques for protein design are needed.
In one aspect, a computing system for training a machine learning model to generate structural information of a target biomolecule includes one or more processors; and one or more non-transitory computer-readable media having stored thereon instructions that, when executed by the one or more processors, cause the computing system to: (1) receive one or more training inputs, including one or more of (i) input biomolecule structural information, (ii) input biomolecule binding partner structural information or (iii) input biomolecule-input binding partner binding complex structural information; (2) process the one or more training inputs with a machine-learned biomolecule prediction model to generate predicted biomolecule structural information; (3) evaluate a loss function that compares the predicted biomolecule structural information to a ground truth value; and (4) modify one or more values of one or more parameters of the machine-learned model based at least in part on the loss function.
In another aspect, a computing system for generating structural information of a target biomolecule includes one or more processors; and one or more non-transitory computer-readable media having stored thereon: a machine-learned biomolecule prediction model trained to predict structural information of biomolecules based on an input; and instructions that, when executed by the one or more processors, cause the computing system to: (1) receive a target input including one or more of a target binding partner primary sequence, three-dimensional coordinates of a target binding partner, a target binding partner epitope primary sequence, or three-dimensional coordinates of a target binding partner epitope primary sequence, or a fragment or portion of any of the foregoing; and (2) predict the structural information of the target biomolecule by processing the target input with the machine-learned biomolecule prediction model.
In yet another aspect, a computing system for predicting an affinity of a target biomolecule includes one or more processors; and one or more non-transitory computer-readable media having stored thereon: a machine-learned affinity prediction artificial neural network, including: (i) one or more biomolecule prediction layers trained to predict biomolecule structural information from target inputs; (ii) one or more docking layers trained to generate docked complexes from two or more input three-dimensional biomolecules; and (iii) one or more affinity prediction layers trained to predict affinity from input docked complexes; wherein the one or more biomolecule prediction layers, the one or more docking layers, and the one or more affinity prediction layers are connected; and instructions that, when executed by the one or more processors, cause the computing system to: (1) receive a target input comprising one or more of a target binding partner sequence, a target binding partner, or a target epitope; and (2) process the target input using the affinity prediction artificial neural network to generate a docked complex corresponding to the target input and a corresponding structural affinity value.
The present disclosure addresses the need for artificial intelligence (AI) and machine learning (ML) models trained to predict biomolecule sequences using known biomolecule/binding partner complexes. In particular, generative AI has the potential to greatly increase the speed, quality and controllability of biomolecule (e.g., antibody) design. The present techniques include using de novo generative deep learning models to de novo design antibodies against three distinct targets in a zero-shot fashion, where all designs are the result of a single round of model generations with no follow-up optimization. In particular, the present techniques may screen a large number (e.g., 400,000 or more) of antibody variants designed for binding to human epidermal growth factor receptor 2 (HER2) using high-throughput wet lab capabilities.
From these screens, the present techniques are able to further characterize 421 binders biophysicallly using surface plasmon resonance (SPR), finding that three bind tighter than the therapeutic antibody trastuzumab. The binders are highly diverse and have low sequence identity to known antibodies. Additionally, these binders score highly on our previously introduced Naturalness metric (see Bachas S, Rakocevic G, Spencer D, Sastry A V, Haile R, Sutton J M, et al. Antibody optimization enabled by artificial intelligence predictions of binding affinity and naturalness. bioRxiv. 2022; doi:10.1101/2022.08.16.504181.), indicating that they are likely to possess desirable developability profiles and low immunogenicity. These results unlock a path to accelerated drug creation for novel therapeutic targets using generative AI combined with high throughput experimentation.
A particularly difficult aspect of antibody drug creation is the initial step of lead candidate identification due to the labor intensive and uncontrolled nature of traditional screening methods. Generative AI-based de novo design has the potential to disrupt these shortcomings of the current drug discovery process. The zero-shot nature of AI design approach obviates the need for cumbersome library screening to identify binding molecules, generating large time and cost savings. Furthermore, the controllable nature of model-based design allows for the creation of proteins optimized for developability and immunogenicity characteristics, mitigating downstream developability risks.
The present techniques provide steps towards fully de novo antibody design by demonstrating the ability to generate, in a zero-shot fashion, novel antibody variants that confer binding and natural sequence characteristics comparable and, in some cases, superior to, the parent antibody. The AI-generated sequences are distinct from any observed in the model training set and the vast majority are distinct from the known sequences in the Observed Antibody Space (OAS) database, yet maintain high Naturalness scores, showing the model can design antibody sequences along a biologically feasible manifold. Furthermore, the designed sequences are highly dissimilar from one another, indicating the ability to design a diverse solution set of binding molecules. Additionally, we demonstrate progress in designing multiple CDRs de novo by creating and validating binders with up to 3 novel heavy chain CDRs using a modified multi-step approach. The present techniques are generalizable, as demonstrated by deployment of the present generative design methods to distinct antigens. Additionally, developing epitope-specificity across multiple antigens for antibody designs may allow for precise interaction with biologically relevant target regions associated with disease mechanisms of action. In addition to advancements on the generative modeling front, the speed and scale of wet lab validation for AI-generated designs will progressively increase as the time and cost of DNA synthesis continues to decline.
Antibodies in particular are a growing class of therapeutic molecules due to their attractive drug-like properties, including high target selectivity and minimal immunogenic effects. Antibody drug development commonly begins with initial lead molecule discovery. Existing approaches for lead discovery typically consist of randomly searching through a massive combinatorial sequence space by screening large libraries of antibody variants against a target antigen. Techniques such as phage, yeast display, immunization coupled with hybridoma screening or B-cell sequencing are typically employed for initial discovery, followed by further molecule development. These methods are time and resource intensive, lack control over the properties of the designed antibodies, and often produce sub-optimal leads. Applying generative artificial intelligence (AI) to design de novo antibodies in a zero-shot and controllable fashion, rather than screening and developing lead molecule may drastically reduce the time and resources necessary for therapeutic antibody development.
The application of AI methods to antibody design, and more generally protein therapeutic design, is compelling given the availability of large protein sequence and structure databases that can fuel model training. Indeed, recent work has shown that models trained on these data could be used for the de novo design of certain classes of proteins. These works screen dozens to thousands of protein designs, representing two to four orders of magnitude fewer proteins than are validated in our study. Moreover, no method has yet achieved de novo design of antibodies with wet-lab validation, despite the immense therapeutic relevance of antibody-based therapeutics which accounted for 30% of FDA approved biologics in 2022. De novo antibody design is of particular interest as the key determinants of antibody function emerge from the complementary determining regions (CDRs) of the sequence. These hyper-variable regions interact directly with the antigen and among these, heavy chain CDR3 (HCDR3) is often most critical to binding but also most variable, making it particularly challenging to model. Several works with experimental validation have attempted to optimize antibodies using supervised learning, though none have attempted zero-shot or de novo design.
Many groups have recognized the potential of zero-shot generative AI to impact antibody design. Several promising methods have recently emerged, leveraging ideas from language modeling to geometric learning, for the design of antibodies. However, no such method has been able to demonstrate de novo antibody design in a zero-shot fashion with validation in the lab. The present techniques integrate generative modeling ideas with high-throughput experimentation capabilities in the wet lab. Recent advancements in DNA synthesis and sequencing, E coli. based antibody expression, and fluorescence-activated cell sorting have made it possible to experimentally assess hundreds of thousands of individual designs rapidly and in parallel.
The present techniques demonstrate zero-shot antibody design with extensive wet lab experimentation. As a first step towards fully de novo antibody design, the present techniques show that HCDR3 can be designed with generative AI methods using a model system of trastuzumab and its target antigen, human epidermal growth factor receptor 2 (HER2), as a model system. All antibodies binding HER2 or homologs of HER2 may be removed from the training set, in some aspects. The present techniques may include de novo design of many (e.g., approximately 440,000 or more) unique HCDR3 variants of trastuzumab and screened for binding to HER2 using a proprietary Activity-specific Cell-Enrichment (“ACE”) assay. As used herein, the term “quantitative affinity Activity-specific Cell-Enrichment or gaACE assay” refers to a high throughput assay for obtaining affinity and sequence data of biomolecule variants (U.S. Provisional Application No. 63/371,474, filed Aug. 15, 2022, and PCT/US23/60167, filed on Jan. 5, 2023, each incorporated by reference in its entirety).
For example, quantitative affinity ACE (“qaACE”) and the ACE analyses described further herein, and known as “de novo ACE” or “dnACE,” are methods for sampling the binding of antibody variants at high throughput using flow cytometry and next generation sequencing. The main goal of this method is to generate high throughput binding information and/or training data for an AI model to perform sequence-based binding predictions. This method can be applied to any antibody format, mabs, fabs, scFv, scFAB, VHH, nanobody etc. and could conceivably be applied to other binding drug formats as well.
In one embodiment, the first step in the gaACE process is to generate a mutationally diverse antibody library, that evenly sample the sequence space around the starting point antibody molecule. This library contains variants that span a range in mutational distance from the original sequence.
In some embodiments including the Examples herein, the method provides a flow cytometry read out of an antibody, expressed in SoluPro E. coli, binding to a fluorescently labeled antigen probe. In the gaACE assay, setting expression of the antibody molecule is normalized such that a change in fluorescent signal in a cell will be due to different affinities of the expressed antibody variants in the cells binding to the fluorescent antigen probe. This normalization is accomplished via a generic target molecule probe that will bind to all variants and whose signal will be in an orthogonal fluorescent channel to the antigen probe. In this setting we show that the fluorescent signal of a variant is proportional to the measured KD of an antibody variant within a range. Given this proportionality, using FACS, cells containing antibody variants can be sorted that span a range (e.g., a distribution) of affinities.
After sorting across a range of affinity values with gating across the library population distribution, the cell material is sequenced and quantified for the prevalence of observed variants across the affinity gates (bins, tubes). Using the quantifications, an enrichment score is calculated for each variant. The enrichment scores generated via gaACE are an ideal data type for AI modeling purposes because of the accuracy and throughput.
In one exemplary workflow, the present disclosure provides a gaACE assay that comprises some or all of the following general steps:
Additional ACE Assay Analyses are discussed further below.
From these designs, the present techniques may be used to functionally validate many (e.g., 421 or more) binders using SPR and estimate the presence of yet many more (e.g., approximately 4,000) binders in total. Not only are the designed binders sequence diverse from those found in the training dataset, but they are also highly diverse and dissimilar to anything previously observed in structural antibody databases or massive datasets of known antibodies. Furthermore, according to the previously described Naturalness metric (Bachas S et al.), the designed binders are likely to be developable and possess favorable immunogenicity characteristics. We show the extensibility of our approach by designing and validating binding molecules to two additional antigens; human vascular endothelial growth factor A (VEGF-A) and the SARS-CoV-2 spike protein (COVID-19 Omicron variant).
While the primary focus of this work is the in silico design of HCDR3, fully de novo antibody design will require the generation of multiple antibody CDR regions. We show initial progress toward this goal with a multi-step generative AI approach for designing antibodies with any antibody domain/sequence (e.g., all three heavy chain CDRs (HCDR1, HCDR2, HCDR3), any light chain, etc.) distinct from those of the parental antibody. Taken together, this work paves the way for rapid progress toward fully de novo antibody design using generative AI, which has the potential to revolutionize the availability of therapeutics for patients.
The present techniques represent an important advancement in in silico antibody design with the potential to revolutionize the availability of effective therapeutics for patients. Generative AI-designed antibodies will significantly reduce development timelines by generating molecules with desired qualities without the need for further optimization. Additionally, the controllability of AI-designed antibodies will enable creation of customized molecules for specific disease targets, leading to safer and more efficacious treatments than would be possible by traditional development approaches for patients. The core platform of generative AI design methods and ultra-high throughput wet lab screening capabilities will continue to drive progress on this front, unlocking new capabilities in the rapidly accelerating field of protein therapeutic design.
The present techniques leverage our previously described ACE assay to screen massive antibody variant libraries containing hundreds of thousands of members expressed in Fragment antigen-binding (Fab) format. The present techniques validate the ACE assay for the de novo discovery workflow by sampling sequences for follow-up analysis by SPR, a gold standard in binding affinity measurement and detection. Empirical evidence finds that the ACE assay is able to correctly classify binders with nearly 60% precision and >95% recall (Table S3).
This enables a powerful workflow where a large population of predictions can be initially screened by the ACE assay, and the expected binding population can be subsequently screened via SPR to remove false positives and collect high quality binding affinity measurements (
The client computing device 102 may be an individual server, a group (e.g., cluster) of multiple servers, or another suitable type of computing device or system (e.g., a collection of computing resources). For example, the client computing device 102 may be any suitable computing device (e.g., a server, a mobile computing device, a smart phone, a tablet, a laptop, a wearable device, etc.). In some aspects, one or more components of the client device 102 may be embodied by one or more virtual instances (e.g., a cloud-based virtualization service) and/or may be included in a respective remote data center (e.g., a cloud computing environment, a public cloud, a private cloud, hybrid cloud, etc.). The client computing device 102 includes a processor and a network interface controller (NIC). The processor may include any suitable number of processors and/or processor types, such as CPUs and one or more graphics processing units (GPUs). Generally, the processor is configured to execute software instructions stored in a memory. The memory may include one or more persistent memories (e.g., a hard drive/solid state memory) and stores one or more set of computer executable instructions/modules. For example, the executable instructions may receive and/or display results generated by the server 104.
The client computing device 102, may include a respective input device and a respective output device. The respective input devices may include any suitable device or devices for receiving input, such as one or more microphone, one or more camera, a hardware keyboard, a hardware mouse, a capacitive touch screen, etc. The respective output devices may include any suitable device for conveying output, such as a hardware speaker, a computer monitor, a touch screen, etc. In some cases, the input device and the output device may be integrated into a single device, such as a touch screen device that accepts user input and displays output. The NIC of the client computing device may include any suitable network interface controller(s), such as wired/wireless controllers (e.g., Ethernet controllers), and facilitate bidirectional/multiplexed networking over the network between the client computing device 102 and other components of the environment 100.
The structural prediction server 104 includes a processor 150, a network interface controller (NIC) 152 and a memory 154. The structural prediction server 104 may further include a data repository 180. The data repository 180 may be a structured query language (SQL) database (e.g., a MySQL database, an Oracle database, etc.) or another type of database (e.g., a not only SQL (NoSQL) database). In some aspects, the data repository 180 may comprise file system (e.g., an EXT filesystem, Apple file system (APFS), a networked filesystem (NFS), local filesystem, etc.), an object store (e.g., Amazon Web Services S3), a data lake, etc. The data repository 180 may include a plurality of data types, such as pretraining data sourced from public data sources (e.g., SAbDab data, OAS data), pre-training data, and fine-tuning data. Fine-tuning data may be proprietary affinity data that is sourced from a quantitative assay ACE, Carterra, or any other suitable source. The data repository 180 may include machine learning model training data represented in any suitable data format(s), such as protein data bank (PDB) format, JavaScript Object Notation (JSON) format, eXtensible Markup Language (XML) format, etc.
The server 104 may include a library of client bindings for accessing the data repository 180. In some aspects, the data repository 180 is located remote from the structural prediction server 104. For example, the data repository 180 may be implemented using a RESTdb.IO database, an Amazon Relational Database Service (RDS), etc. in some aspects. In some aspects, the structural prediction server 104 may include a client-server platform technology such as Python, PHP, ASP.NET, Java J2EE, Ruby on Rails, Node.js, a web service 77 or online API, responsive for receiving and responding to electronic requests. Further, the structural prediction server 104 may include sets of instructions for performing machine learning operations, as discussed below, that may be integrated with the client-server platform technology.
The assay device 106 may be a Surface Plasmon Resonance (SPR) machine, for example, such as a Carterra SPR machine. The device 106 may be physically connected to either the structural prediction server 104 or the data repository 180, as depicted. The device 106 may be located in a laboratory, and may be accessible from one or more computers within the laboratory (not depicted) and/or from the structural prediction server 104. The device 106 may generate data and upload that data to the data repository 180, directly and/or via the laboratory computer(s). The assay device 106 may include instructions for receiving one or more sequences (e.g., mutated sequences) and for synthesizing those sequences. The synthesis may sometimes be performed via another technique (e.g., via a different device or via a human). In some aspects, the device 106 may be configured not as a device, but as an alternative assay that can measure protein-protein interactions as listed in other sections of this application. For example, the device 106 may instead be configured as a suite of devices/workflows, including plates and liquid handling. In general, the device 106 may be substituted with suitable hardware and/or software optionally including human operators to generate affinity data.
The network 108 may be a single communication network, or may include multiple communication networks of one or more types (e.g., one or more wired and/or wireless local area networks (LANs), and/or one or more wired and/or wireless wide area networks (WANs) such as the Internet). The network 108 may enable bidirectional communication between the client computing device 102 and the structural prediction server 104, for example.
The processor 150 may include any suitable number of processors and/or processor types, such as one or more graphics processing units (GPUs), one or more central processing units (CPUs), etc. Generally, the processor 150 is configured to execute software instructions stored in the memory 154. The memory 154 may include one or more persistent memories (e.g., a hard drive/solid state memory) and stores one or more set of computer executable instructions/modules 160, including an input/output (I/O) module 162, a machine learning training module 164 and a machine learning operation module 166. In some aspects, more or fewer modules may be included, and in some aspects, one or more of the models may be combined or aggregated into a fewer number of modules.
Each of the modules 160 implements specific functionality related to the present techniques, as will be described further, below. The modules 160 may store machine readable instructions, including one or more application(s), one or more software component(s), and/or one or more APIs, which may be implemented to facilitate or perform the features, functions, or other disclosure described herein, such as any methods, processes, elements or limitations, as illustrated, depicted, or described for the various flowcharts, illustrations, diagrams, figures, and/or other disclosure herein. In some aspects, a plurality of the modules 160 may act in concert implement a particular technique. For example, the machine learning operation module 166 may load information from one or more other models prior to, during and/or after initiating an inference operation. Thus, the modules 160 may exchange data via suitable techniques, e.g., via inter-process communication (IPC), a Representational State Transfer (REST) API, etc. within a single computing device, such as the structural prediction server 104. In some aspects one or more the modules 160 may be implemented in a plurality of computing devices (e.g., a plurality of servers 104). The modules 160 may exchange data among the plurality of computing devices via a network such as the network 108. The modules 160 of
Generally, the I/O module 162 includes instructions that enable a user (e.g., an employee of the company) to access and operate the structural prediction server 104 (e.g., via the client computing device 102). For example, the employee may be a software developer who trains one or more ML models using the ML training module 164 in preparation for using the one or more trained ML models to generate outputs used in an antibody prediction project, a docking complex prediction project, and/or an affinity value prediction project. Once the one or more ML models are trained, the same user (or another) may access the structural prediction server 104 via the I/O module to cause the molecular modeling process to be initiated. The I/O module 162 may include instructions for generating one or more graphical user interfaces (GUIs) (not depicted) that collect and store parameters related to biomolecular modeling, such as a user selection of a particular reference protein, biomolecule, binding partner, etc. from a list stored in the data repository 180.
In general, a computer program or computer based product, application, or code (e.g., the model(s), such as machine learning models, or other computing instructions described herein) may be stored on a computer usable storage medium, or tangible, non-transitory computer-readable medium (e.g., standard random access memory (RAM), an optical disc, a universal serial bus (USB) drive, or the like) having such computer-readable program code or computer instructions embodied therein, wherein the computer-readable program code or computer instructions may be installed on or otherwise adapted to be executed by the processor(s) 150 (e.g., working in connection with the respective operating system in memory 154) to facilitate, implement, or perform the machine readable instructions, methods, processes, elements or limitations, as illustrated, depicted, or described for the various flowcharts, illustrations, diagrams, figures, and/or other disclosure herein. In this regard, the program code may be implemented in any desired program language, and may be implemented as machine code, assembly code, byte code, interpretable source code or the like (e.g., via Golang, Python, C, C++, C#, Objective-C, Java, Scala, ActionScript, JavaScript, HTML, CSS, XML, etc.).
The present techniques may include a generative model to automatically design or improve antibodies against a specific target of interest (e.g., a target biomolecule). The model may be trained on antibody-antigen structures or a large collection of extant antibody sequences. The model may also be trained using proprietary high-throughput binding interaction data. The model may be validated, in some aspects, by designing antibodies that bind to a receptor protein implicated in cancer. The present techniques may enable zero-shot generation of antibodies, tailored to bind to any target. The present techniques enable redesigning the complementarity-determining regions (CDRs) of antibodies, in some aspects, which confer binding specificity to the target. The present generative model may automatically design or improve antibodies against a specific target of interest, and may be pre-trained on a collection of antibody-antigen complexes mined from public data sources (e.g., the Protein Data Bank).
In some aspects, a model trained according to the present techniques may be trained to receive an antigen structure, and to predict the structure, sequence, and/or affinity of the antibody based on the received antigen. In some aspects, the prediction may be performed by iteratively refining the model's prediction of the structure while autoregressively unraveling the sequence. In such cases, generalized autoregressive or bidirectional training may be used to capture long-range patterns in the sequence. The antibody and antigen modules may capture independent SE(3) equivariance, enabling data efficient protein-protein rigid body docking.
In some aspects, high GPU memory of 80 GB A100s may be used to achieve unravelling of long structural graphs. Scalable implementations of pairwise SE(3) equivariance can be helpful to improve training efficiency, in some aspects. After pre-training, the present techniques may fine-tune a model to predict the binding affinity of an antibody to an antigen of interest. An affinity prediction module may use structural and sequence features inferred by the model to predict the binding affinity of the antibody to the antigen. In particular, embeddings at each residue may be inferred and propagated to the affinity prediction module with attention. The affinity prediction module may combine high-throughput binding interaction data (e.g., data generated on Absci's Integrated Drug Creation™ platform). Thus, advantageously, the present techniques achieve a novel coupling of structural models of antibody binding with real-world binding affinity data.
The present techniques may validate the trained model by designing antibodies to neutralize Her2, an important target in breast cancer. Starting with trastuzumab, a well understood Her2 binder, the present techniques may include designing libraries containing trastuzumab variants with single, double and higher-order mutations. Binding affinity of variants in the library may be measured by high-throughput screening, and the library scored using generative models, to demonstrate a high correlation between model predictions and the high-throughput data. In addition to pre-training on structural data, or alternatively, the present techniques may pre-train a language model on a large database of extant antibody sequences. The language model and predictive model may be effectively ensembled to further improve predictions of the binding affinity data. After training, the model can generate diverse new antibodies against an unseen antigen by iteratively decoding a sequence autoregressively. In contrast to prior work on deep learning-based protein design in which proteins are designed unconditionally, our model enables targeted design to a specified antigen.
In some aspects, the ML training module 164 may include a set of computer-executable instructions implementing machine learning training, configuration, parameterization and/or storage functionality. The ML training module 164 may initialize, train and/or store one or more ML models, as discussed herein. The trained ML models and their weights/parameters may be stored in the data repository 180 and/or in the memory 154, which is accessible or otherwise communicatively coupled to the structural prediction server 104.
For example, the ML training module 164 may train one or more ML models (e.g., an artificial neural network (ANN)). One or more training data sets may be used for model training in the present techniques, as discussed herein. The input data may have a particular shape that may affect the ANN network architecture. The elements of the training data set may comprise tensors scaled to small values (e.g., in the range of (−1.0, 1.0)). In some aspects, a preprocessing layer may be included in training (and operation) which applies principal component analysis (PCA) or another technique to the input data. PCA or another dimensionality reduction technique may be applied during training to reduce dimensionality from a high number to a relatively smaller number. Reducing dimensionality may result in a substantial reduction in computational resources (e.g., memory and CPU cycles) required to train and/or analyze the input data.
In general, training an ANN may include establishing a network architecture, or topology, adding layers including activation functions for each layer (e.g., a “leaky” rectified linear unit (ReLU), softmax, hyperbolic tangent, etc.), loss function, and optimizer. In an aspect, the ANN may use different activation functions at each layer, or as between hidden layers and the output layer. A suitable optimizer may include Adam and Nadam optimizers. In an aspect, a different neural network type may be chosen (e.g., a graph convolutional neural network, a message passing neural network, a geometric vector perceptron network, a recurrent neural network, a deep learning neural network, etc.). Training data may be divided into training, validation, and testing data. For example, 20% of the training data set may be held back for later validation and/or testing. In that example, 80% of the training data set may be used for training. In that example, the training data set data may be shuffled before being so divided. Dividing the dataset may also be performed in a cross-validation setting, e.g., when the data set is small. Data input to the artificial neural network may be encoded in an N-dimensional tensor, array, matrix, and/or other suitable data structure. In some aspects, training may be performed by successive evaluation (e.g., looping) of the network, using labeled training samples. The process of training the ANN may cause weights, or parameters, of the ANN to be altered. The weights may be initialized to random values. The weights may be adjusted as the network is successively trained, by using one or more gradient descent algorithms, to reduce loss and to cause the values output by the network to converge to expected, or “learned”, values. In an aspect, a regression may be used which has no activation function. Therein, input data may be normalized by mean centering, and a mean squared error loss function may be used, in addition to mean absolute error, to determine the appropriate loss as well as to quantify the accuracy of the outputs.
In some aspects, the ML training module 164 may include computer-executable instructions for performing ML model pre-training, ML model fine-tuning and/or ML model self-supervised training. Model pre-training may be known as transfer learning, and may enable training of a base model that is universal, in the sense that it can be used as a common grammar for all antibody sequences, for example. In some examples, pre-training may be used to train multiple models of independent artificial neural networks, and/or multiple respective layers of a single artificial neural network (e.g., an artificial neural network used to predict affinity values from biological structural information). The term “pre-training” may be used to describe scenarios wherein a second training may occur (i.e., when the model may be “fine-tuned”). Transfer learning refers to the ability of the model to leverage the result (weights) of a first pre-training to better initialize the second training, which may otherwise require a random initialization. The second training, i.e., fine-tuning, may be performed using affinity data as discussed herein. The technique of combining pre-training and fine-tuning advantageously boosts performance, in that the result of the training on affinity data performs better after pre-training (e.g., using natural antibody structures from SAbDab) than when no pre-training is performed. Model fine-tuning may be performed with respect to given antibody-antigen pairs, in some aspects. In some aspects, ML model self-supervised learning may be performed to endow the model with an understanding of the antibody grammar during pre-training.
Generally, an ML model may be trained as described herein using a supervised, semi-supervised or unsupervised machine learning program or algorithm. The machine learning program or algorithm may employ a neural network, which may include one or more of a graph convolutional neural network, a message passing neural network, a geometric vector perceptron network, a recurrent neural network, a deep learning neural network, a convolutional neural network, a deep learning neural network, transformer, autoencoder and/or a combined learning module or program that learns in two or more features or feature datasets (e.g., structured data, unstructured data, etc.). The machine learning programs or algorithms may also include natural language processing, semantic analysis, automatic reasoning, regression analysis, support vector machine (SVM) analysis, decision tree analysis, random forest analysis, K-Nearest neighbor analysis, naive Bayes analysis, clustering, reinforcement learning, and/or other machine learning algorithms and/or techniques (e.g., generative algorithms, genetic algorithms, etc.).
In some aspects, an ML algorithm or techniques may be chosen for a particular input based on the problem set size of the input. In some aspects, the artificial intelligence and/or machine learning based algorithms may be based on, or otherwise incorporate aspects of one or more machine learning algorithms included as a library or package executed on the server(s) 104. For example, libraries may include the TensorFlow based library, the Pytorch library (e.g., PyTorch Lightning), the Keras libraries, the Jax library, the HuggingFace ecosystem (e.g., transformers, datasets and/or tokenizer libraries therein), and/or the scikit-learn Python library. However, these popular open source libraries are a nicety, and are not required. The present techniques may be implemented using other frameworks/languages.
Machine learning may involve identifying and recognizing patterns in existing data (e.g., structural information, docking capabilities, binding affinity, etc.) in order to facilitate making predictions, classifications, and/or identifications for subsequent data (such as using the trained models to generate an antibody based on an input antibody, predict the ability of the generated antibody to dock to the input antigen (or other antigen(s)) and/or predict the binding affinity of the generated antibody). Machine learning model(s), may be created and trained based upon example data (e.g., “training data”) inputs or data (which may be termed “features” and “labels”) in order to make valid and reliable predictions for new inputs. In supervised machine learning, a machine learning program operating on a server, computing device, or otherwise processor(s), may be provided with example inputs (e.g., “features”) and their associated, or observed, outputs (e.g., “labels”) in order for the machine learning program or algorithm to determine or discover rules, relationships, patterns, or otherwise machine learning “models” that map such inputs (e.g., “features”) to the outputs (e.g., labels), for example, by determining and/or assigning weights or other metrics to the model across its various feature categories. Such rules, relationships, or otherwise models may then be provided subsequent inputs in order for the model, executing on the server, computing device, or otherwise processor(s), to predict, based on the discovered rules, relationships, or model, an expected output.
For example, the ML training module 164 may analyze labeled data at an input layer of a model having a networked layer architecture (e.g., an artificial neural network, a convolutional neural network, a deep neural network, etc.) to generate ML models. The training data may be, for example, structural information of antibodies. In some aspects, outputs may be the sequence of a new protein. During training, the labeled data may be propagated through one or more connected deep layers of the ML model to establish weights of one or more nodes, or neurons, of the respective layers. Initially, the weights may be initialized to random values, and one or more suitable activation functions may be chosen for the training process, as will be appreciated by those of ordinary skill in the art. The ML training module 164 may include training a respective output layer of the one or more machine learning models. The output layer may be trained to output a prediction. For example, the ML models trained herein are able to predict structural information of an antibody by analyzing the labeled examples provided during training. In some aspects, the binding affinity may be expressed as a real number (e.g., in a regression analysis). In some aspects, the binding affinity may be expressed as a boolean value (e.g., in classification). In some aspects, multiple ANNs may be separately trained and/or operated. For example, an individual model may be fine-tuned (i.e., trained) based on a pre-trained model, using transfer learning, for a plurality of different antibody-antigen pairs.
In unsupervised or semi-supervised machine learning, the server, computing device, or otherwise processor(s), may be required to find its own structure in unlabeled example inputs, where, for example multiple training iterations are executed by the server, computing device, or otherwise processor(s) to train multiple generations of models until a satisfactory model is generated. In the present techniques, semi-supervised learning may be used, inter alia, for natural language processing purposes and to learn a grammar of antibody sequences using an objective, such as a masked language model objective. Supervised learning and/or unsupervised machine learning may also comprise retraining, relearning, or otherwise updating models with new, or different, information, which may include information received, ingested, generated, or otherwise used over time. In various aspects, training the ML models herein may include generating an ensemble model comprising multiple models or sub-models, comprising models trained by the same and/or different AI algorithms, as described herein, and that are configured to operate together.
Once the model machine learning training module 164 has initialized the one or more ML models, which may be ANNs or regression networks, for example, the model machine learning training module 164 trains the ML models by inputting labeled data into the models (e.g., antibody/antigen complexes; docked biomolecules; etc.). The trained ML model provides accurate predictions given inputs previously unseen by the model (i.e., not used during training).
The model machine learning training module 164 may divide training data into a respective training data set and testing data set. The model machine learning training module 164 may train the ANN using the labeled data. The model machine learning training module 164 may compute accuracy/error metrics (e.g., cross entropy) using the test data and test corresponding sets of labels. The model machine learning training module 164 may serialize the trained model and store the trained model in a database (e.g., the data repository 180). Of course, it will be appreciated by those of ordinary skill in the art that the model machine learning training module 164 may train and store more than one model. For example, the model machine learning training module 164 may train an individual model for performing antibody generation, another for performing antibody docking, and still another for performing affinity prediction. It should be appreciated that the structure of the network as described may differ, depending on the embodiment. For example, in some aspects, antibody generation, antibody docking and affinity prediction may each correspond to respective layers of a monolithic machine learning model. For example, in some aspects, the trained ML model(s) predict many pieces of data: the sequence of the antibody, the affinity of the antibody, the docking configuration. The present techniques enable supervising any one of these components and the remaining components can via training improve. This capability is an advantageous result of the end-to-end training of the present techniques.
The machine learning operation module 166 may include a set of computer-executable instructions implementing machine learning loading, configuration, initialization and/or operation functionality. The ML operation module 166 may include instructions for storing trained models (e.g., in the electronic data repository 180, as a pickled binary, etc.). Once trained, a trained ML model may be operated in inference mode, whereupon when provided with de novo input that the model has not previously been provided, the model may output one or more predictions, classifications, etc. as described herein. In some aspects, a loss minimization function may be used, for example, to teach a ML model to generate output that resembles known output (i.e., ground truth exemplars).
Once the model(s) are trained by the model machine learning training module 164, the model operation module 166 may load one or more trained models (e.g., from the data repository 180). The model operation module 166 generally applies new data that the trained model has not previously analyzed to the trained model. For example, the model operation module 166 may load a serialized model, deserialize the model, and load the model into the memory 154. The model operation module 166 may load new data (e.g., binding partner structural information) that was not used to train the trained model. For example, the new data may include data (e.g., antigen sequence data, etc.) as described herein, encoded as input tensors. The model operation module 166 may apply the one or more input tensor(s) to the trained ML model. The model operation module 166 may receive output (e.g., tensors, feature maps, etc.) from the trained ML model. The output of the ML model may include structural information of one or more biomolecules (e.g., antibodies) predicted to have good docking and high binding affinity with the input binding partner (e.g., an antigen). The output of the ML model may include docking complexes and binding affinity values. In this way, the present techniques advantageously provide a means, for example, of quantitatively designing a target antibody corresponding to an input target antigen, while also quantifying docking of the input antigen and designed target antibody, and quantifying the binding affinity of the two. These techniques are far more accurate and data rich than conventional industry practices. By using ML, the present techniques avoid the expense and time consuming process of laboratory experimentation due to in silico performance, rather than requiring continued use of the wet lab.
The model operation module 166 may be accessed by another element of the structural prediction server 104 (e.g., a web service). The ML operation module 166 may pass its output to another module for further processing/analysis. In some aspects, a user may interact with the ML model during training and/or operation using a command line tool, an Application Programming Interface (API), a software development kit (SDK), a Jupyter notebook, etc.
Regarding the modules 160, it will be appreciated by those of ordinary skill in the art that in some aspects, the software instructions comprising the module 160 may be organized differently, and more/fewer modules may be included. For example, one or more of the modules 160 may be omitted or combined. In some aspects, additional modules may be added (e.g., a localization module). In some aspects, software libraries implementing one or more modules (e.g., Python code) may be combined, such that, for example, the ML training module 164 and ML operation module 166 are a single set of executable instructions used for training and making predictions.
In operation, the assay device 106 may be used to produce binding scores from reads. For example, in an embodiment, the following procedures may be used:
1. Paired-end reads may be merged using FLASH2 with a maximum allowed overlap set according to the amplicon size and sequencing reads length (e.g., 150 bases for all the libraries described herein).
2. Primers may be removed from both ends of the merged read using cutadapt tool, and reads were discarded where primers were not detected.
3. Reads may be aggregated across all FACS sorting gates and aligned to the reference sequence (parental version of the amplicon) in amino acid space. Alignment may be performed using the Needleman-Wunsch algorithm implemented in Biopython. In an example, the following parameters may be used: PairwiseAligner, mode=global, match_score=5, mismatch_score=−4, open_gap_score=−20, extend_gap_score=−1. Parameters were may be chosen by manual inspection across a number of processed libraries, in some aspects.
4. Reads may then be discarded if (1) the mean base quality is below a number (e.g., 20), or (2) a sequence (in DNA space) is seen in fewer than a number (e.g., 10) of reads across all gates.
5. The present techniques may further include flagging: (1) sequences that align to the reference with a low score (e.g., defined as less than 0.6 of the score obtained by aligning the reference to itself); (2) sequences containing stop codons outside of the region of interest and (3) sequences containing frame-shifting insertions or deletions. Flagged sequences may not be included in any mutation-related statistics, but may be used for count normalization for binding score calculations, in some aspects. FastQC and MultiQC may be used to generate sequencing quality control metrics.
6. For each gate, the prevalence of each sequence (read count relative to the total number of reads from all sequences in that gate) may be normalized to a number (e.g., 1 million) of counts.
7. The binding score (e.g., ACE score) may be assigned to each unique DNA sequence by taking a weighted average of the normalized counts across the sorting gates. In all experiments, weights may be assigned linearly using an integer scale: the gate capturing the lowest fluorescence signal was assigned a weight of 1, the next lowest gate was assigned a weight of 2, etc.
8. Any detected sequence which was not present in the originally designed and synthesized library may be dropped.
9. ACE scores may be averaged across independent FACS sorts, dropping sequences for which the standard deviation of replicate measurements is greater than 1.25. An amino acid variant may be retained only if the present techniques collect a number (e.g., at least three) independent QC-passing observations between synonymous DNA variants and replicate FACS sorts.
Exemplary Binary Assay (dnACE) Analysis
Enrichment scores may be calculated for individual variants screened by a binary version of the ACE assay. E.g., the following procedure may be used, in some embodiments:
1. Paired-end reads may be merged using Fastp with quality filtering and base correction in merged regions enabled.
2. Primers may be removed from both ends of the merged read using cutadapt tool, and reads discarded where primers were not detected.
3. Unique sequences may be tallied to provide raw counts of each variant observed in each sample. Sequences that do not match a designed sequence in the library may be discarded.
4. For each sample, proportional abundances may be calculated for each variant. Enrichment scores may be calculated by dividing the proportional abundance of each variant in a gate by its proportional abundance in the unsorted library sample.
ACE scores are discretized into a set of 6 bins and for each bin a set of variants belonging to that bin are screened via SPR. The percentage of a bin's variants that bind is used to estimate the expected binding rate for said bin. These values can be seen in Table S2.
As discussed above, SPR may be used for various techniques, including functional validation of binders.
For example, in some aspects, post induction samples may be transferred to plates (e.g., 96-well plates (e.g., Greiner Bio-One)), pelleted and lysed in 50 μL lysis buffer (e.g., 1× BugBuster protein extraction reagent containing 0.01 KU Benzonase Nuclease and 1× Protease inhibitor cocktail). Plates may be incubated for 15-20 min at 30° C. then centrifuged to remove insoluble debris. After lysis, samples may be adjusted with 200 μL SPR running buffer (e.g., 10 mM HEPES, 150 mM NaCl, 3 mM EDTA, 0.01% w/v Tween-20, 0.5 mg/mL BSA) to a final volume of 260 μL and filtered into 96-well plates. Lysed samples may then be transferred from 96-well plates to larger plates (e.g., 384-well plates) for high-throughput SPR, for example, using a Hamilton STAR automated liquid handler. Colonies may be prepared in two sets of independent replicates prior to lysis and each replicate measured in two separate experimental runs. In some instances, single replicates may be used, as indicated.
High-throughput SPR experiments may be conducted on a microfluidic Carterra LSA SPR instrument using SPR running buffer (e.g., 10 mM HEPES, 150 mM NaCl, 3 mM EDTA, 0.01% w/v Tween-20, 0.5 mg/mL BSA) and SPR wash buffer (e.g., 10 mM HEPES,
150 mM NaCl, 3 mM EDTA, 0.01% w/v Tween-20). Carterra LSA SAD200M chips may be pre-functionalized with 20 μg/mL biotinylated antibody capture reagent for 600 s prior to conducting experiments. Lysed samples in 384-well blocks may be immobilized onto chip surfaces for 600 s followed by a 60 s washout step for baseline stabilization.
Antigen binding may be conducted using the non-regeneration kinetics method with a 300 s association phase followed by a 900 s dissociation phase. For analyte injections, six leading blanks may be introduced to create a consistent baseline prior to monitoring antigen binding kinetics. After the leading blanks, five concentrations of HER2 extracellular domain antigen (e.g., ACRO Biosystems, prepared in three-fold serial dilution from a starting concentration of 500 nM), may be injected into the instrument and the time series response recorded. In some experiments, measurements on individual DNA variants may be repeated four times. Each experiment run may consist of two complete measurement cycles (ligand immobilization, leading blank injections, analyte injections, chip regeneration) which may provide two duplicate measurement attempts per clone per run. In some experiments, technical replicates measured in separate runs may further double the number of measurement attempts per clone to four.
To identify the DNA sequence of individual antibody variants evaluated by SPR, duplicate plates may be provided for sequencing. A portion of the pelleted material may be transferred into 96 well PCR (e.g., Thermo-Fisher) plate via pinner (e.g., Fisher Scientific) which may contain reagents for performing an initial phase PCR of a two-phase PCR for addition of Illumina adapters and sequencing. Reaction volumes used may be 12.5 μl, for example. During the initial PCR phase, partial Illumina adapters may be added to the amplicon via 4 PCR cycles. The second phase PCR may add the remaining portion of the Illumina sequencing adapter and the Illumina i5 and i7 sample indices. The initial PCR reaction may be used, for example, 0.45 μM UMI primer concentration, 6.25 μl Q5 2× master mix (NEB). Reactions may be initially denatured at 98° C. for 3 min, followed by 4 cycles of 98° C. for 10 s; 59° C. for 30 s; 72° C. for 30 s; with a final extension of 72° C. for 2 min. Following the initial PCR, 0.5 μM of the secondary sample index primers may be added to each reaction tube.
Reactions may then be denatured at 98° C. for 3 min, followed by 29 cycles of 98° C. for 10 s; 62° C. for 30 s; 72° C. for 15 s; with a final extension of 72° C. for 2 min. Reactions may then be pooled into a 1.5 mL tube (Eppendorf). Pooled samples may be size selected with a 1× AMPure XP (Beckman Coulter) bead procedure. Resulting DNA samples may be quantified by Qubit fluorometer. Pool size may be verified via Tapestation 1000 HS and sequenced on an Illumina MiSeq Micro (2×150 nt) for HCDR3 libraries or an Illumina MiSeq Reagent Kit v3 (2×300 nt) for HCDR1-HCDR3 libraries with 20% PhiX, in some aspects.
After sequencing, amplicon reads may be merged using Fastp, trimmed by cutadapt [44] and each unique sequence enumerated. Next, custom R scripts may be applied to calculate sequence frequency ratios between the most abundant and second-most abundant sequence in each sample. Levenshtein distance may be calculated between the two sequences. These distance values may be used for downstream filtering to ensure a clonal population was measured by SPR. The dominant sequence within each sample may be compared to the designed sequences and discarded if it does not match any expected sequence. Dominant sequences may then be combined with their companion Carterra SPR measurements.
In some aspects, the present techniques may include training one or more machine learning models to generate target biomolecule structures and/or sequences.
Generally, the one or more training inputs may include data representing one or more of: (i) an amino acid; (ii) a peptide sequence; (iii) a polypeptide sequence; (iv) a primary sequence; (v) one or more secondary structures; (vi) one or more tertiary structures; (vii) one or more quaternary structures; or (viii) three-dimensional coordinates of a primary sequence, corresponding to an input biomolecule, an input biomolecule binding partner, or an input biomolecule-input binding partner binding complex.
The method 200 may include processing the one or more training inputs with a machine-learned biomolecule prediction model to generate predicted biomolecule structural information (block 204). The method 200 may include evaluating a loss function that compares the predicted biomolecule structural information to a ground truth value (block 206). The method 200 may include modifying one or more values of one or more parameters of the machine-learned model based at least in part on the loss function (block 208). The method 200 may be performed by the ML training module 164 of
In some aspects, evaluating the loss function that compares the predicted biomolecule structural information to the ground truth value to train the machine-learned model at block 206 may include configuring a second model acting as a critic to train the machine-learned model to predict according to the ground truth data. Such techniques as adversarial architectures (e.g., generative adversarial networks) may be used in this context.
In some aspects, the biomolecule may be an antibody or protein. In some aspects, the biomolecule structural information comprises three-dimensional coordinates of a primary sequence. In some aspects, the biomolecule binding partner is an antigen, receptor, ligand or a cell membrane. In some aspects, the training inputs are represented, respectively, as at least one of: (i) a protein data bank (PDB) data format, (ii) a JSON data format or (iii) an XML data format. Other suitable formats may be used, in some aspects. In some aspects, the biomolecule binding partner is a protein, and the biomolecule binding partner structural information comprises three-dimensional structure. In some aspects, the method 200 may further include receiving the training inputs from a database comprising antibody structures and/or antigen structures. For example, the training inputs may be received/retrieved from a structural antibody database (SAbDab). It will be appreciated by those of ordinary skill in the art that data processing steps may be helpful for performing the present techniques. For example, to assess whether an ML model can generalize to particular binding partners (e.g., antigens) that are very different from what the model has been trained with, all biomolecules in the training set may be removed that are within 40% sequence identity to any protein in corresponding validation and testing sets.
Generally, the structural information of the target biomolecule includes data representing one or more of: (i) an amino acid of the target biomolecule; (ii) a peptide sequence of the target biomolecule; (iii) a polypeptide sequence of the target biomolecule; (iv) a primary sequence of the target biomolecule; (v) one or more secondary structures of the target biomolecule; (vi) one or more tertiary structures of the target biomolecule; (vii) one or more quaternary structures of the target biomolecule; or (viii) three-dimensional coordinates of a primary sequence of the target biomolecule.
The method 230 may include storing (e.g., in the memory 154, the data repository 180, etc.) and/or outputting data representing a portion or all of the structural information of the biomolecule. For example, the data may represent the one or more of: (i) the amino acid of the target biomolecule; (ii) the peptide sequence of the target biomolecule; (iii) the polypeptide sequence of the target biomolecule; (iv) the primary sequence of the target biomolecule; (v) the one or more secondary structures of the target biomolecule; (vi) the one or more tertiary structures of the target biomolecule; (vii) the one or more quaternary structures of the target biomolecule; or (viii) the three-dimensional coordinates of a primary sequence of the target biomolecule. In some aspects, the polypeptide sequence of the target biomolecule corresponds to a complementarity determining region of an antibody. In some aspects, the structural information of the target biomolecule is selected from the group consisting of: an alpha helix, a beta pleated sheet, and a coil. In some aspects, the target biomolecule is an antibody and the target binding partner is an antigen. The method 230 may include repeating block 234 until the structural information of the target biomolecule is complete. In some aspects, the method 230 may be performed iteratively.
In some aspects, the method 230 may include receiving an input parameter specifying a desired length of amino acids of the target biomolecule; and generating a target biomolecule having the desired length. In some aspects, the machine-learned biomolecule prediction model may determine the length by predicting an end-of-sequence token. In some aspects, the method 230 may include updating the structural information of the target biomolecule by folding one or more proteins. In some aspects, the target binding partner is at least one of (i) a primary amino acid sequence, (ii) an antigen epitope and/or (iii) a known three-dimensional structure of an antigen. In some aspects, the structural information of the target biomolecule is predicted in at least one of (i) N-terminus to C-terminus order, (ii) C-terminus to N-terminus order, or (iii) via random sampling. Advantageously, the present techniques are not limited to predicting in any fixed order.
In some aspects, the method 230 may further include operating and/or training a machine-learned docking model trained to generate docked complexes from two or more input three-dimensional biomolecules. In some aspects, the method 230 may further include processing the target input and the predicted structural information of the target biomolecule using the docking model to generate a docked complex comprising the target input and the target biomolecule. In some aspects, the method 230 may include providing the docked complex as a docked complex output. In some aspects, the method 230 may further include operating and/or training a machine-learned binding affinity prediction model trained to predict binding affinity from input docked complexes. The method 230 may include processing the docked complex using the trained affinity prediction model to generate a binding affinity value. In some aspects, the machine-learned model is an artificial neural network. In some aspects, the artificial neural network is at least one of a graph convolutional neural network, a message passing neural network, or a geometric vector perceptron network. In some aspects, the method 230 may include summarizing node information contained within the artificial neural network using a pooling operation to generate the binding affinity value. In some aspects, the method 230 may include providing the docked complex as a docked complex output. In some aspects, the machine learned model may be trained to predict binding specificity in addition to, or alternate from, binding affinity.
In some aspects, the method 230 may include training the machine-learned biomolecule prediction model by providing sequence data as input to teacher force the biomolecule prediction model; training the docking model by providing antibody structure data as input to teacher force the docking model; and/or training the affinity prediction model by providing sequence data as input to teacher force the affinity prediction model. Specifically, the biomolecule prediction model may be designing one or more new antibodies. Early in training, the design capabilities may not be performant (e.g., the model may never get to a point where the training signal is meaningful). With teacher forcing, the problem is made easier. At each stage, the model is trained to look at the last ground truth position, causing the model to focus on the most important information at teach training stage.
The models may be trained using antibody/antigen structures (e.g., antibody/antigen complexes in a protein databank, such as SAbDab). The models may be configured to emit an initial guess as to a drug molecule or antibody. The guess may be generated by instructions that iteratively output one residue or structure at a time, based on an input binding partner (e.g., antigen). Initially, the trained model may predict a first structure of the target biomolecule without knowing the corresponding sequence, but while knowing the target antigen (e.g., provided as input to the trained model). The ML model may iteratively predict a first amino acid and update the target biomolecule's structure, then a second amino acid, and so on until the target biomolecule's structure is complete, or until another stopping condition occurs. Stopping conditions may include the trained model having made a prediction for each corresponding input, a convergence criteria, filling of a portion of a CDR, etc. The predictions of the trained ML model may work in any direction, including (i)N-terminus to C-terminus order, (ii) C-terminus to N-terminus order, or (iii) via random sampling. The antigen provided to the trained ML model as input may be a primary amino acid sequence, an antigen epitope and/or a known antigen three-dimensional structure, optionally including an antigen and epitope. The model is capable of designing an entire structure, or a portion thereof (e.g., a CDR3). However, empirical research may demonstrate that better results have shown that coarsening the residues or averaging the residues may be less performant, that is, not comparable to aspects of the present techniques that analyze the entire structure (e.g., an entire antibody).
In some aspects, the method 230 may include training a machine-learned affinity prediction artificial neural network, including: (i) one or more biomolecule prediction layers trained to predict biomolecule structural information from target inputs; (ii) one or more docking layers trained to generate docked complexes from two or more input three-dimensional biomolecules; and (iii) one or more affinity prediction layers trained to predict affinity from input docked complexes. The one or more biomolecule prediction layers, the one or more docking layers, and the one or more affinity prediction layers may be connected (fully or otherwise). The method 230 may include receiving a target input comprising one or more of a target binding partner sequence, a target binding partner, or a target epitope; and processing the target input using the affinity prediction artificial neural network to generate a docked complex corresponding to the target input and a corresponding structural affinity value. In some aspects, the method 230 may include receiving a sequence affinity prediction value from an affinity sequencing prediction model; and average the structural affinity value with the sequence affinity prediction value to generate an ensemble affinity prediction value.
In some aspects, the method 230 may include training the machine learned affinity prediction artificial neural network (or layers thereof) to make diverse generations. For example, given an antigen, the prediction ANN may make millions of antibodies. In order to rank/filter those and choose the ones that we think are most likely to work, the present techniques may include ranking the antibodies using other techniques, including those described in U.S. Provisional Patent Application No. 63/297,679; incorporated by reference in its entirety herein.
In some aspects, the method 230 may include pre-training the one or more biomolecule prediction layers using bound or unbound structures; pre-training the one or more biomolecule prediction layers and the one or more docking layers using bound structures; and/or pre-training the machine-learned affinity prediction artificial neural network using affinity training data. In some aspects, the method 230 may include receiving an external computationally docked complex corresponding to the target input; and compare the external computationally docked complex to the generated docked complex. Thus, the present techniques may advantageously be used for verification purposes. In some aspects, the method 230 may include the affinity training data being derived from an assay (e.g., an ACE assay). The affinity training data may include an affinity score that is proportional to activity. The method 230 may include controlling a gradient flow of the machine-learned affinity prediction artificial neural network by applying a stop gradient function in at least one of the one or more biomolecule prediction layers, the one or more docking layers, or the one or more affinity prediction layers. For example, in PyTorch, a stop gradient may be implemented using ‘.detacho’ which detaches the relevant variable from the computational graph, thereby creating a new graph. In aspects wherein the same network weights are applied in each iteration, stop gradients may be employed such that each iteration is a separate graph. Gradients may then be averaged. In this context, conceptually, the stop gradients may be thought of as a strategy to augment limited data. Generally, the method 230 may be used to perform de novo discovery of biomolecules (e.g., a drug candidate such as an antibody) in silico.
Regarding the results of the present techniques, the fact that the visualized predicted structural information 402a and visualized ground truth structural information 402b appear visually similar indicates that the trained ML model/ANN is working as expected. In particular, the protein-folding component of the network is working, and designing the outputs along the way.
As discussed herein, the present techniques include ML modeling to automatically design or improve biomolecules against a specific target of interest. For example, in some cases, the models may be pre-trained (e.g., using biomolecule complexes mined from the Protein Data Bank) to predict the structure, sequence and/or affinity of an antibody given an antigen structure. Recent testing has quantified that the present techniques are highly efficacious, even without the added benefit of pre-training the ML model(s) to predict the binding affinity of the target binding partner and predicted target biomolecule structural information.
In general, the data presented in
In some aspects, the present techniques leverage ACE assays (e.g., Bachas S et al.; Liu J. Activity-specific cell enrichment; Patent Publication No. WO2021/146626, 22.07.2021; etc.) to screen massive antibody variant libraries containing hundreds of thousands of members (e.g., expressed in Fab 64 format). After screening by ACE, the present techniques may validate the assay for the de novo discovery techniques described herein by sampling sequences for follow-up analysis by SPR, a gold standard in binding affinity measurement and detection. Empirical results have demonstrated that the ACE assay has high recall, correctly classifying >95% of the SPR binders. Moreover, the assay has high precision, with 60% of ACE binders validating in SPR. This may enable a powerful workflow where a large population of predictions can be initially screened by the ACE assay, and the expected binding population can be subsequently screened via SPR to remove false positives and collect high quality binding affinity measurements, as discussed herein.
The present techniques demonstrate the ability of generative AI to design de novo antibodies and antigens of interest, generate new HCDR3 sequences zero-shot for known antibodies. The present techniques focus on design of the HCDR3 region, a key determinant of antibody function, due to its high sequence diversity in immune repertoires and high density of paratope residues. For example, the present techniques may select trastuzumab, which binds to HER2, as a scaffold antibody to test HCDR3 designed sequences. HCDR3 designs are generated by a model conditioned on an antigen-only modified HER2 3-D structure (PDB:1N8Z (chain C)) and the sequence of the trastuzumab scaffold, excluding the HCDR3. The present techniques may remove, for models used herein, any antibody known to bind the target or any homolog (>40% sequence identity or part of the same homologous superfamily) to the target. In some settings, we instead remove all antibodies from the training set with >40% sequence identity to the wildtype antibody. In all cases, we observed binders. In total we generate and screen 440,354 antibody variants with the ACE assay to identify binding variants. The present techniques find approximately 4000 total estimated binders based on expected ACE assay binding rates and advance a subset for further characterization, in total confirming HER2 binding for 421 AI designs via SPR.
8.05 · 10−300
Confirmed binders show a range of affinity to HER2, with 71 designs exhibiting <10 nM affinity (
In addition to favorable affinity, AI model designs have high sequence diversity, both in terms of amino acid length and identity. Verified binders have HCDR3s ranging from 11 to 15 amino acids (
Despite the high sequence diversity of the 421 designed binders, to ensure model generations are novel sequences rather than simply reproduced training examples the present techniques compare model outputs to the training set. This phenomenon has been observed in machine learning models, and past methods have been critiqued for generating molecules that are similar to those previously known. The present techniques compute the minimum distance between the designed binders and all HCDR3s in the models training and validation sets, finding that designed binders are distinct from those observed during training (
The present techniques examined the sequence similarity of the model's output to sequences in the Observed Antibody Space (OAS), a database of immune repertoire sequencing studies. the present techniques found that some of the model's designs already existed in the OAS, while others were unique with minimum edit distances between 1-5 (
Zero-Shot Designs have a High Naturalness Score
Therapeutic lead antibody candidates that are successful in the drug development process typically have high affinity and are developable with low immunogenicity. In previous work we described a language model that can assign a score to antibody sequences indicating the likelihood of finding a sequence in a typical immune repertoire. This metric is referred to as Naturalness. A high Naturalness score is associated with favorable developability and immunogenicity development outcomes. Using the Naturalness scoring model on our designs, we find our models can generate sequences with high Naturalness scores, with high affinity in a zero-shot manner, despite not training or sampling based on these qualities (
3D Structural Representation of De Novo Designed HCDR3s and their Comparison to Trastuzumab-HER2 Complex
The present techniques next predict the binding mechanisms of the present de novo designed HCDR3 variants to better understand the structural basis of the highly sequence diverse variants. To this end, the present techniques built structural models of eight diverse HCDR3 candidates bound to HER2 in Fab format. These eight variants are selected based on their edit distance to the trastuzumab HCDR3, diversity in length (ranging from 12-15 amino acids), and affinity range, spanning three orders of magnitude (Table 1). The present techniques use trastuzumab Fab-HER2 complex (PDB:1N8Z) as a starting template for structural modeling. The present techniques run local constrained backbone geometry and side chain rotamer optimization followed by relaxation of Fab-HER2 complexes to correct global conformational ambiguities, steric clashes, and sub-optimal loop geometry. As a control, the experimental trastuzumab-Fab complex is optimized using the same protocol and used as a reference for comparing final HCDR3 structural models. The present techniques use the lowest free energy poses of the de novo HCDR3 models for structural analyses and comparisons.
The eight de novo Fab-HER2 structural models are globally similar to trastuzumab with all-atom HCDR3 RMSDs ranging from 1.9 A-2.4 Å, despite sequence dissimilarity. Minimal structural rearrangements are observed in the unmodified regions of the heavy chain, the light chain and epitope residues of the antigen (
Even though de novo HCDR3s adopt distinct conformations there are important positional similarities among all structures. A closer analysis of the spatial orientation of side chains conformers reveals conservation of identical side chains at five discrete spatial locations, two of these locations corresponding to IMGT residue position R103 and Y117 in trastuzumab which are highly conserved in most antibodies. However, there is physiochemical conservation in all structures corresponding to the spatial positions of IMGT residue numbers 109, 113, 117 of trastuzumab, which contribute to the paratope of the trastuzumab-Her2 complex. Although conserved spatially, these side chains originate from a diversity of residue positions which highlight the conformational flexibility observed may be required for orienting key paratope side chains towards making identical important protein-protein interaction with epitope of HER2.
Although the overall binding region is identical, each designed HCDR3 exhibits dis-tinct binding modes between HCDR3 and epitope. In most cases, novel interactions not observed in the trastuzumab-HER2 complex are formed between designed HCDR3 and domain IV HER2 (
In several cases, de novo HCDR3 variants show larger binding interface area than trastuzumab which could suggest novel interactions with HER2 epitope. Interestingly, no correlation is observed between binding interface area to the affinity of binding. This finding would suggest that hydrophobic contributions and surface area burial are not key determinants of binding. Moreover, specific contacts formed between each designed HCDR3 and the epitope are critical to the binding stability of the Fab-Her2 complex. Furthermore, the present techniques calculated the grand average of hydropathy values (GRAVY) of each HCDR3 variant, which defines the collective hydrophobic properties summed over each residue, and compare to the binding affinities. The present techniques observe no correlation between affinity and hydrophobicity which further confirms the hydrophobic effect is not the major determinant of binding for de novo designed HCDR3. (Table 2). Combined, these results suggest that binding affinity of designed HCDR3 is intrinsic to the sequence design and is not driven by a common binding mechanism. The high dependence of binding on sequence attributes agree with a low probability of designing binders by chance.
The present techniques next conduct a pilot study to demonstrate the applicability of our approach to a broader set of antigens. For these additional targets, the present techniques do not pre-screen by the ACE assay. Rather, the present techniques sample a small number of sequences and validate binding by SPR. The present techniques first to successfully design HCDR3 variants of the therapeutic ranibizumab, which binds to human vascular endothelial growth factor A (VEGF-A), as shown in
The present techniques next design a set of HCDR3 variants of casirivimab, conditioned on omicron spike protein binding. Casirivimab binds to multiple COVID spike protein variants and in particular binds weakly to Omicron. The present techniques measured casirivimab affinity to Omicron via SPR at 240.0 nM (Table S1). The present techniques identify one AI designed variant that binds with similar affinity to Omicron at 179.7 nM (
Expanding antibody design beyond HCDR3 allows for increased sequence diversity and controllability and, represents the next step toward fully de novo design. To this end we applied an alternative multi-step AI design method that is distinct from our described zero-shot approach to generate variants of all heavy chain variable regions (HCDR1, HCDR2, HCDR3) simultaneously. The present techniques report multiple binding designs to HER2 identified within a library of less than 500 multi-step designed variants (Table 2). The present techniques find that these binders again are distinct from examples in the model's training data and antibodies in the SAbDab and OAS databases (
For the de novo design of HCDR3 trastuzumab variants, the present techniques may begin by designing two high-diversity libraries (HDLs), denoted HDL1, consisting of, for example, 223,046 and 217308 designs, respectively. HDL1 may be screened using the ACE assay and the dnACE assay described herein, and the results are used to design low-diversity libraries (LDLs), each consisting of 1,000 or fewer sequences that are screened using SPR. HDL2 is screened using ACE and the results are again used to design LDLs for SPR screening. In total, 199 binders are confirmed from HDL1 based on ACE scores reveal an estimate of an additional 3765 binders in HDL2 (Table S5). The ACE assay may not be used to screen variants of HCDR3 ranibizumab, HCDR3 casirivimab nor HCDR123 trastuzumab. Instead, small LDLs may be screened directly with SPR.
The Naturalness score used in this study may be computed using the pre-trained antibody language model discussed herein and introduced previously. This model is based on the pseudo-perplexity across the extended CDRs (e.g., defined by a union of the IMGT and Martin definitions) of an antibody heavy chain under the language model. This metric may be predictive of desirable antibody therapeutic properties such as developability or lack of immunogenicity. The present techniques may place variant HCDR3s into the wildtype scaffold by replacing the wildtype HCDR3. In addition to computing naturalness for the present de novo binders, the present techniques may further include several controls:
1. OAS: Consists of 1,000 HCDR3s randomly sampled from OAS. Naturalness scores are computed over a grafting of the HCDR3 into the trastuzumab scaffold. We expect sequences from OAS to have high Naturalness scores, given that the Naturalness model is pretrained on OAS, so we treat this as a positive control.
2. Frequency Baseline: These are 1000 sequences generated by randomly sampling from a length-conditioned frequency distribution of OAS. The present techniques may compute PL(), the probability that an HCDR3 in OAS has length f, and then compute the probability of sampling a particular sequence with length f, using an independent factorization based on amino acid frequencies, defined as:
The present techniques may include sampling a number (e.g., 1000) of lengths 1,
2, . . . ,
1000˜PL and then sample 1000 sequences according to si˜
3. Phage Display Baseline: These are 1000 HCDR3s randomly sampled from a first round of phage display panning (Liu et al.). Antibody heavy chain sequences are sampled and the HCDR3s are extracted. The collection of antibodies sampled from antibodies sampled from consist of both non-binders and binders.
4. Scrambled OAS: This consists of permuted versions of the 1000 HCDR3s in the OAS control. For each such HCDR3 the present techniques may include permuting the respective sequence 5 different times, computing naturalness using the permuted HCDR3, and reporting the average across the 5 permutations. In some aspects, the motivation for this computation as a negative control is that permuting a protein sequence destroys positional information. Lower Naturalness scores of this baseline compared to the first OAS baseline implies that the Natu-ralness model is able to capture positional information, and is not just considering amino acid com-position.
The naturalness scores of the present de novo designs may be compared to these controls using two-sample t-tests (H0:μ1=μ2, Ha:μ1!=μ2) and compared to wildtype using one-sample t-tests with the wildtype naturalness as the population mean (H0:μ1=μ, Ha:μ1!H μ). A table of the mean naturalness scores is shown below, across the different groups as well as p-values for the relevant statistical comparisons to the de novo binders.
8.05 · 10−300
Table 2 depicts mean naturalness scores across different groups, and p-values using trastuzumab scaffolding.
Further, the present techniques previously noted the diversity of designs relative to OAS, the pre-training data used for the model that was used to computed naturalness herein, in some aspects. Taking samples far away from the training set could lead to lower naturalness. Indeed, as shown in
As discussed, the present techniques include training models using training data. For example, the present techniques may use the following binder data for training, in some aspects. Below are Tables E1-E4 including binder data for, respectively, De Novo Trastuzumab HCDR3 Variant HER2 Binders, De Novo Ranibizumab HCDR3 Variant VEGF-A Binders, De Novo Casirivimab HCDR3 Variant COVID-Omicron Binders De Novo Trastuzumab HCDR123 Variant HER2 Binders. The following common sequences apply to the binder data.
For the HER2 binders the following information is relevant but does not change for each binder:
Common/invariant information for the COVID, VEGF and HER2 affinity tables is as follows:
In some aspects, antibody variants may be cloned and expressed in Fab format. For example, to produce ACE assay and SPR datasets, the present techniques may include synthesizing DNA variants spanning HCDR3, HCDR1 to HCDR3 across different libraries in an oligonucleotide format using ssDNA oligo pools (Twist Bioscience) as well as a double stranded DNA format using eBlocks (IDT). Codons may be randomly selected from the two most common in E. coli B strain for each variant. Amplification of the ssDNA oligo pools may be carried out by PCR according to Twist Bioscience's recommendations, with the exception that Q5 high fidelity DNA polymerase (NEB) may be used in place of KAPA polymerase. Briefly, 25 μL reactions may consist of 1×Q5 Mastermix, 0.3 M each of forward and reverse primers, and 10 ng oligo pool. Reactions may be initially denatured for 3 min at 95° C., followed by 13 cycles of: 95° C. for 20 s; 66° C. for 20 s; 72° C. for 15 s; and a final extension of 72° C. for 1 min. DNA amplification may be confirmed by agarose gel electrophoresis, and amplified DNA was subsequently purified (DNA Clean and Concentrate Kit, Zymo Research).
To build libraries meant for SPR validation of model designs in independent experiments, oligonucleotides spanning appropriate CDR(s) and the immediate upstream/downstream flanking nucleotides were synthesized by Integrated DNA Technologies (IDT).
To generate linearized vectors, the present techniques may include performing a two-step PCR to split the present proprietor's plasmid vector carrying Fab format trastuzumab into two fragments in a manner that provides cloning overlaps of approximately 25 nucleotides (nt) on the 5′ and 3′ ends of the amplified ssDNA oligo pool libraries, or 40 nt on the 5′ and 3′ ends of IDT eBlocks. Vector linearization reactions may be digested with DpnI (New England Bioloabs) and purified from a 0.8% agarose gel (Gel DNA Recovery Kit, Zymo Research) to eliminate parental vector carry through.
Cloning reactions may consist of 50 fmol of each purified vector fragment, either 100 fmol purified library (Twist Bioscience) or 10 pmol gBlock insert (IDT), and 1× final concentration NEBuilder HiFi DNA Assembly (New England Biolabs). Reactions may be incubated at 50° C. for either two hours (Twist Bioscience libraries) or 25 min (IDT library), and subsequently purified (DNA Clean and Concentrate Kit, Zymo Research).
For HDLs, Transformax EPI300 (Lucigen) E. coli may be transformed by electroporation (BioRad MicroPulser) with the purified assembly reactions and grown overnight at 30° C. in 20 mL of Teknova LB Broth with 50 μg/mL Kanamycin at 30° C. and 80% humidity with 270 rpm shaking for 18 h. Plasmids may be extracted (Plasmid Midi Kit, Zymo Research) and submitted for QC sequencing. Absci's SoluPro™ host strain may be transformed with 1 ng QC plasmid and grown at 30° C. in 20 mL of Teknova LB Broth with 50 μg/mL Kanamycin at 30° C. and 80% humidity with 270 rpm shaking for 18 hours.
For LDLs, Absci's SoluPro™ host strain may be transformed with the purified assembly reactions and grown overnight at 30° C. on agar plates containing 50 g/ml kanamycin and 1% glucose. Colonies may be picked for QC analysis prior to cultivation for induction. The foregoing experimental parameters and process flow may be modified, in some aspects.
In the present techniques, quality of high diversity variant libraries may be assessed by deep sequencing. Briefly, library plasmid pools may be amplified by PCR across the HCDR3 region and sequenced with 2×150 nt reads using the Illumina MiSeq platform with 20% PhiX, for example. The PCR reaction may be 10 nM primer concentration, Q5 2× master mix (NEB) and 1 ng of input DNA diluted in H2O. Reactions may be initially denatured at 98° C. for 3 min; followed by 30 cycles of 98° C. for 10 s, 59° C. for 30 s, 72° C. for 15 s; with a final extension of 72° C. for 2 min. Sequencing results may be analyzed for distribution of mutations, variant representation, library complexity and recovery of expected sequences. Metrics may include coefficient of variation of sequence representation, read share of top 1% most prevalent sequences and percentage of designed library sequences observed within the library. Quality of low diversity variant libraries may be assessed by performing rolling circle amplification (Equiphi29, Thermo Fisher Scientific) on 24 colonies and sequencing using the Illumina DNA Prep, Tagmentation Kit (Illumina Inc.). Each colony may be analyzed for single nucleotide polymorphisms (SNPs), presence of multiple variants, misassembly, and/or matching to a library sequence (Geneious Prime).
Exemplary Antibody Expression in SoluPro™ E. coli B Strain
After transformation and 8 hour recovery, HDLs may be grown in 50 mL of Teknova LB Broth with 50 μg/mL Kanamycin at 30° C. and 80% humidity with 270 rpm shaking for 24 hours. At the end of the 24 hours, the preculture may be OD600 normalized to 0.1 in induction media (IBM) (4.5 g/L Potassium Phosphate monobasic, 13.8 g/L Ammonium Sulfate, 20.5 g/L yeast extract, 20.5 g/L glycerol, 1.95 g/L Citric Acid) containing inducers and supplements (250 μM Arabinose, 50 μg/mL Kanamycin, 8 mM Magnesium Sulfate, 1 mM Propionate, 1× Korz trace metals) and grown for 16 hours in a 500 mL baffled flask at 26° C. and 80% humidity with 270 rpm shaking. At the end of the 16 hours, 500 mL aliquots may be adjusted to 20% v/v glycerol and stored at −80° C.
After transformation and QC of LDLs, individual colonies may be picked into deep well plates containing 400 μL of Teknova LB Broth 50 μg/mL Kanamycin and incubated at 30° C. and 80% humidity with 1000 rpm shaking for 24 hours. At the end of the 24 hours, 150 μL samples may be centrifuged (3300 g, 7 min) and supernatant decanted from the preculture plate for sequence analysis. 80 μL of the preculture was transferred to 400 μL of IBM containing inducers and supplements as described above. The culture may be grown for 16 hours at 26° C. and 80% humidity with 270 rpm shaking. At the end of the 16 hours, 150 μL samples may be taken and centrifuged (3300 g, 7 min) into pellets with supernatant decanting prior to being stored at −80° C.
High-throughput quantitative selection of antigen-specific Fab-expressing cells may be adapted from conventional approaches, in some aspects. For staining, an OD600=2 of thawed glycerol stocks from induced cultures may be transferred to 0.7 ml matrix tubes, centrifuged (4000 g, 5 min) and resulting pelleted cells washed three times with PBS (pH 7.4, 1 mM EDTA). Washed cells may be thoroughly resuspended in 250 μL of phosphate buffer (32 mM, pH 7.4) by pipetting prior to fixation by the addition of 250 μL of 0.6% paraformaldehyde and 0.04% glutaraldehyde in phosphate buffer (32 mM, pH 7.4). After 40 min incubation on ice, samples may be centrifuged (4000 g, 5 min) and pellets washed three times with PBS (pH 7.4, 1 mM EDTA), resuspended in permeabilization buffer (20 mM Tris, 50 mM glucose, 10 mM EDTA, 5 μg/mL rLysozyme) and incubated for 8 min on ice. Fixed and permeabilized cells may then centrifuged (4000 g, 5 min) and washed a number (e.g., three) of times with staining buffer (Perkin Elmer AlphaLISA immunoassay buffer, 25 mM HEPES, 0.1% casein, 1 mg/mL dextran-500, 0.5% Triton X-100, 0.05% Kathon).
Prior to library staining, the HER2 probe may be titrated against the reference strain to determine the 75% effective concentration (EC75). Following cell preparation, the library may be resuspended in 500 μL staining buffer containing 100 nM either His/Avi tagged human HER2 (Acro Biosystems) conjugated to 50 nM streptavidin-AF647 (Invitrogen) or tag-free human Her2 (Acro Biosystems) directly conjugated to AF647 via free amines. Libraries may8 be incubated with the probe overnight (16 h) with end-to-end rotation at 4° C., centrifuged (4000 g, 5 min), and pellets washed three times with PBS. Pellets may be resuspended in 500 μL of staining buffer containing 26.5 nM anti-kappa light chain:BV421 (BioLegend) and incubated for 2 hours with end-to-end rotation at 4° C. prior to centrifugation (4000 g, 5 min), three washes with PBS and resuspension in 200 μL of PBS for sorting.
Libraries may be sorted by one of two methods based on binding in aquantitative ACE assay or a binary version of the ACE Assay as described herein. For either method, libraries may be sorted on FACSymphony S6 (BD Biosciences) instruments. Immediately prior to sorting, 50 μL of stained sample was transferred to a flow tube containing 1 mL PBS+3 μL propidium iodide.
Aggregates, debris, and impermeable cells may be removed with singlets, size, and PI+ parent gating, respectively. Cells may then be gated to include only those with kappa light chain expression (BV421). For the quantitative ACE assay, collection gates may be drawn to sample across the log range of binding signal. The far right may be set to collect the brightest 0.1% of the library and the far left gate may be set to collect at the low end of the positive binding signal based on stained control strains. Four additional gates of the same width may be distributed in between, with each set to be approximately half the gMFI of the gate to the right.
For the binary version of the ACE assay, a total of three collection gates may be set to sample at the high end of the binding range (top 0.1%), the remaining positive binding signal events, and a negative gate containing the events with no binding signal. Libraries may be sorted simultaneously on two instruments with photomultipliers adjusted to normalize fluorescence intensity, and the collected events were processed independently as technical replicates.
Sample preparation for sequencing may follow the same protocol for both the previously described quantitative ACE assay and the binary version of the ACE assay. Specifically, cell material from sorted gates may be collected in a diluted PBS mixture (VWR), in 1.5 mL tubes (Eppendorf). A sample of the unsorted library material may be processed for QC and ACE metric calculations. Post-sort samples may be centrifuged (3,800×g) and tube volume normalized to 20 μl. Amplicons encompassing the HCDR3 region may be generated by PCR. The reaction may use 10 nM primer concentration, Q5 2× master mix (NEB) and 20 μl of sorted cell material input suspended in diluted PBS (VWR). Reactions may be initially denatured at 98° C. for 3 min, followed by 30 cycles of 98° C. for 10 s; 59° C. for 30 s; 72° C. for 15 s; with a final extension of 72° C. for 2 min. After amplification, samples may be cleaned enzymatically using ExoSAP-IT (Applied Biosystems). Resulting DNA samples may be quantified by Qubit luorometer (Invitrogen), prepped for sequencing with the ThruPLEX DNA-Seq Kit (Takara Bio), normalized and pooled. Pool size may be verified via Tapestation 1000 HS and may be sequenced on an Illumina NextSeq 1000 P2 (2×150 nt) with 20% PhiX.
Rosetta's FastRelax application is run with flexible backbone and side-chain degrees of freedom. Prior to the relax procedure, we first idealize all candidate structures using Rosetta's Idealize protocol to avoid steric clashes and improper geometry. We relax using the maximum number of rotamers by passing -EX1, -EX2, -EX3 and -EX4 flags at initialization. We also include flags -packing:repack_only to disable design as well as flags -no_his_his_pairE and -multi_cool_annealer 10 to set the number of annealing iterations. For ranking of conformations in FastRelax, we use Rosetta's ref2015 energy function. It is well known that running relax on a structure will often move the backbone a few Angstroms1, so we include an additional term containing harmonic distance constraints for all pairs of C/3 atoms that are either not part of a CDR loop or not within distance 10 to any atom in a CDR loop, based on the conformation of the initial structure. These constraints are given weight 10−4. The protocol is run ten times for each target, and the decoy with the lowest energy in the HCDR3 loop region is eventually selected.
As described herein, the present techniques may include a computer system, computer-implemented method and/or computer-readable medium/media that predicts or generates structural information of a biomolecule such as an antibody. The term “antibody” as used herein refers to whole antibodies that interact with (e.g., by binding, steric hindrance, stabilizing/destabilizing, spatial distribution) an epitope on a target antigen. A naturally occurring “antibody” is a glycoprotein comprising at least two heavy (H) chains and two light (L) chains inter-connected by disulfide bonds. Each heavy chain is comprised of a heavy chain variable region (abbreviated herein as VH) and a heavy chain constant region. The heavy chain constant region is comprised of three domains, CH1, CH2 and CH3. Each light chain is comprised of a light chain variable region (abbreviated herein as VL) and a light chain constant region. The light chain constant region is comprised of one domain, CL. The VH and VL regions can be further subdivided into regions of hypervariability, termed complementarity determining regions (CDR), interspersed with regions that are more conserved, termed framework regions (FR). Each VH and VL is composed of three CDRs and four FRs arranged from amino-terminus to carboxy-terminus in the following order: FR1, CDR1, FR2, CDR2, FR3, CDR3, FR4. The variable regions of the heavy and light chains contain a binding domain that interacts with an antigen. The constant regions of the antibodies may mediate the binding of the immunoglobulin to host tissues or factors, including various cells of the immune system (e.g., effector cells) and the first component (Clq) of the classical complement system. The term “antibody” includes for example, monoclonal antibodies, human antibodies, humanized antibodies, camelised antibodies, chimeric antibodies, single-chain Fvs (scFv), disulfide-linked Fvs (sdFv), Fab fragments, F (ab′) fragments, and anti-idiotypic (anti-Id) antibodies (including, e.g., anti-Id antibodies to antibodies of the invention), and epitope-binding fragments of any of the above. The antibodies can be of any isotype (e.g., IgG, IgE, IgM, IgD, IgA and IgY), class (e.g., IgG1, IgG2, IgG3, IgG4, IgA1 and IgA2) or subclass. The antibody or epitope-binding fragments may be, or be a component of, a multi-specific molecule.
Both the light and heavy chains are divided into regions of structural and functional homology. The terms “constant” and “variable” are used functionally. In this regard, it will be appreciated that the variable domains of both the light (VL) and heavy (VH) chain portions determine antigen recognition and specificity. Conversely, the constant domains of the light chain (CL) and the heavy chain (CH1, CH2 or CH3) confer important biological properties such as secretion, transplacental mobility, Fc receptor binding, complement binding, and the like. By convention the numbering of the constant region domains increases as they become more distal from the antigen binding site or amino-terminus of the antibody. The N-terminus is a variable region and at the C-terminus is a constant region; the CH3 and CL domains actually comprise the carboxy-terminus of the heavy and light chain, respectively.
The phrase “antibody fragment”, as used herein, refers to one or more portions of an antibody that retain the ability to specifically interact with (e.g., by binding, steric hindrance, stabilizing/destabilizing, spatial distribution) a target epitope. Examples of binding fragments include, but are not limited to, a Fab fragment, a monovalent fragment consisting of the VL, VH, CL and CH1 domains; a F(ab)2 fragment, a bivalent fragment comprising two Fab fragments linked by a disulfide bridge at the hinge region; a Fd fragment consisting of the VH and CH1 domains; a Fv fragment consisting of the VL and VH domains of a single arm of an antibody; a dAb fragment (Ward et al., (1989) Nature 341:544-546), which consists of a VH domain; and an isolated complementarity determining region (CDR). Furthermore, although the two domains of the Fv fragment, VL and VH, are coded for by separate genes, they can be joined, using recombinant methods, by a synthetic linker that enables them to be made as a single protein chain in which the VL and VH regions pair to form monovalent molecules (known as single chain Fv (scFv); see e.g., Bird et al., (1988) Science 242:423-426; and Huston et al., (1988) Proc. Natl. Acad. Sci. 85:5879-5883). Such single chain antibodies are also intended to be encompassed within the term “antibody fragment”. These antibody fragments are obtained using conventional techniques known to those of skill in the art, and the fragments are screened for utility in the same manner as are intact antibodies.
As described herein, antibodies may include biologically active derivatives or variants or fragments. As used herein “biologically active derivative” or “biologically active variant” includes any derivative or variant of an antibody having substantially the same functional and/or biological properties of said antibody (e.g., a WT antibody), such as binding properties, and/or the same structural basis, such as a peptidic backbone or a basic polymeric unit, including framework regions.
An “analog,” such as a “variant” or a “derivative,” is an antibody substantially similar in structure and having the same biological activity, albeit in certain instances to a differing degree, to a naturally-occurring antibody or a WT antibody or another reference antibody as will be understood by those of skill in the art. For example, an antibody variant refers to an antibody sharing substantially similar structure and having the same biological activity as a reference antibody. Variants or analogs differ in the composition of their amino acid sequences compared to the reference antibody from which the analog is derived, based on one or more mutations involving (i) deletion of one or more amino acid residues at one or more termini of the antibody and/or one or more internal regions of the antibody sequence (e.g., fragments), (ii) insertion or addition of one or more amino acids at one or more termini (typically an “addition” or “fusion”) of the antibody and/or one or more internal regions (typically an “insertion”) of the antibody sequence or (iii) substitution of one or more amino acids for other amino acids in the antibody sequence. By way of example, a “derivative” is a type of analog and refers to an antibody sharing the same or substantially similar structure as a reference antibody that has been modified, e.g., chemically.
In some aspects, the variants or sequence variants are mutants wherein 1, 2, 3, 4, 5, 6 or more amino acids within one or more CDR are mutated relative to a reference antibody. In some aspects, CDRs on the light chain, heavy chain, or both heavy and light chain, are mutated. In some aspects, one or more framework amino acid residues are mutated relative to a reference antibody.
In substitution variants, one or more amino acid residues, e.g., in a CDR region, of an antibody are removed and replaced with alternative residues. In one aspect, the substitutions are conservative in nature and conservative substitutions of this type are well known in the art. Alternatively, the disclosure embraces substitutions that are also non-conservative. Exemplary conservative substitutions are described in Lehninger, [Biochemistry, 2nd Edition; Worth Publishers, Inc., New York (1975), pp.71-77].
Antibodies contemplated herein include full-length antibodies, biologically active subunits or fragments of full length antibodies, as well as biologically active derivatives and variants of any of these forms of therapeutic proteins. Thus, antibodies include those that (1) have an amino acid sequence that has greater than about 60%, about 65%, about 70%, about 75%, about 80%, about 85%, about 90%, about 91%, about 92%, about 93%, about 94%, about 95%, about 96%, about 97%, about 98% or about 99% or greater amino acid sequence identity, over a region of at least about 25, about 50, about 100, about 200, about 300, about 400, or more amino acids, to a reference antibody (e.g., encoded by a referenced nucleic acid or an amino acid sequence described herein). According to the present disclosure, the term “recombinant protein” or “recombinant antibody” includes any protein obtained via recombinant DNA technology. In certain aspects, the term encompasses antibodies as described herein.
In some embodiments, the antibodies or antibody variants described herein are expressed from one or more expression construct and/or in a cell or strains as described herein.
Exemplary wild-type or reference antibodies include commercially available or other known antibodies, including therapeutic monoclonal antibodies. Reference antibodies according to the present disclosure may include any antibodies now known or later developed, including those that are not clinically and/or commercially available.
Antibodies of the present disclosure, including wild-type (WT) antibodies and variant antibodies, are produced in some embodiments in cells. Cells comprising one or more of the expression constructs described herein are contemplated in various embodiments of the present disclosure.
Prokaryotic host cells. In some embodiments of the disclosure, expression constructs designed for expression of gene products, including fusion proteins as described herein, are provided in host cells, such as prokaryotic host cells. Prokaryotic host cells can include archaea (such as Haloferax volcanii, Sulfolobus solfataricus), Gram-positive bacteria (such as Bacillus subtilis, Bacillus licheniformis, Brevibacillus choshinensis, Lactobacillus brevis, Lactobacillus buchneri, Lactococcus lactis, and Streptomyces lividans), or Gram-negative bacteria, including Alphaproteobacteria (Agrobacterium tumefaciens, Caulobacter crescentus, Rhodobacter sphaeroides, and Sinorhizobium meliloti), Betaproteobacteria (Alcaligenes eutrophus), and Gammaproteobacteria (Acinetobacter calcoaceticus, Azotobacter vinelandii, Escherichia coli, Pseudomonas aeruginosa, and Pseudomonas putida). Preferred host cells include Gammaproteobacteria of the family Enterobacteriaceae, such as Enterobacter, Erwinia, Escherichia (including E. coli), Klebsiella, Proteus, Salmonella (including Salmonella typhimurium), Serratia (including Serratia marcescans), and Shigella.
Eukaryotic host cells. Many additional types of host cells can be used for the expression systems of the present disclosure, including eukaryotic cells such as yeast (Candida shehatae, Kluyveromyces lactis, Kluyveromyces fragilis, other Kluyveromyces species, Pichia pastoris, Saccharomyces cerevisiae, Saccharomyces pastorianus also known as Saccharomyces carlsbergensis, Schizosaccharomyces pombe, Dekkera/Brettanomyces species, and Yarrowia lipolyticd); other fungi (Aspergillus nidulans, Aspergillus niger, Neurospora crassa, Penicillium, Tolypocladium, Trichoderma reesia); insect cell lines (Drosophila melanogaster Schneider 2 cells and Spodoptera frugiperda Sf9 cells); and mammalian cell lines including immortalized cell lines (Chinese hamster ovary (CHO) cells, HeLa cells, baby hamster kidney (BHK) cells, monkey kidney cells (COS), human embryonic kidney (HEK, 293, or HEK-293) cells, and human hepatocellular carcinoma cells (Hep G2)). The above host cells are available from the American Type Culture Collection.
As described in WO/2017/106583, incorporated by reference in its entirety herein, producing gene products such as therapeutic proteins at commercial scale and in soluble form is addressed by providing suitable host cells capable of growth at high cell density in fermentation culture, and which can produce soluble gene products in the oxidizing host cell cytoplasm through highly controlled inducible gene expression. Host cells of the present disclosure with these qualities are produced by combining some or all of the following characteristics. (1) The host cells are genetically modified to have an oxidizing cytoplasm, through increasing the expression or function of oxidizing polypeptides in the cytoplasm, and/or by decreasing the expression or function of reducing polypeptides in the cytoplasm. Specific examples of such genetic alterations are provided herein. Optionally, host cells can also be genetically modified to express chaperones and/or cofactors that assist in the production of the desired gene product(s), and/or to glycosylate polypeptide gene products. (2) The host cells comprise one or more expression constructs designed for the expression of one or more gene products of interest; in certain embodiments, at least one expression construct comprises an inducible promoter and a polynucleotide encoding a gene product to be expressed from the inducible promoter. (3) The host cells contain additional genetic modifications designed to improve certain aspects of gene product expression from the expression construct(s). In particular embodiments, the host cells (A) have an alteration of gene function of at least one gene encoding a transporter protein for an inducer of at least one inducible promoter, and as another example, wherein the gene encoding the transporter protein is selected from the group consisting of araE, araE, araG, araH, rhaT, xylF, xylG, and xylH, or particularly is araE, or wherein the alteration of gene function more particularly is expression of araE from a constitutive promoter; and/or (B) have a reduced level of gene function of at least one gene encoding a protein that metabolizes an inducer of at least one inducible promoter, and as further examples, wherein the gene encoding a protein that metabolizes an inducer of at least one said inducible promoter is selected from the group consisting of araA, araB, araD, prpB, prpD, rhaA, rhaB, rhaD, xylA, and xylB; and/or (C) have a reduced level of gene function of at least one gene encoding a protein involved in biosynthesis of an inducer of at least one inducible promoter, which gene in further embodiments is selected from the group consisting of scpA/sbm, argK/ygfD, scpB/ygfG, scpC/ygfH, rmlA, rmlB, rmlC, and rmlD.
Host Cells with Oxidizing Cytoplasm. The expression systems of the present disclosure are designed to express gene products; in certain embodiments of the disclosure, the gene products are expressed in a host cell. Examples of host cells are provided that allow for the efficient and cost-effective expression of gene products, including components of multimeric products. Host cells can include, in addition to isolated cells in culture, cells that are part of a multicellular organism, or cells grown within a different organism or system of organisms. In certain embodiments of the disclosure, the host cells are microbial cells such as yeasts (Saccharomyces, Schizosaccharomyces, etc.) or bacterial cells, or are gram-positive bacteria or gram-negative bacteria, or are E. coli, or are an E. coli B strain, or are E. coli (B strain) EB0001 cells (also called E. coli ASE(DGH) cells), or are E. coli (B strain) EB0002 cells. In growth experiments with E. coli host cells having oxidizing cytoplasm, specifically the E. coli B strains SHuffle® Express (NEB Catalog No. C3028H) and SHuffle® T7 Express (NEB Catalog No. C3029H) and the E. coli K strain SHuffle® T7 (NEB Catalog No. C3026H), these E. coli B strains with oxidizing cytoplasm are able to grow to much higher cell densities than the most closely corresponding E. coli K strain (WO/2017/106583).
Alterations to host cell gene functions. Certain alterations can be made to the gene functions of host cells comprising inducible expression constructs, to promote efficient and homogeneous induction of the host cell population by an inducer. Preferably, the combination of expression constructs, host cell genotype, and induction conditions results in at least 75% (more preferably at least 85%, and most preferably, at least 95%) of the cells in the culture expressing gene product from each induced promoter, as measured by the method of Khlebnikov et al. described in Example 9 of WO/2017/106583. For host cells other than E. coli, these alterations can involve the function of genes that are structurally similar to an E. coli gene, or genes that carry out a function within the host cell similar to that of the E. coli gene. Alterations to host cell gene functions include eliminating or reducing gene function by deleting the gene protein-coding sequence in its entirety, or deleting a large enough portion of the gene, inserting sequence into the gene, or otherwise altering the gene sequence so that a reduced level of functional gene product is made from that gene. Alterations to host cell gene functions also include increasing gene function by, for example, altering the native promoter to create a stronger promoter that directs a higher level of transcription of the gene, or introducing a missense mutation into the protein-coding sequence that results in a more highly active gene product. Alterations to host cell gene functions include altering gene function in any way, including for example, altering a native inducible promoter to create a promoter that is constitutively activated. In addition to alterations in gene functions for the transport and metabolism of inducers, as described herein with relation to inducible promoters, and/or an altered expression of chaperone proteins, it is also possible to alter the reduction-oxidation environment of the host cell.
Host cell reduction-oxidation environment. In bacterial cells such as E. coli, proteins that need disulfide bonds are typically exported into the periplasm where disulfide bond formation and isomerization is catalyzed by the Dsb system, comprising DsbABCD and DsbG. Increased expression of the cysteine oxidase DsbA, the disulfide isomerase DsbC, or combinations of the Dsb proteins, which are all normally transported into the periplasm, has been utilized in the expression of heterologous proteins that require disulfide bonds (Makino et al., Microb Cell Fact 2011 May 14; 10: 32). It is also possible to express cytoplasmic forms of these Dsb proteins, such as a cytoplasmic version of DsbA and/or of DsbC (‘cDsbA or ‘cDsbC’), that lacks a signal peptide and therefore is not transported into the periplasm. Cytoplasmic Dsb proteins such as cDsbA and/or cDsbC are useful for making the cytoplasm of the host cell more oxidizing and thus more conducive to the formation of disulfide bonds in heterologous proteins produced in the cytoplasm. The host cell cytoplasm can also be made less reducing and thus more oxidizing by altering the thioredoxin and the glutaredoxin/glutathione enzyme systems directly: mutant strains defective in glutathione reductase (gor) or glutathione synthetase (gshB), together with thioredoxin reductase (trxB), render the cytoplasm oxidizing. These strains are unable to reduce ribonucleotides and therefore cannot grow in the absence of exogenous reductant, such as dithiothreitol (DTT). Suppressor mutations (such as ahpC* and ahpCA, Lobstein et al., Microb Cell Fact 2012 May 8; 11: 56; doi: 10.1186/1475-2859-11-56) in the gene ahpC, which encodes the peroxiredoxin AhpC, convert it to a disulfide reductase that generates reduced glutathione, allowing the channeling of electrons onto the enzyme ribonucleotide reductase and enabling the cells defective in gor and trxB, or defective in gshB and trxB, to grow in the absence of DTT. A different class of mutated forms of AhpC can allow strains, defective in the activity of gamma-glutamylcysteine synthetase (gshA) and defective in trxB, to grow in the absence of DTT; these include AhpC V164G, AhpC S71F, AhpC E173/S71F, AhpC E171Ter, and AhpC dupl62-169 (Faulkner et al., Proc Natl Acad Sci USA 2008 May 6; 105(18): 6735-6740, Epub 2008 May 2). In such strains with oxidizing cytoplasm, exposed protein cysteines become readily oxidized in a process that is catalyzed by thioredoxins, in a reversal of their physiological function, resulting in the formation of disulfide bonds. Other proteins that may be helpful to reduce the oxidative stress effects in host cells of an oxidizing cytoplasm are HPI (hydroperoxidase I) catalase-peroxidase encoded by E. coli katG and HPII (hydroperoxidase II) catalase-peroxidase encoded by E. coli katE, which disproportionate peroxide into water and 02 (Farr and Kogoma, Microbiol Rev. 1991 December; 55(4): 561-585; Review). Increasing levels of KatG and/or KatE protein in host cells through induced coexpression or through elevated levels of constitutive expression is an aspect of some embodiments of the disclosure.
Another alteration that can be made to host cells is to express the sulfhydryl oxidase Ervlp from the inner membrane space of yeast mitochondria in the host cell cytoplasm, which has been shown to increase the production of a variety of complex, disulfide-bonded proteins of eukaryotic origin in the cytoplasm of E. coli, even in the absence of mutations in gor or trxB (Nguyen et al, Microb Cell Fact 2011 Jan. 7; 10: 1).
Host cells comprising expression constructs preferably also express cDsbA and/or cDsbC and/or Ervlp; are deficient in trxB gene function; are also deficient in the gene function of either gor, gshB, or gshA; optionally have increased levels of katG and/or katE gene function; and express an appropriate mutant form of AhpC so that the host cells can be grown in the absence of DTT.
Chaperones. In some embodiments, desired gene products are coexpressed with other gene products, such as chaperones, that are beneficial to the production of the desired gene product. Chaperones are proteins that assist the non-covalent folding or unfolding, and/or the assembly or disassembly, of other gene products, but do not occur in the resulting monomeric or multimeric gene product structures when the structures are performing their normal biological functions (having completed the processes of folding and/or assembly). Chaperones can be expressed from an inducible promoter or a constitutive promoter within an expression construct, or can be expressed from the host cell chromosome; preferably, expression of chaperone protein(s) in the host cell is at a sufficiently high level to produce coexpressed gene products that are properly folded and/or assembled into the desired product. Examples of chaperones present in E. coli host cells are the folding factors DnaK/DnaJ/GrpE, DsbC/DsbG, GroEL/GroES, IbpA/IbpB, Skp, Tig (trigger factor), and FkpA, which have been used to prevent protein aggregation of cytoplasmic or periplasmic proteins. DnaK/DnaJ/GrpE, GroEL/GroES, and ClpB can function synergistically in assisting protein folding and therefore expression of these chaperones in combinations has been shown to be beneficial for protein expression (Makino et al., Microb Cell Fact 2011 May 14; 10: 32). When expressing eukaryotic proteins in prokaryotic host cells, a eukaryotic chaperone protein, such as protein disulfide isomerase (PDI) from the same or a related eukaryotic species, is in certain embodiments of the disclosure coexpressed or inducibly coexpressed with the desired gene product.
One chaperone that can be expressed in host cells is a protein disulfide isomerase from Humicola insolens, a soil hyphomycete (soft-rot fungus). An amino acid sequence of Humicola insolens PDI is shown as SEQ ID NO: 1 of WO/2017/106583; it lacks the signal peptide of the native protein so that it remains in the host cell cytoplasm. The nucleotide sequence encoding PDI was optimized for expression in E. coli; the expression construct for PDI is shown as SEQ ID NO: 2 of WO/2017/106583. SEQ ID NO: 2 contains a GCTAGC Nhel restriction site at its 5′ end, an AGGAGG ribosome binding site at nucleotides 7 through 12, the PDI coding sequence at nucleotides 21 through 1478, and a GTCGAC Sail restriction site at its 3′ end. The nucleotide sequence of SEQ ID NO: 2 was designed to be inserted immediately downstream of a promoter, such as an inducible promoter. The Nhel and Sail restriction sites in SEQ ID NO: 2 can be used to insert it into a vector multiple cloning site, such as that of the pSOL expression vector (SEQ ID NO: 3 of WO/2017/106583), described in published US patent application US2015353940A1, which is incorporated by reference in its entirety herein. Other PDI polypeptides can also be expressed in host cells, including PDI polypeptides from a variety of species (Saccharomyces cerevisiae (UniProtKB PI 7967), Homo sapiens (UniProtKB P07237), Mus musculus (UniProtKB P09103), Caenorhabditis elegans (UniProtKB Q 17770 and Q 17967), Arabdopsis thaliana (UniProtKB 048773, Q9XI01, Q9S G3, Q9LJU2, Q9MAU6, Q94F09, and Q9T042), Aspergillus niger (UniProtKB Q12730) and also modified forms of such PDI polypeptides. In certain embodiments of the disclosure, a PDI polypeptide expressed in host cells of the disclosure shares at least 70%, or 80%, or 90%, or 95% amino acid sequence identity across at least 50% (or at least 60%, or at least 70%, or at least 80%, or at least 90%) of the length of SEQ ID NO: 1 of WO/2017/106583, where amino acid sequence identity is determined according to Example 10 of WO/2017/106583.
Cellular transport of cofactors. When using the expression systems of the disclosure to produce enzymes that require cofactors for function, it is helpful to use a host cell capable of synthesizing the cofactor from available precursors, or taking it up from the environment. Common cofactors include ATP, coenzyme A, flavin adenine dinucleotide (FAD), NAD+/NADH, and heme. Polynucleotides encoding cofactor transport polypeptides and/or cofactor synthesizing polypeptides can be introduced into host cells, and such polypeptides can be constitutively expressed, or inducibly coexpressed with the gene products to be produced by methods of the disclosure.
Glycosylation of polypeptide gene products. Host cells can have alterations in their ability to glycosylate polypeptides. For example, eukaryotic host cells can have eliminated or reduced gene function in glycosyltransf erase and/or oligo-saccharyltransferase genes, impairing the normal eukaryotic glycosylation of polypeptides to form glycoproteins. Prokaryotic host cells such as E. coli, which do not normally glycosylate polypeptides, can be altered to express a set of eukaryotic and prokaryotic genes that provide a glycosylation function (DeLisa et al., WO2009089154A2, 2009 Jul. 16).
Available host cell strains with altered gene functions. To create preferred strains of host cells to be used in the expression systems and methods of the disclosure, it is useful to start with a strain that already comprises desired genetic alterations (Table A; WO/2017/106583).
E, coli
E, coli
E, coli
In some embodiments of the present disclosure, inducible promoters are contemplated for use with the expression constructs. Exemplary promoters are described herein and are also described in WO/2016/205570, incorporated by reference in its entirety herein. As described herein, the cells comprising one or more expression constructs may optionally include one or more inducible promoters to express antibodies of the present disclosure, including wild-type antibodies and variant antibodies.
Expression Constructs. Expression constructs are polynucleotides designed for the expression of one or more antibodies, and thus are not naturally occurring molecules. Expression constructs can be integrated into a host cell chromosome, or maintained within the host cell as polynucleotide molecules replicating independently of the host cell chromosome, such as plasmids or artificial chromosomes. An example of an expression construct is a polynucleotide resulting from the insertion of one or more polynucleotide sequences into a host cell chromosome, where the inserted polynucleotide sequences alter the expression of chromosomal coding sequences. An expression vector is a plasmid expression construct specifically used for the expression of one or more gene products, such as one or more antibodies. One or more expression constructs can be integrated into a host cell chromosome or be maintained on an extrachromosomal polynucleotide such as a plasmid or artificial chromosome. The following are descriptions of particular types of polynucleotide sequences that can be used in expression constructs for the expression or coexpression of antibodies.
Origins of replication. Expression constructs must comprise an origin of replication, also called a replicon, in order to be maintained within the host cell as independently replicating polynucleotides. Different replicons that use the same mechanism for replication cannot be maintained together in a single host cell through repeated cell divisions. As a result, plasmids can be categorized into incompatibility groups depending on the origin of replication that they contain, as shown in Table 2 of WO/2016/205570. Origins of replication can be selected for use in expression constructs on the basis of incompatibility group, copy number, and/or host range, among other criteria. As described above, if two or more different expression constructs are to be used in the same host cell for the coexpression of multiple antibodies or components of antibodies (e.g., heavy and light chains, including fragments, as described herein), in one embodiment the different expression constructs contain origins of replication from different incompatibility groups: a pMBl replicon in one expression construct and a p15A replicon in another, for example. The average number of copies of an expression construct in the cell, relative to the number of host chromosome molecules, is determined by the origin of replication contained in that expression construct. Copy number can range from a few copies per cell to several hundred (Table 2 of WO/2016/205570). In one embodiment of the disclosure, different expression constructs are used which comprise inducible promoters that are activated by the same inducer, but which have different origins of replication. By selecting origins of replication that maintain each different expression construct at a certain approximate copy number in the cell, it is possible to adjust the levels of overall production of an antibody component or fragment (e.g., a heavy or light chain) expressed from one expression construct, relative to another antibody component or fragment (e.g., a heavy or light chain) expressed from a different expression construct. As an example, to coexpress subunits A and B of a multimeric protein (including, for example, a heavy chain and a light chain), an expression construct is created which comprises the colEl replicon, the am promoter, and a coding sequence for subunit A expressed from the am promoter: ‘colEl-Para-A.
Another expression construct is created comprising the pl 5A replicon, the am promoter, and a coding sequence for subunit B: ‘pl5A-Para-B’. These two expression constructs can be maintained together in the same host cells, and expression of both subunits A and B is induced by the addition of one inducer, arabinose, to the growth medium. If the expression level of subunit A needed to be significantly increased relative to the expression level of subunit B, in order to bring the stoichiometric ratio of the expressed amounts of the two subunits closer to a desired ratio, for example, a new expression construct for subunit A could be created, having a modified μMB 1 replicon as is found in the origin of replication of the pUC9 plasmid (‘pUC9ori’): pUC9ori-Para-A. Expressing subunit A from a high-copy-number expression construct such as pUC9ori-Para-A should increase the amount of subunit A produced relative to expression of subunit B from pl5A-Para-B. In a similar fashion, use of an origin of replication that maintains expression constructs at a lower copy number, such as pSOOl (WO/2016/205570), could reduce the overall level of a gene product expressed from that construct. Selection of an origin of replication can also determine which host cells can maintain an expression construct comprising that replicon. For example, expression constructs comprising the colEl origin of replication have a relatively narrow range of available hosts, species within the Enterobacteriaceae family, while expression constructs comprising the RK2 replicon can be maintained in E. coli, Pseudomonas aeruginosa, Pseudomonas putida, Azotobacter vinelandii, and Alcaligenes eutrophus, and if an expression construct comprises the RK2 replicon and some regulator genes from the RK2 plasmid, it can be maintained in host cells as diverse as Sinorhizobium meliloti, Agrobacterium tumefaciens, Caulobacter crescentus, Acinetobacter calcoaceticus, and Rhodobacter sphaeroides (Kiies and Stahl, Microbiol Rev 1989 December; 53(4): 491-516).
Similar considerations can be employed to create expression constructs for inducible expression or coexpression in eukaryotic cells. For example, the 2-micron circle plasmid of Saccharomyces cerevisiae is compatible with plasmids from other yeast strains, such as pSRl (ATCC Deposit Nos. 48233 and 66069; Araki et al., J Mol Biol 1985 Mar. 20; 182(2): 191-203) and pKDl (ATCC Deposit No. 37519; Chen et al, Nucleic Acids Res 1986 Jun. 11; 14(11): 4471-4481).
Selectable markers. Expression constructs usually comprise a selection gene, also termed a selectable marker, which encodes a protein necessary for the survival or growth of host cells in a selective culture medium. Host cells not containing the expression construct comprising the selection gene will not survive in the culture medium. Typical selection genes encode proteins that confer resistance to antibiotics or other toxins, or that complement auxotrophic deficiencies of the host cell. One example of a selection scheme utilizes a drug such as an antibiotic to arrest growth of a host cell. Those cells that contain an expression construct comprising the selectable marker produce a protein conferring drug resistance and survive the selection regimen. Some examples of antibiotics that are commonly used for the selection of selectable markers (and abbreviations indicating genes that provide antibiotic resistance phenotypes) are: ampicillin (AmpR), chloramphenicol (CmlR or CmR), kanamycin (KanR), spectinomycin (SpcR), streptomycin (StrR), and tetracycline (TetR). Many of the plasmids in Table 2 of WO/2016/205570 comprise selectable markers, such as pBR322 (AmpR, TetR); pMOB45 (CmR, TetR); pACYClW (AmpR, KanR); and pGBMl (SpcR, StrR). The native promoter region for a selection gene is usually included, along with the coding sequence for its gene product, as part of a selectable marker portion of an expression construct. Alternatively, the coding sequence for the selection gene can be expressed from a constitutive promoter.
In various aspects, suitable selectable markers include, but are not limited to, neomycin phosphotransferase (npt II), hygromycin phosphotransferase (hpt), dihydrofolate reductase (dhfr), zeocin, phleomycin, bleomycin resistance gene (ble), gentamycin acetyltransferase, streptomycin phosphotransferase, mutant form of acetolactate synthase (als), bromoxynil nitrilase, phosphinothricin acetyl transferase (bar), enolpyruvylshikimate-3-phosphate (EPSP) synthase (aro A), muscle specific tyrosine kinase receptor molecule (MuSK-R), copper-zinc superoxide dismutase (sodl), metallothioneins (cup1, MT1), beta-lactamase (BLA), puromycin N-acetyl-transferase (pac), blasticidin acetyl transferase (bls), blasticidin deaminase (bsr), histidinol dehydrogenase (HDH), N-succinyl-5-aminoimidazole-4-carboxamide ribotide (SAICAR) synthetase (adel), argininosuccinate lyase (arg4), beta-isopropylmalate dehydrogenase (leu2), invertase (suc2), orotidine-5′-phosphate (OMP) decarboxylase (ura3), and orthologs of any of the foregoing.
Inducible promoter. As described herein, there are several different inducible promoters that can be included in expression constructs as part of the inducible coexpression systems of the disclosure. Preferred inducible promoters share at least 80% polynucleotide sequence identity (more preferably, at least 90% identity, and most preferably, at least 95% identity) to at least 30 (more preferably, at least 40, and most preferably, at least 50) contiguous bases of a promoter polynucleotide sequence as defined in Table 1 of WO/2016/205570 by reference to the E. coli K-12 substrain MG1655 genomic sequence, where percent polynucleotide sequence identity is determined using the methods of Example 11 of WO/2016/205570. Under ‘standard’ inducing conditions (see Example 5 of WO/2016/205570), preferred inducible promoters have at least 75% (more preferably, at least 100%, and most preferably, at least 110%) of the strength of the corresponding ‘wild-type’ inducible promoter of E. coli K-12 substrain MG1655, as determined using the quantitative PCR method of De Mey et al. (Example 6 of WO/2016/205570). Within the expression construct, an inducible promoter is placed 5′ to (or ‘upstream of) the coding sequence for the gene product (e.g., antibody or antibody fragment) that is to be inducibly expressed, so that the presence of the inducible promoter will direct transcription of the gene product coding sequence in a 5′ to 3′ direction relative to the coding strand of the polynucleotide encoding the gene product.
Ribosome binding site. For some antibodies or antibody fragments, the nucleotide sequence of the region between the transcription initiation site and the initiation codon of the coding sequence of the gene product that is to be inducibly expressed corresponds to the 5′ untranslated region (‘UTR’) of the mRNA for the polypeptide gene product. Preferably, the region of the expression construct that corresponds to the 5′ UT comprises a polynucleotide sequence similar to the consensus ribosome binding site (RBS, also called the Shine-Dalgarno sequence) that is found in the species of the host cell. In prokaryotes (archaea and bacteria), the RBS consensus sequence is GGAGG or GGAGGU, and in bacteria such as E. coli, the RBS consensus sequence is AGGAGG or AGGAGGU. The RBS is typically separated from the initiation codon by 5 to 10 intervening nucleotides. In expression constructs, the RBS sequence is preferably at least 55% identical to the AGGAGGU consensus sequence, more preferably at least 70% identical, and most preferably at least 85% identical, and is separated from the initiation codon by 5 to 10 intervening nucleotides, more preferably by 6 to 9 intervening nucleotides, and most preferably by 6 or 7 intervening nucleotides. The ability of a given RBS to produce a desirable translation initiation rate can be calculated at the website salis.psu.edu/software/RBSLibraryCalculatorSearchMode, using the RBS Calculator; the same tool can be used to optimize a synthetic RBS for a translation rate across a 100,000+ fold range (Salis, Methods Enzymol 2011; 498: 19-42).
Multiple cloning site. A multiple cloning site (MCS), also called a polylinker, is a polynucleotide that contains multiple restriction sites in close proximity to or overlapping each other. The restriction sites in the MCS typically occur once within the MCS sequence, and preferably do not occur within the rest of the plasmid or other polynucleotide construct, allowing restriction enzymes to cut the plasmid or other polynucleotide construct only within the MCS. Examples of MCS sequences are those in the pBAD series of expression vectors, including pBAD18, pBAD18-Cm, pBAD18-Kan, pBAD24, pBAD28, pBAD30, and pBAD33 (Guzman et al., J Bacteriol 1995 July; 177(14): 4121-4130); or those in the pPRO series of expression vectors derived from the pBAD vectors, such as pPR018, pPR018-Cm, pPR018-Kan, pPR024, pPRO30, and pPR033 (U.S. Pat. No. 8,178,338 B2; May 15, 2012; Keasling, Jay). A multiple cloning site can be used in the creation of an expression construct: by placing a multiple cloning site 3′ to (or downstream of) a promoter sequence, the MCS can be used to insert the coding sequence for a gene product to be expressed or coexpressed into the construct, in the proper location relative to the promoter so that transcription of the coding sequence will occur. Depending on which restriction enzymes are used to cut within the MCS, there may be some part of the MCS sequence remaining within the expression construct after the coding sequence or other polynucleotide sequence is inserted into the expression construct. Any remaining MCS sequence can be upstream or, or downstream of, or on both sides of the inserted sequence. A ribosome binding site can be placed upstream of the MCS, preferably immediately adjacent to or separated from the MCS by only a few nucleotides, in which case the RBS would be upstream of any coding sequence inserted into the MCS. Another alternative is to include a ribosome binding site within the MCS, in which case the choice of restriction enzymes used to cut within the MCS will determine whether the RBS is retained, and in what relation to, the inserted sequences. A further alternative is to include a RBS within the polynucleotide sequence that is to be inserted into the expression construct at the MCS, preferably in the proper relation to any coding sequences to stimulate initiation of translation from the transcribed messenger RNA.
Expression from constitutive promoters. Expression constructs of the disclosure can also comprise coding sequences that are expressed from constitutive promoters. Unlike inducible promoters, constitutive promoters initiate continual gene product production under most growth conditions. One example of a constitutive promoter is that of the Tn3 bla gene, which encodes beta-lactamase and is responsible for the ampicillin-resistance (AmpR) phenotype conferred on the host cell by many plasmids, including pBR322 (ATCC 31344), pACYClW (ATCC 37031), and pBAD24 (ATCC 87399). Another constitutive promoter that can be used in expression constructs is the promoter for the E. coli lipoprotein gene, Ipp, which is located at positions 1755731-1755406 (plus strand) in E. coli K-12 substrain MG1655 (Inouye and Inouye, Nucleic Acids Res 1985 May 10; 13(9): 3101-3110). A further example of a constitutive promoter that has been used for heterologous gene expression in E. coli is the trμLEDCBA promoter, located at positions 1321169-1321133 (minus strand) in E. coli K-12 substrain MG1655 (Windass et al., Nucleic Acids Res 1982 Nov. 11; 10(21): 6639-6657). Constitutive promoters can be used in expression constructs for the expression of selectable markers, as described herein, and also for the constitutive expression of other gene products useful for the coexpression of the desired product. For example, transcriptional regulators of the inducible promoters, such as AraC, PrpR, RhaR, and XylR, if not expressed from a bidirectional inducible promoter, can alternatively be expressed from a constitutive promoter, on either the same expression construct as the inducible promoter they regulate, or a different expression construct. Similarly, gene products useful for the production or transport of the inducer, such as PrpEC, AraE, or Rha, or proteins that modify the reduction-oxidation environment of the cell, as a few examples, can be expressed from a constitutive promoter within an expression construct. Gene products useful for the production of coexpressed gene products, and the resulting desired product, also include chaperone proteins, cofactor transporters, etc.
Signal Peptides. Antibodies or antibody fragments expressed or coexpressed by the methods of the disclosure can contain signal peptides or lack them, depending on whether it is desirable for such gene products to be exported from the host cell cytoplasm into the periplasm, or to be retained in the cytoplasm, respectively. Signal peptides (also termed signal sequences, leader sequences, or leader peptides) are characterized structurally by a stretch of hydrophobic amino acids, approximately five to twenty amino acids long and often around ten to fifteen amino acids in length, that has a tendency to form a single alpha-helix. This hydrophobic stretch is often immediately preceded by a shorter stretch enriched in positively charged amino acids (particularly lysine). Signal peptides that are to be cleaved from the mature polypeptide typically end in a stretch of amino acids that is recognized and cleaved by signal peptidase. Signal peptides can be characterized functionally by the ability to direct transport of a polypeptide, either co-translationally or post-translationally, through the plasma membrane of prokaryotes (or the inner membrane of gram negative bacteria like E. coli), or into the endoplasmic reticulum of eukaryotic cells. The degree to which a signal peptide enables a polypeptide to be transported into the periplasmic space of a host cell like E. coli, for example, can be determined by separating periplasmic proteins from proteins retained in the cytoplasm, using a method such as described in Example 12 of WO/2016/205570.
The following is a description of inducible promoters that can be used in expression constructs for expression or coexpression of gene products, along with some of the genetic modifications that can be made to host cells that contain such expression constructs. Examples of these inducible promoters and related genes are, unless otherwise specified, from Escherichia coli (E. coli) strain MG1655 (American Type Culture Collection deposit ATCC 700926), which is a substrain of E. coli K-12 (American Type Culture Collection deposit ATCC 10798). Table 1 of WO/2016/205570 lists the genomic locations, in E. coli MG1655, of the nucleotide sequences for these examples of inducible promoters and related genes. Nucleotide and other genetic sequences, referenced by genomic location as in Table 1 of WO/2016/205570, are expressly incorporated by reference herein. Additional information about E. coli promoters, genes, and strains described herein can be found in many public sources, including the online EcoliWiki resource, located at ecoliwiki.net.
Arabinose promoter. (As used herein, ‘arabinose’ means L-arabinose.) Several E. coli operons involved in arabinose utilization are inducible by arabinose-araBAD, araC, arciE, and araFGH but the terms ‘arabinose promoter’ and ‘ara promoter’ are typically used to designate the araBAD promoter. Several additional terms have been used to indicate the E. coli araBAD promoter, such as Para, ParaB, ParaBAD, and PBAD- The use herein of ‘ara promoter’ or any of the alternative terms given above, means the E. coli araBAD promoter. As can be seen from the use of another term, ‘araC-araBAD promoter’, the araBAD promoter is considered to be part of a bidirectional promoter, with the araBAD promoter controlling expression of the araBAD operon in one direction, and the araC promoter, in close proximity to and on the opposite strand from the araBAD promoter, controlling expression of the araC coding sequence in the other direction. The AraC protein is both a positive and a negative transcriptional regulator of the araBAD promoter. In the absence of arabinose, the AraC protein represses transcription from PBAD, but in the presence of arabinose, the AraC protein, which alters its conformation upon binding arabinose, becomes a positive regulatory element that allows transcription from PBAD- The araBAD operon encodes proteins that metabolize L-arabinose by converting it, through the intermediates L-ribulose and L-ribulose-phosphate, to D-xylulose-5-phosphate. For the purpose of maximizing induction of expression from an arabinose-inducible promoter, it is useful to eliminate or reduce the function of AraA, which catalyzes the conversion of L-arabinose to L-ribulose, and optionally to eliminate or reduce the function of at least one of AraB and AraD, as well. Eliminating or reducing the ability of host cells to decrease the effective concentration of arabinose in the cell, by eliminating or reducing the cell's ability to convert arabinose to other sugars, allows more arabinose to be available for induction of the arabinose-inducible promoter. The genes encoding the transporters which move arabinose into the host cell are araE, which encodes the low-affinity L-arabinose proton symporter, and the araFGH operon, which encodes the subunits of an ABC superfamily high-affinity L-arabinose transporter. Other proteins which can transport L-arabinose into the cell are certain mutants of the LacY lactose permease: the LacY(AlWC) and the LacY(AlWV) proteins, having a cysteine or a valine amino acid instead of alanine at position 177, respectively (Morgan-Kiss et al., Proc Natl Acad Sci USA 2002 May 28; 99(11): 7373-7377). In order to achieve homogenous induction of an arabinose-inducible promoter, it is useful to make transport of arabinose into the cell independent of regulation by arabinose. This can be accomplished by eliminating or reducing the activity of the AraFGH transporter proteins and altering the expression of araE so that it is only transcribed from a constitutive promoter. Constitutive expression of araE can be accomplished by eliminating or reducing the function of the native araE gene, and introducing into the cell an expression construct which includes a coding sequence for the AraE protein expressed from a constitutive promoter. Alternatively, in a cell lacking AraFGH function, the promoter controlling expression of the host cell's chromosomal araE gene can be changed from an arabinose-inducible promoter to a constitutive promoter. In similar manner, as additional alternatives for homogenous induction of an arabinose-inducible promoter, a host cell that lacks AraE function can have any functional AraFGH coding sequence present in the cell expressed from a constitutive promoter. As another alternative, it is possible to express both the araE gene and the araFGH operon from constitutive promoters, by replacing the native araE and araFGH promoters with constitutive promoters in the host chromosome. It is also possible to eliminate or reduce the activity of both the AraE and the AraFGH arabinose transporters, and in that situation to use a mutation in the LacY lactose permease that allows this protein to transport arabinose. Since expression of the lacY gene is not normally regulated by arabinose, use of a LacY mutant such as LacY(A177C) or LacY(A177V), will not lead to the ‘all or none’ induction phenomenon when the arabinose-inducible promoter is induced by the presence of arabinose. Because the LacY(A177C) protein appears to be more effective in transporting arabinose into the cell, use of polynucleotides encoding the LacY(A177C) protein is preferred to the use of polynucleotides encoding the LacY(A177V) protein.
Propionate promoter. The ‘propionate promoter’ or ‘prp promoter’ is the promoter for the E. coli prpBCDE operon, and is also called PPΦB· Like the ara promoter, the prp promoter is part of a bidirectional promoter, controlling expression of the prpBCDE operon in one direction, and with the prpR promoter controlling expression of the prpR coding sequence in the other direction. The PrpR protein is the transcriptional regulator of the prp promoter, and activates transcription from the prp promoter when the PrpR protein binds 2-methylcitrate (‘2-MC’). Propionate (also called propanoate) is the ion, CH3CH2COO—, of propionic acid (or ‘propanoic acid’), and is the smallest of the ‘fatty’ acids having the general formula H(CH2), COOH that shares certain properties of this class of molecules: producing an oily layer when salted out of water and having a soapy potassium salt. Commercially available propionate is generally sold as a monovalent cation salt of propionic acid, such as sodium propionate (CH3CH2COONa), or as a divalent cation salt, such as calcium propionate (Ca(CH3CH2COO)2). Propionate is membrane-permeable and is metabolized to 2-MC by conversion of propionate to propionyl-CoA by PrpE (propionyl-CoA synthetase), and then conversion of propionyl-CoA to 2-MC by PrpC (2-methylcitrate synthase). The other proteins encoded by the prpBCDE operon, PrpD (2-methylcitrate dehydratase) and PrpB (2-methylisocitrate lyase), are involved in further catabolism of 2-MC into smaller products such as pyruvate and succinate. In order to maximize induction of a propionate-inducible promoter by propionate added to the cell growth medium, it is therefore desirable to have a host cell with PrpC and PrpE activity, to convert propionate into 2-MC, but also having eliminated or reduced PrpD activity, and optionally eliminated or reduced PrpB activity as well, to prevent 2-MC from being metabolized. Another operon encoding proteins involved in 2-MC biosynthesis is the scpA-argK-scpBC operon, also called the sbm-yg/DGH operon. These genes encode proteins required for the conversion of succinate to propionyl-CoA, which can then be converted to 2-MC by PrpC. Elimination or reduction of the function of these proteins would remove a parallel pathway for the production of the 2-MC inducer, and thus might reduce background levels of expression of a propionate-inducible promoter, and increase sensitivity of the propionate-inducible promoter to exogenously supplied propionate. It has been found that a deletion of sbm-ygfD-ygfG-ygfH-ygfl, introduced into E. coli BL21(DE3) to create strain JSB (Lee and Keasling, Appl Environ Microbiol 2005 November; 71(11): 6856-6862), was helpful in reducing background expression in the absence of exogenously supplied inducer, but this deletion also reduced overall expression from the prp promoter in strain JSB. It should be noted, however, that the deletion sbm-ygfD-ygfG-ygfH-ygfl also apparently affects ygfl, which encodes a putative LysR-family transcriptional regulator of unknown function. The genes sbm-yg/DGH are transcribed as one operon, and ygfl is transcribed from the opposite strand. The 3′ ends of the ygfti and ygfl coding sequences overlap by a few base pairs, so a deletion that takes out all of the sbm- yg/DGH operon apparently takes out ygfl coding function as well. Eliminating or reducing the function of a subset of the sbm-ygfDGH gene products, such as YgfG (also called ScpB, methylmalonyl-CoA decarboxylase), or deleting the majority of the sbm-yg/DGH (or scpA-argK-scpBC) operon while leaving enough of the 3′ end of the ygfli (or scpC) gene so that the expression of ygfl is not affected, could be sufficient to reduce background expression from a propionate-inducible promoter without reducing the maximal level of induced expression.
Rhamnose promoter. (As used herein, ‘rhamnose’ means L-rhamnose.) The ‘rhamnose promoter‘ or’rha promoter’, or PrhaSR, is the promoter for the E. coli rhaSR operon. Like the ara and prp promoters, the rha promoter is part of a bidirectional promoter, controlling expression of the rhaSR operon in one direction, and with the rhaBAD promoter controlling expression of the rhaBAD operon in the other direction. The rha promoter, however, has two transcriptional regulators involved in modulating expression: RhaR and RhaS. The RhaR protein activates expression of the rhaSR operon in the presence of rhamnose, while RhaS protein activates expression of the L-rhamnose catabolic and transport operons, rhaBAD and rhaT, respectively (Wickstrum et al, J Bacteriol 2010 January; 192(1): 225-232). Although the RhaS protein can also activate expression of the rhaSR operon, in effect RhaS negatively autoregulates this expression by interfering with the ability of the cyclic AMP receptor protein (CRP) to coactivate expression with RhaR to a much greater level. The rhaBAD operon encodes the rhamnose catabolic proteins RhaA (L-rhamnose isomerase), which converts L-rhamnose to L-rhamnulose; RhaB (rhamnulokinase), which phosphorylates L-rhamnulose to form L-rhamnulose-1-P; and RhaD (rhamnulose-1-phosphate aldolase), which converts L-rhamnulose-1-P to L-lactaldehyde and DHAP (dihydroxy acetone phosphate). To maximize the amount of rhamnose in the cell available for induction of expression from a rhamnose-inducible promoter, it is desirable to reduce the amount of rhamnose that is broken down by catalysis, by eliminating or reducing the function of RhaA, or optionally of RhaA and at least one of RhaB and RhaD. E. coli cells can also synthesize L-rhamnose from alpha-D-glucose-1-P through the activities of the proteins RmlA, Rm1B, RmlC, and RmlD (also called RfbA, RfbB, RfbC, and RfbD, respectively) encoded by the rmlBDACX (or rfbBDACX) operon. To reduce background expression from a rhamnose-inducible promoter, and to enhance the sensitivity of induction of the rhamnose-inducible promoter by exogenously supplied rhamnose, it could be useful to eliminate or reduce the function of one or more of the Rm1A, RmlB, RmlC, and
RmlD proteins. L-rhamnose is transported into the cell by RhaT, the rhamnose permease or L-rhamnose:proton symporter. As noted above, the expression of RhaT is activated by the transcriptional regulator RhaS. To make expression of RhaT independent of induction by rhamnose (which induces expression of RhaS), the host cell can be altered so that all functional RhaT coding sequences in the cell are expressed from constitutive promoters. Additionally, the coding sequences for RhaS can be deleted or inactivated, so that no functional RhaS is produced. By eliminating or reducing the function of RhaS in the cell, the level of expression from the rhaSR promoter is increased due to the absence of negative autoregulation by RhaS, and the level of expression of the rhamnose catalytic operon rhaBAD is decreased, further increasing the ability of rhamnose to induce expression from the rha promoter.
Xylose promoter. (As used herein, ‘xylose’ means D-xylose.) The xylose promoter, or ‘xyl promoter’, or PxyiA, means the promoter for the E. coli xylAB operon. The xylose promoter region is similar in organization to other inducible promoters in that the xylAB operon and the xylFGHR operon are both expressed from adjacent xylose-inducible promoters in opposite directions on the E. coli chromosome (Song and Park, J Bacteriol. 1997 November; 179(22): 7025-7032). The transcriptional regulator of both the PxyiA and PxyiF promoters is XylR, which activates expression of these promoters in the presence of xylose. The xylR gene is expressed either as part of the xylFGHR operon or from its own weak promoter, which is not inducible by xylose, located between the xylH and xylR protein-coding sequences. D-xylose is catabolized by XylA (D-xylose isomerase), which converts D-xylose to D-xylulose, which is then phosphorylated by XylB (xylulokinase) to form D-xylulose-5-P. To maximize the amount of xylose in the cell available for induction of expression from a xylose-inducible promoter, it is desirable to reduce the amount of xylose that is broken down by catalysis, by eliminating or reducing the function of at least XylA, or optionally of both XylA and XylB. The xylFGHR operon encodes XylF, XylG, and XylH, the subunits of an ABC super-family high-affinity D-xylose transporter. The xylE gene, which encodes the E. coli low-affinity xylose-proton symporter, represents a separate operon, the expression of which is also inducible by xylose. To make expression of a xylose transporter independent of induction by xylose, the host cell can be altered so that all functional xylose transporters are expressed from constitutive promoters. For example, the xylFGHR operon could be altered so that the xylFGH coding sequences are deleted, leaving XylR as the only active protein expressed from the xylose-inducible PxyiF promoter, and with the xylE coding sequence expressed from a constitutive promoter rather than its native promoter. As another example, the xylR coding sequence is expressed from the PxyiA or the promoter in an expression construct, while either the xylFGHR operon is deleted and xylE is constitutively expressed, or alternatively an xylFGH operon (lacking the xylR coding sequence since that is present in an expression construct) is expressed from a constitutive promoter and the xylE coding sequence is deleted or altered so that it does not produce an active protein.
Lactose promoter. The term ‘lactose promoter’ refers to the lactose-inducible promoter for the lacZYA operon, a promoter which is also called lacZpl; this lactose promoter is located at ca. 365603-365568 (minus strand, with the NA polymerase binding (‘−35’) site at ca. 365603-365598, the Pribnow box (‘-10’) at 365579-365573, and a transcription initiation site at 365567) in the genomic sequence of the E. coli K-12 substrain MG1655 (NCBI Reference Sequence NC 000913.2, 11 Jan. 2012). In some embodiments, inducible coexpression systems of the disclosure can comprise a lactose-inducible promoter such as the lacZYA promoter. In other embodiments, the inducible coexpression systems of the disclosure comprise one or more inducible promoters that are not lactose-inducible promoters.
Alkaline phosphatase promoter. The terms ‘alkaline phosphatase promoter’ and ‘phoA promoter’ refer to the promoter for the phoApsiF operon, a promoter which is induced under conditions of phosphate starvation. The phoA promoter region is located at ca. 401647-401746 (plus strand, with the Pribnow box (‘-10’) at 401695-401701 (Kikuchi et al., Nucleic Acids Res 1981 Nov. 11; 9(21): 5671-5678)) in the genomic sequence of the E. coli K-12 substrain MG1655 (NCBI Reference Sequence NC 000913.3, 16 Dec. 2014). The transcriptional activator for the phoA promoter is PhoB, a transcriptional regulator that, along with the sensor protein PhoR, forms a two-component signal transduction system in E. coli. PhoB and PhoR are transcribed from the phoBR operon, located at ca. 417050-419300 (plus strand, with the PhoB coding sequence at 417,142-417,831 and the PhoR coding sequence at 417,889-419,184) in the genomic sequence of the E. coli K-12 substrain MG1655 (NCBI Reference Sequence NC 000913.3, 16 Dec. 2014). The phoA promoter differs from the inducible promoters described above in that it is induced by the lack of a substance—intracellular phosphate—rather than by the addition of an inducer. For this reason the phoA promoter is generally used to direct transcription of gene products that are to be produced at a stage when the host cells are depleted for phosphate, such as the later stages of fermentation. In some embodiments, inducible coexpression systems of the disclosure can comprise a phoA promoter. In other embodiments, the inducible coexpression systems of the disclosure comprise one or more inducible promoters that are not phoA promoters.
As described herein, the present techniques may include a computer system, computer-implemented method and/or computer-readable medium/media that predicts or generates structural information of a biomolecule such as an antibody. In some aspects, the computer system predicts binding affinity between, for example, an antibody and an antigen. Nevertheless, wet lab techniques are contemplated to (i) confirm a predicted affinity or (ii) generate data to train an affinity model as described herein. Antibody binding and antibody affinity determination assays are well known in the art.
In one embodiment, an activity-specific cell-enrichment method (ACE) can be used to identify host cells that express “active” antibodies rather than “inactive material.” Active antibodies can be distinguished from inactive antibodies by the ability of active antibodies to specifically bind a binding partner molecule (e.g., an antigen or epitope). The ACE assay protocol is described in WO/2021/146626, incorporated by reference herein. It will be appreciated by those of ordinary skill in the art that ACE can not only discriminate between active/inactive in a binary fashion, but can also compute a score that is proportional to affinity. Thus, ACE provides quantitative assay information, not merely binary/Boolean information, which enables the modeling techniques herein to perform regression techniques. This richer modeling represents an advantageous improvement over the limited binary classification of conventional techniques.
In another embodiment, the HiPR Bind assay described in WO/2021/163349 and incorporated by reference herein is used in conjunction with the methods provided herein.
Binding assays, for example assays that measure protein-protein interactions, including antibody-antigen interactions and including measuring binding affinity, are well known in the art. By way of example, Surface plasmon resonance (SPR), Dual polarisation interferometry (DPI), Static light scattering (SLS), Dynamic light scattering (DLS), Flow-induced dispersion analysis (FIDA), Fluorescence polarization/anisotropy, Fluorescence resonance energy transfer (FRET), Bio-layer interferometry (BLI), Isothermal titration calorimetry (ITC), Microscale thermophoresis (MST), Single colour reflectometry (SCORE) are contemplated. Additionally, Bimolecular fluorescence complementation (BiFC), affinity electrophoresis, label transfer, phage display, Tandem affinity purification (TAP), cross-linking, Quantitative immunoprecipitation combined with knock-down (QUICK) and Proximity ligation assay (PLA) are other well-known assays that provide protein-protein interaction information.
In some embodiments, the binding affinities of the antibodies described herein are measured by array surface plasmon resonance (SPR), according to standard techniques (Abdiche, et al. (2016) MAbs 8:264-277). Briefly, antibodies were immobilized on a HC 30M chip at four different densities/antibody concentrations. Varying concentrations (0-500 nM) of antibody target are then bound to the captured antibodies. Kinetic analysis is performed using Carterra software to extract association and dissociation rate constants (ka and kd, respectively) for each antibody. Apparent affinity constants (KD) are calculated from the ratio of kd/ka. In some embodiments, the Carterra LSA Platform is used to determine kinetics and affinity. In other embodiments, binding affinity can be measured, e.g., by surface plasmon resonance (e.g., BIAcore™) using, for example, the IBIS MX96 SPR system from IBIS Technologies or the Carterra LSA SPR platform, or by Bio-Layer Interferometry, for example using the Octet™ system from ForteBio. In some embodiments, a biosensor instrument such as Octet RED384, ProteOn XPR36, IBIS MX96 and Biacore T100 is used (Yang, D., et al., J. Vis. Exp., 2017, 122:55659).
KD is the equilibrium dissociation constant, a ratio of koff/kon, between the antibody and its antigen. KD and affinity are inversely related. The KD value relates to the concentration of antibody and so the lower the KD value (lower concentration) and thus the higher the affinity of the antibody. Antibody, including reference antibody and variant antibody, KD according to various embodiments of the present disclosure can be, for example, in the micromolar range (10−4 to 10−6), the nanomolar range (10−7 to 10−9), the picomolar range (10−1 to 10−2) or the femtomolar range (10−13 to 10−15). In some embodiments, antibody affinity of a variant antibody is improved, relative to a reference antibody, by approximately 5, 10, 15, 20, 25, 30, 35, 40, 45, or 50% or more. The improvement may also be expressed relative to a fold change (e.g., 2×, 4×, 6×, or 2-, 3-, 4-, 5-, 6-, 7-, 8-, 9-, 10-fold or more improvement in binding activity, etc.) and/or an order of magnitude (e.g., 107, 108, 109, etc.).
The data generated from the antibodies and assays described herein is, in some embodiments, used to train one or more models, as will be described next.
Before the present disclosure is further described, it is to be understood that this disclosure is not limited to particular embodiments described, as such may, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting, since the scope of the present disclosure will be limited only by the appended claims.
Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limit of that range and any other stated or intervening value in that stated range, is encompassed within the disclosure. The upper and lower limits of these smaller ranges may independently be included in the smaller ranges, and are also encompassed within the disclosure, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the disclosure.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. Although any methods and materials similar or equivalent to those described herein can also be used in the practice or testing of the present disclosure, the preferred methods and materials are now described. All publications mentioned herein are incorporated herein by reference to disclose and describe the methods and/or materials in connection with which the publications are cited.
It must be noted that as used herein and in the appended claims, the singular forms “a,” “and,” and “the” include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to “a conformation switching probe” includes a plurality of such conformation switching probes and reference to “the microfluidic device” includes reference to one or more microfluidic devices and equivalents thereof known to those skilled in the art, and so forth. It is further noted that the claims may be drafted to exclude any element, e.g., any optional element. As such, this statement is intended to serve as antecedent basis for use of such exclusive terminology as “solely,” “only” and the like in connection with the recitation of claim elements, or use of a “negative” limitation.
As will be apparent to those of skill in the art upon reading this disclosure, each of the individual embodiments described and illustrated herein has discrete components and features which may be readily separated from or combined with the features of any of the other several embodiments without departing from the scope or spirit of the present disclosure. Any recited method can be carried out in the order of events recited or in any other order which is logically possible. This is intended to provide support for all such combinations.
The various embodiments described above can be combined to provide further embodiments. All U.S. patents, U.S. patent application publications, U.S. patent application, foreign patents, foreign patent application and non-patent publications referred to in this specification and/or listed in the Application Data Sheet are incorporated herein by reference, in their entirety. Aspects of the embodiments can be modified if necessary to employ concepts of the various patents, applications, and publications to provide yet further embodiments.
These and other changes can be made to the embodiments in light of the above-detailed description. In general, in the following claims, the terms used should not be construed to limit the claims to the specific embodiments disclosed in the specification and the claims but should be construed to include all possible embodiments along with the full scope of equivalents to which such claims are entitled. Accordingly, the claims are not limited by the disclosure.
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/US23/62331 | 2/9/2023 | WO |
| Number | Date | Country | |
|---|---|---|---|
| 63478933 | Jan 2023 | US | |
| 63308495 | Feb 2022 | US |