The present invention relates to a method for predicting the solubility of polypeptide chains, including antibodies. Other aspects of the invention relate to a method of making polypeptide chains with a reduced propensity to aggregate or enhanced solubility, and to a method of making a pharmaceutical composition comprising polypeptide chains with altered solubility.
Therapeutic proteins such as antibodies are widely employed for diagnostics and therapeutic purposes because of their capacity to bind to target molecules with high affinity and specificity. In antibodies, the residues responsible for antigen binding are found in the so-called complementarity-determining regions (CDRs). These solvent-exposed regions are known to contain, in many cases, some hydrophobic, poorly-soluble, aggregation-promoting residues that, in addition to helping antigen binding, can also mediate self-association and aggregation. For therapeutic applications, the poor solubility of proteins can prove especially problematic as aggregation may not only affects the activity and efficiency of the therapeutic, but also elicit an immune response (as described in “Aggregation-resistant domain antibodies engineered with changed mutations near the edges of the complementary determining regions” by Perchiacca et al, Prot Eng Des Sel 25, 591-601 2012). This problem is further exasperated by the need to formulate and store therapeutic proteins at high concentrations for efficient sub-cutaneous delivery. In contrast, as a rule, proteins are highly soluble at the concentration they are produced by healthy living organisms. Moreover, if we exclude the CDR regions in antibodies, which are unstructured but very small compared to the size of the whole antibody, these molecules are structured proteins, quite stable under physiological or close to physiological conditions.
Protein aggregation also represents a problem in vivo. A number of pathological conditions are associated with aberrant protein deposition or aggregation. Examples of such disorders include neurodegenerative conditions such as Alzheimer's, Huntington's and Parkinson's diseases.
Conversely, there are instances where it may be desirable to form aggregates, in particularly amyloid fibrils, such as for use as plastics materials in electronics, as conductors, for catalysis or as a slow release form of the polypeptide, or where polypeptide fibrils are to be spun into a polypeptide ‘yarn’ for various applications; for example as described in published patent applications WO0017328 (Dobson) and WO024321 (Dobson & McPhee).
It would therefore be useful to be able to predict the solubility of a target polypeptide chain and further predict what mutations or insertions could be made to the amino acid sequence to affect—preferably increase—its solubility while maintaining its structure and function.
Currently a number of computational methods are available to predict protein solubility or aggregation, mainly based on the sequence of the protein and physico-chemical properties such as hydrophobicity, charge and secondary structure propensities (“Rationalization of the effects of mutations on peptide and protein aggregation rates” by Chiti et al. Nature 424, 805-808 (2003)). Other methods based on the protein sequence include SOLpro (“SOLpro: Accurate sequence-based prediction of protein solubility” by Magnan et al., Bioinformatics 25, 2200-2207 (2009)) and PROSO II (“PROSO II—a new method for protein solubility prediction” by Smialowski et al, FEBS J 279, 2192-2200 (2012)).
When predicting protein solubility for medical applications, however, it is very important to remember that these proteins are already folded inside the expression organism before they are concentrated. Aggregation is consequently initiated from the native state of the protein. Thus, the aggregation pathway is mediated by partial unfolding events leading to the formation of oligomeric species, which, in some cases can evolve into fibrillar conformations once a critical number of molecules is present, so that the enthalpy associated with their ordered stacking overcomes the corresponding loss of conformational entropy.
For this reason there is a need to calculate the structurally-corrected solubility or aggregation propensities, which correspond to the propensity to remain soluble or aggregate of a protein in its native state. Such structurally-corrected solubility or aggregation propensities can be very different from the corresponding intrinsic propensities, which refer to the propensities of solubility or aggregation of the unfolded state. In fact, ordered proteins tend to bury their poorly-soluble, aggregation promoting regions inside the native structure. There are cases, however, and antibodies are one example, where some of these aggregation-prone residues need to be exposed on the surface for structural or functional reasons (“Physico-chemical principles that regulate the competition between functional and dysfunctional association of proteins”. Pechmann et al. Proc. Natl. Acad. Sci. USA, 106, 10159-10164 (2009)). For some antibodies, exposing ‘sticky’ (i.e. aggregation-prone) residues in their CDR loops is essential for antigen binding.
Here we present a computational algorithm that can predict the solubility of a target polypeptide chain in its native conformation, and furthermore can be used to predict specific amino acid substitutions and/or insertions that will alter the solubility of a target polypeptide chain while preserving its structure and functionality. The algorithm is very general and can be readily applied to any peptide or protein, requiring only knowledge of the protein sequence, structure and the residues that are important for function.
Furthermore, in cases where homology modeling can be applied, the knowledge of the structure is not necessary. Thus, the algorithm presented here allows for the rational design and production of a target polypeptide chain with a desired solubility, which is related to, but distinct from, the aggregation propensity (see
According to a first aspect of the invention, there is provided a method of identifying mutations or insertions that alter a property such as the solubility or aggregation propensity of an input polypeptide chain as set out in claim 1. According to another aspect of the invention, there is provided a data processing system for identifying mutations or insertions that alter the solubility or aggregation propensity of an input polypeptide chain as set out in claim 29.
In both aspects, the data processing system may be trained by training said first neural network in said data processing system using a set of polypeptide chains having known sequences of amino acids and known values for the solubility or aggregation propensity to determine a first function which maps the known sequences to the known values. Said training may comprise dividing each polypeptide chain in said set of polypeptide chains into a plurality of segments, with each segment having a first fixed length; inputting each segment into said first neural network by representing each amino acid in each segment using an input neuron in the first neural network; and
Thus, according to another aspect of the invention, there is provided a method of training a data processing system to predict a value for a property of a first polypeptide chain comprising a sequence of amino acids, the method comprising:
According to another aspect of the invention, there is provided a data processing system which has been trained to predict a value for a property of a polypeptide chain comprising a sequence of amino acids, the system comprising:
The following features apply to the methods and systems described above.
The first and second neural networks may be trained simultaneously so that the networks are available for prediction at a similar level of accuracy after a similar time scale. It will be appreciated that it may take longer to train the second network because longer segments are used. It will also be appreciated that more neural networks with different lengths of segments may also be used.
The neural network may be a deterministic neural network for example, a non-linear multilayer perceptron. Here, by non-linear, it is meant that one or more layers of neurons in the network have a non-linear transfer function so that the network is not constrained to fit just linear data. The skilled person will recognise that, in principle, the mapping need not be performed by a neural network but may be performed by any deterministic function, for example a large polynomial, splines or the like, but in practice such techniques are undesirable because of the exponential growth in the number of parameters needed as the length of the input/output vectors increases.
The known value may be a solubility or aggregation propensity value, for example a profile having a solubility or aggregation value for each amino acid in the sequence. The method may also be used for other known values, for example a profile having a solubility or aggregation value for each amino acid in the sequence, or solvent exposure, secondary structure population or any other property that can be expressed as one value per residue in the sequence. Where the known value is in the form of a profile, i.e. a set of M real numbers (x0, x1, . . . , xM-1), the method may comprise applying a Fourier transform to the profile to determine a net of Fourier coefficients. For example, the discrete Fourier Transform (DFT) may be used where:
A subset of the net of Fourier coefficients may be used as the known values. In this way, a smaller number of output neurons is required. Moreover, by training the networks to predict only half of the coefficients, the improvement in the accuracy of the neural network appears to have compensated the error introduced by reconstructing the profile from only half of the coefficients.
Once trained, the data processing system may be used to predict values, for example solubility or aggregation values if this is what it was trained on. Thus, generating said first output value of said solubility or aggregation propensity for each said input polypeptide chain using said first trained neural network may comprise dividing each said input polypeptide into a plurality of segments each having a first fixed length, inputting each amino acid in each segment to the first neural network; using the first function to map the input amino acids to a first segment output value for each segment; and combining the first segment output values to generate said first output value. Similarly, generating a second output value of said solubility or aggregation propensity for each said input polypeptide chain using said second trained neural network comprises dividing said input polypeptide chain into a plurality of segments each having a said second length which is greater than said first length, inputting each amino acid in each segment to the second neural network and using the second function to map the input amino acids to a second segment output value for each segment; and combining the second segment output values to generate said second output value. It will be also appreciated that the prediction process may be a stand-alone process.
Thus, according to another aspect of the invention, there is provided a method of predicting a value for a property of an input polypeptide chain comprising a sequence of amino acids, the method using a data processing system comprising a first trained neural network having a first function mapping an input to a first output value and a second trained neural network having a second function mapping an input to a second output value, the method comprising:
According to another aspect there is provided a data processing system for predicting a value for a property of an input polypeptide chain comprising a sequence of amino acids, the data processing system comprising
Again the following features apply to the methods and systems described above.
If more than two networks are used, the processor may be configured to combine all output values to determine the combined output value. The data processing system and thus the first and second neural networks may be trained as described above. Thus, the value to be predicted may be a solubility value or an aggregation propensity value, e.g. a profile having a value for each amino acid in the sequence.
Where the networks were trained to predict Fourier coefficients as described above, the first and second segment output values may be a set of Fourier coefficients. The Fourier coefficients may then be converted to the full profile by applying an inverse Fourier transform. For the DFT above, the reverse transform may take the form:
The Fourier coefficients represent the oscillatory modes of the profile. Accordingly, the first network that covers a sequence segment of smaller length k (suppose k<l) is better suited to capture high frequency modes, while the second network captures low frequency modes. Therefore the employment of two networks increases the accuracy of the reconstruction from the Fourier coefficients.
The Fourier method may thus be used with only one network, thus, according to another aspect of the invention, there is provided a method of training a data processing system to predict a value for a property of a first polypeptide chain comprising a sequence of amino acids, the method comprising training a neural network in said data processing system using a set of polypeptide chains having known sequences of amino acids and known values for said property, the known values being in the form of profiles having a value for each amino acid, wherein said training comprises dividing each polypeptide chain in said set of polypeptide chains into a plurality of segments each segment having a first fixed length and an associated section of the profile; applying a Fourier transform to each associated section of the profile to generate a set of Fourier coefficients for each segment; inputting each segment into said first neural network by representing each amino acid in each segment using an input neuron in the first neural network and representing; and determining a function which maps the input segments and to the set of Fourier coefficients for each segment.
When the sequence is divided into a plurality of segments potential problems could arise at the boundaries of the segments because the influence of neighbouring residues belonging to different segments would be neglected. The use of two networks of different length helps to solve the fixed length problem provided the second length is not a multiple of the first length. The first length may for example be 22 and the second length may for example be 40. The number of input neurons in each network matches the length of the fragments. Thus, the first neural network may have 22 input neurons in an input layer and the second neural network may have 40 input neurons.
Another way to solve this problem comprises dividing the polypeptide chain into a plurality of segments each having an overlapping region with adjacent segments. The overlapping region may comprise at least one amino acid, perhaps two to four amino acids, which is present in both adjacent segments. Splitting the sequence into segments having a longer length may mean that the overlapping regions may also be longer, e.g. to ensure that the segments have uniform length. The overlapping region may have a length which ranges from one amino acid through to all amino acids except one. Thus for a polypeptide chain of length n, the overlapping region may have a length of between 1 and n−1. For each network having an overlapping region of n−1 residues, the network resembles a sliding window that moves one residue at a time along the sequence, an approach that can be preferable for some applications.
The prediction method above can be used to identify mutations or insertions that alter the aggregation properties and hence the solubility of a target (or input) polypeptide chain. For example, according to another aspect of the invention, there is provided a method of identifying poorly soluble aggregation-prone regions in a target polypeptide chain, the method comprising predicting a value for solubility or aggregation propensity using the method described above, comparing the predicted values against a threshold value and identify the poorly soluble or aggregation-prone regions as regions having predicted values above the threshold value. In a preferred embodiment, the polypeptide chain is in its native (i.e. folded) state.
According to another aspect of the invention, there is provided a method of identifying mutations or insertions that alter the solubility or aggregation propensity of an input polypeptide chain, the method comprising
The methods described above may be used to identify mutations which increase or decrease the solubility of the target polypeptide chain. Alternatively, the methods may be used to alter the solubility of the target polypeptide chain to a desired amount.
The regions may be selected by comparing the score of each amino acid in the sequence to a threshold value, e.g. one, and selecting a region having more than a number of adjacent amino acids above the threshold value. These fragments correspond to the ‘dangerous’ regions, i.e. those that can reduce solubility or trigger aggregation. The selected regions may be ranked by taking into account both the length (the size of the ‘dangerous’ region) and its solubility or aggregation propensity (how dangerous its components are) but it will be appreciated that other ranking scores may be awarded. The ranking preferably sorts the regions from less soluble or more soluble, or from more aggregation-prone to less aggregation-prone.
The method of prediction using neural networks described above may be used to predict a value for solubility or aggregation propensity. This method is able to run much faster than other methods for calculating solubility or aggregation propensity such as the structurally-corrected value used in the calculating step. Although this predicted value is designed to depict the tendency to remain soluble or aggregate from the unfolded state, the positions selected for mutations were selected using the structurally-corrected value. As a result, the predicted effect of a mutation at one of these positions on the solubility should, to a very good approximation, be the same. However, as a check, the structurally-corrected values may also be calculated for some or all of the mutated sequences as explained in more detail below.
The structurally-corrected solubility or aggregation propensity profile comprises a score for each amino acid in the sequence. The structurally-corrected solubility or aggregation propensity score may be calculated in any known way but it is essential that the score takes account both of the intrinsic propensity to aggregate and the native structure of the sequence. One method for calculating the score is set out in more detail below, the structurally-corrected solubility or aggregation propensity score Aisurf of residue i can be written as a sum which is extended over all the residues of the protein within a distance rS from residue i:
where wjE is the “exposure weight” which depends on the solvent exposure of residue j, and wjD is the “smoothing weight”, defined as
where dij is the distance of residue j from residue i.
The exposure weight is defined as
where xj is the relative exposure of residue j, i.e. the SASA (solvent accessible surface area) of residue j in the given structure divided by the SASA of same residue in isolation, and θ is the Heaviside step-function, which is employed so that residues less than 5% solvent-exposed are not taken into account.
The identified positions may be ranked, for example, using a combination of the ranking applied to each sequence and its individual score. Again, the ranking may be from less soluble or more soluble, or from more aggregation-prone to less aggregation-prone.
We now have a list of positions that are suitable for mutations and/or insertions. These positions are mapped on both the sequence and the structure. On one hand it could be desirable to perform mutations/insertions at several positions, in order to maximise the solubility of the resulting protein. On the other, too many mutations could change the protein too much.
Rather than constraining the number of mutations/insertions to perform, one might wish to stop doing mutations when a given solubility is reached. Accordingly, the method may comprise choosing the top ranked position and generating the plurality of mutated sequences by applying a plurality of mutations and/or insertions at that position. The aggregation propensity is then predicted for each of these mutated sequences and optionally ranked. We then determine whether any of the predicted values for solubility or aggregation propensity is higher than a threshold value (i.e. the value corresponding to the desired result), in particular the predicted value for the top ranked sequence and when the predicted value is higher than the threshold value, output the mutated sequence as a target polypeptide. When the predicted value is lower than the threshold value, the method reiterates. Thus the choosing, generating and determining steps may be repeated for the next ranked position until the predicted value is higher than the threshold value.
Alternatively, a user may input that at most N mutations (preferably 3 or 4 mutations) may be performed. Accordingly, the method may comprise choosing a set of the top N ranked positions and generating the plurality of mutated sequences by applying at least some of all possible mutations and insertions at that net of positions. The number of mutations/insertions can be decreased, for example by excluding from the list of candidates the strongly hydrophobic amino acids (i.e. tryptophan, threonine, valine, leucine, isoleucine, phenylalanine residues) because it is known that their effect, if any, is to increase the aggregation propensity. The predicted value for solubility or aggregation propensity for each mutated sequence may then be ranked and the highest ranked mutated sequences may be output with their values as the output polypeptides. The structurally-corrected aggregation propensity which was calculated for the original sequence may also be calculated for each of these top-ranked mutated sequences and the ranking may be reorganised based on the structurally-corrected solubility or aggregation propensity.
Another input to the system is preferably to indicate any positions at which mutations or insertions are prohibited. Accordingly, the method may further comprise identifying any positions at which mutations or insertions are prohibited and flagging such positions as immutable so that in the generating steps no mutations or insertions are applied at these positions. If all the positions in a selected region are flagged as immutable, the positions at the side of the selected region may be identified as positions which are suitable for mutation or insertion.
The methods of predicting and training may preferably be computer-implemented methods. The invention thus further provides processor control code to implement the above-described systems and methods, for example on a general-purpose computer system or on a digital signal processor (DSP). The code is provided on a physical data carrier such as a disk. CD- or DVD-ROM, programmed memory such as non-volatile memory (eg Flash) or read-only memory (Firmware). Code (and/or data) to implement embodiments of the invention may comprise source, object or executable code in a conventional programming language (interpreted or compiled) such as C. Python, or assembly code. As the skilled person will appreciate such code and/or data may be distributed between a plurality of coupled components in communication with one another.
In another aspect the invention provides a method of making a target polypeptide with altered solubility or aggregation propensity comprising identifying a mutation and/or insertion that alters the solubility of the target polypeptide as defined above and making a polypeptide chain comprising said mutation(s) and/or insertion(s). In a preferred embodiment, the method is a method of making a protein with increased solubility or a reduced propensity to aggregate. In an alternative embodiment, the method is a method of making a protein with decreased solubility or an enhanced propensity to aggregate. The protein can be made using techniques well known in the art. Such techniques include chemical synthesis using for example, solid-phase synthesis or using standard recombinant techniques.
The ability of a polypeptide chain to form highly-organised aggregates such as amyloid fibrils has been found to be a generic property of polypeptide chains regardless of their structures or sequences, and not simply a feature of a small number of peptides and proteins associated with recognised pathological conditions (C. M. Dobson, “The structural basis of protein folding and its links with human disease,” Philos. Trans. R. Soc. Lond., B. Sci., vol. 356, no. 1406, pp. 133-145. February 2001). For this reason the target polypeptide chain can be any sequence of at least two amino acids (also called residues) joined by a peptide bond, regardless of length, post-translational modification, chemical modification or function. Similarly, the polypeptide chain may be naturally occurring or chemically synthesized, wild type or recombinant, such as a chimeric or hybrid. In the present invention the terms ‘polypeptide chain’, ‘peptide’ and ‘protein’ are used interchangeably.
In one aspect, the target polypeptide chain may be any protein, including but not limited to a protein hormone, antigen, immunoglobulin (e.g. antibody), repressors/activators, enzymes, cytokines, chemokines, myokines, lipokines, growth factors, receptors, receptor domains, neurotransmitters, neurotrophins, interleukins, interferons and nutrient-transport molecules (e.g. transferrin).
In a preferred embodiment, the target polypeptide chain is a CDR-containing polypeptide chain such as a T-cell receptor or antibody. In a preferred embodiment the CDR-containing polypeptide chain is an antibody or antigen-binding fragment thereof.
The term ‘antibody’ in the present invention refers to any immunoglobulin, preferably a full-length immunoglobulin. Preferably, the term covers monoclonal antibodies, polyclonal antibodies, multispecific antibodies, such as bispecific antibodies, and antibody fragments thereof, so long as they exhibit the desired biological activity. Antibodies may be derived from any species. Alternatively, the antibodies may be humanised, chimeric or antibody fragments thereof. The immunoglobulins can also be of any type (e.g. IgG, IgE, IgM, IgD, and IgA), class (e.g., IgGI, IgG2, IgG3, IgG4, IgAI and IgA2) or subclass of immunoglobulin molecule.
The term ‘antigen-binding fragment’ in the present invention refers to a portion of a full-length antibody where such antigen-binding fragments of antibodies retain the antigen-binding function of a corresponding full-length antibody. The antigen-binding fragment may comprise a portion of a variable region of an antibody, said portion comprising at least one, two, preferably three CDRs selected from CDR1. CDR2 and CDR3. The antigen-binding fragment may also comprise a portion of an immunoglobulin light and heavy chain. Examples of antibody fragments include Fab, Fab′, F(ab′)2, scFv, di-scFv, and BiTE (Bi-specific T-cell engagers), Fv fragments including nanobodies, diabodies, diabody-Fc fusions, triabodies and, tetrabodies; minibodies; linear antibodies; fragments produced by a Fab expression library, anti-idiotypic (anti-Id) antibodies, CDR (complementary determining region), and epitope-binding fragments of any of the above that immunospecifically bind to a target antigen such as a cancer cell antigens, viral antigens or microbial antigens, single-chain or single-domain antibody molecules including heavy chain only antibodies, for example, camelid VHH domains and shark V-NAR; and multispecific antibodies formed from antibody fragments. For comparison, a full-length antibody, termed ‘antibody’ is one comprising a VL and VH domains, as well as complete light and heavy chain constant domains.
The term ‘antibody’ may also include a fusion protein of an antibody, or a functionally active fragment thereof, for example in which the antibody is fused via a covalent bond (e.g., a peptide bond), at either the N-terminus or the C-terminus to an amino acid sequence of another protein (or portion thereof, such as at least 10, 20 or 50 amino acid portion of the protein) that is not the antibody. The antibody or fragment thereof may be covalently linked to the other protein at the N-terminus of the constant domain.
Furthermore, the antibody or antigen-binding fragments of the present invention may include analogs and derivatives of antibodies or antigen-binding fragments thereof that are either modified, such as by the covalent attachment of any type of molecule as long as such covalent attachment permits the antibody to retain its antigen binding immunospecificity. Examples of modifications include glycosylation, acetylation, pegylation, phosphorylation, amidation, derivatization by known protecting/blocking groups, proteolytic cleavage, linkage to a cellular antibody unit or other protein, etc. Any of numerous chemical modifications can be carried out by known techniques, including, but not limited to specific chemical cleavage, acetylation, formylation, metabolic synthesis in the presence of tunicamycin, etc.
Additionally, the analog or derivative can contain one or more unnatural amino acids. When non-natural amino acids or post-translational modification come into play the method can be easily applied at a good level of approximation by replacing such amino acids with the natural ones (Modifications seldom take place close to aggregation promoting regions). To actually account for modifications rather than neglecting them one would need to introduce correction to the intrinsic profile.
In an alternative embodiment, the target polypeptide chain is a peptide hormone. Examples of peptide hormones include insulin, glucagon, islet amyloid polypeptide (IAPP), ACTH (corticotrophin), granulocyte colony stimulating factor (G-CSF), tissue plasminogen, somatostatin, erythropoietin and calcitonin.
In a further alternative embodiment, the target polypeptide may be a protein associated with an amyloid disease. Examples include, but are not limited to, the Aβ peptide (Alzheimer's disease), amylin (or IAPP) (Diabetes mellitus type 2), α-synuclein (Parkinson's disease), PrPSc (Transmissible spongiform encephalopathy), huntingtin (Huntington's disease), calcitonin (medullary carcinoma of the thyroid), atrial natriuretic factor (cardiac arrhythmias, isolated atrial amyloidosis), apoloprotein A1 (Atherosclerosis), seum amyloid A (Rheumatoid arthritis), medin (Aortic medial amyloid), prolactin (Prolactinomas), transthyretin (Familial amyloid polyneuropathy), lysozyme (Hereditary non-neuropathic systemic amyloidosis), β2 microglobulin (Dialysis related amyloidosis), gelsolin (Finnish amyloidosis), keratoepithelin (Lattice corneal dystrophy), crystatin (Cerebral amyloid angiopathy, Icelandic type), immunoglobulin light chain AL (Systemic AL amyloidsosis), fibrinogen Aα chain (Familial visceral amyloidosis), oncostatin M receptor (Primary cutaneous amyloidosis), integral membrane protein 2B (Cerebral amyloid angiopathy. British type) and S-IBM (Sporadic inclusion body myositis).
Further examples of the target polypeptide include angiogenin, anti-inflammatory peptides. BNP, endorphins, endothelin, GLIP, Growth Hormone Releasing Factor (GRF), hirudin, insulinotropin, neuropeptide Y, PTH, VIP, growth hormone release hormone (GHRH), octreotide, pituitary hormones (e.g., hGH), ANF, growth factors, bMSH, platelet-derived growth factor releasing factor, human chorionic gonadotropin, hirulog, interferon alpha, interferon beta, interferon gamma, interleukins, granulocyte macrophage colony stimulating factor (GM-CSF), granulocyte colony stimulating factor (G-CSF), menotropins (urofollitropin (FSH) and LH)), streptokinase, urokinase, ANF, ANP, ANP clearance inhibitors, antidiuretic hormone agonists, calcitonin gene related peptide (CGRP). IGF-I, pentigetide, protein C, protein S, thymosin alpha-1, vasopressin antagonist analogs, dominant negative TNF-α, alpha-MSH, VEGF, PYY, and polypeptide chains, fragments, polypeptide analogs and derivatives of the above.
In one aspect of the invention, there is provided a method of making a pharmaceutical composition wherein the composition comprises one or more polypeptide chains produced by the methods described herein formulated with a pharmaceutically acceptable carrier, adjuvant and/or excipient. In a preferred embodiment, the method is a method of making a pharmaceutical composition comprising a target polypeptide chain with an increased solubility or a reduced propensity to aggregate. In an alternative embodiment, the method is a method of making a pharmaceutical composition comprising a target polypeptide chain with a decreased solubility or an increased propensity to aggregate. Pharmaceutical compositions of the present invention can also be administered as part of a combination therapy, meaning the composition is administered with at least one other therapeutic agent, for example, an anti-cancer drug.
A pharmaceutically acceptable carrier can include solvents, dispersion media, coatings, antibacterial and antifungal agents, isotonic and absorption delaying agents. Preferably the carrier is suitable for intravenous, intramuscular, subcutaneous, parenteral, spinal or epidermal administration.
The pharmaceutical compositions of the present invention may also include one or more pharmaceutically acceptable salts, a pharmaceutically acceptable anti-oxidant, excipients and/or adjuvants such as wetting agents, emulsifying agents and dispersing agents.
In another aspect of the invention there is provided a polypeptide chain, preferably an antibody, with altered solubility or an aggregation propensity obtained or obtainable by the methods described herein. In a preferred embodiment, the polypeptide chain has a reduced propensity to aggregate or an increased solubility. In an alternative embodiment the polypeptide chain has an increased propensity to aggregate or a decreased solubility. In certain embodiments the polypeptide chain may be used as a medicament. In one embodiment the polypeptide chain may be used in the treatment of a disease, such as but not limited to, autoimmune diseases, immunological diseases, infectious diseases, inflammatory diseases, neurological diseases and oncological and neoplastic diseases including cancer.
The present invention will now be described, by way of example only, with reference to the accompanying drawings, in which:
Returning to
An optional step S102 also includes inputting the maximum number N of mutations that are to be considered. A desired output value may also be input.
Once the various inputs are entered, at step S104, the structurally-corrected aggregation propensity is calculated for the whole sequence (i.e. for the whole molecule). This calculation yields a solubility score that is related to the solubility of the whole protein. This calculation also gives a profile, which is a score representative of the propensity for aggregation for every amino acid along the sequence.
The solubility score is calculated from the aggregation propensity profile, as predicted by the neural networks (intrinsic solubility score) or as modified by the structural correction (structurally-corrected solubility score, also called structurally-corrected aggregation propensity score). The solubility score takes into account only aggregation promoting residues (value in the profile larger than 1) and aggregation-resistant residues (value in the profile smaller than −1) and ignores all intermediate values, which are treated as neutral noise in the profile. Specifically, it is the sum of the individual aggregation propensity of those residues with aggregation propensity values either larger than one or smaller than minus one divided by the total length of the sequence.
As a consequence, a protein sequence with no solubility enhancing and no solubility reducing, or no aggregation-promoting and no aggregation-resistant regions will have a score of zero; a protein with a majority of solubility-promoting aggregation-resistant regions will have a negative score and a protein with a majority of solubility-reducing or aggregation-promoting regions a positive one. Since the sum is divided by the total length of the sequence, typical values of this score are close to zero, and small variations can have a significant impact on the solubility of the protein. In an alternative embodiment, threshold values different from −1, 1 can be employed, in order to make the score more or less sensible to mutations and insertions. The intrinsic solubility score and the structurally corrected one can be very different and in principle uncorrelated, since they are calculated from different profiles. However, when mutations or insertions are performed at sites that are exposed to the solvent, the variations of the two scores always correlate. This is why we use the intrinsic score to scan a large number of possible combinations of mutation and insertion and we calculate the structural correction only for the most promising ones.
One method for defining the structurally-corrected surface solubility or aggregation propensity is to project the intrinsic solubility or aggregation propensity profile onto the surface and smooth it over a surface patch of size S with radius rS. Ajint is the intrinsic aggregation propensity score of residue i which is calculated using the neural networks as described below. The structurally-corrected solubility or aggregation propensity score Aisurf of residue i can be written as a sum which is extended over all the residues of the protein within a distance rS from residue i:
where wjE is the “exposure weight” which depends on the solvent exposure of residue j, and wjD is the “smoothing weight”, defined as
where dij is the distance of residue j from residue i.
This definition of the smoothing weight guarantees that neighbouring residues contribute more to the local surface solubility or aggregation propensity than more distant ones. Furthermore, the smoothing weight does not bias towards a preselected surface patch size, and thus makes the method applicable to the study of a wide range of interface sizes. In the present work we set rS equal to 10 Å, as this value is consistent with the seven amino acids window implemented in the prediction of the intrinsic profile.
The exposure weight is defined as
where xj is the relative exposure of residue j, i.e. the SASA (solvent accessible surface area) of residue j in the given structure divided by the SASA of same residue in isolation, and θ is the Heaviside step-function, which is employed so that residues less than 5% solvent-exposed are not taken into account.
The equation defining the exposure weight is a sigmoidal function, where a and b are parameters tuned so that the weight grows slowly to a relative exposure x≈20% and then grows linearly reaching 1 at x≈50%. When a residue is 50% solvent-exposed, half of it faces inwards in the structure while the other half, facing the solvent, already provides the largest surface for eventual aggregation partners.
Thus the structurally-corrected solubility or aggregation propensity profile has a value for each residue that is associated to the contribution of that residue to the overall solubility or aggregation propensity. Since this is a structurally-corrected profile, amino acids little exposed to the solvent will get a value that is zero or close to zero.
Returning to
At step S110, we scan through our ensemble of fragments that were selected in the previous steps searching for any residues which were indicated as not to be changed in the first step. These residues are flagged as immutable and thus this step may be considered as filtering the fragments for immutable residues. At least in the case of antibodies, after this filtering, some of the fragments may be completely immutable, as it is quite common for solubility-reducing or aggregation-promoting residues to be found within the CDR loops. Regardless of the number of immutable residues, the position of the fragment in the sequence and in the structure and its ranking score are stored.
The next step S112 is to highlight some positions as candidates for mutations or insertions. Each fragment is considered one at a time. If the fragment still contains some mutable residues, i.e. residues that were not flagged in the previous step, their positions in the sequence are highlighted as possible positions for mutations. If the fragment contains no mutable residues, the positions at the side of the fragment are highlighted as possible positions for mutations/insertions. Each site can either be a candidate for an insertion or a mutation; it cannot be a candidate for both.
It is known that the presence of solubility-promoting or aggregation-resistant residues (such as charged residues or residues that disfavour β-strand formation, like proline or glycine residues) has an effect on the aggregation propensity of the region that contains them. Solubility promoting or aggregation-neutral residues may be defined as ones having a score between −1 and 1 according to our prediction. A list of the known solubility-reducing or aggregation-resistant residues consist in the charged residues (Lysine, Arginine, Glutamic acid and Aspartic acid) with the addition of Proline and Glycine as these two are known to break secondary structures.
Mutating solubility-neutral or aggregation-neutral residues to solubility-promoting or aggregation-resistant at one or both sides of a ‘dangerous’ fragment can significantly increase the solubility or decrease the aggregation propensity of the fragment itself. For this reason we look at the position of the residues adjacent an immutable fragment in the structure. If the amino acid in the adjacent position is solvent exposed and its sidechain is not involved in particular interactions (such as salt bridges, disulphide bonds or hydrogen bonds) its position is flagged for mutation. In addition, if the amino acid is part of some secondary structure (and its backbone hydrogen is involved in hydrogen bond), proline and glycine residues are excluded from the list of possible candidates to replace it. On the other hand, if the adjacent amino acids are not solvent exposed or their side-chains form important interactions, the sides of the solubility-reducing or aggregation-prone fragment are labelled as possible sites for insertions. Furthermore, the sidechain could be part of the hydrophobic core and thus would not be flagged for mutation. However, this is generally accounted for by checking the solvent exposure. (e.g. if it is part of the hydrophobic core then it is not exposed to the solvent).
We now have a list of positions that are suitable for mutations and/or insertions. These positions are mapped on both the sequence and the structure. Each position also has a score (the one given to the fragments before i.e. as calculated in S106-108) that reflects how large the effect on solubility of a mutation/insertion at that site is expected to be. Sites for possible mutation/insertion are therefore ranked. At this point, a choice needs to be made by the user. On one hand it could be desirable to perform mutations/insertions at several position, in order to maximise the solubility of the resulting protein. On the other too many mutations could change the protein too much. This is generally unsuitable for pharmaceutical applications, as one needs to be sure that the resulting protein, after injection, does not trigger an immune reaction in the patient. Moreover, a large number of mutations, even when they are solely on the surface, can affect the stability of the protein, compromising its folding and, consequently, its function.
A strategy that can be employed to distribute the N mutations/insertions more effectively among the fragments is to first normalize the scores of the fragments. Once the dangerous fragments have been selected and ranked with their scores, such scores are normalized (by dividing each score by the sum of all the scores) so that their sum is equal to one. In this way the mutations/insertions are assigned, always starting from the most dangerous in the ranking, by rounding the product of N times the normalized score to the closer integer and moving down the ranking until all N mutations/insertions have been assigned to some of the positions determined in S112 in the various fragments.
Once the positions are selected, sequences corresponding to every possible combination of mutations and/or insertions at those sites are generated at step S116. Even though each site can only have a mutation or an insertion, this step involves generally a very large number of sequences (N20) because there are 20 types of amino acids that can be used at each position. The number of sequences, however, can be decreased, for example by excluding from the list of candidates the strongly hydrophobic amino acids (i.e. tryptophan and threonine) because it is known that their effect, if any, is to increase the aggregation propensity. Other techniques may also be used to reduce the number of sequences. However, in most cases, it is unlikely to be possible to reduce the number of generated sequences so that the structurally-corrected solubility or aggregation propensity that was calculated for the original sequence can be calculated for each generated sequence in a reasonable time (not least because every sequence needs to be mapped on the structure first).
Accordingly, at step S118, an intrinsic solubility or aggregation propensity value is calculated as explained in more detail with reference to
Once this calculation is terminated, the mutated sequences are ranked at step S120 using the calculated intrinsic solubility or aggregation propensity. The top m (say m≈10) mutated sequences that are predicted to be the most soluble are selected at step S122. The structurally-corrected solubility or aggregation propensity that was calculated for the original sequence is calculated for each of these top-ranked mutated sequences at step S124. The mutated sequences are ranked at step S126 using the calculated structurally-corrected solubility or aggregation propensity in order to double-check the ranking and to obtain a more accurate solubility score. It will be appreciated that the calculation of the structurally-corrected solubility or aggregation propensity and the double-checking of the ranking step is optional.
The mutated sequences and their solubility scores are output in ranked order at step S128. The output thus listed the most soluble protein sequences that are obtainable with the given number of mutations, i.e. without changing the protein too much.
One could now select the sequence with the highest solubility, or could consider carrying out another refinement step. This step entails using one of the many available algorithms that calculate the effect of a mutation/insertion on the stability of the protein (i.e. ΔΔG). See for example Zhang Z1, Wang L, Gao Y, Zhang J, Zhenirovskyy M, Alexov E “Predicting folding free energy changes upon single point mutations” in Bioinformatics 2012, 28(5):664-671 (http://www.ncbi.nlm.nih.gov/pubmed/22238268); Li Y1, Fang J “PROTS-RF: a robust model for predicting mutation-induced protein stability changes” in PLoS One. 2012; 7(10):e47247 (http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0047247); Thiltgen G1, Goldstein R A “Assessing predictors of changes in protein stability upon mutation using self-consistency in PLoS One. 2012; 7(10):e46084 (http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0046084); or Capriotti E1, Fariselli P, Casadio R “I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure” in Nucleic Acids Res. 2005; 33:W306-10 (http://gpcr2.biocomp.unibo.it/˜emidio/I-Mutant2.0/I-Mutant2.0_Details.html).
This could be helpful as, in the majority of the cases, the m more soluble sequences will have a very similar solubility, hence scoring them with the sum of the ΔΔG-s of all the mutations/insertions they contain, could be a useful way to select the best one. As already mentioned, however, since our mutation sites are selected on the surface of the protein, the ΔΔG values of mutations at these sites should be close to zero. Therefore this further refinement is not expected to add much to the prediction.
In this method, the initial step S130 is to select the top site from the sites selected in step S112. The top site is the one with the highest fragment score and the highest score within that fragment. At step S132, all possible mutations are performed at this site and the new intrinsic aggregation propensity for each mutated sequence is calculated using the predictor as set out at S134. The mutated sequences are ranked according to this calculated propensity at S136. Thus, steps S132 to S136 are the same as steps S116 to S120 in the previous method except that the mutations are being performed at only one site.
The methods diverge at this stage. At step S138, the best mutation (i.e. highest-ranked sequence) is selected rather than the top m sequences as in the previous arrangement. The structurally-corrected score is then calculated for this top ranked sequence (S140). It then extrapolates the solubility using the correlation coefficients calculated from the fit of the experimental data. For example, in the case of single domain antibodies, the coefficient in
where the sum is over the n connections that enter in the neuron and the function g, specific to the neuron, is called activation function.
The activation function can be very general; the most common examples are a threshold, a symmetric threshold, a sigmoid, a symmetric sigmoid, a stepwise sigmoid and also linear functions. A symmetric sigmoid is implemented in the current version and is illustrated in
The deterministic neural network may be, for example, a non-linear multilayer perceptron. Here, by non-linear, it is meant that one or more layers of neurons in the network have a non-linear transfer function so that the network is not constrained to fit just linear data. The skilled person will recognise that, in principle, the mapping need not be performed by a neural network but may be performed by any deterministic function, for example a large polynomial, splines or the like, but in practice such techniques are undesirable because of the exponential growth in the number of parameters needed as the length of the input/output vectors increases.
A neural network requires a fixed number of input neurons but sequences have variable length. To overcome this problem, as illustrated in
As shown in
As shown in
Returning to
As explained above, the neural network first needs to be trained before it can be used to perform predictions. Accordingly, the segments need to be input with an output value.
The output value may be the known value for the propensity as represented by the profiles in
The set of numbers is thus reduced to a smaller set of complex numbers:
The last coefficients (i.e. lower set of complex numbers) of the Discrete Fourier Transform may be ignored without comprising the output. In this way, a smaller number of output neurons are required. In the output layer, a reverse Fourier transform is applied to recreate the profile. For the DFT above, the reverse transform may take the form
Moreover, as mathematically the Fourier coefficients represent the oscillatory modes of the profile, the network that covers a sequence segment of smaller length k (suppose k<l) is better suited to capture high frequency modes, while the other one to capture low frequency modes. Therefore the employment of two networks not only helps to solve the fixed length problem but also to increase the accuracy of the reconstruction from the Fourier coefficients.
Returning to
Once each network has been trained, it can be used to predict output values as shown in
Accordingly, the first step in the prediction sequence shown in
Where we are predicting the intrinsic aggregation propensity for the sequence, the final output is the profile. Where the Fast Fourier methodology has been used, the output values that are predicted are the complex coefficients (S408, S410). These complex coefficients are then converted to the profile by using the inverse Fourier transform as described. Each segment has its own output profile (S412, S414); these profiles are than combined to create an output profile for the associated network (S416, S418). In the overlapping regions, the profiles are simply averaged to create the combined profile. Finally, at step S420, the profile from each neural network is averaged to provide a combined output profile.
The averaging can be done by carrying out a smoothing over a window of seven residues by carrying out an averaging. This step is done to strengthen the influence that residues in the sequence have on their vicinity and to reduce the noise of the profile, highlighting regions rather than single residues.
To experimentally validate the predictor of
The algorithm of
The values of the score predicted using the method above, the critical concentration that was experimentally determined and the error between the measured and the predicted concentration are reported in the table below. These results are also plotted in the correlation plot of
The wild type has a predicted score of 0.059 and a measured critical concentration of 27.6. The six variants have scores ranging between −0.015 and 0.041 and measured concentrations varying between 42.3 and 113.4 μM. Accordingly, some of the mutations only have a small effect on the solubility or aggregation propensity whereas some of the mutations (e.g. Aβ33-42 EEP) change it radically. Every mutation or insertion we have tried was predicted using the algorithm described above. Since, however, we wanted to validate the goodness of our predictions, we did not simply select the most solubilizing combination of mutations and insertions as described previously, but we tried to screen a wider range of solubility values.
The critical concentration of the wild type and of these variants was measured. Gammabody Aβ33-42 mutant variants were obtained by employing phosphorylated oligonucleotide PCR or Quick Change XLII kit (Qiagen) on the wild type variant cDNA, depending on the kind of the mutation. The different gammabodies were expressed in E. coli BL21 (DE3)-pLysS strain (Stratagene) for 24 h at 30° C. using Overnight Express Instant TB Medium (Novagen) supplemented with ampicillin (100 μg/mL) and chloramphenicol (35 μg/mL). Cellular suspension was therefore centrifuged twice at 6000 rcf and the supernatant incubated with 2.5 mL/L of supernatant of Ni-NTA resin (Qiagen) at 18° C. overnight in mild agitation. The Ni-NTA beads were collected and the protein eluted in PBS pH 3, neutralized at pH 7 upon elution. The protein purity, as determined by SDS-PAGE electrophoresis, exceeded 95%. Solutions of the purified proteins were then divided into aliquots, flash-frozen in liquid nitrogen and stored at 80° C.; each protein aliquot was thawed only once before use. Protein concentrations and soluble protein yields were determined by absorbance measurements at 280 nm using theoretical extinction coefficients calculated with Expasy ProtParam.
In order to determine the critical concentration (cc) of the gammabody variants (i.e. the higher concentration at which the gammabodies are able to keep their native monomeric conformations), protein samples at different concentrations were obtained by centrifugation steps using AmiconUltra-0.5, Ultracel-3 Membrane, 3 kDa (Millipore), incubated for 30 min at room temperature and ultracentrifuged at 90000 rpm for 45 min at 4° C. Protein concentration of the resulting supernatant was plotted as a function of the starting protein concentration, before ultracentrifugation, of the solution and analysed using an exponential equation assuming the top asymptote corresponding to the Critical Concentration value.
In
The processor of
No doubt many other effective alternatives will occur to the skilled person. It will be understood that the invention is not limited to the described embodiments and encompasses modifications apparent to those skilled in the art lying within the spirit and scope of the claims appended hereto.
N N K (NotI- A A A) W G Q G T L V T V S S-
D E D (NotI- A A A) W G Q G T L V T V S S-
E E E (NotI- A A A) W G Q G T L V T V S S-
Number | Date | Country | Kind |
---|---|---|---|
1310859.2 | Jun 2013 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/GB2014/051866 | 6/17/2014 | WO | 00 |