The present invention relates to a method for identifying vaccine candidates for example from the proteome of a pathogenic organism and in particular a bacteria, to vaccines identified using this method and to computer readable mediums which are useful in it.
During the past 200 years the use of vaccines to control infectious diseases caused by bacterial pathogens has proven to be both effective and safe. Many of these vaccines were discovered using an empirical approach and such vaccines include live attenuated forms of bacterial pathogens, killed bacterial cells and individual components of the bacterium (sub-units). Although many bacterial vaccines are still widely used, a shift towards reliance on antibiotic therapy for the control of many other infectious diseases occurred during the latter half of the twentieth century.
The recent appearance of antibiotic resistant strains of many bacterial pathogens has prompted a resurgence of interest in the use of vaccines to prevent disease. However, many of the existing bacterial vaccines are not considered to offer appropriate levels of protection against infection. In addition, an increased awareness of the potential for transient side effects following vaccination has prompted an increased emphasis on the use of sub-unit vaccines rather than vaccines based on whole bacterial cells. Also, there are still several infectious organisms for which no effective vaccine has yet been produced.
Whilst empirical approaches to the selection of vaccine sub-units are still employed, the selection of candidate sub-units for testing is generally dependent on a significant body of background knowledge on the molecular interactions between pathogen and host. For many bacterial pathogens this information is not available. More recently, there has been an increased awareness that bioinformatic-based approaches can allow candidate protein sub-units to be selected in silico from bacterial genome sequences. These methods can be used to screen whole genomes for potential candidates far more rapidly than empirical approaches, so providing a more rapid advance towards preclinical studies with vaccines.
In general the ‘in silico’ approaches have relied on the assumption that candidate proteins will be located on the outer surface of, or exported from, the bacterium. Some workers have first identified ORFs which would encode proteins which possess a signal sequence directing export across the cytoplasmic membrane (Gomez M, et al. Infec. Immun. 2000 66: 2323-2327; Pizza M, et al, 2000). This dataset has then been screened to eliminate proteins which include transmembrane domains (Pizza et al., 2000; Gomez et al., 2000 supra.) and to include proteins which possess lipoprotein attachment sites (Gomez et al., 2000 supra; Chakravarti et al. Vaccine. 2000 19:601-612) or other motifs associated with surface anchoring (Pizza et al., 2000 supra.; Ross et al. Vaccine. 2001 19:4135-4142). Whilst these approaches have yielded novel sub-units, the predictive power of these approaches is limited both by limited knowledge of the export and protein processing pathways in different bacterial species and by limited knowledge of the molecular architecture of outer membrane proteins. In addition, it should be borne in mind that some vaccine antigens might not be located predominantly on the outer surface of the bacterium.
The genome sequences of many bacterial pathogens have now been determined or are due for completion in the next few years, and this has prompted significant work to investigate how these genome sequences can be interpreted to provide improved pretreatments or therapies for disease. Previous workers have considered the likely cellular location of vaccine antigens on the surface of the bacterium, and used algorithms which predict the cellular location to interrogate the predicted bacterial proteome for novel vaccine candidates.
Other previous methods for the prediction of vaccine candidates have included using algorithms to locate proteins with sequence similarity to known vaccines. However, such techniques would fail to predict new families of vaccine candidates. Yet further reported methods searched for tandem repeats at the 5′ end of a gene, since such repeats have been associated with some virulence genes (Hood DW, et al. Proc Natl Acad Science USA. 1996, 93:11121-11125). However, many virulence-associated genes lack such repeats and so would not be identified by this method.
Algorithms that search for signal sequences to identify secreted proteins have also been:used by many workers to identify candidate vaccine antigens (Chakravarti et al., 2001 supra, Janulczyk R and Rasmussen M. Infect. Immun. 2001 69:4019-4026). However, such programs are unable to take into account the different methods used to export proteins and the different signal sequences possessed by different bacteria. Nor do such algorithms provide 100% accuracy when predicting the cellular locality of proteins and possible candidates may be missed. As has been previously pointed out (Montgomery D L. Brief. Bioinform. 2000 1:289-296), protein antigens having no classic leader sequence would not be identified using this method. One such example is the vaccine antigen ESAT-6 from Mycobacterium tuberculosis, a known T-cell antigen (Sonrenson A L, et al., Infect. Immun 1995 63:1710-1717, Li Z, et al, Infect. Immun. 1999 67:4780-4786, Olsen A W, et al., Infect. Immun. 2001 69:2773-2778), which would be missed using this method.
The applicants have surprisingly found that certain properties of reported protein vaccine antigens are significantly different from a representative control protein dataset. This indicates that likely vaccine antigens can be identified by comparing those properties of known protein vaccine antigens with those of randomly selected but representative proteins in a control dataset.
The present invention provides a method for identifying a vaccine candidate, said method comprising selecting a protein from the proteome of a target organism on the basis of a property selected from a biophysical property or the amino acid composition of that protein.
In particular the method requires that an algorithm is constructed based upon a comparison of the above-mentioned property of a range of proteins known to have the desired protective immunogenic property (i.e. vaccine antigens) as compared to that property of a random selection of proteins.
The term “biophysical property”, used herein refers to a bulk property of the protein as a whole, such as molecular weight or isoelectric point (pI). It has also been found that amino acid composition can act as a basis of the selection, either by considering the properties of the individual amino acids within the sequence, such as hydrophobicity, bulkiness, flexibility and mutability, and more particularly, the simple amino acid makeup or composition itself.
Surprisingly, it has been found that there is a particularly good correlation between these properties and ability of the protein to produce a protective immune response and therefore have application as a vaccine. No such correlation between such basic properties and function or activity has previously been noted.
In particular the method comprises collecting a first set of data for a said property of a one or more vaccine antigens of a particular genus, collecting a control set of data for said property of one or more random proteins from the same genus, comparing said data, examining the said property of proteins from the proteome of a target species, and selecting a vaccine candidate from that proteome which has a property more similar to that of the first set of data.
Suitably the first and control sets of data are each obtained from a plurality of proteins, which are themselves suitably obtained from a plurality of species of the selected genus.
The method may be applied to any genus of organism for which vaccines are required, for example, bacteria including mycoplasma, viruses, yeasts and bacteria, but is preferably applied to bacteria, including both gram negative and gram positive bacteria.
A list of suitable bacteria from which the datasets are constructed is set out in Table 1 hereinafter. Preferably, the datasets are constructed using proteins from all of the bacterial species listed in Table 1.
In a particularly preferred embodiment, the datasets are interrogated or analysed on the basis of the percentage composition of individual amino acids.
This embodiment therefore comprises a process which comprises the steps of analysing the individual amino acid content of proteins from one or more species having a known vaccine effect, and comparing this with the individual amino acid content of a range of randomly selected proteins from said species, and comparing the results.
A suitable comparison is carried out by first ascribing an amino acid score to each amino acid within the protein sequence using the equation:
When this analysis is applied to all proteins derived from all the species listed in Table 1 hereinafter, each amino acid has a score shown in Table 4 hereinafter. With this information, the sequence of proteins within a proteome of a target organism can be given a “total” score, based upon applying the appropriate figure. For vaccine use, it has been found that the protein preferably scores highly on this scale. Thus for example, proteins from said target organism which are in the highest 20% of scores, suitably in the top 10%, and more preferably in the top 3% may be selected as vaccine candidates.
If required, analysis using one or more different properties can be applied in order to select a vaccine candidate with “fits” the vaccine profile more closely. In all cases, the analysis is suitably effected in silico and may be carried out using software which is in the public domain, as illustrated below.
Once the vaccine candidate has been identified, it may then be obtained and tested to establish its suitability as a vaccine. For example, it may be isolated from the bacterial source, or synthesized, for example chemically using peptide or protein synthesizer, or using recombinant DNA technology as is well known in the art. Thus a nucleotide sequence encoding the protein is incorporated into an expression vector including the necessary control elements such as a promoter, which is used to transform a host cell, which may be a prokaryotic or eukaryotic cell, but is preferably a prokaryotic host cell such as E. coli.
It may then be tested either in vitro, and/or in vivo for example in animal models and in clinical trials, to establish that it produces a protective immune response.
Vaccine candidates identified as described above form a further aspect of the invention.
In addition, vaccines which use these candidates or protective variants thereof or protective fragments of any of these, as active components, and which may include pharmaceutically acceptable carriers, as understood in the art, form a further aspect of the invention. Vaccines may be suitable for administration by various routes including oral, parenteral, inhalation, insufflation or intranasal routes, depending upon factors such as the nature of the active component and the type of formulation used. Active vaccine components may be used in the form of proteins of peptides, or nucleic acids, which encode these, may be used in such a way that they are expressed within the host animal. For example, they may be used to transform organisms such as viruses or gut colonizing organisms, which are then used as “live” vaccines, or they may be incorporated into plasmids in the form of so called “naked DNA” vaccines.
As used herein, the expression “variant” refers to sequences of amino acids which differ from the base sequence from which they are derived in that one or more amino acids within the sequence are substituted for other amino acids. Amino acid substitutions may be regarded as “conservative” where an amino acid is replaced with a different amino acid with broadly similar properties. Non-conservative substitutions are where amino acids are replaced with amino acids of a different type. Broadly speaking, fewer non-conservative substitutions will be possible without altering the biological activity of the polypeptide. Suitably variants will be at least 60% identical, preferably at least 75% identical, and more preferably at least 90% identical to the base sequence.
Identity in this instance can be judged for example using the algorithm of Lipman-Pearson, with Ktuple:2, gap penalty:4, Gap Length Penalty:12, standard PAM scoring matrix (Lipman, D. J. and Pearson, W. R., Rapid and Sensitive Protein Similarity Searches, Science, 1985, vol. 227, 1435-1441).
The term “fragment thereof” refers to any portion of the given amino acid sequence which has the same activity as the complete amino acid sequence. Fragments will suitably comprise at least 5 and preferably at least 10 consecutive amino acids from the basic sequence.
In a further aspect, the invention provides a computer-readable medium, which contains first and control datasets, for use in the method described above, and computer readable instructions for performing the method as described above.
Newly reported vaccine antigens could be added, to further refine the positive dataset.
As described in more detail below, using the method of the invention, the applicants found that both the pI and molecular weight of the proteins in the positive dataset showed statistical significance difference from the control dataset. The two-peak pattern seen in the pI analysis occurs in all datasets tried. Bacteria are more likely to experience acidic or basic conditions in nature (and rarely encounter neutral conditions) which may account for the trough in the pI analysis at neutral conditions.
In addition, the analysis in accordance with the invention has revealed that the hydrophobicity, bulkiness, flexibility and mutability of vaccine antigens are significantly dif ferent from these properties of the control dataset. As most vaccine antigens previously described are surface exposed or secreted they are more likely to be in contact with surrounding media. This might be reflected in their hydrophobicity and may therefore explain the differences seen between the two datasets using hydrophobicity as a scale. The difference in mutability could reflect the ability of pathogens to alter their antigenic presentation and thereby evade the host's immune system. Phenotypic variation in the relevant cell-surface proteins has been seen amongst clinical isolates of some species, suggesting that antigenic proteins can mutate and evolve during the period of infection (Peterson et al, 1995). This could also account for the differences seen in the comparisons of bulkiness and flexibility since the use of small, flexible residues on a protein surface may also reflect the need for mutation.
Using the vaccine antigen amino acid scoring scale described above, it has been found that vaccine antigens have a significant scoring similarity to outer membrane and secreted proteins. Since most vaccines antigens identified to date are known to be surface exposed or secreted, this is expected. This particular scoring algorithm was able to rank known antigens within the top 10% of proteins from the Streptococcus pneumoniae proteome.
Other bacterial proteomes have also been ranked using the scoring algorithm described herein and the known vaccines antigens that are included in our positive dataset most frequently occur in the top 10% of scores (data not shown).
This study demonstrates the effective use of certain properties, in particular amino acid composition, as a tool for the prediction of vaccine candidates. The approach described here would be applicable to any pathogenic organism, and in particular bacteria, for which a proteome or a substantial part of the proteome is or becomes available. Since it does not rely on sequence similarity, motifs or sub-cellular location, it should identify vaccine candidates that other prediction tools may miss.
The method of the invention appears robust in that it allows potential vaccine candidates to be identified irrespective of the cellular location. It does not require that-a specific sequence or motif is present in the protein. For instance, using a method of the invention based upon the amino acid composition, the ESAT-6 from Mycobacterium tuberculosis, the known T-cell antigen discussed above, was the 85th ranked protein in the entire predicted proteome of M. tuberculosis (i.e. in the top 3%, data not shown).
The invention will now be particularly described by way of example with reference to the accompanying tables and drawings in which:
Table 1 lists the data sources of proteins used to construct the vaccine antigen dataset. Vaccine antigen proteins were selected from the references indicated in the table.
Table 2 lists the data sources of proteins used to construct the control dataset. Proteins were selected from existing databases as shown in the table. (1 http://www.ncbi.nlm.nih.gov; 2 http://www.sanger.ac.uk; 3 http://www.tigr.org; 4 http://www.genomecorp.com; 5 http://genome.wisc.edu; 6 http://www.genome.ou.edu)
Table 3 is a summary of bacterial subcellular location protein database. Proteins were selected from the SWISSPROT annotated protein database from the species listed in the table. Proteins from each subcellular location were grouped to form subcellular location databases.
Table 4 shows amino acid composition of vaccine antigen and control databases, and the results of the application of an algorithm of a preferred embodiment of the invention to them. The mean percentage amino acid composition and standard deviation of the proteins within the vaccine antigen and control databases are listed. The probability (P) of the two databases sharing the same median has been calculated by the Wilcoxon Rank Sum test and is given to three decimal places. Values of P below 0.05 are significantly different and have been allocated a score as indicated in the methods.
Table 5 shows proteins of Streptococcus pneumoniae R6 scored by 30 the vaccine antigen scale. The top 50 ranked proteins of Streptococcus pneumonia as scored by the vaccine antigen scale are listed. Other known vaccine antigens of S. pneumoniae are also shown, along with their rankings and vaccine antigen scores. *-represents vaccine candidates as previously recognised by bioinformatic methods (Hoskins et al, 2001).
Table 6 shows P scores for comparisons of positive and control atasets with databases for various sub-cellular locations. he vaccine antigen scale was used to score proteins from either the positive or control datasets and compared to databases of proteins from various cellular locations. The probability (P) of the two databases sharing the same median has been calculated by the Wilcoxon Rank Sum test.
Construction of Vaccine Antigen Dataset
Vaccine antigens were identified by patent and open literature searches to derive a list of bacterial proteins which have been shown to induce a protective response when used as immunogens in an appropriate animal model of disease. To qualify for inclusion into the database the candidate, whole or part of the protein or corresponding DNA must have been shown to induce a protective response after immunisation using an appropriate animal model of infection, or to induce a protective response against the effects of a toxic component challenge. Those chosen were entered into a FASTA formatted database file.
In total, 72 vaccine antigens were identified (Table 1). These proteins originated from 32 bacterial species in 23 genera. of the 72 antigens held within the vaccine antigen dataset, 26 originated from Gram-positive bacteria and 46 from Gram-negative bacteria (for the purposes of this study Mycobacteria were treated as Gram-positive bacteria).
The amino acid sequences of the vaccine antigens were obtained from publicly available sequence databases, primarily the NCBI database, which may be interrogated at http://www.ncbi.nlm.nih.gov. The vaccine antigen proteins identified for use in this study are shown in Table 1.
Construction of Control Dataset
In order to allow meaningful comparisons, a control database was constructed that mirrored the vaccine antigen dataset with respect to the proportion of entries from each genus. For the control dataset a single species which was considered to be representative of each genus included in the vaccine antigen dataset was selected. The species was also selected on the basis of availability of an entire predicted proteome or genome sequence. Then, for each entry in the vaccine antigen dataset, we randomly selected 35 proteins from the proteome of the corresponding species, for inclusion in the control dataset, using a routine written in PERL. In cases where a genome sequence was available but had not been annotated, the proteome was predicted using Glimmer (Delcher et al., 1999). In these cases the program fastablast.pl from TIGR (which may be found at http://www.tigr.org.uk) was adapted and used to produce a FASTA file of all the predicted protein sequences. Where no completed genome sequence was available for any member of the genus represented in the vaccine antigen dataset, all of the known proteins from the chosen species were downloaded from the publicly available protein sequence databases (NCBI). All proteome data was stored in FASTA format. The genus, species and data sources used to construct the control database are shown in Table 2.
The size of the control dataset was constructed to ensure that the final size was approximately equal to the number of proteins encoded by a typical bacterial genome. Annotated genome sequences contain protein sequences, inclusive of any signal peptides. Since the proteins in the control dataset were derived mainly from predicted proteomic and genomic data, they are inclusive of any signal sequences. To ensure that the positive database mirrored the control dataset, the sequences used were also inclusive of any signal sequences. The vaccine antigen and control datasets were used for all of the comparisons detailed below.
Programs were written in PERL to calculate the predicted molecular weight and predicted isoelectric point (pI) of each protein within the control and vaccine antigen databases. The results were ranked, grouped into histogram bins corresponding to increments of 15Da (
The two-peak distribution of pI values in both the control and positive datasets was also seen with all of the predicted proteomes analysed (including E. coli, M. tuberculosis, H. pylori, N. meningtidis and S. pneumoniae—data not shown). The mean values for each dataset was calculated, and to allow a comparison of the distribution of the data, the Wilcoxon Rank Sum test was applied. A comparison of positive and control datasets revealed that the distribution of molecular weight and pI values was significantly different (P=0.5×10−6 for molecular weight and P=0.002 for pI).
A PERL program was written to allow each protein in the control and vaccine antigen databases to be scored according to published scales. The amino acid compositions of the proteins in the vaccine antigen and control datasets were analysed using four different scales. The total amino acids which were present in these datasets were scored for hydrophobicity (Kyte & Doolittle, 1982), flexibility (Bhaskaran & Ponnuswamy, 1988), bulkiness (Zimmermann et al., 1968) or relative mutability (Dayhoff et al., 1978) according to previously reported scoring methodologies.
The output from each of these analyses was again ranked, grouped into 25 equally distributed histogram bins and plotted as a percentage of the total database (
A PERL program was written to calculate the percentage amino acid composition of every protein within a FASTA formatted database. [Previous workers have described a program, ProtLock, that uses amino acid composition to predict five, protein cellular locations using the Least Mahalanobis Distance Algorithm (Cedano et al, 1997). This method was compared to the one we have developed but not found to give any better results (data not shown).]
A novel method for the prediction of bacterial protein vaccine antigens using amino acid composition to develop a new scoring algorithm was then tried.
This allowed the average amino acid composition of each database to be calculated, in addition to the standard deviation for each amino acid. Statistical significant differences in amino, acid composition between the control and vaccine antigen databases were calculated by the Wilcoxon Rank Sum test. Amino acid composition and the significance of any differences between the two databases are shown in Table 4.
Development of Scoring Algorithms
A score table was produced for amino acids based on the amino acid composition of the control and vaccine antigen datasets. The amino acid composition of each database had been calculated as described above and statistically significant differences noted. Amino acids that showed a statistically significant. difference in occurrence in the two databases were allocated a score. Each amino acid score was calculated using the mean database scores as follows:
Amino acids that showed an increased frequency in the vaccine antigen database when compared with the control database therefore received a positive score, while those depleted in the vaccine antigen database received a negative score. Those that showed no statistically significant difference between the two databases scored 0. The scores obtained by each amino acid are shown in Table 4.
This scoring table was then used to score individual proteins in the positive and control datasets. The mean score of a protein was calculated by adding up the scores for each amino acid in the protein and dividing by the number of amino acids in the protein. The proteins were ranked on this score and then the output was allocated into 25 equally distributed histogram bins (
The vaccine antigen scoring scale of Example 4 was used to score proteins from each of the sub-cellular databases described. The distributions of the scores obtained by these databases are shown in
It was hypothesised that the differences in amino acid composition of the vaccine antigen and control datasets might reflect the differences in the likely cellular locations of vaccine antigens. To investigate this possibility, the scoring algorithm described above was applied to groups of proteins with known cellular locations (cytoplasmic, inner membrane, periplasmic, outer membrane and secreted proteins).
The SWISSPROT annotated protein database http://www.expasy.ch/sprot) was searched for proteins with a defined sub-cellular location from each of the bacterial species contained in the control dataset. Any entries where the sub-cellular location of the protein was listed as ‘putative’, ‘by similarity’ or ‘suggested’ were omitted from the databases. Separate databases were constructed for each sub-cellular location, producing cytoplasmic, inner membrane, periplasmic, outer membrane and exported protein databases. Gram-positive membrane proteins were included in the,inner membrane database.
The resulting sub-cellular location databases and the number of proteins per species are listed in Table 3.
Each dataset of different sub-cellular location was compared with both the vaccine antigen and control databases. Since most currently known vaccine antigens are either surface expressed or excreted proteins, it was expected that this analysis would reveal a similarity between the positive dataset and the databases of both the outer membrane and secreted proteins. The P scores of 0.38 and 0.30 (outer membrane and secreted proteins) confirmed this (
To evaluate whether the algorithm of Example 4 could be used to screen an entire predicted proteome for vaccine antigens, the proteome of Streptococcus pneumoniae was analysed. When the algorithm was applied to this predicted proteome, the surface protein A (PspA), a known protective antigen (Briles et al, 2000), was identified as the 11th ranked protein. other known S. pneumoniae protective antigens were found ranked within the top 190 proteins, which puts them in the top 10% of the scores (Table 5). Of the 5 proteins identified by Wisemann et al. (2001) and found to give a protective immune response in a mouse model, all but one was also found in the top 10% of proteins ranked by our scoring algorithm. Of the five, a conserved hypothetical protein with a signal peptidase II cleavage site motif identified by Wizemann et al (SP101) had the worst ranking at 347 (Table 5).
Denis-Mize K. S, et al. FEMS Immunology and Medical Microbiology. 2000 27:147-154.
Zhang Y., et al. Infect. Immun. 2001 69:6828-3836.
Bacillus
Bordetella
Borrelia
Brucella
melitensis
Campylobacter
Chlamydia
Clostridium
Corynebacterium
Escherichia
Haemophilus
Helicobacter
Legionella
pneumophila
Listeria
monocytogenes
Neisseria
meningitidis
Pasteurella
Pseudomonas
Rickettsia
Shigella
Staphylococcus
Streptococcus
Treponema
Yersinia
Borrelia
burgdorferi
Bacillus
subtilis
Bordetella
pertusis
Campylobacter
jejuni
Chlamydia
pneumoniae
Escherichia
coli
Haemophilus
influenzae
Helicobacter
pylori
Staphylococcus
aureus
Neisseria
meningitides
Pasteurella
mulocida
Pseudomonas
aeruginosa
Rickettsia
prowazekii
Streptococcus
pyogenes
Treponema
pallidum
Vibrio cholerae
Yersinia pestis
S. pneumoniae Protein
Number | Date | Country | Kind |
---|---|---|---|
0204387.5 | Feb 2002 | GB | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/GB03/00796 | 2/25/2003 | WO | 5/12/2005 |