The present invention relates to a method for identifying and/or characterizing a (poly)peptide comprising: (a) analyzing a peptide map of said (poly)peptide, comprising at least 1 peptide, and its peptide primary structure fingerprint by mass spectrometry; and (b) comparing data obtained in step (a) with a reference (poly)peptide database, said database comprising mass spectrometric data of peptide maps, comprising at least 1 peptide, and of its peptide primary structure fingerprint, of a (poly)peptide or of a variety of (poly)peptides.
With the human genome project well underway and the deadline for completion approaching, the challenges of understanding the function of newly discovered genes have to be addressed. Initial attempts at sequencing the large and complex human genome were intentionally focused on expressed regions, as represented by cDNA repertoires. Estimates of the total gene number vary from 60,000 to over 140,000 (Nature, 401:311 news section 1999)) in the human genome. While the majority of the total number of human genes are now represented as expressed sequence tags (ESTs) in the dbEST database only a tiny minority have yet been assigned a function. For example in the Oct. 22, 1999 release, the number of entries for human was 1,617,045 (hftp://www.ncbi.nlm.nih.gov/dbEST/index.html) (Wolfsberg and Landsman, 1997), corresponding to 85,713 clusters in the UniGene set (www.ncbi.nlm.gov/UninGene/Hs.stats.shtml), of which only 9,274 contained known genes. The most straightforward solution to this structure-function discrepancy seems to be the direct correlation between the functional status of a tissue and the expression of certain sets of genes.
However, although the primary amino acid sequences of proteins are encoded by genes, the relationship between genes and proteins is profoundly non-linear. The control and signaling pathways executing the functions of cells are robust and irregular. Cellular activity is transacted through a vast array of signaling, regulatory, and metabolic pathways, each embodied in the functional and structural relationship of many specific molecules. This makes it difficult to predict protein dynamics or structure using genetics. Also, gene-protein dynamics are non-linear as there is no reliable correlation between gene activity and protein abundance (Anderson and Seilhammer, 1997). Structurally, the existence of alternative splice variants of mRNA complicate the relationship between genes and protein. Many proteins undergo post-translational modifications critical to their function but which are not encoded in the protein's corresponding DNA. Furthermore, a protein may be processed in different ways under different conditions, which seems to be of critical importance, for example, in Alzheimer's disease (Masters and Beyreuther, 1998). Another example can be found from experience with the cystic fibrosis transmembrane receptor (CTFR) functions, involved in cystic fibrosis. This disease is caused by a mutation in a single gene, but has a complex pathogenesis, where CTFR functions as a chloride channel but has additional, possible pathological, roles in the regulation of outer membrane conductance pathways. Additionally, the CFTR's expression is highly variable within the lungs, depending on cell type and anatomical location. Such complex functions of a single-gene defect complicate the determination of CFTR in cystic fibrosis and the identification of appropriate cellular targets for therapy (Jiang and Engelhardt, 1998). The overwhelming majority of human diseases are vastly more complex than CFTR, involving large numbers of genes and environmental factors.
Thus, a full understanding of the expression profile of a tissue or organism on the genomic and proteomic levels requires the screening of many samples in parallel, as rapidly as possible.
Accordingly, the technical problem underlying the present invention was to provide a method that allows the identification and/or characterization of proteins in a large scale, short time and in high throughput and at low costs.
The solution to this technical problem is achieved by providing the embodiments characterized in the claims.
Accordingly, the present invention relates to a method for identifying and/or characterizing a (poly)peptide comprising:
The term “(poly)peptide” as used in accordance with the present invention refers both to peptides and to (poly)peptides, naturally occurring or recombinantly, chemically or by other means produced or modified, which may assume the three dimensional structure of proteins that may be post-translationally processed, optionally in essentially the same way as native proteins. Furthermore, this term encompasses (poly)peptides or proteins having a length of about 50 to several hundreds of amino acids as well as peptides having a length of about 1, 2, 3, 4 and preferably 5 to 50 amino acids. In a further preferred embodiment, said peptide has a length of 6 amino acids. Said (poly)peptide and its map, respectively, in other embodiments comprise 2, 3, 4, 5, 6, up to 10, or more peptides.
The term “peptide map” as used in accordance with the present invention denotes a set of peptides that is obtained by fragmentation of a given (poly)peptide and, thus, specific for said (poly)peptide. Fragmentation may be effected e.g., by enzymatic digestion of the (poly)peptide, e.g., with trypsin, according to conventional techniques. In specific embodiments, only data from one peptide of a (poly)peptide is contained in said database. In further embodiments, the database comprises data from a variety of peptides wherein each peptide is derived from a different (poly)peptide. It is preferred, however, that said database comprises mass spectrometric data of peptide maps comprising more than one peptide such as 2, 3, 4, 5, 6, 7, 8, 9, 10 or more peptides of a variety of (poly)peptides (see
The term “peptide primary structure fingerprint” as used in accordance with the present invention denotes the peptide fragmentation pattern as generated by mass spectrometry.
A “variety” of (poly)peptides denotes a number of at least 2 or 3, preferably at least 5 to 50, more preferably at least 50 to 1,000, even more preferred at least 1,000 to 10,000, and most preferred more than 10,000 (poly)peptides.
The method of the present invention advantageously combines data obtained by mass spectrometric analyses of a peptide map, comprising at least 1 peptide, and of its peptide primary structure fingerprint, where “peptide primary structure fingerprint” as used in accordance with the present invention denotes the peptide fragmentation pattern generated by mass spectrometry. Compared to protein identification by mass spectrometric peptide maps, the inclusion of peptide primary structure fingerprints of the peptides of the peptide map strongly improves protein identification in sequence databases and enables unambiguous identification of (poly)peptides (see
The set of structural information obtained by the method of the present invention for each (poly)peptide, in the following also designated as “minimal protein identifier” (MPI), (see
Moreover, MPIs may be electronically stored, thus allowing computer based comparison of different MPIs. This further improves speed and accuracy, reduces costs, and consequently allows high-throughput identification and/or characterization of (poly)peptides (see
A further advantage of the method of the present invention is that it allows identification and/or characterization of a (poly)peptide without knowing its amino acid sequence and/or further structural features (such as identifying spots from 2D gels, as seen in
It is envisaged in accordance with the present invention that for the identification and/or characterization of a (poly)peptide not necessarily all data obtained in step (a) is compared with the reference (poly)peptide database. Accordingly, for unambiguous identification and/or characterization comparison of the data obtained by the analysis of the peptide map and/or one peptide primary structure fingerprint with the reference (poly)peptide database may be sufficient. Alternatively, comparing the data obtained by analyses of the peptide map and, e.g., in a most preferred embodiment, at least 6-8, preferably 10 or more peptide primary structure fingerprints with the reference (poly)peptide database may result in the finding that no identical mass spectrometric data are present in the reference (poly)peptide database. This would identify the analyzed (poly)peptide as a new entry into the database. Accordingly, such a situation is also encompassed by the term “identifying” as used in accordance with the present invention (see
In a preferred embodiment of the present invention, the data obtained in step (a) are recorded as lists of digit numbers corresponding to measured molecular or fragment ion masses or mass/charge (m/z) ratios (see
In another preferred embodiment, said reference (poly)peptide database in step (b) is produced by the steps of:
Preferably, the above recited organism is an animal, more preferably a mammal and most preferably a human.
The term “specific time point” refers to time points after a tissue, a cell, a non-human organism, including a plant, microorganism etc., an organelle, a tissue culture cell line, a protein complex or interacting proteins, an antibody, an antibody library, a bacteriophage, a virus etc. (of a specific developmental stage, disease stage, sex, age etc.) has been contacted, incubated or treated with a ligand, drug, compound etc., such as described above. Preferably, said tissue etc. is compared to a second sample of said tissue etc. not so contacted or treated.
This embodiment of the present invention advantageously not only allows the simultaneous identification and/or characterization of a large number of different (poly)peptides due to the high resolution of the employed two-dimensional gel-electrophoresis (2-DE) but also the assignment of functional parameters to the analyzed (poly)peptide. Accordingly, it is envisaged in accordance with the present invention that 2-DE patterns obtained from, e.g., different species, tissues, developmental stages, cells or organelles, sexes and disease states are compared and subtracted with respect to the presence/absence of protein spots on the different 2-DE patterns, and with respect to different quantitative levels of a (poly)peptide. Evaluation of 2-DE patterns may be performed by laser scanning followed by software assisted spot-recognition and characterization. For presence/absence analysis of protein patterns highly sensitive silver-staining procedures may be used. For quantification purposes, Commassie blue or fluorescent stains, well known in the art, may be employed. This embodiment of the present invention further allows the detection of post-translational modifications, and the person skilled in the art is well aware of, e.g., glycostaining or phosphostaining procedures.
Thus, the method of the present invention allows for the identification and/or characterization of a (poly)peptide if the corresponding MPI matches with a MPI present in the database and, e.g., containing further information with regard to the source of the corresponding (poly)peptide (see
Additionally, due to the MPIs, known as well as unknown individual (poly)peptides may be characterized in a certain population of (poly)peptides and, furthermore, unambiguously identified within and across two or more populations of (poly)peptides (see
Another advantage of the method of the present invention is that due to the MPIs a two dimensional (2-D) reference standard pattern can be provided that allows simple and fast comparison of 2-D gels from different laboratories, of different gel formats, independently of the gel resolution and/or applied separation technology, from different patients, tissues, etc. (see above). Once a 2-D reference standard pattern has been established by mass spectrometric analysis of a representative number of spots, preferably of at least 100 spots, more preferred of at least 5,000 spots, most preferred of all discernible spots on the gel, and storage of the corresponding MPIs in a database, in combination with their coordinates of molecular weight and pH in the spot pattern, analysis of only a small number of reference spots (e.g. 20 spots) of, e.g., two gels that are to be compared and allocation to the corresponding spots on the reference standard pattern allows standardization and, thus, comparison of the two gels. This considerably improves the speed of the identification and/or characterization of multiple protein spots by comparison of two different 2-D gels (see
The advantages of this method are that the MPI can be used to compare different 2-D gels, as well that the spots, which are differentially present in different 2-D gels (see
In an additionally preferred embodiment of the of the method of the present invention, said reference (poly)peptide database in step (b) is produced by the steps of:
In a further preferred embodiment of the of the method of the present invention, said reference (poly)peptide database in step (b) is produced by the steps of:
The term “cDNA or genomic library” refers to libraries consisting of complementary DNA or genomic DNA molecules. These cDNA or genomic DNA molecules, referred to throughout this specification, may be full length or non-full length. It is preferred that they are full length. If not full length, said fragments preferably encode a protein domain or an epitope.
This embodiment is particularly useful for applications where it is desired or necessary to have direct access to the genetic information encoding the (poly)peptide the MPI of which has been found in the database. For example, if the MPI of an unknown (poly)peptide is compared with the MPIs of the database, the identification of a MPI in the database matching with the MPI of the (poly)peptide to be analyzed thus does not only provide information with regard to certain functions of the (poly)peptide but also makes immediately available the corresponding genetic information. Thus, only clones of interest need to be sequenced (see
This embodiment also contributes to the speed and convenience of the method of the present invention in another aspect. In the prior art, in order to identify and/or obtain the nucleic acid encoding a (poly)peptide that has been analyzed by mass spectrometry, DNA sequences in the database were computer-translated into amino acid sequences in all possible reading-frames and, e.g., trypsin digestion products of these amino acid sequences computer-generated. The molecular masses of these digestion products were then theoretically calculated and compared with the experimentally obtained mass spectrometric data. Thus, identification of a desired nucleic acid molecule was not only time-consuming and cumbersome but also prone to the identification of false-positive sequences because theoretically and experimentally obtained data were compared to each other. Alternatively or additionally, for the same reason, correct sequences could be missed.
In yet another preferred embodiment of the method of the present invention, said reference (poly)peptide database is generated from (poly)peptides isolated form their natural context.
This advantageously allows for the generation of MPIs inter alia taking into account, e.g., post-translational modifications or specifically processed forms of a (poly)peptide that may not occur when, e.g., a eukaryotic (poly)peptide is recombinantly produced in a prokaryotic host.
However, it is also envisaged in accordance with the present invention that the database also comprises entries comprising structural and functional information of recombinantly produced (poly)peptides, where their corresponding DNA sequences may or may not be known.
The (poly)peptides may be native or denatured.
In a still further preferred embodiment, said (poly)peptide to be identified and/or characterized is a recombinantly produced (poly)peptide.
Methods for the recombinant production of (poly)peptides are well known in the art and include, e.g., production of the (poly)peptide in prokaryotic or eukaryotic hosts. However, the (poly)peptide may also be produced by well known in vitro transcription and translation methods.
In a more preferred embodiment, said recombinantly produced (poly)peptide is comprised in a (poly)peptide library, said library being prepared by expressing a library of nucleic acid molecules comprising a nucleic acid molecule encoding said (poly)peptide.
Vectors that may be used in accordance with the present invention comprise, e.g., plasmids, cosmids, viruses and bacteriophages used conventionally in genetic engineering. Expression vectors derived from viruses such as retroviruses, vaccinia virus, adeno-associated virus, herpes viruses, or bovine papilloma virus, may be used for delivery of the nucleic acid molecule of the invention into targeted cell populations. Methods which are well known to those skilled in the art can be used to construct recombinant viral vectors; see, for example, the techniques described in Sambrook et al., Molecular Cloning A Laboratory Manual, Cold Spring Harbor Laboratory (1989) N.Y. and Ausubel et al., Current Protocols in Molecular Biology, Green Publishing Associates and Wiley Interscience, N.Y. (1989). The vector comprising the nucleic acid molecule of the invention can be transferred into the host cell by well-known methods, which vary depending on the type of cellular host. For example, calcium chloride transfection is commonly utilized for prokaryotic cells, whereas, e.g., calcium phosphate or DEAE-Dextran mediated transfection or electroporation may be used for other cellular hosts; see Sambrook, supra.
Such vectors may comprise further genes such as marker genes which allow for the selection of said vector in a suitable host cell and under suitable conditions.
Expression vectors further comprise expression control sequences allowing expression in prokaryotic or eukaryotic cells. Expression of said nucleic acid molecule comprises transcription of the nucleic acid molecule into a translatable mRNA. Regulatory elements ensuring expression in eukaryotic cells, preferably mammalian cells, are well known to those skilled in the art. They usually comprise regulatory sequences ensuring initiation of transcription and, optionally, a poly-A signal ensuring termination of transcription and stabilization of the transcript, and/or an intron further enhancing expression of said polynucleotide. Additional regulatory elements may include transcriptional as well as translational enhancers, and/or naturally-associated or heterologous promoter regions. Possible regulatory elements permitting expression in prokaryotic host cells comprise, e.g., the PL, lac, trp or tac promoter in E. coli, and examples for regulatory elements permitting expression in eukaryotic host cells are the AOX1 or GAL1 promoter in yeast or the CMV-, SV40-, RSV-promoter (Rous sarcoma virus), CMV-enhancer, SV40-enhancer or a globin intron in mammalian and other animal cells. Beside elements which are responsible for the initiation of transcription such regulatory elements may also comprise transcription termination signals, such as the SV40-poly-A site or the tk-poly-A site, downstream of the nucleic acid molecule. Furthermore, depending on the expression system used leader sequences capable of directing the polypeptide to a cellular compartment or secreting it into the medium may be added to the coding sequence of the nucleic acid molecule of the invention and are well known in the art. The leader sequence(s) is (are) assembled in appropriate phase with translation, initiation and termination sequences, and preferably, a leader sequence capable of directing secretion of translated protein, or a portion thereof, into the periplasmic space or extracellular medium. Optionally, the heterologous sequence can encode a fusion protein including an C- or N-terminal identification peptide imparting desired characteristics, e.g., stabilization or simplified purification of expressed recombinant product. In this context, suitable expression vectors are known in the art such as Okayama-Berg cDNA expression vector pcDV1 (Pharmacia), pCDM8, pRc/CMV, pcDNA1, pcDNA3 (In-vitrogene), pSPORT1 (GIBCO BRL), pCl (Promega), or pQE30 (Qiagen).
In an additionally preferred embodiment of the method of the present invention, said (poly)peptide to be identified and/or characterized is part of a protein complex. Where a protein is isolated and the protein or proteins which form the complex are identical using their MPIs. Such complexes can also be run on 1 D or 2D gels, and the spots isolated and identified.
In yet another preferred embodiment of the method of the present invention, said (poly)peptide to be identified and/or characterized interacts with another (poly)peptide.
The term “another (poly)peptide” includes antibodies specifically recognizing said (poly)peptide or fragments or derivatives thereof having the same specificity. The term “fragment” of an antibody is well understood in the art (see e.g. Harlow and Lane “Antibodies, A Laboratory Manual”, CSH Press, Cold Spring Harbor, USA, 1988) and includes Fab and F(ab′)2 fragments. The term “derivative” is equally well understood and includes scFv fragments. Phage displaying antibodies may also be used, and are well known in the art.
In a further preferred embodiment, said (poly)peptide to be identified and/or characterized is present in a lysate or a whole cell extract. Here (poly)peptides may be isolated which may be difficult to separate on 2D gels, or may be difficult to recombinantly express. Examples of such (poly)peptides may include membrane-bound proteins, trans-membrane proteins, and receptors. As well as proteins which are toxic proteins to the expression host if a recombinant expression system is used.
In a still further preferred embodiment, said mass spectrometric method is MALDI-MS, MALDI-MS/MS, electron spray ionization (ESI), Q-TOF or post-source decay (PSD).
In a particularly preferred embodiment, said library of nucleic acid molecules encode the (poly)peptides as fusion proteins.
In a still further more preferred embodiment said fusion proteins comprise a tag.
Advantageously, tags allow for the convenient isolation, purification, detection and localization for re-arraying purposes of the produced (poly)peptides.
In a most preferred embodiment said tag is a His-tag.
However, other tags like, e.g., c-myc, FLAGS alkaline phosphatase, EpiTag™, V5 tag, T7 tag, Xpress™ tag, Strep-tag, a fusion protein, preferably GST, cellulose binding domain, green fluorescent protein, maltose binding protein or lacZ may also be useful in performing the method of the present invention.
In another particularly preferred embodiment of the method of the present invention, expression is inducible.
In yet another more preferred embodiment of the method of the present invention, said nucleic acid molecule is cDNA. This embodiment also includes nucleic acid molecules that constitute a fragment or a full length cDNA molecule.
However, it is also envisaged that said nucleic acid molecule is genomic DNA. This embodiment also includes nucleic acid molecules that constitute a fragment or a full length genomic DNA molecule.
In another preferred embodiment of the method of the present invention, said analysis in step (a) is, in addition to or alternatively to mass spectrometry, effected by surface plasmon resonance, as well known in the art. Such, procedures can be performed using BIA core systems, as is well known in the art. This has the advantages of determining interactions, affinity measurements, dissociation and association measurements, as well as identifying and characterising the interacting partners.
In a stilt further particularly preferred embodiment, prior to expression of said library of nucleic acid molecules, the following steps are carried out:
It is particularly preferred that the nucleic acid molecules are full length.
In this embodiment arrays, preferably microarrays, are provided comprising an optionally non-redundant set of genomic DNA or cDNA clones (in the following also designated as “UNIgene set” or “UNIclone set”) representing a set of mRNAs expressed in a specific species, tissue, developmental stage, cell, organelle, sex, disease state, microorganism, tissue culture cell line, virus, bacteriophage, organism, or plant etc. (see above).
The oligonucleotides may be hybridized sequentially to the array of nucleic acid molecules or as a mixture of oligonucleotides. In the latter case, each species of oligonucleotide is labeled with a specific label. This method also referred to as oligonucleotide fingerprinting is known in the art (Meier-Ewert et al., 1998; Radelof et al., 1998; Poustka et al., 1999; Herwig et al., 1999). Furthermore, the person skilled in the art is well aware of various nucleic acid labels (see, e.g., WO 99/29897 and WO 99/29898).
Regularly arraying said amplified nucleic acid molecules may be effected, e.g., by needle or pin spotting, where liquid containing the nucleic acid molecules will be delivered through adhesion to stainless steel pins. Alternatively, piezo-ink-jet technology may be utilized, where cDNAs, for example, are transferred without touching the surface. Advantageously, a multi-head piezo-jet micro-arraying system is used, which permits the construction of large micro-arrays on a variety of surfaces with a spot density of more than 2000 clones/cm2. This methodology is combined with high resolution detection systems, based on laser scanning. As a further alternative to conventional needle spotting, a drop-on-demand technology may be employed. This technology reduces the dimensions of the hybridization arrays by one or two orders of magnitude, the genetic samples are pipetted with a multi-channel micro-dispensing robot, which works on a similar principle to an ink jet printer. Integrated image analysis routines decide whether a suitable drop is generated. If the drop is poorly formed, the nozzle tip is cleaned automatically. A second integrated camera defines positions for automated dispensing, e.g. filling of cavities in silicon wafers. Each head is capable of dispensing single or multiple drops with a volume of 1000 pl. The dispensers may contain inside a magnetic bead-based purification system. This allows concentration and purification of spotting probes prior to dispensing. The resulting spot size depends on the surface onto which the liquid is dispensed and varies between 100 μm and 120 μm in diameter. The density of the arrays can be increased to 3,000 spots/cm2. The micro-dispensing system has the ability to dispense on-the-fly and takes less than three minutes to dispense 100×100 spots, in a square, with 100 μm diameter and with 230 μm distance between the center of each spots. At this density, it is possible to immobilize a small cDNA library consisting of 14,000 clones, on a microscope-slide surface. This advantageously offers a higher degree of automation since glass-slides are more rigid and easier to handle than membranes.
The array so produced is then hybridized under stringent conditions with a 9-mer oligonucleotides at a temperature between 37 degrees centigrade and 42 degrees centigrade, depending on the GC content, preferably 39 degrees Centigrade, and the positive signals are detected, quantified and stored using image-analysis software. This step is repeated until data from several hybridizations have been collected. By combining all these data an oligofingerprint consisting of the list of probes which hybridize to the nucleic acid molecule may be constructed for each clone. Since the hybridizations are conducted under stringent conditions, these fingerprints are a property of the clones' DNA sequences and, therefore, whenever two clones have similar or identical fingerprints they must have similar or identical sequences and can be clustered together on this basis. Each cluster represents a different gene and has an average, or consensus, fingerprint characteristic of that gene.
Finally, nucleic acid molecules showing the same sequence may be identified, and a set of non-redundant nucleic acid molecules be regularly re-arrayed by the same procedures described hereinabove.
These arrays will allow the simultaneous measurement of the gene expression level and, therefore, provide an indication of the level of activity, of all genes represented in the array in any sample investigated. When complex mixtures of RNAs or cDNAs or genomic DNA from different, e.g., tissues or developmental stages are hybridized to these DNA chips, this will enable the determination of differences in gene expression profiles.
It is further envisaged that (poly)peptide arrays, in which the positions of the (poly)peptides correspond to the positions of their corresponding cDNA clones on the DNA array, are produced, and the (poly)peptides analyzed as described hereinabove. Protein arrays may be produced by, e.g., automatically spotting proteins from liquid expression cultures using a transfer stamp mounted onto a flat-bed spotting robot. If the expression profiles are used to complement the MPIs of the corresponding (poly)peptides, this provides a direct linkage of mRNA and protein populations extracted from, e.g., cells or tissues. (Büssow et al., 1998; also see
In a more preferred embodiment, the amplification in step (aa) is effected by PCR.
PCR amplification is a well known technique in the art (see, e.g., Sambrook et al., loc. cit.) and the person skilled in the art knows without further ado how to adapt reaction parameters to certain amplification reactions. Exemplary conditions for 12-mer oligonucleotides, where preferably no mismatch occurs, are at a temperature between 37 degrees centigrade and 42 degrees centigrade, depending on the GC content, preferable 39 degrees Centigrade.
In a more preferred embodiment of the method of the present invention, after expression of said library of nucleic acid molecules, the following steps are carried out in connection with step (b):
With this embodiment, the same advantages are obtained at the protein level as discussed for the preceding embodiment at the nucleic acid level. Namely, a library or collection of essentially non-redundant (poly)peptides is obtained that may then be further analysed. This library, also known as a UNIclone, or a UNIprotein or a UNIgene set, can be used to generate protein arrays, and/or DNA arrays as described in Cahill (2000).
In yet another more preferred embodiment, said regularly arraying and/or said regularly re-arraying is effected on a solid support.
In a still further more preferred embodiment, said solid support is a chip, a glass substrate, a filter, a membrane, a magnetic bead, a silica wafer, metal, a mass spectrometry target or a matrix.
Any of the above solid supports may be coated or uncoated. Coating may be with a gel such as hydrogel or with teflon. Chemical coating is also envisaged. The surface of the solid supports may also be covered by anchor targets.
In a most preferred embodiment of the method of the present invention, said regularly arraying and/or said regularly re-arraying is effected on a porous surface.
The porous surface may be a solid or a non-solid support. Said porous surface may, for example, be a sponge, a membrane, filter, for example PVDF membrane or nylon membrane.
In another most preferred embodiment said regularly arraying and/or said regularly re-arraying is effected on a non-porous surface.
The non-porous surface may also be a solid or non-solid surface/support.
In a further most preferred embodiment of the method of the present invention, said arraying and/or re-arraying is effected by an automated device.
Said automated device, preferably in the form of a robot, may effect spotting, gridding, pipetting or piezo-electric spraying of biological material.
Expression of a library of nucleic acid molecules may be effected, e.g., by the picking of randomly distributed clones from agar plates and arraying these clones into microtitre plates. Advantageously, this is done by picking robots. The colonies are checked by an image analysis system to address the position for picking. The software, furthermore, identifies clone positions and translates the position into robot movement. The next step is the profiling of protein products encoded by differentially expressed genomic DNA or cDNA clones, including the simultaneous expression of large numbers of cDNA clones in an appropriate vector system and high-speed arraying of protein products. For example, using robotic technology, a human fetal brain cDNA expression library may be arrayed in microtitre plates, and bacterial colonies may be gridded onto PVDF filters. In situ expression of recombinant fusion proteins may be induced and detected using an antibody against a 6xHis-tag-containing epitope. Using such an approach, the genes in these libraries can be studied on the DNA and protein levels simultaneously, and provide sources of recombinant genes and proteins to make DNA and protein chips. This approach may also achieve the large-scale systematic provision of recombinant proteins for functional studies by making and arraying cDNA expression libraries and by allowing the direct connection from DNA sequence information on individual clones to protein products and back again on a whole genome level. This makes translated gene products directly amenable to high-throughput experimentation and generates a direct link between protein expression and DNA sequence data (Cahill et al., 2000).
In another more preferred embodiment of the method of the present invention, said variety of oligonucleotides comprises at least 2, preferably at least 10, and most preferred at least 150 different oligonucleotides.
In another preferred embodiment of the method of the present invention prior to step (aa), the following steps are carried out:
Isolation of mRNA and reverse transcription into cDNA are well known methods in the art (see, e.g., Sambrook loc. cit.). Accordingly, RNA may be prepared, and mRNA isolated via, e.g., oligo-dT cellulose. Subsequently, e.g., oligo-dT primer may be hybridized to the poly-A tails of the mRNA, and mRNA reverse transcribed via, e.g., AMV reverse transcriptase. After second strand synthesis the so obtained cDNA may then be cloned into an expression vector using well known techniques. Suitable expression vectors have been described herein above.
If extracted mRNA populations are, via reverse transcription and cloning, expressed as recombinant fusion proteins, their encoded MPIs can easily be determined by mass spectrometry (see
In a still further preferred embodiment, the following further steps are carried out:
In this embodiment, clones may be grown, e.g., in microtitre plates, protein expression induced, and the produced fusion proteins purified via their tag and, e.g., magnetic beads. Furthermore, it is envisaged that the bound fusion proteins are digested “on-particle” by, e.g., trypsin, and the emerging peptides subjected to MALDI-MS and MS-PSD. As a result, an MPI profile is generated for each (poly)peptide produced by the optionally non-redundant clones that unambiguously specifies each entry, and allows its rapid identification (see
In a more preferred embodiment, said isolation is effected by metal chelate affinity purification.
In a most preferred embodiment, said metal chelate affinity purification employs Ni2+-NTA ligands immobilized onto magnetic particles. Alternatively, they may be immobilized on agarose; see
However, Ni2+-NTA ligands may also be immobilized onto Ni2+-NTA agarose or a matrix of a column.
This embodiment of the purification is most preferred because the yield and the purity of the product is high, the method is cheap and fast, and appropriate for automation and high-throughput handling of large numbers of proteins.
Another most preferred embodiment of the method of the present invention further comprises:
Any of the above recited hybridizing molecules may be in the form of synthetic oligonucleotides. Yet, other origins such as naturally derived or recombinantly produced are also envisaged.
This embodiment of the present invention advantageously provides the link of genes to their expression products and vice versa (see
In a more preferred embodiment of the method of the present invention, expression is effected in procaryotes.
In an even more preferred embodiment said procaryotes are bacteria.
In a most preferred embodiment said bacteria are E. coli (see
In a more preferred embodiment of the method of the present invention, expression is effected in non-human eukaryotes or eukaryotic cells.
In an even more preferred embodiment said non-human eukaryotes are yeast, such as S. cerevisiae.
In a most preferred embodiment said yeast belong to the species Pichia pastoris (see
In another more preferred embodiment said eukaryotic cells are mammalian or insect cells.
In a preferred embodiment of the method of the present invention, said peptides have a molecular weight in the range of 600 to 4500 Daltons. This range of peptides has specific advantages, in particular, if the peptides to be analysed are of heterologous nature as compared to the peptides stored in the data base, as is evident from the appended example (see
The distribution of m/z values is important for the determination of MPIs. The MPIs were calculated for the number of peaks in a spectrum within the range 800 Da to 2000 Da. This range was selected because the minimal and maximal region of detection is on average 600-2750 Da for the homologous and 600-4500 Da for the heterologous protein, respectively (
In a most preferred embodiment, said peptides have a molecular weight of 600 to 2750 Daltons. This embodiment is particularly advantageous if the peptides are of homologous nature.
In a preferred embodiment of the method of the present invention, said comparing in step (b) comprises normalization for chemical or post-translational modifications. Normalization can be effected e.g. on the basis of the teachings of the appended example.
In a most preferred embodiment, said chemical modification is oxidation.
Post-translational modifications include glycosylation and phosphorylation, acetylation, sulfation and myristoylation.
As described hereinabove, by the method of the present invention (poly)peptides may be identified and/or characterized. In other words, the method of the present invention allows for the provision of structural and functional features of (poly)peptides independently of whether they are known or unknown.
As also described hereinabove, the method of the present invention further allows for the combination of these biological and biochemical parameters of different (poly)peptides with their gene expression profiles (see
Finally, if genomic DNA molecules are hybridized to the arrays of nucleic acid molecules produced in accordance with the present invention, the here described method not only allows for the functional and structural identification and/or characterization of (poly)peptides but also for the identification and isolation of the genes encoding these (poly)peptides, thus, further contributing to the elucidation of the genome-proteone interrelation, e.g., in a particular cell or tissue, under normal conditions, disease conditions and activated (for example drug-treated) conditions.
The method of the present invention may also be useful for the development of pharmaceuticals and/or diagnostics. Accordingly, the method of the present invention may be focused on the identification and/or characterization of (poly)peptides that show, e.g., altered expression levels and/or structural modifications like, e.g., post-translational modifications or amino acid substitutions, additions and/or deletions in different disease states or if normal conditions and disease conditions are compared. This may, in turn, lead to the identification of corresponding defects on the DNA level, valuable information for pharmaceutical and/or diagnostic purposes, and/or the identification of compounds counteracting the abnormal expression levels and/or structural modifications and, thus, being potential drug candidates.
The disclosure content of the documents cited herein is herewith incorporated by reference in its entirety.
The figures show:
The example illustrates the invention.
Material and Methods
Strains, transformation and media. Escherichia coli strains XL-1Blue, BL21(D3)pLysS (Invitrogen) and SCS1 (Stratagene) were used for cloning and expression as described [üssow et al., 1998, Lueking et al., 2000].
Pichia pastoris: strain GS115 (his4, Mut+; Invitrogen) was used for eukaryotic protein expression as described [Lueking et al., 2000].
Protein expression and purification. The bacterial protein expression in strain SCS1 were performed as described in [Büssow et al., 1998], and the expression in strain BL21(D3)pLysS as described in [Lueking et al., 2000]. The proteins were purified as previously described [Büssow et al., 2000].
Mass Spectrometry
Tryptic Digestion of 2-D Gel Separated Proteins from Human Brain
Coomassie G250-stained large-format 2D gels of human brain total protein extract were prepared, according to the protocol of Klose (1975), Humangenetik 26, 231-243 where cylindrical gel samples of 1 mm diameter were excised and then destained by incubation with 400 μL 25% isopropanol for 30 min. The destained gel samples were dried in a vacuum centrifuge for 10 min, followed by addition of 5 μL digestion buffer (5 mM DTT, 5 mM n-octylglucopyranoside (n-OGP), 20 mM Tris, pH 7.8) containing 12 ng/μL modified porcine trypsin (sequencing grade, Promega). Following overnight incubation at 37° C., 5 μL 0.4% TFA, 5 mM n-OGP were added and incubated for 1 h, at room temperature. Samples were stored at −20° C. prior to MALDI-MS sample preparation.
Tryptic Digestion of Heterologous Expressed Proteins
The proteins were electrophoretically separated by SDS-PAGE (12.5% polyacrylamide, bisacrylamide 30:0.8). The gels were stained with Commassie Blue and destained and protein spots were visualised. The spots were excised from the 2-D gels and the proteins were extracted and tryptically digested as described above, as well known in the art.
MALDI Sample Preparation
Sample desalting and enrichment was achieved using micro-scale reversed-phase purification tips (ZipTip-C18, Millipore), following the protocol provided by the manufacturer
CHCA Surface Affinity Preparation
Samples were prepared on pre-structured MALDI sample supports (Schuerenberg et al., 2000), using alpha-cyano-4-hydroxycinammic acid (CHCA) as the matrix, according to a recently described protocol (Gobom et al., 2001).
MALDI-TOF-MS
Mass spectra of positively charged ions were recorded on a Bruker Scout 384 Reflex III instrument (Bruker Daltonik, Bremen, Germany) operated in the reflector mode. 100 single-shot spectra were accumulated from each sample. The total acceleration voltage was 25 kV. The XMASS 5.0 and MS Biotools software packages provided by the manufacturer were used for data processing. For the calibration of the tryptic digested protein samples, known auto-proteolytic products of trypsin were used for internal calibration.
Database Searching
For protein identification, human protein sequences in the SwissProt database (www.expasy.ch/) and PROWL (Rockefeller University) databases (www.prowl.rockefeller.edu/), were searched using the Mascot Software (Matrix Science Ltd., U.K.) The probability score calculated by the software was used as the criterion for correct identification. A further criterion was applied, namely, that a minimum of three peptides were required to match the highest ranking sequence entry, compared to the next unrelated candidate. A mass deviation of 30 ppm was tolerated in the searches, and for proteins isolated from 2-DE, oxidation of methionine residues was considered a possible modification.
Generation of MPI
For the generation of MPIs, all possible m/z-values in the databases searched were transformed using the software “m/z-freeware editions” (Proteometrics, LLC) (www.canada.proteometrics.com/). The theoretical enzymatic cleavage of the database proteins was performed using the GPMAW software version 3.15 (Lighthouse data) (www.welcome.to/gpmaw).
Comparison of MALDI-TOF-MS of Recombinant Proteins and their Corresponding Native Proteins from 2D Gels.
For comparison by mass spectrometry, 5 proteins (Aconitate hydrogenase, pyruvate kinase, GTP binding protein, tubulin α-1 chain and tubulin β-3 chain) that were previously identified and analysed on 2-DE gels (
The spectra of the recombinantly expressed proteins and the homologous proteins from 2-DE gels (as is shown for
To determine the feasibility of this approach, the coverage and the MPI value were calculated, both in percent. The coverage, as a percentage, was determined on comparing the number of actually identified peaks with the number of all theoretically possible peaks, after in silico digestion. The MPI value is the number of identical peaks, from the homologous and heterologous protein, based on the total number of peaks obtained from the heterologous protein, as a percentage.
In
The Effect of Oxidation of Homologous Proteins from 2 DE Gels and their Consequence on the MPIs.
Due to the long staining times of 2D gels with Coomassie® G250, homologous proteins may be oxidised, particularly methionine. As generally, the recombinantly expressed proteins are more concentrated and, therefore, require only short staining times, these proteins are less oxidised. As a consequence a peptide containing an oxidised amino acid would have an increased mass, for example, when methionine is oxidised, an increase of 16.00 m/z units is obtained in the monoisotopic state. This corresponds to the addition of one oxygen molecule. For example, each of the peptides 6, 19 and 35 from tryptically digested tubulin β-3 chain contain one methionine. Comparing the spectrum of the homologous protein with those of the recombinantly expressed tubulin β-3 chain, the peaks 6, 19, 35 of the homologous protein show a precise increase of 16 Da (see Table 3). This difference of 16 Da may be result in some difficulties in the identification of unknown proteins from 2D gels when compared to a database based on spectra of heterologous expressed proteins. Modifying the MPI-database by addition of such values of oxidised peptides, will improve the number of identical peaks obtained, as well as improving the probability of correct identification. For tubulin D-3 chain, such a database modification will lead to the ability to increase the number of peaks used to determine the MPI value from 2 to 5 peaks, resulting in a more reliable MPI value.
The Distribution of M/Z Values.
The distribution of m/z values is important for the determination of MPIs. In general, the value of MPIs (%) was calculated for the number of peaks in a spectrum within the range 800 Da to 2000 Da. This range was selected because the minimal and maximal region of detection is on average 600-2750 Da (see
Influence of Expression by Different Hosts on the MPIs
The generation of a database containing MPIs may use heterologous expressed proteins from different hosts. Therefore, it is important to analyse whether the expression by different hosts influences the peptide spectrum. Since cDNA expression libraries are mainly generated in E. coli (Büssow, 1998) and, only recently, in yeast expression libraries, as described (Lueking, 2000). Here, E. coli and the yeast Pichia pastoris were used as reference expression hosts. Human GAPDH were expressed in both hosts using the dual expression vector (Lueking et al., 2000) suitable for expression in P. pastoris (see
These data provide a proof of principal of the method of the present invention to improve the identification of proteins, e.g. from 2 D gels, using generated MPI from proteins such as recombinantly expressed proteins. The above data qualify the present invention for a high throughout and, potentially fully automated method to identify proteins using mass spectrometry.
With the prior art methods, it was only possible to obtain about 50% coverage when identifying proteins by MALDI-MS. There are a number of reasons for this, namely, due to the redundancy in the genetic code, the incorrect amino acid sequence is generated. Other reasons may include that the protein is absent in the databases searched, or sequencing errors and contaminating sequences in the databases.
Therefore, a technique is described to improve this by generating mass spectrometry fingerprints of proteins, such as recombinant proteins. It was also shown that it is possible to carry out a high throughput and reliable method to identify proteins by mass spectrometry. The method of the invention also enables high throughput or automatic generation of MPI, which includes the standardisation of sample preparation procedures (for a general outline of the procedure, see
However, for the establishing of such an MPI database, the following points should be noted. For the identification of a known, or previously unknown protein, it was determined that an MPI value of at least 15% is sufficient, which may correspond to about 5 peaks that match to the peaks obtained from the homologous protein. Based on the results shown in
Preferably, the relative intensity units are correctly selected, so that only the well-defined peaks above background are selected. It is also preferred that an internal standard is measured, such as the auto-digestion peaks of trypsin, which will be used for the automatic calibration of the software, and also to determine if the spectrum is worth measuring.
The MPI database will also include information such as the expected peptide mass changes resulting from modifications of proteins such as oxidation, incomplete digestion of trypsin, and that these known variability factors as that methionine when present in a peptide, it is not always completely oxidised. Including such information in the MPI-database facilitates the improved identification of the various peptides obtained.
As can be seen from Table 1, peptides were obtained that were not present in the theoretical peak list. However, this did not hinder the generation of useful MPIs. These additional peaks may be explained by the presence of premature terminated proteins, which may have resulted from differences in codon usage when the protein is expressed in different host expression systems. Other possibilities may be due to the degradation of the proteins during storage or their proteolytic digestion by contaminating host proteases.
Also, as has been shown, not all the recombinant proteins used were full-length but despite this, useful MPI were obtained. This implies that MPI can be generated from gene products, which are not full length, as is frequently in cDNA expression libraries. The criteria determined should also not affect the generation of MPI from most recombinant systems, as genes cloned in either random-primed or oligo-dT-primed cDNA libraries should contain proteins, which on digestion, give peaks in this range.
In conclusion, the generation of MPI-database may have broad applications in the improved identification of proteins from many sources, such as from 2D gels, recombinant proteins, interacting proteins and whole protein complexes.
Number | Date | Country | Kind |
---|---|---|---|
EP 00102567.5 | Feb 2000 | EP | regional |
Number | Date | Country | |
---|---|---|---|
Parent | 10203334 | Jul 2003 | US |
Child | 12036525 | US |