The present invention relates to the study of biological samples containing a mixture of biomolecules, e.g. peptides, in order to identify, characterise and quantify individual biomolecules, and more particularly to methods and systems for profiling the relative abundance of at least some of the individual biomolecules across different experimental and biological conditions optionally defining a subset of biomolecules for identification or further characterisation.
A widespread method of studying protein content in biological samples is by using two-dimensional gel electrophoresis in combination with mass spectrometry, see for example, Kennedy, S., Toxicol. Lett. 2001, 120, 379-384. Two-dimensional gel electrophoresis is limited to the analysis of molecules with a molecular mass greater than approximately 10 kDa and there are no well-established methods to globally address the content of proteins and peptides below this limit.
Many of these smaller protein and peptide molecules play an important role in many biological processes and the advent of a method to routinely analyse peptide content in biological samples would therefore be a significant advance. Liquid chromatography (LC) coupled with mass spectrometry (MS) has emerged as a promising tool in proteomics capable of dealing with the inherent complexity in the biological samples and an increasing number of reports have been published illustrating the usefulness in combining LC and MS. It is suggested in “A neuroproteomic approach to targeting neuropeptides in the brain.”, Sköld K, Svensson M, Kaplan A, Björkesten L, Åström J and Andrén, Proteomics, 2, 447-454, that neuropeptides in the mass range of 300-5000 Da can be analysed by on-line nanoscale capillary reversed phase liquid chromatography (CRP LC) followed by electrospray ionisation quadrupole-time of flight mass spectrometry (Q-TOF MS). The article describes how the relative abundance of individual biomolecules across samples representing different experimental and biological conditions can be profiled and differences between the samples shown. Samples containing biomolecules were run through nanoscale CPR LC and Q-TOF MS. Each run resulted in an elution profile. Each individual data point in the elution profile represented an intensity value, or ion count, obtained from the MS detector for a particular chromatographic elution time and a particular m/z value. 3D representations of these elution profiles were drawn in which the y-axis showed the m/z ratio, the x-axis showed the elution time and the z-axis represented ion counts. Comparison between the different samples was performed by manually selecting similar regions on the 3 D representations of the different samples, integrating the ion counts within the regions and comparing the integrated ion counts of corresponding regions.
An LC/MS analysis can be pictured as a dispersion of the signal from each biomolecule species in the elution time and m/z dimensions and each peptide species will typically yield a plurality of peaks in the elution profile. If the resolution of the mass spectrometer is high enough, different isotopes of the same biomolecule species will be separated in the elution profile. Another type of dispersion of the signal is inflicted by the experimental method. In addition the biomolecules may receive different charge states during the experimental procedure. The different charge states will appear at different position in the elution profile. A further type of dispersion may arise from chemical pre-processing of the samples, for example mass labelling. In order to accurately compare relative abundances of biomolecule spices across different samples the dispersion of the signal originating from one peptide species has to be considered. In the method of Sköld et al, the different isotopes of one biomolecule species were manually identified and reassembled in an annotation process. The different charge states were not considered. Comparison between the 3D representations obtained from different samples was performed by manually selecting similar regions on the 3 D representations of the different samples, integrating the ion counts of the spots and comparing the integrated ion counts of corresponding regions. Since elution times of samples in LC columns may vary from run to run, it is not possible to simply overlay different representations of elution profiles on top of each other, instead the corresponding regions on the different representations have to manually identified, selected and marked so that they can be compared to each other.
Both the manual annotation and the manual process of finding corresponding regions in different elution profiles (samples) are extremely labour intensive and time consuming. The manual methods are not useful in large scale experiments or for industrial applications.
Several automated methods of processing LC/MS-data have been reported. In a number of methods, exemplified by “MoWeD, a computer program to rapidly deconvolute low resolution electrospray liquid chromatography/mass spectrometry runs to determine component molecular weights” by Pearcy and Lee, J am soc mass spectrum, 12, (2001) 599-606; and “Automated postprocessing of electrospray LC/MS data for profiling protein expression in bacteria.”, by Williams, Leopold and Musser, Anal chem 74, (2002) 5807-5813, individual mass spectra are deconvoluted by transformation methods. The methods offer an automated detection of peaks corresponding to peptides and are in some degree capable of handling the dispersed signals originating from the same peptide species. However, since only one or a few mass spectra are treated at the time and a transformation of the spectra is used, weak signals will often be ignored. In addition, the methods are noise sensitive as spurious noise peaks appearing in one or a few spectra, are easily mistaken as peaks originating from peptides. To reduce the effects of this problem hard filtration is used resulting in low sensitivity.
In “New algorithms for processing and peak detection in liquid chromatography/mass spectrometry data” by Hastings et al, Rapid comm mass spectrum 16, (2002) 462-467. a peak detection method is disclosed, “vectorized peak detection”, performed in a two dimensional representation, similar to the above described elution profiles. For a (elution time, m/z) position to be considered a peak, it must be a peak in the mass spectrum as well as a peak in the elution time dimension. The method is effective in avoiding spurious noise peaks, for example, but does not address the problem of dispersed signals.
The above mentioned studies illustrate the usefulness of LC/MS investigations. However, to make LC/MS-based analysis a method to be routinely used for analysing peptide content in biological samples further requirements have to be met. Most importantly, the method has to be able to screen a large amount of data and profile the relative abundance of some of the individual biomolecules across different experimental and biological conditions. In this the method has to address the problem of signal dispersion in the elution profiles. Due to the vast amount of data produced in a typical experiment, the method needs to be at least partly automated.
Furthermore, an attractive method needs to provide means for confirmation and validation of the result. This will be of special importance in fully automated methods and/or if advanced statistical methods like multivariate analysis are used, since these usually powerful analysis methods in certain cases can yield doubtful or misleading results even if the statistical measures indicate a high accuracy. In these cases an ability to compare the final results or an interim result with for example the unprocessed elution profiles would be of high value.
The objective problem is to provide a method and measurement system of analysing LC/MS data for profiling the relative abundance of some of the individual biomolecules across different experimental and biological conditions adapted for the vast amount of data typically appearing in real experiments. Furthermore, it preferably should be possible to trace high level results back to their origins in the source data and it should be possible to define subsets of biomolecule species for further analysis.
The problem is solved by the method as defined in claim 1, the measurement system as defined in claim 19 and the computer program product defined in claim 23. Further improved methods and measurement systems have the features mentioned in the respective dependent claims.
The method of performing a combined Chromatography and Mass Spectrometry analysis (C/MS) according to the present invention comprises the steps of:
In one embodiment of the method according to the present invention the dispersion of signal from each biomolecule species arises from the existence of different isotopes and/or charge states of the biomolecule species, and the automated annotation reassembles, for essentially each biomolecule species, the signal dispersion caused by both the different isotopes and/or different charge states of the biomolecule species.
In another embodiment the sample comprises biomolecules species that have received different chemical labels, giving at least a first chemically labelled biomolecule with a first label and a second mass-labelled biomolecule with a second label. The chemical difference causes a further dispersion of the signal in the elution profile, and the automated annotation reassembles the signal dispersion caused by the chemical labelling.
In a further embodiment the automated annotation uses knowledge of the mass spectrometer resolution in the reassembling of dispersed signals.
In a still further embodiment of the present invention the automated annotation in the reassembling of dispersed signals uses a priori assumptions on the relations between different charge states and/or different isotopes of the same biomolecule species in the reassembling of dispersed signals. Alternatively, or in combination, the automated annotation uses resemblances detected during the analysis, for example in the signal pattern between different charge states, in the reassembling process.
One advantage afforded by the present invention is that the automated alignment makes it possible to screen a large amount of data and profile the relative abundance of some biomolecule species across different samples.
A further advantage is that the enhancement in the signal intensity afforded by the consensus profile can be used to detect weak signals typically corresponding to biomolecule species with low abundance.
Another advantage is that in the method according to the present invention it is possible to trace a high level result back to its origins in the source data, and to define subsets of biomolecule species for further analysis.
The features and advantages of the present invention outlined above are described more fully below in the detailed description in conjunction with the drawings where like reference numerals refer to like elements throughout, in which:
a is an example of an elution profile produced by the system of
b and
a is a flowchart illustrating the main steps of the method according to the invention;
b is a flowchart illustrating the details of the annotating algorithm of the method according to the invention;
A Chromatography/Mass-Spectrometry (C/MS) analysis of a biological system is typically performed by running a plurality of samples representing different conditions in a biological system under study, through a combination of C/MS instrumentation. The chromatography can be seen as a separation method and the mass-spectrometry as a method of detection. Currently the most used and most promising method for the separation of biomolecules comprises Liquid Chromatography (LC). However, also other separation methods can be used, for example Gas Chromatography (GC). The inventive method and apparatus will be described using, but is not limited to, liquid chromatography. An instrumental setup, schematically illustrated in
As appreciated by the skilled in the art the instrumental setup adapted for producing elution profiles with the described characteristics, may be realized in a number of various ways, and the above should be regarded as a non limiting example of an instrumental setup adapted for performing the method according to the present invention.
In the description the use of the method and the measurement arrangement according to the present invention, will be exemplified with analysis of peptides in a biological system. The peptides are of special interest due to their importance in many biological processes. The peptides may be native or resulting from a digestion of full length protein, for example by using enzymes like trypsin. However, the method and apparatus according to the present invention are not limited to the study of peptides. A wide range of biomolecules, especially molecules with masses smaller than 10 kDa, can advantageously be analyzed with the method and apparatus disclosed herein. The term biomolecules should be interpreted as including both single biomolecules and biomolecule complexes.
A proteomic experiment typically includes a plurality of varieties e.g. a treated group and a control group of subjects, i.e. patients, animals, colonies etc., generating a large and diverse data set. The LC/MS analysis can be pictured as dispersing the signal from each peptide species in the elution time and m/z dimensions. The typically large data set and the dispersion of the signal constitutes an information handling problem. In the method according to the invention the vast amount of data is handled by alternately using refined data representations, the original elution profiles and using peptide maps generate from elution profiles. The refined data representations are for example: a consensus elution profile combining the data of several elution profiles or a differential profile highlighting differences between individual elution profiles. Throughout the method, although refined data representations are used, preferably the raw data and the links between the raw and refined data are always preserved, in order to be able to “go back” to confirm a result and to be able to perform further analysis either on the data already collected or to initiate further analysis processes. The preservation of raw data and the possibility to alternatively use refined and corresponding original raw data are useful for the checking the reliability of the results generated by a method in accordance with the present invention.
In the method according to the invention, regions of interest, corresponding to peptides showing an interesting variation over a set of samples, may be selected based on the variation behaviour, before the peptides have been identified. The concept of detecting a region with an interesting signal variation between different profiles and selecting a region of interest for further analysis, without attempting to identify the peptides before the selection, is to be regarded as part of the present invention.
As discussed above the LC/MS analysis can be pictured as a dispersion of the signal from each peptide species in the elution time and m/z dimensions and each peptide species will typically yield a plurality of peaks in the elution profile. If the resolution of the mass spectrometer is high enough different isotopes of the same peptide species will be separated in the elution profile. Characteristic “isotope ladders” 205 can be seen in the elution profiles, as exemplified in
wherein it is assumed that the spacing between isotopes and the adduct ion mass are precisely 1 Da. As indicated in the figure the “distance” between different isotopes of the same peptide species will be 1/z.
If the separation of different isotopes are distinguishable or not, will depend on the mass spectrometer resolution. The resolution of the mass spectrometer may in turn depend on m/z. A peptide species will typically appear in the elution profile with separated isotopes, i.e. well defined peaks, for the charge states with low z and as less well defined “blobs” including several isotopes, for higher z.
In order to, for example, compare the abundances of certain peptide species between different samples, it is in most application advantageously to reassemble, or link, all peaks originating from the same peptide species. The aim of the reassembling is to generate a peptide map corresponding to an elution profiles. In the peptide map all dispersed signal relating to each peptide species in one elution profile is, if possible, brought together.
To be able to compare the relative abundance between different peptide species and/or the changes in abundances of certain peptides between different experimental and biological conditions, typically represented by different samples (and hence different elution profiles), it is necessary to also link peptide species across different samples represented by individual elution profiles and peptide maps thus forming a global annotation. The global annotation is preferably achieved by an automated matching process as will be described below.
Even though the theoretical relation between different peaks of the same peptide species is known according to the above, the generation of peptide maps and the matching are, in practice, not trivial tasks. The complications arise from several factors. In a typical sample a large number of different peptides are present, and peaks may be very close or overlapping, making it difficult to, taking experimental uncertainties under consideration, for example ascribe the correct charge states to a specific peptide. In addition typically not all charge states are represented and their relations are not known. Noise will always be present, both as a background noise level and as spurious noise peaks. The noise may lead to falsely identified peptides peaks. One complication of special importance is caused by experimental variations, most pronounced as an unpredictable variation in the elution time. Elution profiles from identical samples may be shifted and/or compressed or expanded in the elution time when compared to each other. The method according to the present invention offers an automated annotation process, adapted to produce a peptide map for each elution profile or from a group of elution profiles. The method produces peptide maps of high quality and reliability, and importantly, significantly reduces the time needed, in comparison with the prior art methods, for the annotation process. The method according to the present invention differentiates from the prior art methods of automated annotation in that, among other features, it is capable of reassembling isotopes as well as charge states. In addition the inventive method offers an increased effective sensitivity, as very weak signals can be detected and processed by the automated annotation. This is possible since the peak detection is performed simultaneously in both the elution time dimension and the m/z-dimension, requiring a peak to have an extension in both dimensions, giving a detection method that is less sensitive to noise.
The peptide maps produced by the annotation are the input to the matching process. The outcome of the matching, as well as the processing time needed, is highly dependent on the quality of the annotation, i.e. the peptide maps. The automated annotation method according to the present invention, which gives accurate and reliable peptide maps, is required for an effective and accurate matching process, and hence to achieve a correct global annotation. The global annotation is in turn needed for a reliable statistical evaluation of the experiment.
Different type of chemical pre-processing of the samples can also cause differences in the mass of the biomolecule and hence a splitting of the signal. Even if the differences are wanted and aimed to facilitate a certain analysis, the effect of the differences must be accounted for in any reassembling of the biomolecule peaks in the elution profiles. The method of automated annotation according to the present invention is easily adapted also for this type of wanted mass differences.
In a plurality of the analysing steps of the method according to the invention the analysis is performed in the two-dimensional space defined by the elution time and the m/z. This might at first sight seem like a complication, but will be shown to simplify the process of re-assembling the spread out signal from each peptide, for example. The concept of simultaneously using both the elution time dimension and the m/z dimension of an elution profile is advantageous
The main steps of method according to the present invention, which will be described with references to the flowchart of
The steps of the method will be described in detail below:
Performing 300 an C/MS Analysis and Generating First Elution Profiles 305
Two or more biomolecule containing samples are run through a combination of LC/MS instrumentation according to the setup described above. The samples could typically represent different conditions in a biological system being studied. The simplest case is a differential experiment aiming at highlighting biomolecule species for which there is a large change in abundance between two different experimental conditions. A more advanced experimental design involves more than two conditions and/or introduces replication, i.e., the use of more than one sample per experimental condition. By the use of well-established statistical methods it is possible to assign statistical significance to abundance changes between the different conditions.
The measurement system according to
Generating Peptide Maps 310 by an Automated Annotation Process
The automated annotation process, according to the method of the present invention, automatically reassembles signals originating from the same peptide species dispersed in the elution profile and appearing as a plurality of peaks. The peaks typically range from well-defined to weak and diffuse for the same peptide species. The automated annotation process generates a peptide map for each elution profile.
The automated annotation algorithm starts by detecting primary features presumably corresponding to peaks in the signal variation of the elution profile. Primary features may comprise e.g. local maxima in the signal intensity, seeds from thresholding morphological operations or positions selected by analysis of gradients. Spots are compact areas of high intensity, which are detected starting from the primary features. Spots may correspond to individual isotopic peaks, or to isotopic peak clusters when the instrument resolution is not good enough to separate them. Spots may also originate from noise and data acquisition artefacts. The primary feature detection and spot detection steps make use of the local surroundings of the data points in both the m/z and elution time dimensions. A spot must have at least a predefined extension in both dimensions. In that way noise peaks, for example, are avoided.
When a spot is found, attempts are made to put it into context, i.e., to find additional traces of the peptide species that gave rise to the spot in the elution profile. As previously described, these traces are highly structured; the spot corresponds to a certain charge state and possibly a certain molecule isotope of the peptide species, and there may also be spots for other molecule isotopes and additional charge states. If a labelling method is used, there may also be spots corresponding to differently labelled versions of the same peptide species. Thus, a peptide map entry for the peptide species is constructed, starting from a single spot. This step is carried out for each spot.
The last step in the process is a refinement step, where duplicate entries are removed and overlaps are resolved. A peptide species may be detected several times by the algorithm (e.g. once for each charge state), which leads to duplicate entries in the peptide map. Such duplication is detected by systematic comparison and duplicate entries are removed either automatically or manually. There may also be regions where two or more peptide species overlap, due to insufficient chromatographic separation. A region where there is a large overlap between two peptide species cannot be used for measurements of the amounts of either species, and may therefore have to be removed from the map entries of both species or otherwise be indicated as being unreliable.
Referring now to the flowchart of
In order to assess the significance and consistence of the detected isotopes, charge states, and label varieties of step 310:3, a number of measures can be used, e.g.:
As can be seen, these measures make extensive use of both the m/z and elution time dimensions. The measures a) and b) are examples of how the method according to the invention uses a priori knowledge of the structure of the dispersion of the signal to verify an assumption on charge state and isotope, for example. The above measure can preferably be combined.
If the different isotopes of a peptide species are distinguishable or not, will depend on the charge state z, and the mass spectrometer resolution at the particular m/z ratio. A peptide species will typically appear in the elution profile with separated isotopes, i.e. well-defined peaks, for the charge states with low z and as less well defined “blobs” including several isotopes, for higher z. In the case where a mass spectrometer operating according to the time-of-flight (TOF) principle is used, the mass spectrometer resolution also depends on m/z, imposing a complication in the isotope detection step 310:3:1.
In one embodiment of the present invention peptide map entry construction step 310:3 is improved by including different modes reflecting the resolution characteristics of the mass spectrometer. The resolution of the spectrometer is typically assumed to be dependent on m/z and described by a spectrometer resolution function R(m/z), as stated by the mass spectrometer manufacturer. The peptide map entry construction step 310:3 may then operate in at least two different modes: a high resolution mode and a low resolution mode, wherein the shifting between the modes is dynamic. The criteria for shifting between the modes are for example dependent on R(m/z) and z. In this embodiment, using the two resolution modes and the dynamic switching between them, the algorithm will only search for different isotopes of a peptide species for charge states where isotope resolution is expected according to the mass spectrometer resolution. This not only saves processing time, it also improves the quality and reliability of the produced peptide maps. This in turn is a prerequisite for a reliable result of the subsequent matching step 315.
In the case where the resolution of the spectrometer is well-described by the function R(m/z), an effective resolution βR can be used for setting up a criteria for shifting between the resolution modes. β is an empirically predefined parameter relating to a required minimum difference between peaks and valleys in the elution profiles. A suitable value of β is 0.85 (unitless). R(m/z) depends on the properties of the mass spectrometer and is usually available from the manufacturer. For a given m/z and z the high resolution mode is used if:
and the low resolution mode is used otherwise.
A background noise will always be present in the elution profiles, and the annotation process may be preceded by a noise removing step. All signal intensity below a threshold may be removed, for example. Since the signal level may fluctuate significantly between elution profiles, any signal intensity thresholds should preferably be chosen individually for each elution profile. Suitable background and peak thresholds are taken to be the 95th and 99th percentiles of the intensity distribution of the elution profile, respectively.
A detailed example of an automated annotation algorithm, representing a current best mode of operation, is presented under the section Implementation examples.
The usefulness of the method according to the present invention, compared to some prior art methods, is illustrated in
Matching Peptide Maps 315
The aim of the matching step 330 is to generate the global annotation which is needed for the abundance profiles for individual peptides across different samples. The matching links the peptide species across the different elution profiles, for example representing different experimental and biological conditions.
In certain application the number of biomolecules in one map will not be very large (typically on the order of 100-10,000) and the mass spectrometer can give a very accurate and specific mass measurement for each peptide. In these cases, and since the elution profiles are aligned, the matching of the peptide maps will be a simple projection of the peptide map of one elution profile (or consensus) onto another elution profile.
In other cases the unique masses of individual peptides can not be fully resolved and clusters will be formed. These clusters must be resolved in order to get the global annotation. This is preferably achieved by treating the matching process as an optimization problem. Those skilled in the art will appreciate that many different optimization methods may be used for this type of problem, including greedy algorithms, simulated annealing, dynamic programming or genetic algorithms.
An example of a matching algorithm, suitable to be combined with the automated annotation, which has generated peptide maps, is given under the section Implementation Examples.
Abundance Measurement 320
For each elution profile with an associated peptide map, the signal intensity over the data points belonging to each peptide species in the map can be integrated. This yields an intensity measurement for each peptide species, and (optionally) for its charge states and molecule isotopes.
A data point in an elution profile is a measurement of the number of ions that were detected in a certain mass-to-charge ratio interval, during a certain time interval. Provided that the ions all come from the same peptide species, this can be can regarded as a measurement of the amount of the species in the sample. Measurements cannot be compared directly between species, because different molecule species are ionised to different extent in the mass spectrometer. However, the previously mentioned investigation by Sköld et al indicates that the measurements are at least repeatable. Since the peptide species are matched the relative abundance of peptide species between the different samples can be established.
Certain measures can be taken to further increase the accuracy of the abundance and relative abundance: A normalisation procedure can be applied to e.g. compensate uneven sample loadings among the LC/MS runs; and internal standards (spikes), i.e. known amounts of certain peptide species can be added to the samples before the LC/MS analysis. In each experiment there will be a large number of elution profiles, yielding a large number of abundance measurements. These measurements have a high degree of structure. There is the peptide species—charge state—isotope relation, to begin with, which may be aggregated to reduce the number of measurements. There is also the experimental design that relates the runs to each other and adds a number of factors/dimensions to the data set. In many cases further analysis of the data will facilitate the interpretation. This kind of data is preferentially analysed by multivariate statistical methods for example ANOVA (Analysis of Variance), PCA (Principal Components Analysis) and FA (Factor Analysis). Various regression methods can also prove useful for model building. The analysis may be performed using dedicated, custom-built software, or by general-purpose statistical and data analysis packages such as SAS (SAS Institute Inc, Cary, N.C. USA) or Spotfire (Spotfire, U.S. Headquarters, Somerville, Mass., USA).
Defining Subsets of Peptide Species for Further Analysis 325
One aim of the method according to the present invention is to be able to define a subset of peptide species for further analysis from the samples, represented by the peptide maps. The preceding steps of the method have made it possible to select peptides of interest since their abundance and/or relative abundance across different samples is measured. The subset of peptide species may be peptides that show a high variation in abundance between samples, or show a statistically significant variation between replica groups of samples, or yield individual measurements with high abundances. The selection of these biomolecules may be achieved automatically, by applying user-specified thresholds for the selection criteria. Selection criteria are for example “all peptides with significant variation between samples above a threshold”, “the ten peptides with the highest abundance” etc. The selection may also be done manually, or by a combination of manual and automated selection. The selection process, manual or automated, may advantageously use a differential profile to highlight the differences between samples.
The further analysis of the subsets of peptides typically and preferably comprises identification or further characterisation by MS/MS. The previous exemplified, in connection to
In a further embodiment of the invention a first portion of a sample is analysed according to the above method and at least one subset of peptide species is selected. The elapsed time when they are supposed to elute, and what is supposed to elute in-between are known from the representation of the elution profile, and therefore it is possible to construct a list of features to be on the lookout for during an upcoming identification/characterisation run on a second portion of the same sample. These features consist of the identification candidates themselves, taken together with a number of “sentinel features” that act as markers/milestones that enables corrections to be made for experimental variation in elution time. The subset is then further analysed with MS/MS. By using the list of features the elaborate MS/MS analysis is essentially only performed on the selected peptides. The ability to construct this list is provided by the method according to the invention by the raw data (elution profiles) and the links between the global annotation, the peptide maps and the raw data being preserved.
In the area of chromatography much attention has lately been given to the possibilities of introducing more than one separating step. These techniques are referred to as multidimensional chromatography and are well known in the art. Multidimensional liquid chromatography is advantageously combined with mass spectrometry (MDLC/MS). By introducing additional separation steps more complex samples, for example blood plasma, may be purposefully analysed. A 2-dimensional expansion of the measurement system described with reference to
The method according to the present invention of automatically annotating elution profiles will work also for this type of experiment, without any non-trivial adaptations. The elution profiles from a MDLC/MS are annotated in the same manner as in the described 1DLC/MS. Since the multidimensionality multiplies the number of elution profile, and the amount of data will be very large also in an experiment involving a rather small number of samples, the method according to the invention will be particularly useful.
Additionally, other types of multidimensionality, created by additional separation steps e.g. electrophoresis and iso-electric focusing (IES) or other methods, may in the same manner be handled by the method according to the present invention.
Methods of chemically labelling molecules in samples have received an increasing attention. The idea of chemical labelling is to treat samples from, for example, a treated group and a control group, exactly the same way through the sampling, preparation and measurement procedures. The chemical labels are used to separate the groups at a late stage in the analysis. A chemical labelling of particular interest in the area of proteomics and LC/MS-techniques is mass labelling.
The method of automated annotation according to the present invention handles chemical labels, for example mass labels, as described in the step 310:3:3. As appreciated by those skilled in the art, other types of labels, including, for example, isotope labels, may be used in the same manner.
Illustrated in
An example of an experiment utilising mass labels and the method of automated annotation according to the present invention is given under the section Implementation Examples.
In the automated alignment of the present invention, and also in the annotation and matching process, the original data of the elution profiles is preferably preserved as well as the correlations between refined data and the original data. In addition the method is very visual, and preferably visualized with the aid of computer graphics, for example how peptide maps are projected onto elution profiles. This gives an ability to visualise the steps of the method as well as confirm and verify a high level result with original data. For example to check the consistence of a global annotation with the first elution profiles. This is of special importance if, for example, advanced statistical methods are needed for the abundance measurement. Such advanced methods, however powerful, may in certain cases produce doubtful results even if the statistical measure may indicate a high accuracy. In these cases, the ability to trace the result back to original data and the visual nature of the results and interim results such as elution profiles and peptide maps are of high value.
Below, the present invention will be explained in more detail by way of examples, which however are not to be construed as limiting the present invention as defined by the appended claims. All references given below and elsewhere in the present specification are hereby included herein by reference.
A. Auto-Annotating an LC/MS Elution Profile
1. Spot Detection
1.1. Selection of Background and Peak Thresholds
Because the signal level may fluctuate significantly between elution profiles, any signal intensity thresholds should be chosen individually for each elution profile. In this implementation, the background and peak thresholds are taken to be the 95th and 99th percentiles of the intensity distribution of the elution profile, respectively.
1.2. Detection of Primary Features
Each data point in the elution profile is compared with its neighbours in order to find local maxima. Any local maxima above the peak threshold are considered valid primary features.
1.3. Spot Detection (Corresponds to 310:2)
For each local maximum, a m/z interval centered at the maximum is set up. The width of the interval is taken to be the FWHM (full width at half maximum) for a mass spectrometer peak at that particular m/z, a figure which is available from the manufacturer of the mass spectrometer.
An elution time interval is then found by scanning for signal above the background threshold within the m/z interval in both directions along the elution time axis. A spot is formed by combining the m/z interval with the elution time interval.
A thresholding procedure is applied to remove spots that have a too short time extent, assuming that they result from spurious noise.
2. Peptide Map Entry Construction (Peptide Pattern Reassembly) (Corresponds to 310:3)
This step is carried out for each spot individually. Spots are ordered with respect to decreasing peak intensity.
2.1. Seed-Spot Charge Screening
The set of putative charges z is screened for candidates in steps 2.1.1-2.1.3. Each z that passes the screening is assigned a score, and the z with the best score is selected.
First, try to detect isotopes, if a) meaningful (mass spec resolution dependent) and b) non-blob.
Then, try the charge states,
Then, try the labels, if a labelling scheme is used.
Finally, after selecting one z, do some refinement.
2.1.1. Isotope Detection
if
(i.e. high-res mode is suitable) and the peak is well-resolved (test by comparing to a model peak), then
search for isotopes with spacing 1/z Da. The minimum number of detectable isotopes is estimated from the average isotope distribution (the averaging of a certain mass is an average of all peptides of that mass). The tentative isotope positions m/z±1/z, ±2/z, are investigated:
If there are enough valid isotope positions, the charge state passes the screening.
2.1.2 Neighbour Charge State Detection
detect additional charge states at (m−1)/(z±1)+1
using the same time interval as the spot, look for
This has to be specifically implemented for each labelling scheme.
2.1.4. Peptide Map Entry Refinement
It is possible (even likely for large peptides) that the lowest isotope has very low abundance and therefore won't be detected. The empirical isotope distribution is matched to various shifted versions of the average isotope distribution, and the closest match is selected for the calculation of the peptide mass.
A subsequent step is to find the start of the isotope ladder that contains the spot. This is necessary for assigning the correct mass to the peptide species. Simply taking the first detectable spot to be the start does not work for large peptides or proteins, where the relative abundance of the first molecule isotope is almost zero. Instead, an approximate molecule isotope distribution is calculated as described by Senko et al, which is then fit to the region surrounding the spot for a number of possible integer-mass shifts.
3. Peptide Map Refinement (Corresponds to 310:4-5)
In this step, overlapping peptides are detected and the overlaps resolved. The method identifies four cases and handles them separately:
The algorithm takes two or more peptide maps as input. The output is a match table, holding one column for each peptide map. The rows of the table correspond to unique peptides. Non-empty table cells represent a mapping from a unique peptide (table row) to a peptide in a particular map (table column). An empty table cell indicates that a unique peptide does not match any peptide in a particular peptide map. For each peptide in each map, the mass (M/z and usually M) and the elution time are known.
The matching is performed in two steps. Both steps employ a greedy algorithm. A greedy algorithm is not optimal, but scales well with problem size and therefore selected. Other algorithms such as simulated annealing or genetic algorithms could also be employed.
1. Cluster Formation:
A cluster is a putative row in the match table. In the first step, the optimal cluster for each peptide is found, at this stage ignoring conflicts with other clusters.
All peptide maps are joined to form a large peptide list. The list is sorted with respect to M (or M/z if charges are not available). For each entry in the list, the optimal cluster is identified by exhaustive search (within a mass tolerance). The optimal cluster for a given list entry (i.e., peptide) is defined as the best-scoring cluster that contains that particular list entry (called the reference) and at most one list entry from all other maps, fulfilling the requirements: a) the mass difference between the peptide and the reference must be within a predefined limit, and b) the peptide does not belong to a selected cluster (see below).
Each cluster is assigned a score, which is calculated as the sum of all pairwise elution time difference scores within the cluster:
wherein |ti−tj| is the pairwise elution time difference. The parameter τ is interpreted as the largest time difference that is considered a perfect match. Score 1 is considered a perfect match between two peptides, and 0 an infinitely bad match. A cluster must not contain a pair with zero score.
2. Selection of Clusters:
In the second step the clusters are sorted with respect to score. The following procedure is then iterated as long as there are any clusters left:
a) the best-scoring cluster is found using linear search.
b) the cluster formation algorithm, 1) is run on that cluster again. If the score has decreased, it is assume that some of the peptides in the cluster now belong to a selected cluster; the cluster score is updated and the procedure restarts. It may also happen that the score increases; this is due to the non-optimality of the greedy algorithm and is ignored.
c) the best-scoring cluster is selected, i.e., copied to the match table.
This exemplary algorithm may preferably be extended in several ways. For example with a limitation on how well the elution times must match in order to make a valid match. A simple way of solving this problem is to append a cutoff threshold to the cluster formation requirements. Alternatively dynamic thresholds, for example based on a statistical measure on how well all peptides match can be used.
C. Annotation of a Mass Labelled Sample
Consider a simple experiment with the intent to examine of the effects of a drug. There are two experimental varieties; a “treated” variety that receives a drug treatment, and a “control” variety that is treated identically except that the drug is replaced by placebo.
1: Collect tissue samples from animals of each variety and prepare them for LC-MS analysis.
2: Label each sample with a different label. In the case of ICAT, the labels are molecules that bind to the cysteine residues in the peptides. One label contains eight hydrogen atoms, and the other kind contains eight deuterium atoms.
3: Pool the labelled samples.
4: Purify the labelled peptides on an affinity column. Peptides and other molecules that lack a label flow right through and are removed, leading to less background in the subsequent analysis steps.
5: Perform an LC-MS analysis of the purified, pooled sample. In this example, peptides will show up in pairs separated by eight Da.
6: Annotate the profile, i.e., run the peptide detection algorithm, and quantitate each peptide.
7: Identify peptide pairs (or n-tuples if there are more than two labels) and mark each labelled peptide with its corresponding variety—this is easily done because the labelling scheme (and therefore the expected mass difference) is known, and the mass difference should not lead to large differences in elution time. The outcome of this process is a cross-table of <mass, control-intensity, treated-intensity> entries that can be further analysed by appropriate statistical methods. To be performed in step 310:3:3 of the annotation algorithm.
It is apparent that many modifications and variations of the invention as hereinabove set forth may be made without departing from the spirit and scope thereof. The specific embodiments described are given by way of example only, and the invention is limited only by the terms of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
0316943.0 | Jul 2003 | GB | national |
This application is a filing under 35 U.S.C. § 371 and claims priority to international patent application number PCT/EP2004/007339 filed Jul. 6, 2004, published on Feb. 17, 2005 as WO 2005/015209, which claims priority to application number 0316943.0 filed in Great Britain on Jul. 21, 2003; the disclosure of which are incorporated herein by reference in their entireties.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP04/07339 | 7/6/2004 | WO | 12/14/2005 |