A “marker” typically refers to a polypeptide or some other molecule that differentiates one biological status from another. It is useful to identify novel markers for diagnostics and drug discovery processes. One way to discover if substances are markers for a disease is by determining if they are “differentially expressed” in biological samples from patients exhibiting the disease as compared to samples from patients not having the disease. For example,
When the signals in the graphs 100, 102 are viewed collectively, it is apparent that the average intensity of the signals at the mass-to-charge ratio A is higher in the samples from diseased patients than the average intensity of the signals at the mass-to-charge ratio A from the normal patient samples. The marker at the mass-to-charge ratio A is said to be “differentially expressed” in diseased patients, because the concentration of this marker is, on average, greater in samples from diseased patients than in samples from normal patients.
Mass spectra like those shown in
When forming more sophisticated analytical models, signals in mass spectra are often “clustered” together and are then further processed by a computer. For example, various signals associated with the different mass spectra at one or more mass-to-charge ratios can form one or more signal clusters. The signals forming the signal clusters may be further processed, for example, to identify markers and/or to form an analytical model. If, for example, it was not known that the mass-to-charge ratio A represented a differentially expressed marker in normal and diseased patients, a computer could cluster all 36 signals shown in
Deciding which signals to include within a signal cluster is a problem. Different signal peaks with slightly different mass-to-charge ratios in respectively different mass spectra may in fact represent the same marker. Consequently, these signals are clustered together as a signal cluster and each of the signals in the signal cluster is treated as having the mass-to-charge ratio associated with the signal cluster, even though the signals are in fact associated with slightly different mass-to-charge ratios.
A “cluster window” can be used to capture all desired signals for a signal cluster. The cluster window is typically a continuous range of values such as time-of-flight values, mass-to-charge ratio values, or values derived therefrom. All signal peaks within the cluster window would form a signal cluster, and the signals in the signal cluster and the mass-to-charge ratio for the signal cluster would be used for further data analysis. The width of a cluster window was specified in terms of a percentage of the mass-to-charge ratio (e.g., 1% of a particular mass-to-charge ratio).
A problem with the cluster window is that it was not wide enough to capture all signals that should have been in the same signal cluster. If some signal peaks are incorrectly excluded in this clustering process, then any subsequent data analysis and model formation would also be incorrect. Accordingly, it is desirable to cluster signals correctly.
The cluster window could be widened so that more signals are included in a signal cluster. For example, the proportional growth rate of the cluster window could be increased as the time-of-flight or mass-to-charge ratio increases. However, doing so may upset the clustering of peaks at lower molecular masses. For example, at low time-of-flights or low mass-to-charge ratios, one might capture too many signals within a signal cluster if the cluster window is too wide. Signals associated with different markers could be erroneously included in the same cluster. This would also be undesirable. This potential solution would also require manual tuning on the part of the user, which is subjective and prone to human error.
Embodiments of the invention address these and other problems.
Embodiments of the invention are directed to methods for processing spectra such as mass spectra. Other embodiments of the invention are directed to computer readable media including code for processing spectra as well as systems that use the computer readable media.
One embodiment of the invention is directed to a method for processing spectra, the method comprising: (a) obtaining a plurality of spectra, each spectrum in the plurality of spectra comprising a signal including a signal strength as a function of time-of-flight, mass-to-charge ratio, or a value derived from time-of-flight or mass-to-charge ratio; and (b) forming a signal cluster by clustering signals from the plurality of spectra with time-of-flights, mass-to-charge ratios, or values derived from time-of-flights or mass-to-charge ratios that are within a window that is defined using an expected signal width value.
Another embodiment of the invention is directed to a method for processing spectra, the method comprising: (a) obtaining a first plurality of spectra, each spectrum in the first plurality of spectra comprising a signal including a signal strength as a function of time-of-flight, mass-to-charge ratio, or a value derived from time-of-flight or mass-to-charge ratio; (b) determining a peak value for each signal above a predetermined signal-to-noise ratio in the first plurality of spectra; (c) forming a first signal cluster by clustering signals from the plurality of spectra with time-of-flights, mass-to-charge ratios, or values derived from time-of-flights or mass-to-charge ratios that are within a first cluster window that is defined using a first expected signal width value; (d) determining a cluster center value using the peak values of the signals in the first signal cluster; and (e) forming a second signal cluster by clustering signals from the first plurality of spectra with time-of-flights, mass-to-charge ratios, or values derived from time-of-flights or mass-to-charge ratios that are within a second cluster window that is defined using the cluster center value and a second expected signal width value associated with the cluster center value.
Other embodiments of the invention are directed to computer readable media for processing spectra and systems for obtaining and processing spectra.
These and other embodiments of the invention are described below with reference to the drawings.
Some embodiments of the invention are directed to methods for processing spectra. The method comprises obtaining a plurality of spectra. Each spectrum in the plurality of spectra comprises a signal that is represented by signal strength as a function of time-of-flight, mass-to-charge ratio, or a value derived from time-of-flight or mass-to-charge ratio. An example of a “value derived from time-of-flight or mass-to-charge ratio” may be, for example, the mass of an ion.
In one type of mass spectrum display format, the signals in the mass spectrum are generally in the form of “peaks”. After the spectra are obtained, one or more signal clusters are formed by selecting signals from the plurality of spectra with time-of-flights, mass-to-charge ratios, or values derived from time-of-flights or mass-to-charge ratios that are within the one or more corresponding cluster windows. The cluster windows are defined using expected signal width values. Expected signal width values are sometimes referred to as “expected peak width” values if the signals are in the form of peaks. After clustering the signals, the signals in the signal cluster and the mass-to-charge ratios associated therewith may be further processed or analyzed. In embodiments of the invention, there may be one, or two or more signal clusters per group of mass spectra.
Using an expected signal width value to determine the size of a cluster window is more desirable than the above-described way of defining the cluster window (e.g., by defining it in terms of a percentage of a mass-to-charge ratio). By using an expected signal width to determine the size of the cluster window, the non-linear relation of the signal width to the time-of-flight, mass-to-charge ratio, or value derived therefrom is automatically taken into account. Defining a cluster window in terms of expected signal width also has the added benefit of being more intuitive if the clustering algorithm fails for some reason. In embodiments of the invention, it is easy to see why the algorithm does not cluster two peaks (from different spectra) together when they are visually separated. It is also easier for a user to see that two adjacent signals overlap and are desirably included in the same signal cluster.
I. Expected Signal Widths
An “expected signal width” includes an expected signal dimension such as an expected or measured signal width. The expected signal width for a peak can be the width of a signal peak in a mass spectrum that is predicted at a given time-of-flight value or mass-to-charge ratio value (or value derived from such values) by the mass spectrometer.
If the signal is in the form a peak, the expected signal width can be measured from any suitable point along the height of a signal. In some embodiments, the expected signal width may be the expected width of the base of a signal peak, or may include a point between the apex and base of each signal peak. For instance, the signal widths that are used may be the signal widths at half the height of each signal peak. In another example, for a series of signals in a mass spectrum, the expected signal widths can be at a point between the apex and the base of each peak at the same distance from the baseline forming the bases of the peaks. In each case, the expected signal width generally increases as the time-of-flights, mass-to-charge ratios, or values derived from such values increase.
The expected signal widths can be theoretically or empirically derived. For example, a mass spectrum signal with a number of peaks corresponding to different analytes with known mass-to-charge values can be created, where the number of each of the different analytes is known to be approximately the same. The average time-of-flight value associated with each peak and the width of the peak can be recorded in a table of expected signal widths using analytes with known mass-to-charge values. An exemplary table of expected signal widths is shown in the Table below.
Using the values in the Table, a best-fit curve can be created to fit the values in the Table. Alternatively, linear interpolations can be used to form a piecewise linear function that represents the data.
In another example, an equation such as the following can be used to determine expected signal width. In the following equation,
Reasonable values to use with the above equation for predicting the width of a signal for some current mass spectrometers commercially available from Ciphergen Biosystems, Inc. are Δvi=800 m/s,
Other methodologies can be used to determine expected signal widths for specific time-of-flight values or mass-to-charge ratio values.
II. Signal Clustering
Exemplary clustering methods can be described with reference to the flowchart shown in
First, mass spectra are obtained (step 26). Any suitable process may be used to obtain the mass spectra. For example, the mass spectra may be retrieved (e.g., downloaded) from a local or remote server computer having access to one or more databases of mass spectra. The databases may contain libraries of mass spectra of different biological samples associated with different biological statuses. Alternatively, the mass spectra may be generated from the biological samples. Regardless of how they are obtained, the mass spectra and the samples used are preferably processed under similar conditions to ensure that any changes in the spectra are due to the biological factors, and not differences in processing.
Any suitable biological samples may be used in embodiments of the invention. Biological sample examples include tissue (e.g., from biopsies), blood, serum, plasma, nipple aspirate, urine, tears, saliva, cells, soft and hard tissues, organs, semen, feces, and the like. The biological samples may be obtained from any suitable organism including eukaryotic, prokaryotic, or viral organisms. Other examples of biological samples are described in the U.S. Pat. No. 6,675,104, which is herein incorporated by reference for all purposes.
In embodiments of the invention, a gas phase ion mass spectrometer may be used to create mass spectra. A “gas phase ion spectrometer” refers to an apparatus that measures a parameter that can be translated into mass-to-charge ratios of ions formed when a sample is ionized into the gas phase. This includes, e.g., mass spectrometers, ion mobility spectrometers, or total ion current measuring devices.
The mass spectrometer may use any suitable ionization technique. The ionization techniques may include for example, an electron ionization, fast atom/ion bombardment, matrix-assisted laser desorption/ionization (MALDI), surface enhanced laser desorption/ionization (SELDI), or electrospray ionization.
In preferred embodiments, a laser desorption time-of-flight mass spectrometer is used to create the mass spectra. Laser desorption spectrometry is especially suitable for analyzing high molecular weight substances such as proteins. For example, the practical mass range for a MALDI or SELDI process can be up to 300,000 daltons or more. Moreover, laser desorption processes can be used to analyze complex mixtures and have high sensitivity. In addition, the likelihood of protein fragmentation is lower in a laser desorption process such as a MALDI or SELDI process than in many other mass spectrometry processes. Thus, laser desorption processes can be used to accurately characterize and quantify high molecular weight substances such as proteins.
In a typical process for creating a mass spectrum, a probe with a marker is introduced into an inlet system of the mass spectrometer. The marker is then ionized. After the marker ions are generated, the generated ions are collected by an ion optic assembly, and then a mass analyzer disperses and analyzes the passing ions. The ions exiting the mass analyzer are detected by a detector. In a time-of-flight mass analyzer, ions are accelerated through a short high voltage field and drift into a high vacuum chamber. At the far end of the high vacuum chamber, the accelerated ions strike a sensitive detector surface at different times. Since the time-of-flight of the ions is a function of the mass-to-charge ratio of the ions, the elapsed time between ionization and impact can be used to identify the presence or absence or the quantity of molecules of specific mass-to-charge ratio.
Signals corresponding to the presence of a potential marker are identified in each spectrum. Each such signal is assigned a mass-to-charge ratio value. Signals above a predetermined signal-to-noise ratio are then detected to form a first plurality of mass spectra (step 28). In a typical example, signals with a signal-to-noise ratio greater than a value S may be detected. The value S may be an absolute or a relative value.
In embodiments of the invention, signals can be obtained in any suitable manner. In preferred embodiments, the signals are derived from analytes, including biological molecules such as nucleotides, amino acids, carbohydrates, simple lipids, polynucleotides (e.g., nucleic acids), polypeptides (e.g., proteins), polysaccharides (e.g., complex carbohydrates), complex lipids and conjugates of these (e.g., glycoproteins, lipoproteins and glycolipids).
A “peak value” for each signal in each mass spectrum is then determined (step 30). The peak value associated with a signal is the time-of-flight value, mass-to-charge ratio value, or any value derived from such values that corresponds to the tip or maximum intensity associated with a particular signal.
A first signal cluster is then formed using an expected signal width value (step 32). For example, a first cluster window can be formed using an expected signal width value. The width of the first cluster window may be the same or substantially the same as the expected signal width value at a particular mass-to-charge ratio. For example, the expected signal width at a mass-to-charge ratio X may be about 100 Daltons and the width of the first cluster window may also be about 100 Daltons wide. Signals with peak values that are within the first cluster window around X (X−50 Da to X+50 Da) form the first signal cluster. There may, of course, be more signal clusters per plurality of mass spectra.
A cluster center value is then determined for the signals in the first signal cluster (step 34). The cluster center value is determined using the peak values of the signals within the first signal cluster. In some embodiments, the center of the range of peak values associated with the first signal cluster may be used as a cluster center value. For example, if a first signal cluster comprises three signals with peak values 9,900 Da, 10,090 Da, and 10,100 Da, respectively, then the range of peak values would be from 9900 Da to 10,100 Da. The center (or midpoint) of that range would be 10,000 Da. In other embodiments, the cluster center value may be the average peak value for the peak values in the first signal cluster. For example, in the previously described example, the average of the peak values 9,000 Da, 10,090 Da, and 10,100 Da would be 10,030 Da, and the cluster center value would be 10,030 Da.
Referring to
After the second signal cluster is formed, signal clusters having a predetermined number of signals can be selected (step 37). Signal clusters having less than the predetermined number are discarded. In a typical example, if the number of signals in a signal cluster is less than 50% of the number of mass spectra, then the signal cluster can be discarded. In some embodiments, the selection process results in anywhere from as few as about 20 to more than about 200 selected signal clusters. This ensures that signal clusters of potential significance are selected for further analysis and processing. Once the signal clusters are selected, the mass-to-charge ratios for these signal clusters can be identified (step 38).
Once the mass-to-charge ratios are identified, “missing signals” for the mass-to-charge ratios can be determined. For example, some of the mass spectra may not exhibit a signal at the identified mass-to-charge ratios. This group of mass spectra or the samples associated with the mass spectra can be re-analyzed to determine if signals do in fact exist at the identified mass-to-charge ratios. Estimates are added for any missing signals (step 40). For spectra where no signal is found in a cluster, an intensity value is estimated from the trace height or noise value. The estimated intensity value may be user selectable.
The steps shown in
With reference to
Referring to
As shown in
A second signal cluster 203 is formed using this cluster center value 112 and a cluster window 111 is formed using second expected signal width value associated with that cluster center value 112. The expected signal width at the centroid value of 10,010 Da may be, for example, about 106 Da. In this example, the second cluster window 111 may be 106 Da wide and may be centered around 10,010 Da. The signals 101, 103, and 105 would fall within this second cluster window 111. Thus, in this example, the second signal cluster 203 includes the same signals 101, 103, and 105 as the first signal cluster 201.
Signal clusters with signals in more than N spectra may then be selected (step 37) for further data analysis and/or for further processing. For example, if N equals 3 or more signals, then the second signal cluster 203 comprising the signals 101, 103, 105 would be selected. The signal 107 would not belong to a signal cluster meeting the condition N equals 3 or more signals and would therefore be excluded from further data analysis, processing, and/or display. For instance, as shown in
The mass-to-charge ratio value associated with the cluster center value 112 for the second signal cluster 203 shown in
In some embodiments, the signal intensities of the signals in the second signal cluster 203 can be placed in a spreadsheet (e.g., an Excel™ spreadsheet) and can be labeled with the mass-to-charge ratio associated with the cluster center value 112. The mass spectra and their associated signals may then be processed using one or more statistical analyses as described in further detail below.
In some embodiments, each signal 101, 103, and 105 may be marked with a red line (not shown) at the mass-to-charge ratio value corresponding to the cluster center value 112. This shows a user where the mass-to-charge ratio of the signal cluster is in relation to the peak value of the particular signal being viewed.
III. Additional Processing of Mass Spectra Data
Referring to
In some embodiments, the log normalized data set is then processed by a classification process (step 46) that is embodied by code that is executed by a digital computer. After the code is executed by the digital computer, the analytical model (e.g., a classification model) is formed (step 48). The analytical model can use analysis processes such as hierarchical clustering, p-value plots, and multi-condition visualizations.
Statistical processes such as recursive partitioning processes can also be used to classify spectra. The spectra that are grouped together can be classified using a pattern recognition process that uses a classification model. In general, the spectra will represent samples from at least two different groups for which a classification algorithm is sought. For example, the groups can be pathological v. non-pathological (e.g., cancer v. non-cancer), drug responder v. drug non-responder, toxic response v. non-toxic response, progressor to disease state v. non-progressor to disease state, phenotypic condition present v. phenotypic condition absent.
In some embodiments, data derived from the spectra (e.g., mass spectra or time-of-flight spectra) that are generated using samples such as “known samples” can then be used to “train” a classification model. A “known sample” is a sample that is pre-classified. The data that are derived from the spectra and are used to form the classification model can be referred to as a “training data set”. Once trained, the classification model can recognize patterns in data derived from spectra generated using unknown samples. The classification model can then be used to classify the unknown samples into classes. This can be useful, for example, in predicting whether or not a particular biological sample is associated with a certain biological condition (e.g., diseased vs. non diseased).
Classification models can be formed using any suitable statistical classification (or “learning”) method that attempts to segregate bodies of data into classes based on objective parameters present in the data. Classification methods may be either supervised or unsupervised. Examples of supervised and unsupervised classification processes are described in Jain, “Statistical Pattern Recognition: A Review”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, No. 1, January 2000, which is herein incorporated by reference in its entirety.
In supervised classification, training data containing examples of known categories are presented to a learning mechanism, which learns one more sets of relationships that define each of the known classes. New data may then be applied to the learning mechanism, which then classifies the new data using the learned relationships. Examples of supervised classification processes include linear regression processes (e.g., multiple linear regression (MLR), partial least squares (PLS) regression and principal components regression (PCR)), binary decision trees (e.g., recursive partitioning processes such as CART-classification and regression trees), artificial neural networks such as backpropagation networks, discriminant analyses (e.g., Bayesian classifier or Fischer analysis), logistic classifiers, and support vector classifiers (support vector machines).
A preferred supervised classification method is a recursive partitioning process. Recursive partitioning processes use recursive partitioning trees to classify spectra derived from unknown samples. Further details about recursive partitioning processes are in U.S. Provisional Patent Application Nos. 60/249,835, filed on Nov. 16, 2000, and 60/254,746, filed on Dec. 11, 2000, and U.S. Non-Provisional patent application Ser. No. 09/999,081, filed Nov. 15, 2001, now U.S. Pat. No. 6,675,104, and Ser. No. 10/084,587, filed on Feb. 25, 2002. All of these U.S. Provisional and Non Provisional patent applications, and U.S. patents are herein incorporated by reference in their entirety for all purposes.
In other embodiments, the classification models that are created can be formed using unsupervised learning methods. Unsupervised classification attempts to learn classifications based on similarities in the training data set, without pre classifying the spectra from which the training data set was derived. Unsupervised learning methods include statistical cluster analyses. A statistical cluster analysis attempts to divide the data into groups that ideally should have members that are very similar to each other, and very dissimilar to members of other groups. Similarity is then measured using some distance metric, which measures the distance between data items, and groups together data items that are closer to each other. Statistical clustering techniques include the MacQueen's K-means algorithm and the Kohonen's Self-Organizing Map algorithm.
IV. Systems
All or some of the steps in
A block diagram of an exemplary system incorporating a computer readable medium and a digital computer is shown in
The mass spectrometer 72 can be operably associated with the digital computer 74 without being physically or electrically coupled to the digital computer 74. For example, data from the mass spectrometer could be obtained (as described above) and then the data may be manually or automatically entered into the digital computer 74 using a human operator. In other embodiments, the mass spectrometer 72 can automatically send data to the digital computer 74 where it can be processed. For example, the mass spectrometer 72 can produce raw data (e.g., time-of-flight data) from one or more biological samples. The data may then be sent to the digital computer 74 where it may be pre-processed or processed. Instructions for processing the data may be obtained from the computer readable medium 78. After the data from the mass spectrometer is processed, an output may be produced and displayed on the display 76.
The computer readable medium 78 may contain any suitable instructions for processing the data from the mass spectrometer 72. For example, the computer readable medium 78 may include computer code for entering data obtained from a mass spectrum of an unknown biological sample into the digital computer 74. The data may then be processed using any of the above-described steps. Although the block diagram shows the mass spectrometer 72, digital computer 74, display 76, and computer readable medium 78 in separate blocks, it is understood that one or more of these components may be present in the same or different housings. For example, in some embodiments, the digital computer 74 and the computer readable medium 78 may be present in the same housing, while the mass spectrometer 72 and the display 76 are in different housings. In yet other embodiments, all of the components 72, 74, 76, 78 could be formed into a single unit.
Any of the functions described herein can be embodied by computer code that can be executed by the digital computer 74 or stored on the computer readable medium 78. The code may be stored on any suitable computer readable media. Examples of computer readable media include magnetic, electronic, or optical disks, tapes, sticks, chips, etc. The code may also be written in any suitable computer programming language including, C, C++, Java, Fortran, Pascal, etc.
Cluster editing functions can also be provided in the software in the system. Cluster editing allows a user to directly edit signal clusters. Cluster editing functions can comprise a cluster selection cue in a spectrum viewer. Signals in a selected signal cluster in the cluster table are highlighted in red while the rest are in gray for easy distinction of which peaks belong to the same cluster. This also flags the current cluster that is being edited. The cluster editing functions also include a feature which allows a user to directly adjust (“move”) signal peaks within a signal cluster, and a tool to delete signal clusters (e.g., allows a user to delete clusters with high p-values). Yet another cluster editing function is a cluster index/peak type display function. This includes an additional mode that allows one to directly examine a cluster index and whether the peak was identified in the first or second signal cluster or an estimated signal.
While the foregoing is directed to certain preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope of the invention. Such alternative embodiments are intended to be included within the scope of the present invention. Moreover, the features of one or more embodiments of the invention may be combined with one or more features of other embodiments of the invention without departing from the scope of the invention.
For example, although
All publications and patent documents cited in this application are incorporated by reference in their entirety for all purposes to the same extent as if each individual publication or patent document were so individually denoted. By his citation of various references and providing background descriptions in this document Applicant does not admit that any particular reference or any particular description herein is “prior art”.
This application claims the benefit of U.S. Patent Application No. 60/540,741, filed Jan. 30, 2004, entitled “METHOD FOR CLUSTERING SIGNALS IN SPECTRA,” which disclosure is incorporated by reference herewith for all purposes.
Number | Name | Date | Kind |
---|---|---|---|
4885697 | Hubner | Dec 1989 | A |
6253162 | Jarman et al. | Jun 2001 | B1 |
6675104 | Gavin et al. | Jan 2004 | B2 |
6940065 | Graber et al. | Sep 2005 | B2 |
20020102610 | Townsend et al. | Aug 2002 | A1 |
20030218130 | Boschetti et al. | Nov 2003 | A1 |
20050075797 | Wishart et al. | Apr 2005 | A1 |
20050267689 | Tsypin | Dec 2005 | A1 |
Number | Date | Country |
---|---|---|
WO 0242733 | May 2002 | WO |
WO 200242733 | May 2002 | WO |
Number | Date | Country | |
---|---|---|---|
20050206363 A1 | Sep 2005 | US |
Number | Date | Country | |
---|---|---|---|
60540741 | Jan 2004 | US |