The present teachings relate to the field of mass spectrometry.
Mass defect information can be used to filter mass spectrometer data. However, most such methods typically use a mass defect based filtering window that does not scale with ion mass and/or does not include a statistical confidence performance measure. In such cases, the selected mass defect window is generally only optimal for a limited mass range. Various embodiments of the present teachings provide a statistical confidence value associated with the mass defect window selected and filter the data such that the window appropriately scales with the mass of the compound.
a: A spectrum from 1 fmol/uL B-gal before filtering.
b: Spectrum from 1 fmol/uL B-gal after 2 sigma filtering.
Different elements and isotopes have different nuclear binding energy. This typically results in an atomic mass shift away from their nominal mass. This mass difference is called the mass defect. A chemical compound will have a mass defect that is the sum of the mass defects from all its component atoms. Different classes of molecules are made of characteristic combinations of elements, and typically different classes of molecules exhibit distinctly characteristic mass defects.
In the field of high-resolution mass spectrometry, mass defects can be used as a signature of the chemical compound. In the study of elemental compositions, the Kendrick Mass defect spectrum has been used to show the mass defects of thousands of elemental compositions as a function of their nominal masses and thus permit classification of compositions based on their mass defects. Mass defects of monoisotopic ions are routinely used in the identification of drug metabolites using LC-MS (Liquid-Chromatograph—Mass Spectrometry) and a fixed mass defect window can be used to filter out chemical noise. In MALDI-TOF (Matrix-Assisted Laser Desorption Ionization—Time of Flight) mass spectrometry based PMF (Peptide Mass Fingerprinting), peptides and matrix ions generally have a different range of mass defects, and mass defects can be used to differentiate matrix ion peaks from peptide ion peaks.
It has been observed that the mass defect of a peptide is a function of its mass and a random variable whose distribution function varies according to peptide mass. The present teachings discuss selecting a mass defect window to use in filtering in a manner appropriate to exclude as many non-peptide ions as possible, yet large enough to include most peptide ions.
Statistical Model for Peptide Mass Defects
The present teachings contemplate the use of a statistical model of mass defect distribution to perform filtering of mass spectrometer data. One skilled in the art will appreciate that there are many methods of building such a model. The model disclosed herein is presented for illustrative purposes and does not limit the present teachings specifically to that model.
A peptide is a chain of amino acids that are made of only a few elements; generally C, H, N, O and S. Each of these elements has a small mass defect except the isotope 12C which has zero mass defect by definition. The mass defect of each element can be normalized by its nominal mass. In the typical mass spectrometer range of interest of a few hundred to a few thousand mass units, a peptide is made of hundreds or thousands of such unit masses. Statistically, the average value of a large collection of measurements generally follows a normal distribution. Considering each mass unit to be a measurement, the average value of a single mass unit in a peptide can be modeled with a normal distribution.
Building on this normal-based modeling concept, for a known mass defect d1, and standard deviation σ1 for a single mass unit, on average the corresponding values at any nominal mass N can be calculated as:
dN=Nd1 (1)
σN=√{square root over (N)}σ1 (2)
The mass defect distribution can be described by the following normal distribution:
Furthermore, the mass defect and standard deviation for a single mass unit can be estimated from peptide mass data according to the following equations:
where ΔmN is the mass defect at nominal mass N.
The following table lists some peptide masses, their nominal masses and their mass defects.
Enzyme Digestion Correction:
Enzymes generally cleave a protein into peptide segments at particular sites. A commonly used enzyme is trypsin which cleaves at the amino acids Lysine (K) and Arginine (R) sites resulting in what are known as tryptic peptides. For a tryptic peptide, the c-terminal residue will be generally either K or R; not a randomly chosen amino acid as is expected by the statistical model. Due to the large number of hydrogen atoms, both K and R have larger mass defects than most other amino acids. Thus the mass defect at the c-terminus will generally be higher than the average mass defect. The extra mass defect contribution from the c-terminus De, modifies equation (1) to become
dN=Nd1+De (6)
and equation (4) becomes:
The other equations are not affected.
To estimate De from equation (6), knowledge of the average mass for a single mass unit, d1 can be used. If the peptide mass is very large, the impact of De on the total mass defect is relatively small. Thus equation (4) would still be valid.
Five proteins were theoretically digested according to the trypsin digestion rule. The five proteins were: Bovine Lactoperoxidase, BGAL_ECOLI Beta-galactosidase, Pig Immuno gamma globulin, Bovine Catalase and Rabbit Phosphorylase B. 25 peptides in the range of 3000-5000 Da were used for estimating the average mass defect. The average mass defect for a single mass unit is calculated to be d1=0.477×10−3 Da according to equation (4).
According to equation (1), the average mass defect at mass 128 Da (the mass of K) is 0.061 Da. The actual mass defect of K is 0.095 Da. Thus the extra mass defect introduced by K is 0.034 Da. Similarly, the extra mass defect introduced by R is 0.027 Da. Thus, De is chosen to be 0.03 Da for tryptic peptides.
Once De is determined, equations (7), (6) and (5) can be used to calculate d1 and σ1. 310 peptides in the mass range of 300 to 5000 Da (from the same five proteins) were used for the calculation. The average mass defect and standard deviation were determined to be d1=0.4802×10−3 Da and σ1=1.46×10−3 Da.
According to equation (6) and (2), some predicted mass defects as of nominal masses are listed in the following table:
Validation of the Model:
According to the statistical model adopted in some embodiments of the present teachings, mass defects at different masses follow normal distributions with mass dependent means and standard deviations. A new variable can be defined
for each nominal mass N, and the mass defect distribution becomes:
This distribution becomes independent of the nominal mass N. Thus the normalized mass defect from all peptides should follow the same distribution as described by equation (9).
To validate the model, thirteen proteins were theoretically digested according to the trypsin rule. Mass defects of all 663 peptides in the mass range of 300 to 5000 Da were normalized according to equation (8). The normalized mass defect distribution from those peptides is compared against the standard normal distribution as described by equation (9). The comparison is shown in
Mass Defects from Modifications:
Often times, peptides undergo modifications that can change their mass. The chemical composition of modifications may not be similar to those of standard amino acids. Thus they may introduce an extra mass defect. The impact of this extra mass defect can be handled in a similar fashion to the enzyme digestion correction. The following table shows the impact of some large modifications on mass defects.
When a modification is considered, there are two groups of peptides, one without modification, the other with modification. Generally, their mass defects follow the same normal distribution with different De. In many cases, the extra mass defect due to the modification is very small. For spectrum filtering purposes, one can use the assumption that that all mass defects follow the same distribution and add this extra mass defect to one side of the mass defect filtering window.
An occasion where the impact of a modification may become more significant occurs when the modification has one or more large mass defect elements such as Br, I, or Cs. The mass defect distribution for the modified peptides is still normally distributed and possesses the same standard deviation as that of the unmodified ones. In some applications, a large mass defect has been added to peptides as a mass defect tag to efficiently track the desired tagged species. The amount of defect introduced in the tagged peptide determines the amount of overlap between the two mass defect distributions (one for untagged peptides, the other for tagged), and thus determines the probability of false positive identification. In the overlapping region, the tagged and untagged peptides can not be distinguished, resulting in possible false positive identification.
Application of Mass Defect Model in Spectrum Filtering:
Low abundance proteins play very important roles in biological processes. An active research area is the detection of biomarker proteins. Very often, biomarkers are associated with low abundance proteins with mass peak intensities barely above background noise levels. Because of this and other factors, reliably identifying biomarker patterns can be very challenging. If mass spectra noise can be reduced without significantly affecting peptides peaks, the chance of identifying low abundance proteins will likely be greatly improved.
Using the normal-based mass defect distribution with mean and standard deviations described by equations (6) and (2), the mean and standard deviation of the mass defect at any mass can be computed. Some embodiments contemplate using a mass filter to exclude masses outside 2 times the standard-deviation of the mass defect. Statistically, 95.5% of peptide ions should not be affected by this filter, while all noise outside this window will be removed. Since the confidence interval for 2 sigma is 95.5% a statistical measure is imparted on the filtering process. Instead of using a fixed window size, this filter window size scales with mass according to equation (2). The size of the window, ie. the multiplier for sigma, can be set to other values as appropriate.
The present teachings contemplate a filtering algorithm based on variable window-sizes to filter MS spectra from MALDI-TOF data, although any type of mass spectrometer data can benefit from the present teachings. The algorithm computes a statistical model based on the mass defects, calculates the mass defect for a given mass and applies a filter to remove peaks outside a window that scales with the mass. This scaling can be performed by using a multiple of the standard deviation of the mass defects for a given mass.
a and 3b show the comparison between spectra before and after mass defect filtering using a 2 standard deviation window.
One skilled in the art will appreciate that the present teachings involving constructing a mass defect model and filtering MS data in a manner whereby the size of the filter window varies with mass and is based on mass defect information can also be applied to other chemical compound families such as small molecule drug metabolites. Generally, what differentiates one family of compound from another is the value of average mass defect and standard deviation. Thus, the same methodology can be applied but with parameters that depend on the types of compounds being studied.
Computer System Implementation:
Computer system 500 may be coupled via bus 502 to a display 512, such as a cathode ray tube (CRT) or liquid crystal display (LCD), for displaying information to a computer user. An input device 514, including alphanumeric and other keys, is coupled to bus 502 for communicating information and command selections to processor 504. Another type of user input device is cursor control 516, such as a mouse, a trackball or cursor direction keys for communicating direction information and command selections to processor 504 and for controlling cursor movement on display 512. This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane.
Consistent with certain embodiments of the present teachings functions such as mass defect computation, and mass defect filtering can be performed and results displayed by computer system 500 in response to processor 504 executing one or more sequences of one or more instructions contained in memory 506. Such instructions may be read into memory 506 from another computer-readable medium, such as storage device 510. Execution of the sequences of instructions contained in memory 506 causes processor 504 to perform the process states described herein. Alternatively hard-wired circuitry may be used in place of or in combination with software instructions to implement the invention. Thus implementations of the present teachings are not limited to any specific combination of hardware circuitry and software.
The term “computer-readable medium” as used herein refers to any media that participates in providing instructions to processor 504 for execution. Such a medium may take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Non-volatile media includes, for example, optical or magnetic disks, such as storage device 510. Volatile media includes dynamic memory, such as memory 506. Transmission media includes coaxial cables, copper wire, and fiber optics, including the wires that comprise bus 502.
Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, or any other medium from which a computer can read.
Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to processor 504 for execution. For example, the instructions may initially be carried on magnetic disk of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem. A modem local to computer system 500 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal. An infra-red detector coupled to bus 502 can receive the data carried in the infra-red signal and place the data on bus 502. Bus 502 carries the data to memory 506, from which processor 504 retrieves and executes the instructions. The instructions received by memory 506 may optionally be stored on storage device 510 either before or after execution by processor 504.
The foregoing description has been presented for purposes of illustration and description. It is not exhaustive and does not limit the invention to the precise form disclosed. Modifications and variations are possible in light of the above teachings or may be acquired from practice. Additionally, the described implementation includes software but the present teachings may be implemented as a combination of hardware and software or in hardware alone. The present teachings may be implemented with both object-oriented and non-object-oriented programming systems.
Number | Name | Date | Kind |
---|---|---|---|
20070278395 | Gorenstein et al. | Dec 2007 | A1 |
Number | Date | Country | |
---|---|---|---|
20070038387 A1 | Feb 2007 | US |
Number | Date | Country | |
---|---|---|---|
60694127 | Jun 2005 | US |