This application is related to U.S. patent application Ser. No. 09/285,481, filed Apr. 2, 1999, and entitled “Automated Process Line”, which is referred to and incorporated herein in its entirety by this reference.
The present invention is in the field of biological identification. More specifically, the present invention relates to identifying a biological sample by analyzing information received from a test instrument.
Advances in the field of genomics are leading to the discovery of new and valuable information regarding genetic processes and relationships. This newly illuminated genetic information is revolutionizing the way medical therapies are advanced, tested, and delivered. As more information is gathered, genetic analysis has the potential to play an integral and central role in developing and delivering medical advancements that will significantly enhance the quality of life.
With the increasing importance and reliance on genetic information, the accurate and reliable collection and processing of genetic data is critical. However, conventional known systems for collecting and processing genetic or DNA data are inadequate to support the informational needs of the genomics community. For example, known DNA collection systems often require substantial human intervention, which undesirably risks inaccuracies associated with human intervention. Further, the slow pace of such a manual task severely limits the quantity of data that can be collected in a given period of time, which slows needed medical advancements and adds substantially to the cost of data collection.
In a particularly exciting area of genomics, the identification and classification of minute variations in human DNA has been linked with fundamental treatment or medical advice for a specific individual. For example, the variations are a strong indication of predisposition for a particular disease, drug tolerance, and drug efficiency. The most promising of these minute variations are commonly referred to as Single Nucleotide Polymorphisms (SNPs), which relate to a single base-pair change between a first subject and a second subject. By accurately and fully identifying such SNPs, a health care provider would have a powerful indication of a person's likelihood of succumbing to a particular disease, which drugs will be most effective for that person, and what drug treatment plan will be most beneficial. Armed with such knowledge, the health care provider can assist a person in lowering other risk factors for high-susceptibility diseases. Further, the health care provider can confidently select appropriate drug therapies, a process which is now an iterative, hit or miss process where different drugs and treatment schedules are tried until an effective one is found. Not only is this a waste of limited medical resources, but the time lost in finding an effective therapy can have serious medical consequences for the patient.
In order to fully benefit from the use of SNP data, vast quantities of DNA data must be collected, compared, and analyzed. For example, collecting and identifying the SNP profile for a single human subject requires the collection, identification, and classification of thousands, even tens of thousands of DNA samples. Further, the analysis of the resulting DNA data must be carried out with precision. In making a genetic call, where a composition of a biological sample is identified, any error in the call may result in detrimentally affecting the medical advice or treatment given to a patient.
Conventional, known systems and processes for collecting and analyzing DNA data are inadequate to timely and efficiently implement a widespread medical program benefiting from SNP information. For example, many known DNA analysis techniques require the use of an operator or technician to monitor and review the DNA data. An operator, even with sufficient training and substantial experience, is still likely to occasionally make a classification error. For example, the operator may incorrectly identify a base-pair, leading to that patient receiving faulty SNP profile. Alternatively, the operator may view the data and decide that the data do not clearly identify any particular base pair. Although such a “no call” may be warranted, it is likely that the operator will make “no-call” decisions when the data actually support a valid call. In such a manner, the opportunity to more fully profile the patient is lost.
Therefore, there exists a need for a system and process to efficiently and accurately collect and analyze data, such as DNA data.
It is an object of the present invention to provide an apparatus and process for accurately identifying genetic information. It is another object of the present invention that genetic information be extracted from genetic data in a highly automated manner. Therefore, to overcome the deficiencies in the known conventional systems, a method and apparatus for identifying a biological sample is proposed.
Briefly, the method and system for identifying a biological sample generates a data set indicative of the composition of the biological sample. In a particular example, the data set is DNA spectrometry data received from a mass spectrometer. The data set is denoised, and a baseline is deleted. Since possible compositions of the biological sample may be known, expected peak areas may be determined. Using the expected peak areas, a residual baseline is generated to further correct the data set. Probable peaks are then identifiable in the corrected data set, which are used to identify the composition of the biological sample. In a disclosed example, statistical methods are employed to determine the probability that a probable peak is an actual peak, not an actual peak, or that the data are too inconclusive to call.
Advantageously, the method and system for identifying a biological sample accurately makes composition calls in a highly automated manner. In such a manner, complete SNP profile information, for example, may be collected efficiently. More importantly, the collected data are analyzed with highly accurate results. For example, when a particular composition is called, the result may be relied upon with great confidence. Such confidence is provided by the robust computational process employed and the highly automatic method of collecting, processing, and analyzing the data set.
These and other features and advantages of the present invention will be appreciated from review of the following detailed description of the invention, along with the accompanying figures in which like reference numerals refer to like parts throughout.
In accordance with the present invention, a method and device for identifying a biological sample is provided. Referring now to
The apparatus 10 for identifying a biological sample may operate as an automated identification system having a robot 25 with a robotic arm 27 configured to deliver a sample plate 29 into a receiving area 31 of the mass spectrometer 15. In such a manner, the sample to be identified may be placed on the plate 29 and automatically received into the mass spectrometer 15. The biological sample is then processed in the mass spectrometer to generate data indicative of the mass of DNA fragments into biological sample. These data may be sent directly to computing device 20, or may have some preprocessing or filtering performed within the mass spectrometer. In a preferred embodiment, the mass spectrometer 15 transmits unprocessed and unfiltered mass spectrometry data to the computing device 20. However, it will be appreciated that the analysis in the computing device may be adjusted to accommodate preprocessing or filtering performed within the mass spectrometer.
Referring now to
The data generated by the test instrument, and in particular the mass spectrometer, include information indicative of the identification of the biological sample. More specifically, the data are indicative of the DNA composition of the biological sample. Typically, mass spectrometry data gathered from DNA samples obtained from DNA amplification techniques are noisier than, for example, those from typical protein samples. This is due in part because protein samples are more readily prepared in more abundance, and protein samples are more easily ionizable as compared to DNA samples. Accordingly, conventional mass spectrometer data analysis techniques are generally ineffective for DNA analysis of a biological sample.
To improve the analysis capability so that DNA composition data can be more readily discerned, a preferred embodiment uses wavelet technology for analyzing the DNA mass spectrometry data. Wavelets are an analytical tool for signal processing, numerical analysis, and mathematical modeling. Wavelet technology provides a basic expansion function which is applied to a data set. Using wavelet decomposition, the data set can be simultaneously analyzed in both the time and frequency domains. Wavelet transformation is the technique of choice in the analysis of data that exhibit complicated time (mass) and frequency domain information, such as MALDI-TOF DNA data. Wavelet transforms as described herein have superior denoising properties as compared to conventional Fourier analysis techniques. Wavelet transformation has proven to be particularly effective in interpreting the inherently noisy MALDI-TOF spectra of DNA samples. In using wavelets, a “small wave” or “scaling function” is used to transform a data set into stages, with each stage representing a frequency component in the data set. Using wavelet transformation, mass spectrometry data can be processed, filtered, and analyzed with sufficient discrimination to be useful for identification of the DNA composition for a biological sample.
Referring again to
After denoising in block 45 and the baseline correction in block 50, a signal remains which is generally indicative of the composition of the biological sample. However, due to the extraordinary discrimination required for analyzing the DNA composition of the biological sample, the composition is not readily apparent from the denoised and corrected signal. For example, although the signal may include peak areas, it is not yet clear whether these “putative” peaks actually represent a DNA composition, or whether the putative peaks are result of a systemic or chemical aberration. Further, any call of the composition of the biological sample would have a probability of error which would be unacceptable for clinical or therapeutic purposes. In such critical situations, there needs to be a high degree of certainty that any call or identification of the sample is accurate. Therefore, additional data processing and interpretation are necessary before the sample can be accurately and confidently identified.
Since the quantity of data resulting from each mass spectrometry test is typically thousands of data points, and an automated system may be set to perform hundreds or even thousands of tests per hour, the quantity of mass spectrometry data generated is enormous. To facilitate efficient transmission and storage of the mass spectrometry data, block 55 shows that the denoised and baseline corrected data are compressed.
In a preferred embodiment, the biological sample is selected and processed to have only a limited range of possible compositions. Accordingly, it is therefore known where peaks indicating composition should be located, if present. Taking advantage of knowing the location of these expected peaks, in block 60 the method 35 matches putative peaks in the processed signal to the location of the expected peaks. In such a manner, the probability of each putative peak in the data being an actual peak indicative of the composition of the biological sample can be determined. Once the probability of each peak is determined in block 60, then in block 65 the method 35 statistically determines the composition of the biological sample, and determines if confidence is high enough to calling a genotype.
Referring again to block 40, data are received from the test instrument, which is preferably a mass spectrometer. In a specific illustration,
Referring again to block 45, where the raw data received in block 40 is denoised, the denoising process will be described in more detail. As illustrated in
Referring now to
Referring now to
The standard deviation number for each stage is used with the stage 0 noise profile (the exponential curve) 97 to generate a scaled noise profile for each stage. For example,
In a similar manner, stage 2 high 100 has stage 2 high data 104 with the last five percent of points represented by area 101. The data points in area 101 are then used to calculate a standard deviation number which is then used to scale the stage 0 noise profile 97 to generate a noise profile for stage 2 data. This same process is continued for each of the stage high data as shown by the stage n high 105. For stage n high 105, stage n high data 108 has the last five percent of data points indicated in area 106. The data points in area 106 are used to determine a standard deviation number for stage n. The stage n standard deviation number is then used with the stage 0 noise profile 97 to generate a noise profile for stage n. Accordingly, each of the high data stages has a noise profile.
Due to the characteristics of wavelet transformation, the lower stages, such as stage 0 and 1, will have more noise content than the later stages such as stage 2 or stage n. Indeed, stage n low data is likely to have little noise at all. Therefore, in a preferred embodiment the noise profiles are applied more aggressively in the lower stages and less aggressively in the later stages. For example,
Referring again to
The formula is generally indicated in
Putative peak areas 145, 147 and 149 are removed from the signal 150 to create a peak-free signal 155 as shown in
Referring again to
In compressing the data according to a preferred embodiment, an intermediate format 186 is generated. The intermediate format 186 generally comprises a real number having a whole number portion 188 and a decimal portion 190. The whole number portion is the x-axis point 183 while the decimal portion is the value data 184 divided by the maximum data value. For example, in the data 182 a data value “25” is indicated at x-axis point “100”. The intermediate value for this data point would be “100.025”.
From the intermediate compressed data 186 the final compressed data 195 is generated. The first point of the intermediate data file becomes the starting point for the compressed data. Thereafter each data point in the compressed data 195 is calculated as follows: the whole number portion (left of the decimal) is replaced by the difference between the current and the last whole number. The remainder (right of the decimal) remains intact. For example, the starting point of the compressed data 195 is shown to be the same as the intermediate data point which is “100.025”. The comparison between the first intermediate data point “100.025” and the second intermediate data point “150.220” is “50.220”. Therefore, “50.220” becomes the second point of the compressed data 195. In a similar manner, the second intermediate point is “150.220” and the third intermediate data point is “500.0001”. Therefore, the third compressed data becomes “350.000”. The calculation for determining compressed data points is continued until the entire array of data points is converted to a single array of real numbers.
Referring again to
Once the putative peaks have been shifted to match expected peaks, the strongest putative peak is identified in
As generally addressed above, the denoised, shifted, and baseline-corrected signal is not sufficiently processed for confidently calling the DNA composition of the biological sample. For example, although the baseline has generally been removed, there are still residual baseline effects present. These residual baseline effects are therefore removed to increase the accuracy and confidence in making identifications.
To remove the residual baseline effects,
The peaks are removed and remaining minima 247 located as shown in
To determine peak height, as shown in
An indication of the confidence that each putative peak is an actual peak can be discerned by calculating a signal-to-noise ratio for each putative peak. Accordingly, putative peaks with a strong signal-to-noise ratio are generally more likely to be an actual peak than a putative peak with a lower signal-to-noise ratio. As described above and shown in
Although the signal-to-noise ratio is generally a useful indicator of the presence of an actual peak, further processing has been found to increase the confidence by which a sample can be identified. For example, the signal-to-noise ratio for each peak in the preferred embodiment is preferably adjusted by the goodness of fit between a Gaussian and each putative peak. It is a characteristic of a mass spectrometer that sample material is detected in a manner that generally complies with a normal distribution. Accordingly, greater confidence will be associated with a putative signal having a Gaussian shape than a signal that has a less normal distribution. The error resulting from having a non-Gaussian shape can be referred to as a “residual error”.
Referring to
where G is the Gaussian signal value, R is the putative peak value, and N is the number of points from −W to +W. The calculated residual error is used to generate an adjusted signal-to-noise ratio, as described below.
An adjusted signal noise ratio is calculated for each putative peak using the formula (S/N)*EXP(−0.1*R), where S/N is the signal-to-noise ratio, and R is the residual error determined above. Although the preferred embodiment calculates an adjusted signal-to-noise ratio using a residual error for each peak, it will be appreciated that other techniques can be used to account for the goodness of fit between the Gaussian and the actual signal.
Referring now to
At some target value for the adjusted signal-to-noise, it has been found that the probability is 100% that the putative peak is an actual peak and can confidently be used to identify the DNA composition of a biological sample. However, the target value of adjusted signal-to-noise ratio where the probability is assumed to be 100% is a variable parameter which is to be set according to application specific criteria. For example, the target signal-to-noise ratio will be adjusted depending upon trial experience, sample characteristics, and the acceptable error tolerance in the overall system. More specifically, for situations requiring a conservative approach where error cannot be tolerated, the target adjusted signal-to-noise ratio can be set to, for example, 10 and higher. Accordingly, 100% probability will not be assigned to a peak unless the adjusted signal-to-noise ratio is 10 or over.
In other situations, a more aggressive approach may be taken as sample data are more pronounced or the risk of error may be reduced. In such a situation, the system may be set to assume a 100% probability with a 5 or greater target signal-to-noise ratio. Of course, an intermediate signal-to-noise ratio target figure can be selected, such as 7, when a moderate risk of error can be assumed. Once the target adjusted signal-to-noise ratio is set for the method, then for any adjusted signal-to-noise ratio a probability can be determined that a putative peak is an actual peak.
Due to the chemistry involved in performing an identification test, especially a mass spectrometry test of a sample prepared by DNA amplifications, the allelic ratio between the signal strength of the highest peak and the signal strength of the second (or third and so on) highest peak should fall within an expected ratio. If the allelic ratio falls outside of normal guidelines, the preferred embodiment imposes an allelic ratio penalty to the probability. For example,
With the peak probability of each peak determined, the statistical probability for various composition components may be determined, as an example, in order to determine the probability of each of three possible combinations of two peaks, —peak G, peak C and combinations GG, CC and GC.
With the probability of G existing (90%) and the probability of C existing (20%) as a starting point, the probability of combinations of G and C existing can be calculated. For example,
In a similar manner, the probability of CC existing is equivalent to the probability of C existing (20%) multiplied by the probability of G not existing (100%-90%). As shown in
Once the probabilities of each of the possible combinations has been determined,
Referring now to
One skilled in the art will appreciate that the present invention can be practiced by other than the preferred embodiments which are presented in this description for purposes of illustration and not of limitation, and the present invention is limited only by the claims which follow. It is noted that equivalents for the particular embodiments discussed in this description may practice the invention as well.
| Number | Name | Date | Kind |
|---|---|---|---|
| 3789203 | Catherall et al. | Jan 1974 | A |
| 4031370 | Catherall | Jun 1977 | A |
| 4076982 | Ritter et al. | Feb 1978 | A |
| 4490806 | Enke et al. | Dec 1984 | A |
| 4791577 | Winter | Dec 1988 | A |
| 4826360 | Iwasawa et al. | May 1989 | A |
| 4851018 | Lazzari et al. | Jul 1989 | A |
| 5121337 | Brown | Jun 1992 | A |
| 5122342 | McCulloch et al. | Jun 1992 | A |
| 5124932 | Lodder | Jun 1992 | A |
| 5175430 | Enke et al. | Dec 1992 | A |
| 5247175 | Schoen et al. | Sep 1993 | A |
| 5273718 | Sköld et al. | Dec 1993 | A |
| 5363885 | McConnell et al. | Nov 1994 | A |
| 5379238 | Stark | Jan 1995 | A |
| 5428357 | Haab et al. | Jun 1995 | A |
| 5436447 | Shew | Jul 1995 | A |
| 5440119 | Labowsky | Aug 1995 | A |
| 5453613 | Gray et al. | Sep 1995 | A |
| 5490516 | Hutson | Feb 1996 | A |
| 5498545 | Vestal | Mar 1996 | A |
| 5547835 | Köster | Aug 1996 | A |
| 5572125 | Dunkel | Nov 1996 | A |
| 5592402 | Beebe et al. | Jan 1997 | A |
| 5605798 | Köster | Feb 1997 | A |
| 5622824 | Köster | Apr 1997 | A |
| 5635713 | Labowsky | Jun 1997 | A |
| 5686656 | Amirav et al. | Nov 1997 | A |
| 5691141 | Köster | Nov 1997 | A |
| 5851765 | Köster | Dec 1998 | A |
| 5853979 | Green et al. | Dec 1998 | A |
| 5869242 | Kamb | Feb 1999 | A |
| 5872003 | Köster | Feb 1999 | A |
| 5885841 | Higgs, Jr. et al. | Mar 1999 | A |
| 5900481 | Lough et al. | May 1999 | A |
| 5928906 | Köster et al. | Jul 1999 | A |
| 5928952 | Hutchins et al. | Jul 1999 | A |
| 5975492 | Brenes | Nov 1999 | A |
| 5985214 | Stylli et al. | Nov 1999 | A |
| 5995989 | Gedcke et al. | Nov 1999 | A |
| 6017693 | Yates, III et al. | Jan 2000 | A |
| 6022688 | Jurinke et al. | Feb 2000 | A |
| 6024925 | Little et al. | Feb 2000 | A |
| 6043031 | Köster et al. | Mar 2000 | A |
| 6060022 | Pang et al. | May 2000 | A |
| 6094050 | Zaroubi et al. | Jul 2000 | A |
| 6111251 | Hillenkamp | Aug 2000 | A |
| 6132685 | Kercso et al. | Oct 2000 | A |
| 6133436 | Köster et al. | Oct 2000 | A |
| 6140053 | Köster | Oct 2000 | A |
| 6146854 | Köster et al. | Nov 2000 | A |
| 6147344 | Annis et al. | Nov 2000 | A |
| 6188064 | Koster | Feb 2001 | B1 |
| 6270835 | Hunt et al. | Aug 2001 | B1 |
| 6440705 | Stanton et al. | Aug 2002 | B1 |
| 6586728 | Gavin et al. | Jul 2003 | B1 |
| 20020009394 | Koster et al. | Jan 2002 | A1 |
| Number | Date | Country |
|---|---|---|
| 0296781 | Jun 1988 | EP |
| 0299652 | Jan 1989 | EP |
| 0395481 | Oct 1990 | EP |
| 0596205 | May 1994 | EP |
| 2749662 | Dec 1997 | FR |
| 2329475 | Mar 1999 | GB |
| 9315407 | Aug 1993 | WO |
| 9321592 | Oct 1993 | WO |
| 9416101 | Jul 1994 | WO |
| 9421822 | Sep 1994 | WO |
| 9629431 | Sep 1996 | WO |
| 9708306 | Mar 1997 | WO |
| 9737041 | Oct 1997 | WO |
| 9742348 | Nov 1997 | WO |
| 9743617 | Nov 1997 | WO |
| 9812734 | Mar 1998 | WO |
| 9820019 | May 1998 | WO |
| 9820020 | May 1998 | WO |
| 9820166 | May 1998 | WO |
| WO 9820166 | May 1998 | WO |
| 9833808 | Aug 1998 | WO |
| 9912040 | Mar 1999 | WO |
| 9931278 | Jun 1999 | WO |
| 9957318 | Nov 1999 | WO |
| 0056446 | Sep 2000 | WO |
| 0060361 | Oct 2000 | WO |
| 0127857 | Apr 2001 | WO |