The modern era of breath testing dawned in 1971, when Linus Pauling first reported that normal human breath contains large numbers of volatile organic compounds (VOCs) in low concentrations. Subsequent researchers have attempted to employ breath VOCs as disease biomarkers with varying degrees of success. The U.S. Food & Drug Administration (FDA) has approved a small number of breath tests for clinical use (e.g. breath nitric oxide for airways inflammation), but FDA has not yet approved a breath test for lung cancer. Despite 30 years of research resulting in more than 300 relevant publications, no breath VOC has yet emerged as a clinically useful biomarker of lung cancer when employed alone. However, several breath VOCs appear to provide moderately accurate biomarkers that could potentially identify lung cancer if combined with one another in a multifactorial algorithm.
In seeking breath biomarkers of lung cancer, researchers have employed a wide range of different tools including VOC separation methods using gas chromatography mass spectrometry (GC MS), non-separative detectors, such as electronic noses and chemosensors, analysis of expired breath condensate, measurement of breath temperature, and sniffer dogs. Analysis of breath VOCs with analytical instruments employing 2-dimensional GC has revealed a complex matrix of ˜2,000 different VOCs in a single sample. Data management tools for metabolomic analysis that were originally developed for genomics and proteomics have been used to manage the information. An increased risk of false discovery of biomarkers can arise when a multivariate model over-fits large number of candidate breath VOCs to a small number of test subjects, a pitfall that has been termed “voodoo correlations”, or “seeing faces in the clouds”.
Despite these concerns, breath biomarkers of lung cancer have been proposed as safe and cost-effective tools to help determine a person's risk of lung cancer. There is a clinical need for such a test because more people in the United States die from lung cancer than from any other type of cancer. Early detection can save lives: the National Lung Screening Trial found that screening with low-dose chest CT reduced mortality from lung cancer by 20%. However, the comparatively low positive predictive value (PPV) of chest CT (2.4% to 5.2%) has raised concerns that screening for lung cancer might yield an overwhelming number of false-positive test results.
Volatile organic compounds (VOCs) contained in human breath have been identified as candidate biomarkers of breast cancer as described in Phillips et al., Detection of an Extended Human Volatome with Comprehensive Two-Dimensional Gas Chromatography, Time-of-Flight Mass Spectrometry. PLoS One 2013; 8:e75274. The tool most widely employed for breath VOC biomarker discovery is gas chromatography mass spectrometry (GC MS). A sample of concentrated breath VOCs is injected onto a chromatographic column that separates the complex mixture into a series of individual VOCs according to their physicochemical properties such as polarity and boiling point. The separated VOCs then flow into a detector where they are broken into fragments by a beam of high-energy electrons in a vacuum, and the resulting mass spectrum of fragments comprises a “fingerprint” that can be used to identify the VOC from a computer-based spectral library.
GC MS is a widely accepted tool, but it can potentially yield erroneous identification of analytes if a mixture as complex and diverse as human breath VOCs overburdens the separation column. If the separation of VOCs is incomplete, then two or more VOCs may enter the MS detector simultaneously, and their combined mass spectra may lead to misidentification of their chemical structures in the spectral library. Breath volatile organic compounds (VOCs) contain biomarkers of breast cancer that are detectable with gas chromatography mass spectrometry (GC MS). However, chemical identification of breath VOC biomarkers may be erroneous because spectral matching can misidentify their structure.
It is desirable to provide new and improved methods of identifying biomarkers of a disease to potentially improve the sensitivity and specificity of the disease screening and reduce the number of false-positive and false-negative test findings.
The present invention provides a method for identifying biomarkers and generating an output indicative of disease, including for example lung cancer or breast cancer. The method for identifying biomarkers comprises the steps of:
collecting a breath sample from subjects known to have a disease and subjects known to be free of the disease;
analyzing the collected breath samples to determine all mass ions in each of the collected breath samples using at least one time-resolved separation technique and at least one mass-resolved separation technique;
identifying a subset of the determined mass ions in a processor as the biomarkers for detecting the disease, the subset of the determined mass ions are statistically significant for detecting the disease; and
combining the subset of the determined mass ions in a multivariate algorithm in the processor to generate a value of a discriminant function indicating the likelihood that the subject has the disease. It will be appreciated that the subset of the determined mass ions will be different for each disease which is analyzed with the collected breath sample. Similarly, each subset of determined mass ions for each disease is combined in a different multivariate algorithm in order to generate a value of the discriminant function for the particular disease.
In one embodiment, biomarker mass ions are determined from breath VOCs after bombardment of the breath VOCs with high energy electrons using a mass spectrometer.
The invention also comprises a method for predicting the probable presence of disease in a test subject using the method for identifying biomarkers described above. In one embodiment, the method of the present invention is used for predicting the probable presence of lung cancer in a test subject using the method for identifying biomarkers of the present invention. In another embodiment, the method of the present invention is used for predicting the probable presence of breast cancer in a test subject using the method for identifying biomarkers of the present invention.
Another embodiment of the invention features a system for identifying a plurality of biomarkers for predicting a disease in a subject including an apparatus for collecting a breath sample from subjects known to have the disease and subjects known to be free of the disease. A mass spectrometer (MS) associated with a gas chromatograph (GC) apparatus analyzes the collected breath samples to determine all mass ions in each of the collected breath samples. A computer identifies a subset of the determined mass ions as the biomarkers for detecting the disease, the subset of the determined mass ions are statistically significant for detecting the disease, and combines the subset of the determined mass ions in a multivariate algorithm to generate a discriminant function. The discriminant function indicates a value of the likelihood that the subject has the disease. In particular embodiments, the system can be used for predicting the probable presence of lung cancer or breast cancer in the subject using the identified biomarkers for predicting respectively lung cancer or breast cancer in the multivariate algorithm.
It was found that biomarkers determined with the method of the present invention accurately predicted lung cancer in a blinded replicated study. Breath testing in parallel with chest CT can potentially improve the accuracy of lung cancer screening.
It was found that biomarkers determined with the method of the present invention accurately identified women with breast cancer and can be used for early diagnosis and treatment monitoring.
The invention will be more fully described by reference to the following drawings.
Reference will now be made in greater detail to a preferred embodiment of the invention, an example of which is illustrated in the accompanying drawings. Wherever possible, the same reference numerals will be used throughout the drawings and the description to refer to the same or like parts.
In block 14, the collected breath samples are analyzed to determine all mass ions in each of the collected breath samples using at least one time-resolved separation technique and at least one mass-resolved separation technique. In a preferred embodiment, the samples are analyzed with gas chromatography and mass spectrometry (GC MS). Data from the GC MS of chromatograms is processed in a computer processor to identify mass ions in the sample.
In block 16, a subset of the determined mass ions which are statistically significant for detecting disease are identified as the biomarkers for detecting the disease. In block 18, the subset of the determined mass ions is combined in a multivariate predictive algorithm to generate a value of a discriminant function (DF) indicating the likelihood that the subject has a disease. For example, a subset of determined mass ions can be determined which are statistically significant for detecting lung cancer or breast cancer.
Blocks 34, 35 and 36 describe steps using multiple Monte Carlo simulations to identify a set of mass ion biomarkers of a disease that detect the disease with greater than random accuracy. In block 34, a correct assignment curve is constructed with data of the AUC of the ROC curves for all candidate biomarker mass ions. In one embodiment, block 34 can be performed by assigning all data of the AUC of the ROC curves to a series of bins with incremental values. For example, the bins can be assigned values of 0.50 to 0.51, 0.51 to 0.52 and so forth up to 0.99 to 1.0. The correct assignment curve is generated as a plot of the number of mass ions in a bin on the y-axis versus the AUC value of a bin on the x-axis
An example correct assignment curve for lung cancer is shown as 50a in
The accuracy of the correct assignment curve can be re-evaluated by comparison of Monte Carlo simulations of the identified subset of mass ions to a plurality of Monte Carlo simulations of random assignment of each of the mass ions to either disease or being free of disease. Referring to
Referring to
In one embodiment, block 36 can be implemented using vertical line V1 53a of
Referring to
For example, from the embodiment shown in
Method 10 for identifying biomarkers and generating an output indicative of disease of the present invention can be used to detect the probable presence of a disease in a human subject. A breath sample from a test subject is collected, chemically analyzed and the data is analyzed with the multivariate algorithm to generate a value of the discriminant function for the test subject. The value of the discriminant function for the test subject is compared to the value of the discriminant function determined in block 18.
In one example, the probability of presence of lung cancer in a test subject increases with the value of the discriminant function, as shown in
VOCs are thermally desorbed from the sorbent trap 62, separated by gas chromatography apparatus 70, and injected into mass spectrometry detector 72. In mass spectrometry detector 72 the VOCs are bombarded with energetic electrons in a vacuum and degraded into a set of ionic fragments, each with its own mass/charge (m/z) ratio. Data from gas chromatography apparatus 70 and mass spectrometry detector 72 is received at processor 74.
Although some embodiments herein refer to methods, it will be appreciated by one skilled in the art that they may also be embodied as a system or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “processor,” “device,” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable mediums having computer readable program code embodied thereon. Any combination of one or more computer readable mediums may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to CDs, DVDs, wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks. The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The invention can be further illustrated by the following examples thereof, although it will be understood that these examples are included merely for purposes of illustration and are not intended to limit the scope of the invention unless otherwise specifically indicated. All percentages, ratios, and parts herein, in the Specification, Examples, and Claims, are by weight and are approximations unless otherwise stated.
Methods
Model-Building Phase—Unblinded for Detecting Lung Cancer
In the unblinded model-building phase, breath VOCs were analyzed with gas chromatography mass spectrometry to provide data in breath chromatograms. The human subjects from which breath chromatograms were obtained are shown in Table 1 which included Group 1: 82 asymptomatic high-risk including smokers aged >=50 years of age undergoing chest CT; Group 2: 84 symptomatic high-risk subjects with a tissue diagnosis; Group 3: 99 symptomatic high-risk subjects without a tissue diagnosis; and Group 4: 35 apparently healthy subjects free of lung cancer.
Multiple Monte Carlo simulations identified candidate breath VOC mass ions from the data with greater than random diagnostic accuracy for detecting lung cancer, and the determined candidate biomarkers were combined in the multivariate predictive algorithm.
In the blinded model-testing phase, breath VOCs were analyzed in a new set of human subjects. The subjects from which breath chromatograms were obtained included Group 1: 68 asymptomatic high-risk including smokers aged >=50 years of age undergoing chest CT; Group 2: 51 symptomatic high-risk subjects with a tissue diagnosis; Group 3: 76 symptomatic high-risk subjects without a tissue diagnosis; and Group 4: 19 apparently healthy subjects free of lung cancer. The multivariate algorithm predicted discriminant function (DF) values in blinded replicate samples analyzed independently at two laboratories (A and B).
The subjects of Group 3 are shown in Table 2.
Collection of breath VOC samples: Collection of breath VOC samples was performed in accordance with method 10 for identifying biomarkers and generating an output indicative of lung cancer and system 60. A subject wears a nose clip and breathes normally through a disposable valved mouthpiece and bacterial filter into the BCA for 2.0 min. Alveolar breath VOCs are captured on to a sorbent trap that is immediately sealed in a hermetic container. Since there is low resistance to expiration (˜6 cm water), breath samples could be collected without discomfort from elderly patients and those with respiratory disease. In order to minimize the risk of potential site-dependent confounding factors such as environmental contamination of room air, subjects in all four groups donated breath samples in the same room at each clinical site. All subjects donated two samples for replicate assay at two independent laboratories (Menssana Research, Inc and American Westech, Inc., Harrisburg, Pa.). Samples were stored at −15° C. prior to analysis.
Analysis of breath VOC samples: Analysis of breath VOC sample was performed with method 10 for identifying biomarkers and generating an output indicative of lung cancer and system 60. Using automated instrumentation, VOCs were thermally desorbed from the sorbent trap 62, cryogenically concentrated, and assayed by gas chromatography mass spectrometry (GC MS). A known quantity of an internal standard (bromofluorobenzene) was automatically loaded on to all samples in order to normalize the abundance of VOCs and to facilitate alignment of chromatograms. A typical total ion chromatogram of breath VOCs is shown in
Analysis of data: GC MS data from both laboratories was pooled for analysis and development of a single predictive algorithm.
Alignment of single ion masses in chromatograms: Chromatograms were processed with metabolomic analysis software (XCMS in R) in order to generate a table listing retention times with their associated ion masses and intensities. Retention times and ion mass intensities were normalized to the bromofluorobenzene (ion mass 95) internal standard in each chromatogram. The aligned data was then binned into a series of 5 sec retention time segments.
Identification of biomarker single ions: The statistical methods have been previously described. Mass ions as candidate biomarkers of lung cancer were ranked by comparing their intensity values in subjects with lung cancer (Group 3 lung cancer confirmed by tissue diagnosis shown in table 3) to cancer-free controls (Group 1 with negative chest CT). In each 5 sec time segment, the diagnostic accuracy of each mass ion was ranked according to its C-statistic value [(area under curve (AUC) of the receiver operating characteristic (ROC) curve]. Multiple Monte Carlo simulations were employed in order to minimize the risk of including random identifiers of disease by selecting the mass ions in each time segment that identified active lung cancer with greater than random accuracy. The average random behavior of mass ions in each time segment was determined by randomly assigning subjects to the “lung cancer” or the “cancer-free” group and performing 40 estimates of the C-statistic. For any given value of the C-statistic, it was then possible to identify the ionic biomarkers that exhibited greater diagnostic accuracy with correct assignment than with multiple random assignments.
Development of predictive algorithm: Biomarker ions that identified lung cancer with greater than random accuracy were employed to construct a predictive algorithm using multivariate weighted digital analysis (WDA).
Model-Testing Phase—Blinded for Detecting Lung Cancer
Blinding procedures: The independent monitor maintained a database of all clinical and diagnostic data, and this information was not shared with any participant in the research. Laboratories received no clinical information and only the subject identification number accompanied sorbent traps sent for analysis.
Human subjects: A new set of human subjects was recruited in the same fashion as described above in the model-building phase. No subject from the unblinded phase was included in the blinded phase of the research.
Collection of breath VOC samples and analysis of breath VOC samples were performed in the same fashion as described above in the model-building phase.
Prediction of outcomes: The predictive algorithm developed in the unblinded phase was applied to the mass ions in each of the blinded breath chromatograms in order to generate a discriminant function (DF) value. This procedure was replicated in duplicate breath samples that were analyzed at two laboratories. At the conclusion of the study, the resulting DF values with their associated subject identification numbers were transmitted to the monitor who then broke the blinding and determined the predictive accuracy of the breath test. There were no adverse effects associated with breath testing in either phase of the study.
It was found that in the unblinded model-building phase, the method of the present invention identified lung cancer with sensitivity 74.0%, specificity 70.7% and C-statistic 0.78 as shown in
This figure displays the expected improvement in sensitivity and specificity of chest CT for lung cancer if it is combined in parallel with a breath testing. If both tests are positive for lung cancer, then specificity increases from 73.4% to 91.49%. If either test is positive, then sensitivity increases from 93.8% to 98.15%. These improvements were computed from the formulas for combining two independent tests (A and B) in parallel: If both tests are positive, then sensitivity (sen)=(A)sen×(B)sen, and specificity (spec)=(A)spec+(B)spec−[(A)spec×(B)spec]. Compared to either test employed alone, their combined specificity is increased but sensitivity is reduced. If only one of the tests is positive, then sensitivity=(A)sen+(B)sen [−(A)sen×(B)sen] and specificity=(A)spec×(B)spec. Compared to either test employed alone, their combined sensitivity is increased but specificity is reduced.
Expected outcome of screening one million people for lung cancer is shown in table 3.
Table 3 indicates TP=true positives, FN=false negatives, TN=true negatives, and FP=false positives. The main limiting factor in population screening programs is the potentially overwhelming number of false-positive test results. Screening one million people with chest CT alone would result in 263,074 false positive test results, but if chest CT and breath testing are positive, the increased specificity would reduce this number to 88,919 i.e. by 66.2%. However, if only one of the tests is positive, then the increased sensitivity would reduce the number of false-negatives from 682 to 198 i.e. by 71.0%.
The present results indicate that ionic biomarkers in breath accurately predicted the presence or absence of lung cancer in a blinded validation study. A multivariate algorithm predicted the diagnosis from replicate breath samples independently analyzed at two laboratories, and the sensitivity, specificity, and overall accuracy of the test were similar at both sites. The outcome of the test was not significantly affected by age or pack-years of tobacco smoking.
The breath test for biomarker ions can improve both the sensitivity and the specificity of chest CT if the two tests are employed in parallel. In a program to screen one million asymptomatic high risk-subjects for lung cancer with chest CT alone, the expected outcome would include 263,074 false-positive test results. However, if chest CT and a breath test are combined in parallel, the number of false-positive results would be expected to fall to 88,919, a reduction of 66.2%. Similarly, if only one of the tests is positive, then the number of false-negatives would be expected to fall from 682 to 198 i.e. by 71.0%. As a result, combined parallel testing could potentially facilitate large-scale screening for lung cancer by reducing the economic costs and the potential harms of false-positive and false-negative test outcomes that are currently associated with chest CT.
Model-Building Phase—Unblinded for Detecting Breast Cancer
Collection of breath VOC samples: Collection of breath VOC samples was performed in accordance with method 10 for identifying biomarkers and generating an output indicative of lung cancer and system 60. A subject wears a nose clip and breathes normally through a disposable valved mouthpiece and bacterial filter into the BCA for 2.0 min. Alveolar breath VOCs are captured on to a sorbent trap that is immediately sealed in a hermetic container.
VOCs in 54 women with biopsy-proven breast cancer and in 204 healthy controls were analyzed. Subjects were randomly assigned to a training set (2/3) and a test set (1/3). Analysis of breath VOC samples: Analysis of breath VOC sample was performed with method 10 for identifying biomarkers and generating an output indicative of lung cancer and system 60. Using automated instrumentation, VOCs were thermally desorbed from the sorbent trap 62, cryogenically concentrated, and assayed by gas chromatography mass spectrometry (GC MS). A known quantity of an internal standard (bromofluorobenzene) was automatically loaded on to all samples in order to normalize the abundance of VOCs and to facilitate alignment of chromatograms.
Analysis of data: GC MS data from both laboratories was pooled for analysis and development of a single predictive algorithm.
Alignment of single ion masses in chromatograms: Chromatograms were processed with metabolomic analysis software (XCMS in R) in order to generate a table listing retention times with their associated ion masses and intensities, and binned into a series of 5 sec retention time segments. In the training set, mass ions in each time segment were ranked according to their diagnostic accuracy i.e. the area under curve (AUC) of the receiver operating characteristic (ROC) curve. Correct assignment curve 50b shown in
Multiple Monte Carlo simulations were employed in order to minimize the risk of including random identifiers of disease by selecting the mass ions in each time segment that identified active breast cancer with greater than random accuracy and combined those with the highest diagnostic accuracy in a predictive algorithm using multivariate weighted digital analysis (WDA). The algorithm was used to predict the diagnosis in the test set.
It was found that the method of the present invention using the WDA algorithm employing 21 mass ion biomarkers identified breast cancer with sensitivity 79.0% in the training set as shown in
It is to be understood that the above-described embodiments are illustrative of only a few of the many possible specific embodiments, which can represent applications of the principles of the invention. Numerous and varied other arrangements can be readily devised in accordance with these principles by those skilled in the art without departing from the spirit and scope of the invention.
The following references, to the extent that they provide exemplary procedural or other details supplementary to those set forth herein, are specifically incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
62174256 | Jun 2015 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 15177695 | Jun 2016 | US |
Child | 15263621 | US |