Lung diseases impair lung function and, according to the American Lung Association, are the third primary cause of death in America, accounting for one in six deaths. The main categories of lung disease include airway diseases, lung tissue diseases and pulmonary circulation diseases as well as combinations of the above. Examples of diseases affecting lung function include asthma, chronic obstructive pulmonary disease (COPD), lung cancer, alpha-1 antitrypsin deficiency, respiratory distress syndrome, chronic bronchitis, chronic systemic inflammation, and inflammatory respiratory disease among others.
COPD is the fourth leading cause of morbidity and mortality in the United States and is expected to rank third as the cause of death, worldwide, by 2020 (Mannino and Braman, 2007, Proceedings of the American Thoracic Society 4:502-506). Cigarette smoking is widely recognized as a primary causative factor of COPD and accounts for approximately 80-99% of all cases in the United States. COPD is characterized by chronic airflow limitation, measured spirometrically by the ratio of the forced expiratory volume in one second (FEV1) to the forced vital capacity (FVC), and associated with an abnormal inflammatory response of the lung to noxious particles or gases. The operational diagnosis of lung diseases such as COPD has traditionally been made by spirometry, as a ratio of FEV1 to FVC below 70% (Rabe et al., 2007, American Journal of Respiratory and Critical Care Medicine 176:532-555).
Prior diagnostic methods of COPD and other lung diseases employ diagnostic tests which rely on the presumed correlation of decreased pulmonary function with lung disease such as COPD, asthma, fibrosis, emphysema and others. While lung function tests can provide a general assessment of the functional status of a subject's lungs, the tests do not distinguish between the different types of lung diseases that may be present. For example, certain diseases such as asthma cannot be confirmed based on functional tests alone. In addition, it is only when a measurable change in lung function exists that such tests aid in the diagnosis of a lung disease.
Studies of mechanisms underlying lung diseases are hampered by the procedures required to obtain samples of disease tissue. In particular, studies investigating differential gene expression associated with lung disease have been hindered by the invasiveness of procedures used to obtain sample tissue from diseased and normal subjects. Methods which provide an accurate diagnosis of lung disease prior to development measurable changes in lung function using less invasive tissue sampling techniques would be desirable.
Novel gene biomarkers of lung function are provided. In one aspect, the gene biomarkers are identified using comparisons of gene expression profiles in subjects with a lung disease and in subjects not having the disease. In another aspect, the profiles are obtained using a method comprising high-throughput analysis. Compositions and devices comprising the novel gene biomarkers are also provided. The gene biomarkers also are useful as prognostic or diagnostic indicators of lung disease or as an indicator of a subject's risk of developing lung disease. In an additional aspect, the lung disease is COPD.
In one embodiment, gene biomarkers of lung function comprise one, two, three, four, five, six, seven, eight or more genes selected from the group of genes set forth in Supplementary Table II. In another embodiment a gene biomarker of lung function is selected from a nucleic acid molecule (polynucleotide) having a nucleotide sequence of a gene set forth in Supplementary Table II, or a nucleic acid molecule (polynucleotide) having a sequence with 70-99% identity to the nucleic acid sequence of a gene set forth in Supplementary Table II, or a fragment thereof. In another embodiment a gene biomarker of lung function is selected from a nucleic acid molecule comprising a nucleotide sequence of a gene selected from IL6R, CCR2, PPP2CB, RASSF2, WTAP, DNTTIP2, GDAP1, LIPE, and RPL14, or a nucleic acid molecule comprising a sequence with 70-99% identity to the nucleic acid sequences of a genes selected from IL6R, CCR2, PPP2CB, RASSF2, WTAP, DNTTIP2, GDAP1, LIPE, and RPL14, or a fragment thereof. It is understood that such nucleic acid molecules and fragments thereof include the sequence of the coding strand or the non-coding strand of the gene, or a fragment thereof unless stated otherwise. It is also understood that such nucleic acid molecules and fragments may comprise the sequences found in either the exons and/or introns of the genes set forth in Supplementary Table II unless stated otherwise.
The present disclosure provides for a composition comprising nucleic acids having the nucleotide sequence of a gene biomarker of lung function. In one embodiment the disclosure provides for compositions comprising two nucleic acid molecules wherein the first nucleic acid molecule comprises a first nucleotide sequence and the second nucleic acid molecule comprises a second nucleotide sequence, wherein the first nucleotide sequence differs from the second nucleotide sequence and the first and second nucleotide sequences are selected independently from the group consisting of the nucleotide sequences of the genes set forth in Supplementary Table II, or a sequence having 70-99% identity to the nucleotide sequences of the genes set forth in Supplementary Table II, or a fragment thereof. In other embodiments the disclosure provides for compositions further comprising a third, forth, fifth, sixth, seventh, eighth and/or ninth nucleic acid molecules.
Also provided is a device comprising a plurality of locations (e.g., a chip or slide bearing an array), wherein 2, 3, 4, 5, 6, 7, 8 or more of said locations each comprise a different nucleic acid molecule comprising a nucleotide sequence of a gene set forth in Supplementary Table H, or a sequence having 70-99% identity to the nucleotide sequences of a gene as set forth in Supplementary Table II, or a fragment thereof (e.g., a fragment of the protein coding exon regions).
In one embodiment, the disclosure provides a method of identifying a gene biomarker associated with lung disease by employing statistical analysis of nucleic acid sequences differentially expressed in subjects having lung disease as compared to control subjects without the disease. In one aspect, the gene biomarkers of lung disease are identified as the group of genes set forth in Supplementary Table II. In another embodiment, the gene biomarkers of lung function are identified as one or more genes (or nucleic acids encoding those genes) selected from: IL6R, CCR2, PPP2CB, RASSF2, WTAP, DNTTIP2, GDAP1, LIPE, and RPL14. Exemplary lung diseases include, for example, asthma, chronic obstructive pulmonary disease, lung cancer, alpha-1 antitrypsin deficiency, respiratory distress syndrome, chronic bronchitis, chronic systemic inflammation, and inflammatory respiratory disease, among others. In one embodiment, lung diseases or disorders may exclude cancers and/or tumors of the lungs, airways, or of other respiratory tissues. In another embodiment lung diseases may exclude one or more of asthma, chronic bronchitis, chronic systemic inflammation or inflammatory respiratory disease.
In one embodiment, a diagnostic and/or prognostic method of assessing lung disease in a subject is provided, wherein the method includes use of two or more described gene biomarkers. In one aspect, the method includes detecting two or more gene biomarkers in a biological sample obtained from a subject expression. In another embodiment, the method includes measurement of the level of expression of a gene biomarker selected from: IL6R, CCR2, PPP2CB, RASSF2, WTAP, DNTTIP2, GDAP1, LIPE, and RPL14.
In another aspect, the present disclosure provides a method of monitoring an increase in the severity of lung disease in a subject by comparing expression profiles of two or more gene biomarkers in the subject at a first time point versus a second time point, wherein a difference in the expression profiles indicates an increase in severity of the subject's lung disease. In one embodiment, the gene biomarker is selected from: IL6R, CCR2, PPP2CB, RASSF2, WTAP, DNTTIP2, GDAP1, LIPE, and RPL14 (including sequences complementary to those encoding mRNAs).
In an additional aspect, the gene biomarkers are useful as prognostic indicators of lung disease. Thus, in one embodiment, the present disclosure provides a method of determining the prognosis of a lung disease in a subject by detecting in a subject sample expression of two or more gene biomarkers at a first point in time and then at a second point in time, and comparing the profile of gene biomarkers expressed at the second time point versus the first time point to determine the prognosis of the lung disease in a subject. In one embodiment, the gene biomarker is selected from: IL6R, CCR2, PPP2CB, RASSF2, WTAP, DNTTIP2, GDAP1, LIPE, and RPL14 (and complementary sequences thereof).
Also provided are kits for use in the diagnosis, prognosis and treatment of lung disease comprising one or more of the gene biomarkers or compositions described herein.
The present disclosure provides compositions and methods of identifying genes as biomarkers of lung disease and compositions and kits comprising materials (e.g., nucleic acids and/or protein affinity reagents such as antibodies) for use in assessing nucleic acid and protein expression from those genes. Also provided are methods of using the novel biomarker for diagnostic, prognostic and predictive measures of a subject's lung disease. In one embodiment, the lung disease is COPD, where by identifying genes differentially expressed in subjects with COPD compared to control subjects, (biomarkers for the diagnostic, prognostic and predictive measures of a subject's lung disease are provided). Other exemplary diseases include, but are not limited to, obstructive pulmonary disease, chronic systemic inflammation, emphysema, asthma, pulmonary fibrosis, cystic fibrosis, obstructive lung disease, pulmonary inflammatory disorder, and lung cancer.
In one embodiment an individual or a population of individuals may be considered as not having lung disease or impaired lung function when they do not have exhibit clinically relevant signs, symptoms, and/or measures of lung disease. Thus, in various aspects, an individual or a population of individuals may be considered as not having chronic obstructive pulmonary disease, chronic systemic inflammation, emphysema, asthma, pulmonary fibrosis, cystic fibrosis, obstructive lung disease, pulmonary inflammatory disorder, or lung cancer when they do not manifest clinically relevant signs, symptoms and/or measures of those disorders. In another embodiment, an individual or a population of individuals may be considered as not having lung disease or impaired lung function, such as COPD, when they have a FEV1/FVC ratio greater than or equal to about 0.70 or 0.72 or 0.75. In another embodiment, an individual or population of individuals that may be considered as not having lung disease or impaired lung function are sex- and age-matched with test subjects (e.g., age matched to 5 or 10 year bands) that =current or former cigarette smokers without apparent lung disease who have an FEV1/FVC≧0.70 or ≧0.75. Individuals or populations of individuals without lung disease or impaired lung function may be employed to establish the normal range of proteins, peptides or gene expression. Individuals or populations of individuals without lung disease or impaired lung function may also provide samples against which to compare one or more samples taken from a subject (e.g., samples taken at one or more different first and second times) whose lung disease or lung function status may be unknown. In other embodiments, an individual or a population of individuals may be considered as having lung disease or impaired lung function when they do not meet the criteria of one or more of the above mentioned embodiments.
In one embodiment, control subjects, as that term is used herein are sex- and age-matched current or former cigarette smokers, without apparent lung disease who have FEV1/FVC≧0.70. Age matching may be conducted in bands of several years, including 5, 10 or 15 year bands. Control subjects are preferably recruited from the same clinical settings. A control group is more than one, and preferably a statistically significant number of control subjects. In one embodiment control subjects are sex- and age-matched (in 10 year bands) current or former cigarette smokers, without apparent lung disease who had FEV1/FVC≧0.70
In one embodiment, a control sample is a sample from one or more control subjects or which provides a result representative of tests conducted on a control group. In another embodiment, a control sample is a sample from a subject without lung disease (e.g., COPD) or which provides a result representative of tests conducted on a subjects without lung disease. In another embodiment a control sample is a sample containing a known amount (e.g., in mass, number of moles, or concentration) of one or more nucleic acids and/or proteins.
As described herein, a “gene biomarker” is a gene, or a nucleic acid sequence, such as the sequence of a gene, or fragment thereof, which is differentially expressed in a sample obtained from an individual having one phenotypic status (e.g., having a lung disease such as COPD) as compared with individual having another phenotypic status (e.g., control subject without a lung disease). A biomarker is an assayable nucleic acid sequence (or fragment thereof) that is used to identify, predict, or monitor a condition related to lung disease, such as COPD, or a therapy for such a condition, in a subject or sample obtained from a subject. The presence, absence, or relative amount of a gene biomarker can be used to identify a condition or status of a condition in a subject or sample obtained from that subject. Proteins that are encoded by a nucleic acid gene biomarker may be assayed as surrogates for the nucleic acid, and may be understood to be a biomarker or gene biomarker in that circumstance.
A gene biomarker may be characterized using a variety of approaches. Exemplary methodologies include, but are not limited to, the use of the polymerase chain reaction, sequencing, quantitative polymerase chain reaction, quantitative real-time polymerase chain reaction, protein or DNA array, microarray, ligase chain reaction, and oligonucleotide ligation assay, as well as use of high-throughput techniques such as cDNA microarray followed by statistical analysis to identify those nucleic acid sequences which are differentially expressed in subjects having lung disease as compared to control subjects.
A biomarker is differentially expressed between different phenotypic statuses if the expression level of the biomarker in the different groups is calculated to be statistically significantly different. Exemplary statistical analysis includes, among others, Random forest analysis (Breiman, 2001, Random Forests. Machine Learning 45:5-32), L1 penalized logistic regression (Tibshirani, 1996, Journal of the Royal Statistical Society B 58:267-288) and use of R programming environment (R Development Core Team 2007, R: a language and environment for statistical computing. http://www R-project org).
Gene biomarkers, alone or in combination, are useful as diagnostic markers of: lung disease; determining therapeutic effectiveness of a treatment for lung disease and/or lung disease progression; determining prognosis of lung disease; and/or for determining an individual's relative risk of developing lung disease.
Methods for identifying gene biomarkers are useful as diagnostic or prognostic indicators of different classifications and/or severity of lung disease by comparison of gene biomarkers differentially expressed in subjects having lung disease varying in degrees of severity or symptoms. In one embodiment, the gene biomarkers of lung function may be used as prognostic indicators of how likely a subject having lung disease is to experience an increase in disease symptoms or how severe those symptoms may become. In one embodiment, the greater the difference in expression of the gene biomarkers of lung function (e.g., IL6R, CCR2, PPP2CB, RASSF2, WTAP, DNTTIP2, GDAP1, LIFE, and RPL14) in a subject with suspected lung disease from when compared to control subjects, the more likely they will have the disease.
Gene biomarkers may also be identified by analysis of nucleic acid sequences differentially expressed by a subject with a lung disease as compared to nucleic acid sequences expressed by gender-matched control subjects. Identification of nucleic acid sequences that are differentially abundant among subjects with lung disease as compared to control subjects (e.g., COPD subjects having mild to moderate COPD with rapid or slow decline in lung function versus age- and gender-matched smokers without COPD) allows an understanding of the mechanisms underlying a lung disease and its related decline in lung function. Such nucleic acid sequences are useful as gene biomarkers for diagnostic and prognostic determinants of lung disease and/or assessing a subject's relative risk of developing a lung disease.
In one embodiment, methods for determining gene expression profiles include determining the amount of RNA that is produced by a gene encoding a polypeptide. Such methods include, but are not limited to, the use of reverse-transcriptase PCR (RT-PCR), competitive RT-PCR, real time RT-PCR, differential display RT-PCR, Northern Blot analysis and other related assays. The methods include the use of individual PCR reactions as well as amplification of complementary DNA (cDNA) and/or complementary RNA (cRNA) produced from mRNA and analysis via microarray.
Gene expression profiling using microarray analysis allows measurement of the steady-state mRNA level of thousands of genes simultaneously. Microarray techniques useful in the methods described herein are known in the art and are described, for example, in U.S. Pat. No. 6,271,002; U.S. Pat. No. 6,218,122; U.S. Pat. No. 6,218,114; and U.S. Pat. No. 6,004,755.
A gene biomarker may be detected in any tissue of interest from a subject suspected of having, at risk of having, or diagnosed as having a lung disease. Biological samples obtained from a subject that are suitable for detection of gene biomarkers include, but are not limited to, serum, plasma, blood, lymphatic fluid, cerebral spinal fluid, saliva, and epithelial cells, such as those available from a buccal swab. It is known that the transcriptome of peripheral blood leukocytes (PBL) reflect a majority of genes actively expressed in a subject. Thus, PBLs are useful as a target tissue “surrogate” for identifying genes differentially expressed in diseased subjects as compared to control subjects. As such, the present disclosure also provides a method of identifying the presence of a gene biomarker in a biological sample of a subject obtained using less invasive sampling techniques. A biological sample includes peripheral blood cells which are readily accessible using traditional blood drawing techniques such as, for example, venipuncture or finger prick.
In one embodiment, a gene biomarker of lung disease is selected from the nucleic acid sequence of a gene set forth in Supplementary Table II. In another embodiment, a gene biomarker of lung disease is a nucleic acid sequence encoding IL6R, CCR2, PPP2CB, RASSF2, WTAP, DNTTIP2, GDAP1, LIPE and RPL14, or a complementary sequence thereof (i.e., IL6R complementary sequence, CCR2 complementary sequence, PPP2CB complementary sequence, RASSF2 complementary sequence, WTAP complementary sequence, DNTTIP2 complementary sequence, GDAP1 complementary sequence, LIPE complementary sequence and RPL14 complementary sequence), or a fragment thereof.
In another embodiment, the present disclosure provides a composition comprising two, three, four, five, six, seven, eight or nine nucleic acid molecules, wherein each nucleic acid molecule differs from the other nucleic acid molecules and each nucleic acid molecule comprises a nucleotide sequence that is selected independently from the nucleic acid sequences of the genes set forth in Supplementary Table II, their complements, or a sequence having 70-99% identity to the nucleic acid sequences of the genes set forth in Supplementary Table II, or a fragment thereof. Moreover, such a composition may contain two, three, four, five, six, seven eight or nine nucleic acid molecules that are directed to different sequences selected independently from the nucleic acid sequences of the genes set forth in Supplementary Table H, or a sequence having 70-99% identity to the nucleic acid sequences of the genes set forth in Supplementary Table II, or a fragment thereof. It is understood that such nucleic acid molecules may have the sequence of the coding strand or the non-coding strand of the gene, or a fragment thereof. In aspects of such an embodiment, the fragments may be selected independently to have lengths greater than about 20, 22, 23, 24, 25, 26, 27, 28, 32, 34, 36, 38, 40, 50, 60, 75, 100, or 150 contiguous nucleotides of those sequences.
In another embodiment, the present disclosure provides a composition comprising two, three, four, five, six, seven, eight or nine different nucleic acid molecules where each comprises a nucleotide sequence that is: complementary to a fragment greater than about 20, 22, 23, 24, 25, 26, 27, 28, 32, 34, 36, 38, 40, 50, 60, 75, 100, or 150 contiguous nucleotides of the coding or non-coding strand of a gene set forth in Supplementary Table II, an RNA or cDNA transcribed from a gene set forth in Supplementary Table II, or the protein coding (exons) thereof.
Nucleic acid molecules, which may also be referred to herein as polynucleotides, “polynucleotide probes” or simply as “probes” may be immobilized on a substrate. In one embodiment, the present disclosure provides a device comprising one or more nucleic acid molecules immobilized on a substrate wherein each probe includes a gene biomarker. In another embodiment, the device comprises a plurality of nucleic acid molecules, each probe stably associated with (e.g., covalently bound to) and having a unique position on the substrate. In one embodiment, the substrate comprises an array or microarray device. In yet another embodiment the array comprises an array of nucleic acid molecules wherein two, three, four, five, six, seven, eight or nine different nucleic acid molecules are gene biomarkers of lung disease described herein (e.g., IL6R, CCR2, PPP2CB, RASSF2, WTAP, DNTTIP2, GDAP1, LIFE, and RPL14).
Nucleic acid molecules comprising a nucleotide sequence of a gene biomarker of lung disease may also be immobilized on beads or nanoparticles, such as gold, platinum, or silver nanoparticles. Nucleic acid molecules comprising a nucleotide sequence of a gene biomarker of lung disease may also be detectably labeled. In one embodiment, the label is detectable by fluorescence, or UV/Visible spectroscopic means. In other embodiments, the label is a nanoparticle such as a colloidal metal nanoparticle that is detectable by spectroscopic means including plasmon resonance. In still other embodiments, the label is a radioactive label.
Another embodiment is directed to a device comprising two, three, four, five, six, seven or eight different nucleic acid molecules that comprise the sequence of a gene biomarker of lung disease. In one embodiment the nucleic acid molecule(s) comprises a nucleotide sequence having greater than about 20, 22, 23, 24, 25, 26, 27, 28, 32, 34, 36, 38, 40, 50, 60, 75, 100, or 150 contiguous nucleotides of a gene biomarker of lung disease set forth in Supplementary Table II. In such embodiments the device can be an array wherein each nucleic acid molecule is fixed at a spatially addressable location.
The disclosure provided herein employs highly sensitive techniques for identification of gene biomarkers. that have low systemic levels in a subject. In one embodiment, a biological sample may be analyzed by use of an array technology and methods employing arrays such as, for example, a nucleic acid microarray or a biochip bearing an array of nucleic acids. An array or biochip generally comprises a solid substrate having a generally planar surface, to which a capture reagent is attached. Frequently, the surface of an array or biochip comprises a plurality of addressable locations, each of which has a capture reagent bound thereon. In one embodiment the arrays will permit the detection and/or quantitation of two, three, four, five, six, seven, or eight or more different biomarkers associated with COPD or its progression. In another embodiment the array will comprise addressable locations for capturing/binding and/or measuring two, three, four, five, six, seven, eight or more different gene biomarkers of lung disease. In one embodiment the gene biomarkers of lung disease are selected from nucleic acid sequences of one or more genes selected from IL6R, CCR2, PPP2CB, RASSF2, WTAP, DNTTIP2, GDAP1, LIPE and RPL14 (including the coding strand, non-coding strand, or exons thereof).
In one particular embodiment, the methods are provided using one or more gene biomarkers for diagnosing the presence of a lung disease or for determining a risk of developing a lung disease in a subject. A gene biomarker may include a nucleic acid sequence or fragment thereof encoding IL6R, CCR2, PPP2CB, RASSF2, WTAP, DNTTIP2, GDAP1, LIPE, RPL14, IL6R complementary sequence, CCR2 complementary sequence, PPP2CB complementary sequence, RASSF2 complementary sequence, WTAP complementary sequence, DNTTIP2 complementary sequence, GDAP1 complementary sequence, LIPE complementary sequence or RPL14 complementary sequence. A lung disease may include, but is not limited to, asthma, COPD, lung cancer, alpha-1 antitrypsin deficiency, respiratory distress syndrome, chronic bronchitis, chronic systemic inflammation, and inflammatory respiratory disease, which may, or may not, include lung cancer in any embodiment described herein. In one aspect the biological sample is a blood sample, a plasma sample, a serum sample, a urine sample, a lymphatic fluid sample, saliva sample or a sputum sample.
In one aspect, the present disclosure provides a method for identifying gene biomarkers of a disease that are associated with either a slow decrease or a rapid decrease in lung function. Methods are also provided for discriminating between a rapid and a slow decline in lung function and/or methods for identifying a subject as having an increased risk of developing a rapid decline in lung function or an increased risk of developing a slow decline in lung function by use of a gene biomarker. As used herein, the term “increased risk” refers to a statistically higher frequency of occurrence of the disease or disorder in an individual in comparison to the ?average frequency of occurrence of the disease or disorder in a population. A “decreased risk” refers to a statistically lower frequency of occurrence of the disease or disorder in an individual in comparison to the ?average frequency of occurrence of the disease or disorder in a population.
In another embodiment, the status of a subject's lung disease may be determined by measuring the quantity of one or more particular gene biomarkers present in a biological sample from that subject, and correlating the quantity of each biomarker with a previously determined measure of the severity of the disease based on the presence and/or quantity of one or more particular gene biomarkers present in a test sample from the subject. As used herein, the term “status” refers to the degree of severity of a subject's lung disease such as, for example, the number or degree of severity of symptoms presented or exhibited by the subject with the lung disease. The symptoms associated with different forms of lung diseases may differ between forms of lung diseases or may overlap. For example, exemplary symptoms commonly associated with COPD include, destruction or decreased function of the air sacs in the lungs, cough producing mucus that may be streaked with blood, fatigue, frequent respiratory infections, headaches, dyspnea, swelling of extremities, and wheezing. A subject with COPD may have a few to all of these symptoms. A subject with an early stage of COPD may exhibit one, two, three, or only a few of those symptoms.
In another embodiment, the present disclosure provides a method of determining the status of a subject's lung disease by assessing the level of expression of one or more gene biomarkers during the course of the subject's lung disease. Such assessment includes (1) measuring at a first time point the level of expression of one or more gene biomarkers of lung disease in a subject's sample, (2) measuring the same biomarker(s) at a second time, and (3) comparing the first measurement to the second measurement, wherein a difference between the two measurements indicates the status of the lung disease, such as an increase or decrease in severity of the disease. In one embodiment a gene biomarker of a lung disease or an impaired lung function measure is selected from the group consisting of: IL6R, CCR2, PPP2CB, RASSF2, WTAP, DNTTIP2, GDAP1, LIPE, RPL14, or fragments thereof. In other aspects the method further comprises measuring two, three, four, five, six, seven, or eight, or more different gene biomarkers of lung disease.
Techniques for use in a method of measuring an increased or decreased expression of gene biomarkers include the use of quantitative assays for nucleic acids and proteins, including for example, polymerase chain reaction, array detection and measurement of proteins (e.g., using immobilized antibodies), quantitative RT-PCR (reverse transcriptase followed by PCR for measuring mRNA for example), quantative real time PCR, multiplex PCR, quantitative DNA array analysis, autoradiograph analysis, quantitative hybridization, immunoassays (e.g., ELIAS, Western, or sandwich assays), quantitative rRNA-based amplification, fluorescent probe hybridization, fluorescent nucleic acid sequence specific amplification, loop-mediated isothermal amplification and/or ligase chain reaction.
In one embodiment, the present disclosure provides a method of managing a subject's lung disease whereby a therapeutic treatment plan is customized/personalized or adjusted based on the status of the disease. Exemplary therapeutic treatments for lung disease include administering to the subject one or more of: immunosuppressants, corticosteroids (e.g., betamethasone delivered by inhaler), β2-adrenergic receptor agonists (e.g., short acting agonists such as albuterol), anticholinergics (e.g., ipratropium, or a salt thereof delivered by nebulizer), and/or oxygen. In addition, where the lung disease is caused or exacerbated by bacterial or viral infections, one or more antibiotics or antiviral agents may also be administered to the subject.
The materials and reagents required for diagnosing a lung disease, for determining the prognosis of a lung disease, or for use in the treatment or management of lung disease in a subject may be assembled together in a kit. A kit comprises one or more biomarker probes and a control nucleic acid sequence (e.g., present in a known quantity or amount), wherein the control nucleic acid sequence corresponds to a sequence that is not a gene biomarker of lung disease. The kit may be used for diagnosing, identifying prognosis, and/or predicting a lung disease in a subject. The kit generally will comprise components and reagents necessary for determining one or more biomarkers in a biological sample as well as control and/or standard samples. For example, a kit may include, probes, and/or antibodies specific to the one or more proteins, or peptide fragments of proteins, encoded by a gene set forth in Supplementary Table II for use in a quantitative assay such as RT-PCR, in situ hybridization, microarray and/or biochip detection. In another embodiment, the kit may include a compositions with gene expression products in ratios found in individuals having lung disease and/or compositions with gene expression products in ratios found in individuals not having a lung disease, thus avoiding the use of control gene(s) or control sample(s) from “control” subjects. In some embodiments, the kit includes a pamphlet which includes a description of use of the kit in relation to COPD diagnosis, prognosis, or therapeutic management and instructions for analyzing results obtained using the kit.
A cDNA microarray was used to obtain data to identify genes differentially expressed in PBLs between adult cigarette smokers or other subjects with or without COPD. In a training set of Cases and Controls clearly defined by spirometric criteria, random forest statistical modeling was used to generate a list of variables that predicted COPD classification. This list was then subjected to an L1 penalized logistic regression model to create a more focused set of variables. Both lists were assessed in a test set of subjects with spirometric parameters that closely bordered the generally acceptable spirometric diagnostic value for COPD. The identified genes were analyzed for their ontology assignment and pathway involvement. The gene expression profiles identified in this study are novel biomarkers for COPD and provide insight into disease mechanisms.
Study Design and Subjects
The COPD Biomarker Discovery Study (CBD) included male and female self-reported cigarette smokers, aged 45 years or older, with at least 10 pack-years smoking history that were recruited from the University of Utah Health Sciences Network of local clinics and hospitals and from community physician offices. COPD was diagnosed in 300 subjects according to the Global Initiative for Chronic Obstructive Lung Disease (GOLD) spirometric guidelines as having a ratio of forced expiratory volume in 1 second (FEV1) to forced vital capacity (FVC)<0.70 (Rabe et al. 2007, American Journal of Respiratory and Critical Care Medicine 176:532-555). The Control group included 425 sex- and age-matched, current or former cigarette smokers, without apparent lung disease with FEV1/FVC≧0.70. Individuals who had recent exacerbation of COPD, uncontrolled angina, hypertension, or allergy to albuterol, and females who were pregnant or lactating were excluded. Demographic variables, respiratory symptoms, medical history, tobacco use history, and concomitant medications were assessed. Pack-years were calculated as (maximum average cigarettes smoked per day over total smoking history/20)×(total years smoking). Body weight and height were measured. Spirometry was performed with a rolling seal spirometer by certified pulmonary function technicians according to American Thoracic Society guidelines (Miller et al. 2005, European Respiratory Journal 26:319-338). Measurements of FEV1 and FVC were made before and at least 20 min after inhaled bronchodilator administration (albuterol 180 μg). The FEV1/FVC ratio was calculated for each subject from the highest post-bronchodilator values of FEV1 and FVC. A blood sample was collected for assessment of carboxyhemoglobin (COHb) and complete blood cell counts. In a subgroup of 81 subjects with COPD and 61 unaffected (Control) subjects, a whole blood sample was also obtained for assessment of gene expression in PBLs.
Blood Sample Collection and Processing
Whole blood samples were obtained from each subject by venipuncture using 10 mL EDTA Vacutainer® tubes (BD, Franklin Lakes, N.J., USA). COHb, hemoglobin, hematocrit and total and differential white blood cell (WBC) counts were measured at ARUP Laboratories™, a national, CLIA (Clinical Laboratory Improvement Amendments of 1988)-certified reference laboratory (Centers for Medicare & Medicaid Services 1992, Federal Register 40:7002-7186). Isolation of PBLs was carried out using the LeukoLOCK™ Total RNA Isolation System (Ambion, Inc., Austin Tex., USA) following the manufacturer's protocol. Briefly, after isolation of PBLs, the filter was flushed with 3 EA. of phosphate-buffered saline, to remove residual red blood cells, and then with RNAlater®, to stabilize the leukocyte RNA, and frozen at −20° C. until processing for RNA. RNA isolation was then carried out using the mirVana™ miRNA Isolation Kit (Ambion, Inc., Austin Tex., USA). The LeukoLOCK™ filter was flushed with 2.5 mL of mirVana miRNA Lysis Solution, and the lysate was collected in a 15-mL conical tube. mirVana miRNA homogenate additive (one-tenth volume) was then added to the cell lysate. A volume of acid-phenol:chloroform, equal to the lysate volume, was used to flush the LeukoLOCK™ filter and was collected into the same 15-mL conical tube as the lysate. The tube was shaken vigorously for 30 seconds and stored for 5 min at room temperature. The samples were centrifuged for 10 min at 10,000×g (maximum) in a table-top centrifuge. The aqueous phase was transferred into a new tube, and mixed with 1.25 volumes of room-temperature 100% ethanol, and the mixture was filtered through the filter cartridge into the collection tubes supplied with the kit. The isolated RNA was then washed and eluted following the standard steps described in the kit's manual. Quality of the isolated RNA was checked using the Agilent 2100 Bioanalyzer (Agilent Technologies, Inc., Santa Clara, Calif., USA) before use and storage at −80° C.
Microarray Data Acquisition
Statistical procedures and analysis involved in pre-processing and identifying differential expression of microarray data were performed using Bead Studio® v3.0.14 (IIlumina Inc., San Diego, Calif., USA) and R-2.6.1 software (R Development Core Team 2007). cRNA from each sample following RNA isolation were hybridized to Sentrix® Human WG-6 BeadChips (Illumina Inc., San Diego, Calif., USA). Hybridized BeadArrays™ were examined with respect to number of genes detected, average intensity, 95th percentile of signal intensity, signal-to-noise ratio, and background signal intensity as a means of assessing quality. For each quality control (QC) measure, the BeadArray statistics were plotted and the mean+3 standard deviations were overlaid on the plot as a method for identifying potentially outlying arrays. All BeadArrays were considered to be within acceptable limits for these QC measures. In addition, the BeadArrays were examined with respect to beadtypes labeled as hybridization, low and high stringency, biotin, housekeeping, and labeling controls (data not shown). All control beadtypes yielded intensities at the expected levels, therefore each of the 142 hybridizations were considered to be of good quality.
Microarray Data Preprocessing
Prior to analysis, the gene expression data was log2 transformed. Since negative control bead background correction was demonstrated to negatively impact identifying differentially expressed genes (Dunning et al. 2008, BMC Bioinformatics 9:85), the estimated background from the negative control beads was not subtracted from the mean beadtype signal intensities. The log2 transformed intensities were subsequently normalized using a global median scaling method. Specifically, the expression for each sample was scaled by an array-specific constant factor so that the median expression values were the same across all arrays. An arbitrarily selected array was set as the baseline against which all other arrays were normalized. For array i and beadtype j, using the log2 transformed expression values log2 (xij), global normalization was performed as follows: 1) the median expression for the baseline array
was calculated; 2) for the ith array, the median expression,
was also calculated; and 3) for the ith array, bi={tilde over (x)}base/{tilde over (x)}i was taken to be the global scaling factor and was applied to normalize the j expression values for array i so that the log2 transformed and scaled values for beadtype j and array i were xijnorm=bi log2 (xij).
Random Forest Analysis
The normalized gene expression data were combined with selected demographic, smoking history and clinical variables (see Supplementary Table I). A random forest consisting of 10,000 trees was derived for predicting COPD-affected (Case) or unaffected (Control) samples/individuals, using a split-sample approach (training and test sets) and the random Forest package in the R programming environment (Breiman 2001, Liaw & Wiener 2002, R News 2:18-22; R Development Core Team, 2007). An extreme discordant phenotype design (Zhang et al. 2006, Pharmacogenitics and Genomics 16:401-413), based on the FEV1/FVC ratio, was used to select the training set for the analysis. Of 142 subjects, 36 were clearly classified as having COPD (FEV1/FVC<0.60), and 36 were classified as Controls (FEV1/FVC>0.75). This set of samples was then used as the training set for the analysis in order to maximally stratify the Case and Control subgroups. The remaining 70 subjects had FEV1/FVC values between 0.60 and 0.75 and were used as the test set.
For each classification tree in the random forest, the observations left out of the bootstrap re-sample (e.g., “out-of-bag”) were used as a natural test set for estimating prediction error. The out-of-bag observations were also used to estimate the importance of each variable for the classification task (Archer & Kimes, 2008, Computational Statistics and Data Analysis 52:2249-2260). The bootstrap method was used to estimate the null distribution for the mean decrease in Gini impurity by drawing a random sample with replacement from those variables with a non-zero mean decrease in Gini impurity, estimating the mean decrease of the re-sampled observations and repeating this procedure 2000 times. Candidate predictors with a Gini impurity>99.99795% were considered significant for the classification task.
L1 Penalized Logistic Regression
An L1 penalized logistic regression model was fit to predict the dichotomous outcome variable (Case/Control status) using the significant candidate predictors identified by the random forest algorithm. This additional modeling step was used to identify a more focused set of predictor variables that retain a similar error rate as the complete predicted random forest. This model was fit using the same training set used to derive the random forest model. The glmpath library (Park & Hastie, 2007, Journal of the Royal Statistical Society B 69:659-677) in the R programming environment (R Development Core Team, 2007) was used for fitting the L1 penalized models. The final model was selected as that model with minimum Akaike's information criterion (AIC) and was subsequently used to obtain fitted probabilities for all testable subjects. Those subjects with probabilities≧0.5 were classified as Cases, and all others were classified as Controls.
Gene Ontology and Pathway Analysis
Genes identified statistically as having significant predictive value for the discrete Case/Control outcome were used as the input for subsequent gene ontology and pathway analysis. Gene ontology and functional categories were identified by analyzing isolated gene lists using the Database for Annotation, Visualization and Integrated Discovery (DAVID, on the world wide web at david.abcc.ncifcrf.gov/) (Dennis et al. 2003, Genome Biology 4:3) and Pathway Studio V5.0 (Ariadne Inc., Rockville, Md., USA). EASE scores for gene-enrichment analysis were calculated using a 0.1 threshold. The DAVID annotation tool was also used to probe the Kyoto Encyclopedia of Genes and Genomes (KEGG, www.genome.jp/kegg/kegg2.html), BioCarta (www.biocarta.com/genes/index.asp) and the Biological and Biochemical Image Database (BBID, on the world wide web at bbid.grc.nia.nih.gov/) pathway databases to identify regulated pathways and to complement the gene ontology. “Biological processes” and “Pathways” with ap-value≦0.05 were considered significant. The output analyses were manually filtered to remove overlapping and redundant categories to generate non-redundant lists.
Quantitative Real-Time PCR (qRT-PCR)
Quantitative real-time polymerase chain reaction (qRT-PCR) was performed on isolated RNA from randomly selected subjects in the training set (12 with and 12 without COPD) to confirm the microarray results in terms of differential expression and statistical significance. First-strand cDNA was synthesized from 1 μg of RNA in a 100 μl reaction volume with the TaqMan® Reverse Transcriptase Reaction Kit (Applied Biosystems, Carlsbad, Calif., USA) using random hexamers as primers following the manufacturer's recommended protocol. After the synthesis was complete, the cDNA was diluted 1:3. Six microliters of diluted cDNA were then used for each qRT-PCR reaction in a final volume of 20 using pre-designed Gene Expression Assays (Applied Biosystems, Carlsbad, Calif., USA) for the genes of interest. All PCR reactions were carried out in triplicate. Relative expression levels were calculated using the ΔΔCt method algorithm provided by Applied Biosystems. The average intensity value obtained for the Control subjects was used as the calibrator. All reactions were run in an Applied Biosystems 7500 Fast Sequence Detection System (Applied Biosystems, Carlsbad, Calif., USA). The gene expression assays used were: 18S (Hs99999991—1), GAPDH (4310884E), DNTTIP2 (Hs00966646—1), GDAP1 (Hs00184079—1), IL6R (Hs01075667—1), LIPE (Hs00943410—1), WTAP (Hs00374488—1), CCR2 (Hs00174150—1), PPP2CB (Hs00602137—1), RASSF2 (Hs00542460—1) and RPL14 (Hs00427856—1).
Subject Demographics
Characteristics of the spirometrically defined COPD-affected and unaffected groups (overall and for the training set) are summarized in Table I. The distribution of the COPD group by severity of airflow obstruction (FEV1 as percent of predicted) by GOLD spirometric guidelines (Rabe et al. 2007) was GOLD 1 (mild, n=30), GOLD 2 (moderate, n=38), GOLD 3 (severe, n=6), and GOLD 4 (very severe, n=7). It should be noted that 10 subjects with FEV1/FVC>0.70 were categorized as Controls according to the GOLD guideline but had subnormal FEV1 (<80% predicted) and could be considered to have spirometrically indeterminate Case/Control status; 3 subjects were in the training set, and 7 were in the test set. In the cohort overall and in the training and test sets, the COPD group was older and had at least 56% greater pack-years of cigarette smoking, on average, than the Control group. However, the proportion of current smokers was similar across all groups, at 58-69%. Although the mean total circulating WBC count did not differ significantly between the groups, those with COPD had significantly higher mean neutrophils and lower mean lymphocytes, as percentages of the total WBC, than the group without COPD.
aCOPD subjects with % FEV1/FVC <60 and control subjects with % FEV1/FVC >75.
bp-value for difference in mean values between the Case/Control groups was obtained by Welch's t-test for continuous variables and by Fisher's exact test for categorical variables.
cAverage daily cigarette consumption of current smokers during the 3 months prior to study participation
Identification of COPD Predictors
Due to the inability of the random forest algorithm to handle missing values among the predictor variables, the medication history of the subjects was not included in the analysis since several subjects had missing values. For example, 15/81 (18.5%) Cases and 19/61 (31%) Controls failed to indicate whether they were using glucocorticoids. The final size of the training set was 33 Cases and 34 Controls because 3 Cases and 2 Controls had missing values for other key variables. The out-of-bag estimate of error associated with the random forest analysis in the training set was 6.0% overall, with a misclassification rate of 2.9% for the spirometric Controls and 9.1% for the spirometric Cases (Table H). The random forest algorithm identified 1,014 candidate predictor variables, which included only 1 phenotypic variable, ‘years of daily smoking’. The top 30 candidate predictors using the mean decrease in Gini impurity, as well as the mean decrease in accuracy, are displayed in
The random forest model derived using the training set was then applied to the remaining 70 subjects with FEV1/FVC values of 0.60-0.75 (test set). Five subjects were excluded due to missing values for a key variable, leaving 65 subjects as a test set for evaluation of the random forest classifier. The overall misclassification rate for the test set was 24.6% (16/65). Spirometric versus gene expression-predicted classifications for the training and test sets are shown in Table II, along with misclassification rates. Of the discordantly classified subjects in the test group, 14/16 (87.5%) were classified as Cases by spirometry but not by their gene expression profile.
Gene Ontology and Pathway Analyses
In an effort to identify biological processes and pathways that were differentially affected in Cases versus Controls, gene ontology assessment using the DAVID annotation tool (Dennis et al., 2003) was performed. A total of 784 genes (77.4% of the 1,013 genes identified by random forest modeling) were represented in the DAVID gene ontology categories. The analysis output list was manually edited to remove redundant and overlapping gene ontologies. Biological processes that were enriched in the set of predictor genes included regulation of apoptosis and cell growth, macromolecule (protein and RNA) transport, post-translational protein modification, cellular defense response, inflammatory response and RNA processing (
The gene ontology analysis revealed a number of up-regulated genes involved in positive regulation of apoptosis (e.g., BAD, CASP4, CASP6, CASP10, DIABLO, FAF1, FASTK and TRADD) as well as a number of genes involved in inhibition of apoptosis (e.g., BCL2L1, BIRC2, CDKN2D, MCL1, NAIP, SERPINB2, SGMS1 and YWHAZ). A similar situation occurred with cell cycle progression related genes. Several of the genes identified are involved in general regulation of the cell (e.g., CCT7, CDC2L1, CDK2, CDC42, CDKN2D, MDM4, NEDD9, PCNA, PML, PMS1, RASSF2, RASSF4, RASSF5, RB1, TSC1, VEGFB and VHL) with a number of them clearly involved in negative regulation of the cell (e.g., CDKN2D, PML, RASSF2, RASSF4, RB1 and TSC1).
A number of genes were identified that were involved in the MAPK signaling pathway (e.g., ATF2, ATF4, DUSP6, DUSP10, IL1R2, MAP2K3, MAP4K3, MAPK14, MAX, MEF2A, PIK3R5, SOS1, SOS2 and TGFBR2) and in inflammatory response (e.g., ALOX5, CCL7, CCR2, CCR4, CD97, CD163, NFRKB, NLRP3, PLAA, SPN, TLR4, TLR6, TLR8), consistent with prior reports in the literature and the systemic pro-inflammatory characteristics associated with COPD (Mossman et al. 2006, American Journal of Respiratory Cell and Molecular Biology 34:666-669; Agusti et al. 2003, European Respiratory Journal 21:347-360; Rahman et al. 1996, American Journal of Respiratory and Critical Care Medicine 154:1055-1060; Chung 2001, European Respiratory Journal Supplement 34:50s-59s; Chung 2005, Curr Drug Targets Inflamm Allergy 4:619-625; Rahman 2005, Treatments in Respiratory Medicine 4:175-200; Agusti & Soriano 2008, Journal of Chronic Obstructive Pulmonary Disease 5:133-138; Fabbri & Rabe 2007, Lancet 370:797-799). A summary of the protein-protein interactions and possible biological outcomes identified by Pathway Studio from the list of candidate predictor genes is shown in
L1 Penalized Logistic Regression Model
In order to identify a more focused set of variables having a similar predictive capability as the random forest, an L1 penalized logistic regression model was fit to predict the dichotomous outcome variable (Case/Control status) using the 1,014 variables identified by the random forest algorithm. L1 penalized models are effective in performing automatic variable selection (Tibshirani, 1996). The model was first fit using data from the training set of 33 Cases and 34 Controls used to derive the random forest model. The final model, selected as the L1 logistic regression model with minimum AIC (data not shown), comprised 9 predictor genes: IL6R, CCR2, PPP2CB, RASSF2, and WTAP were up-regulated and DNTTIP2, GDAP1, LIPE, and RPL14 were down-regulated in Cases compared with Controls. As shown in Table III, the 9-gene model had an overall error rate of 3.0%, discordantly classifying 1 spirometric Case and 1 spirometric Control. The derived L1 penalized logistic regression model was subsequently applied to classify the test set of 70 subjects with FEV1/PVC of 0.60-0.75, although one subject was excluded for missing a key variable leaving 69 subjects in the test set. The overall misclassification rate was 21.7% (Table III). The calculated sensitivity, specificity, and positive and negative predictive values in the test set of samples for both models are shown in Table IV.
Biological Validation
Real-time PCR was performed using isolated RNA from 24 randomly selected subjects in the training set (12 Cases and 12 Controls) to confirm the microarray results for the 9 predictor genes. Experimental results are shown in
Using microarray analysis of PBL and random forest modeling, 1,013 genes were identified. One phenotypic variable was identified as a candidate predictor capable of differentiating smokers (current or former) with or without COPD. Gene ontology analyses indicate that these genes are involved in various cellular processes including regulation of apoptosis, regulation of cell growth, macromolecule (protein and RNA) transport, post-translational protein modification, cellular defense response, inflammatory response and RNA processing. A 9-gene subset derived from the larger set of candidate predictors that reliably discriminated between COPD and non-COPD objects was also identified. Differential expression of 7 of the 9 genes identified was confirmed by qRT-PCR, corroborating the microarray results.
The full random forest predictive model discordantly classified, or “misclassified,” 6% of the training set and 24.6% of the test set, and the 9-gene model differed from the spirometrically-defined classification for 3% of the training set and 21.7% of the test set. These models performed well in the more phenotypically extreme (by spirometry) training set and less well in the test set whose FEV1/FVC values more closely bordered the diagnostic Case/Control cutoff value of 0.70. The great majority of the discordantly classified subjects in the test set were classified as Cases by spirometry but as Controls by their gene expression profile. It is possible for an individual to have a spuriously low airflow measurement that could result in a misdiagnosis of COPD by the GOLD guideline, which uses a fixed, arbitrary cutoff value of FEV1/FVC.
Furthermore, although spirometric parameters are the traditional diagnostic and prognostic markers for COPD, it has become clear that they do not adequately represent all of its respiratory and systemic aspects (Marin et al. 2009, Respiratory Medicine 103(3):373-378; Celli 2006, Proceedings of the American Thoracic Society 3:461-465). FEV1 correlates poorly with the degree of dyspnea, and the change in FEV1 does not reflect the rate of decline in health status (Celli et al. 2004, Celli 2006, Burge et al. 2000, British Medical Journal 320:1297-1303). Other factors, such as emphysema and hyperinflation (Casanova et al. 2005, American Journal of Respiratory and Critical Care Medicine 171:591-597), malnutrition (Schols et al. 1998, American Journal of Respiratory and Critical Care Medicine 157:1791-1797), peripheral muscle dysfunction (Maltais et al. 2000, Clinics in Chest Medicine 21:665-677), and dyspnea (Nishimura et al. 2002, Chest 121:1434-1440), are independent predictors of outcome. In fact, the multifactorial BODE index that includes body mass index (B), degree of airflow obstruction (0), dyspnea score (D), and exercise endurance (E), is a better predictor of mortality than FEV1 alone (Celli et al. 2004, The New England Journal of Medicine 350:1005-1012). The PBL gene expression profile alone or in combination with clinical markers such as the BODE components and/or lung parenchymal or airway changes on chest CT scans (Omori et al. 2006, Respirology 11:205-210) may be more predictive of the (early) presence, activity, and progression of the multi-component syndrome that is COPD than the clinical parameters alone.
One of the major constraints of COPD biomarker discovery has been the accessibility of suitable samples. In the past, sputum, bronchoalveolar lavage fluid, exhaled breath condensate, and bronchial biopsy tissue have been used (Sin & Man 2008, Chest 133:1296-1298). However, the sampling methodologies for such specimens are limited by their invasiveness and poor reproducibility. Since COPD is accompanied by systemic changes, as well as increased serum levels of certain proteins [e.g., C-reactive protein (CRP), interleukin 6 (IL-6), IL-8, leukotriene B4 (LTB4), and TNFa], the use of PBLs as a surrogate biosample is an ideal alternative because they can be easily collected in large quantities at multiple time points using a relatively non-invasive procedure (Celli 2006; Schols et al. 1996, Thorax 51:819-824; Rahman & Biswas 2004, Redox Report: Communications in Free Radical Research 9:125-143; Rahman et al. 1996, Vemooy et al. 2002, American Journal of Respiratory and Critical Care Medicine 166:1218-1224; Agusti et al. 2003, Noguera et al. 1998, American Journal of Respiratory and Critical Care Medicine 158:1664-1668). As noted earlier, PBL gene expression profiles are successfully used to identify the presence or risk of other diseases having prominent systemic components.
Due to the role of PBLs in inflammation, the gene expression differences between subjects with and without COPD in this population of cells can reflect the degree of systemic inflammation or inflammation in the lungs. Lung inflammation is known to increase with the severity of the disease, as classified by the degree of airflow limitation (Hogg et al. 2004). The gene expression-based classifier is derived from the training set of COPD subjects with the most extreme airflow limitation, who likely also have the greatest degree of inflammation, while the test group with lesser airflow limitation may be predicted to have less inflammation. This may also partially account for the lower predictive ability between spirometric Cases and Controls in the test set compared to the training set.
In the present study, biological processes identified as over-represented in the set of COPD predictor genes include regulation of apoptosis, regulation of cell growth, macromolecule (protein and RNA) transport, post-translational protein modification, cellular defense response, inflammatory response and RNA processing. Major pathways identified include apoptosis, p38/MAPK signaling, focal adhesion, and leukocyte transendothelial migration. Changes in these biological processes and pathways may reflect the changes in activation, differentiation and cellular composition of the samples analyzed. The identification of leukocyte transendothelial migration is an important change in this cell population as COPD is characterized by leukocyte infiltration in the lung parenchyma (Panina et al. 2006, Current Drug Targets 7:669-674). Differences in expression of these genes may result in a predisposition of leukocyte subpopulations to infiltrate the lung tissue, and perhaps other tissues. This observation is supported by previously reported changes in chemotaxis and extracellular proteolysis in neutrophils isolated from the blood of subjects with COPD (Burnett et al. 1987, Lancet 2:1043-1046).
The subset of 9 genes identified using L1 penalized logistic regression modeling have similar predictive performance as the full set of candidate predictors identified by the random forest model. It includes 5 up-regulated genes (CCR2, IL6R, PPP2CB, RASSF2, and WTAP) and 4 down-regulated genes (DNTTIP2, GDAP1, LIPE, RPL14) in COPD Cases compared with Controls. IL6R and CCR2 have been previously reported to have possible roles in COPD development and progression (Owen 2001, Pulmonary Pharmacology and Therapeutics 14:193-202; Wilk et al. 2007, BMC Medical Genetics 8 Suppl 1:S8). However, there have been no prior reports of an association with COPD for DNITIP2, GDAP1, LIPE, PPP2CB, RASSF2, RPL14 and WTAP.
The IL6R gene codes for the IL6 receptor, which is only reported to be expressed in subpopulations of leukocytes (monocytes, neutrophils and T and B lymphocytes) and hepatocytes (Chalaris et al. 2007, Blood 110:1748-1755; Jones et al. 2001, The FASEB Journal 15:43-58; Hamid et al. 2004, Diabetes 53:3342-3345). Many cell types do not express IL6R and are not directly responsive to IL6 (Chalaris et al. 2007, Jones et al. 2001). However, these cell types can be stimulated by IL6 bound to a soluble form of the IL6 receptor in a process called trans-signaling (Chalaris et al. 2007, Jones et al. 2001). IL6R shedding and subsequent release of the soluble form of the receptor results from cleavage of the membrane-bound receptor during apoptosis, a biological process and pathway identified in the gene expression signatures. This process is dependent on the metalloproteinases, ADAM17 and to a lesser extent ADAM10 (Chalaris et al. 2007, Matthews et al. 2003, The Journal of Biological Chemistry 278:38829-38839). ADAM17 was also found to be up-regulated in the microarray and was identified as one of the candidate predictor genes. Reported inducers of IL6R shedding include phorbol myristate acetate, cholesterol depletion, CRP, bacterial toxins, Fas stimulation and ultraviolet light (Chalaris et al. 2007, Mullberg et al. 1992, Biochemical and Biophysical Research Communications 189:794-800; Jones et al. 1999, Journal of Experimental Medicine 189:599-604; Matthews et al. 2003). Signaling through IL6R has also been shown to have a role in both inflammation and apoptosis (Finotto et al. 2007, Int Immunol 19:685-693). Furthermore, genome-wide association analyses have identified IL6R as a likely candidate gene for association with lung function (Wilk et al. 2007).
CCR2, which encodes the receptor for monocyte chemoattractant protein 1 and 3 (MCP1 and MCP3), is involved in inflammatory processes related to rheumatoid arthritis, alveolitis and tumor infiltration (Owen 2001). Higher levels of MCP1 mRNA and protein are detected in the bronchiolar epithelium in subjects with COPD, and increased levels of CCR2 are detected in macrophages, mast cells and epithelial cells of COPD subjects, indicating that MCP1 and CCR2 are involved in the recruitment of macrophages into the airway epithelium (Owen 2001, de Boer et al. 2000, Journal of Pathology 199:619-626). This increased expression of CCR2 also correlates with increased levels of mast cells and macrophages in the lungs of COPD subjects (de Boer et al. 2000). In addition, it has been demonstrated that activated neutrophils migrate in response to MCP1 (Johnston et al. 1999, The Journal of Clinical Investigation 103:1269-1276). These findings indicate mechanistic roles of IL6R and CCR2 in systemic and lung inflammation in COPD.
The 7 other genes in the 9-gene profile have varied biological functions. PPP2CB encodes the beta-isoform of the catalytic subunit of protein phosphatase 2A (PP2A) (Hemmings et al. 1988, Nucleic Acids Research 16:11366; Cohen 1989, Annual Review of Biochemistry 58:453-508). PP2A has been shown to regulate apoptosis in neutrophils by dephosphorylating both p38/MAPK and its substrate caspase 3, suggesting that PP2A has a role in the induction of apoptosis and the resolution of inflammation (Alvarado-Kristensson & Andersson 2005, The Journal of Biological Chemistry 280:6238-6244). RASSF2 promotes apoptosis and cell cycle arrest (Vos et al. 2003, The Journal of Biological Chemistry 278:28045-28051). WTAP is involved in the expression of genes related to cell division cycle and the G2/M checkpoint (Horiuchi et al. 2006, PNAS USA 103:17278-17283). The DNTT-interacting protein 2 (DNTTIP2), also known as estrogen receptor-binding protein, can bind the estrogen receptor-alpha and enhance its transcriptional activity in an estrogen-dependent manner (Bu et al. 2004, Biochemical and Biophysical Research Communications 317:54-59). GDAP1, or ganglioside-induced differentiation-associated protein 1, is found localized in the mitochondrial outer membrane and regulates the mitochondrial network. Over-expression of GDAP1 induces fragmentation of mitochondria without inducing apoptosis, affecting overall mitochondrial activity, or interfering with mitochondrial fusion (Niemann et al. 2005, The Journal of Cell Biology 170:1067-1078; Cuesta et al. 2002, Nature Genetics 30:22-25). LIPE, also know as HSL (hormone-sensitive lipase), has a role in the mobilization of free fatty acids from adipose tissue by controlling the rate of lipolysis of the stored triglycerides (Holm et al. 1988, Nucleic Acids Research 16:9879). Finally, RPL14 is a gene coding for a protein of the large ribosomal subunit (Robledo et al. 2008, RNA 14:1918-1929). The role of these genes in COPD may be linked to the cellular processes and pathways, such as cell cycle regulation and apoptosis, associated with the full list of genes.
Some factors, such as cellular composition of the sample, may influence the gene expression profiles detected by microarray in this study. Although the average total circulating WBC counts were similar between the groups with and without COPD, the mean lymphocyte and granulocyte counts as percentages of the total were significantly different (Table I). These parameters were included in the random forest analysis yet were not retained in the final model, indicating that the gene expression differences were more predictive of COPD status than lymphocyte and granulocyte percentages. Due to the random forest algorithm's inability to handle missing values among the predictor variables, the medication history of the subjects was not included in the analysis as several subjects had missing values. Although it is unclear how corticosteroids might affect gene expression in PBLs, it is known that the small airway inflammation responsible for airflow obstruction in COPD is poorly sensitive to the anti-inflammatory effects of corticosteroids (Hogg et al. 2004, The New England Journal of Medicine 350:2645-2653; Barnes 2006, Chest 129:151-155). Recent evidence has attributed this to oxidative and nitrative stress-induced reduction in histone deacetylase expression in inflammatory cells, thus preventing activated corticosteroid receptors from reversing the acetylation of activated inflammatory genes and turning off their transcription (Barnes 2006). Analysis of 10 subjects with possible indeterminate spirometric COPD Case/Control status based on their combination of FEV1/FVC and FEV1% predicted, categorizing them spirometrically as Controls by the GOLD-identified FEV1/FVC cutoff value is also included. Only one of these subjects, in the test set, was discordantly classified as a Case by the gene expression profile (both the full and reduced models).
Cigarette smoke exposure can also influence gene expression, and of the 1,013 predictor genes identified in this analysis, differential expression of ATF4, MCL1, MAPK14, SERPINA1 and SOD2 was also identified in a study by van Leeuwen et al. (2007, Carcinogenesis 28:691-697), as strongly correlating with serum cotinine levels, a biomarker of recent exposure to tobacco. Two additional genes in the list, CCR2 and EPB41, are observed by Lampe et al. (2004, Cancer Epidemiology, Biomarkers & Prevention 13:445-453) as part of a cigarette smoke exposure molecular signature. Both the van Leeuwen and Lampe studies use PBLs isolated from current smokers and non-smokers indicating that the differential gene expression of some of the genes identified in this analysis may be related to tobacco smoke exposure. In a study of bronchial epithelial cells from never, current and former smokers, Beane et al. (2007, Genome Biology 8:R201) found 175 genes differentially expressed between never and current smokers, with irreversible changes in expression for 28 genes, slowly reversible for 6 genes and rapidly reversible for 139 genes. This indicates that duration and possibly intensity of cigarette smoking, and length of time since quitting, may be important confounding variables to gene expression analysis. The 1 phenotypic variable identified as a candidate predictor in this analysis (‘years of daily smoking’) appears to support this possibility.
This example indicates, among other things, that a training set and test set can be established that permit the identification of differential gene expression (1,013 genes in this instance) occurring in peripheral WBCs that discriminated between cigarette smokers with or without spirometrically defined COPD. The group of 1,013 genes can be reduced to a 9-gene subset with similar performance in differentiating smokers with or without COPD. Gene ontology and pathway analyses indicate that these genes are involved in regulation of apoptosis, regulation of cell growth, macromolecule (protein and RNA) transport, RNA processing, post-translational protein modification, cellular defense response, and inflammatory response. This is the first study to use microarray analysis of PBLs to identify gene expression differences associated with COPD. PBL samples are easy to obtain and their analysis complements current clinical diagnostic procedures for COPD. The gene expression profiles identified are novel biomarkers for COPD.
Unless otherwise indicated, the nucleic acids listed or set forth in Supplementary Table II include: nucleic acids having the sequences recited in the table and/or their complement; the sequences of nucleic acids transcribed from the genes or loci listed in the table or their complement; and either or both strands (if double stranded) of cDNAs clones of the nucleic acids transcribed from the genes or loci listed in the table. The nucleic acids listed or set forth in Supplementary Table II also include the specific nucleic acid sequences listed under the NCBI accession and/or the NCBI GI number categories and their complementary sequences.
Substitutions, modifications, changes and omissions may be made in the design, operating conditions and arrangement of the aspects and embodiments described herein without departing from the spirit of the subject matter as expressed, inter alia, in the appended claims. Additional advantages, features and modifications will readily occur to those skilled in the art. Therefore, the subject matter of this disclosure, in its broader aspects, is not limited to the specific details, examples, or representative devices, shown and described herein. Accordingly, various modifications may be made without departing from the spirit or scope of the general concepts as defined, inter alia, by the appended claims and their equivalents.
All of the references cited herein, including patents, patent applications, and publications, are hereby incorporated in their entireties by reference.
The scope of the claims below is not restricted to the particular embodiments described herein. The following examples describe for illustrative purposes and are not intended to limit the methods and compositions of the present disclosure in any manner. Those of skill in the art will recognize a variety of parameters that can be changed or modified to yield the same results.
This application claims the benefit of U.S. Provisional Application Ser. No. 61/292,154, filed Jan. 4, 2010, entitled “GENE BIOMARKERS OF LUNG FUNCTION” the entirety of which is hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
61292154 | Jan 2010 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/US2011/000016 | Jan 2011 | US |
Child | 13541349 | US |