1. Field
The invention relates to predictive models for determining smoking status based on marker expression measurements, to their methods of use, and to computer systems and software for their implementation.
2. Description of the Related Art
Smoking is the leading cause of preventable death in the world, resulting in over 5 million deaths per year worldwide, with ˜500,000 of those deaths occurring in the United States (1,2). Smoking has been shown to be detrimental to human health, increasing the risk of multiple diseases, including many forms of cancers (lung, pancreatic) and cardiovascular/pulmonary disease (atherosclerosis, chronic obstructive pulmonary disease) (3,4,5). Cigarette smoke contains over 4,000 compounds, many of which have been shown to be carcinogenic or toxic; these compounds are able to enter the circulatory system by way of the pulmonary alveoli, and are distributed to different organs of the body, resulting in damage (6). During this process, circulatory cells of the immune system are exposed to these compounds, which may result in the changes in gene expression which can be assessed using established technologies.
Unmet Clinical and Scientific Need
Cotinine is a metabolite of nicotine and appears in the blood and urine of cigarette smokers. Biochemical measurements of cotinine in blood or urine therefore provide a marker of smoking status, but specialized assays are required. A general assay using readily available and general molecular biology tools such as quantitative RNA measures or nucleic acid sequencing reactions provides an independent way to determine smoking status and can be carried out as part of a parallel or multiplexed series of nucleic-acid based measures obtained from a patient sample.
Described herein is a computer-implemented method for scoring a sample obtained from a subject, wherein said score indicates said subject's smoking status, comprising: obtaining a dataset associated with said sample, wherein said dataset comprises quantitative expression data for one or more of marker 1, marker 2, marker 3, marker 4, and/or marker 5, wherein marker 1 is CLDND1 or IL7R, wherein marker 2 is LRRN3 or CCR7, wherein marker 3 is MUC1 or FOXP3, wherein marker 4 is GOPC or MCM3, and wherein marker 5 is LEF1 or CCR7; and determining, by a computer processor, a score from said dataset using an interpretation function, wherein said score is indicative of said subject's smoking status.
In some embodiments, said dataset comprises quantitative expression data for marker 1, marker 2, marker 3, marker 4 and marker 5, wherein marker 1 is CLDND1, wherein marker 2 is LRRN3, wherein marker 3 is MUC1, wherein marker 4 is GOPC, and wherein marker 5 is LEF1. In some embodiments, said dataset comprises quantitative expression data for two or more of marker 1, marker 2, marker 3, marker 4, and marker 5. In some embodiments, said dataset comprises quantitative expression data for three or more of marker 1, marker 2, marker 3, marker 4, and marker 5. In some embodiments, said dataset comprises quantitative expression data for four or more of marker 1, marker 2, marker 3, marker 4, and marker 5. In some embodiments, said dataset comprises quantitative expression data for marker 1, marker 2, marker 3, marker 4, and marker 5.
In some embodiments, the method further comprises, determining, by a computer processor, the subject's risk for developing a smoking-related disease, based on said score. In some embodiments, said smoking-related disease is chronic obstructive pulmonary disease, chronic bronchitis, emphysema, lung cancer, and/or asthma.
In some embodiments, the dataset comprises quantitative expression data for at least one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty or more additional markers selected from Table 1.
In some embodiments, the dataset further comprises a clinical factor used to calculate the score. In some embodiments, the clinical factor is selected from the group consisting of: gender and hypertension. In some embodiments, the clinical factor is gender.
In some embodiments, said interpretation function is based on a predictive model. In some embodiments, said predictive model is selected from the group consisting of a partial least squares model, a logistic regression model, a linear regression model, a linear discriminant analysis model, a ridge regression model, and a tree-based recursive partitioning model. In some embodiments, said interpretation function is log(pr(Smoker)/(1−Pr(Smoker))=15.78306+0.3876*SEX−3.3368*CLDND1−3.4034*LRRN3−1.4847*MUC1+5.9209*GOPC+2.27166*LEF1, wherein SEX=1 if male, and 0 if female and Pr is probability. In some embodiments, said interpretation function is an interpretation function selected from the group of interpretation functions consisting of those set out in Table 7.
In some embodiments, obtaining said dataset associated with said sample comprises obtaining said sample and processing said sample to experimentally determine said dataset. In some embodiments, obtaining said dataset associated with said sample comprises receiving said dataset directly or indirectly from a third party that has processed said sample to experimentally determine said dataset.
In some embodiments, the dataset is obtained stored on a storage memory. In some embodiments, said quantitative expression data are from hybridization data. In some embodiments, said quantitative expression data are from polymerase chain reaction data. In some embodiments, said quantitative expression data are from sequence data.
Also described herein is a computer-implemented method for scoring a sample obtained from a subject, comprising: obtaining a dataset associated with the sample, wherein the dataset comprises a clinical factor used to calculate a score and quantitative expression level values for at least one marker selected from the group consisting of CLDND1, IL7R, LRRN3, CCR7, MUC1, FOXP3, GOPC, MCM3, LEF1, and CCR7; and determining, by a computer processor, the score from the dataset using an interpretation function, wherein the score is indicative of said subject's smoking status. In some embodiments, said dataset comprises quantitative expression data for CLDND1, LRRN3, MUC1, GOPC, and LEF1. In some embodiments, said dataset comprises quantitative expression data for two or more markers. In some embodiments, said dataset comprises quantitative expression data for three or more markers. In some embodiments, said dataset comprises quantitative expression data for four or more markers. In some embodiments, said dataset comprises quantitative expression data for five or more markers.
Also described herein is a system for scoring a sample obtained from a subject, wherein said score indicates said subject's smoking status, comprising: a storage memory for storing a dataset associated with said sample, wherein said dataset comprises quantitative expression data for one or more of marker 1, marker 2, marker 3, marker 4, and/or marker 5, wherein marker 1 is CLDND1 or IL7R, wherein marker 2 is LRRN3 or CCR7, wherein marker 3 is MUC1 or FOXP3, wherein marker 4 is GOPC or MCM3, and wherein marker 5 is LEF1 or CCR7; and a processor communicatively coupled to the storage memory for determining a score from said dataset using an interpretation function, wherein said score is indicative of said subject's smoking status.
Also described herein is a computer-readable storage medium storing computer-executable program code, the program code comprising: program code for storing a dataset associated with said sample, wherein said dataset comprises quantitative expression data for one or more of marker 1, marker 2, marker 3, marker 4, and/or marker 5, wherein marker 1 is CLDND1 or IL7R, wherein marker 2 is LRRN3 or CCR7, wherein marker 3 is MUC1 or FOXP3, wherein marker 4 is GOPC or MCM3, and wherein marker 5 is LEF1 or CCR7; and program code for determining a score from said dataset using an interpretation function, wherein said score is indicative of said subject's smoking status.
Also described herein is a method for scoring a sample obtained from a subject, wherein said score indicates said subject's smoking status, comprising: obtaining a sample from the subject, wherein the sample comprises a plurality of analytes; contacting the sample with a reagent; generating a plurality of complexes between the reagent and the plurality of analytes; detecting the plurality of complexes to obtain a dataset associated with said sample, wherein said dataset comprises quantitative expression data for one or more of marker 1, marker 2, marker 3, marker 4, and/or marker 5, wherein marker 1 is CLDND1 or IL7R, wherein marker 2 is LRRN3 or CCR7, wherein marker 3 is MUC1 or FOXP3, wherein marker 4 is GOPC or MCM3, and wherein marker 5 is LEF1 or CCR7; and determining a score from the dataset using an interpretation function, wherein said score is indicative of said subject's smoking status.
Also described herein is a kit for scoring a sample obtained from a subject, wherein said score indicates said subject's smoking status, comprising: a set of reagents comprising a plurality of reagents for determining from a sample obtained from the subject quantitative expression data for one or more of marker 1, marker 2, marker 3, marker 4, and/or marker 5, wherein marker 1 is CLDND1 or IL7R, wherein marker 2 is LRRN3 or CCR7, wherein marker 3 is MUC1 or FOXP3, wherein marker 4 is GOPC or MCM3, and wherein marker 5 is LEF1 or CCR7; and instructions for using the plurality of reagents to determine quantitative expression data in a dataset from the sample, wherein the instructions comprise instructions for determining, by a computer processor, a score from said dataset using an interpretation function, wherein said score is indicative of said subject's smoking status.
In an embodiment, the invention provides a method for determining the smoking status of a subject through use of a dataset that includes quantitative expression data for one or more markers listed in Table 1, by analyzing the dataset to determine an expression level of the marker, wherein the expression level of the marker positively or negatively correlates with the smoking status of the subject, thereby determine the subject's smoking status. In an embodiment, the subject's smoking status can be used to assess risk of developing a smoking-related disease such as chronic obstructive pulmonary disease, chronic bronchitis, emphysema, lung cancer, or asthma. In an embodiment, the analyzing step is carried out by comparing the expression level of the marker to a threshold value. In other embodiments, the dataset includes quantitative expression data for at least two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty or more markers selected from Table 1. The marker may be positively or negatively correlated with smoking status, and the expression level of the marker may be increased or decreased in a smoker as compared to a non-smoker. In other aspects, the methods of the invention are implemented on one or more computers. In some embodiments, the dataset is obtained by assaying a sample to experimentally determine expression values. In other embodiments, the dataset is obtained directly or indirectly from a third party that has processed the sample to experimentally determine the data. The data in the dataset may reflect measurements made using a nucleotide-based assay such as a qRT-PCR assay, a hybridization assay, or a sequencing reaction assay. In some embodiments, the methods of the invention are implemented using a computer processor. The invention also encompasses systems for determining smoking status of a subject. The system includes a storage memory for storing the dataset, and a processor communicatively coupled to the storage memory for analyzing the dataset to determine the expression level of the marker. In other embodiments, the invention includes a computer-readable storage medium storing computer-executable program code for storing a dataset associated with a sample obtained from the subject, which dataset includes quantitative expression data for a marker selected from Table 1 and program code for analyzing the dataset to determine an expression level of the marker, where the expression level positively or negatively correlates with the subject's smoking status. In yet other embodiments, the system of the intention includes a storage memory for storing a dataset that includes a threshold value for a marker selected from Table 1. The threshold can be associated with expression data obtained from a non-smoking subject or a non-smoking population of subjects. Still other embodiments of the invention include a kit for use in determining a subject's smoking status, that includes a set of reagents for determining from a sample obtained from the subject quantitative expression data for a marker selected from Table 1 and instructions for using the reagents to determine quantitative expression data from the sample and analyzing the dataset to determine an expression level of the marker, which positively or negatively correlates with the subject's smoking status. The instructions may further include threshold values for use in the analysis and/or an interpretation function for generating a score indicative of smoking status. The kit may include reagents for more than one marker selected from Table 1, for example, at least two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty or more markers. Embodiments of the invention also incorporate predictive models and associated interpretation functions that operate on the quantitative expression data to generate a score that is indicative of a subject's smoking status. The predictive model can be a partial least squares model, a logistic regression model, a linear regression model, a linear discriminant analysis model, a ridge regression model, and a tree-based recursive partitioning model. In certain embodiments, the markers comprise CDND1, LRRN3, MUC1, GOPC, or LEF1, or markers selected from Table 1 whose expression is correlated with CDND1, LRRN3, MUC1, GOPC, and LEF1. In some embodiments, the interpretation function is log(pr(Smoker)/(1−Pr(Smoker))=15.78306+0.3876*SEX−3.3368*CLDND1−3.4034*LRRN3−1.4847*MUC1+5.9209*GOPC+2.27166*LEF1, wherein SEX=1 if male, and 0 if female and Pr is probability. In other embodiments, the interpretation function is an interpretation function set out in Table 7.
These and other features, aspects, and advantages of the present invention will become better understood with regard to the following description, and accompanying drawing, where:
FIG. 1—is a plot showing relationship between classification by application of predictive model of samples into smoker and non-smoker categories and mean cotinine levels for the same samples.
In general, terms used in the claims and the specification are intended to be construed as having the plain meaning understood by a person of ordinary skill in the art. Certain terms are defined below to provide additional clarity. In case of conflict between the plain meaning and the provided definitions, the provided definitions are to be used.
The term “Ct” refers to cycle threshold and is defined as the PCR cycle number where the fluorescent value is above a set threshold. Therefore, a low Ct value corresponds to a high level of expression, and a high Ct value corresponds to a low level of expression.
The term “Cp” refers to the crossing point and is defined as the intersection of the best fit of the log-linear portion of a standard's amplification curve in a real time PCR instrument such as, e.g., a LightCycler, and the noise band (set according to background fluorescence measurements).
The term “FDR” means to false discovery rate. FDR can be estimated by analyzing randomly-permuted datasets and tabulating the average number of genes at a given p-value threshold.
The terms “GL” “GM” and “GU” respectively refer to 1st percentile, median, and 99th percentile of Cp for that gene in the Algorithm Development data set.
The terms “marker” or “markers” encompass, without limitation, lipids, lipoproteins, proteins, cytokines, chemokines, growth factors, peptides, nucleic acids, genes, and oligonucleotides, together with their related complexes, metabolites, mutations, variants, polymorphisms, modifications, fragments, subunits, degradation products, elements, and other analytes or sample-derived measures. A marker can also include mutated proteins, mutated nucleic acids, variations in copy numbers, and/or transcript variants, in circumstances in which such mutations, variations in copy number and/or transcript variants are useful for generating a predictive model, or are useful in predictive models developed using related markers (e.g., non-mutated versions of the proteins or nucleic acids, alternative transcripts, etc.).
The terms “highly correlated gene expression” or “highly correlated marker expression” refer to gene or marker expression values that have a sufficient degree of correlation to allow their interchangeable use in a predictive model of coronary artery disease. For example, if gene x having expression value X is used to construct a predictive model, highly correlated gene y having expression value Y can be substituted into the predictive model in a straightforward way readily apparent to those having ordinary skill in the art and the benefit of the instant disclosure. Assuming an approximately linear relationship between the expression values of genes x and y such that Y=a+bX, then X can be substituted into the predictive model with (Y−a)/b. For non-linear correlations, similar mathematical transformations can be used that effectively convert the expression value of gene y into the corresponding expression value for gene x. The terms “highly correlated marker” or “highly correlated substitute marker” refer to markers that can be substituted into and/or added to a predictive model based on, e.g., the above criteria. A highly correlated marker can be used in at least two ways: (1) by substitution of the highly correlated marker(s) for the original marker(s) and generation of a new model for predicting smoking status; or (2) by substitution of the highly correlated marker(s) for the original marker(s) in the existing model for predicting smoking status.
The term “mammal” encompasses both humans and non-humans and includes but is not limited to humans, non-human primates, canines, felines, murines, bovines, equines, and porcines.
The term “metagene” refers to a set of genes whose expression values are combined to generate a single value that can be used as a component in a predictive model. (Brunet, J. P., et al. Proc. Natl. Acad. Sciences 2004; 101(12):4164-9)
The term “sample” can include a single cell or multiple cells or fragments of cells or an aliquot of body fluid, taken from a subject, by means including venipuncture, excretion, ejaculation, massage, biopsy, needle aspirate, lavage sample, scraping, surgical incision, or intervention or other means known in the art.
The term “subject” encompasses a cell, tissue, or organism, human or non-human, whether in vivo, ex vivo, or in vitro, male or female.
The term “obtaining a dataset associated with a sample” encompasses obtaining a set of data determined from at least one sample. Obtaining a dataset encompasses obtaining a sample, and processing the sample to experimentally determine the data. The phrase also encompasses receiving a set of data, e.g., from a third party that has processed the sample to experimentally determine the dataset. Additionally, the phrase encompasses mining data from at least one database or at least one publication or a combination of databases and publications. A dataset can be obtained by one of skill in the art via a variety of known ways including stored on a storage memory.
It must be noted that, as used in the specification and the appended claims, the singular forms “a,” “an,” and “the” include plural referents unless the context clearly dictates otherwise.
Methods
Markers and Clinical Factors
The quantity of one or more markers of the invention can be indicated as a value. A value can be one or more numerical values resulting from evaluation of a sample under a condition. The values can be obtained, for example, by experimentally obtaining measures from a sample by an assay performed in a laboratory, or alternatively, obtaining a dataset from a service provider such as a laboratory, or from a database or a server on which the dataset has been stored, e.g., on a storage memory.
In an embodiment, the quantity of one or more markers can be one or more numerical values associated with expression levels of the genes set out in Table 1 resulting from evaluation of a sample under a condition. The column labels of Table 1 indicate the following: “Probe Name” refers to the names of probes found on Agilent Human Whole Genome Arrays (Agilent Technologies, Santa Clara, Calif.); “Gene Name” refers to the names of human genes in accordance with guidelines provided by the Human Genome Organization (HUGO) Gene Nomenclature Committee (HGNC). Further information about each human gene, such as accession number(s) and aliases, can be found by entering the gene name into the search page on the HGNC genenames.org website. For example, entering the term “LRRN3” into the Simple Search field of the HGNC website on Aug. 10, 2011 returns the approved gene name of LRRN3 (leucine rich repeat neuronal 3), the sequence accession IDs of LRRN3 (GenBank AB060967; RefSeq: NM 001099658), and the previous symbols or synonyms for LRRN3 (FIGLER5, F1111129, NLRR3). Further human gene names are provided in the Examples section below. A person of ordinary skill in the art recognizes that the Gene Name information provided in Table 1 unambiguously identifies genes used as biomarkers in the present invention, and is able to use the Gene Name information of Table 1 to obtain protein and nucleic acid sequence information about the named gene without exercising undue experimentation. Such information readily enables the skilled artisan to obtain quantitative expression level data for these markers using any one of a number of methods described in this specification. “Smoking Log Odds” refers to a standard statistical measure of the association of a biomarker with smoking status. A positive value in Table 1 indicates that the marker is positively associated with smoking status, while a negative value indicates that the marker is negatively associated with smoking status (i.e., that the marker is associated with negative (“non-smoking”) smoking status. Thus, if expression goes down with increased smoking, the marker has a negative value, and if expression goes up with increased smoking, the marker has positive value in Table 1. “Smoking p” refers to the statistical significance of a marker's association (positive or negative) with smoking status.
In an embodiment, a marker's associated value can be included in a dataset associated with a sample obtained from a subject. A dataset can include the marker expression value of two or more, three or more, four or more, five or more, six or more, seven or more, eight or more, nine or more, ten or more, eleven or more, twelve or more, thirteen or more, fourteen or more, fifteen or more, sixteen or more, seventeen or more, eighteen or more, nineteen or more, twenty or more, twenty-one or more, twenty-two or more, twenty-three or more, twenty-four or more, twenty-five or more, twenty-six or more, twenty-seven or more, twenty-eight or more, twenty-nine or more, or thirty or more marker(s) set out in Table 1. A dataset can include a subset or a complete set of the markers set out in Table 1 with other markers now known or later determined to be positively or negatively associated with smoking status. For example, a dataset can include the expression values for SASH1, P2RY6, MUC1, LRRN3, MGAT3, and CLDND1. In another embodiment, a dataset can include the expression values for CLDND1, LRRN3, MUC1, GOPC, and LEF1. Other combinations are described in more detail in the Examples section below. A dataset may also combine expression values for markers with a clinical factor, e.g., gender. A dataset may also combine expression values for markers with an indicator of a subject's sex (i.e., an indication of whether the subject is male or female). A dataset may also combine expression values for markers with an indicator of a subject's hypertension status.
In another embodiment, the invention includes obtaining a sample associated with a subject, where the sample includes one or more markers. The sample can be obtained by the subject or by a third party, e.g., a medical professional. Examples of medical professionals include physicians, emergency medical technicians, nurses, first responders, psychologists, medical physics personnel, nurse practitioners, surgeons, dentists, and any other obvious medical professional as would be known to one skilled in the art. A sample can include peripheral blood cells, isolated leukocytes, or RNA extracted from peripheral blood cells or isolated leukocytes. The sample can be obtained from any bodily fluid, for example, amniotic fluid, aqueous humor, bile, lymph, breast milk, interstitial fluid, blood, blood plasma, cerumen (earwax), Cowper's fluid (pre-ejaculatory fluid), chyle, chyme, female ejaculate, menses, mucus, saliva, urine, vomit, tears, vaginal lubrication, sweat, serum, semen, sebum, pus, pleural fluid, cerebrospinal fluid, synovial fluid, intracellular fluid, and vitreous humour. In an example, the sample is obtained by a blood draw, where the medical professional draws blood from a subject, such as by a syringe. The bodily fluid can then be tested to determine the value of one or more markers using an assay. The value of the one or more markers can then be evaluated by the same party that performed the assay using the methods of the invention or sent to a third party for evaluation using the methods of the invention.
Smoking status is well known to correlate with certain smoking-related disease risks. These include chronic obstructive pulmonary disease (COPD), chronic bronchitis, emphysema, lung cancer, asthma (11, 12). Thus, the methods of the invention can be used to provide an independent risk factor to assess an individual's risk of developing one or more smoking-related disease. The result from the methods of the invention can be fed into any one of a number of diagnostic processes that use smoking status to assess a smoking-related disease risk. These results can be used in lieu of or in addition to an individual's self-reported smoking status, for example, in providing patient history data to a physician, to an insurance carrier or to any other entity that was interested in assessing an individual's risk of developing one or more smoking-related diseases.
Interpretation Functions
In an embodiment, an interpretation function can be a function produced by a predictive model. An interpretation function can also be produced by a plurality of predictive models. In an embodiment, an interpretation function can take the form of: log (Pr(Smoker)/(1−Pr(Smoker))=15.7306+0.3876*SEX−3.3368*CLDND1−3.4034*LRRN3−1.4847*MUC1+5.9209*GOPC+2.7166*LEF1 where SEX=1 if male, 0 if female; Pr=probability. Other interpretation functions are set out in Table 7.
In an embodiment, a predictive model can include a partial least squares model, a logistic regression model, a linear regression model, a linear discriminant analysis model, a ridge regression model, and a tree-based recursive partitioning model. In an embodiment, a predictive model can also include Support Vector Machines, quadratic discriminant analysis, or a LASSO regression model. See Elements of Statistical Learning, Springer 2003, Hastie, Tibshirani, Friedman; which is herein incorporated by reference in its entirety for all purposes. Predictive model performance can be characterized by an area under the curve (AUC). In an embodiment, predictive model performance is characterized by an AUC ranging from 0.68 to 0.70. In an embodiment, predictive model performance is characterized by an AUC ranging from 0.70 to 0.79. In an embodiment, predictive model performance is characterized by an AUC ranging from 0.80 to 0.89. In an embodiment, predictive model performance is characterized by an AUC ranging from 0.90 to 0.99. Interpretation functions can be developed using combinations of informative markers as shown in the Examples below, or using a single gene whose expression is highly correlated with smoking status. In certain embodiments, methods for classifying based on a single gene are developed using logistic regression or linear discriminant analysis (LDA).
Assays
Examples of assays for one or more markers include DNA assays, microarrays, sequencing-based assays in which the number of sequenced molecules is counted and the count used to determine expression level. The sequenced molecules can be cDNAs corresponding to mRNA transcripts. Other assays include polymerase chain reaction (PCR), RT-qPCR, sequencing assays, Southern blots, Northern blots, antibody-binding assays, enzyme-linked immunosorbent assays (ELISAs), flow cytometry, protein assays, Western blots, nephelometry, turbidimetry, chromatography, mass spectrometry, immunoassays, including, by way of example, but not limitation, RIA, immunofluorescence, immunochemiluminescence, immunoelectrochemiluminescence, or competitive immunoassays, immunoprecipitation, and the assays described in the Examples section below. The information from the assay can be quantitative and sent to a computer system of the invention. The information can also be qualitative, such as observing patterns or fluorescence, which can be translated into a quantitative measure by a user or automatically by a reader or computer system. In an embodiment, the subject can also provide information other than assay information to a computer system, such as a clinical factor (e.g., gender).
In addition to the use of RT-qPCR to assess expression levels, other modalities such as microarrays or RNA sequencing can be used. For example, to crosswalk a predictive model based on RT-qPCR data to microarray data, the array data is first subjected to standard normalization. A regression line is then fit to predict the PCR value for each of the model genes from its array value. The fitted values of each regression are then inserted into the smoking model as predictors. To crosswalk a predictive model to RNA sequencing, targeted re-sequencing of the model genes is accomplished using a next-generation sequencing platform. Raw sequence reads are aligned to the respective targeted genes, and raw expression levels assessed by calculating depth of coverage. Raw values are normalized by the total number of raw sequences per sample and length of target gene. A regression line is then fit to predict the PCR value for each of the model genes from its normalized sequence value. The fitted values of each regression are inserted into the smoking model as predictors.
Informative Marker Groups
In addition to the specific, exemplary markers identified in this application by name, accession number, or sequence, included within the scope of the invention are all operable predictive models of smoking status and methods for their use to score and optionally classify samples using expression values of variant sequences having at least 90% or at least 95% or at least 97% or greater identity to the exemplified sequences or that encode proteins having sequences with at least 90% or at least 95% or at least 97% or greater identity to those encoded by the exemplified genes or sequences. The percentage of sequence identity may be determined using algorithms well known to those of ordinary skill in the art, including, e.g., BLASTn, and BLASTp, as described in Stephen F. Altschul et al., J. Mol. Biol. 215:403-410 (1990) and available at the National Center for Biotechnology Information website maintained by the National Institutes of Health. As described below, in accordance with an embodiment of the present invention, are all operable predictive models and methods for their use in scoring and optionally classifying samples that use a marker expression measurement that is now known or later discovered to be highly correlated with the expression of an exemplary marker expression value in addition to or in lieu of that exemplary marker expression value. For the purposes of the present invention, such highly correlated genes are contemplated either to be within the literal scope of the claimed inventions or alternatively encompassed as equivalents to the exemplary markers. Identification of markers having expression values that are highly correlated to those of the exemplary markers, and their use as a component of a predictive model is well within the level of ordinary skill in the art. The Examples section below provides numerous examples of methods for identifying highly correlated markers and substituting them for algorithm markers in predictive models of smoking status and methods for their use to score and optionally classify samples.
Computer Implementation
In one embodiment, a computer comprises at least one processor coupled to a chipset. Also coupled to the chipset are a memory, a storage device, a keyboard, a graphics adapter, a pointing device, and a network adapter. A display is coupled to the graphics adapter. In one embodiment, the functionality of the chipset is provided by a memory controller hub and an I/O controller hub. In another embodiment, the memory is coupled directly to the processor instead of the chipset.
The storage device is any device capable of holding data, like a hard drive, compact disk read-only memory (CD-ROM), DVD, or a solid-state memory device. The memory holds instructions and data used by the processor. The pointing device may be a mouse, track ball, or other type of pointing device, and is used in combination with the keyboard to input data into the computer system. The graphics adapter displays images and other information on the display. The network adapter couples the computer system to a local or wide area network.
As is known in the art, a computer can have different and/or other components than those described previously. In addition, the computer can lack certain components. Moreover, the storage device can be local and/or remote from the computer (such as embodied within a storage area network (SAN)).
As is known in the art, the computer is adapted to execute computer program modules for providing functionality described herein. As used herein, the term “module” refers to computer program logic utilized to provide the specified functionality. Thus, a module can be implemented in hardware, firmware, and/or software. In one embodiment, program modules are stored on the storage device, loaded into the memory, and executed by the processor.
The term percent “identity,” in the context of two or more nucleic acid or polypeptide sequences, refer to two or more sequences or subsequences that have a specified percentage of nucleotides or amino acid residues that are the same, when compared and aligned for maximum correspondence, as measured using one of the sequence comparison algorithms described below (e.g., BLASTP and BLASTN or other algorithms available to persons of skill) or by visual inspection. Depending on the application, the percent “identity” can exist over a region of the sequence being compared, e.g., over a functional domain, or, alternatively, exist over the full length of the two sequences to be compared.
For sequence comparison, typically one sequence acts as a reference sequence to which test sequences are compared. When using a sequence comparison algorithm, test and reference sequences are input into a computer, subsequence coordinates are designated, if necessary, and sequence algorithm program parameters are designated. The sequence comparison algorithm then calculates the percent sequence identity for the test sequence(s) relative to the reference sequence, based on the designated program parameters.
Optimal alignment of sequences for comparison can be conducted, e.g., by the local homology algorithm of Smith & Waterman, Adv. Appl. Math. 2:482 (1981), by the homology alignment algorithm of Needleman & Wunsch, J. Mol. Biol. 48:443 (1970), by the search for similarity method of Pearson & Lipman, Proc. Nat'l. Acad. Sci. USA 85:2444 (1988), by computerized implementations of these algorithms (GAP, BESTFIT, FASTA, and TFASTA in the Wisconsin Genetics Software Package, Genetics Computer Group, 575 Science Dr., Madison, Wis.), or by visual inspection (see generally Ausubel et al., infra).
One example of an algorithm that is suitable for determining percent sequence identity and sequence similarity is the BLAST algorithm, which is described in Altschul et al., J. Mol. Biol. 215:403-410 (1990). Software for performing BLAST analyses is publicly available through the National Center for Biotechnology Information.
Embodiments of the entities described herein can include other and/or different modules than the ones described here. In addition, the functionality attributed to the modules can be performed by other or different modules in other embodiments. Moreover, this description occasionally omits the term “module” for purposes of clarity and convenience.
Kits
The invention provides kits for determining quantitative expression data for one or more markers selected from Table 1 and instructions for using the data to determine a subject's smoking status. Optionally the kid may include packaging. The kit can include reagents for carrying out a nucleotide-based assay such as a qRT-PCR assay, a hybridization assay, or a sequencing assay for determining the expression levels of the one or markers selected from Table 1. The kit can include reagents for carrying out any of the other types of assays described in this specification. The reagents can be probes and primers such as those set out in Table 4, or other similar reagents. The reagents can be probes such as the probes identified in Table 1 or Table 2. The instructions can include an interpretation function that is used to operate on the quantitative expression data. The interpretation function can be generated from a predictive model. The instructions can include thresholds that can be determined from a smoking subject or a smoking population of subjects, or from a non-smoking subject or a non-smoking population of subjects. The instructions can include methods for comparing the quantitative expression data to a threshold for determining smoking status.
Below are examples of specific embodiments for carrying out the present invention. The examples are offered for illustrative purposes only, and are not intended to limit the scope of the present invention in any way. Efforts have been made to ensure accuracy with respect to numbers used (e.g., amounts, temperatures, etc.), but some experimental error and deviation should, of course, be allowed for.
The practice of the present invention will employ, unless otherwise indicated, conventional methods of protein chemistry, biochemistry, recombinant DNA techniques and pharmacology, within the skill of the art. Such techniques are explained fully in the literature. See, e.g., T. E. Creighton, Proteins: Structures and Molecular Properties (W.H. Freeman and Company, 1993); A. L. Lehninger, Biochemistry (Worth Publishers, Inc., current addition); Sambrook, et al., Molecular Cloning: A Laboratory Manual (2nd Edition, 1989); Methods In Enzymology (S. Colowick and N. Kaplan eds., Academic Press, Inc.); Remington's Pharmaceutical Sciences, 18th Edition (Easton, Pa.: Mack Publishing Company, 1990); Carey and Sundberg Advanced Organic Chemistry 3rd Ed. (Plenum Press) Vols A and B(1992).
Materials and Methods
Statistical Methods
All statistical methods were performed using the R software package. The statistical methods used are described and referenced in greater detail below.
Gene Selection
Genes for RT-PCR were selected based on significance, fold-change, pathway analysis, and literature support. Hierarchical clustering based on gene: gene correlations ensured that RT-PCR genes represented multiple clusters. Normalization genes were selected based on low variance, moderate to high expression, and no significant association with case: control status, sex, age, or cell counts.
PCR Statistical Analysis
Clinical/demographic factors were assessed for smoking status association using univariate and multivariate logistic regression. Gene expression association with smoking status and other clinical/demographic factors was assessed by robust logistic regression (unadjusted and sex/age adjusted).7
Whole Genome Microarray Analysis
Phase I—PREDICT DISCOVERY.
We performed whole genome microarray analysis on RNA isolated from 210 catheter lab patients enrolled in a prospective clinical trial (PREDICT) designed to identify gene expression signatures that correlate with coronary artery disease.
Blood was collected at the time of catheterization in PAXgen tubes. RNA was isolated by automated method, using the Agencourt RNAdvance system, and quantified using Ribogreen (Invitrogen (now Life Technologies), Carlsbad, Calif.). RNA was labeled with Cy3 using methods recommended by the manufacturer (Agilent, Santa Clara, Calif.) and hybridized to whole genome arrays (Agilent Human Whole Genome Arrays).
Array Normalization
Agilent processed signal values for array normalization were scaled to a trimmed mean of 100 and then log 2 transformed. Standard array QC metrics (percent present, pairwise correlation, and signal intensity) were used for quality assessment, resulting in 12 of 210 PREDICT samples being excluded.
Array Analysis
For the PREDICT array, given the paired design, conditional logistic regression was used. False discovery rates were used to account for multiple comparisons. GOEAST was used to determine over-representation of Gene Ontology (GO) terms.
Array feature data was extracted using Agilent Feature Extraction software and normalized using quantile normalization.
Algorithm Calculation and Transformation
Data Preprocessing and QC Steps
In certain cases, an algorithm score was not be calculated for a subject. Reasons for this included low PAXgene® tube blood volume, lab QC failure, etc. The frequency of occurrence of these failures was tabulated, though these subjects were not included in the analysis set.
To identify genes whose expression levels correlated with smoking, a robust linear model was used (8), with smoking status used as the dependent variable and age, sex, and gene expression as the independent variables. Table 1 contains the 4988 probes (representing 4214 genes) which showed significant correlation with smoking status (p<0.05). 1933 probes were down-regulated (indicated by negative smoking Log Odds) in smokers and 3055 were up-regulated (indicated by positive smoking Log Odds) (Table 1).
A small number of genes (36) had more than one associated probe, of which one or more probes displayed up-regulation and one or more displayed down-regulation.
In a subsequent microarray analysis on RNA isolated from 150 female catheter lab patients enrolled in the PREDICT trial, six genes were selected for further evaluation via RT-qPCR due to their association with self-reported smoking status: SASH1; P2RY6; MUC1; LRRN3; MGAT3; AND CLDND1.
Among microarray probes that were biologically annotated and had non-zero expression, these were selected because they exhibited the strongest absolute correlation with smoking status (r>0.425). All have higher expression with smoking status. Five of these six RT-qPCR probe designs were successful; the design for MGAT3 was not, and this gene was not included. These 5 genes are designated as Set 1. Corresponding Agilent Whole Genome Array probes and Gene Name for the Set 1 markers (and MGAT3) are provided in Table 2.
Phase II—RT-qPCR Analysis.
In an RT-qPCR analysis of RNA isolated from 1039 PREDICT patients, 261 genes, including the Set 1 genes, were assessed for association with smoking status (Table 3). The additional 255 genes were selected for association with coronary disease, associated traits (e.g., lipid levels) or as cell markers. Expression values for the 261 genes were normalized to the mean of ACLY and TFCP2; expression values for genes were truncated if values exceeded the 0.01 and 0.99 quantile. Of the genes, 135 showed a significant association with smoking status in an age- and sex-adjusted logistic regression model; 59 of the 80 significant array genes remained significant by qRT-PCR. Of this set, all but 3 (HIST1H2AC, NONO, PAPD4) agreed with the array data in directionality of gene expression. LRRN3 remained the gene most significantly associated with smoking status, followed by CLDND1, SASH1, and P2RY6 (p<0.001). Note that in Table 3, below, for the genes GNAS and FTH1, an “x” suffix in the gene symbol denotes that the assay for the given gene was designed against an exonic sequence; an “n” suffix in the gene symbol denotes that the assay for the given gene was designed against an intronic sequence.
Determining Discriminating Threshold Values for Individual Markers.
In certain embodiments, smoking status is determined by analyzing the expression level of a single marker or a group of markers to determine whether the expression level is significantly different from a threshold level established by analysis of the expression level of the same marker or markers in a non-smoking subject or population of non-smoking subjects. A significant difference between a subject's value for the marker or markers and the threshold value indicates smoking status. Conversely, methods can be developed to set thresholds using a smoking subject or population of non-smoking subjects. In this embodiment a significant difference indicates non-smoking status of the subject providing the test sample.
A predictive model for smoking status was built using stepwise forward logistic regression (9), with smoking status as the dependent variable and age, sex, and gene expression as the independent variables. A patient was defined as a smoker if their self-reported smoking status was a current smoker or one who had recently quit (within the previous two months). Five genes were selected by the model, designated as Set 2. Three of the Set 1 genes were selected by the model (CLDND1, LRRN3, MUC1). The remaining 2 genes (GOPC, LEFT) were included in the 259 gene analysis due to CAD association and as a CD8+naïve cell marker, respectively. The probe and primer sequences used to assess the expression levels of the Set 2 genes are given in Table 4.
The resulting model formula is:
log(Pr(Smoker)/(1−Pr(Smoker))=15.7306+0.3876*SEX−3.3368*CLDND1−3.4034*LRRN3−1.4847*MUC1+5.9209*GOPC+2.7166*LEF1,
where SEX=1 if male, 0 if female; Pr=probability.
For predicting self-reported status, the model has a mean AUC of 0.932 in ten-fold cross-validation. At a cutoff probability of 0.5, the fitted sensitivity of the model was 0.784, with a specificity of 0.953. More details of model performance are provided in the Table 5. Model performance was validated using 180 independent PREDICT subjects, with an AUC of 0.82 (95% CI 0.65-0.94), a sensitivity of 0.63 and a specificity of 0.94.
Ten-fold cross-validated sub-models were developed for all subsets of the Set 2 genes using the approach described in Example 1 (i.e., stepwise forward logistic regression (9), with smoking status as the dependent variable and age, sex, and gene expression as the independent variables). The performance of these sub-models is set out in Table 6.
The equations corresponding to these sub-models are set out below in Table 7.
They are applied in a similar manner as the equation set out in Example 1 and are used to solve for log (Pr(Smoker)/(1−Pr(Smoker)). For example, the formula associated with submodel 1 is: log (Pr(Smoker)/(1−Pr(Smoker))=3.411334+0.5660*SEX−6.4940*CLDND1, where SEX=1 if male, 0 if female; Pr=probability.
Models were developed based on the model described in Example 1, in which one of the markers was substituted with a highly correlated marker; the criterion for selecting the highly correlated marker was maximum (Pearson) correlation R to the original gene. Table 8 shows the mean AUC in ten-fold cross-validated models where each gene in the smoking model was replaced by a gene selected from 253 study genes that was not already included in the smoking model. Correlations (expressed as R values) are also included in Table 8.
The reported mean Cts and standard deviations of the informative set of 259 markers in the smoking and non-smoking population are as shown in Table 9. These values have been normalized and truncated as described above in the “Phase II—RT-qPCR Analysis” section.
To check on the accuracy of self-reported smoking status, a biochemical method was used. Levels of cotinine, a relatively stable metabolite of nicotine, have been show to correlate with self-reported smoking status and can be easily measured by enzyme-linked immunoassays (ELISA)(10). Twenty samples were assayed in total: Ten samples from self-reported non-smokers with less than 0.3% fitted probability of smoking by gene expression, and ten samples from self-reported smokers with a greater than 99% fitted probability of smoking by gene expression. Strong concordance was seen between self-reported status and cotinine levels with the exception of 1 self-reported smoker with lower cotinine levels (
Using a threshold of 10 ng/ml, cotinine levels provided an AUC of 0.89 (95% CI 0.81-0.97), a sensitivity of 0.81 and a specificity of 0.97. Moderate concordance was observed between the gene expression model and cotinine (91% agreement, 95% CI 85.97-94.83, kappa=0.53); where both methods reported positive smoking status, 85% (11) subjects were self-reported smokers, 1 had recently quit, and 1 was a former smoker.
All references cited in this specification, including scientific publications, published patent applications, and issued patents, are hereby incorporated in their entirety for all purposes.
This application claims the benefit of U.S. Provisional Application No. 61/528,616, filed Aug. 29, 2011, the entire disclosure of which is hereby incorporated by reference in its entirety for all purposes.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US12/52303 | 8/24/2012 | WO | 00 | 7/31/2014 |
Number | Date | Country | |
---|---|---|---|
61528616 | Aug 2011 | US |