1. Field of the Invention
The invention relates to the field of diagnostic and prognostic methods using gene and protein biomarkers to determine a subject's radiation exposure and discriminating between persons who have been exposed to radiation only, and various levels of radiation exposure.
2. Related Art
Ionizing radiation causes well understood molecular, cellular, and tissue damage, with a broad range of severity in health effects that can range from non-detectable to acute radiation sickness and possibly, death. Exposure dose is the main predictor of the severity of the health effect. Unlike for medical procedures, assessing radiation dose in cases of nuclear accidents and nuclear terrorism remains a major unmet challenge. Rapid methods are needed to determine an individual's dose using small biological samples to separate the larger numbers of worried well from those individuals who will benefit most from immediate medical care. More than 200 mammalian proteins have been reported to be responsive to ionizing radiation [1], and currently there are no small panels of proteins capable of or for determining individual biodosimetry. Currently, there is no accepted method or protocol for human radiation biodosimetry, i.e., radiation exposure in cases where a physical dosimeter, such as a badge reader, is not available or practical.
Ossetrova and Blakely (2009) investigated the utility of multiple blood protein biomarkers for early-response assessment of radiation exposure using BALB/c mice. Serum amyloid A (SAA) was measured in plasma of irradiated mice using ELISA at 4, 24, 48 and 72 hr after whole body exposure to 0, 1, 2, 3.5, 5 and 7 Gy. Results showed significant dose-related increases in protein levels in plasma of irradiated mice. SAA was significantly increased at doses of 2 Gy and above at 24 hr only. This study was performed using an optimized ELISA protocol; their measurements were well above the LOD detection limit. The authors demonstrated that the use of multivariate discriminant analysis enhanced dose-dependent separation of irradiated animals from controls as the number of biomarkers increased.
Rithidech et al (2009) utilized two-dimensional electrophoresis gel coupled with mass spectrometry to analyze plasma proteins in CBA/CaJ mice exposed to 0 or 3 Gy. Plasma was collected from total body irradiated mice at 2 and 7 days post-irradiation. A dose dependent increase in both CC3 and VCAM levels was observed.
Prat (2005 and 2006) demonstrated that FLT3LG levels in the plasma of BALB/c mice were increased after whole body exposure to 2, 4, 7.5 and 11 Gy. Results showed that FLT3LG levels remained increased throughout the duration of the experiment which concluded at 28 days post irradiation.
Sugimoto (2001) demonstrated that C3H/HeN mice that were total body irradiated with 15 Gy showed a significant increase in serum FLT3LG levels.
This investigation was undertaken to evaluate classification analysis of gold standard ELISA proteomic data for four candidate markers for individual radiation dose prediction. We are down-selecting candidate proteins to identify small panels of proteins for individual biodosimetry, using a mouse model as described herein and in Kim, D., Marchetti, F., Chen, Z., Zaric, S., Wilson, R. J., Hall, D. A., Gaster, R. S., Lee, J. R., Wang, J., Osterfeld, S. J., Yu, H., White, R. M., Blakely, W. F., Peterson, L. E., Bhatnagar, S., Mannion, B., Tseng, S., Roth, K., Coleman, M. A., Snijders, A. M., Wyrobek, A. J., Wang, S. X. Nanosensor dosimetry of mouse blood proteins after exposure to ionizing radiation. Scientific Reports(Nature). 3:2234, 2013, hereby incorporated by reference in its entirety.
We selected the following four biomarkers for analysis by ELISA: 1) FLT3 ligand (Fms-related tyrosine kinase 3 ligand; FLT3LG), which is a hematopoietic growth factor and is used as a clinical indicator for bone marrow status; 2) Serum amyloid A (SAA1), which is a major acute phase protein that is expressed and regulated in response to tissue injury and inflammation; 3) CC3, also known as HTATIP2 (HIV-1 Tat interactive protein 2), which is an oxidoreductase with proapoptotic as well as antiangiogenic properties; and 4) VCAM-1 (vascular cell adhesion molecule 1), which plays a role in cell-cell recognition [3].
Using an in vivo mouse radiation model, we developed protocols for measuring FLT3 ligand (FLT3LG), serum amyloid Al (SAA1), HIV-1 Tat interactive protein 2 (CC3) and vascular cell adhesion molecule 1 (VCAM-1) in small amounts of blood collected during the first week after X-ray exposures of sham, 0.1, 1, 2, 3, or 6 Gy. FLT3LG concentrations showed excellent dose discrimination at ≧1 Gy in the time window of 1 to 7 days after exposure except 1 Gy at day 7. SAA1 dose response was limited to the first two days after exposure. A multiplex assay with both and all four proteins showed improved dose classification accuracy.
Random forests analysis were then used for calculating permutation-based importance scores, feature selection frequency counts during decision tree learning, and for generating unsupervised and supervised cluster representations of samples via eigenanalysis of decision tree-based proximity matrices. We also employed several feature filtering and selection techniques, and a variety of supervised classification methods for class prediction of irradiation dose category.
Thus, the invention provides for a panel of one, two, three or four blood plasma proteins that are responsive to whole body radiation exposure and the development of an algorithm based on protein expression that discriminates individuals into 5 radiation-exposure categories: those not exposed to ionizing radiation, those exposed to 1Gy of radiation, or 2Gy of radiation, or 3Gy of radiation, or 6 Gy of radiation. Studies were conducted in mice and are described herein. Studies will be validated in human blood. Classification analysis resulted in assigning unirradiated and irradiated mice to their correct dose groups (0, 1, 2, 3, and 6 Gy) with 90-100% accuracy on day 1 and with 100% accuracy on day 5 after exposure.
Table 1. Number of class comparisons, M, and number of biomarkers, Nm, filtered for each comparison.
Table 2. Supervised classification accuracy for 5-class problem involving dose categories 0 Gy, 1 Gy, 2 Gy, 3 Gy, and 6 Gy for Day-1 ELISA responses. Leave-one-out cross validation used.
Table 3. Classification accuracy for 5-class problem involving dose categories 0 Gy, 1 Gy, 2 Gy, 3 Gy, and 6 Gy for Day-5 ELISA responses. Leave-one-out cross validation used.
Tables 4A, 4B, 4C, and 4D. Gender variation and dose dependence of plasma protein concentrations with time after exposures. Dose response tables for each biomarker-average concentrations (all samples), average concentrations (males and females), standard deviation; separated by time points (24 hr and 5 d). (Table 4A) Ht-31g. (Table 4B) SAA1. (Table 4C) CC3. (Table 4D) VCAM1.
Table 5. Receiver operator characteristic (ROC) curve are under the curve (AUC) for 2 class models using k-nearest neighbors (k=7, “7NN”) with leave-one-out cross validation (LOOCV) and linear discriminant analysis (LDA) for Day 1 and Day 5.
Table 6. Receiver operator characteristic (ROC) area under the curve (AUC) for one-against-other classification analyses for triplicate ELISA Day 1 and 5 response using k-nearest neighbor (k=7, i.e. “7NN”) and linear discriminant analysis (LDA) with leave-one-out cross validation (LOOCV) for Day 1 and Day 5.
Tables 7A, 7B, and 7C. (Table 7A) T-test results (t-statistics and p-values in parentheses) for FLT3LG concentrations between two doses on day 1 (above diagonal line) and day 5 (below diagonal line). Conclusion: FLT3LG mean concentration in plasma was significantly different between all pairs of doses on day 1 and day 5 after irradiation. (Table 7B) T-test results (t-statistics and p-values in parentheses) for FLT3LG concentrations between two dose groups on day 1. Conclusion: t-test results show that on day 1 after irradiation, FLT3LG concentration in plasma was significantly different between 0 GY and groups (1 Gy, 2 Gy), (1 Gy, 2 Gy, 3 Gy), (1 Gy, 2 Gy, 3 Gy, 6 Gy, between 1 Gy and groups (2 Gy, 3Gy), (2 Gy, 3 Gy, 6Gy), between 2 Gy and group (3 Gy, 6 Gy). (Table 7C) T-test results (t-statitics and p-values in parentheses) for FLT3LG concentrations between two dose groups on day 5. Conclusion: t-test results indicate that on day 5 after irradiation, FLT3LG concentration in plasma was significantly different between 0 GY and groups (1 Gy, 2 Gy), (1 Gy, 2 Gy, 3 Gy), (1 Gy, 2 Gy, 3 Gy, 6 Gy), between 1 Gy and groups (2 Gy, 3Gy), (2 Gy, 3 Gy, 6Gy), between 2 Gy and group (3 Gy, 6 Gy).
Tables 8A, 8B, and 8C. (Table 8A) T-test results (t-statistics and p-values in parentheses) for SAA1 concentrations between two doses on 1-day (above diagonal line) and 5-day (below diagonal line). Conclusion: SAA1 mean concentration in plasma was significantly different between 0 Gy, 2 Gy, 3 Gy, 6 Gy and between 1 Gy and 2 Gy, 3Gy, 6 Gy but not between 2 Gy and 3 Gy, 6 Gy and also not between 3 Gy and 6 Gy on day 1 after irradiation. On day 5, SAA1 mean concentration was not significantly different between any two doses. (Table 8B). T-test results (t-statistics and p-values in parentheses) for SAA1 concentrations between two dose groups on day 1. Conclusion: t-test results show that on day 1 after irradiation, SAA1 concentration in plasma was significantly different between 0 GY and groups (1 Gy, 2 Gy), (1 Gy, 2 Gy, 3 Gy), (1 Gy, 2 Gy, 3 Gy, 6 Gy), between 1 Gy and groups (2 Gy, 3Gy), (2 Gy, 3 Gy, 6Gy), but not between 2 Gy and group (3 Gy, 6 Gy). (Table 8C) T-test results (t-statistics and p-values in parentheses) for SAA1 concentrations between two dose groups on day 5. Conclusion: SAA1 concentration in plasma was not significantly different between these dose groups on day 5 after irradiation.
Tables 9A, 9B, and 9C. (Table 9A) T-test results (t-statistics and p-values in parentheses) for CC3 concentrations between two doses on day 1 (above diagonal line) and day 5 (below diagonal line). Conclusion: t-test results indicate that CC3 concentration in plasma was significantly different between 0 Gy and 2 Gy, 3 Gy, 6 Gy, and between 1 Gy and 3Gy, 6 Gy on day 1 after irradiation; on day 5, CC3 concentration was significantly different between 0 Gy and 6 Gy and between 2 Gy and 6 Gy. (Table 9B) T-test results (t-statistics and p-values in parentheses) for CC3 concentrations between two dose groups on day 1. Conclusion: T-test results indicate significantly different mean concentration between 0 GY and groups (1 Gy, 2 Gy), (1 Gy, 2 Gy, 3 Gy), (1 Gy, 2 Gy, 3 Gy, 6 Gy), between 1 Gy and groups (2 Gy, 3 Gy, 6Gy), between 2 Gy and group (3 Gy, 6 Gy) on day 1. (Table 9C) T-test results (t-statistics and p-values in parentheses) for CC3 concentrations between two dose groups on day 5. Conclusion: During 5 days after irradiation, CC3 concentration in plasma was not significantly different between 0 GY and groups (1 Gy, 2 Gy), (1 Gy, 2 Gy, 3 Gy), (1 Gy, 2 Gy, 3 Gy, 6 Gy, between 1 Gy and groups (2 Gy, 3 Gy, 6Gy), between 2 Gy and group (3 Gy, 6 Gy).
Tables 10A, 10B, and 10C. (Table 10A) T-test results (t-statistics and p-values in parentheses) for VCAM concentrations between two doses on day 1 (above diagonal line) and day 5 (below diagonal line). Conclusion: on day 1 after irradiation, VCAM concentration in plasma was significantly different between 0 Gy and 3 Gy, 6 Gy, and between 1 Gy and 6 Gy and on day 5, was very significantly different between 6 Gy and 0 Gy, 2 Gy, 3 Gy. (Table 10B) T-test results (t-statistics and p-values in parentheses) for VCAM concentrations between two dose groups on day 1. Conclusion: 0 GY and groups (1 Gy, 2 Gy, 3 Gy), (1 Gy, 2 Gy, 3 Gy, 6 Gy) have significant differences for VCAM concentration in plasma on day 1 after irradiation. (Table 10C) T-test results (t-statistics and p-values in parentheses) for VCAM concentrations between two dose groups on day 5. Conclusion: 0 GY and groups (1 Gy, 2 Gy, 3 Gy), (1 Gy, 2 Gy, 3 Gy, 6 Gy), 1 Gy and group (2 Gy, 3 Gy, 6 Gy) have significant differences for VCAM concentration in plasma on day 5 after irradiation.
Herein are described systems, methods and compositions for the identification of a panel of genes whose gene expression and protein levels in part provide a signature for determining individual radiation dose prediction and dosimetry. In some embodiments, 1-gene, 2-gene, 3-gene, or 4-gene blood protein biomarker panel signature is described. In other embodiments, ELISA-based methods for analyzing irradiation responses of proteomic biomarkers for dose category prediction.
In various embodiments, 1-, 2-, 3- and 4-identified blood proteins are sufficient to provide a 100% classification accuracy for assigning individuals to the correct radiation exposure dose (5 class problem). Further development of this blood protein panel is to test the same proteins in blood from irradiated persons or in irradiated blood cells collected from healthy normal people. We have already demonstrated the efficacy of the latter approach to develop a biodosimetry panel for human blood. We previously used human blood exposed ex vivo to ionizing radiation to develop a panel of blood biomarkers consisting of a combination of several blood mRNAs and proteins. Blood proteins are more stable than blood mRNA, and therefore more promising for biodosimetry and were the focus of the present Examples. The high level of homology between genes/proteins of mice and humans makes the in vivo mouse model extremely suitable for biomarker-discovery studies (Rithidech 2009).
Thus, the present panels can also be used to determine an individual's degree of radiation exposure from 1, 2, 3, 4, to 5 days after such exposure. In some embodiments, the panel of four blood plasma proteins that are responsive to whole body radiation exposure and the methods of analysis of protein expression, whereby the result of the analysis provides for patient discrimination and selection that discriminates individuals into 5 radiation-exposure categories: those not exposed to ionizing radiation, those exposed to 1Gy of radiation, or 2Gy of radiation, or 3Gy of radiation, or 6 Gy of radiation.
In various embodiments, the measurement and detection of blood protein levels are from a sample from a patient. In some embodiments, the protein levels are analyzed and determined by Enzyme-Linked Immunosorbant Assay (ELISA). Such methods for protein analyses are well known to those skilled in the art. Suitable methods for ELISAs are described in Kim, D., Marchetti, F., Chen, Z., Zaric, S., Wilson, R. J., Hall, D. A., Gaster, R. S., Lee, J. R., Wang, J., Osterfeld, S. J., Yu, H., White, R. M., Blakely, W. F., Peterson, L. E., Bhatnagar, S., Mannion, B., Tseng, S., Roth, K., Coleman, M. A., Snijders, A. M., Wyrobek, A. J., Wang, S. X. Nanosensor dosimetry of mouse blood proteins after exposure to ionizing radiation. Scientific Reports(Nature). 3:2234, 2013, and Budworth, H., Snijders, A. M., Marchetti, F., Mannion, B., Bhatnagar, S., Kwoh, E., Tan, Y., Wang, S. X., Blakely, W. F., Coleman, M. A., Peterson, L. E., Wyrobek, A. J. DNA repair and cell cycle biomarkers of radiation exposure and inflammation stress in human blood. PLoS One. 7(11):e48619, 2012, both of which are hereby incorporated by reference in their entirety.
We selected the following four biomarkers for analysis by ELISA: 1) FLT3 ligand (Fms-related tyrosine kinase 3 ligand; FLT3LG), which is a hematopoietic growth factor and is used as a clinical indicator for bone marrow status; 2) Serum amyloid A (SAA1), which is a major acute phase protein that is expressed and regulated in response to tissue injury and inflammation; 3) CC3, also known as HTATIP2 (HIV-1 Tat interactive protein 2), which is an oxidoreductase with proapoptotic as well as antiangiogenic properties; and 4) VCAM-1 (vascular cell adhesion molecule 1), which plays a role in cell-cell recognition [3].
Methods for detection of expression levels of a gene can be carried out using known methods in the art including but not limited to, optical density, fluorescent intensity, fluorescent in situ hybridization (FISH), immunohistochemical analysis, fluorescence detection, comparative genomic hybridization, PCR methods including real-time and quantitative PCR, mass and imaging spectrometry and spectroscopy methods and other sequencing and analysis methods known or developed in the art. The expression level of the gene in question can be measured by measuring the amount or number of molecules of mRNA or transcript in a cell. The measuring can comprise directly measuring the mRNA or transcript obtained from a cell, or measuring the cDNA obtained from an mRNA preparation thereof. Such methods of extracting the mRNA or transcript from a cell, or preparing the cDNA thereof are well known to those skilled in the art. In other embodiments, the expression level of a gene can be measured by measuring or detecting the amount of protein or polypeptide expressed, such as measuring the amount of antibody that specifically binds to the protein in a dot blot or Western blot. The proteins described in the present invention can be overexpressed and purified or isolated to homogeneity and antibodies raised that specifically bind to each protein. Such methods are well known to those skilled in the art.
Comparison of the detected expression level of a gene in a patient sample is often compared to the expression levels detected in a normal tissue sample or a reference expression level. In some embodiments, the reference expression level can be the average or normalized expression level of the gene or gene product in a known panel of standards or a panel of normal cell lines or cancer cell lines.
In various embodiments, the expression levels of proteins in the biomarker panel of FLT3LG, SAA1, CC3, and/or VCAM, are measured and analyzed to determine an individual subject's radiation exposure level , comprising: (a) measuring the protein expression level of the blood proteins FLT3LG, SAA1, CC3, and/or VCAM, each protein in one of the biomarker panels in a sample from a patient, whereby based on the calculated methods and analyses a patient prediction score is used to determine the level of radiation exposure of a patient.
Gene sequences and gene products that may be detected are herein identified by gene name, Unigene ID, GeneID and/or GenBank Accession Numbers, and the publicly available content all of which are hereby incorporated by reference in their entireties for all purposes. As understood in the art, there are naturally occurring polymorphisms for many gene sequences. Genes that are naturally occurring allelic variations for the purposes of this invention are those genes encoded by the same genetic locus. The proteins which are detected and encoded by allelic variations of the four proteins FLT3LG, SAA1, CC3, and/or VCAM typically have at least 95% amino acid sequence identity to one another, i.e., an allelic variant of a gene indicated in herein typically encodes a protein product that has at least 95% identity, often at least 96%, at least 97%, at least 98%, or at least 99%, or greater, identity to the amino acid sequence encoded by the nucleotide sequence denoted by the Entrez Gene ID number (as of Jun. 25, 2014) shown herein for that gene. For example, an allelic variant of a gene encoding FLT3LG (gene: Homo sapiens fms-related tyrosine kinase 3 ligand (FLT3LG)) typically has at least 95% identity, often at least 96%, at least 97%, at least 98%, or at least 99%, or greater, to the FLT3LG protein sequence encoded by the nucleic acid sequence available under Entrez Gene ID no. 2323). In some cases, a “gene identified in” herein, may also refer to an isolated polynucleotide that can be unambiguously mapped to the same genetic locus as that of a gene assigned to a genetic locus by the Entrez Gene ID or it may also refer to an expression product that is encoded by a polynucleotide that can be unambiguously mapped to the same genetic locus as that of a gene assigned to a genetic locus by the Entrez Gene ID.
FLT3LG: Homo sapiens fms-related tyrosine kinase 3 ligand (FLT3LG) Nucleotide sequence GenBank Accession No. NM—001204502.1 GI:325197196, encodes FLT3LG protein sequence GenBank Accession No. NP—001191431.1 GI:325197197, both of which are hereby incorporated by reference.
Dendritic cells (DCs) provide the key link between innate and adaptive immunity by recognizing pathogens and priming pathogen-specific immune responses. FLT3LG controls the development of DCs and is particularly important for plasmacytoid DCs and CD8 (see MIM 186910)-positive classical DCs and their CD103 (ITGAE; MIM604682)-positive tissue counterparts (summary by Sathaliyawala et al., 2010 [PubMed 20933441]).[supplied by OMIM, January 2011]. SEQ ID NO: 1 is the FLT3LG nucleotide sequence and Transcript Variant isoform 1. This variant (1) encodes the longer isoform (1). Variants 1, 2, and 3 encode the same isoform (1).
Homo sapiens fms-related tyrosine kinase 3 ligand (FLT3LG), transcript
SAA1: Homo sapiens serum amyloid A1 (SAA1). Nucleotide sequence GenBank Accession No. NM—000331.4 GI:295821191 encodes SAA1 protein sequence GenBank Accession No. NP—000322.2 GI:40316912, both of which are hereby incorporated by reference. This gene encodes a member of the serum amyloid A family of apolipoproteins. The encoded protein is a major acute phase protein that is highly expressed in response to inflammation and tissue injury. . It represents a family of low molecular weight acute phase proteins, which are produced primarily by the liver in response to infection and inflammatory stimuli (Glojnaric, 2007). This protein also plays an important role in HDL metabolism and cholesterol homeostasis. High levels of this protein are associated with chronic inflammatory diseases including atherosclerosis, rheumatoid arthritis, Alzheimer's disease and Crohn's disease. This protein may also be a potential biomarker for certain tumors. Alternate splicing results in multiple transcript variants that encode the same protein. A pseudogene of this gene is found on chromosome 11.[provided by RefSeq, Jun 2012]. This variant (1) represents the longest transcript. Variants 1, 2 and 3 encode the same protein.
Homo sapiens serum amyloid A1 (SAA1), transcript variant 1, mRNA
CC3: oxidoreductase HTATIP2 isoform b [Homo sapiens], HIV-1 TAT-interactive protein 2. Nucleotide sequence GenBank Accession No. NM—001098522.1 GI:148728171 encodes the CC3 protein sequence GenBank Accession No. NP—001091992.1 GI:148728172, both of which are hereby incorporated by reference. This variant (4) has an alternate 5′ end and differs in the 5′ UTR, compared to variant 1. These differences cause translation initiation at a downstream AUG and an isoform (b, also known as CC3) with a shorter N-terminus compared to isoform a. Variants 2, 3 and 4 encode the same isoform.
Homo sapiens HIV-1 Tat interactive protein 2, 30 kDa (HTATIP2),
VCAM-1 (also referred to herein as VCAM): Vascular cell adhesion molecule. Nucleotide GenBank Accession No. NM—001078.3 GI:31543426 encodes protein GenBank Accession No. NP—001069.1 GI:4507875 both of which are hereby incorporated by reference. This gene is a member of the Ig superfamily and encodes a cell surface sialoglycoprotein expressed by cytokine-activated endothelium. This type I membrane protein mediates leukocyte-endothelial cell adhesion and signal transduction, and may play a role in the development of artherosclerosis and rheumatoid arthritis. Three alternatively spliced transcripts encoding different isoforms have been described for this gene. [provided by RefSeq, December 2010]. This variant (1) encodes the predominant, full-length isoform (a).
Homo sapiens vascular cell adhesion molecule 1 (VCAM1), transcript
In various embodiments, the present methods and protein analysis may be carried out with or on a system incorporating computer and/or software elements configured for performing logic operations and calculations, input/output operations, machine communications, statistical analysis, detection of gene or protein expression levels and analysis of the measured levels and/or the like. Such system may also be used to generate a report, determinations of the total expression levels measured, the comparison with any reference levels, and calculation of the median levels of gene and gene product expression levels. It will be appreciated by one of skill in the art that various modifications are anticipated by the present embodiments.
In various embodiments, the methods described carried out on a computer readable storage medium having computer readable program code embodied in the medium to carry out the methods and determinations of protein concentration and/or patient dosage classification.
In some embodiments, protein concentration is measured by ELISA by optical density (O.D.) or fluorescent intensity. In some embodiments, the clinician drops a plasma serum or whole blood sample on a reader such as a Radbiochip described in Kim, D., Marchetti, F., Chen, Z., Zaric, S., Wilson, R. J., Hall, D. A., Gaster, R. S., Lee, J. R., Wang, J., Osterfeld, S. J., Yu, H., White, R. M., Blakely, W. F., Peterson, L. E., Bhatnagar, S., Mannion, B., Tseng, S., Roth, K., Coleman, M. A., Snijders, A. M., Wyrobek, A. J., Wang, S. X. Nanosensor dosimetry of mouse blood proteins after exposure to ionizing radiation. Scientific Reports(Nature). 3:2234, 2013, and the reader would determine OD or fluorescent intensity, followed by software which would run Ensemble vote predictions for which dose class a person should be classified. RF analysis is employed to determine the relative discrimination informativeness of each marker, but could be used for dose category prediction. In some embodiments, for dose prediction the ensemble majority vote from supervised classifiers is used.
In various embodiments, a patient's predicted dose category is determined after a known exposure event, for which time since exposure would always be known. However, rrediction for time since exposure using methods herein are contemplated. Time prediction from markers may further involve regression models, or possibly classification models withhold as inputs.
In various embodiments, a computer-implemented software component carries out the methods described herein. In some embodiments, such method uses optical density (O.D.) or fluorescent intensity, dose prediction is made and the dose category is determined The software would be trained using blood sample data similar to the mouse data described herein, data from non-human primates, data from irradiated blood samples, or data from patients that have received known irradiation exposure (e.g., cancer, radiotherapy patients).
In one embodiment, the 1-4 protein signature panel may also be added to a larger biomarker panel comprising the detection of the genes or gene products. A method is described for identifying a patient with higher predicted probability of disease free survival. Methods for determining such disease-free survival may comprise: (a) measuring the amplification or expression level of each gene in the biomarker panel in a sample from a patient; and (b) determining a total amplification or expression level of said panel by adding together the measurements from Step (a); and (c) comparing said total in Step (b) to a median of total amplification or expression level of said panel of genes in a normal tissue sample or a reference amplification or expression level, whereby a below-median expression level indicates a patient that has a higher predicted probability of disease free survival.
In one embodiment, a kit comprising probes for detection of expression levels of the 1-4 protein signature panel, wherein said probes provide for assessment of a subject's radiation exposure.
In other embodiments, a sample is obtained from a patient; an ELISA is conducted; FLT3LG, SAA1, CC3, and VCAM-1 protein concentrations in the patient blood sample are determined, for example, by using an optical density spectrophotometer reader and protein concentration is calculated e.g., ng/mL concentration using analysis software; the protein concentrations are next log transformed, mean-zero standardized, and then transformed through the classification algorithms and methods described herein in the Examples to produce prediction scores for the dose classification of the patient sample. The resulting prediction scores are normally distributed and equilibrated, and a prediction/probability is made for each possible dose class membership, with the greatest probability where the patient is classified. In some embodiments, the classification methods employ Ensemble methods which use 8 supervised votes, and the majority class determines the dosage classification. In other embodiments, it may be better to use Ensemble majority and weighted majority vote methods.
In various embodiments, following the determination of radiation dosage classification of the patient sample, the patient is then triaged according to the severity of the dosage received and then appropriate treatments are recommended and prescribed.
It is generally believed that the best accuracy in assigning an individual with unknown exposure to ionizing radiation into the correct dose category for proper medical care requires panels of multiple radiation-responsive biomarkers. However, major uncertainties remain regarding the number of biomarkers required and the selection of the best methods for unsupervised dose class discovery and supervised dose class prediction.
This investigation was undertaken to evaluate classification analysis of gold standard ELISA proteomic data for four candidate markers considered for use on a nano-biochip for individual radiation dose prediction [2]. We selected the following four biomarkers for analysis by ELISA: 1) FLT3 ligand (Fms-related tyrosine kinase 3 ligand; FLT3LG), which is a hematopoietic growth factor and is used as a clinical indicator for bone marrow status; 2) Serum amyloid A (SAA1), which is a major acute phase protein that is expressed and regulated in response to tissue injury and inflammation; 3) CC3, also known as HTATIP2 (HIV-1 Tat interactive protein 2), which is an oxidoreductase with proapoptotic as well as antiangiogenic properties; and 4) VCAM-1 (vascular cell adhesion molecule 1), which plays a role in cell-cell recognition [3]. Random forests were used for calculating permutation-based importance scores, feature selection frequency counts during decision tree learning, and for generating unsupervised and supervised cluster representations of samples via eigenanalysis of decision tree-based proximity matrices. We also employed several feature filtering and selection techniques, and a variety of supervised classification methods for class prediction of irradiation dose category.
C57BL/6 inbred mice were exposed to 0 Gy, 1 Gy, 2 Gy, 3 Gy,or 6 Gy and peripheral blood plasma was obtain on day 1 (n=50) and day 5 after exposure (n=50) for evaluation by ELISA for 4 proteomic biomarkers (FLT3LG, SAA1, CC3, and VCAM) in triplicate with equal numbers of males and females in each dose-time group. Random forests (RF) were used for unsupervised analyses to evaluate biomarker informativeness, while 8 supervised classification techniques were compared for multiclass analyses of this 5-class problem (0,1,2,3,6 Gy). Classifiers included k nearestneighbor (kNN), naive Bayes classifier (NBC), linear discriminant analysis (LDA), learning vector quantization (LVQ1), least squares support vector machines (SVMLS), artificial neural networks (ANN), constricted particle swarm optimization (CPSO), and polytomous logistic regression (PLOG). Results indicate that gender and time since exposure were much less informative than the biomarker responses were for dose category assignments. For day 1, SAA1 and FLT3LG are almost equally informative when considering RF importance scores and supervised classification results. For day 5, FLT3LG dominated the importance scores, but VCAM, CC3 and SAA1 were, nevertheless, selected quite frequently during first and all node splits during RF decision tree generation. During feature filtration and selection, only FLT3LG was selected for the day 5 classification runs. Feature selection approaches using various inferential hypothesis testing approaches did not result in markedly different classification performance. For day 1 supervised classification analyses, the overall accuracies for each method were (in decreasing order): EMV (80%), PLOG (78%), LDA (77%), 5NN (76%), CPSO (76%), EWMV (75%), LVQ1 (72%), ANN (71%), NBC (65%), and SVMLS (49%). For day 5 supervised classification results, the accuracies were: LDA (100%), 5NN (100%), LVQ1 (100%), CPSO (100%), PLOG (98%), EMV (96%), EWMV (87%), ANN (75%), SVMLS (71%), and NBC (53%). These analyses demonstrate that RF bootstrapping to generate alternative realizations of training data and simultaneous random selection of features during node splitting in decision tree learning is a superior approach to unsupervised and supervised classification analysis, especially for evaluating biomarker dose informativeness. Our findings lay the groundwork as additional radiation biomarkers become available to improve the cluster structure of the data and to improve supervised classification performance.
Animals and Treatments. C57BL/6 mice, 8 to 10 week old males and females, were purchased from Harlan Laboratories. Mice were housed under conventional conditions in microisolator filter-top cages. Animal rooms were provided with 10-12 air changes h-1 of 100% fresh conditioned air and maintained at 22° C.±1° C. with a relative humidity of 50%±20. Animals remained on 12:12-h full spectrum light:dark cycles and provided food (Lab Diet 5008 Mouse Chow) and reverse osmosis filtered water ad libitum. Mice were acclimated for a minimum of 2 weeks before sham treatment or exposure to ionizing radiation. The use of animals in the study was approved by the Institutional Animal Care and Use Committee (IACUC) of Lawrence Berkeley National Laboratory, which approved the protocols.
Mouse Irradiations. Mice were total-body-irradiated (TBI) using a Pantak 320 kV X-ray machine set at 300 kV and 10 mA. Mean weights before irradiation were 28.2 g±3.3 and 22.4 g±3.0 for males and females, respectively. Irradiations of mice were carried out in well-ventilated clear plastic rodent restrainers each containing a single mouse. Restrainers were placed on a turning table and mice were irradiated one or two at a time. 1, 2 and 3 Gy irradiations were carried out at a dose rate of 775 mGy/min and the turning table was set to 85 cm. 6 Gy irradiations were carried out at a dose rate of 1.9 Gy/min and the turning table was set to 60 cm. Sham-irradiated animals were treated in the same manner but not exposed to the radiation source. Dosimetry was performed using an Accu-Pro™ dosimeter.
Euthanasia and blood collection. At 24 h or 5 d post irradiation, mice were weighed (mean weights were 27.4 g±3.1 and 21.9 g±2.6 for males and females, respectively) and euthanized by CO2 asphyxiation followed by open thoracotomy. Blood was collected via intracardiac puncture with a heparin rinsed syringe. The average volume of blood collected was 738 μl and 660 μl for males and females, respectively. Tubes with collected peripheral blood were centrifuged at 400 g for 5 min, and plasma was collected, aliquoted, and preserved at −80° C. until use. The average volume of plasma collected was 321 μl and 288 μl for males and females, respectively.
Groups of 10 C57BL/6 mice (5 male and 5 female) underwent whole-body irradiation using X-rays (0.7 Gy/min dose rate) to doses of 0 (sham), 1, 2, 3, and 6 Gy (total=100 mice) with approval of the LBNL Animal Use Committee. Cardiac blood was collected at24 hours and 5 days after exposure and plasma prepared for protein ELISA analyses. Triplicate measures of plasma-based ELISA concentration for four protein biomarkers (FLT3LG, SAA1, CC3, VCAM-1) were obtained from each mouse at 24 h post-exposure (n=50) and 5 days post-exposure (n=50).
Protein bioassays. Total protein concentrations of the samples were measured via the bicinchoninic acid (BCA) method (Pierce). Radiation responses of blood protein biomarkers were measured using ELISA. All quality control concentrations were within 2SD of mean on all plates that were run.
FLT-3LG. Sandwich ELISA for mouse FLT-3LG was run according to manufacturer's instructions using a commercially available kit (R&D Quantikine Mouse*Flt-3 Ligand Immunoassay, cat #MFK00, Minneapolis, Minn., USA). The quality control provided with the kit was resuspended in distilled water, aliquoted, and frozen. An aliquot of the quality control was run in triplicates on each plate. The quality control gave an average concentration of −237 pg/mL.
SAA1. Sandwich ELISA for mouse SAA1 was run according to manufacturer's instructions using a commercially available kit (ALPCO Immunoassays, cat #41-SAAMS-E01, Salem, N.H., USA). The quality control provided with the kit was diluted 1:2000, aliquoted, and frozen. An aliquot of the quality control was run in triplicates on each plate. The quality control gave an average concentration of −215 μg/mL.
CC3. Sandwich ELISA for mouse CC3 was run according to manufacturer's instructions using a commercially available kit (Alpha Diagnostic International Mouse C3, cat #6270, San Antonio, Tex. USA). The quality control provided with kit (mouse serum) was pooled from 6 ELISA kits, aliquoted, and placed at 4° C. An aliquot of the quality control was run in triplicates on each plate. The quality control gave an average concentration of ˜66 ng/mL.
VCAM-1. Sandwich ELISA for mouse VCAM-1 was run according to manufacturer's instructions using a commercially available kit (Abnova, cat #KA0428, Taipei City, Tiwan). A quality control was not provided with the kit. A quality control was made by using 10 μl of plasma collected from a sham irradiated mouse from a pilot study. The plasma was diluted 1:400 with VCAM sample diluent buffer, aliquoted, and frozen. An aliquot of the quality control was run in triplicates on each plate. The quality control gave an average concentration of −800 μg/mL
Data Collection. ELISA plates were read using TECAN Infinite M200 plate reader using the TECAN Magellan software. Data obtained from Magellan was exported into Microsoft Excel.
Biomarker Transformations. Triplicate ELISA concentration measurements for each mouse were collapsed into an average value. Within each biomarker, the continuously-scaled averages were log-transformed and then mean-zero standardized using the mean and standard deviation over all mice. For notation purposes, the log-transformed variant of FLT3LG was lnflt31g, and the mean-zero standardized value was then termed zlnflt31g. Therefore, the final variable names of the four log-transformed mean-zero standardized biomarker ELISA concentrations were zlnflt31g, zlnsaal, zlncc3, zlnvcam, which were used in all analyses.
Statistical Analyses. We used the linear discriminant analysis (LDA) and k-nearest neighbors (k=7, i.e., “7NN”) modules of Stata Version 12 (College Station, TX) for classification analysis. The MAUCROC algorithm for Stata was used for generating receiver operator characteristic curve (ROC) area under the curve (AUC). Classification and AUC runs were made for all possible pairs of dose classes (e.g., 0 vs. 1, 0 vs. 2, . . . ,3 vs. 6) as well as all one-against-remaining class comparisons (e.g., 0 vs. other, 1 vs. other, . . . , 6 vs. other) for the Day 1 and Day 5 data.
Unsupervised and Supervised Random Forest Analysis of Biomarker Informativeness. Random forests (RF) were used to generate importance scores and frequency of biomarker (feature) selection in first node splits and all node splits within the trees employed in a forest[4]. A total of 1,000 trees was used for each forest, and for each node split jtry=√{square root over (p)}features were randomly selected and evaluated with the Gini index to identify the optimal cutpoint value for splitting. During tree generation, node splitting was performed until each daughter node had either one object or multiple objects with class purity. Supervised clustering results were based on eigenanalysis of the proximity matrix and presented in the form of 2D principal component score plots with varying symbols(colors) assigned to objects based on their true class labels. For unsupervised cluster analysis, the dataset being analyzed was augmented with n simulated objects by randomly selecting feature values from the observed n objects (within the same feature), such that the final dataset contained a total of 2n objects. Objects in the original dataset were assigned class 1 and objects in the augmented dataset were assigned to class 2. Eigenanalysis was then performed on the proximity matrix of the 2n objects, and 2D score plots weregenerated for the first n original objects in class 1 using a single symbol(color).
Biomarker Filtering from Class comparisons. Let xi=(xi1, xi2, . . . , xip) be an object (mouse) with p(j=1,2, . . . , p) features (biomarkers), n(i=1,2, . . . , n) the total number of objects (mice), and Ω(ω=1,2, . . . , Ω) be the total number of classes. In addition, let M=Ω(Ω−1)/2 be the possible pairs of class comparisons and M=Ω be all possible one-against-all remaining class comparisons (m=1,2, . . . , M). For each mth class comparison, the top N, biomarkers with the greatest informativeness were identified. Informativeness was based on the T-test, Mann-Whitney test, F-test, Kruskal-Wallis test, Gini index and entropy in the form of information gain[5]. For statistical tests, Nm was equal to the number of biomarkers for which pj≦0.05. A list of non-redundant biomarkers among the N=N1+N2+ . . . +Nm+ . . . +NM biomarkers was then constructed. (For large gene lists, we commonly identify 150 unique biomarkers from M sets of Nm=150/M biomarkers). Table 4 lists the number of biomarkers filtered for the M possible class comparisons. It warrants noting that the biomarkers from various class comparisons can be redundant, so a unique list is obtained from the M comparisons.
Results. Results for the selected protein biomarkers (FLT-3LG, SAA1, CC3 and VCAM) in mouse plasma represent 5 mice per group (dose and sampling time-point) derived from four independent experiments. Results shown for FLT-3LG (
FLT-3LG. As shown in
SAA1. As shown in
CC3. As shown in
VCAM. As shown in
Table 5 lists receiver operator characteristic (ROC) area under the curve (AUC) for all possible 2-class comparisons of triplicate ELISA Day 1 response using linear discriminant analysis (LDA) and k-nearest neighbor (k=7, i.e. “7NN”) with leave-one-out cross validation (LOOCV) for Day 1 and Day 5. For Day 1 data, mean univariate AUCs for FLT3LG and SAA1 exceeded 90%, while mean univariate AUC for CC3 and VCAM were less than 90%. Average multivariate AUC for both classification methods (LDA and 7NN) exceeded 95%. For Day 5 data, mean univariate AUCs for FLT3LG exceeded 90%, while mean univariate AUC for SAA1, CC3, and VCAM were less than 90%. Average multivariate AUC for both classification methods (LDA and 7NN) was equal to 100%. Table Mists receiver operator characteristic (ROC) area under the curve (AUC) for one-against-other classification analyses for triplicate ELISA Day 1 and 5 response using k-nearest neighbor (k=7, i.e. “7NN”) and linear discriminant analysis (LDA) with leave-one-out cross validation (LOOCV). For Day 1 data, mean univariate AUC for FLT3LG was 93% for 7NN and was 79% for LDA. Mean univariate AUC for SAA1, CC3, and VCAM were less than 90% for both 7NN and LDA. Average multivariate AUC for 7NN and LDA was 95% and 85%, respectively. For Day 5 data, mean univariate AUC for FLT3LG was 100% for 7NN and was 81% for LDA. Mean univariate AUC for SAA1, CC3, and VCAM were less than 80% for both 7NN and LDA. Average multivariate AUC for 7NN and LDA was 100% and 80%, respectively.
For Day 1 data, we expected classifier breakdown for the 2 vs. 3 vs. 6 Gy dose comparisons, since the dose-response results do not reveal clear separation of response of FLT3LG and SAA1 at greater doses. On day 5, FLT3LG contributes to the majority of discrimination due to the greater separation between mean responses over the entire dose range. Overall, LDA results were less appealing than results from 7NN classification. The non-parametric k-nearest neighbor classifier has the ability to go within clusters of sample and correctly classify each test sample left of training during LOOCV using the majority true class label among k closest neighbors in Euclidean space. Whereas, LDA assumes that the number of samples used is large enough to ensure that the pooled covariance matrix approximates a multivariate normal distribution and that there are no outliers that can bias the results through leveraging.
FLT3LG. Our results showed that ionizing radiation significantly increased plasma levels of FLT-3LG at all doses tested at both 24 hr and 5 days. The strongest induction of FLT-3LG levels was observed at 5 days post irradiation. These results are in agreement with Prat et al.(2005) in BALB/c mice where it was demonstrated that FLT-3LG levels peaked at 3 and 7 days post irradiation. Our results were also in agreement with Sugimoto (2001) who showed that C3H/HeN mice irradiated with 15 Gy showed a significant increase in serum FLT-3LG levels with a peak at 6 days post irradiation.
SAA1. Our results showed that ionizing radiation significantly increased plasma levels of SAA1 at all doses tested at 24 hr after irradiation. At day 5, SAA1 levels in irradiated mice had returned to baseline levels and were not different from sham, irrespective of radiation dose. These results are in agreement with findings of Ossetrova and Blakely (2009) and Ossetrova et al (2010) in BALB/c mice.
CC3. Our results showed that ionizing radiation significantly increased plasma levels of CC3 at all doses tested at 24 hr after irradiation. At day 5, CC3 levels in irradiated mice appeared to have returned to baseline levels and were not different from sham, irrespective of the radiation dose. Rithidech et al (2009) had used 2-D gel to identify a significant increase in plasma CC3 levels at 7 days after exposure to 3 Gy in CBA/CaJ mice, suggesting possible genetic variation in time response.
VCAM-1. Our results showed that ionizing radiation induced a dose-related decrease of plasma levels of VCAM at both 24 hrs and 5 days. Rithidech et al (2009) used 2-D gel to identify a significant increase in plasma VCAM at 7 days after exposure to 3 Gy in CBA/CaJ, suggesting that biomarker response may show genetic variation.
RF analysis allowed us to evaluate the informativeness of biomarker response at 1 and 5 days after exposure for the purpose of dis-criminating the 5 classes of dose. The advantage of RF lies in the strength of bootstrapping multiple realizations of the data and each time randomly selecting features for Gini evaluation for each node splitting in order to generate thousands of decision trees in a forest. RF's do not overfit data and typically are the most reliable approach for generalizing results to future unobserved test data, which is mostly due to their conservativeness hinged to bootstrapping and random feature selection-evaluation during tree generation. Importance scores are permutation-based and offer a means of evaluating feature informativeness based on null and alternative distributions, while the frequency of feature selection during first node splits and all node splits provides additional information regarding feature informativeness—because if a feature results in strong class separation, it will likely be identified via Gini in the first node split and numerous other splits within a decision tree. The supervised and unsupervised clustering results presented in 2D score plots of the proximity matrix reveal the cluster structure objects based on training data with intact class labels and simulated class labels. Table 1 shows the number of class comparisons, M, and number of biomarkers, Nm, filtered for each comparison.
Selection of Filtered Biomarkers for Supervised Classification. A stepwise greedy plus-take-away (“Greedy PTA”) method using a plus 1 take away 1 heuristic[6] was used for selecting biomarkers from the unique list of filtered biomarkers described above. (This step requires biomarkers to be mean-zero standardized over the n objects). Forward stepping was carried out to add(delete) the most(least) important biomarkers for class separability based on squared Mahalanobis distance and the F-to-enter and F-remove statistics. Biomarkers were entered into the model if their standardized expression resulted in the greatest Mahalanobis distance between the two closest classes, and their F-to-enter statistic exceeded F=3.84. At any step, a biomarker was removed if its F-to-enter statistic was less than the F-to-remove criterion of F=2.71. When done, there were N unique biomarkers which were jointly statistically significantly different between all classes, which also happened to provide the greatest multivariate Mahalanobis distance-based class separation. We also selected the “Best ranked N” biomarkers for supervised classification runs, where N was set equal to the number of unique biomarkers selected during greedy PTA.
Supervised Classification Analysis. Eight supervised classification techniques were employed for multiclass analysis of a 5-class problem (0 Gy, 1 Gy, 2 Gy, 3 Gy, and 6 Gy). These included k nearest neighbor (kNN), naïve Bayes classifier (NBC), linear discriminant analysis (LDA), learning vector quantization (LVQ1), least squares support vector machines (SVMLS), artificial neural networks (ANN), constricted particle swarm optimization (CPSO), and polytomous logistic regression (PLOG)[7,8,9,10,11,12,13,14]. For kNN, k was set equal to 5 (“5NN”), and for the LVQ1 we used a single prototype per class. For SVMs, we used an L2 soft norm least squares approach. A weighted exponentiated RBF kernel was employed to map samples in the original space into the dot-product space, given as
where m=#features. Such kernels are likely to yield the greatest class prediction accuracy providing that a suitable choice of γ is used. To determine an optimum value of γ for use with RBF kernels, a grid search was done using incremental values of γ from 2−15, 2−13, . . . , 23 in order to evaluate accuracy for all training samples. We also used a grid search in the range of 10−2, 10−1, . . . , 104 for the SVM margin parameter C. The optimal choice of C was based on the grid search for which classification accuracy was the greatest, resulting in the optimal value for the separating hyperplane and minimum norm ∥ξ∥ of the slack variable vector. SVM tuning was performed by taking the median of parameters during grid search iterations when the test sample misclassification rate was zero. For the ANN classifier, the logistic activation function was used with each hidden node, and the softmax function used to compute class membership probabilities for output node weight connections. We also used 500 sweeps with a grid search for each ANN model in which the learning rate E and momentum a ranged from 2−9, 2−8, . . . , 2−1. The grid search for ANNs also included an evaluation of error for a variable number of hidden nodes in the single hidden layer, which ranged from the number of training features (i.e., the length of input vector for each sample) down to the number of output nodes, incremented by −2. In cases when there were multiple valuesof grid search parameters for the same error rate, we used the median value. Leave-one-out cross validation was employed for all runs.
Ensemble Techniques for Supervised Classifier Fusion. Classifiers were trained with the same feature sets, and then classifier votes were combined using the ensemble majority voting (EMV) and ensemble weighted majority voting (EWMV) ensemble combination techniques[15]. Let dl,ω(x)in{0,1} be the decision rule for an object by the lth classifier (l=1,2, . . . , L) for class ω. The support for EMV and EWMV, respectively, is functionally composed as
where wl is the normalized weight reflecting the accuracy of the lth classifier. Here, accuracy is based on the proportion of classified test objects assigned to the diagonal of the confusion matrix divided by the number of test objects. Let the ensemble decision for object x be E(xω). The decision rule for test object x is
E(xω)≡ω ∈ Ω arg max{μω(x)}. (3)
Results of ensemble methods are presented under the classifier names EMV and EWMV in tabular form with results of the individual classifiers.
Table 2 lists the 5-class accuracy for class prediction using the 8 supervised classifiers based on the most informative biomarkers for their day-1 response. The lowest mean accuracy was observed for SVMLS (49%) and the greatest mean accuracy occurred for PLOG (78%). The EMV mean accuracy was 80% however, and reflects the benefit of ensemble methods for combining multiple votes from a committee of classifiers. Classification accuracy using each method was (in decreasing order): EMV (80%), PLOG (78%), LDA (77%), 5NN (76%), CPSO (76%), EWMV (75%), LVQ1 (72%), ANN (71%), NBC (65%), and SVMLS (49%). Table 3 lists mean classification accuracy based on for day-5 biomarker responses used as inputs. The mean accuracy for NBC was quite low (53%), and mean accuracy for SVMLS (71%) and ANN (75%) were not markedly better. Classification accuracy using the various techniques was: LDA (100%), 5NN (100%), LVQ1 (100%), CPSO (100%), PLOG (98%), EMV (96%), EWMV (87%), ANN (75%), SVMLS (71%), and NBC (53%). The majority of classifiers showed outstanding performance for dose category prediction based on the 5-day biomarker responses.
Lastly, there was good agreement between the values of biomarker RF importance scores and biomarker filtering and selection results prior to input for supervised classification. The additional information provided by RF-based frequency of biomarker selection in first and all node splits reveals that, although a biomarker can suffer from having a low RF importance score, it can nevertheless be selected more often than other biomarkers with greater importance scores.
RF analysis allowed us to evaluate the informativeness of biomarker response at 1 and 5 days after exposure for the purpose of discriminating the 5 classes of dose. The advantage of RF lies in the strength of bootstrapping multiple realizations of the data and each time randomly selecting features for Gini evaluation for each node splitting in order to generate thousands of decision trees in a forest. RF's do not commonly overfit data and typically are the most reliable approach for generalizing resultsto future unobserved test data, which is mostly due to their conservativeness hinged to bootstrapping and random feature selection-evaluation during tree generation Importance scores are permutation-based and offer a means of evaluating feature informativeness based on null and alternative distributions, while the frequency of feature selection during first node splits and all node splits provides additional information regarding feature informativeness—because if a feature results in strong class separation, it will likely be identified via Gini in the first node split and numerous other splits within a decision tree. The supervised and unsupervised clustering results presented in 2D score plots of the proximity matrix reveal the cluster structureof objects based on training data with intact class labels and simulated class labels.
RF results based on all data indicate that gender and time since exposure are much less informative than the association between biomarker response and dose category. This is an impressive observation which suggests that in the context of all data, the four biomarkers evaluated are considerably more informative for predicting class when compared with mouse gender and time after exposure. For day 1 results, SAA1 and FLT3LG are almost equally informative when considering RF importance scores and supervised classification results. Filtering and selection methods for supervised classification resulted in SAA1 and FLT3LG being selected the majority of time during various filtration approaches. For day 5 biomarker response, FLT3LG dominated the importance scores, but VCAM, CC3 and SAA1 were nevertheless selected quite frequently during first and all node splits during decision tree generation. During feature filtration and selection, only FLT3LG was selected for the day 5 classification runs.
Feature selection approaches using various inferential hypothesis testing approaches did not result in markedly different classification performance. The NBC, SVMLS, and ANN supervised classifiers resulted in lower performance most likely because of the small number of features used. ANNs are biased toward the amount of training data used, tend to suffer when the feature number is low, and can also overfit the data if less than n˜200 objects are used per feature. For the day 1 data, the SVMLS suffered due to the increased overlap of mice in the 3 Gy and 6 Gy dose categories.
This study is a works-in-progress to develop classification methods for ELISA-based irradiation responses of proteomic biomarkers for dose category prediction. Gender and time-after-exposure dose-response curves, ROC curves, and ROC area under the curve for the biomarkers evaluated were not provided in this report since they form the basis of evaluations in our other reports.
Unsupervised class discovery and supervised class prediction of ionizing radiation dose category for mice exposed to 0 Gy, 1 Gy, 2 Gy, 3 Gy, and 6 Gy were investigated using plasma ELISA concentration of 4 proteomic biomarkers (FLT3LG, SAA1, CC3, and VCAM). Plasma ELISA concentrations were obtained in triplicate from n=50 mice at 1 day post-exposure and n=50 mice at 5 days post-exposure, with equal sample sizes of gender (male, female) at each dose level. Random forests (RF) were used for unsupervised analyses to evaluate biomarker informativeness, while 8 supervised classification techniques were employed for multiclass analysis of a 5-class problem (0,1,2,3,6 Gy). Classifiers included k nearest neighbor (kNN), naïve Bayes classifier (NBC), linear discriminant analysis (LDA), learning vector quantization (LVQ1), least squares support vector machines (SVMLS), artificial neural networks (ANN), constricted particle swarm optimization (CPSO), and polytomous logistic regression (PLOG). Results indicate that, for the biomarkers considered, gender and time since exposure were much less informative than the association between biomarker response and dose category. For day 1 results, SAA1 and FLT3LG are almost equally informative when considering RF importance scores and supervised classification results. For day 5 biomarker response, FLT3LG dominated the importance scores, but VCAM, CC3, and SAA1 were nevertheless selected quite frequently during first and all node splits during RF decision tree generation. During feature filtration and selection, only FLT3LG was selected for the day 5 classification runs. Feature selection approaches using various inferential hypothesis testing approaches did not result in markedly different classification performance. The RF analyses performed demonstrate that bootstrapping to generate alternative realizations of training data and simultaneous random selection of features during node splitting in decision tree learning is a superior approach to unsupervised and supervised classification analysis, especially for evaluating biomarker dose informativeness. Our findings lay the groundwork as additional radiation biomarkers become available to improve the cluster structure of the data and to improve supervised classification performance.
This application priority to U.S. Provisional Patent Application No. 62/018,501, filed on Jun. 27, 2014. This application is related to U.S. Provisional Patent Application No. 61/901,372; and U.S. patent application Ser. No. 14/023,968, which claims priority to U.S. Provisional Patent Application No. 61/801,372, filed on Mar. 15, 2013 and to U.S. Provision 61/699,418, filed on Sep. 11, 2012, the contents of all of which are incorporated by reference in their entirety.
The invention was made with government support under Contract No. DE-AC02-05CH11231 awarded by the U.S. Department of Energy and under Contract No. HHSO100201000006C awarded by the Biomedical Advanced Research and Development Authority, Office of the Assistant Secretary for Preparedness and Response, Office of the Secretary, Department of Health and Human Services. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
62018501 | Jun 2014 | US |