The present invention relates to a gender and race identification from body fluid traces using spectroscopic analysis.
Body fluids found at a crime scene can be some of the most valuable forms of evidence in forensic investigations. They can provide complex information about a potential suspect or victim. Therefore, a crucial step of forensic casework is the identification of biological traces such as blood, semen, saliva, or sweat (Kobilinsky, L. F., In Forensic Chemistry Handbook; John Wiley & Sons: Hoboken, N.J., pp 269-282 (2012)). Human blood is the most common body fluid found at scenes of violent crimes. Also, the amount of sample available for a forensic investigation could be extremely small. In these instances, even more care should be taken to preserve the evidence for further analysis. There are presumptive assays, such as the Kastle-Meyer test, Hemastix, Leucomalachite Green, as well as using luminol or fluorescein (Kobilinsky, L. F., In Forensic Chemistry Handbook; John Wiley & Sons: Hoboken, N.J., pp 269-282 (2012); Johnston et al., “Comparison of Presumptive Blood Test Kits Including Hexagon OBTI,” J. Forensic Sci. 53:687-689 (2008)), and confirmatory tests (microcrystal assays) for detecting and identifying of blood (Kobilinsky, L. F., In Forensic Chemistry Handbook; John Wiley & Sons: Hoboken, N.J., pp 269-282 (2012)). Nevertheless, many of these tests require the use of hazardous chemicals, and all consume part of the sample. Furthermore, the current tests can only identify the presence of blood, but do not provide investigators with any additional information about the donor. The person's race can be inferred through cranial and dental analyses (Rosas et al., “Thin-Plate Spline Analysis of the Cranial Base in African, Asian and European Populations and its Relationship with Different Malocclusions,” Arch. Oral Biol. 53:826-834 (2008); Blumenfeld, J. “Racial Identification in the Skull and Teeth,” Totem: The University of Western Ontario Journal of Anthropology 8:20-23 (2011)) and through DNA analysis (Elkins, K. M., Forensic DNA Biology: A Laboratory Manual, 1st ed.; Academic Press: Oxford, UK (2012)). Therefore, the application of a nondestructive and rapid method for reliable identification of human blood as well as providing identifiable information, such as race, would be highly advantageous in forensic casework.
Raman spectroscopy is a sensitive method for obtaining information about the chemical and biochemical composition of a sample (Skoog et al., In Principles of Instrumental Analysis, 5th ed.; Saunders College Publishing: Orlando pp 429-444 (1998)). This analytical technique is based on molecular vibrations and requires a change in polarizability. Raman spectroscopy uses monochromatic light to irradiate a sample and inelastically scatter photons, which are collected to generate a spectrum (Skoog et al., In Principles of Instrumental Analysis, 5th ed.; Saunders College Publishing: Orlando pp 429-444 (1998)). Raman spectroscopy has already been used for the analysis of various types of forensic evidence including fibers (Miller et al., “Forensic Analysis of Single Fibers by Raman Spectroscopy,” Appl. Spectrosc. 55:1729-1732 (2001)), ink (Zięba-Palus et al., “Application of the Micro-FTIR Spectroscopy, Raman Spectroscopy and XRF Method Examination of Inks,” Forensic Sci. Int. 158:164-172 (2006)), paints (Zięba-Palus et al., “Examination of Multilayer Paint Coats by the Use of Infrared, Raman and XRF Spectroscopy for Forensic Purposes,” J. Mol. Struct. 792-793:286-292 (2006)), gunshot residue (Bueno et al., “Raman Spectroscopic Analysis of Gunshot Residue Offering Great Potential for Caliber Differentiation,” Anal. Chem. 84(10):4334-9 (2012)), and bones (McLaughlin et al., “Spectroscopic Discrimination of Bone Samples from Various Species,” Am. J. Anal. Chem. 3:161-167 (2012)), to name a few. Studies on different biological traces including blood, semen, saliva, sweat, vaginal fluid, and body fluid mixtures (Virkler et al., “Raman Spectroscopy Offers Great Potential for the Nondestructive Confirmatory Identification of Body Fluids,” Forensic Sci. Int. 181(1-3):e1-e5 (2008); Virkler et al., “Raman Spectroscopic Signature of Semen and Its Potential Application to Forensic Body Fluid Identification,” Forensic Sci. Int. 193(1-3):56-62 (2009); Virkler et al., “Forensic Body Fluid Identification: the Raman Spectroscopic Signature of Saliva,” Analyst 135(3):512-7 (2010); Sikirzhytskaya et al., “Raman Spectroscopic Signature of Vaginal Fluid and Its Potential Application in Forensic Body Fluid Identification,” Forensic Sci. Int. 216(1-3):44-8 (2012); Sikirzhytski et al., “Advanced Statistical Analysis of Raman Spectroscopic Data for the Identification of Body Fluid Traces: Semen and Blood Mixtures,” Forensic Sci. Int. 222(1-3):259-265 (2012); Sikirzhytski et al., “Discriminant Analysis of Raman Spectra for Body Fluid Identification for Forensic Purposes,” Sensors 10(4):2869-2884 (2010)) were published. The interference of common substrates with the Raman signal of deposited bloodstains (McLaughlin et al., “Circumventing Substrate Interference in the Raman Spectroscopic Identification of Blood Stains,” Forensic Sci. Int. 231(1-3):157-166 (2013)) and contaminated blood traces (Sikirzhytski et al., “Forensic Identification of Blood in the Presence of Contaminations Using Raman Microspectroscopy Coupled with Advanced Statistics: Effect of Sand, Dust, and Soil,” J. Forensic Sci. 58:1141-1148 (2013)) were previously investigated. A wide study on blood traces was also conducted to understand the heterogeneous chemical composition of blood (Virkler et al., “Raman Spectroscopic Signature of Blood and its Potential Application to Forensic Body Fluid Identification,” Anal. Bioanal. Chem. 396(1):525-534 (2010)) and to distinguish between peripheral and menstrual blood (Sikirzhytskaya et al., “Raman Spectroscopy Coupled With Advanced Statistics for Differentiating Menstrual and Peripheral Blood,” J. Biophotonics 7(1-2):59-67 (2014)).
Variances in the biochemical composition of blood from donors of different races, genders, and ages have been reported by Koh et al. (Koh et al., “Comparison of Selected Blood Components by Race, Sex, and Age,”Am. J. Clin. Nutr. 33(8):1828-35 (1980)). They found a higher concentration of albumin, hemoglobin, hematocrit, serum iron, and serum triglycerides in Caucasian (CA) donors' blood than in African American (AA) donors', while AA donors had significantly higher glucose and total protein concentrations. Hemoglobin concentration has been widely studied over the last few decades (Koh et al., “Comparison of Selected Blood Components by Race, Sex, and Age,”Am. J. Clin. Nutr. 33(8):1828-35 (1980); Garn et. al.., “Lifelong Differences in Hemoglobin Levels Between Blacks and Whites,” J. Natl. Med. Assoc. 67:91-96 (1975); Johnson et al., “Advance data From Vital and Health Statistics,” The National Center for Health Statistics, U.S. Department of Health, Education, and Welfare, Public Health Service, Office of Health Research, Statistics, and Technology, 46:1-12 (1979); Meyers et al., “Components of the Difference in Hemoglobin Concentrations in Blood Between Black and White Women in the United States,” Am. J. Epidemiol. 109:539-549 (1979); Reeves et al., “Screening for Anemia in Infants: Evidence in Favor of Using Identical Hemoglobin Criteria for Blacks and Caucasians,” Am. J. Clin. Nutr. 34:2154-2157 (1981); Gam et al., “The Magnitude and the Implications of Apparent Race Differences in Hemoglobin Values,” Am. J. Clin. Nutr. 28:563-568 (1975)), and these investigations have confirmed that there is a higher amount of hemoglobin in the blood of CA subjects than AA subjects. Kramer et al. showed that CA and AA racial groups can be distinguished based on the concentration of certain enzymes (creatine kinase and lactate dehydrogenase) in blood serum (Kramer et al., “Biocatalytic Analysis of Biomarkers for Forensic Identification of Ethnicity Between Caucasian and African American Groups,”Analyst 138(21):6251-6257 (2013)). Differences between races in plasma lipids' and lipoproteins' concentrations have also been shown (Morrison et al., “Black-White Differences in Plasma Lipids and Lipoproteins in Adults: The Cincinnati Lipid Research Clinic Population Study,” Prev. Med. 8:34-39 (1979)).
The present invention is directed to overcoming these and other deficiencies in the art.
One aspect of the present invention relates to a method of identifying gender and/or race of a subject using a body fluid stain from the subject. This method includes providing a sample containing a body fluid stain from the subject; providing a statistical model for determination of gender and/or race of a subject; subjecting the sample or an area of the sample containing the stain to a spectroscopic analysis to produce a spectroscopic signature for the sample; and applying the spectroscopic signature for the sample to the statistical model to ascertain gender and/or race of the subject.
Another aspect of the present invention relates to a method of establishing a statistical model for determination of gender and/or race of a subject using a body fluid stain from the subject. This method includes providing a plurality of samples containing a known type of body fluid stain from a subject of known race and/or gender; subjecting each sample or an area of each sample containing the stain to a spectroscopic analysis to produce a spectroscopic signature for each sample; and establishing a statistical model for determination of gender and/or race of a subject for a particular body fluid type based on said subjecting.
Due to the significant information that can be gathered from blood, it requires special attention during forensic investigations. It can even lead to identifying a suspect. All currently applied methods for collecting information about a person are destructive to the sample since they require extraction of DNA or biomarkers from a bloodstain. Treated traces can be no longer used for further examination. Finding a nondestructive method would be very valuable to support forensic investigations. Attenuated total reflectance (ATR) Fourier transform infrared (FTIR) spectroscopy was applied in order to discriminate gender and race from human blood traces. Such a person's identification is possible due to chemical and biochemical differences in blood composition from donor to donor. Advanced statistics were applied in order to enhance classification processes.
Genetic profiling (or phenotype profiling; these two terms are considered synonymous here) is a very important part of criminal investigations. Determining the suspect race and gender at the very early stages of investigation would be most important. A method for determining race and/or gender based on Raman spectra of blood, saliva, sweat, and semen samples were developed. Near-Infrared (NIR) Raman microspectroscopy and Attenuated total reflectance (ATR) Fourier transform infrared (FTIR) spectroscopy were combined with advanced statistics for developing classification models which account for the sample heterogeneity and variations with donor.
Gaining knowledge from these studies, the highly selective technique of Raman spectroscopy was applied to detect chemical and biochemical differences in dry blood traces from two different racial groups. It was already reported for different species that even if visual differentiation of Raman blood spectra is impossible advanced statistics allows for classification (Virkler et al., “Blood Species Identification for Forensic Purposes using Raman Spectroscopy Combined with Advanced Statistical Analysis,” Anal. Chem. 81(18):7773-7777 (2009); McLaughlin et al., “Discrimination of Human and Animal Blood Traces Via Raman Spectroscopy,” Forensic Sci. Int. 238(0):91-95 (2014); De Wael et al., “In Search of Blood-Detection of Minute Particles using Spectroscopic Methods,” Forensic Sci. Int. 180(1):37-42 (2008), which are hereby incorporated by reference in their entirety). Therefore, in the present application, an advanced statistical approach was utilized for discrimination processes.
The present application describes the use of genetic algorithm (GA) analysis, which helped to select the spectral regions with the largest diversity between Caucasian (CA) and African American (AA) peripheral blood donors. GA analysis is a heuristic search algorithm developed to select variables with the lowest prediction error using simulated natural processes necessary for evolution (Niazi et al., “Genetic Algorithms in Chemometrics,” Journal of Chemometrics 26(6):345-351 (2012), which is hereby incorporated by reference in its entirety). For statistical analysis, principal component analysis (PCA) was used to remove outliers (Pascoal et al., In Combining Soft Computing and Statistical Methods in Data Analysis; Borgelt et al., Eds.; Springer Berlin Heidelberg: Vol. 77:499-507 (2010), which are hereby incorporated by reference in their entirety), and support vector machine-discriminant analysis (SVM-DA) to build classification models. SVM-DA is a supervised machine learning technique that has been widely used in pattern classification problems (Sikirzhytskaya et al., “Raman Spectroscopy Coupled With Advanced Statistics for Differentiating Menstrual and Peripheral Blood,” J. Biophotonics 7(1-2):59-67 (2014); Marcelo et al., “Profiling Cocaine by ATR-FTIR,” Forensic Sci. Int. 246:65-71 (2015), which are hereby incorporated by reference in their entirety). In order to validate the accuracy performance of SVM-DA models built for this study, outer cross-validation (CV) loop was performed.
The receiver operating characteristic (ROC) and area under the curve (AUC) analyses are commonly used in diagnostic and screening tests (Hajian-Tilaki, K., “Receiver Operating Characteristic (ROC) Curve Analysis for Medical Diagnostic Test Evaluation,” Caspian J. Intern. Med. 4:627-635 (2013), which is hereby incorporated by reference in its entirety). The trapezoidal method of integration was used to estimate AUCs of ROC curves with corresponding 95% confidence intervals (CIs) that have been estimated with the method described DeLong et al., “Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach,” Biometrics 837-845 (1988), which is hereby incorporated by reference in its entirety. The curve in a ROC diagram plots sensitivity (true positive rate) against specificity (true negative rate) for varying thresholds of class prediction probabilities was generated, as a way to gauge the prediction efficiency of the SVM-DA models built. Here, a proof-of-concept that Raman spectroscopic analysis of bloodstains is able to successfully differentiate between CA and AA racial groups was demonstrated. Further studies are necessary for examining other factors and conditions, which can potentially affect the biochemical composition and corresponding Raman signature of a bloodstain.
The word “race” has become a complex and sensitive term. Some believe race to be a purely socio-cultural construct, while others report that there is biological evidence to support it (Jorde et al., “Genetic Variation, Classification, and ‘Race’,” Nature Reviews. Genetics 36(11):28-33 (2004), which is hereby incorporated by reference in its entirety). One approach has been to differentiate the two terms; “race” and “biological race” (Ousley et al., “Understanding Race and Human Variation: Why Forensic Anthropologists are Good at Identifying Race,” American Journal of Physical Anthropology 139(1):68-76 (2009), which is hereby incorporated by reference in its entirety). The first refers to the social notions about race, often characterized by broad generalizations and stereotypes. The latter refers to “a division of a species which differs from other divisions by the frequency with which certain hereditary traits appear among its members” (Brues, A. M., “People and Races,” New York: Macmillan 336 (1977), which is hereby incorporated by reference in its entirety). In this sense, “biological race” is very similar to biogeographic ancestry.
There is no technique to predict a person's race based on the Raman spectrum of a dried semen sample. In the present application, “race” refers to a self-reported characteristic that includes, but is not limited to, skin color. In the present application, it was uncritically ascribed to the hypothesis that groups from different biological races or biogeographic ancestries have biological differences, which appear be evident in skeletal morphology and genetics (Ousley et al., “Understanding Race and Human Variation: Why Forensic Anthropologists are Good at Identifying Race,” American Journal of Physical Anthropology 139(1):68-76 (2009), which is hereby incorporated by reference in its entirety). While this is absolutely a serious and important consideration, it is outside of the scope of the present work. It was hypothesized that discernible differences could be seen in the biochemical make up of semen. In the present application, Raman spectra were acquired from human semen samples, from donors of three different races (Caucasian, Black, and Hispanic). Their spectra were then analyzed and compared using MATLAB version R2012a. Statistical models were built to differentiate the spectra according to their respective races. The developed model allowed for discrimination between races with excellent sensitivity and specificity. Ultimately, all 28 donors were classified correctly. The results described show Raman spectroscopy's potential to correctly differentiate races based on dry semen traces.
In the present application Raman microspectroscopy was used for gender identification from the human blood, taking into account its heterogeneity. Advanced statistical analysis was performed to deal with variations of Raman spectra and to minimize the possibility of false gender identification. An automatic mapping technique was used to collect Raman spectra from different spots of dried blood samples. The fluorescent background was subtracted from the experimental data using an automatic baseline correction procedure, and two data sets (male and female) were formed. The present application showed that human genders could be predicted based on dry blood traces using support vector machine discriminant analysis (SVMDA) and (k-nearest neighbors) KNN algorithms with a high level of confidence. Despite the visual similarity of Raman spectra from male and female donors, the sensitivity and specificity of the SVMDA model was about 77% and 93% respectively, despite of the visual similarity of Raman spectra from male and female donors.
In the present application, ATR-FTIR spectroscopy was applied as a sensitive analytical method for human blood identification. Dissimilarities between groups of genders and races were focused on. As already reported, blood donors are ineligible for visual distinction between Raman or infrared spectra (Virkler et al., “Blood Species Identification for Forensic Purposes Using Raman Spectroscopy Combined with Advanced Statistical Analysis,” Anal. Chem. 81(18):7773-7777 (2009); McLaughlin et al., “Discrimination of Human and Animal Blood Traces Via Raman Spectroscopy,” Forensic Sci. Int. 238(0):91-95 (2014); De Wael et al., “In Search of Blood—Detection of Minute Particles Using Spectroscopic Methods,” Forensic Sci. Int. 180(1):37-42 (2008), which are hereby incorporated by reference in their entirety). In the present application supporting discrimination power was employed, with advanced statistical analysis (Wise et al., PLS_Toolbox 3.5 for Use with MATLAB Wenatchee, Wash.: Eigenvector Research, Inc. (2005), which is hereby incorporated by reference in its entirety). Firstly, genetic algorithm (GA) allowed for selection of spectral ranges where the biggest differences between the applied classes occur (Niazi et al., “Genetic Algorithms in Chemometrics,” J. Chemometrics 26(6):345-351 (2012), which is hereby incorporated by reference in its entirety). This step was carried out in two different ways: for gender discrimination and distinction between races of Caucasian (CA), African American (AA), and Hispanic (HI). A principal component analysis (PCA) model was used to remove outliers (through Q residuals and Hotelling T2) (Rodriguez et al., “Raman Spectroscopy and Chemometrics for Identification and Strain Discrimination of the Wine Spoilage Yeasts Saccharomyces cerevisiae, Zygosaccharomyces bailii, and Brettanomyces bruxellensis,” Appl. Environ. Microbiol. 79(20):6264-6270 (2013); Xiao et al., “Drift Compensation of Gas Sensor Array by Matrix Transform and Genetic Algorithm Based on Database,” J. Computational Information Systems, 9(9):3469-3476 (2013), which are hereby incorporated by reference in their entirety). Multivariate partial least squares-discriminant analysis (PLS-DA) was conducted to differentiate gender and races with emphasis on the validation phase to assure the applicability of the built models. PLS-DA is a classification method based on the standard PLS algorithm and for the dependent y-vector class labels are used (Varmuza et al., Introduction to Multivariate Statistical Analysis in Chemometrics. CRC Press (2008), which is hereby incorporated by reference in its entirety). An external cross-validation (CV) was used in order to examine prediction performance of models where all spectra from one donor were placed aside from training dataset and predicted by recalculating model based on n−1 donors. Y predictions were recorded from all donors for each spectrum and for each donor as well. Additionally, the predictive abilities of PLS-DA models were summarized using a receiver operating characteristic (ROC) and area under ROC curve (AUC). In the ROC space, the AUC is a single measure of model performance. ROC curves were generated from cross-validated Y-predicted values, and the best threshold was determined for each class prediction and for its corresponding PLS-DA classifier. The last step of validation was testing the model with external blind samples, from donors who were not included in training datasets. This approach showed potential to discriminate donors based on dry blood traces found at a crime scene. Moreover, the method gives fast results, and it is not destructive to the sample, and thus can be applied as an additional investigation technique before the sample is subjected for final DNA testing. Availability of ATR-FTIR portable instruments (Mukhopadhyay, R., “Product Review: Portable FTIR Spectrometers Get Moving,” Anal. Chem. 76(19):369 A-372 A (2004), which is hereby incorporated by reference in its entirety) raises efficacy of this approach to compare with other bloodstain tests which mostly require laboratory settings.
One aspect of the present invention relates to a method of identifying gender and/or race of a subject using a body fluid stain from the subject. This method includes providing a sample containing a body fluid stain from the subject; providing a statistical model for determination of gender and/or race of a subject; subjecting the sample or an area of the sample containing the stain to a spectroscopic analysis to produce a spectroscopic signature for the sample; and applying the spectroscopic signature for the sample to the statistical model to ascertain gender and/or race of the subject.
In one embodiment, the body fluid is selected from the group consisting of blood, saliva, sweat, urine, semen, and vaginal fluid. In a preferred embodiment, the body fluid is blood.
In one embodiment, the gender of the subject is determined.
In another embodiment, the race of the subject is determined.
In one embodiment, the method determines the race of the subject as being black, white, asian, or hispanic.
In one embodiment, the sample is recovered at a crime scene.
In another embodiment, spectroscopic analysis is selected from the group consisting of Raman spectroscopy, mass spectrometry, fluorescence spectroscopy, laser induced breakdown spectroscopy, infrared spectroscopy, scanning electron microscopy, X-ray diffraction spectroscopy, powder diffraction spectroscopy, X-ray luminescence spectroscopy, inductively coupled plasma mass spectrometry, capillary electrophoresis, and atomic absorption spectroscopy.
Raman spectroscopy is a spectroscopic technique which relies on inelastic or Raman scattering of monochromatic light to study vibrational, rotational, and other low-frequency modes in a system (Gardiner, D. J., Practical Raman Spectroscopy, Berlin: Springer-Verlag, pp. 1-3 (1989), which is hereby incorporated by reference in its entirety). Vibrational modes are very important and very specific for chemical bonds in molecules. They provide a fingerprint by which a molecule can be identified. The Raman effect is obtained when a photon interacts with the electron cloud of a molecular bond exciting the electrons into a virtual state. The scattered photon is shifted to lower frequencies (Stokes process) or higher frequencies (anti-Stokes process) as it abstracts or releases energy from the molecule. The polarizability change in the molecule will determine the Raman scattering intensity, while the Raman shift will be equal to the vibrational intensity involved.
Raman spectroscopy is based upon the inelastic scattering of photons or the Raman shift (change in energy) caused by molecules. The analyte is excited by laser light and upon relaxation scatters radiation at a different frequency which is collected and measured. With the availability of portable Raman spectrometers, it is possible to collect Raman spectra in the field. Using portable Raman spectrometers offers distinct advantages to government agencies, first responders, and forensic scientists (Hargreaves et al., “Analysis of Seized Drugs Using Portable Raman Spectroscopy in an Airport Environment—a Proof of Principle Study,” J. Raman Spectroscopy 39(7):873-880 (2008), which is hereby incorporated by reference in its entirety).
Raman spectroscopy is increasing in popularity among the different disciplines of forensic science. Some examples of its use today involve the identification of drugs (Hodges et al., “The Use of Fourier Transform Raman Spectroscopy in the Forensic Identification of Illicit Drugs and Explosives,”Molecular Spectroscopy 46:303-307 (1990), which is hereby incorporated by reference in its entirety), lipsticks (Rodger et al., “The In-Situ Analysis of Lipsticks by Surface Enhanced Resonance Raman Scattering,” Analyst 1823-1826 (1998), which is hereby incorporated by reference in its entirety), and fibers (Thomas et al., “Raman Spectroscopy and the Forensic Analysis of Black/Grey and Blue Cotton Fibers Part 1: Investigation of the Effects of Varying Laser Wavelength,” Forensic Sci. Int. 152:189-197 (2005), which is hereby incorporated by reference in its entirety), as well as paint (Suzuki et al., “In Situ Identification and Analysis of Automotive Paint Pigments Using Line Segment Excitation Raman Spectroscopy: I. Inorganic Topcoat Pigments,” J. Forensic Sci. 46:1053-1069 (2001), which is hereby incorporated by reference in its entirety) and ink (Mazzella et al., “Raman Spectroscopy of Blue Gel Pen Inks,” Forensic Sci. Int. 152:241-247 (2005), which is hereby incorporated by reference in its entirety) analysis. Very little or no sample preparation is needed, and the required amount of tested material could be as low as several picograms or femtoliters (10−12 gram or 10−15 liter, respectively). A typical Raman spectrum consists of several narrow bands and provides a unique vibrational signature of the material (Grasselli et al., “Chemical Applications of Raman Spectroscopy,” New York: John Wiley & Sons (1981), which is hereby incorporated by reference in its entirety). Unlike infrared (IR) absorption spectroscopy, another type of vibrational spectroscopy, Raman spectroscopy shows very little interference from water (Grasselli et al., “Chemical Applications of Raman Spectroscopy,” New York: John Wiley & Sons (1981), which is hereby incorporated by reference in its entirety). Proper Raman spectroscopic measurements do not damage the sample. A swab could be tested in the field and still be available for further use in the lab, and that is very important to forensic application. The design of a portable Raman spectrometer is a reality now (Yan et al., “Surface-Enhanced Raman Scattering Detection of Chemical and Biological Agents Using a Portable Raman Integrated Tunable Sensor,” Sensors and Actuators B. 6 (2007); Eckenrode et al., “Portable Raman Spectroscopy Systems for Field Analysis,” Forensic Science Communications 3:(2001), which are hereby incorporated by reference in their entirety) which could lead to the ability to make identifications at the crime scene.
Fluorescence interference is the largest problem with Raman spectroscopy and is perhaps the reason why the latter technique has not been more popular in the past. If a sample contains molecules that fluoresce, the broad and much more intense fluorescence peak will mask the sharp Raman peaks of the sample. There are a few remedies to this problem. One solution is to use deep ultraviolet (DUV) light for exciting Raman scattering (Lednev I. K., “Vibrational Spectroscopy: Biological Applications of Ultraviolet Raman Spectroscopy,” in: V. N. Uversky, and E. A. Permyakov, Protein Structures, Methods in Protein Structures and Stability Analysis (2007), which is hereby incorporated by reference in its entirety). Practically no condensed face exhibits fluorescence below ˜250 nm. Possible photodegradation of biological samples is an expected disadvantage of DUV Raman spectroscopy. Another option to eliminate fluorescence interference is to use a near-IR (NIR) excitation for Raman spectroscopic measurement. Finally, surface enhanced Raman spectroscopy (SERS) which involves a rough metal surface can also alleviate the problem of fluorescence (Thomas et al., “Raman Spectroscopy and the Forensic Analysis of Black/Grey and Blue Cotton Fibers Part 1: Investigation of the Effects of Varying Laser Wavelength,” Forensic Sci. Int. 152:189-197 (2005), which is hereby incorporated by reference in its entirety). However, this method requires direct contact with the analyte and cannot be considered to be nondestructive.
Basic components of a Raman spectrometer are (i) an excitation source; (ii) optics for sample illumination; (iii) a single, double, or triple monochromator; and (iv) a signal processing system consisting of a detector, an amplifier, and an output device.
Typically, a sample is exposed to a monochromatic source usually a laser in the visible, near infrared, or near ultraviolet range. The scattered light is collected using a lens and is focused at the entrance slit of a monochromator. The monochromator which is set for a desirable spectral resolution rejects the stray light in addition to dispersing incoming radiation. The light leaving the exit slit of the monochromator is collected and focused on a detector (such as a photodiode arrays (PDA), a photomultiplier (PMT), or charge-coupled device (CCD)). This optical signal is converted to an electrical signal within the detector. The incident signal is stored in computer memory for each predetermined frequency interval. A plot of the signal intensity as a function of its frequency difference (usually in units of wavenumbers, cm−1) will constitute the Raman spectroscopic signature.
Raman signatures are sharp and narrow peaks observed on a Raman spectrum. These peaks are located on both sides of the excitation laser line (Stoke and anti-Stoke lines). Generally, only the Stokes region is used for comparison (the anti-Stoke region is identical in pattern, but much less intense) with a Raman spectrum of a known sample. A visual comparison of these set of peaks (spectroscopic signatures) between experimental and known samples is needed to verify the reproducibility of the data. Therefore, establishing correlations between experimental and known data is required to assign the peaks in the molecules, and identify a specific component in the sample.
The types of Raman spectroscopy suitable for use in conjunction with the present invention include, but are not limited to, conventional Raman spectroscopy, Raman microspectroscopy, near-field Raman spectroscopy, including but not limited to the tip-enhanced Raman spectroscopy, surface enhanced Raman spectroscopy (SERS), surface enhanced resonance Raman spectroscopy (SERRS), and coherent anti-Stokes Raman spectroscopy (CARS). Also, both Stokes and anti-Stokes Raman spectroscopy could be used.
In addition to Raman spectroscopy, the spectroscopic analysis of the present invention can be performed using, for example, mass spectrometry, fluorescence spectroscopy, laser induced breakdown spectroscopy, infrared spectroscopy, scanning electron microscopy, X-ray diffraction spectroscopy, powder diffraction spectroscopy, X-ray luminescence spectroscopy, inductively coupled plasma mass spectrometry, capillary electrophoresis, or atomic absorption spectroscopy. Some of the spectroscopic methods mentioned above, including but not limited to Raman spectroscopy, are relatively simple, rapid, non-destructive, and would allow for the development of a portable instrument. The technique can be performed with relatively small samples, picogram (pg) quantities. The composition of the sample is not changed in any way, allowing for further forensic tests on the residue or other components of the evidence.
Scanning Electron Microscopy combined with Energy Dispersive Spectroscopy (SEM/EDS or EDX when equipped with an X-ray analyzer) is capable of obtaining both morphological information and the elemental composition. Recently, SEM/EDS systems have become automated, making automated computer-controlled SEM the method of choice for most laboratories conducting analyses. Several features of the SEM make it useful in many forensic studies, including magnification, imaging, composition analysis, and automation.
Inductively coupled plasma mass spectrometry (ICP-MS) is a mass analysis method with sensitivity to metals. As a result, this analytical technique is ideal for analyzing barium, lead, and antimony. This technique is known for its sensitivity, having detection limits that are usually in the parts per billion.
Fourier transform infrared (FTIR) spectroscopy is a versatile tool for the detection, estimation and structural determination of organic compounds such as drugs, explosives, and organic components. Due to the availability of portable IR spectrometers, it will be possible to analyze the samples at scenes remote from laboratories. Capillary electrophoresis (CE) is another suitable analytical technique. The significant advantage of CE is the low probability of false positives (Bell, S., Forensic Chemistry, Pearson Education: Upper Saddle River, N.J. (2006), which is hereby incorporated by reference in its entirety).
Atomic absorption spectroscopy (AAS) is a bulk method of analysis used in the analysis of inorganic materials in primer residue, namely Ba and Sb. The high sensitivity for a small volume of sample is one advantage of AAS. This technique involves the absorption of thermal energy by the sample and subsequent emission of some or all of the energy in the form of radiation (Bauer et al., Instrumental Analysis, Allyn and Bacon, Inc.: Boston (1978), which is hereby incorporated by reference in its entirety). These emissions are generally unique for specific elements and thus give information about the composition of the sample. Laser-induced breakdown spectroscopy (LIBS) is a type of atomic emission spectroscopy that implements lasers to excite the sample. Rather than flame AAS, LIBS is accessible to field testing because of the availability of portable LIBS systems.
X-ray diffraction (XRD) is one such technique that can be used for the characterization of a wide variety of substances of forensic interest (Abraham et al., “Application of X-Ray Diffraction Techniques in Forensic Science,” Forensic Science Communications 9(2) (2007), which is hereby incorporated by reference in its entirety). XRD is capable of obtaining information about the actual structure of samples, in a non-destructive manor.
In one embodiment, spectroscopic analysis is Raman spectroscopy. In a preferred embodiment, Raman spectroscopy is selected from the group consisting of resonance Raman spectroscopy, normal Raman spectroscopy, Raman microscopy, Raman microspectroscopy, NIR Raman spectroscopy, surface enhanced Raman spectroscopy (SERS), tip enhanced Raman spectroscopy (TERS), Coherent anti-Stokes Raman scattering (CARS), and Coherent anti-Stokes Raman scattering microscopy.
In another embodiment, spectroscopic analysis is Infrared spectroscopy. In a preferred embodiment, the Infrared spectroscopy is selected from the group consisting of Infrared microscopy, Infrared microspectroscopy, Infrared reflection spectroscopy, Infrared absorption spectroscopy, attenuated total reflection infrared spectroscopy, Fourier transform infrared spectroscopy, and attenuated total reflection Fourier transform infrared spectroscopy.
The spectroscopic signature can be obtained from: spectra at different locations of the sample of the body fluid; a single spectrum of the sample of the body fluid; or as an average of spectra collected at different locations of the sample.
In the present invention, the term “spectroscopic signature” refers to a single spectrum, an averaged spectrum, multiple spectra, or any other spectroscopic representation of intrinsically heterogeneous samples.
In one embodiment, the statistical model for determination of gender and/or race of a subject is prepared by multivariate analysis. In a preferred embodiment, multivariate analysis is supervised multivariate analysis.
In another embodiment, the statistical model is prepared by classification statistical analysis. In a preferred embodiment, the classification statistical analysis is selected from the group consisting of Partial least squares discriminant analysis (PLS-DA), Support vector machines discriminant analysis (SVMDA), K-Nearest neighbor (KNN), Artificial neural network (ANN), and Soft independent modeling of/by class analogy (SIMCA).
Artificial neural network (ANN) are a family of models inspired by biological neural networks (the central nervous systems of animals, in particular the brain) which are used to estimate or approximate functions that can depend on a large number of inputs and are generally unknown. Artificial neural networks are typically specified using architecture, activity rule, and learning rule.
Classical least squares (CLS) techniques also known as direct least squares or forward least squares. CLS methods are typically used for exploratory analysis, detection, classification, and quantification. CLS regression methods include classical, extended, weighted, and generalized least squares. These methods can be used to account for interferents (i.e. analytes other than the one of interest) in spectroscopic systems. CLS also provides a natural framework for the development of popular de-cluttering methods such as External Parameter Orthogonalization (EPO) and Generalized Least Squares (GLS) weighting.
Locally weighted regression (LWR) is a memory-based method that performs a regression around a point of interest using only training data that are “local” to that point.
Multiple linear regression (MLR) is the most common form of linear regression analysis. As a predictive analysis, the multiple linear regression is used to explain the relationship between one continuous dependent variable from two or more independent variables. The independent variables can be continuous or categorical.
Multiway partial least squares (MPLS) is an extension of the ordinary regression model PLS to the multi-way case. In chemometrics, there is some confusion in distinguishing between multi-way methods and multi-way data. Bilinear two-way PLS and PCA can cope with multi-way data by unfolding the data arrays to matrices, but the methods themselves are not multi-way and do not take advantage of any multi-way structure in the data.
Principle component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). It considers regressing the outcome (also known as the response or, the dependent variable) on a set of covariates (also known as predictors or, explanatory variables or, independent variables) based on a standard linear regression model, but uses PCA for estimating the unknown regression coefficients in the model.
Support vector machines (SVM) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked for belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier.
Partial least squares (PLS) or Partial least squares regression (PLSR) is a statistical method that bears some relation to principal components regression; instead of finding hyperplanes of minimum variance between the response and independent variables, it finds a linear regression model by projecting the predicted variables and the observable variables to a new space. Because both the X and Y data are projected to new spaces, the PLS family of methods are known as bilinear factor models. Partial least squares Discriminant Analysis (PLS-DA) is a variant used when the Y is categorical.
Linear discriminant analysis (LDA) is a generalization of Fisher's linear discriminant, a method used in statistics, pattern recognition and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events.
Multivariate analysis of variance (MANOVA) is a procedure for comparing multivariate sample means. As a multivariate procedure, it is used when there are two or more dependent variables, and is typically followed by significance tests involving individual dependent variables separately.
K-Nearest neighbor (KNN) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space.
Soft independent modeling of/by class analogy (SIMCA) is a statistical method for supervised classification of data. The method requires a training data set consisting of samples (or objects) with a set of attributes and their class membership. The term soft refers to the fact the classifier can identify samples as belonging to multiple classes and not necessarily producing a classification of samples into non-overlapping classes.
Another aspect of the present invention relates to a method of establishing a statistical model for determination of gender and/or race of a subject using a body fluid stain from the subject. This method includes providing a plurality of samples containing a known type of body fluid stain from a subject of known race and/or gender; subjecting each sample or an area of each sample containing the stain to a spectroscopic analysis to produce a spectroscopic signature for each sample; and establishing a statistical model for determination of gender and/or race of a subject for a particular body fluid type based on said subjecting.
For samples containing a known type of body fluid stain from a subject of known race and/or gender the spectroscopic signature is obtained from the spectra at: different locations of the same sample of the body fluid; different samples of the same type of body fluid; or different locations on different samples of the same type of body fluid.
According to the present invention, a statistical model for determination of gender and/or race of a subject using a body fluid stain from the subject can be prepared using any type of the statistical analysis described above.
In one embodiment, the statistical model for determination of gender and/or race of a subject is prepared by multivariate analysis. In a preferred embodiment, multivariate analysis is supervised multivariate analysis.
In another embodiment, the statistical model is prepared by classification statistical analysis. In a preferred embodiment, the classification statistical analysis is selected from the group consisting of Partial least squares discriminant analysis (PLS-DA), Support vector machines discriminant analysis (SVMDA), K-Nearest neighbor (KNN), Artificial neural network (ANN), and Soft independent modeling of/by class analogy (SIMCA).
In another embodiment, the method further includes rebuilding the statistical model; and validating the statistical model.
In yet another embodiment, the method further includes performing an informative spectral features selection for further developing a spectroscopic signature.
In one embodiment, the establishing produces a statistical model for determination of the subject's gender for a specific type of body fluid.
In another embodiment, the establishing produces a statistical model for determination of the subject's race for a specific type of body fluid.
According to one embodiment, the method of developing a statistical model for determination of gender and/or race of a subject using a body fluid stain from the subject using spectroscopic analysis involves the following steps. First, multiple spectra for samples of body fluid of known gender and race are collected. Second, these spectra are preprocessed. The preprocessing step can be performed using any of the different pre-treatment procedures alone or in different combinations. Then a statistical model is developed using any of the statistical methods described above alone or in combination. Next, an informative spectral features selection is performed. Next, the model is rebuilt and, if necessary, the model can be validated using any of the statistical methods described above alone or in combination (validation step is optional).
According to another embodiment, the method of determining gender and/or race of an unknown sample involves the following steps. First, multiple spectra for an unknown sample are obtained. Second, spectra are preprocessed. Preprocessing step can be performed using any of the above-described pre-treatment procedure alone or in different combinations. Next, the statistical model for determining gender and/or race of a subject is applied to determine the gender and/or race of a subject using a body fluid stain.
A total of 20 human peripheral blood samples were used for this experiment, which were purchased from Bioreclamation, Inc. Donors were chosen with consideration to gender and age diversity. The average age of Caucasian (CA) and African American (AA) donors was 45.0±8.4 and 43.8±7.2 years, respectively, with male donors making up 40% and 50% of the donor pool, respectively. All blood samples were kept frozen until sample preparation. After defrosting, tubes of blood were vortexed and 10 μL of blood were deposited onto an aluminum foil covered microscope slide. Prepared samples were allowed to dry overnight prior to spectral collection.
A Renishaw inVia Raman spectrometer was used for sample analysis. The instrument was equipped with a Leica optical microscope with a 20× objective and PRIOR automatic stage. A 785 nm laser light (power=4.0 mW) was used for excitation; twenty 10-second accumulations were recorded from each spot on the sample. Spectra were recorded in the range of 250-1800 cm−1. A total of 180 spectra were collected using Raman mapping with nine different spots for each sample. The instrument was calibrated using a silicon standard (peak at 520.6 cm−1) before collecting spectra from a bloodstain.
Data treatment and advanced statistical analysis were performed using MATLAB R2013b (Mathworks, Inc.). Recorded blood spectra were divided into two datasets based on race. Raman spectra were baseline corrected using the automatic weighted least squares baseline algorithm, normalized by the standard normal variate method, and mean centered. After these preprocessing steps, further analysis was performed using the PLS Toolbox (Eigenvector Research, Inc.). Informative spectral regions were identified using genetic algorithm (GA) analysis. Multivariate outlier removal was carried out using PCA prior to all statistical analyses, which resulted in the removal of 20 spectra from the 180 total spectra originally collected. To distinguish between blood spectra from CA and AA donors, SVM-DA models were built. The method was validated by outer subject-wise CV loop where all spectra from one donor were taken out, one at a time, from the training dataset and used for validation. The remaining spectra of n−1 donors were used as training data to build a new SVM-DA model and predictions were performed for the validation data (excluded donor's spectra). For evaluation purposes, receiver operating characteristic (ROC) and area under the curve (AUC) analyses were applied. ROC analysis was carried out with the open source package pROC (Robin et al., “pROC: An Open-Source Package for R and S+ to Analyze and Compare ROC Curves.” BMC Bioinformatics 12(1):77 (2011), which is hereby incorporated by reference in its entirety). The AUC analysis indicated how well the model ranks subjects according to the probability of assignment to the correct class.
As previously mentioned, other studies have shown that visual distinction between Raman spectra of blood from different classes is not possible (Virkler et al., “Blood Species Identification for Forensic Purposes using Raman Spectroscopy Combined with Advanced Statistical Analysis,” Anal. Chem. 81(18):7773-7777 (2009); McLaughlin et al., “Discrimination of Human and Animal Blood Traces Via Raman Spectroscopy,” Forensic Sci. Int. 238(0):91-95 (2014); De Wael et al., “In Search of Blood-Detection of Minute Particles using Spectroscopic Methods,” Forensic Sci. Int. 180(1):37-42 (2008), which are hereby incorporated by reference in their entirety). This is due to the fact that spectra generated by Raman analysis of dried blood, using 785 nm excitation, are composed of peaks originating exclusively from vibrational modes of hemoglobin, which is present in all human blood samples (Premasiri et al., “Surface-Enhanced Raman Scattering of Whole Human Blood, Blood Plasma, and Red Blood Cells: Cellular Processes and Bioanalytical Sensing,” J. Phys. Chem. B, 116(31):9376-86 (2012), which is hereby incorporated by reference in its entirety). The averaged preprocessed spectrum of all CA and AA donors analyzed in this study is shown in
GA analysis was carried out on the 160 spectra used to build the SVM-DA models for optimization purposes and to better understand and identify the origin of differences between classes. The analysis considered all possible variables (wavenumbers) within the Raman spectral dataset and their significance for the discrimination between classes (races). This allowed for the reduction of the original Raman spectra to subsets of unique wavenumbers in order to achieve better prediction performance (Niazi et al., “Genetic Algorithms in Chemometrics,” Journal of Chemometrics 26(6):345-351 (2012), which is hereby incorporated by reference in its entirety). The GA analysis only selected variables that gave the most valuable information for discrimination within the entire training dataset of donors from both races. The spectral regions selected by the GA operation are shown in
An SVM-DA classification model was built based on 160 spectra from 20 donors (10 for each race). The model was used to differentiate races based on the spectral features, selected by GA analysis from the original Raman spectra. The SVM-DA model was automatically trained with a dataset of labeled spectra and by tuning parameters via modification of the underlying kernel function. For this study, pattern recognition SVM-DA was used with the radial basis function as a kernel function, and it was optimized by a combined approach of 5-fold CV and a systematic grid search of the parameters. The internal CV executed by the model showed 71% accuracy. The prediction performance of the subsequently built SVM-DA models was estimated by the outer loop of leave-one-out CV at the donor level. For additional information, see Varma et al., “Bias in Error Estimation When Using Cross-Validation for Model Selection,” BMC Bioinformatics 7:91 (2006), which is hereby incorporated by reference in its entirety.
All spectra from one subject at a time were excluded from the initial training set and used as the validation set to test the model built using spectra from the remaining (n−1) donors. This process was repeated until all subjects were separately used for validation.
For each donor, the final classification results were calculated as prediction probabilities that each spectrum will be correctly classified and also that each subject belongs to the correct class based on the classification of all donors' spectra. For each donor, the final classification results were calculated as prediction probabilities that each spectrum, or each subject as a whole, belong to the correct class. Among the subsets from all 20 subjects, the predicted group membership and probabilities, for each spectrum and for each subject, were recorded. Using ROC analysis, the best thresholds were identified (above which the spectrum/donor probability estimate was assigned to the correct class) to rank the SVM classifier's ability to separate the races. The results of the AUC analysis can range from 0 to 1. An AUC value of 0.5 represents a random classifier and an AUC value of 1.0 indicates a perfect test. This analysis allowed for discrimination of CA and AA races with an AUC value of 0.71 (95% CI: 0.63-0.79) based on a single spectrum, and 0.83 (95% CI: 0.64-1.00) based on each subject (
This preliminary study showed promise for race differentiation based on human blood traces analyzed by Raman spectroscopy.
For the first time, Raman spectroscopy, combined with chemometrics, has been used to differentiate between dry blood traces from CA and AA donors. To validate the internal CV results, which achieved 71% correct classification of donors based on all spectra included in a training dataset, outer CV was performed. The summary of predictions from outer CV for 20 different SVM-DA models demonstrated 83% (AUC) probability of correct race classification of individual donors after ROC analysis. These results showed promise for discrimination of the race of human peripheral blood found at a crime scene. Since blood composition quantitatively varies for different races, these changes for the two races considered here may be detected by Raman spectroscopy. More importantly, chemometrics was applied to support and strengthen the classification. This approach allowed for nondestructive detection of minor differences that were present in blood spectra between two races (CA and AA).
By using Raman spectroscopy for the method of analysis, the bloodstain's integrity was preserved, and it can be further examined or used for subsequent tests (e.g. DNA profiling) with no change to the sample. Therefore, this technique could extract information about an unknown blood sample without damaging or consuming it, unlike most tests currently used for blood identification and/or analysis in forensic casework. The application of Raman spectroscopy in real crime scene investigations is highly probable due to commercially available portable instruments, which allow for nondestructive and rapid examination at the scene of a crime. Furthermore, not only can a stain be identified as blood using the present technology but, by incorporating statistical analysis, more information about the donor can be obtained, all in a reliable and statistically confident manner.
Samples
A total 30 male and 30 female blood samples, purchased from certified company Bioreclamation Inc., were used for the entire study. All donors were found to be negative for HIV ½ AB and HCV AB and non-reactive for HBSAG, HIV-1 RNA, HCV RNA, and STS. The average age for all subjects was 42 years. Samples were prepared by putting a 10-μl drop on aluminum foil placed on microscopic slide. Aluminum foil has a low level of fluorescence and very weak Raman signal. It is also an inexpensive material, which can be easily prepared right before an experiment. A Raman mapping procedure was performed on dry spots with one 10-second accumulation of 785-nm laser light with approximately 10 mW power of excitation beam. Total more than 4,500 spectra were collected from the area about 4×4 mm using a PRIOR automatic stage, attached to a Renishaw inVia confocal Raman spectrometer equipped with a research-grade Leica microscope with a 50× long-range objective (numerical aperture of 0.35). A silicon standard was used for the calibration.
Data Treatment
The spectra were imported into MATLAB 7.11 for statistical analysis. The fluorescent background contribution in Raman spectra of blood was removed using an adaptive iteratively reweighted penalized least squares (air-PLS) baseline correction algorithm. No contribution of aluminum substrate was found in the Raman spectra of blood. All Raman spectra were subjected to the statistical analysis including significant factor analysis (SFA), principal component analysis (PCA), hierarchical clustering such as k-nearest neighbor (KNN), and support vector machine discriminant analysis (SVMDA).
Raman Spectra of Blood
Human blood consists of a diverse biochemical constituents and their contribution varies from donor to donor (Virkler et al., “Raman Spectroscopic Signature of Blood and Its Potential Application to Forensic Body Fluid Identification,” Anal. Bioanal. Chem. 396(1): 525-34 (2010), which is hereby incorporated by reference in its entirety). The heterogeneous nature of such system could be illustrated by deviations in Raman spectra (
The averaged, normalized by total area blood spectra of female and male donors have very similar profiles (
a - Movasaghi et al., “Raman Spectroscopy of Biological Tissues,” Appl. Spectrosc. Rev. 42(5): 493-541 (2007); Alfano et al., Detection of Glucose Levels Using Excitation and Difference Raman Spectroscopy at the IUSL (2008); Janko et al., “Preservation of 5300 Year Old Red Blood Cells in the Iceman,” J. R. Soc. Interface (2012); Aubrey et al., “Raman Spectroscopy of Filamentous Bacteriophage Ff (fd, M13, f1) Incorporating Specifically-Deuterated Alanine and Tryptophan Side Chains. Assignments and Structural Interpretation,” Biophys. J. 60(6): 1337-49 (1991); Grasselli, J., Chemical Applications of Raman Spectroscopy, New York: John Wiley & Sons (1981); Johnson et al., “Ultraviolet Resonance Raman Characterization of Photochemical Transients of Phenol, Tyrosine, and Tryptophan,” J. Am. Chem. Soc. 108: 905-912 (1986); Hu et al., “Tyrosine and Tryptophan Structure Markers in Hemoglobin Ultraviolet Resonance Raman Spectra: Mode Assignments Via Subunit-Specific Isotope Labeling of Recombinant Protein,” Biochemistry 36(50): 15701-12 (1997); Sato et al., “Excitation Wavelength-Dependent Changes in Raman Spectra of Whole Blood and Hemoglobin: Comparison of the Spectra with 514.5-, 720-, and 1064-nm Excitation,” J. Biomed. Opt. 6(3): 366-70 (2001); Premasiri et al., “Surface-Enhanced Raman Scattering of Whole Human Blood, Blood Plasma, and Red Blood Cells: Cellular Processes and Bioanalytical Sensing,” J. Phys. Chem. B, 116(31): 9376-86 (2012), which are hereby incorporated by reference in their entirety.
Main Approach
The present application describes the feasibility of the Raman multidimensional blood signatures from the perspective of donor's sex differentiation. The present application demonstrates that Raman spectra of blood regardless of the gender of donors can be distinguished from other body fluids using earlier developed blood signature (Sikirzhytski, et al., “Multidimensional Raman Spectroscopic Signatures as a Tool for Forensic Identification of Body Fluid Traces: A Review,” Appl. Spectrosc. 65(11):1223-32 (2011), which is hereby incorporated by reference in its entirety).
Unsupervised methods of spectroscopic data analysis can be used as a first step of analysis to find out the general relationships between spectra. Their application exposed a high level of similarity between the male and female data sets (
Hierarchical clustering methods were used to search for the internal structure of Raman spectroscopic data. This method allows splitting the analyzed data into hierarchical subgroups forming a dendrogram. In particular, spectral clusters unique for male and female donors were under consideration (
Support Vector Machine Discriminant Analysis of Human Blood
SVMDA classification models built using described characteristic clusters demonstrated high selectivity and sensitivity (˜90%) of gender determinations. Results were cross-validated using sample-wise leave-one-out approach. The best results were obtained using an SVMDA algorithm, which allows for effective separation of overlapping classes (
An alternative possibility of data preparation is to calculate averaged spectra and use them for building a classification model. This approach helps to reduce the dimensionality of data and overcome difficulties, originating from the poor quality of some spectra. It was hypothesized that misclassification in different gender classes in some cases can be caused by a relatively low signal-to-noise ratio. The presence of noise influences sensitivity of the method, making spectral features indistinguishable for male and female groups. To overcome this problem, the averaged spectra for each donor were calculated and subjected to SVMDA (
Raman microspectroscopy was used for the identification of human gender based on dried blood traces. Blood samples from a total of 60 human donors were subjected to automatic mapping followed by chemometrical analysis. Male and female datasets were formed using MATLAB 7.11 after preprocessing (baseline correction, noise reduction and normalization by total area). Spectroscopic patterns from those two groups were found to be the same, despite the high level of blood heterogeneity. Both human genders were described by characteristic Raman spectra based on unsupervised cluster analysis. The most successful results were achieved using the SVM algorithm followed by cross-validation using the sample-wise leave-one-out approach using Raman spectra averaged by donors. Further development of this classification method is ongoing.
Sample Preparation and Raman Microspectroscopy
Twenty eight human semen samples were purchased from Bioreclamation LLC (Westbury, N.Y.). Donors self-reported their race as Caucasian (n=10), Black (n=8), or Hispanic (n=10). Each group had an age range from mid-twenties to mid-fifties to ensure donor diversity. Samples were kept frozen until preparation for analysis, when they were thawed to room temperature and vortexed for 30 seconds to ensure a homogeneous distribution of the different phases of the sample. A 10 μL aliquot was deposited on an aluminum foil covered microscope slide, which has minimal Raman and fluorescence signal contribution. Samples were air dried overnight prior to analysis.
A Renishaw inVia confocal Raman microspectrometer equipped with a Renishaw PRIOR automatic stage was used for data collection. The excitation source was a 785-nm laser operating at about 50 mW. Calibration was performed with a silicon standard. Spectra were collected with a 50× long range/working distance range objective in the range of 300-1800 cm−1, with a 10 second exposure time and 7 accumulations. Each sample was automatically mapped to collect 64 spectra across an area of approximately 2.0 mm2.
Data Treatment
Statistical software MATLAB version R2012a (Mathworks, Inc., Natick Mass.) was used with the PLS Toolbox 7.0.3 (Eigenvector Research, Inc., Wenatchee, Wash.) for data pretreatment and analysis. Spectra that exhibited significant noise or cosmic ray interference were removed from the dataset, resulting in a total of 1,537 spectra. Each sample's dataset was baseline corrected with an adaptive iteratively reweighted penalized least-squares (air-PLS) baseline correction algorithm (Zhang et al., “Baseline Correction Using Adaptive Iteratively Reweighted Penalized Least Squares,” Analyst 135(5):1138-1146 (2010), which is hereby incorporated by reference in its entirety). Spectra were averaged to create one mean spectrum per donor for the development of the model based on donors, instead of individual spectra. The donor's class (Black, Caucasian, or Hispanic) was assigned to all spectra. Two datasets were created from the existing data, one collective dataset with all spectra (n=1,537), and one with all mean spectra (n=28). All spectra were smoothed with a Savitzky-Golay filter, normalized by total area, and mean centered prior to analysis. Principal component analysis (PCA) with leave-one-out cross-validation was applied to the preprocessed collective dataset for dimensionality reduction of the data and to calculate the number of principal components (PCs) that could fully describe the obtained data, which was found to be five. Several comprehensive chemometrical approaches were investigated, including Significant Factor Analysis (SFA), k-nearest neighbor (KNN) hierarchical clustering, Partial Least Squares Discriminant Analysis (PLS-DA), and Support Vector Machine Discriminant Analysis (SVMDA).
The main objective of this example was to use Raman spectroscopy of dry semen traces to identify a donor's race. Three different classification schemes were explored. First, a chemometric model was built to classify donors into one of the three races (Caucasian, Black, or Hispanic) in one step, based solely on their mean spectrum. Next, a two-step scheme was constructed using the collective data set. The first step classified the spectra into one of the three races studied using a chemometric model, just as the previous model had with the mean spectra. The overall donor classification was then determined using the classification results observed for each individual donor. Finally, a three-step scheme was created. Using the collective dataset, this scheme employed two models to classify the spectra. The first model separated the spectra from Caucasian and Hispanic from those of Black donors. The second model then differentiated Caucasian and Hispanic spectra. In the third and final step, the spectral classification results were used to classify individual donors.
Spectra Acquisition and Analysis
Previously, a spectroscopic signature was reported that can be used to identify semen, and differentiate it from other body fluids (Virkler et al., “Raman Spectroscopic Signature of Semen and its Potential Application to Forensic Body Fluid Identification,” Forensic Sci. Int. 193(1-3):56-62 (2009), which is hereby incorporated by reference in its entirety). The Raman spectrum of dry semen can be characterized by the peaks typical for tyrosine (641, 798, 829, 848, 983, 1179, 1200, 1213, 1265, 1327, and 1616 cm−1), choline (715 cm−1), albumin (759, 1003, 1336, and 1448 cm−1), other proteins (1668 and 1240 cm−1), and spermine phosphate hexahydrate (888, 958, 1011, 1055, 1065, 1125, 1317, 1461, and 1494 cm−1) (Sikirzhytski et al., “Multidimensional Raman Spectroscopic Signatures as a Tool for Forensic Identification of Body Fluid Traces: A Review,”Applied Spectroscopy 65(11):1223-32 (2011), which is hereby incorporated by reference in its entirety).
The spectra showed significant variation between donors and within the same sample, illustrating semen's heterogeneous nature (
One-Step Classification Scheme
Several different decomposition, regression, and classification models were investigated. An SVMDA model proved to be the best at differentiating the races, based on true positive and true negative rates. The SVMDA model parameters were optimized to enhance classification performance. The first SVMDA model was built using the 28 mean spectra, as a way to classify at the individual level as opposed to the spectral level. As a result, the model generated would classify donors in a single step. Unfortunately, this approach did not yield successful results; 18 of the 28 donors were misclassified (
Two-Step Classification Model
Based on the results from the direct application of the classification algorithm on the mean spectra, it was hypothesized that the collective dataset may yield more accurate predictions. When the donor's spectra are averaged, it can mask subtle, but key, spectral features that are characteristic of certain races. In a study from Belgium, researchers attempted to differentiate human, canine, and feline blood using an average spectrum and no statistical analysis (De Wael et al., “In Search of Blood—Detection of Minute Particles using Spectroscopic Methods,” Forensic Sci. Int. 180(1):37-42 (2008), which is hereby incorporated by reference in its entirety). In another study, these exact groups were differentiated using Raman mapping and chemometric models (Virkler et al., “Blood Species Identification for Forensic Purposes using Raman Spectroscopy Combined with Advanced Statistical Analysis,” Anal. Chem. 81(18):7773-7777 (2009), which is hereby incorporated by reference in its entirety).
The SVMDA model in the two-step system was built using the collective data set. The results from the model and its classification sensitivity and specificity are shown in
While the model's classification performance was improved by using the collective dataset, a complication was presented. In the mean dataset, each donor was represented by a single spectrum, so the SVMDA model classified each donor into one race. The model was built using the collective dataset, where each donor was represented by several spectra, could classify some number of a single donor's spectra into more than one race. Therefore, this approach can lead to ambiguous results. To resolve this problem, a classification scheme was developed to use the results from the SVMDA model to classify individuals on the donor level (
Using this classification scheme, the donor classification results were significantly better than the results from the first SVMDA model, which was built using the mean spectra. When every donor is studied individually, on average 90% of each donor's spectra were classified correctly (Table 4). Table 4 shows the breakdown of each donor's spectral classification, including the number and percentage classified correctly. A threshold was set at 51%, such that if 51% of a donor's spectra were attributed to a specific race, the donor was classified as a member of that race. Using this threshold, 100% of donors were classified into the correct race (Table 5). This is a notable improvement from the first SVMDA model built using the mean spectra, which only classified 10 (35.7%) donors into the correct race.
Table 4 shows that while the classification results given by the model were not perfect, every donor clearly fell into one race. In each case, a majority of spectra were correctly classified into one race, with only a few being misclassified. On average, 90% of each donor's spectra were classified correctly. This shows that most of the samples were not being classified by a simple majority, but rather by an overwhelming proportion.
Three-Step Classification System
While all donors were separated with 100% accuracy using the two-step classification scheme, the SVMDA model used did yield perfect results. Upon closer examination of the misclassified data, it was observed that a majority were from Caucasian or Hispanic donors. In an attempt to improve the average number of spectra classified correctly a third approach was investigated. A three-step classification system was designed, the first two steps consisted of SVMDA models to classify the spectra and the third step classified the donors (
The results from the models are reported in Table 5. The true positive and true negative rates are similar to those reported for the two-step system, but the error has decreased considerably. The classification results from the first and second SVMDA models are reported in Table 6. Using the same 51% threshold applied in the second classification system, the third system also classifies all 28 donors correctly.
In the first SVMDA model, 21 (75%) of the 28 donors have at least 90% of their spectra classified correctly. In the second step, 19 (95%) of the 20 donors have at least 90% of their spectra classified correctly. The overall trend is not just a simple majority being classified correctly, but that the models are classifying a vast majority of each donor's spectra correctly. On average, only 5% of each donor's spectra were misclassified in the first step, and only 5% in the second step.
For the three-step classification system, donor #1 demonstrated the lowest rate of classification in the second SVMDA model. While 90% of this donor's spectra were classified correctly as Caucasian/Hispanic in the first step, only 64% were classified correctly as Caucasian in the second step. Bioreclamation LLC was contacted to request additional information about this particular donor. More detailed records showed that the donor was actually biracial, of both Caucasian and Hispanic descent. Although this information provides a possible explanation as to why this particular donor had poor classification rates, it also introduces a new limitation. Semen from biracial or mixed-race men may prove to be more difficult to classify. However further data collection, from additional biracial donors, could be used to investigate this unique class more thoroughly. Eventually, new classes could be added to the model to differentiate these samples as well.
Near-Infrared (NIR) Raman microspectroscopy was used to analyze human semen samples.
A new two-step classification system using advanced statistical analysis was developed to determine a donor's race based on the Raman spectroscopic profile of their semen. An SVMDA model was used to classify each spectrum as belonging to one of the three races studied, Caucasian, Black, or Hispanic. The sensitivity and specificity scores for the model were reported as 93.9/86.6/89.2 and 96.6/95.0/93.6, respectively.
A new three-step classification system using advanced statistical analysis was developed to determine a donor's race based on the Raman spectroscopic profile of their semen. Two SVMDA models were used in sequence to classify each spectrum as belonging to one of three races. The sensitivity/specificity of the first and second model was 96.3/86.3% and 93.9/96.5%, respectively.
The overall classification pattern of each donor's spectra was used to classify the individual's race. This final step resulted in 100% sensitivity and specificity. The results obtained during the SVMDA classification were examined using extensive cross-validation with spectroscopic data acquired from additional donors. The small amount of sample needed, minimal sample preparation, automated scanning, and nondestructive nature of this method give it the potential to be very useful in forensic investigations. The present model can be further improved by including more racial groups, analyzing more samples from biracial donors, and acquiring samples for external validation. Nonetheless, the method demonstrates the ability of Raman spectroscopy and advanced statistical analysis to determine an individual's race from their semen. The present method can be extended by including more racial groups as well as differentiation of donors by their age.
Blood Samples
The experiment was performed on human blood collected from 30 donors in total which was acquired from Bioreclamation, Inc. Samples were divided into gender (15 per subset) and race (10 per each including CA, AA and HI) classes. Age diversity was maintained in subject selection. From the total sample population, 26 were used to create a training dataset. The remaining four samples were used as blind samples to externally validate the models built. Each blood sample was defrosted and vortexed to obtain its homogeneous content before deposition. Samples were prepared by depositing 30 μL of fluid on microscope slide for overnight drying.
Instrumentation and Spectra Collection
Spectra were recorded using a PerkinElmer Spectrum 100 FT-IR Spectrometer connected with Spectrum software version 6.0.2.0025 (PerkinElmer, Inc.). A diamond/ZnSe plate was used as an ATR attachment which was cleaned with water and acetone before each sample, and a 10% bleach solution after each analysis. Consistently, a background check was run prior to collecting spectra. Ten spectra were recorded from each sample in a spectral range of 600-4000 cm−1. Each spectrum was the result of ten co-added scans. The spectral resolution was set to 4 cm−1.
Data Treatment
Dataset preparation and statistical analysis was performed using MATLAB (Mathworks, Inc. version R2013b) with PLS Toolbox (Eigenvector Research, Inc.) (Wise et al., PLS_Toolbox 3.5 for Use with MATLAB Wenatchee, Wash.: Eigenvector Research, Inc. (2005), which is hereby incorporated by reference in its entirety). Previous studies on species' differentiation based on infrared blood spectra demonstrated enhanced contribution from the ATR crystal in the spectral range of 1711-2669 cm−1 (
Validation Tests
Because of the small sample population size, a large enough test (blind) dataset was not created after using 26 of the 30 donors in the calibration dataset. In order to achieve the best verification of this model, the PLS-DA model was validated in two different ways. Firstly, to rule out the effect of the test dataset size, the training dataset consisting of 26 donors was externally cross-validated where the spectra from one donor were removed from the training dataset and the PLS-DA model was refit to remaining training data, and used to predict the corresponding test set, which had been removed. This was repeated until all subjects were removed and predicted. No subjects that were used to test predictions were used during the model development, so a reliable error rate of CV was ensured (Anderssen et al., “Reducing Over-Optimism in Variable Selection by Cross-Model Validation,” Chemometrics and Intelligent Laboratory Systems 84(1-2):69-74 (2006), which is hereby incorporated by reference in its entirety). CV results are reported as the performance over all test sets. This provided an estimate of model performance, and it confirmed classification of predictions performed for this particular training dataset. However, it required refitting the model for each individual subject. Therefore, generalizability and predictive potential given by the external CV was additionally assessed by validation of the primary PLS-DA models with all 26 donors by predicting four blind samples that had been separated from training dataset from the beginning of the statistical analysis. The Y values for all spectra were predicted by building PLS-DA models using the same number of optimal latent variables as was determined by CV. For each class prediction and its corresponding PLS-DA classifier the threshold was determined. The trained threshold of Y predictions identified during the external CV was used to classify gender or race of all test samples. During the testing, the features extracted from spectra were compared against the trained threshold to assess the gender and race assignment. The test samples included a diversity of gender and race (1 CA male, 1 AA male, 1 CA female, 1 HI female). This step was used to examine the prediction performance of the method and models, as well as to confirm the models' integrity when analyzing external, unknown bloodstains.
Discrimination of blood donors is possible based on differences in concentration of blood components between groups (Virkler et al., “Raman Spectroscopic Signature of Blood and its Potential Application to Forensic Body Fluid Identification,” Anal. Bioanal. Chem. 396(1):525-534 (2010), which is hereby incorporated by reference in its entirety). The main approach of this example was to develop ATR-FTIR spectroscopy as an analytical technique capable of detecting these changes. Although all blood infrared spectra looked very similar, as can be seen in
Spectral Analysis and Training Dataset
Blood infrared spectra (
aKanagathara et al., “FTIR and UV-Visible Spectral Study on Normal Blood Samples,” Int. J. Pharm. Biol. Sci. 1: 74-81 (2011); Elkins, K. M., “Rapid Presumptive “Fingerprinting” of Body Fluids and Materials by ATR FT-IR Spectroscopy,” J. Forensic Sci. 56(6): 1580-1587 (2011), which are hereby incorporated by reference in their entirety.
Collected blood infrared spectra were assigned in two ways for gender and race differentiation. After spectra were preprocessed and analyzed by GA for variable selection, they were used to build a PCA model to detect outliers and exclude them from the classification process. For this purpose, Hotelling T2 and Q residuals analyses were used (Varmuza et al., Introduction to Multivariate Statistical Analysis in Chemometrics. CRC Press (2008), which is hereby incorporated by reference in its entirety). This made it possible to limit the dataset to the spectra which were not influenced by divergence within the dataset. Subsequently, selected spectra were used to construct a new set, which was divided into a training dataset (26 donors) and an external, blind, dataset (4 donors). Different models were created based on the training set for classification processes. PLS-DA was chosen as the most applicable model for use in predictions, which was determined based on the results of internal prediction performance obtained by the models, as well as the lowest error values of created models. The next step of this approach was validation tests.
Validation Tests for Gender
The first step of validation was external CV. The spectra from one subject were removed from the original calibration dataset, a new PLS-DA model was built, and the previously excluded spectra were used to test the new model. Repeating the process in the manner that each subject appears once in validation set, class labels of all subjects were predicted. Based on these predictions for all 26 donors (13 per class of male and female) contained in the training dataset, AUC and number of misclassifications was obtained. Prediction performance of the PLS-DA models is measured by ROC (
For the next validation step, the model was built on the original 26 samples using an optimal number of components. Next, the class labels of all blind spectra of four donors were predicted using the model and trained threshold. Of the 39 spectra collected from the four blind samples, 36 were classified correctly (
Validation Tests for Race
Spectra separated into racial groups (10 donors per class) were treated in the same manner as dataset used for external CV in gender predictions. Based on these predictions of all 26 donors included in training dataset, AUC and number of misclassification were obtained. Prediction performance of the PLS-DA models was assessed by ROC (
The class labels of all blind spectra of four donors were predicted in the same manner as a gender dataset. Class labels were predicted again with 36 of 39 blind spectra correctly classified and all donors were classified correctly. Using this approach, the versatility and the high performance of optimized PLS-DA models was also demonstrated for determination of race based on FTIR spectra (
FTIR spectroscopy has already been utilized in forensic laboratories for drug analysis. Application of this approach for other forms of evidence would be very valuable, including cost reduction, among others. Its nondestructive nature is one of the most desirable in forensic investigations since examined traces can be still subjected to further analysis. The problem of minuscule sizes of trace evidence found at crime scenes can be resolved by this aspect. The method does not require protein extraction, like most current forensic methods for bloodstain analysis, in order to gain information about the donor. In this study, infrared spectra were collected from 30 donors in total. PLS-DA classification models were successfully utilized for discrimination between genders, which resulted with 91% probability of donors' correct classification, and races, which resulted with 94% on average probability of donors' correct classification based on external CV. The main classification models were also validated with four external blind samples giving 100% accuracy for each donor's classification. The combination of FTIR spectroscopy with chemometrics showed a great ability for human gender and race discrimination from dry blood traces in forensic analysis. FTIR portable instruments facilitate investigation and allow for obtaining results at a crime scene.
In order to investigate the ability to differentiate gender and race using Raman spectra collected from saliva samples, a proof of concept study was designed and implemented. Saliva samples from 60 donors were analyzed by Raman spectroscopy and chemometrics. Two SVM-DA models were built using preprocessed spectra. The models classified the spectra according to race (Caucasian, Black, or Asian) and gender (male or female). The average accuracy of the race differentiation model was 65.4%, and the average accuracy of the gender differentiation model was 82.8%.
Experimental Work
Saliva samples were purchased from Biological Specialty Corp. and Lee BioSolutions. The sample population included saliva from 60 donors, with an equal number of male and female subjects. The 60 donors represented four racial groups, Caucasian (n=20), Black (n=20), and Asian (n=20). All samples were prepared by depositing 104 onto a microscope slide covered with aluminum foil and air dried overnight. Samples were analyzed with a Renishaw inVia Raman spectrometer, equipped with a Leica optical microscope and a PRIOR automatic stage. The samples were irradiated with a 785 nm excitation laser and spectra were collected with a 50× long range objective in the range of 300-1800 cm-1. A 60 μm×60 μm area was mapped, to collect 25 spectra per sample.
Analytical Work
After collection, the data was imported into the MATLAB workspace. Here, spectral datasets for each sample were preprocessed for visualization and further data analysis. Initial preprocessing steps included assigning classes (race, gender, etc.), baseline correction using an air-PLS algorithm (Zhang et al., “Baseline Correction Using Adaptive Iteratively Reweighted Penalized Least Squares,” Analyst 135(5):1138-1146 (2010); which is hereby incorporated by reference in its entirety), removing spectra that exhibited cosmic rays, and interpolating axes so that all datasets have the same axis scale. Two calibration datasets were built for gender and race differentiation, using the preprocessed spectra, each containing 1,357 spectra.
After the initial preprocessing was completed, statistical modeling was carried out with the PLS Toolbox. Additional preprocessing steps, such as normalization and mean centering, were incorporated immediately before modeling. Two SVM-DA models were built for classification based on the two calibration datasets. Both models were internally cross-validated by Venetian blinds.
Results
Race Determination
Saliva is a very heterogeneous body fluid, consisting of water, mucus, electrolytes, enzymes, and antibacterial compounds (Virkler et al., “Raman Spectroscopic Signature of Blood and its Potential Application to Forensic Body Fluid Identification,” Anal. Bioanal. Chem. 396(1):525-34 (2010), which is hereby incorporated by reference in its entirety). This complexity is reflected by the Raman spectra of saliva, showing contributions from several different chemical species. Glycoproteins from the mucus are made evident by the amide I peak at 1653 cm−1 and the aromatic breathing peak at 1002 cm−1. Low frequency peaks, 323-521 cm−1, are due to polysaccharides. The averaged spectra for each donor and race can be seen in
All of the collected spectra were combined to create the calibration dataset, from which an SVMDA model was built, with 10-fold cross validation splits. A confusion matrix, displaying the cross-validated predictions and accuracy, is shown in Table 8. A total of 469 spectra, out of 1,357, were misclassified. The overall accuracy of the model is 65.4%.
The class predictions are visualized two ways in
Gender Determination
The ability to differentiate saliva samples according to donor gender was also investigated using chemometrics and the same dataset used for race differentiation. As described above, saliva is a heterogeneous body fluid and its Raman spectra indicated the presence of several biochemical components. The average Raman spectra from female and male donors are shown in
Just as with the race differentiation example, an SVMDA model was built using the calibration dataset, with 10-fold cross validation splits. Table 9 shows the confusion matrix for this calibration model. Out of all 1,357 spectra used to build the model, only 233 were misclassified. The cross-validated sensitivity and specificity rates of the model are 88.4% and 77.4%, respectively.
The prediction results of the model are shown in
Two preliminary SVMDA models built on spectra collected from 60 saliva donors were constructed to differentiate donor gender and race. The donor population consisted of Caucasian, Black, and Asian donors, with an equal number of males and females. The average cross-validated accuracy of the preliminary race-based calibration model is 65.4%. The cross-validated sensitivity and specificity rates of the preliminary gender-based model are 77.4% and 88.4%, respectively. These results were all obtained through internal cross-validation. None of the models described above were submitted to external validation, a key step in method development.
This study looked into the potential to use Raman spectroscopy as a tool to determine an individual's race and gender using a sample of sweat. Raman spectra were collected from 20 sweat donors, and used to build two chemometric classification models. The cross-validated PLS-DA model built to differentiate race had an average sensitivity and specificity of 98.7 and 99.4%, respectively. The SVM-DA model that differentiated the genders of sweat donors had a 93.7% cross-validated sensitivity rate, and a 98.6% cross-validated specificity rate.
Experimental Work
A total of 20 sweat samples were purchased from Lee Biosolutions. The donor population consisted of 10 Caucasian, 7 Black, 2 Hispanic, and 1 Asian donor. The gender breakdown was 13 males, and 7 females. Sweat samples were prepared by depositing 10 μL onto an aluminum foil covered microscope slide, and allowed to dry overnight. Samples were analyzed via Raman mapping, with a 785 nm excitation laser and a 50× objective. Spectra were collected in the range of 300-1800 cm−1, with three 10-second accumulations. Two mapping procedures were utilized. First, three areas on the sample were mapped, each containing 35 points/spectra, for a total of 105 spectra. In the interest of time efficiency, this was changed to one map consisting of 117 points/spectra. Because none of the irradiation, excitation, or collection parameters were altered, the spectral information obtained remained constant.
Analytical Work
Spectra were imported into MATLAB for preprocessing, and used to build models with the PLS Toolbox. First, spectra were assigned class labels, such as race, and gender. Next, spectra were truncated to reduce the spectral range to 500-1700 cm−1. Lastly, spectra were filtered through PCA modeling to exclude outliers. A PCA model was constructed using all of the collected spectra, and those with high Hotelling T2 scores outside of the 95% confidence interval were excluded from the calibration dataset.
The preprocessed calibration dataset was then used to build chemometric models to differentiate the spectra on the basis of donor race or gender. Two SVM-DA calibration models were built. Final preprocessing steps executed during the model calibration phase included smoothing, normalization, and mean centering.
Results
Race Determination
The first chemometric model constructed attempted to separate spectra from donors of differing races. Five latent variables were used to separate the four groups.
Gender Determination
The same calibration dataset was used to build another SVM-DA model to differentiate donors by gender.
The present study sought to explore the potential to use Raman spectroscopy to identify a donor's race and gender using their sweat. The SVM-DA model built to differentiate race had an average cross-validated accuracy rate of 98.7%, while the SVM-DA model built to differentiate gender had an accuracy rate of 96.2%. The results reported in the present application do not include external validation of the models, a key step in method development.
The main objective of this study was to develop a new method that can differentiate Raman spectra from dried semen traces based on the race of the donors. Raman spectra were acquired from human semen samples, from donors of three races (Caucasian, Black, and Hispanic). The spectra in the original dataset showed significant variation within and between donors, demonstrating semen's heterogeneous nature. Multivariate statistical analysis of Raman spectra was employed on the collected data to evaluate composition of semen samples, which varies with race. A PCA model was used to remove outliers (through Q residuals and Hotelling T2). ANN classification models reveal that the developed methodology has the definite potential to differentiate races.
Experimental Work
A total of 36 semen samples were acquired from Bioreclamation, LLC for this project. The population included 12 Caucasian, 12 Black, and 12 Hispanic donors. Samples were prepared by depositing 10 μL of semen onto an aluminum foil covered microscope slide and allowed to dry overnight. Samples were then analyzed the following day using a Renishaw inVia Raman spectrometer, equipped with a Leica microscope and PRIOR automatic stage. Data was collected by a 785 nm excitation laser in the range of 300-1800 cm−1. Each semen sample was mapped to collect 64 spectra across a 2 mm2 area, where each spectrum was the result of seven 10-second accumulations.
Analytical Work
The experimental spectra were imported into the MATLAB. All of the spectra were labeled according to donor number and race. The PLS Toolbox and R project were used for spectral preprocessing and modeling. Spectra were preprocessed by baseline correction using an airPLS algorithm. All spectra were normalized by total area and mean centered. PCA models were created for detecting outliers; spectra with the most abnormal Hotelling T2 and Q residuals. The dataset was then split into training and test data according to the donors that were randomly selected for testing at the beginning of the statistical analysis. Because of the small sample size, the data could not be partitioned into similarly sized and large training and test data set. Thus, the challenge was to find a reasonable balance between training and test data set size. A slight increase in the prediction error for test data set might be acceptable in order to minimize the variability of the error estimate considered acceptable (to achieve a stable model). After careful consideration, the test dataset size was decided to be 3 donors. The training data was used to build three binary and one tertiary model for classification and discrimination between all three races using the ANN approach.
After creating a test dataset by moving spectra from three donors from a set of available data into an independent data set (never to be touched during cross-validation), the remaining dataset (33 donors in tertiary model, or 21 donors in binary models) was used to build the classification models. For unbiased assessment and to rule out the effect of the data set size, all four original training datasets were externally cross-validated. For each classification model, the original training datasets were randomly split into training (75%) and validating (25%) data subsets in 20 repetitions. The R Neuralnet package (Fritsch et al., “Neuralnet: Training of Neural Networks.” R package version 1.31 (2010), which is hereby incorporated by reference in its entirety) was used to design and train all models of artificial neural networks. Different network topologies have been tried in an attempt to find the optimum network architecture. Among them, the resilient backpropagation algorithm showed the best accuracy for the validation sets. Optimal network architecture was determined by varying the number of hidden layers and number of neurons in each layer between 10 and 600. For each classification model, its performance was reported and averaging was used to obtain an aggregate measure from these models. Thus, CV results are reported as the performance over all validation sets. Generalizability and predictive potential given by the external CV was additionally assessed by validation of the models with the test dataset containing the three donors that were set aside at the beginning of modeling. This step was used to make sure the model trained on the calibration data is generalizable and will correctly classify external, unknown, spectra.
Results
Despite the high heterogeneity observed both within and between donors, the mean spectra of semen from each group were found to be very similar with no significant spectral differences identified and they appear as typical characteristic bands for semen.
The modeling process was carried out in six steps. First, the original dataset of 36 donors was divided into a training dataset of 33 donors, and a testing dataset of 3 donors. The test donors were set aside until the final step of validation. Second, the training dataset of 33 donors was divided further in an effort avoid overfitting and to build a robust ANN model. The training dataset was randomly split so that a bulk of the spectra (75%) was put into a training data subset, and the remaining spectra (25%) were put into a testing data subset. Third, the training data subset was used to calibrate an ANN model, which was then validated with the testing data subset. Steps 2 and 3 were repeated several times, each time with both a new random split and a new architecture scheme, until the ANN model parameters were optimized. Fourth, the “optimal” model architecture was cross-validated 20 times with new training and testing data subsets. The results from all 20 repetitions were recorded and used to make an average confusion matrix for the cross-validation phase. Fifth, the original training dataset, created in the first step, was used to train the final ANN model according to the optimal architecture scheme. The sixth and final step was external validation. The original testing dataset, containing the 3 donors set aside at the beginning, was used to externally validate the final ANN model.
In order to build binary models, all of the donors from one race were removed from the original dataset, and then all six modeling steps were carried out exactly as outlined above. This was done for three binary models, Caucasian vs. Black, Caucasian vs. Hispanic, and Black vs. Hispanic. A total of four final ANN models were built and externally validated.
The results from all four model's cross-validation phases are show in Table 12. During the cross-validation phase, the tertiary model achieved 89% accuracy in its predictions. For the binary models; the Caucasian vs. Black model achieved 96% accuracy, the Caucasian vs. Hispanic model achieved 94%, and the Black vs. Hispanic model achieved 91%.
The confusion matrices from all four model's external validation are reported in Table 13. After external validation, the tertiary model achieved 82% accuracy in its predictions. For the binary models, the Caucasian vs. Black model achieved 98% accuracy, the Caucasian vs. Hispanic model achieved 99%, and the Black vs. Hispanic model achieved 80%. A threshold of 50% was then used, such that if 50% or more of a particular donor's spectra are classified to a single race, the donor is ultimately classified to that race. Using this threshold, all three external validation donors were classified correctly by all four models.
Raman spectroscopy was used to analyze human semen samples and a new analytical approach was developed to determine a donor's race based on the spectroscopic data obtained. ANN models were used to classify each spectrum as belonging to one of the three races studied, Caucasian, Black, or Hispanic. After extensive cross-validation, the accuracy scores for three binary models, Caucasian vs. Black, Caucasian vs. Hispanic, and Black vs. Hispanic, were reported as 96%, 94%, and 91%, respectively. After external validation, these rates were 98%, 99%, and 80%. The tertiary model achieved an accuracy rate of 89% during cross-validation, and a rate of 82% during external validation. Finally, applying a threshold of 50% to the spectral predictions resulted in all three external validation donors being classified correctly. This was true for the tertiary model, as well as all three binary models.
The intention of this study is to develop a method capable of differentiating donor's races based on Raman spectra collected from dry human menstrual blood. All instrumental parameters were selected based on preliminary studies. PLS-DA and SVM-DA were chosen to construct simple classification models using a training dataset containing Raman spectra from five Caucasian and ten African American donors. One additional PLS-DA and SVM-DA model was built using only specific peaks selected by GA analysis. The number of components for each model was selected by choosing a local minimum of total data variance captured using a scree plot. All models were internally cross-validated and three of the four were externally validated.
Experimental Work
All menstrual blood samples were kept frozen until sample preparation. For each blood sample, approximately 10 μL was placed on an aluminum covered microscope slide and allowed to dry overnight prior to analysis. A Renishaw inVia confocal Raman spectrometer and a PRIOR automatic stage were used for data collection for all menstrual blood samples. The instrument was calibrated with a silicon standard before all measurements. Spectra were accumulated with a 20× long working distance objective and 785 nm excitation laser in the spectral range of 300-1800 cm−1. Laser power at the sample was approximately 4.0 mW. A Raman map consisting of 15 spectra were collected from each of the samples. WiRE software version 3.2 was used to operate the instrument.
Analytical Work
All data preparation and construction of statistical models were performed with the PLS Toolbox 7.5.3 (Eigenvector Research, Inc., Wenatchee, Wash.) operating in MATLAB and Statistics Toolbox Release R2012b (Mathworks, Inc., Natick, Mass.). For each sample, the 15 spectra were smoothed with a second-order polynomial and filter width of 15, baseline corrected with a 6th order polynomial, and normalized by total area. After the preprocessing steps, the spectra were mean centered before models were calculated.
In order to eliminate the non-informative and redundant variables from the datasets, GA was applied, which is an evolutionary feature selection method. GA considers all of the variables within a Raman spectral dataset and their significance, or contribution, to the discrimination process. This allows for a reduction of the original Raman spectra to a smaller subset(s) of wavenumbers in order to improve prediction performance. The technique is especially helpful in cases when the spectral dataset consists of hundreds or thousands variables. A detailed explanation of GA for variable selection and its applications was published by Niazi and Leardi (Niazi et al., “Genetic Algorithms in Chemometrics,” Journal of Chemometrics 26(6):345-351 (2012), which is hereby incorporated by reference in its entirety). The population size was set to 72, the maximum number of generations was set to 100, the breeding crossover rule was set to double crossover, and the default mutation rate was used (0.005). Finally, a total of 200 runs were performed.
Two PLS-DA models were constructed, one for 214 preprocessed spectra (11 outliers removed) and another using the genetic algorithm selected peaks for all 225 spectra. Two SVM-DA models were also constructed, one using the 225 total spectra and the other using the genetic algorithm selected peaks for all 225 spectra. All models were internally cross-validated using the venetian blinds method and three were externally cross-validated using a donor-wise leave-one-out approach.
Results
For each menstrual blood sample, a Raman spectral map of 15 points was collected. The spectra were preprocessed by smoothing, baseline correction and normalization by total area. They were also averaged by donor and race to study the differences within the peaks. The training dataset for the first PLS-DA model consisted of 214 preprocessed spectra. The 225 total preprocessed spectra used for model building are shown in
The first PLS-DA model was constructed using a training dataset containing only 214 of the 225 total preprocessed spectra. Eleven of the 225 spectra were outside the 95% confidence interval on the Hotelling T2 and Q Residuals scores plot and were removed from the original training dataset to improve the results. The model was built using four LVs. The cross-validated prediction results for the African American class for the first PLS-DA model can be seen in
An external validation was made for the model by taking out one single donor, rebuilding the model and making predictions for the donor removed. All donors were removed one by one and the model was rebuilt each time. The true positive (TP) and false negative (FN) results for race predictions are displayed in Table 15. The average TP and FN values for the donor-wise external validation for the first PLS-DA model (built with 214 spectra) were 0.64 and 0.37 for the African American class and 0.33 and 0.67 for the Caucasian class, respectively.
The second PLS-DA model was constructed using the GA selected peaks. The model was built using two LVs. The cross-validated prediction results for the African American class for the second PLS-DA model can be seen in
The first SVM-DA model was constructed using a training dataset containing 225 preprocessed spectra. The model was built using two LVs. The African American class prediction probability plot for this model can be seen in
An external validation was carried out for the first SVM-DA model using the same principle described above for the PLS-DA model. The results for race predictions, TP and FN assignments are displayed in Table 18. The average TP and FN values for the donor-wise external validation for the first SVM-DA model were 0.69 and 0.31 for the African American class, and 0.28 and 0.72 for the Caucasian class, respectively.
The second SVM-DA model was constructed using only the specific peaks selected by the GA analysis. The model was built using two LVs. The African American class prediction probability plot for this model can be seen in
Four different statistical models were constructed using a training dataset of Raman spectral data collected from menstrual blood from ten African American donors and five Caucasian donors. The models constructed with the entire spectral range showed better internal classification when compared to the models constructed using the GA selected peaks. Furthermore, the models were tested via external validation of individual donors, which were excluded from the training dataset one by one. The PLS-DA model built with GA selected peaks was not subjected to the external validation because it did not show promising results for the internal classification. The results obtained for the external validation of the PLS-DA and SVM-DA models constructed with all preprocessed spectra were similar to each other. However, the PLS-DA model showed better sensitivity and specificity for the Caucasian class while the SVM-DA model showed better results for the African American class. With the number of samples analyzed, and the parameters chosen for using Raman spectroscopy combined with statistical modeling, it was not possible to sufficiently differentiate between menstrual blood from African American and Caucasian donors.
ATR-FTIR spectroscopy was applied to distinguish between genders and races from human blood. The sample collection included donors of both genders, and Caucasian, Black and Hispanic races. A calibration dataset of thirty donors was used to build models. The final SVM-DA models show donors' classification with 87% accuracy for each group respectively.
Experimental Work
The examination was performed on blood collected from 30 donors in total. The collection included 16 males and 14 females with 10 donors per race. For all the blood samples, a 20 μL drop was deposited onto a microscope slide and allowed to dry overnight. Each sample was scraped off from the glass slide and placed onto the instrument's crystal for data collection. A PerkinElmer Spectrum 100 FT-IR spectrometer with a diamond/ZnSe crystal was used for analysis. Spectra were recorded in the range 600-4000 cm−1 with a spectral resolution of 4 cm−1. Prior to placing the sample on the crystal for each measurement a background check was performed. Samples were scanned 10 times, with 32 accumulations per spectrum.
Analytical Work
For data treatment and advanced statistical analyses R (The R project. “R: A language and environment for statistical computing,” R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0, [Available from: www.R project.org] (February 2016): package pROC, Robin X., et al., “pROC: An Open-Source Package for R and S+ to Analyze and Compare ROC Curves,” BMC Bioinformatics 12(1):77 (2011)) and MATLAB software (MATLAB and Statistics Toolbox Release R2012b (Mathworks, Inc., Natick, Mass.)) were used. For all 300 collected spectra, transmission to absorbance (log (1/T)), 2nd order derivative, normalization by total area and mean centering were applied for pretreatment. After these necessary preprocessing procedures statistical analysis was performed using the PLS Toolbox 7.5.3 (Eigenvector Research, Inc., Wenatchee, Wash.). Spectral fingerprint regions were identified via GA analysis. PCA and SVM-DA were used to distinguish the race and gender of different donors. After necessary preprocessing steps, multivariate outlier removal was carried out through PCA.
Results
In order to optimize the prediction results of SVM, the GA was again used to progressively reduce the wavenumber selection and the number of latent variables to be included. The population size was set to 70, the maximum number of generations was set to 100, the breeding crossover rule was set to double crossover, and the default mutation rate was used (0.005). Finally, a total of 100 runs were performed.
SVM modeling was applied to distinguish between races using the input features selected by GA from original infrared spectra. For this study Radial Basis Function (RBF) as a kernel function was optimized by a combined approach of Venetian blind cross-validation (five samples out) and a systematic grid search of the parameters. To evaluate the subject-independent accuracy performance of the SVM-DA models, all data from all subjects were divided subject-wise, so that spectra from one subject was placed aside from training set and served as a test set. The model was refit to each training set and validated in a blind manner on the corresponding test set. Validation results are reported as the average performance over all test sets. Among all 30 subjects, the probabilities for each spectrum and subject to belong to each class were recorded. ROC curves and AUC values were computed using SVM models to estimate the discriminatory power. Note that in the case of race differentiation, ROC analysis produced three ROC curves, one for each of the three classes compared to the others by binary models.
The principal of ROC analysis was used to assess the diagnostic accuracy of the SVM models in external donor-wise cross-validation. The AUCs of ROC curves were estimated by the trapezoidal method of integration with the corresponding 95% CI that have been evaluated with the method described by DeLong et al. (DeLong et al., “Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach,” Biometrics 837-845 (1988), which is hereby incorporated by reference in its entirety). Results of ROC analysis for race differentiation are depicted in
A new technique has been applied to discriminate race and gender from human blood traces. ATR-FTIR with chemometrics has successfully distinguished between donors. Based on the two models that were built for gender and race differentiation, 26 of the 30 donors were classified correctly. Statistical parameters, as well as sensitivity and specificity values, were calculated for each model. The initial results show promise and validation testing is underway. This study demonstrates a great potential of FTIR spectroscopy combined with advanced statistics for forensic analysis of biological stains. To strengthen the results and validate the models, a blind test with unknown blood samples should be performed and is a future approach for this experiment.
Although preferred embodiments have been depicted and described in detail herein, it will be apparent to those skilled in the relevant art that various modifications, additions, substitutions, and the like can be made without departing from the spirit of the invention and these are therefore considered to be within the scope of the invention as defined in the claims which follow.
This application claims the benefit of U.S. Provisional Patent Application Ser. No. 62/199,079, filed Jul. 30, 2015, which is hereby incorporated by reference in its entirety.
This invention was made with government support under Award No. 2011-DN-BX-K551 awarded by the National Institute of Justice, Office of Justice Programs, U.S. Department of Justice. The government has certain rights in this invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2016/044807 | 7/29/2016 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62199079 | Jul 2015 | US |