The inventive technology includes compositions, devices, processes, methods, and systems are directed to rapid and accurate optical fingerprinting, identification, and sequencing of amino acid and other macromolecules. Additional inventive aspects of the invention include novel systems and methods for bioinformatics algorithms capable of using the high-throughput content k-mers for rapid, broad spectrum identification of genetic biomarkers.
Single-molecule sequencing and mapping of molecular variations in polynucleotides, such as DNA, RNA, and polypeptides can lead to significant improvements in precise diagnosis and treatment of a variety of diseases. First, sequencing of low-copy-number cells without amplification could prove vital for pathogen identification, prenatal care, and diagnosis of circulating tumor cells. Second, an integrated platform capable of single-molecule proteome, genome, transcriptome, and epigenome sequencing could lead to rapid and accurate disease biomarker identification. The lack of such studies at the single-cell level leads to extended controversies and an absence of clear evidence for molecular variations, sometimes at both the genetic and enzymatic levels, as a causative agent for the disease. An example of such impeded progress is the use of epigenetic markers for cancer identification. While several years of research have led to the identification of methylation as an epigenetic marker for cancer cells, it requires a separate and tedious bisulfite sequencing process, which suffers from issues such as incomplete conversion, DNA degradation, and an inability to distinguish between different 5-methylcytosine derivatives. Interconversion between 5-methylcytosine and 5-hydroxymethylcytosine and lack of a direct identification method (current techniques use antibody-based immunofluorescence and immunohistochemistry approaches, immuno-dot blots, and liquid chromatography coupled with mass spectrometry), has prevented its confirmation as a biomarker, and a better understanding of its role in stem cells and tumorigenesis. Further, identification of other new molecular markers and their role in cancer also requires protracted and indirect studies to infer their role. Even for less prevalent or “rare” diseases (affecting less than 200,000 patients each year in the U.S.), in the past 25 years, only about 50% of the 7,000 rare monogenic disease-causing genes have been identified. Together this affects millions without an accurate diagnostic method for identification and therapeutic treatment.
Unfortunately, current sequencing techniques rely on expensive and labor-intensive enzymatic amplification of samples, which introduce amplification bias and provide a statistically significant ensemble-averaged sequence, which often lacks detection of population heterogeneity and information that can be vital for medical intervention. While studies in single-cell genomics have outlined the potential of single-molecule sequencing for medicine and non-invasive clinical applications, these studies involved enzymatic amplification of DNA and subsequent sequencing using traditional sequencing tools. In order to assess the sensitivity required for non-amplified samples, a single prokaryotic cell (˜10−15 liter) with one copy of DNA corresponds to a concentration of (1/(6.023×1023)/1015 mol/L) nM, with similar concentration magnitude for low copy number variants, and ˜1 μM concertation of other prevalent enzymes. Such low concentrations and large differences in magnitudes pose a challenge for any amplification or statistically significant analysis using traditional sequencing tools.
To address these challenges, several recent efforts have been directed towards developing a new single-molecule sequencing method, using easily observable molecular fingerprints and a high-throughput and inexpensive technique. Optical sequence identification has emerged as an important candidate for a next-generation inexpensive and high-throughput sequencing technology and is potentially capable of identifying molecular sequences and variations in single molecules using their vibrational signatures. This approach also creates the potential for a single platform for combined proteomics, genomics, transcriptomics, and epigenomics. As such, there exists a need for a system for the optical sequence identification of single DNA, RNA and peptide molecules using individual SERS measurements and a molecular identification algorithm rooted in machine learning.
Building on the above described sequencing methods, in the push for precision medicine, there is an increasing demand for inexpensive, non-specific assays capable of broad-spectrum diagnostics, where a single test can rapidly screen an array of biomarkers. One immediate application of such a technology is to address the growing threat of antibiotic resistance, a public health crisis that affects nearly two million people in the U.S. annually. Rapid, affordable identification of drug-resistance in clinically relevant microbial strains is vital for prescribing patients with appropriate treatment plans to reduce mortality rates and the development of further resistances. Current resistance diagnostics and profiling assays are often performed only after initial antibiotics fail. Most of these assays rely on cell culturing, PCR amplification, and microarray analyses. Not only do these tests require hours to days and significant costs, but they are specific for detecting resistances of one or a few well-characterized strains. Next-generation, whole-genome sequencing approaches to resistance screening have shown promise; however, applications of this technology to diagnostics has been limited by lack of standardization protocols and the need for data interpretation leading to long diagnosis times.
A rapid, broad-spectrum diagnostic technique would also prove invaluable in the screening of cancers and other genetic diseases. Point-of-care diagnostic devices for sensitive and specific detection of cancer biomarkers have long been a goal of the bio-sensing community. Moreover, scientists and clinicians have long struggled to identify rare, novel, and undiagnosed disorders as evident by initiatives such as the National Institutes of Health (NIH) Undiagnosed Diseases Network. For cancers and other genetic diseases, early detection is crucial for patient survival. Current and emerging diagnostics continue to rely on the identification of the protein, peptide, or gene expression biomarkers. These diagnostic devices apply an array of nano-electronic and optical techniques, but like antibiotic resistance assays, are specific for detecting merely one or a few biomarkers for which the device is constructed.
As such, there exists a need for a novel and robust algorithmic platform, that may further be coupled with BOC technology as described below, to address the above identified shortcomings in the prior art. Such algorithms may provide a single, inexpensive diagnostic test capable of rapidly identifying a wide range of genetic biomarkers.
The inventive technology described herein includes optical systems and methods for accurately discriminating between different nucleobases or amino acids within single DNA, RNA, and protein molecules. The novel method utilizes a silver-coated silicon nanopillar substrate to trap individual biomolecules in SERS hotspots, allowing high-throughput single-molecule optical reads. Using spectroscopic ‘fingerprints’ that were identified from the spectral libraries that have been collected, the present inventors developed a novel molecular identification algorithm to accurately identify DNA and RNA bases, as well as a subset of naturally occurring amino acids. The optical nature of the measurement combined with the ability to trap and isolate single molecules on the substrate allows for the potential to simultaneously collect spectra from many hotspots on the same substrate using high-resolution optical microscopy, which provides a distinct advantage over other single-molecule sequencing methods that read molecules sequentially. (Background information related to certain embodiments related to the identification of polynucleotides by the applicant's novel BOC system may be included in co-owned U.S. Provisional Application No. 62/595,551, and U.S. Non-Provisional application Ser. No. 16/211,817. Notably, the entirety of that application's specification, including figures, related to earlier iterations of its BOS systems and identification of nucleotide content in a portion of a polynucleotide is incorporated herein by reference). By combining this approach with more sophisticated machine learning identification algorithms as generally described herein, it may be possible to deconvolute the contribution of different nucleobases or amino acids within the same spectrum, enabling accurate measurement of sequence content in mixed sequences. This novel approach to high-throughput (epi)genomics, transcriptomics, and proteomics at the level of single cells is generally described below.
The inventive technology described herein includes a comprehensive and robust algorithmic platform generally referred to as block optical content scoring (BOCS), generally referred to herein as the algorithm of BOC algorithm, that facilitates rapid, broad-spectrum genetic biomarker identification from DNA k-mer content. This algorithm builds upon novel systems and methods described below demonstrating the use of single-molecule Raman spectroscopy measurements for high-throughput, label-free detection of A-G-C-T content in DNA k-mers, called block optical sequencing (BOS). This BOS method is an alternative to single-letter sequencing and has the potential to simultaneously measure DNA k-mer content from millions of fragments simultaneously, thereby converting it into useful genetic information. This approach is akin to sharing and streaming of large multimedia files across the World Wide Web using a combination of lossless and lossy data compression techniques. The present inventor's bioinformatics approach, BOCS, uses the DNA k-mer content for identification of genetic biomarkers through probabilistic mapping of the k-mer content to gene databases. Comprehensive simulations show accurate and specific recognition of antibiotic resistance genes, as well as cancer and other genetic disease genes with less than full coverage of the genes and in the presence of sequencing error. The results described here for the BOCS algorithm system pave the way for a single, inexpensive diagnostic test capable of rapidly identifying a wide range of genetic biomarkers among other applications.
Supplementary Tables 1-16 show supplementary information tables of detailed results for the figures presented herein. This includes information on all of the individual genes used in the enabling simulations, as well as full simulation results for single-gene studies with and without entropy screening, varying k-mer lengths, and block errors; multiple-gene studies; and cancer and other genetic disease results. Supplementary information tables include:
Described herein are devices, techniques, and systems that employ multiplexed 3D plasmonic nanofocusing, optical signatures from nanometer-scale mode volumes to aid in identifying macromolecules, and in particular DNA, RNA and polypeptides. In one preferred embodiment, the inventive technology includes devices, methods, and systems for rapid and high throughput sequencing of macromolecules, such as proteins using optical methods to identify the amino acid content of a block of a polypeptide. The disclosed methods may include an inherent lossy compression of proteomic information, which can be used to rapidly identify specific target sequences, modifications, mutations, alternative splicing and the like, as well as provide protein sequence information. In one embodiment, the disclosed methods and systems combine Raman spectroscopy with other optical methods, such as FTIR to help increase the sensitivity and accuracy of fingerprinting as well as sequencing.
For example, as described herein, is the use of Raman spectroscopy and FTIR spectroscopy for label-free identification of protein amino acids, as well as RNA and DNA nucleobases. The disclosed method identifies characteristic molecular vibrations using optical spectroscopy, especially using the “fingerprinting region” for different molecules from ˜400-1400 cm′, to determine, in one embodiment, the amino acid content of a block, or portion, of a polypeptide, as well as. These block fingerprints can then be analyzed and compared with other block fingerprints to identify a specific target polypeptide or protein sequence.
In one preferred embodiment, the invention may include Described herein are devices, techniques, and systems that employ multiplexed 3D plasmonic nanofocusing, optical signatures from nanometer-scale mode volumes to aid in identifying amino acid content in peptide k-mer blocks. The content of each amino acid in a block can be used as a unique and high-throughput method for identifying sequences, mutations, and other markers as an alternative to single-letter peptide sequencing. Here, surface-enhanced Raman spectroscopy is used for label-free identification of protein amino acids, as well as DNA and other RNA nucleobases, with multiplexed 3D plasmonic nanofocusing. It is shown that the content of each amino acid in a peptide block can be used as a unique and high-throughput method for identifying sequences, mutations, and other markers as an alternative to single letter peptide sequencing. Additionally, it is shown that coupling two complementary vibrational spectroscopy techniques (infrared and Raman spectroscopy) can improve block characterization. These results can pave the way for the development of a novel, high-throughput block optical sequencing method with lossy genomic and/or proteomic data compression using k-mer identification from multiplexed optical data acquisition.
The described devices, processes, and systems are useful in label-free, high-throughput block optical sequencing (BOS) with inherent lossy compression. In many of these embodiments, k-mer blocks of peptides are read using 3D nanofocusing of light. Since the different amono acid based in peptides are biochemically distinct, their unique interactions with light photons (observable optical fingerprints) can be used to discriminate them. Surface-enhanced Raman spectroscopy (SERS) is an optical method routinely used for identification of unknown chemical and biochemical compounds from their vibrational fingerprints. In this technique, surface plasmon polaritons lead to 3D nanofocusing and enhancement of near field signal at the apex of rough features or patterned nanostructures. However, applying SERS, or the related tip-enhanced Raman spectroscopy (TERS), for reproducible single-molecule molecules, such as DNA sequence identification has proven difficult. Previous studies have used SERS/TERS measurements on DNA for label-free chemical fingerprinting; however, mixing of a large number of DNA molecules with metal nanoparticles provides an ensemble spectra and poses uncertainties in signal strengths. Furthermore, small molecules, such as polypeptides have varied enhancement due to differences in their location from the plasmonic antenna, and thus suffer from low reproducibility. Since the SERS/TERS signal falls off dramatically with distance from the plasmonic antenna, it makes signal amplitudes highly sensitive to the orientation and conformation of molecules with respect to the surface. While many of these effects are washed out in an ensemble detection, it has been shown that the SERS/TERS signal strength and reproducibility are severely affected by the packing fraction and large uncontrollable variation in molecular orientation with respect to the plasmonic nanostructure. Thus, single-molecule label-free identification of amino acids remains an important and critical challenge.
As such, in certain embodiments described herein is the use of patterned nanopyramid probes on a multiplexed substrate to reproducibly enhance “optical fingerprints” of peptide amino acids. Identifying the different molecular vibrations, bond stretches, and rocking motions in these reproducible spectra allowed differentiation of the amino acids peptide bases from their respective spectral fingerprints. In addition, the disclosed identification techniques may be improved by combining Raman with Fourier-transform infrared (FTIR) spectroscopy.
Probes for use with the disclosed methods and techniques may be fabricated using methods known to those of skill in the art to obtain a suitable shape for providing Raman scatter or FTIR absorbance information from a polypeptide. In some embodiments, the probes may be manufactured with a pyramidal shape of three or four sides, such that they end in a tip with significantly reduced surface area relative the base of the shape. In other embodiments, the shape may be other than pyramidal, for example square, conical, or cylindrical.
In many embodiments, nanopyramidal probes may be fabricated from various compositions. In some embodiments, metal pyramids are used. In one embodiment, the periodicity of the nanopyramids may be about 2 μm and in various suitable patterns. For example, as described below, a square periodic pattern may be used with 2 μm periodicity in both the x and y direction. In many embodiments, this may help enhance vibrational signal using the fingerprinting region of the mid-IR region. Probes may have characteristics that help to retain a polypeptide at the tip. In some embodiments, the composition of the material at the tip of the probe may have a charge that is opposite of the polypeptide to aid in retaining the polypeptide, for example the tip may be positively charged to attract and retain negatively charged polypeptides. In some embodiments, other surfaces of the tip may be of a material that may repel or poorly interact with a polypeptide.
Probes for use with the disclosed methods and techniques may define a surface for accepting or interrogating a polypeptide. In some embodiments, the surface of the probe may be a tip of the probe that may be blunt or sharp. A blunt tip may define a surface that can accommodate a polypeptide of 1 to about 10 nm. In many embodiments, the polypeptide being interrogated may be longer than the surface of the tip. In some embodiments, the tip may have a have a diameter of about 1 to 10 nm, or about 2-7 nm, or about 2 nm, 3 nm, 4 nm, or 5 nm. In many embodiments, the tip may be designed to interrogate a portion or block of a polypeptide that is from about 2 to about 20 nt. In other embodiments, the tip may be designed to interrogate 3 nt to about 10 nt.
A surface for use with the disclosed devices, methods, techniques, and systems may have a plurality of probes. In some embodiments, a surface may have about 1×105 to about 1×1010 probes, for example 1×106 or 1×109 probes. In many embodiments, a plurality of probes may be analyzed simultaneously or sequentially for Raman scatter and FTIR for, in one preferred embodiment amino acid content of a polypeptide positioned on the tip of the probe.
Laser light may be directed at one or more probes to interrogate a polypeptide at, on, or near a tip of the probe. Light reflected from the portion of the polypeptide at the tip may be analyzed by various spectrophotometric methods. In some embodiments, scattered light is analyzed by a Raman spectrophotometer. In some embodiments, absorbance may be analyzed by FTIR spectrophotometer. In some embodiments, one or more filters may be used to analyze light within the wavenumber range.
The polypeptide may be applied to the surface, for example the probe tip by various methods. In most embodiments, wherein the portion of the polypeptide is interrogated on a probe tip, the tip may support or be in contact with a single polypeptide. In some embodiments, the polypeptide may be combed on the surface so that it is substantially linear.
The polypeptide may be treated prior to applying it to the surface. In one embodiment the polypeptide is digested or fragmented by enzyme or chemical treatment, for example with a specific protease enzyme. In some embodiments, the fragmentation may provide a fragment size that is similar to, but generally larger, than that of the block size being analyzed. A portion, or block, of a polypeptide may be analyzed by the described method. In some embodiments, the block may comprise from about 2 to about 20 amino acids, for example 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20 amino acids. The number of amino acids in a block may be referred to as the “k” number. In most embodiments, a polypeptide comprises a plurality of blocks.
The disclosed methods, techniques, devices, and systems are useful in determining the amino acid composition of an interrogated block. In some embodiments, the disclosed methods may be useful in determining the relative or absolute number of each type of amino acid in a block. In many embodiments, this composition of a given block may represent a fingerprint for that block.
The disclosed methods and techniques for identification and sequencing of polypeptide may represent lossy compression. In the disclosed techniques and methods, the identity and order of amino acids within a given block is not determinable by analysis of the light from that tip. In some embodiments, fingerprints of multiple blocks at multiple tips may be combined to provide an overall sequence of a given amino acids comprised of the analyzed blocks.
As noted herein, while in certain embodiment the inventive technology has been described to the identification of polypeptides; such applications may also be applied to the identification of polynucleotides or amino acids as generally described herein.
The disclosed devices, methods, techniques, and systems may be used to sequence a plurality of polynucleotides or polypeptide by movement of the probe tip relative to the polynucleotide or polypeptide. In this embodiment, the polynucleotide or polypeptide may be applied to a surface other than a probe tip, and then a probe tip may be moved into proximity with the polynucleotide or polypeptide. When the tip is moved along the polynucleotide or polypeptide, the fingerprint will change as one nucleotide or amino acid at the end of the block is lost, and a new nucleotide or amino acid is added to the beginning of the block.
Additional embodiments of the current inventions include a single, inexpensive diagnostic test capable of rapidly identifying a wide range of genetic biomarkers would prove invaluable in precision medicine. Previous work has demonstrated the potential for high-throughput, label-free detection of A-G-C-T content in DNA k-mers, providing an alternative to single-letter sequencing while also having inherent lossy data compression and massively parallel data acquisition. Here, the present inventors apply a new bioinformatics algorithm—block optical content scoring (BOCS)—capable of using the high-throughput content k-mers for rapid, broad-spectrum identification of genetic biomarkers. BOCS uses content-based sequence alignment for probabilistic mapping of k-mer contents to gene sequences within a biomarker database, resulting in a probability ranking of genes on a content score. Enabling simulations of the BOCS algorithm reveal high accuracy for identification of single antibiotic resistance genes, even in the presence of significant sequencing errors (100% accuracy for no sequencing errors, and >90% accuracy for sequencing errors at 20%), and at well below full coverage of the genes. Simulations for detecting multiple resistance genes within a methicillin-resistant Staphylococcus aureus (MRSA) strain showed 100% accuracy at an average gene coverage of merely 0.416, when the k-mer lengths were variable and with 4% sequencing error within the k-mer blocks. Extension of BOCS to cancer and other genetic diseases met or exceeded the results for resistance genes. Combined with a high-throughput content-based sequencing technique, the BOCS algorithm potentiates a test capable of rapid diagnosis and profiling of genetic biomarkers ranging from antibiotic resistance to cancer and other genetic diseases.
The BOCS algorithm uses content-based alignment for probabilistic mapping of k-mer contents to gene sequences within a biomarker database. The algorithm applies elements from pattern recognition and machine learning to rank biomarkers based on a content score. Simulations of the BOCS algorithm showed 100% accurate and highly-specific identification of single antibiotic resistance genes at average coverages of merely 0.255±0.096. Further simulations demonstrated robust performance of the BOCS algorithm in the presence of variable k-mer lengths and high sequencing error rates. With errors as high as 20%, over 90% accuracy in gene identification was achieved at less than full gene coverages.
Additionally, BOCS has the ability to identify multiple genes when the k-mer fragments from the multiple genes are randomly mixed. When applied to a clinically relevant MDR bacterial strain, the BOCS algorithm showed 100% accuracy with a low false positive rate for detection of two resistance genes (mecA and OXA for MRSA identification) at an average coverage of 0.416±0.296, with a block error rate of 4% and variable k-mer lengths. BOCS applied to cancer and other genetic diseases also showed detection at 100% accuracy with coverages at or below the values for resistance genes. When coupled with a high-throughput content-based sequencing platform, the BOCS algorithm can provide a biomarker detection tool applicable for rapid, broad-spectrum diagnostics.
As noted above, the disclosed BOCS algorithm, methods, techniques, and systems may be implemented in a digital computer system. Such a digital computer is well-known in the art and may include one or more of a central processing unit, one or more of memory and/or storage, one or more input devices, one or more output devices, one or more communications interfaces, and a data bus. In some embodiments, the memory may be RAM, ROM, hard disk, optical drives, removable drives, etc. In some embodiments, storage may also be included in the disclosed system. In some embodiments, storage may resemble memory that may be remotely integrated into the system. The input and output devices may be, for example one or more monitors, display units, video hardware, printers, speakers, lasers, spectrophotometers, filters, collectors, cameras, etc.
Optical sequencing of amino acids and nucleotides in proteins, DNA, and RNA from individual cells requires a strong enhancement of the optical signatures in order to accurately detect and characterize the signal from single molecules. Furthermore, individual proteins or nucleic acid molecules must be spatially isolated on a substrate such that their respective signals can be resolved. To achieve reproducible and high-density SERS enhancement on an inexpensive substrate, the present inventors used ‘leaning nanopillar’ substrates that were generated by reactive ion etching of silicon wafers followed by deposition of a thin coating of silver metal. These substrates, which can be generated in wafer scale and are commercially available, trap single-molecules in nanoscale ‘hotspots’ that focus and intensify the local electromagnetic field, resulting in an easily observable optical signal enhanced by many orders of magnitude over the signals from molecules in the surrounding regions.
As illustrated in
In order to test the viability of using the leaning nanopillar substrates for identifying biomolecule sequence content from Raman spectra, the present inventors first carried out SERS measurements on short poly-(dC)5 DNA homopolymers adsorbed from solution droplets with varying DNA concentrations. To do this, water droplets containing DNA concentrations of 0, 1.0, 10, and 100 nM were deposited onto the substrate and allowed to dry. Then several hundred Raman measurements were acquired pointwise along a grid within the droplet area, with a grid point spacing of approximately 10 Examples of resulting spectra are shown in
When the DNA concentration was increased to 10 nM, the fraction of spectra showing significant peaks from cytosine increased to ˜20%, with a few measurements even showing DNA peaks with roughly twice the intensity relative to background peaks, indicating an increase in the number of molecules trapped in SERS hotspots per unit area. Further increase of the DNA concentration to 100 nM resulted in a larger fraction of the spectra showing significant DNA peaks; however, many spectra also displayed a very high intensity relative to the background, indicating that most measurements now contained multiple DNA molecules trapped in hotspots. To identify optical fingerprints from measurements on single molecules, the present inventors carried out all further measurements using a concentration of 10 nM, as it provides a good balance between minimizing the chances of measuring multiple molecules in a given spectrum and reducing the required number of raw measurements to achieve a statistically relevant sample size.
To further confirm that the collected Raman spectra do indeed arise from SERS signals of individual molecules, the present inventors next sought to use the relative intensity of the peaks in each measurement to estimate the number of molecules trapped in hotspots, or occupancy, for that measurement. To accomplish this, the present inventors first took the scaled average of the spectra that displayed significant non-background peaks and determined the vibrational mode to which each peak corresponds using peak positions previously reported in the literature. Spectra that did not display significant non-background peaks were considered to have no molecules trapped in hotspots within the measurement area (occupancy=0) and were not included in the following analysis. Of the remaining measurements, the present inventors calculated the median absolute deviation (MAD) of peak intensity for each peak in order to find the expected peak intensity range for single occupancy, assuming that multiple occupancy is relatively rare. Then, for each peak within a given spectrum, the ‘peak occupancy’ was determined by comparing the peak intensity to the MAD for that peak. The estimated occupancy for that spectrum was then taken as the largest peak occupancy. The results were then fit to a Poisson distribution using the following equation:
where k is the occupancy number, λ, is the mean, and P(k) is the probability of having occupancy k in a given measurement. The resulting occupancy histogram and the Poisson fit are shown in
Next the present inventors sought to establish an optical fingerprint for each of the DNA and RNA nucleotides (adenine, A; guanine, G; cytosine, C; thymine, T; uracil, U; and 5-methylcytosine, 5 mC) using sets of specific Raman peaks, in order to perform sequence identification of unknown DNA and RNA oligomers. Previous work from our group showed that characteristic sets of peaks in Raman spectra of DNA homopolymers on silver nanopyramid arrays could be used to distinguish the different DNA bases with high accuracy. Specifically, the present inventors sought to extend this approach in order to identify DNA and RNA nucleotides and epigenetic modifications from SERS measurements on the nanopillar substrates. To this end, the present inventors first generated a spectral library by carrying out SERS measurements on dilute solutions of poly-(dN)x and poly-(rN)x homopolymers (N=A, G, C, T, 5 mC, or U), where the length of the oligomer x was 5-10 nucleotides. For each library experiment, the present inventors diluted the sample to 10 nM in water, deposited a ˜0.1 μL droplet onto the substrate, and allowed it to dry completely before collecting Raman spectra. Average spectra from the library collection are shown in
where I is the intensity, {tilde over (v)} is the Raman shift (in cm−1), μ is the mean and a is the standard deviation. From each Gaussian peak fit, the present inventors extracted the peak center position and full width at half maximum (FWHM), which were later used for classification of unknown spectra. The peak positions and FWHM values for the peaks of interest are shown in a table in
After identifying the characteristic peaks present in the library spectra, the present inventors next adapted a molecular identification algorithm to identify unknown DNA and RNA nucleobases from their individual Raman spectra. The algorithm is based on a previously developed method of identifying DNA bases from SERS measurements, and is outlined using an example spectrum in
To assess the accuracy of the molecular identification algorithm, the present inventors applied the algorithm to discriminate between the DNA bases A, G, C, and T, as well as the epigenetic modification 5 mC, from a randomized library of Raman spectra collected on DNA homo-oligomers. Each ‘unknown’ spectrum was probabilistically classified as described above, and then the predicted class was compared to the actual class to generate a confusion matrix. The resulting (epi)genomics confusion matrix for DNA base calling is shown in
Next, the present inventors tested the viability of using the same molecular identification algorithm for discriminating between the four nucleobases present in RNA—A, G, C, and U—as would be necessary for single-molecule transcriptomics. Using the same approach of classifying each ‘unknown’ spectrum in a randomized library and comparing the predicted and actual classes, the present inventors generated a transcriptomics confusion matrix, as shown in
Next, the present inventors sought to test the invention's optical fingerprinting and molecular identification method in the context of single-molecule sequencing. To this end, the present inventors generated random ‘unknown’ sequences of DNA or RNA bases and pulled corresponding single measurements from our spectral library for each base. The measurements were then fed into the molecular identification algorithm to predict the sequence of the unknown, which the present inventors then compared to the actual generated sequence to produce a sequencing trace plot. Representative segments of resulting trace plots for DNA and RNA sequencing are shown in
Finally, while the above work lays the foundation for single-molecule genomics and transcriptomics using SERS measurements, a similarly important challenge is to quickly identify individual protein molecules using optical measurements, which would enable translational profiling and proteomics at the level of single cells. Given the success in identifying nucleotides in single DNA and RNA molecules, the present inventors next sought to test whether this same approach could be extended to discriminate between different amino acids within peptides and proteins. The present inventors demonstrated discrimination between four different amino acids—histidine (His), methionine (Met), serine (Ser), and tyrosine (Tyr)—to enable the feasibility of using the optical sequencing approach for single-molecule proteomics. To do this, the present inventors adsorbed small quantities of four different poly-(X)5 polypeptides (X=His, Met, Ser, Tyr) onto different areas of the nanopillar substrates from 0.1 uL solution droplets containing 10 nM polypeptide. Raman spectral grids were collected within each area and their spectra filtered to remove those showing only background peaks, forming the basis for the peptide library. The remaining library spectra were averaged, and Gaussian peak fitting was performed on each average spectrum, and the peak fitting parameters (peak center position and FWHM) were extracted to identify characteristic peaks for each amino acid (
In order to test the invention's method for fingerprinting and identification of peptides, the present inventors next modified the molecular identification algorithm that was previously used for DNA/RNA base calling and applied it to differentiate between the four chosen amino acids. For this purpose, the present inventors again limited the chosen peaks for each molecule to an optimized subset of the characteristic peaks in order to improve classification and minimize overlap between the different peak sets. The present inventors then applied the algorithm to a randomized library of homopolypeptide spectra containing either His, Met, Ser, or Tyr, classified each ‘unknown’ spectrum as one of the four known classes, and compared the predicted classes to the actual classes to generate a confusion matrix. The results of this classification are shown in
Nanopillar Substrates:
All experiments were carried out using commercially available silver-coated leaning nanopillar ‘SERStrate’ substrates (Silmeco, Denmark). Substrates were received as ˜16 mm2 squares and were stored under an inert atmosphere until use. Substrates were used as received and no prior cleaning step was performed.
RNA Handling:
Precautions were taken to minimize enzymatic degradation of the RNA. All solutions coming into contact with RNA were prepared with ultrapure deionized (DI) water (Barnstead Thermolyne NANOpure Diamond purification system, water resistivity >18 MΩ·cm). Prior to handling RNA, the workbench, gloves, pipets and other surfaces were cleaned with RNaseZAP™ RNase inhibitor solution (Ambion, Inc, USA). RNA solutions were stored long-term at −80° C. and short-term at −20° C. in small aliquots and were thawed on ice immediately before use.
Biomolecule Adsorption:
The DNA, RNA or peptide molecules were diluted to a concentration of 10 nM in ultrapure DI water (resistivity >18 MΩ·cm) and were adsorbed onto the substrate from a small droplet (˜0.1 μL). The droplet was then allowed to evaporate completely, during which time the surface tension at the air/liquid/solid interface of the receding droplet caused the pillars to lean into one another and trap some of the molecules in hotspots between the pillars.
Raman Spectroscopy:
Data was acquired using a Horiba LABRAM HR Evolution Raman Spectrometer. For each sample droplet area, several hundred Raman measurements were acquired pointwise along a grid within the droplet area, with a grid spacing of approximately 10 μm. Excitation was achieved using a 532 nm laser operating at 5% power with 0.5 s acquisition times. Scattered light was collected through a 100× microscope objective and passed through a 600 gr/mm grating before reaching the detector.
Data Analysis:
The disclosed algorithms, methods, techniques, and systems may be implemented in a digital computer system (1). Such a digital computer is well-known in the art and may include one or more of a central processing unit, one or more of memory and/or storage, one or more input devices, one or more output devices, one or more communications interfaces, and a data bus. In some embodiments, the memory may be RAM, ROM, hard disk, optical drives, removable drives, etc. In some embodiments, storage may also be included in the disclosed system. In some embodiments, storage may resemble memory that may be remotely integrated into the system. The input and output devices may be, for example one or more monitors, display units, video hardware, printers, speakers, lasers, spectrophotometers, filters, collectors, cameras, etc.
In accordance with any of the digital computer system (1) or computer(s) 1, these may be generally described as general purpose computers with elements that cooperate to achieve multiple functions normally associated with general purpose computers. For example, the hardware elements may include one or more central processing units (CPUs) for processing data. The computer 1 may further include one or more input devices (e.g., a mouse, a keyboard, etc.); and one or more output devices (e.g., a display device, a printer, etc.). The computers may also include one or more storage devices. By way of example, storage device(s) may be disk drives, optical storage devices, solid-state storage device such as a random access memory (“RAM”) and/or a read-only memory (“ROM”), which can be programmable, flash-updateable and/or the like.
Each of the computers and server described herein may include a computer-readable storage media reader; a communications peripheral (e.g., a modem, a network card (wireless or wired), an infra-red communication device, etc.); working memory, which may include RAM and ROM devices as described above. The server may also include a processing acceleration unit, which can include a DSP, a special-purpose processor and/or the like.
The computer-readable storage media reader can further be connected to a computer-readable storage medium, together (and, optionally, in combination with storage device(s)) comprehensively representing remote, local, fixed, and/or removable storage devices plus storage media for temporarily and/or more permanently containing computer-readable information. The computers and serve permit data to be exchanged with a network (2) and/or any other computer, server, or mobile device.
The computers and server also comprise various software elements and an operating system and/or other programmable code such as program code implementing a web service connector or components of a web service connector. It should be appreciated that alternate embodiments of a computer may have numerous variations from that described above. For example, customized hardware might also be used and/or particular elements might be implemented in hardware, software (including portable software, such as applets), or both. Further, connection to other computing devices such as network input/output devices may be employed.
It should also be appreciated that the method described herein may be performed by hardware components or may be embodied in sequences of machine-executable instructions, which may be used to cause a machine, such as a general-purpose or special-purpose processor or logic circuits programmed with the instructions to perform the methods. These machine-executable instructions may be stored on one or more machine readable mediums, such as CD-ROMs or other type of optical disks, ROMs, RAMs, EPROMs, EEPROMs, magnetic or optical cards, flash memory, or other types of machine-readable mediums suitable for storing electronic instructions. Alternatively, the methods may be performed by a combination of hardware and software.
The term “software” as used herein shall be broadly interpreted to include all information processed by a computer processor, a microcontroller, or processed by related computer executed programs communicating with the software. Software therefore includes computer programs, libraries, and related non-executable data, such as online documentation or digital media. Executable code makes up definable parts of the software and is embodied in machine language instructions readable by a corresponding data processor such as a central processing unit of the computer. The software may be written in any known programming language in which a selected programming language is translated to machine language by a compile, interpreter or assembler element of the associated computer.
Considering the foregoing exemplary computer and communications network and elements described therein, In connection with one embodiment of the invention, it may be considered a software program or software platform with computer coded instructions that enable execution of the functionality associated with the systems and methods described generally in
In connection with another embodiment of the invention, it may be considered a combined software and hardware system including (a) a software program or software platform with computer coded instructions that enable execution of the functionality associated with the digital computer system (1) along with the execution of the BOCS algorithm to generate block optical content, and (b) hardware elements including the hardware, such as optical hardware such as Surface-enhanced Raman spectroscopy (SERS) as generally described herein that may be used to analyze a SERS substrate.
Given the capability of high-throughput single-molecule Raman spectroscopy measurements in determining DNA k-mer content, the need arises for a way to correlate these content measurements into meaningful genetic information. The potential for coupling a high-throughput measurement system with a broad-spectrum genetic biomarker identification method could lead to a diagnostic platform for rapid point-of-care genetic profiling. Direct applications range from providing clinicians with the information they need to effectively treat multidrug-resistant (MDR) bacterial infections to early detection of cancers and other genetic diseases that previously had no screening techniques. Therefore, the present inventors introduced the BOCS algorithm, which uses DNA k-mer content for broad-spectrum genetic biomarker recognition. In designing BOCS (schematic in
In a similar nature to these methods, the BOCS algorithm relies on probabilistic content alignments to reference sequences for genetic biomarkers. The BOCS algorithm requires 1) the log of all k-mer blocks and their content and 2) a database containing gene sequences for the genetic biomarkers being investigated (e.g., antibiotic resistance, cancer, or other genetic diseases). The algorithm cycles through each k-mer block and performs a content-based alignment with each gene sequence in the database, translating through the gene sequence one nucleotide at a time and tracking the number of match locations—where the k-mer block content matches the content of the k-length gene sequence. A probability is calculated for each gene after each block is aligned with it. This raw probability (PR) is simply the number of observed matches divided by the calculated number of matches that are statistically expected to occur randomly. It is based on the fundamental idea that genes in the database that are most similar to the k-mer blocks in terms of their content should have the most matches during alignment, and therefore deviate the most significantly from the random case. The raw probability is calculated from the number of match locations (m), the length of the k-mer block (k) and its content in terms of the number of A-G-CT nucleotides, and length of the gene (gL), shown below for an arbitrary gene (x):
In the case where no matches are found for a gene, the gene is given a penalty score in place of the raw probability (adjustable parameter for the algorithm, normally in the range of 0.01-0.10). After the analysis of a block (i.e., when the block has been content aligned to each gene in the database), this raw probability is normalized by the maximum raw probability observed for all genes (PR becomes PR*). While this raw probability itself is not the score on which biomarker identifications are made, it is the basis for many of the six probability factors that make up the overall content score.
After the content alignment of a block has been completed for all genes, and the raw probabilities are calculated for each gene, six probability factors (PF) that make up the content score (CS) are calculated for each gene. These PF values are designed as pattern recognition elements for a customized machine learning enhancement to the algorithm. They were designed to account for repeated trends observed throughout comprehensive analyses of match patterns during content alignment. The first probability factor (PF1) is the cumulative percent difference from average of the normalized raw probability (PDiff) multiplied by the normalized cumulative raw probability, shown below for an arbitrary gene (x) after an arbitrary block (bn) in terms of normalized raw probabilities:
The second probability factor (PF2) is the total number of blocks, up to the current block, having at least one match from the content alignment:
PF2,x=Σ1b
The third probability factor (PF3) is the product of all normalized raw probabilities taken as the log base 2 sum. Since this leads to negative values, they are flipped by subtracting from the most negative value:
PF3,x=max(|log2PR,all*|)−|log2PR,x*| (5)
The fourth probability factor (PF4) is an exponential of the gene coverage (gcov), indicating the fractional number of nucleotides within the gene that have been matched during content alignment:
PF4,x=exp(500·gcov)/exp(500) (6)
The fifth probability factor (PF5) is the cumulative slope (SPF5) calculated from the percent difference from average of the normalized raw probability (PDiff, equation 2). The slope is calculated for the current block and the nine previous blocks; therefore, this factor does not take effect until the tenth block:
The sixth probability factor (PF6) is the cumulative difference from average of the normalized raw probability:
PF6,x=Σ1b
Each of the six PF values are normalized individually by the maximum PF observed for all genes (PF becomes PF*). This normalization by the maximum ensures equal weighting for the factors when they are added together to give the CS:
Notice that the CS is also normalized; however, here it is by the sum of CS values for all of the genes instead of the maximum as for the PFs. As each block is analyzed, the CS for each gene accumulates, leading to a probabilistic ranking of genes in the database. As demonstrated in the results, the compounded probabilistic content scoring is robust, and can often correlate the k-mer block contents to a positive genetic biomarker identification well below full coverage of the gene.
The BOCS algorithm may be built into a simulation for large-scale analyses. Such a simulation takes gene sequences from a biomarker database and creates k-mer blocks of A-G-C-T content to simulate BOS reads. These simulated BOS reads are then run through the BOCS algorithm against the biomarker database. The goal of the simulation is to see how well the BOCS algorithm can identify the correct gene (out of all others in the database) using merely randomized k-mer blocks of A G-C-T content. A specific gene from the database can be pulled or a random gene can be selected. The k-mer block lengths, gene coverage, and the number of errors within the blocks can all be set.
For comprehensive testing of the BOCS algorithm, the present inventors used the MEGARes database of antimicrobial resistance, composed of 3824 total resistance gene sequences. Due to the phylogeny of annotated genes in MEGARes and other gene databases, the BOCS analysis uses three levels for gene detection. In the order of most broad to most specific they include—class, sub-class, and specific gene. For example, a gene leading to resistance of tetracycline antibiotics could have a class: tetracycline ribosomal protection proteins, sub-class: TETO, and specific gene: TETO-x,y,z (where x, y, z are specific mutations of TETO). Note that deviations from the MEGARes three-level annotation system for more wide-range applicability with other genetic databases (as demonstrated later). For our BOCS benchmarking analyses, the present inventors randomly selected 70 genes having unique sub-classes from the MEGARes database (see the Supplementary Information Table S1 for details of the genes) and ran 25 repeat simulations on each, where each simulation repeat represents different split locations for the k-mer blocks and a different randomized order in which the blocks are analyzed. In this first set of 1750 simulations, the k mer blocks were set at k=10, single gene coverage, and no block errors (results are shown in
In analyzing the simulation results, the present inventors were interested in four main metrics: accuracy, coverage at which a gene is identified, false positives, and specificity. The accuracy is a measure of how often the selected gene, which has been fragmented into randomized k-mers of A-G-C-T content, can be identified. The coverage at which a gene is identified indicates how many blocks less than the total (all blocks correspond to a coverage=1.0) are needed, eluding to the rapid, robust nature of the algorithm. False positives are a measure of the sensitivity in detection (more false positives means less sensitive). The specificity shows how significantly the gene database can be narrowed as consecutive blocks are analyzed. All of these factors depend on when an identification is made, which is determined as the point where a gene within the database adopts the highest content score and remains there and/or separates itself probabilistically from the rest. False positives arise when genes other than the selected gene meet this identification criterion. Genes within the database can be eliminated when a block shows no content matches during the alignment (this elimination scheme can only be used when there is single coverage for the genes and no block errors). In this first simulation with 70 resistance genes, 100% accuracy (with no false positives) was achieved while requiring an average coverage of merely 0.271±0.064 (
Additionally, roughly 90% of the genes in the MEGARes database were eliminated by 0.20 coverage (
When looking at the content scoring for this first set of simulations on antibiotic resistance genes, the present inventors observed the most significant spikes in probabilities when the number of permutations for a particular block content was low (i.e., the value k!/(A! G! C! T!) was low). This led to the idea of preferably analyzing these ‘low entropy’ blocks before others in a process the present inventors call entropy screening. In the simulation, entropy screening can be applied in a random fashion (in the random order to which the blocks are scattered) or an ideal fashion (in order of low entropy to high entropy). Moreover, the present inventors noticed that in the majority of simulations, genes within the database that had probabilistically become irrelevant were still being analyzed as potential candidates. To alleviate this, the present inventors implemented a thresholding system to remove genes with lowest probability ranks after each round of block analyses. This type of thresholding based on content score ranking is also necessary to eliminate genes for the cases when there are more than a single gene or gene coverage as well as sequencing errors, where eliminations based on no content matches to a block would lead to significant identification error and decrease the overall accuracy. In the simulation, thresholding can be implemented based on the rank of the content score, as well as each of the individual probability factors, and each can be multiplied by a factor to increase/decrease the sensitivity of thresholding. With the thresholding and entropy screening in place, the first simulation with 70 resistance genes was re-run (again with k-mer blocks set at k=10, single gene coverage, and no block errors, with 25 repeat simulations per gene). Looking at the results shown in
The present inventors next sought to test the limits of the BOCS algorithm by introducing sequencing variability in the form of fluctuating k-mer block lengths, block errors, and using blocks from multiple genes. All of these settings can be input on the BOCS simulation, and each of the simulations were run with the thresholding (using all probability factors and content score) and random entropy screening. First looking at k-mer lengths, the present inventors ran two sets of simulations with constant k-mer lengths different from the k=10 case used previously—one with k=8 and another with k=12. Then another set of simulations were run for varying k-mer lengths centered around k=10. For this, k-mer lengths for each block are randomly picked from a normal distribution centered around k=10, leading to a distribution of k-mer lengths in the range k=6-14. For each of these simulations, the same 70 MEGARes genes were used, again with 25 repeats. Results in
Next looking at block errors, a set of simulations (for the 70 resistance genes with 25 repeats) were run for each of four error rates within the blocks: 2, 5, 10, and 20%. Note that when using content as a sequencing platform, the error rates become double the rates that would normally be seen in single-letter sequencing. This is because a single point error within a k-mer block affects the resulting content of two nucleotides—the letter corresponding to the correct nucleotide, and the letter corresponding to the incorrect nucleotide. In the BOCS simulation, the error rates are entered as fractional error rates for the gene sequence, not the content; therefore, the error rates shown here (2, 5, 10, and 20%) were entered as 0.01, 0.025, 0.05, and 0.10. The results in
Lastly looking at using k-mer blocks from multiple genes instead of a single gene (and therefore trying to identify all genes from which the blocks are compiled), the present inventors ran two sets of simulations using sets of k-mer blocks from two and five genes. The 2-gene simulations are for 10 random 2-gene selections from the base set of 70 resistance genes, each with 25 repeats. The 5-gene simulations are for 5 random 5-gene selections from the base set of 70 resistance genes, each with 25 repeats.
The present inventors applied BOCS simulations towards the detection of a very relevant clinical MDR bacterial strain. Methicillin-resistant Staphylococcus aureus (MRSA) has become a leading cause of bacterial infections in healthcare and the community. It is the most clinically-relevant Staphylococcus species, with a large prevalence of tissue and bloodstream infections due to chronic skin conditions and surgical procedures. Through horizontal gene transfer, MRSA strains show resistance to most beta-lactam antibiotics, leading to endemics in healthcare facilities worldwide. Diagnosis is most commonly performed with phenotypic cell culture assays. These assays look for the presence of the mecA gene encoding the PBP2a penicillin-binding protein with a cefoxitin (a beta-lactam, with resistance being of the type OXA class D) antibiotic inducer. The culture tests must incubate for >24 hours, with overall time for testing usually being >46 hours.
To demonstrate detection of MRSA with BOCS, the present inventors designed a simulation looking for two genes: 1) mecA gene encoding the PBP2a penicillin-binding protein and 2) OXA beta lactamase (class D). The simulation used variable length k-mer blocks centered around k=10 (for a range of k=6-14), and a 4% error rate within the blocks. Thresholding (with multiplier and selected factors) and random entropy screening were also applied, and the simulation was run with 25 repeats. The BOCS algorithm once again showed powerful performance in identification of the two resistance genes of interest, leading to MRSA detection even in the presence of block errors and variable k-mer lengths (results in
Expanding BOCS to other areas benefiting from broad-spectrum diagnostics, the present inventors ran simulations with the COSMIC cancer database and a custom compiled database of other genetic diseases including many listed by the NIH Undiagnosed Diseases Network. Note for these databases, there is no class level identification, only sub-class and specific gene. For each database, 10 randomly-selected genes were run with 10 repeats, for 100 total simulations with constant k-mers at k=10, no block errors, and thresholding and entropy screening (results in
In one embodiment, the present inventors successfully coupled optical sequencing measurements with the content-scoring algorithm, or BOCS algorithm for the characterization of a β-lactamase gene within the pathogen of origin. Specifically, we show that merely a few highly accurate measurements of DNA k-mer block content (<<full coverage of the gene) from silver nanoparticles can be used with the content-scoring algorithm to identify the correct OXA β-lactamase (class D) gene from a comprehensive antibiotic resistance database and confirm the Pseudomonas aeruginosa pathogen from which it originates. Although optical sequencing measurements can be multiplexed using silver-coated nanopyramid substrates for SERS, we utilized metallic nanoparticles here to demonstrate broader applicability across plasmonic substrates and varying resolution (single molecule versus ensemble). We also show extensions to transcriptomics and epigenomics. Ultimately, the results here demonstrate the use of an optical sequencing platform as a diagnostic for inexpensive and rapid identification of broadspectrum genetic, transcriptomic, and epigenomic biomarkers.
In this study, we collected optical sequencing measurements from ssDNA k-mer blocks with positively charged, spermine-coated silver nanoparticles (Ag NPs) as the plasmonic substrate (
Within each SERS measurement, the PO2− stretching mode peak at 1089 cm−1 due to the phosphate backbone is used as an internal standard for normalizing the relative peak intensities, as is consistent with other studies employing nanoparticle substrates. All signature peaks and the PO2− normalization peak are highlighted in
It is important to note that impactful extensions exist for transcriptomics and epigenomics by applying optical detection to RNA and chemically modified nucleobases. As shown in
To fully deconvolute the A-G-C-T content of an unknown mixed sequence DNA k-mer block for optical sequencing, it may be necessary to know the full range of intensity values for signature peaks of each nucleobase. Therefore, we used custom DNA k-mer blocks with a known content as standards for generating content calibrations. The 14 calibration blocks are provided in Table 5. These 14 ssDNA 10-mer calibration blocks span the range of 0-1 fractional content for each of the four nucleobases. Blocks Cal_1, Cal_2, Cal_3, and Cal_4 provided the Raman signatures shown in
The present inventors applied the calibrations toward identifying content within k-mer blocks from an actual gene sequence, for subsequent integration with the content-scoring algorithm. The 15 gene blocks are provided in Table 6. These 15 ssDNA 10-mer gene blocks are from an OXA β-lactamase (class D) gene found in P. aeruginosa. Although 10-mers were used throughout this study, SERS measurements can be collected from longer blocks. From SERS measurements on the 15 gene blocks, the present inventors measured the normalized intensity for signature peaks (averaged from three technical replicates).
Predicted content for all 15 gene blocks is provided in
For full integration into a diagnostic method, the high-accuracy optical sequencing reads were coupled with the content-scoring algorithm for genetic biomarker detection. With the optical sequencing platform, we set out to demonstrate the detection of a P. aeruginosa infection with the drug-resistant β-lactamase gene. P. aeruginosa is a clinical multidrug-resistant (MDR) pathogen of critical importance due to its prevalence for causing bloodstream, urinary, and pulmonary infections in hospital settings, especially for immunocompromised patients in intensive care settings. Due to the multiple mechanisms of inherent and acquired resistance of this organism, patients infected with P. aeruginosa have limited therapeutic options. It is, therefore, imperative to have more early-stage, rapid diagnostic techniques in place to screen for P. aeruginosa so that effective antibiotic regimens can be prescribed from the onset of infection.
The content-scoring BOCS algorithm was developed to perform genetic biomarker database searching from measurements of the nucleotide sequence content. It operates analogously to probability-based sequence analyzers such as those employed for peptide identification from mass spectrometry data and alignment programs used for mapping next-generation sequencing reads to reference genomes. In a similar fashion, the algorithm relies on probabilistic content alignments to database sequences of genetic biomarkers. Outlined in
Thorough simulations of the BOCS algorithm with antibiotic resistance, cancer, and other genetic disease databases proved very robust, even under the pressures of variable k-mer block lengths, high error rates, and in the presence of blocks comprised of multiple genes. The present inventors ran the measured gene blocks with predicted content at 93.3% accuracy through the content-scoring algorithm against the MEGARes antibiotic resistance database comprised of ˜4000 known resistance genes, including the OXA β-lactamase (class D) gene of our measured gene blocks. This analysis demonstrates the ability of optical sequencing to diagnose antibiotic resistances from unknown samples with no prior knowledge of the pathogen or strain. The table of gene blocks and their predicted content, which was provided to the algorithm, is shown in the lower portion of
Extending diagnostic applications further, we ran our measured gene blocks through the algorithm again after substituting the MEGARes database for the P. aeruginosa reference genome PAO1 containing the OXA β-lactamase (class D) gene. This analysis indicates the ability to confirm pathogens and specific strains responsible for the infection. It also shows the robustness of the content-scoring algorithm in identifying specific genes in the background of an entire microbial genome.
Entropy Screening in the BOCS Algorithm:
The most significant spikes in raw probabilities occur when the number of permutations for a particular k-mer block is low (i.e., the value k!/(A! G! C! T!) is low). Preferably analyzing these ‘low entropy’ blocks before others therefore enhances the BOCS algorithm by allowing for genetic biomarker identification at lower coverages, in a process the present inventors call entropy screening.
Thresholding in the BOCS Algorithm:
As more k-mer blocks are analyzed and content scores become compounded, genes within the biomarker database that have probabilistically become irrelevant need to be eliminated. For the case of analyzing k-mer blocks from a single gene at single coverage and no errors, genes can be eliminated when no content matches for a block occur. However, this elimination scheme cannot be implemented in the presence of errors, higher coverages, or the case of multiple genes comprising the k-mer blocks as it will lead to significant decreases in accuracy. To account for this, the present inventors implemented a thresholding system within BOCS to remove genes with lowest probability ranks after each consecutive round of block analyses. Thresholding is based on the rank of the content score, as well as each of the individual probability factors, and can be multiplied by a factor to increase/decrease the sensitivity of the eliminations being made.
Accounting for Special Characters in the Genetic Databases:
Some genetic biomarker database FASTA files contain special nucleic acid code characters (e.g., N signifies that either A, G, C, or T can be substituted into the sequence at that location). When performing content-based sequence alignment, this creates multiple possibilities for content within the two sequences being aligned (the k-mer block and genetic biomarker sequence). To account for these special characters, the BOCS algorithm tests all possible substitutions of A, G, C, and T for the character code used, and a match is awarded if any of the possible substitutions lead to equal content between block and gene sequence.
Making Genetic Biomarker Identifications:
The BOCS algorithm uses three levels for gene detection. In the order of most broad to most specific they include—class, sub-class, and specific gene. For example, a gene leading to resistance of beta-lactam antibiotics could have a class: class A beta-lactamase, sub-class: TEM, and specific gene: TEM-x,y,z (where x, y, z are specific mutations of TEM). Based on the level of phylogeny present in the genetic biomarker database, some or all of these classes are used. Each of these levels are tracked in terms of content score ranking throughout the k-mer blocks analysis and an identification can be made for each level. Identification is determined as the point where a gene within the database adopts one of the n-highest content scores (for n genes comprising the blocks) and remains there and/or separates itself probabilistically from the rest. False positives arise when genes other than the selected gene(s) meet this identification criterion.
Implementing a BOCS simulation: To generate large amounts of data on which to benchmark the BOCS algorithm without the need for experimental data, the present inventors built the BOCS algorithm into a simulation. The simulation uses gene sequences from a biomarker database to create k-mer blocks of A-G-C-T content as would be output from high-throughput BOS experiments. The simulated BOS reads are then run through the BOCS algorithm against the biomarker database. The goal of the simulation is to see how well the BOCS algorithm can identify the correct gene (out of all others in the database) using merely randomized k-mer blocks of A-G-C-T content. A specific gene from the database can be pulled or a random gene can be selected. The k-mer block lengths, gene coverage, and the number of errors within the blocks can all be set.
Simulating DNA k-Mer Blocks:
Blocks of DNA k-mer content within the BOCS simulation are generated from one (or more, based on simulation inputs) of the gene sequences within the biomarker database being used. Prior to fragmenting a gene sequence into k-mer blocks, random errors can be added at any specified rate. The gene sequence is split into k-mers based on the set value of k and whether k-mers are to be of constant length or variable length. For the variable length setting, lengths are randomly chosen from a normal distribution centered around the set value for k (with restrictions limiting the length to deviate no more than ±4). Note that the first and last fragments of the gene sequence can deviate from the settings in order to include the entire gene. After errors have been added to the sequence and the gene has been split into k-mers, fractional content for each k-mer is calculated and logged. This process is repeated for however many genes are selected for the analysis and for whatever integer the coverage is set to (for each additional +1× coverage, split locations for the blocks are different). The k-mer block contents for all genes selected for the analysis and all coverages are combined into a single randomized pool to be introduced into the BOCS algorithm. For each repeat simulation, split locations for the k-mer blocks and their randomized ordering will vary.
Simulation inputs/outputs: The following inputs can be set and tuned when running the BOCS simulation (see the Supplementary Information for more details):
The BOCS simulation outputs a text file with the following data used for analysis (see the Supplementary Information for more details):
Gene Databases:
The following exemplary gene databases may be applicable to the BOCS system described herein:
Running the BOCS Simulation:
The following options for inputs/settings may available in certain embodiments of the BOCS system. Within the main text figures, tables are shown summarizing the important inputs that were used for each of the simulations. These include 3, 4, 5, 6, 7, 8, 9 below. The other inputs are not shown in the main text figures, and are merely user options dictating database options, file locations, output settings, and figure displays for further analysis.
BOCS Output
The following sections may be output in the results .txt file. The .txt files can be analyzed for overall simulation performance and metrics such as coverage at which the selected gene(s) was identified, accuracy, and false positives.
Synthesis of Positively-Charged Silver Nanoparticles (Ag NPs).
The synthesis protocol was adapted from van Lierop et al. Prior to synthesis, all glass vials were left to soak in the PEI solution (0.4% v/v) overnight followed by extensive rinsing with ultrapure DI water. For Ag NPs, silver nitrate solution (40 μL, 0.5 M) and spermine tetrahydrochloride solution (14 μL, 0.1 M) were mixed with ultrapure DI water (20 mL) and stirred for 20-30 min in the dark. After 20-30 min, sodium borohydride solution (500 μL, 0.01 M) was spiked into the mixture (with continued stirring for 5-10 min). Ag NP colloids were allowed to sit overnight in the dark (at room temperature), and the sediment at the bottom of the vial was then discarded.
Sample Preparation:
Prior to use, the Ag NPs were cleaned by collection with centrifugation at 9,000 rpm for 10 min, followed by redispersion in ultrapure DI water at half the original volume. Following mixture with DNA/RNA/amino acids (described below), the Ag NPs-analyte solution was centrifuged at 8,500 rpm for 5 min, 4/5 volume of supernatant was removed, and the sedimented sample was resuspended. Specific procedures for the different bio-analytes are described below.
SERS Measurements:
SERS measurements were collected with a 532 nm 40 mW laser from Thorlabs, Inc. (diode-pumped solid state, operated at 15-20 mW) focused on the colloidal sample through a Zeiss Observer.Alm microscope with 50× objective, and spectra were collected with a Princeton Instruments Acton SpectraPro SP-2500 spectrometer with PIXIS 100 CCD camera at 30 s exposure time, 10 accumulations.
Signal Processing and Normalization:
Signal processing and normalization including cosmic ray removal, average smoothing, and baseline subtraction was described in Korshoj, L. E.; Nagpal, P. Diagnostic Optical Sequencing. ACS Appl. Mater. Interfaces 2019, 11 (39), 35587-35596, the entirety of which is incorporated herein by reference, and specifically materials and methods).
Peak Analysis with p-Value Statistics:
The difference in Raman signal between the DNA and RNA nucleobases was quantified with a p-value analysis on the intensity values observed for all distinct signature peaks. To generate p-values, t-tests (two-sample assuming equal variances) were performed with the intensities of each nucleobase Raman signal for each of the signature peaks. For RNA, the p-values for the U signature were generated with a χ2 analysis on a combination of two peaks in accordance with Fisher's method.
Each of the below references is hereby incorporated by reference:
Appl. Mater. Interfaces 2009, 1, 1396.
Prim. 4, 18033 (2018).
This invention was made with support under a grant by the W. M. Keck Foundation, and through the National Science Foundation Soft Materials (MRSEC) at the University of Colorado through NSF Award DMR 1420736, and from the National Science Foundation Graduate Research Fellowship Program under Grant Nos. DGE 1144083 and 1650115. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
62775736 | Dec 2018 | US |