This application contains a Sequence Listing XML, which has been submitted electronically and is hereby incorporated by reference in its entirety. Said XML Sequence Listing, created on May 7, 2023, is named RICEP0100US.xml and is 7,148 bytes in size.
The disclosure relates generally to the field of biology and signal processing. More particularly, it concerns methods of compressed sensing methods and systems.
As data increasingly informs critical decision-making, efficient signal acquisition frameworks must keep pace. Modern signals of interest are often high-dimensional but can be efficiently recovered by exploiting their underlying structure through signal processing. The field of compressed sensing (CS), reviewed in (Baraniuk, 2007; Eldar & Kutyniok, 2012), focuses on the recovery of sparse signals from fewer measurements than the signal dimension. Concretely, an N-dimensional signal x* with at most k nonzero entries (in which case x* is said to be k-sparse) can be recovered from a measurement vector y acquired by M<N sensors. The sensors' linear responses to entries of x* define a sensing matrix Φ such that, compactly, y=Φx*. Recovering x* from y is known as the single measurement vector (SMV) problem (Chen et al., 1998; Donoho, 2006; Candes & Tao, 2005; Cohen et al., 2009). In the multiple measurement vector (MMV) problem (Chen & Huo, 2005; Chen & Huo, 2006; Tropp et al., 2006), D measurements are captured in an M×D matrix Y in order to recover X*, an N×D signal matrix. X* is jointly sparse such that only k rows contain at least some nonzero elements. CS has been applied extensively in imaging (Duarte et al., 2008; Willett et al., 2011; Lustig et al., 2008) and communications (Li et al., 2013; Zhu & Giannakis, 2011; Ragheb et al., 2008) and only recently in biosensing (Cleary et al., 2017; Koslicki et al., 2014; Aghazadeh et al., 2016; Ghosh et al., 2020).
Emerging microfluidics technologies in the field of biosensing motivate a new MMV framework. With microfluidics, a single sample can be split into D small-volume partitions such as droplets or nanowells with D on the order of 103 to 107 (Guo et al., 2012). Microfluidic partitioning captures individual analytes (e.g., cells (Hosokawa et al., 2017; Mazutis et al., 2013); genes (Vogelstein & Kinzler, 1999); proteins (Kim et al., 2012; Yelleswarapu et al., 2019); etc.) in partitions, and analyte quantities across partitions are known to follow a Poisson distribution (Basu, 2017; Moon et al., 2011). The common method to detect a library of analytes with large N is to either dilute samples or split samples into more partitions such that the Poisson distributions reduce to either empty or single-analyte capture, i.e., that columns of X* satisfy ∥x*∥∈{0,1}(Mazutis et al., 2013; Velez et al., 2017). This assumption motivates a straightforward N-class classification problem for each non-empty partition, but it necessitates clear separation between classes even under noise, some prior knowledge of sample concentration, and the generation of many wasteful, empty partitions.
The advent of microfluidics in biosensing has led to portable, cost-effective, and automated assays to be performed on simple chips manufactured with the same platforms that spurred the computing revolution (Luka et al., 2015; Bashir, 2004). However, the core methods of biosensing have largely rested on the paradigm of designing sensors that are each specific to a target analyte. For situations in which many target analytes must be considered, this one-to-one principle scales poorly. Many sensors must be embedded on a single device, samples must be concentrated enough such that a representative subsample can be applied to each sensor, and cross-reactivity of sensors and analytes scales combinatorically (Palka-Santini et al., 2009; MacBeath, 2002). For example, bacterial and fungal infection diagnostics comprise dozens to hundreds of pathogens that may be responsible for a patients' generic symptoms, but samples may exhibit very low concentrations (Sinha et al., 2018; Opota et al., 2015).
Nonspecific or target-agnostic sensing modalities are needed for scalable coverage of many analytes. For example, metagenomic sequencing reads the contents of virtually any sample. Amplicon sequencing is an alternative for microbial diagnostics in both microbiome analysis and infections (Schlaberg, 2020; Wellinghausen et al., 2009). These approaches conduct PCR on ribosomal RNA genes (e.g., 16S for bacteria, 18S or 28S for fungal, among others), that are flanked by conserved regions for priming and exhibit internal sequence differences for taxonomic discrimination (Schlaberg, 2020; Johnson et al., 2019). While sequencing has made strides in clinical practice, its expense and required expertise have hindered its routine use (Schlaberg et al., 2017; Gardy & Loman, 2018), and its low sensitivity is insufficient for some clinical presentations such as bloodstream infections (Moragues-Solanas et al., 2021).
Another category of nonspecific sensing involves “fingerprinting” where a general sensing modality assigns unique signatures to target analytes such that an unknown sample can be read and matched against a database of analytes. Spectroscopic methods are common in this class, profiling wide ranges of proteins, metabolites, and cells. In clinical infections, mass spectrometry has been applied to rRNA amplicons (Ecker et al., 2008) although its use to identify clinical isolates from positive culture is gaining much more traction (Singhal et al., 2015, Ozenci et al., 2018). However, a major limitation of these approaches is the difficulty in analyzing mixtures of analytes (Zhang et al., 2019; Alseekh et al., 2021). For mass spectrometry, this limitation manifests as a need to analyze clinical isolates from polymicrobial samples one at a time.
Microfluidic partitioning technologies offer an avenue for the high throughput, sensitive, and quantitative characterization of heterogenous samples via a fingerprinting approach (Zhang et al., 2019). These systems split an initial sample into thousands or millions of partitions such as droplets or nanowells (Guo et al., 2012). If the analytes are at a limiting dilution, most partitions will be empty with an occasional analyte isolated in its own partition. Formally, analytes are captured according to a Poisson distribution and the Poisson rate parameter λn for analyte n dictates the probability of capturing nonnegative integer copies (xn) of the analyte such that P(xn)=λnx
Given the probabilistic isolation of individual analytes, nonempty partitions can be classified one at a time against a database. Researchers have demonstrated classification with high resolution melt (HRM) curve analysis of individual 16S gene amplicons captured in droplets (Velez et al., 2017). Also, surface enhanced Raman spectroscopy (SERS) of isolated bacterial cells (Dina et al., 2017) can assign unique spectra to species, and digital SERS with microfluidic capture has been proposed (Zhang et al., 2019). However, all microfluidic partitioning technologies rest on the assumption that droplets must be individually classified. From a data perspective, N analyte classes must have reliable decision boundaries, making these systems highly sensitive to measurement noise (Langouche et al., 2020). Acquiring enough information from each partition for reliable classification limits the throughput of acquiring partition measurements, and therefore, the volume of sample that can be analyzed (Zhu & Fang, 2013). Moreover, in diagnostics, sample concentrations are rarely known a priori. Samples with higher concentrations can result in multianalyte capture in the same partition, causing errors in classification approaches that assume single-analyte capture. Thus, there is an unmet need for CS methods that can generalize the signal recovery strategy when samples are sparse, a common characteristic of biological samples. For example, samples may contain only a few microbes or mutations of interest among many possibilities (Kota et al., 2022; Quan et al., 2018).
In certain embodiments, the present disclosure provides methods for a broad-range sensing method for detecting multiple target analytes in a sample comprising:
In particular aspects, the sensors are nonspecific sensors. In some aspects, the method further comprises the use of specific sensors, such as one or two specific sensors in addition to the nonspecific sensors.
In some aspects, statistical estimation comprises defining an objective function based on a modeled distribution and optimizing the objective by adjusting the inferred analyte quantities or concentrations. These approaches could involve maximum likelihood estimation (MLE) or maximum a posteriori estimation (MAP). In some aspects, the quantity of each analyte in each partition is an integer and follows a Poisson distribution with mean given by its true concentration in the sample, wherein the objective function comprises maximum likelihood estimation under the modeled probability distribution and maximum likelihood estimation comprises applying a gradient ascent or descent algorithm. Analyte quantities or concentrations may have a closed form solution, but if not, they can be iteratively solved for by applying a gradient ascent or descent algorithm, such as a SPoRe algorithm comprising:
In some aspects, the partitions are microfluidic partitions.
In some aspects, the method further comprises reading partial or full fingerprints and matching the fingerprints against a database of analytes. In certain aspects, the nonspecific sensors are nucleic acids, primers, probes, antibodies, mass spectra, optical spectra, or phenotypic indicators in various growth conditions. In particular aspects, the nonspecific sensors are nucleotide probes comprising 8-11 nucleotides.
In some aspects, detecting comprises measuring the concentration of said target analytes in said sample. In certain aspects, detecting comprises quantifying the microbial content of the sample.
In certain aspects, the microfluidic partitions comprise droplets, chambers or nanowells. In some aspects, the multiple target analytes in the sample are sparse. In some aspects, the method comprises fewer total sensors than the number of target analytes. For example, in the present studies, five nonspecific probes were used to quantify nine types of bacteria. In certain aspects, the partitions are less than 10 μL in volume. In some aspects, the target analytes are captured in the small volume partitions according to a Poisson distribution with the true Poisson rates of all target analytes totaling to less than 20, i.e. Σnλn*<20.
In some aspects, the target analytes in the sample are sparse. In certain aspects, the sample is split into 50 to 107 partitions.
In certain aspects, the target analytes comprise whole cells, genes, DNA, RNA, proteins, or metabolites. In some aspects, the partitions initially capture whole cells that are subsequently lysed. In certain aspects, the multiple target analytes comprise microbes or mutations of interest.
In some aspects, the microbes comprise bacterial pathogens or fungal pathogens.
In some aspects, the target analytes comprise ribosomal RNA genes selected from 16S, 18S, 23S, or 28S ribosomal RNA genes or other marker regions such as internal transcribed spacer (ITS) and interspace (IS) regions. In certain aspects, the nonspecific probes are nonspecific hydrolysis probes which react with the ribosomal RNA genes and assign binary barcodes to said ribosomal RNA genes. In some aspects, the nonspecific hydrolysis probes comprise 8-11 nucleotides with locked nucleic acid that bind to the ribosomal RNA genes. In some aspects, each ribosomal RNA gene elicits a binary barcode response to the nonspecific hydrolysis probes based on the presence or absence of the probe sequences in the gene.
In certain aspects, the method comprises splitting a sample into microfluidic partitions in a digital PCR (dPCR) system and reading signatures from each partition after performing PCR to detect the presence or absence of target nucleic acids.
In some aspects, the digital PCR readout is the continuously valued, raw fluorescence measurements are used to characterize the intensity of the response of the nucleic acid fragments in the partition to the sensors. In certain aspects, the digital PCR reaction completes and a melt curve is evaluated from each partition to characterize the dissociation of nonspecific probes, including unstructured and structured probe types such as molecular beacons and hairpins, from the PCR products.
In some aspects, the target analytes comprise microbes associated with urinary tract infection, bacterial biofilms in chronic wounds, sepsis, meningitis, or a human microbiome. For example, the bacteria may comprise E. coli, S. aureus, K. pneumoniae, S. pneumoniae, A. baumannii, or Lactobacillus spp. In some aspects, the target analytes are the set of all relevant bacteria and fungi that may appear in a sample for the application such as the relevant organisms in the human gut, vaginal, oral, or other microbiota. In certain aspects, the target analytes include a priority set of bacteria and fungi that exhibit fingerprints that are different from other application-relevant bacteria and fungi, and in which samples containing bacteria and fungi beyond the priority set are flagged as indeterminate.
A further embodiment provides a method for evaluating a solution (e.g., an abundance profile of analytes) in a sample and evaluating a pass/fail criteria for the abundance results comprising:
An additional embodiment provides method for evaluating a candidate solution for the target analyte quantities or concentrations in a sample that has been split into multiple partitions and reporting whether the sample contains an exogenous analyte beyond the set of target analytes comprising:
In some aspects, the comparison of distributions is based on a z square goodness of fit test.
In certain aspects, all individual measurements can be adequately explained by the given panel of potential analytes, but the observed distribution may be sufficiently different from the expected distribution based on the forward model. In some aspects, the distributions are compared using a χ-square goodness of fit test.
It is contemplated that any method or composition described herein can be implemented with respect to any other method or composition described herein. For example, a compound synthesized by one method may be used in the preparation of a final compound according to a different method.
Other objects, features and advantages of the present disclosure will become apparent from the following detailed description. It should be understood, however, that the detailed description and the specific examples, while indicating specific embodiments of the disclosure, are given by way of illustration only, since various changes and modifications within the spirit and scope of the disclosure will become apparent to those skilled in the art from this detailed description.
The following drawings form part of the present specification and are included to further demonstrate certain aspects of the present invention. The invention may be better understood by reference to one or more of these drawings in combination with the detailed description of specific embodiments presented herein.
Current diagnostic methods use specific sensors (e.g., primers/probes, antibodies, etc.) for detecting analytes (e.g., whole cells, target DNA/RNA, protein, etc.). These methods are specific and are designed to sense one analyte. Broad-range sensing can use a mechanism to detect multiple analytes, such as sample culture, sequencing, spectrometry, and melt curve analysis. There is an unmet need to extract more information from simple sensors for a rapid, efficient method for broad-range sensing.
A few nonspecific sensors can efficiently assign fingerprints to many analytes and compressed sensing can unmix fingerprints to quantify analytes if samples are sparse. Diagnostics with microfluidic partitioning means analytes are captured according to a Poisson distribution. At times, the number of sensors required for distinguishing analyte fingerprints exceeds what can be resolved from single individual reactions. Thus, in certain embodiments, the present methods provide framework for expanded multiplexing through asynchronous fingerprinting.
Compressed sensing (CS) problems typically follow the single-measurement vector (SMV) formulation (
Given Φ and having measured y, CS solves for x by accounting for assumptions on its structure. Most commonly, x is assumed to be sparse (most of its elements are zero) such that sparse recovery algorithms can recover x from y and Φ.
In the Multiple Measurement Vector (MMV) framework, a set of signals with the same support must be recovered from their corresponding measurements. The present studies demonstrate the first exploration of the MMV problem where signals are independently drawn from a sparse, multivariate Poisson distribution. The method may be applied to a suite of biosensing applications of microfluidics where analytes (such as whole cells or biomarkers) are captured in small volume partitions according to a Poisson distribution. The sparse parameter vector of Poisson rates is recovered through maximum likelihood estimation with the present novel Sparse Poisson Recovery (SPoRe) algorithm. SPoRe uses batch stochastic gradient ascent enabled by Monte Carlo approximations of otherwise intractable gradients. By uniquely leveraging the Poisson structure, SPoRe substantially outperforms a comprehensive set of existing and custom baseline CS algorithms. Notably, SPoRe can exhibit high performance even with one-dimensional measurements and high noise levels. This resource efficiency is not only unprecedented in the field of CS but is also particularly potent for applications in microfluidics in which the number of resolvable measurements per partition is often severely limited. The present studies showed the identifiability property of the Poisson model under such lax conditions, analytically developed insights into system performance, and confirmed these insights in simulated experiments. The present methods can be applied to biosensing and are generalizable to other applications featuring spatial and temporal Poisson signals.
Thus, in certain embodiments, the present methods comprise the use of a recovery algorithm that applies additional prior knowledge beyond sparsity for specialized scenarios, depicted in
MMV problems are well studied, but it is a special case when these nonzero elements in X also follow a Poisson distribution in each row. Each row Xn is described by its own Poisson parameter λn. At its simplest, this parameter represents the average value in any given row, but it also parametrizes the probability that a particular element Xn,d takes on the integer value x:
It can be assumed that each signal vector (column of X) is independently and identically distributed; this assumption makes the probability independent of the particular signal index d.
The present algorithm does not recover X but rather finds the Poisson rate vector λ that characterizes X (
To find λ, a maximum likelihood estimation (MLE) problem was set up; the λ was sought that maximizes the likelihood of the measurements (Y) given Φ and Poisson constraints. an estimate of λ can be iteratively adjusted with a common optimization technique known as stochastic gradient descent by processing batches of measurement vectors from Y. The present studies describe the mathematical details, the theory, and simulated results. The simulations indicated very high noise robustness and far better measurement efficiency (i.e., fewer required measurements, M) than conventional SMV CS. Another finding in the theory is that while signal sparsity is helpful in achieving stable recovery, it is not necessary for estimating λ, broadening the viable application of the present embodiments.
As used herein, “essentially free,” in terms of a specified component, is used herein to mean that none of the specified component has been purposefully formulated into a composition and/or is present only as a contaminant or in trace amounts. The total amount of the specified component resulting from any unintended contamination of a composition is therefore well below 0.05%, preferably below 0.01%. Most preferred is a composition in which no amount of the specified component can be detected with standard analytical methods.
The term “essentially” is to be understood that methods or compositions include only the specified steps or materials and those that do not materially affect the basic and novel characteristics of those methods and compositions.
The terms “substantially” or “approximately” as used herein may be applied to modify any quantitative comparison, value, measurement, or other representation that could permissibly vary without resulting in a change in the basic function to which it is related.
As used herein the specification, “a” or “an” may mean one or more. As used herein in the claim(s), when used in conjunction with the word “comprising,” the words “a” or “an” may mean one or more than one.
The use of the term “or” in the claims is used to mean “and/or” unless explicitly indicated to refer to alternatives only or the alternatives are mutually exclusive, although the disclosure supports a definition that refers to only alternatives and “and/or.” As used herein “another” may mean at least a second or more.
The term “about” means, in general, within a standard deviation of the stated value as determined using a standard analytical technique for measuring the stated value. The terms can also be used by referring to plus or minus 5% of the stated value.
The phrase “effective amount” or “therapeutically effective” means a dosage of a drug or agent sufficient to produce a desired result. The desired result can be subjective or objective improvement in the recipient of the dosage, increased lung growth, increased lung repair, reduced tissue edema, increased DNA repair, decreased apoptosis, a decrease in tumor size, a decrease in the rate of growth of cancer cells, a decrease in metastasis, or any combination of the above.
“Subject” and “patient” refer to either a human or non-human, such as primates, mammals, and vertebrates. In particular embodiments, the subject is a human.
As used herein, the terms “treat,” “treatment,” “treating,” or “amelioration” when used in reference to a disease, disorder or medical condition, refer to therapeutic treatments for a condition, wherein the object is to reverse, alleviate, ameliorate, inhibit, slow down or stop the progression or severity of a symptom or condition. The term “treating” includes reducing or alleviating at least one adverse effect or symptom of a condition. Treatment is generally “effective” if one or more symptoms or clinical markers are reduced. Alternatively, treatment is “effective” if the progression of a condition is reduced or halted. That is, “treatment” includes not just the improvement of symptoms or markers, but also a cessation or at least slowing of progress or worsening of symptoms that would be expected in the absence of treatment. Beneficial or desired clinical results include, but are not limited to, alleviation of one or more symptom(s), diminishment of extent of the deficit, stabilized (i.e., not worsening) state of a tumor or malignancy, delay or slowing of tumor growth and/or metastasis, and an increased lifespan as compared to that expected in the absence of treatment.
A “nucleic acid molecule” or “nucleic acid sequence” refers to any single-stranded or double-stranded nucleic acid molecule including standard canonical bases, hypermodified bases, non-natural bases, or any combination of the bases thereof. For example and without limitation, the nucleic acid molecule contains the four canonical DNA bases—adenine, cytosine, guanine, and thymine, and/or the four canonical RNA bases—adenine, cytosine, guanine, and uracil. Uracil can be substituted for thymine when the nucleoside contains a 2′-deoxyribose group. The nucleic acid molecule can be transformed from RNA into DNA and from DNA into RNA. For example, and without limitation, mRNA can be created into complementary DNA (cDNA) using reverse transcriptase and DNA can be created into RNA using RNA polymerase. “Analogous” forms of purines and pyrimidines are well known in the art, and include, but are not limited to aziridinylcytosine, N4-acetylcytosine, 5-fluorouracil, 5-bromouracil, 5-carboxymethylaminomethyl-2-thiouracil, 5-carboxymethylaminomethyluracil, inosine, N6-isopentenyladenine, 1-methyladenine, 1-methylpseudouracil, 1-methylguanine, 1-methylinosine, 2,2-dimethylguanine, 2-methyladenine, 2-methylguanine, 3-methylcytosine, 5-methylcytosine, N6-methyladenine, 7-methylguanine, 5-methylaminomethyluracil, 5-methoxyaminomethyl-2-thiouracil, beta-d-mannosylqueosine, 5-methoxyuracil, 2-methylthio-N6-isopentenyladenine, uracil-5-oxyacetic acid methyl ester, pseudouracil, queuosine, 2-thiocytosine, 5-methyl-2-thiouracil, 2-thiouracil, 4-thiouracil, 5-methyluracil, uracil-5-oxyacetic acid, and 2,6-diaminopurine. The nucleic acid molecule can also contain one or more hypermodified bases, for example and without limitation, 5-hydroxymethyluracil, 5-hydroxyuracil, α-putrescinylthymine, 5-hydroxymethylcytosine, 5-hydroxycytosine, 5-methylcytosine, 5-methylcytosine, 2-aminoadenine, α-carbamoylmethyladenine, N6-methyladenine, inosine, xanthine, hypoxanthine, 2,6-diaminopurine, and N7-methylguanine. The nucleic acid molecule can also contain one or more non-natural bases, for example and without limitation, 7-deaza-7-hydroxymethyladenine, 7-deaza-7-hydroxymethylguanine, isocytosine (isoC), 5-methylisocytosine, and isoguanine (isoG).
The nucleic acid molecule containing only canonical, hypermodified, non-natural bases, or any combinations the bases thereof, can also contain, for example and without limitation where each linkage between nucleotide residues can consist of a standard phosphodiester linkage, and in addition, may contain one or more modified linkages, for example and without limitation, substitution of the non-bridging oxygen atom with a nitrogen atom (i.e., a phosphoramidate linkage, a sulfur atom (i.e., a phosphorothioate linkage), or an alkyl or aryl group (i.e., alkyl or aryl phosphonates), substitution of the bridging oxygen atom with a sulfur atom (i.e., phosphorothiolate), substitution of the phosphodiester bond with a peptide bond (i.e., peptide nucleic acid or PNA), or formation of one or more additional covalent bonds (i.e., locked nucleic acid or LNA), which has an additional bond between the 2′-oxygen and the 4′-carbon of the ribose sugar. The term “2′-deoxyribonucleic acid molecule” means the same as the term “nucleic acid molecule” with the limitation that the 2′-carbon atom of the 2′-deoxyribose group contains at least one hydrogen atom. The term “ribonucleic acid molecule” means the same as the term “nucleic acid molecule” with the limitation that the 2′-carbon atom of the ribose group contains at least one hydroxyl group.
“Contacting” means a process whereby a substance is introduced by any manner to promote an interaction with another substance.
“Detecting a nucleic acid molecule” means using an analytical method that can determine the presence and/or quantity of the nucleic acid of interest or that can determine more detailed information regarding the nucleic acid sequence, alterations of a nucleic acid sequence when compared with a reference sequence, or the presence or absence of one or more copies of the nucleic acid sequence.
The term “label” or “probe sequence” as used herein means a molecule or moiety having a property or characteristic which is capable of detection. A label or probe sequence can be directly detectable, as with, for example, radioisotopes, fluorophores, chemiluminophores, enzymes, colloidal particles, and fluorescent microparticles; or a label may be indirectly detectable, as with, for example, specific binding members. It will be understood that directly detectable labels may require additional components such as, for example, substrates, triggering reagents, and light to enable detection of the label. When indirectly detectable labels are used, they are typically used in combination with a “conjugate.” A conjugate is typically a specific binding member which has been attached or coupled to a directly detectable label. Coupling chemistries for synthesizing a conjugate are well known in the art and can include, for example, any chemical means and/or physical means that does not destroy the specific binding property of the specific binding member or the detectable property of the label.
“Hybridize” or “hybridization” refers to the binding between nucleic acids. The conditions for hybridization can be varied according to the sequence homology of the nucleic acids to be bound. Thus, if the sequence homology between the subject nucleic acids is high, stringent conditions are used. If the sequence homology is low, mild conditions are used. When the hybridization conditions are stringent, the hybridization specificity increases, and this increase of the hybridization specificity leads to a decrease in the yield of non-specific hybridization products.
However, under mild hybridization conditions, the hybridization specificity decreases, and this decrease in the hybridization specificity leads to an increase in the yield of non-specific hybridization products.
A “nucleotide probe” or “probes” refers to a polynucleotide that is at least eight (8) nucleotides in length and which forms a hybrid structure with a target sequence, due to complementarity of at least one sequence in the probe with a sequence in the target region. The polynucleotide can be composed of DNA and/or RNA or any of the analogous or modified bases described above. Probes in certain embodiments, are detectably labeled. Probes can vary significantly in size. Generally, probes are, for example, at least 8 to 15 nucleotides in length. Other probes are, for example, at least 20, 30 or 40 nucleotides long. Still other probes are somewhat longer, being at least, for example, 50, 60, 70, 80, or 90 nucleotides long. Probes can be of any specific length that falls within the foregoing ranges as well. Preferably, the probe does not contain a sequence complementary to the sequence(s) used to prime for a target sequence during the polymerase chain reaction.
“Oligonucleotide” or “polynucleotide” refers to a polymer of a single-stranded or double-stranded deoxyribonucleotide or ribonucleotide, which may be unmodified RNA or DNA or modified RNA or DNA.
A “modified ribonucleotide” or deoxyribonucleotide refer to molecules that can be used in place of naturally occurring bases in nucleic acid and includes, but is not limited to, modified purines and pyrimidines, minor bases, convertible nucleosides, structural analogs of purines and pyrimidines, labeled, derivatized and modified nucleosides and nucleotides, conjugated nucleosides and nucleotides, sequence modifiers, terminus modifiers, spacer modifiers, and nucleotides with backbone modifications, including, but not limited to, ribose-modified nucleotides, phosphoramidates, phosphorothioates, phosphonamidites, methyl phosphonates, methyl phosphoramidites, methyl phosphonamidites, 5′-p-cyanoethyl phosphoramidites, methylenephosphonates, phosphorodithioates, peptide nucleic acids, achiral and neutral internucleotidic linkages.
A “primer” or “primer sequence” refers to an oligonucleotide that hybridizes to a target nucleic acid sequence (for example, a DNA template to be amplified) to prime a nucleic acid synthesis reaction. The primer may be a DNA oligonucleotide, an RNA oligonucleotide, or a chimeric sequence. The primer may contain natural, synthetic, or modified nucleotides. Both the upper and lower limits of the length of the primer are empirically determined. The lower limit on primer length is the minimum length that is required to form a stable duplex upon hybridization with the target nucleic acid under nucleic acid amplification reaction conditions. Very short primers (usually less than 3-4 nucleotides long) do not form thermodynamically stable duplexes with target nucleic acid under such hybridization conditions. The upper limit is often determined by the possibility of having a duplex formation in a region other than the pre-determined nucleic acid sequence in the target nucleic acid. Generally, suitable primer lengths are in the range of about 10 to about 40 nucleotides long. In certain embodiments, for example, a primer can be 10-40, 15-30, or 10-20 nucleotides long. A primer is capable of acting as a point of initiation of synthesis on a polynucleotide sequence when placed under appropriate conditions.
“Detection,” “detectable” and grammatical equivalents thereof refer to ways of determining the presence and/or quantity and/or concentration and/or identity of an analyte. In some embodiments, detection occurs during or after the amplification of the target nucleic acid sequence. In other embodiments, sequencing of the target nucleic acid can be characterized as “detecting” the target nucleic acid. A label attached to the probe can include any of a variety of different labels known in the art that can be detected by, for example, chemical or physical means.
Labels that can be attached to probes may include, for example, fluorescent and luminescence materials.
“Amplifying,” “amplification,” and grammatical equivalents thereof refers to any method by which at least a part of a target nucleic acid sequence is reproduced in a template-dependent manner, including without limitation, a broad range of techniques for amplifying nucleic acid sequences, either linearly or exponentially. Exemplary means for performing an amplifying step include ligase chain reaction (LCR), ligase detection reaction (LDR), ligation followed by Q-replicase amplification, PCR, primer extension, strand displacement amplification (SDA), hybridization chain reaction (HCR), hyperbranched strand displacement amplification, multiple displacement amplification (MDA), nucleic acid strand-based amplification (NASBA), two-step multiplexed amplifications, rolling circle amplification (RCA), recombinase-polymerase amplification (RPA) (TwistDx, Cambridg, UK), and self-sustained sequence replication (3SR), including multiplex versions or combinations thereof, for example but not limited to, OLA/PCR, PCR/OLA, LDR/PCR, PCR/PCR/LDR, PCR/LDR, LCR/PCR, PCR/LCR (also known as combined chain reaction-CCR), and the like. Descriptions of such techniques can be found in, among other places, Sambrook et al. Molecular Cloning, 3rd Edition).
The term “analyte” refers to the unit intending to be detected. In some embodiments, it includes nucleic acids such as a gene, gene region, or whole genome. In other embodiments, an analyte may include a polypeptide, metabolite, unique cell state, or an organism or group of organisms belonging to a particular taxonomic grouping.
The term “signal” refers to the true analyte quantities or concentrations in a single reaction (e.g., a microfluidic partition) or the whole sample that is inferred based on the acquired measurements. In our mathematical notation, the signal consists of N analyte quantities, concentrations, or Poisson parameters.
The term “sparse” refers to signals with relatively few nonzero elements. Particularly, an N dimensional signal that characterizes the quantities or concentrations of N analytes is said to be k-sparse if N-k or more of its elements are zero or approximately zero. In some embodiments, many (N) analytes must be accounted for in a sample, but it can be assumed that any given sample only has a small subset (k or fewer) of the possible analytes. In these cases, the sparse signal is the analyte concentrations or quantities.
The term “measurement” refers to a unit of information directly acquired from each reaction (e.g., a microfluidic partition) represented by a scalar value.
The term “sensing channel” refers to a system component that translates the output of sensors into a measurement. In some embodiments, it is an optical wavelength band (e.g., from fluorescence, luminescence, reflectance, etc.) that captures the intensity of acquired light within that band. In similar embodiments, it may refer to any unit or range of physical properties (e.g., mass, mass-to-charge, temperature, etc.) that captures the absolute or relative abundance of analytes in the reaction at that unit or within the range. It may also be agnostic to a particular band of information, capturing the total incident light, mass, charge, voltage, current, or other physical or electrical unit. In other embodiments, a sensing channel is defined by its physical location; for instance, a sensor may be immobilized on an array such that the location is linked to the sensor's identity, and the resulting measurement corresponds with that location. In other embodiments, a sensing channel is defined by the unit of time at which the measurement is captured. In still other embodiments, it may be the result of representing a component of an arbitrarily long nucleic acid sequence read from a sequencing instrument. For example, the sensing channel may report the quantity of a certain subsequence among one or more sequencing reads (Koslicki et al., 2014).
The term “sensor” refers to that which captures information from a sample based on the analyte content and transduces it to a sensing channel. In many embodiments, the sensor and sensing channel are directly linked; that which captures the information, e.g., a photodetector capturing light is directly connected to a circuit that digitizes the value to translate the photons into a measurement. In optical spectrometry, a sensor reports the intensity of light corresponding with a wavelength or band of wavelengths. In some embodiments, a sensor could be a biosensor (nucleic acid probes, antibodies, etc.) that is optically labeled (e.g., with a fluorophore) such that its interactions with analytes result in a change in the light emitted within a certain optical wavelength band (the sensing channel). In some embodiments involving composite measurements (Cleary et al., 2017), a sensor could be a set of multiple biosensors that each target one or more separate analytes but are tagged with the same label such that all biosensors in the set report to the same sensing channel. The resulting measurement is the aggregate response of all sensors. The term “nonspecific sensor” refers to a sensor that is intentionally designed to report a nonzero measurement value through a sensing channel in response to more than one particular analyte such that a nonzero measurement does not indicate the presence or quantity of any single analyte with high probability. It could include nucleotide probes that bind to more than one type of genetic biomarker or organism, units of a spectral measurement system (wavelength, mass, temperature, etc.) that are nonunique to any single molecule or organism, or nonunique nucleotide sequences acquired by a sequencing instrument. It could also be a composite measurement as discussed above.
In other embodiments, it may be a cell growth or viability indicator for a cell or population of cells in a particular environment.
The term “fingerprint” of an analyte refers to the unique and expected measurements from a set of sensors measuring one copy, isolate, molecule, or unit of the analyte. By unique, it is meant that all analytes under consideration exhibit a distinct fingerprint. In some embodiments, a fingerprint is a single vector representing the expected measurements containing the same number of elements as the number of sensors. In some embodiments, the analyte has multiple states such that the fingerprint is a mixture distribution of vectors of expected measurements. For instance, in
The term “small volume” refers to relatively small fractions of an initial sample volume.
In some embodiments, an initial sample is split via microfulidics into hundreds of partitions or more such that each “small volume” partition is 1% or less of the initial sample volume. Although it could depend on the specific system in question, partition volumes are generally 10 μL or less.
In the specific studies below with digital PCR, partition volumes were 1 nL with initial sample volumes of 20 μL.
The term “fingerprinting” refers to a method for detecting analytes in which a set of sensors assigns unique fingerprints to all analytes under consideration, and the fingerprints are stored in a reference database.
The term “asynchronous fingerprinting” refers to a type of fingerprinting method for detecting analytes in which all of the sensors required for uniquely assigning fingerprints to the analytes under consideration are not applied simultaneously in any given reaction volume. Instead, a sample is split into multiple subsamples (or “groups”) with subsets of the sensors applied in each group. Asynchronous fingerprinting may generally be used when the number of sensing channels available for a given reaction volume is outnumbered by the number of sensors necessary for fingerprinting. For instance, in a microfluidic partition with a single copy of an analyte, the entire analyte fingerprint is not measured at once for the single copy. In some embodiments, this involves P nonspecific probes that assign fingerprints to genetic biomarkers, where the system acquires data from fewer than P (e.g., M) probes at a time. In the in vitro studies below, P=5 probes assign fingerprints to nine bacterial taxonomic groups, but subsets of M=2 probes are measured in four subsamples, wherein each subsample is split into thousands of microfluidic partitions (droplets).
In other embodiments where the fingerprint is spectrally based, asynchronous fingerprinting refers to capturing portions of the spectra in each group. In another embodiment, the fingerprint may be a signature response of a cell type to multiple environments. Each group could feature a different set of reaction conditions (e.g., growth media) such that a cell type (e.g., microbial species) has a fingerprint combination of responses to all tested environments, but only one environment can be applied at a time in any given group.
The term “concentration” of an analyte is used as a general term for the number of analyte copies in a unit of volume, a standard definition to one skilled in the art. As a special case use of the term, when Poisson statistics are assumed such as in the distribution of capture of analyte copies across microfluidic partitions, the Poisson rate parameter (λ) of an analyte is often used interchangeably with the analyte's “concentration.”
In certain embodiments, the present methods comprise use of an algorithm that recovers sparse, high dimensional Poisson signals from low dimensional measurements. Further embodiments provide use of the algorithm for biomedical application. Biomedical technologies are increasingly applying “digital microfluidics” to capture analytes (e.g., cells) in thousands of individual chambers for analysis. The count of analytes in each chamber follows a Poisson distribution. Analytes are considered to be sparse if there are only a few unique analytes (e.g., cell types) in a given sample relative to the library of possible analytes. Rather than quantifying each analyte with its own specific sensor, a few sensors may be utilized that broadly react with the library of analytes to give each analyte a fingerprint response represented by a low dimensional vector. Low-dimensional measurements are collected from all chambers and account for Poisson statistics to quantify the concentration of the few unique analytes in a given sparse sample.
Emerging biomedical applications of “digital microfluidics” (DM) technologies enable researchers and clinicians to capture individual analytes (e.g., cells) from patient or environmental samples in thousands of chambers, and the rate of cell capture follows a Poisson distribution. The present algorithm may be used for diagnostic and prognostic applications where relatively few unique cell types are present (i.e., samples are sparse). For example, the present algorithm may be applied to clinical infection diagnostics where only one or a few unique pathogens are responsible for any given patient's condition.
DM is well known to be capable of single-cell resolution, an advantage in infection diagnostics where samples may contain less than ten copies (colony forming units, CFU) of a target pathogen (Table 1). Moreover, background contamination such as a human cell is naturally isolated in only one sample compartment. Any interference from contamination corrupts a single measurement among thousands in the MMV DM setup but can easily invalidate entire SMV “test tube” assays.
Despite these advantages, DM's utility in infection diagnostics has two limiting assumptions. First, like most rapid diagnostic assays, DM uses highly specific sensors to sense for the presence or absence of target analytes in each chamber which severely limits the scope of assays. DM platforms can generally only screen for only a handful of analytes (e.g., genes, cell types) for each sample because of these sensor limitations, while infections can be caused by any of dozens to hundreds of pathogens. A framework that leverages CS enables exponential measurement efficiency for broad pathogen coverage. The present methods introduce Poisson statistical constraints to CS such that even a single sensor can distinguish between cell types.
Second, samples in DM are assumed to be dilute enough for single-cell capture (Σnλn*≤0.1). More concentrated samples result in frequent multi-cell capture which makes downstream quantification difficult. The present algorithm can handle much higher concentrations. Achieving single-cell characterization while tolerating higher loading capacities allows for infection diagnostics when the pathogen concentration in a patient sample is unknown. Table 1 shows that many relevant clinical scenarios present with vastly different concentrations. The present algorithm enables at least two orders of improvement in dynamic range—the range of concentrations that can be covered in a single assay.
Nucleic acid (DNA or RNA) sequencing is the only competing technology which offers better breadth of coverage. Virtually any organism can be identified with a single technology coupled with ever growing public databases. However, sequencing still suffers from cost and time-limitations that limit its routine use in diagnostics. Importantly, run time and cost scales with sample concentration such that low concentrations result in very long time to results, hindering time sensitive diagnosis of infections to inform treatment selection.
In some embodiments, the present algorithm may be used for detecting ribosomal RNA, such as 16S rRNA. 16S rRNA gene is shared by almost all bacteria but has internal sequence differences based on the bacterial genus or species. Usually, DNA sequencing of 16S genes is used to read them and match them to databases. The present methods can be used to skip the need to do full DNA sequencing by probing particular “keywords” or “barcodes” of the gene (e.g., 8-11 nucleotides). These keywords assign barcode signatures to different 16S gene based on the presence or absence of the keyword in the gene. Bacteria have multiple 16S copies that are largely identical, but occasionally have small differences. Thus, in some aspects, each bacteria does not get just one barcode, but a distribution of barcodes (i.e., barcode signature) that are elicited by the probes. 5 copies, 1 of which is a variant, leads to 80% of gene copies with one barcode and the variant (20% of copies) has a different barcode. The present studies use short (e.g., 8-11 nucleotides) hydrolysis probes spiked with locked nucleic acids. The probes may alternatively comprise a molecular beacon (Chakravorty et al, Rapid universal identification of bacterial pathogens from clinical cultures by using a novel sloppy molecular beacon melting temperature signature technique, J. Clin. Microbiol., 2010).
In some aspects, each target analyte (e.g., bacteria) is not given a barcode, but the cluster positions of each bacteria is calibrated in the fluorescence channels (e.g., a multivariate Gaussian fit). This method can be used to account for partial, non-binary hits and allow the capture of more information with a few measurement channels, rather than hard thresholding and binarizing the data. For example, each bacteria has multiple 16S copies and can be modeled as a mixture distribution based on the responses of individual copies.
A. Scalable Microbial Sensing with Microfluidic Partitioning
In certain embodiments, the present methods provide a new framework for microbial diagnostics that enables the quantification of large panels of microbes with fewer sensors than ever before. This resource efficiency could enable routine infection diagnostics at the point-of-care with same-day results, informing the best treatment course for patients and avoiding the unnecessary use of antibiotics. For example, target analytes may be useful in infection diagnostics of the urinary tract infection and chronic wounds, sepsis diagnostics and microbiome profiling.
The vast majority of microbial identification assays today rely on sample culture to grow enough of the target, unknown organism which can take days or weeks and is prone to false negatives. In time sensitive infections, the critical window to initiate therapy can be as short as six hours, forcing physicians to prescribe broad-spectrum antibiotics preemptively even before confirming a bacterial agent. Unnecessary antibiotic use engenders drug resistance and can result in long term complications for patients and can be avoided with a reliable rapid microbial identification test.
Rapid diagnostic tests conventionally rely on target-specific sensors that are highly specialized to sense one microbe without cross-reactivity. These tests fail to scale for applications where there are dozens to hundreds of microbes of interest which has limited their general clinical utility. However, a suite of technologies has been developed through research in rapid biomarker diagnostics for targeted diagnostics where portable, point-of-care devices with a few on-board sensors are appropriate.
In certain aspects, the present methods can be used to detect large panels of microbes with fewer sensors than previously theoretically possible through the SPoRe algorithm enabling broad-range diagnostics at the point-of-care. Decades of research into point-of-care diagnostics and microfluidics have matured the field of hardware development to provide a suite of tools for processing samples to separate target microbial cells from a background of human cells and other contaminants for downstream signal acquisition from sensors. Given the measurements from these sensors, the present compressed sensing technique offers a new signal processing approach to infer the microbial content of the sample very efficiently. The signal processing method can be coupled with established sensing technologies.
In infection diagnostics, signal sparsity arises because among a library of dozens to hundreds of possible pathogens, only one or a few pathogens are responsible for any given patient's condition. Rather than mixing sensors in with a bulk sample containing biomarkers (e.g., microbial cells or genes), emerging technologies in microfluidics can split a sample into thousands of tiny nanoliter droplets (
In particular aspects, the present methods allow for the use of a decreased number of sensors (M<k) which does not merely result in cost savings. Several of the biological sensing methods that are amenable to rapid assays and simple devices are also fundamentally constrained in the number of measurements that can be resolved (Guo et al., 2012; Suea-Ngam et al., 2019). For example, with fluorescence measurements, only M=3 to 5 color channels can be reliably distinguished per reaction. With polymicrobial infections presenting with as many as k=5 microbes (Peters et al., 2012), the present compressed sensing framework enables for a new rapid diagnostic test.
In further aspects, the present methods allow for modular sensing. Most compressed sensing approaches rely on nonspecific linear sensors. For example, a light signal that scales with the microbial content and reacts broadly with multiple microbes. Current options for nonspecific linear sensing have poor sensitivity. The present SPoRe algorithm can perform diagnostics with any sensing mechanism which opens the floodgates for sensors that can couple with the present approach. The present studies showed that SPoRe can recover microbial content even from binary sensors representing the worst-case of information loss from which clinically relevant performance can be achieved.
With the standard compressed sensing framework, microbial species outside of the expected panel of microbes can result in erroneous misclassifications. With conventional target-specific probes, such species result in false negatives. Both situations can be avoided with microfluidic partition measurements as out-of-library species would result in very low measurement likelihoods that can be statistically detected. These samples could be labeled as “unknown.”
The present experiments demonstrated that even binary measurements from partitions are sufficient to recover microbial quantities. A droplet digital PCR (ddPCR) instrument was used with the universal bacterial 16S gene that exhibits small sequence differences between species. A small set of probes were selected that can capture these differences to assign 9 pathogenic genera their own digital barcode.
In an initial demonstration with 18 samples, SPoRe recovered the bacterial content with an average cosine similarity of 0.96 (max of 1) which illustrates its ability to capture the relative abundance of bacteria (
While current methods depend on single-cell capture by diluting samples sufficiently to enforce that non-empty partitions have only one cell with high probability by Poisson statistics (Guo et al., 2012; Velez et al., 2017), SPoRe can tolerate multi-cell capture events to facilitate a broader dynamic range of sample concentrations. Dynamic range is valuable in diagnostics as the sample content is unknown which makes the process of prior dilution blind. For example, the methods may be applied to bacterial, and fungal infection diagnostics. These organisms have genome sizes on the order of 106-108 base pairs of DNA, making them more amenable to the nonspecific genomic DNA sensing methods. Specifically, such DNA sensors can bind a sufficient number of times to each microbe to easily generate unique and robust fingerprint signals.
In some aspects, the present methods comprise the detection of pathogens for the diagnosis of urinary tract infections (UTIs). In the U.S., UTIs are responsible for over 10 million office visits and $3B in healthcare and societal costs (Flores-Mireles et al., 2015). The commercially available rapid diagnostic tests for UTIs cannot sense the variety of pathogens that are encountered in hospitals, necessitating sample culture and prophylactic antibiotics before confirming a bacterial agent (Marscha et al., 2012). There is a clear need for a diagnostic test that can distinguish between the bacterial and fungal pathogens that occur with low but nonetheless significant prevalence. UTI diagnosis generally requires a limit of detection on the order of 104 cells/mL of sample.
In some aspects, fungal pathogens may be detected with the simultaneous PCR amplification of the 16S (bacteria) and 18S (fungi) genes.
Further aspects provides methods for the detection of bacterial biofilms in chronic wounds. Bacterial biofilms in chronic wounds delay healing and require targeted treatment depending on bacterial identification (Wu et al., 2019). Bacterial density in samples tends to be at minimum 104 cells/mL (Woods et al., 2012; Dalton et al., 2011). Current diagnostics are limited by their inability to diagnose multiple species in the same sample, identify the dominant pathogen, or culture the microbes altogether (Wu et al., 2019). Advanced molecular assays rely on 16S for downstream sequencing. The present methods can be used to to characterize the relative abundance of microbes in polymicrobial samples and can bypass the need for downstream sequencing, facilitating rapid species identification for further targeted subtyping. Once the pathogen species are identified, targeted transcriptomics can help further guide antibiotic therapy.
Sepsis Diagnostics: The need for a rapid assay is most urgent in sepsis diagnostics. Sepsis is one of the leading causes of death in the U.S. as improper therapy leads to rapid organ failure. Sepsis is particularly challenging in diagnostics as the limit of detection can be as low as one cell per mL of sample, but the total microbial genomic copies per mL could be on the order of 102-103 copies (Bacconi et al., 2014).
Microbiome Community Profiling: The composition of patients' gut microbiome is increasingly linked to health and disease (Clemente et al., 2012; Hall et al., 2017). Quantifying the abundance of hundreds of kinds of microbes in patients' guts is a costly and time intensive process, usually achieved through DNA sequencing. Patients' microbiomes can rapidly change over time via changes in diet or therapy (David et al., 2014; Galloway-Pena et al., 2017), making temporal monitoring clinically useful. Routine monitoring applications would particularly benefit from a rapid, inexpensive test. While sample sparsity is needed for typical compressed sensing and not exhibited in the microbiome, the theory developed in the studies below indicate that sparsity is not necessary for estimating signals with microfluidic partitioning and Poisson constraints.
B. Detecting Out-of-Library Microbes with Microfluidics and Compressed Sensing
In some aspects, the present framework for microbial diagnostics comprises (1) the fingerprinting of microbial biomarkers with nonspecific nucleic acid sensors, (2) the capture of partial or full fingerprint measurements from droplets, and (3) the iterative maximization of a likelihood function by gradient ascent (maximum likelihood estimation, MLE). Specifically, for (3), the microbial concentrations λMLE can be estimated by maximizing the average log-likelihood of the droplet measurements, where yd represents the measurement(s) from the dth droplet:
Implicit in p(yd|x) is the sensing model from droplet signal x (microbial content) to measurements. The present Sparse Poisson Recovery (SPoRe) algorithm may be used to solve this function with Monte Carlo estimations of the infinite sum over all integer vectors x. Note that if the set of relevant signals is finite, then gradients can be computed exactly.
The dimension N of x covers the N microbes in the library (equivalent to panel, database, etc.). Maximizing this function can return some solution intended to be a diagnostic result. However, this solution can be evaluated for its quality based on its consistency with the observed measurements. This evaluation is important to detect microbes that exhibit fingerprints that are not included in the library—i.e., microbes that are unknown, mutated, or otherwise unaccounted for.
In the general case, the empirical distribution of measurements p(yobserved|λMLE) may be compared with the theoretical distribution of measurements p(y|λMLE) given the same MLE solution. In the present demonstration with droplet digital PCR (ddPCR), the measurement space is discretized with binary thresholding, so a χ squared goodness of fit test may be used based on the expected and observed measurement counts for each measurement. An exemplary process comprises:
In the present studies, with 5 samples of 2 bacteria, 6 samples of 3 bacteria, and 6 samples of 4 bacteria, there were 69 total cases for which the inventors evaluated χ-squared tests. The ability of the p value from the chi squared test to serve as a discriminator between samples with or without unknown microbial fingerprints was evaluated.
Depending on the nature of the measurements (e.g., discrete vs continuous) and function that maps signals to measurements (e.g., linear vs nonlinear), other statistical tests may be used to determine the presence of an unknown microbe.
C. Methods of Detection
In certain aspects, the present disclosure provides methods for detecting target analytes, such as pathogens or other microbes in a sample, such as for the diagnosis or prognosis of a disease. In some aspects, the present methods comprise selecting a treatment for a disease patient as well as methods of treating a patient with a disease with the selected treatment. Methods of detection may include detecting DNA, such as probes and PCR, or polypeptides.
As used herein, the term “sample” refers to any sample suitable for the detection methods provided by the present invention. The sample may be any sample that includes material suitable for detection or isolation. Sources of samples include blood, pleural fluid, peritoneal fluid, urine, saliva, malignant ascites, broncho-alveolar lavage fluid, synovial fluid, and bronchial washes. In one aspect, the sample is a blood sample, including, for example, whole blood or any fraction or component thereof. A blood sample suitable for use with the present disclosure may be extracted from any source known that includes blood cells or components thereof, such as venous, arterial, peripheral, tissue, cord, and the like. For example, a sample may be obtained and processed using well-known and routine clinical methods (e.g., procedures for drawing and processing whole blood). In one aspect, an exemplary sample may be peripheral blood drawn from a subject with cancer. In some aspects, the biological sample comprises a plurality of cells. In certain aspects, the biological sample comprises fresh or frozen tissue. In specific aspects, the biological sample comprises formalin fixed, paraffin embedded tissue. In some aspects, the biological sample is a tissue biopsy, fine needle aspirate, nipple aspirate, blood, serum, plasma, cerebral spinal fluid, urine, stool, saliva, circulating tumor cells, exosomes, or aspirates and bodily secretions, such as sweat.
The sample could also be from a nonclinical source but could contain microbial content.
These sources could include but are not limited to soil, fermentation reactors, wastewater, and commercial microbial products such as probiotics. Samples could also be swab samples from surfaces in an environment.
“Prognosis” refers to as a prediction of how a patient will progress, and whether there is a chance of recovery. Prognosis may also refer to a nonclinical situation, such as contaminant detection, quality assurance, or general microbiome profiling of various environments.
The sample obtained from an individual may contain cells affected by the disease, meaning that the cells express the disease-associated analyte. Thus, where the analyte is expressed in a cell-specific manner, the sample will contain the cell type in which the disease analyte is expressed.
The sample analyzed may be anybody fluid sample, such as, for example, blood serum, cerebrospinal fluid, mucus, saliva, vaginal secretion, and urine, or may be a sample of the diseased tissue itself.
In some embodiments of the methods described herein, detecting the presence a biomarker in a biological sample obtained from an individual comprises determining the presence of a modified polypeptide in the sample. A polypeptide can be detected by any of a number of means known to those of skill in the art, including analytical biochemical methods, such as electrophoresis, capillary electrophoresis, high performance liquid chromatography (“HPLC”), thin layer chromatography (“TLC”), hyperdiffusion chromatography, and the like, or various immunological methods, such as fluid or gel precipitation reactions, immunodiffusion (single or double), immunoelectrophoresis, radioimmunoassay (“RIA”), enzyme-linked immunosorbent assay (“ELISA”), immunofluorescent assays, flow cytometry, FACS, western blotting, and the like.
Other methods of sensing may be based on cellular responses to environmental changes such as growth media. In these embodiments, the analytes are whole cells rather than individual molecules, genes, or other biomarkers. Different sensing groups may each contain a unique environment such that a cell's fingerprint is its coded in its unique combination of responses to the multiple environments. These responses may be measured by growth rate with metabolic assays such as resorufin-based fluorescence monitoring or turbidity monitoring. In other embodiments, they may involve single-cell RNA sequencing data or nucleotide probe-based measurements of RNA for gene expression analysis.
The following examples are included to demonstrate preferred embodiments of the disclosure. It should be appreciated by those of skill in the art that the techniques disclosed in the examples which follow represent techniques discovered by the inventor to function well in the practice of the disclosure, and thus can be considered to constitute preferred modes for its practice.
However, those of skill in the art should, in light of the present disclosure, appreciate that many changes can be made in the specific embodiments which are disclosed and still obtain a like or similar result without departing from the spirit and scope of the disclosure.
The present studies provides a generally applicable framework for the MMV problem with Poisson signals (MMVP). Let each signal x*d be drawn independently from a multivariate Poisson distribution parametrized by the N-dimensional, k sparse vector λ*. That is, x*n,d˜Poisson(λn*) are independent. This framework should not be confused with the well-studied “Poisson compressed sensing” problem in imaging where the measurement noise, rather than the signal, follows a Poisson distribution (Raginsky et al., 2010; Harmany et al., 2012). In contrast to typical MMV problems, the present studies find an estimate {circumflex over (λ)}≈λ* from D observations rather than to estimate {circumflex over (X)}. Each signal and measurement pair (x*d, yd) is in one of G different sensor groups, each with its own sensing matrix Φ(g) such that yd=Φ(g)x*d. The group g associated with each index d is known and deterministic. This measurement group multiplexing is similar to that found in the single pixel camera (SPC) (Duarte et al., 2008); however, in the SPC the x*d are assumed to be equal, whereas in MMVP they are independently sampled. In microfluidics, several sensor groups can be feasibly achieved by forking an input microfluidic channel into G reaction zones each containing its own set of M sensors. Note that xd*˜Poisson(λ*) regardless of which group it is in. The statement Y=ΦX* is the special case without noise where G=1 and is illustrated in
Y=[Φ
(1)
X
(1)* . . . Φ(G)X(G)*]. (1)
The present studies demonstrate the first exploration of the MMVP problem and develop a novel recovery algorithm and initial theoretical results. a maximum likelihood estimation (MLE) approach was taken, treating Y as a set of D observations from which to infer {circumflex over (λ)}. The contributions are 1: the Sparse Poisson Recovery (SPoRe) algorithm that tractably estimates {circumflex over (λ)}; 2: theoretical results on the identifiability of the MMVP model, proving that MLE can asymptotically recover λ* even when M=G=1 for any k, and insights into MLE performance; and 3: simulations demonstrating SPoRe's superior performance over existing and custom baseline algorithms. The achievable measurement rate in the MMVP model is unprecedented in CS, and insights were analytically derived into the influence of various system parameters and these insights were confirmed in the simulated experiments. It was found that system designers should first maximize M and then increase G as necessary depending on the expected real-world conditions. It was found that using G>1 was helpful for cases with high noise or high Σnλ*n.
SPoRe's strong performance even with extremely low M∈{1,2,3} under very high measurement noise uniquely enables sensor-constrained applications in biosensing. Although microfluidics devices can rapidly generate a large number of partitions D at a tunable rate, most optical and electrochemical sensing modalities that can keep pace are limited in M (Guo et al., 2012; Suea-Ngam et al., 2019). Commonly, fluorescently tagged sensors reveal droplets' contents rapidly as they flow by a detector, but spectral overlap generally limits M to be five or less without highly specialized, system-specific approaches (Orth et al., 2018). High M alternatives such as various spectroscopic techniques limit throughput, necessitate additional instrumentation, or complicate workflows (Suea-Ngam et al., 2019, Zhu & Fang, 2013). These severe restrictions in M may have forestalled research into CS's potential role in microfluidics.
Background: The MMVP problem has not yet been explored, likely owing to the ongoing maturation of microfluidics and only recent application of CS to biosensing. The Poisson signal model constrains elements of X to be nonnegative integers under a set of defined probability mass functions. Some aspects of this structure have been studied tangentially, but not the MMVP structure directly.
The core MMV problem only imposes joint sparsity. Early greedy algorithms for this generalized scenario extend the classic Orthogonal Matching Pursuit (OMP) algorithm (Pati et al., 1993) into OMPMMV (Chen & Huo, 2006), simultaneously developed as Simultaneous OMP (S-OMP) (Tropp et al., 2006). Generally, OMP-based algorithms iteratively build a support set of an estimated sparse solution {circumflex over (x)} (or {circumflex over (X)}) by testing for the correlation between columns of Φ and the residuals between the measurements and previous estimates. A suite of greedy algorithms was recently developed that impose nonnegative constraints to a number of MMV approaches including OMP's analogues, and the nonnegative extensions outperformed their generalized counterparts (Kim & Haldar, 2016).
The application of integer constraints to the SMV problem has proven challenging in the literature. Some theory involving sensing matrix design includes (Dai & Milenkovic, 2009a; Fukshansky et al., 2019), but practical algorithms have required additional constraints on the possible integers, e.g., x∈{0,1}N or other finite-alphabet (Zhu & Giannakis, 2011; Nakarmi & Rahnavard, 2012; Tian et al., 2009; Draper & Malekpour, 2009; Shim et al., 2014). A recent study verified that these problems, as well as those with unbounded integer signals, are NP-hard (Lange et al., 2020). Algorithms for the unconstrained integer SMV problem thus apply greedy heuristics such as OMPbased approaches (Flinth & Kutyniok, 2017; Sparrer & Fischer, 2016).
Additional structural constraints can also make these problems tractable. The communications problem of multi-user detection (MUD), reviewed in (Alam & Zhang, 2018a), bears some similarity to MMVP. Here, the activity of N users is the signal of interest and generally follows a Bernoulli model where each user is active with the same prior probability pa (Zhu & Giannakis, 2011). An alternative prior with Σn=1Nxn,d˜Poisson(λ) models the mean number of total active users in any given signal (Ji et al., 2015) although the authors solely explored an overdetermined system. Applying an MMV framework to MUD enables underdetermined (M<N) applications but has generally assumed that any active user is active for the entire frame of observation (a row of X is entirely zero or nonzero) (Monsees et al., 2014; Liu et al., 2015). Recently, the potency of sensor groups with a G=2 system was demonstrated in the MUD context (Alam & Zhang, 2018b). Despite some similarities to MMVP with an MMV framework and discrete signals, MUD most fundamentally differs from MMVP in its utilization of the probabilistic structure of X. In MUD, the model parameters governing user activities are assumed and leveraged in recovery of X, whereas in MMVP, the model parameters in λ* themselves are the target of recovery.
Regardless of the particular structural constraints imposed in the variants of MMV above, M>k is necessary to recover X* (Davies & Eldar, 2012). However, in MMV models where the rows of X* are statistically independent, the simpler task of support recovery has been proven to be possible below the M>k regime, as low as M=Ω(√{square root over (k)}) (Balkan et al., 2014; Pal & Vaidyanathan, 2015; Koochakzadeh et al., 2018). In particular, multiple measurement sparse Bayesian learning (M-SBL) (Balkan et al., 2014) takes a MLE approach in a setting very similar to MMVP, but where x*d have a Gaussian distribution rather than Poisson, and there M=Ω(√{square root over (k)}) is limit for sparse support recovery even for D→∞.
In contrast, the present studies prove that in the MMVP problem, this is improved to M≥1 (i.e., no minimum number of measurements) regardless of k when D→∞, even when G=1. Furthermore, the MLE in this setting recovers λ* exactly. It was demonstrated empirically that λ* can be recovered well for finite D even for extremely low M such as M∈{1,2,3}.
Sparse Poisson Recovery (Spore) Algorithm
Notation: A probability mass function is denoted by P(⋅) and a probability density function by p(⋅). RN and ZN are used to represent the N-dimensional Euclidean space and integer lattice, respectively. +N and +N denote the non-negative restrictions on these spaces. Script letters ( . . . ) are used for sets unless otherwise described. Lowercase and uppercase boldface letters are used for vectors and matrices, respectively. Their dimensions are represented with uppercase letters (e.g., X∈+N×D) that are indexed by their lowercase counterparts. For example, xn,d is the element of X in the nth row and dth column, and the inventors use the shorthands xn and xd to represent the entire nth row and dth column vectors, respectively. Other lower case letters (a, b, ϵ, etc.) may represent variable or constant scalars depending on context. λ* and X* are used to refer to the true signal values, and estimates are denoted {circumflex over (λ)} and {circumflex over (X)} with the source of the estimate (e.g., MLE, SPoRe, baseline algorithm) being implicit from the context. ∥⋅∥ is used to denote a norm, ∥⋅∥0 for the number of nonzero elements of a vector, ∥⋅∥1 for the 1 vector norm, ∥⋅∥2 for the Euclidean norm, and ∥⋅∥F for the Frobenius norm. ∥X∥RxΣn maxd|xn,d| is also used, relaxation of the row 0 quasi-norm defined in (Tropp, 2006). The null space of matrix A is denoted by (A). As one abuse of notation, for densities of the form p(y|x), the corresponding Φ(g) applied to x and the relevant noise model are implicit. Also, the division of two vectors represent element-wise division.
Algorithm. If the index d is in sensor group g, the inventors say that linear measurements are corrupted by an additive random noise vector bd:
y
d=Φ(g)xd+bd. (2)
bd was entirely independent (e.g., additive white Gaussian noise (AWGN), as used in the simulations) or dependent on x. With xd˜Poisson(λ*) i.i.d., yd are independent across d as well. The MLE estimate maximizes the average log likelihood of the measurements:
Because the infinite sum over x yields an intractable posterior distribution, the popular expectation maximization (EM) algorithm (Dempster et al., 1977) cannot be applied to solve this MLE problem. Instead, the Sparse Poisson Recovery (SPoRe) algorithm (Algorithm 1) optimizes this function with batch stochastic gradient ascent, drawing B elements uniformly with replacement from {1, . . . , D} to populate a batch set . First, note that
Denoting the objective function from the right-hand side of (4) as , the gradient is
With gradient ascent, each iteration updates λ←λ+α∇λ with learning rate α. However, the summations over all of +N are clearly intractable. SPoRe approximates these quantities with a Monte Carlo (MC) integration over S samples of x, newly drawn for each batch gradient step from sampling distribution Q: +N→+, such that
The optimal choice of Q(xs) is beyond the scope of this work, but it was found that Q(xs)=P(xs|λ) simplifies the expression, is effective in practice, and draws inspiration from the EM algorithm. In other words, the sampling function is updated at each iteration based on the current estimate of λ. The gradient thus simplifies to
Note that if only one {circumflex over (x)}d∈+N satisfied p((yd|{circumflex over (x)}d)>0 for every yd, the objective would be concave with
i.e., the MLE solution if X* were directly observed. Of course, with compressed measurements and noise, multiple signals may vie to “explain” any single measurement, but SPoRe's key strength is that it jointly considers independent measurements to directly estimate {circumflex over (λ)}.
It is noted that for finite samples, since the MC integration occurs inside a logarithm, the stochastic gradient is biased. However, since it converges in probability to the true gradient, results can be expected to be comparable to SGD with an unbiased gradient for sufficiently large S (Chen & Luss, 2019).
Algorithm 1 summarizes the implementation details of SPoRe. Even though λ∈+N, λ≥ϵ can be enforced by clipping (ϵ=10−3 in the simulations) to maintain exploration of the parameter space. Note that in (8), gradients can become very large with finite sampling as some elements of λ approach zero. It was found that rescaling gradients to maximum norm γ helps stabilize convergence. For rescaling, only the subvector δΓ of the α-scaled gradient δ was considered, defining indices n∈Γ⊆{1, . . . , N} if λn+δn>ϵ. This restriction ensured that rescaling is solely based on the indices still being optimized, excluding those clipping to ϵ.
Rescale gradient step
For the stopping criterion, a moving average of {circumflex over (λ)} was evaluated for convergence. The median of p(yd|λ) was also tracked for d∈, accounting for stochasticity in likelihood approximations and batch selections such that may reduce α if no improvements in the median have been seen within a patience window and terminate if α is reduced five times. All experiments were conducted on commodity personal computing hardware. Ultimately, recovery of {circumflex over (λ)} takes a few minutes on a single core for the problem sizes considered (M≤15, N≤50), and SPoRe can be easily parallelized for faster performance.
Within an iteration, it was found that using the same S=1000 samples for all d∈ helped to vectorize the implementation to dramatically improved speed over sampling S times for each drawn yd. This simplification had no noticeable influence on performance. While random initializations were found with a small offset λ(0)˜Uniform(0,1)+v (with v=0.1) to be effective in general, a numerical issue was encountered when under low-variance AWGN. Even though AWGN results in nonzero probabilities everywhere, p(yd|xs) may numerically evaluate to zero for all drawn samples in low-noise settings. These zeros across all samples result in undefined terms in the summation over d∈ in (8). SPoRe simply ignores such undefined terms, but when this numerical issue occurs for all of , SPoRe takes no gradient step. With very low noise and large N dampening the effectiveness of random sampling, SPoRe may stop prematurely as it appears to have converged. This problem did not arise with larger noise variances where even inexact samples pushed {circumflex over (λ)} in the generally appropriate direction until better samples could be drawn (recall that Q(x)=P(x|{circumflex over (λ)}) at each iteration). Nonetheless, it was decided to set λ(0)=v for consistency across all simulated experiments. It was speculated that setting λ(0) to a small value helped encourage sampling sparse x's in early iterations to help find xs with nonzero p(yd|xs), bypassing the numerical issue altogether.
Theory and analysis. The summation over x inside the logarithm of the objective function complicates the precise analysis of the SPoRe algorithm. However, the asymptotic MMVP problem can be considered as D→∞ and its MLE solution to gain insight into the superior recovery performance of SPoRe. In this setting, the powerful result was obtained that λ* is exactly recoverable even when M=G=1 under a simple null space condition on Φ. The loss in Fisher Information was then characterized for MMVP and showed how losses accrue with the increase of signals that map to the same measurements, an effect that increases with k. Lastly, insights were derived into the influence of sensor groups through a small-scale analysis. From a system design standpoint, it was found that designers should first increase M as much as feasible and then increase G as needed. All proofs can be found in the Appendix below.
Identifiability of MMVP Models. The primary theoretical result was that the asymptotic MLE of the MMVP problem is exactly equal to λ* as long as a simple null space condition on Φ was satisfied. This result improved upon the M=Ω(√{square root over (k)}) limitations of correlation-aware support recovery methods (Balkan et al., 2014; Koochakzadeh et al., 2018), placing no restrictions on k and M. The reason for this improvement is that Φ transforms an integer lattice rather than a linear subspace such that the resulting model is still identifiable.
Identifiability refers to the uniqueness of the model parameters that can give rise to a distribution of observations. A model {p(⋅|λ):λ∈+N} is a collection of distribution functions which are indexed by the parameter λ; in the MMVP problem, each choice of Φ and noise give rise to a different model . Through an optimization lens, if the inventors' model is identifiable, then λ* is the unique global optimum of the data likelihood as D→∞. Recall that
meaning that this model can be interpreted as each sensor group consisting of a mixture whose elements' positions are governed by Φ(g)x, distributions by the noise model, and weights by P(x|λ). in this analysis on G=1 was focused on, since as D→∞, at least one sensor group contains infinite measurements. If the corresponding Φ(g) satisfies the conditions described here, then the model is identifiable. Formally:
Definition III.1 (Identifiability). The model is identifiable if p(y|λ)=p(y|λ′) ∀y⇒λ=λ′ for all λ, λ′∈+N.
The identifiability of mixtures is well-studied (Teicher, 1963; Tallis 1969); if a mixture is identifiable, the mixture weights uniquely parametrize possible distributions. For finite mixtures, a broad set of distributions including multivariate exponential and Gaussian have been proven to be identifiable (Yakowitz & Spragins, 1968). A finite case may manifest in realistic MMVP systems where measurements y must eventually saturate; all sensors have a finite dynamic range of values they can capture. In the most general case, p(⋅|λ) is a countably infinite mixture.
Although less studied, countably infinite mixtures are identifiable under some classes of distributions (Yang & Wu, 2013). The AWGN that the inventors use in the inventors' simulations is identifiable for both the finite and countably infinite cases. Characterizing the full family of noise models that are identifiable under countably infinite mixtures is beyond the scope of this work. The contribution was that given a noise model that yields identifiable mixtures, equal mixture weights induced by λ and λ′ imply λ=λ′. the sufficiency of the following simple conditions on Φ for identifiability were proven:
Theorem III.2 (Identifiability of Mixture Weights). Let b be additive noise drawn from a distribution for which a countably infinite mixture is identifiable. If (Φ)∩+N={0} and ϕn*≠ϕn′ ∀n,n′∈{1, . . . , N} with n≠n′, then is identifiable.
The null space condition essentially says that any nonzero vector in (Φ) must contain both positive and negative elements. Many practical Φ satisfy this constraint (e.g., any Φ with at least one strictly negative or positive row). The second condition is trivial: no two columns of Φ can be identical. A separate sufficient condition was also obtained, that Φ drawn from any continuous distribution results in identifiability.
Corollary III.3 (Identifiability with Random Continuous Φ). Let b be additive noise drawn from a distribution for which a countably infinite mixture is identifiable. If the elements of Φ are independently drawn from any continuous distribution, then is identifiable.
The general result of Theorem III.2 was emphasized, since discrete sensing is common in biomedical systems. For example, sensors are often designed to bind to an integer number of known target sites and yield “digital” measurements (Basu, 2017; Zhang et al., 2012). Discrete Φ can give rise to what the inventors call collisions. Formally:
Definition III.4 (Collisions and Collision Sets). Let Φ∈RM×N be a sensing matrix applied to signals x∈+N. A collision occurs between x and x′ when Φx=Φx′. A collision set for an arbitrary u∈+N is the set u={x:Φx=Φu; x∈+N}.
If the distribution from which b is drawn is fixed (e.g., AWGN) or a function of Φx, then the mixture weights are the probability mass of each collision set. Let the set of collision sets be with u∈ being an arbitrary collision set.
The weights of the mixture elements are governed by P(u|λ). Given a noise model that yields identifiable mixtures, the same distribution of observations y implies that the mixture weights are identical, i.e., P(u|λ)=P(u|λ′)∀u. It was proven that P(u|λ)=P(u|λ′)∀u implies λ=λ′, which implies the identifiability of under both the conditions of Theorem III.2 and Corollary III.3.
The proofs were based on the existence and implications of single-vector collison sets x={x}. When (9) holds, u indexes both the mixture elements and the collision sets. In the general case where b is dependent on x and not simply Φx, signals participating in the same mixture element may have different noise distributions. These differences can only further subdivide collision sets and leaves single-vector collision sets unaffected. Thus, the results also covered the general noise case.
Fisher Information of MMVP Measurements. While identifiability confirms that λ* is a unique global optimum of the MLE problem given infinite observations, Fisher Information helps characterize estimation of λ* as D increases. The Fisher Information matrix is the (negative) Hessian of the expected log-likelihood function at the optimum λ*, and it is well-known that under a few technical conditions the MLE solution is asymptotically Gaussian with covariance −1/D. Intuitively, higher Fisher Information implies a “sharper” optimum that needs fewer observations for stable recovery. For direct observations of Poisson signals x*d rather than yd, is diagonal with n,n=1/λ*n. In MMVP with observations of noisy projections (yd), and its inverse are difficult to analyze. The reduction in n,n in MMVP caused by the noisy measurement of x*d can be characterized and was derived that was empirically confirmed. Concretely, elements of follow
The shorthand wx p(y|x)P(x|λ) was denoted and noted that Σxwx=p(y|λ). Following a similar derivation for the partial derivatives in (8), it was shown that the general expression for diagonal elements n,n was
In the ideal scenario, it was observed x*d directly such that
It can be easily shown that (13) reduces to the canonical 1/λ*n. The integration of p(y|x) evaluates to one, but was manipulated algebraically to re-express the quantity as
Let nlossn,nideal−n,n and let Σ(x,χ) denote the sum over all pairs of signals x′, χ∈+N. Expanding Equations (12) and (14) and simplifying yields
Note that nloss is non-negative such that n,n≤n,nideal and that pairs of signals with x′n≠χn can contribute to nloss. Also note that, wxww=p(y|x′)p(y|χ)P(x′|λ*)P(χ|λ*) and that P(x|λ*)>0 only when supp(x)⊆supp(λ*). Thus, the Fisher Information is only reduced over the direct Poisson observation case when there are pairs of signals that are well-explained by the same y and by the same λ*. Clearly, λ* with higher k resulted in more of such pairs, and this influence of k on Fisher Information was confirmed empirically. Although further precise analysis via Fisher Information was challenging, a deeper analysis of the special case of the MMVP problem with small λ through a different lens was performed.
Small scale analysis. With identifiability, it was known that λ* uniquely maximizes the expected log-likelihood. However, because SPoRe uses stochastic gradient ascent to optimize the empirical loglikelihood, it will typically achieve a {circumflex over (λ)} that is near but not equal to λ*. How the neighborhood of λ* changes given the parameters of the problem was therefore studied. The natural way to do this for MLE problems was to consider the Fisher Information matrix as in the previous section, but the presence of a sum inside the logarithm makes analysis difficult. Instead, a particular {circumflex over (λ)} near λ* was considered that solves an optimization related to the original likelihood maximization problem. To further simplify the setting, it was considered that the “small scale” case where Σnλ*n is small enough that there is almost never a case where Σnx*n>1. It was emphasized that although this setting is simple, the MLE approach can still drastically outperform a trivial solution such as {circumflex over (λ)}=E[{circumflex over (x)}], where {circumflex over (x)}=argmaxx p(y|x), since with sufficient noise, x≠x* with arbitrary probability.
At the small scale, the distribution of each x*n becomes Bernoulli with parameter λn, and the probability that x*n=1 and x*n, for n≠n′ vanishes. Let n*≙(the first nonzero index of x*, 0 if none), which has a categorical distribution with parameter λ*. Notation was abused so that ϕ0=0, λ0* is the probability that n*=0, and Σn=0Nλn*=1. Applying Jensen's inequality to the log-likelihood for the conditional expectation given n*,
was obtained. The right-hand side of this inequality was called the Jensen bound. This Jensen bound via the logarithm has the attractive property of having a gradient that is equal to a first-order Taylor approximation of the gradient of the original likelihood (The Taylor expansion is of ƒ(u,v)=u/v, for which a first-order approximation yields E[U/V]≈E[U]/E[V] for random variables U, V). To see this, consider the partial derivatives for a single λn:
Thus, the optimizer of the Jensen bound was expected to be close to λ* (this is particularly true as measurement noise vanishes and the bound becomes tight).
In the case where G=1 under AWGN, the following result characterizing the solution of the Jensen bound was found. Proposition III.5. If y˜N(ϕn*,n,σ2I) and
is invertible, then the maximizer {tilde over (λ)} of the Jensen bound satisfies
where s∈∂∥{tilde over (λ)}∥1 and for all n, μn≥0 and μn{tilde over (λ)}n=0.
In the case where all entries of {tilde over (λ)} are positive, s−μ=1. K has values of one along the diagonal and smaller values off the diagonal, so it mimics the identity matrix. Clearly, as K→I, {tilde over (λ)}→λ*. However, given nonzero σ2, K is bounded away from i. Furthermore, since it is impossible to find a set of more than M+1 equidistant points in M, the off-diagonal values of K will differ when M<N, introducing distortion in the transformation.
However, even if M<N, if y is a measurement from a random sensor group, then the effect of this distortion can be mitigated such that {tilde over (λ)} is a reliable estimator of λ* from a support recovery perspective:
Theorem III.6. If y˜N(ϕn*(g), σ2I), g is distributed uniformly, and ϕn(g)˜N(0, I) i.i.d., then if G→∞ and all elements of the maximizer i of the Jensen bound are strictly positive, there exist c1>0,c2∈ such that {tilde over (λ)}n=c1λn*+c2 for 1≤n≤N.
If {circumflex over (λ)} has the same rank ordering as λ*, the exact support can be recovered. Therefore, an increase in G was expected to improve performance in tasks such as support recovery. From this result, however, gains were expected due to increasing G to be less immediate than those due to increasing M. To see this, the asymptotic nature of Theorem III.6 in G was contrasted with the fact that for a finite choice of M (specifically M=N) all ϕn equidistant (or that for M even smaller the inventors can select Φ satisfying a RIP with some acceptable distortion) were selected and the same reliability result was obtained.
In this section, comparisons of SPoRe are presented against existing and custom baseline algorithms and follow with focused experimentation on SPoRe's performance and limitations. For most baseline algorithms that find an estimate R, their estimates were set
i.e., the canonical Poisson MLE if X* were observed directly. For a performance metric, cosine similarity between {circumflex over (λ)} and λ* was chosen as it captures the relative distribution of elements of the solution which was believed most useful to a user in biosensing applications. Although comparisons of cosine similarity mask differences in magnitude, estimates with high cosine similarity also exhibited low mean-squared error. Cosine similarity was plotted alone for brevity. In all simulations, AWGN was used and ϕm,n(g)˜Uniform(0,1) i.i.d. set, since many sensors are restricted to nonnegative measurements. For each parameter combination, the inventors evaluate over 50 trials in which they draw new Φ(g) and λ* for each trial. Due to high performance variability for some baseline algorithms, all error bars were scaled to ±½ standard deviation for consistency and readability.
Comparison against existing baselines. With no existing algorithm designed for Poisson signals, it was compared against a number of algorithms with various relevant structural assumptions. It was compared against both greedy and convex optimization approaches. First, DCS-SOMP (Baron et al., 2009) was used, a generalization of the common baseline Simultaneous Orthogonal Matching Pursuit (S-OMP) (Tropp et al., 2006) that assumes no structure and greedily solves MMV problems for any value of G. Next, NNS-SP and NNS-CoSaMP (Kim & Haldar, 2016) was used, two greedy algorithms for nonnegative MMV CS motivated by subspace pursuit (SP) (Dai & Milenkovic, 2009b) and compressive sampling matching pursuit (CoSaMP) (Needell & Tropp, 2009) which exhibited the best empirical performance in (Kim & Haldar, 2016). For integer-based recovery, PROMP (Flinth & Kutyniok, 2017) was used, an SMV algorithm for unbounded integer sparse recovery, to recover an estimate for each signal xd. Two support recovery algorithms were also used with M=Ω(√{square root over (k)}) measurement rates: M-SBL (Balkan et al., 2014) and a convex, 1-based variance recovery algorithm (Pal & Vaidyanathan, 2015). Support recovery is generally not quantitative and appears to make cosine similarity an inappropriate metric for comparison. However, these algorithms produced an estimate of the variance of x*n, which equals λn* for the Poisson distribution, making them coincidentally reasonable baseline approaches for MMVP directly.
For comparison against best-possible performance of the baselines and to avoid hyperparameter search where possible (for regularization weights, stopping criteria, etc.), the baselines were armed with relevant oracle knowledge of λ* or X*. While NNS-SP and NNS-CoSaMP require k as an input, DCS-SOMP and PROMP were also given, algorithms that iteratively and irreversibly select support elements, knowledge of k and have them stop after k elements have been chosen. Additionally, three oracle-enabled convex algorithms were created that minimize Σg=1G∥Y−Φ(g)X(g)∥F under norm constraints with oracle knowledge of the value of the norm. The 1 norm is commonly used as a penalty for convex solvers to encourage sparsity in sparse recovery. The 1-Oracles included SMV and MMV versions, where in the SMV case, Y was collapsed to a single vector by summing Σdyd, and a vector R was recovered from which {circumflex over (λ)}={circumflex over (X)}/D. The 1-Oracle SMV and 1-Oracle MMV were both given Σn,dx*n,d. In (Tropp, 2006), ∥X∥Rx was suggested as a better alternative for MMV, so that the Rx-Oracle was given ∥X*∥Rx. In the support recovery algorithm from (Pal & Vaidyanathan, 2015), the vector of sample variances (Var(xn))n=1N was recovered under a constraint on Σn Var(xn). Because the sum of variances was provided, this method was called the ΣVar-Oracle. Nonnegativity of the optimization variables was also enforced in all convex problems, which was solve using the convex optimization package CVX in Matlab (Grant & Boyd, 2014; Grant & Boyd, 2008). No oracle knowledge was used for M-SBL, but the same initialization was used as for SPoRe and run its fixed-point update until convergence.
From
Comparison against integer baselines: M<kL In the M<k regime, since with high probability the elements of X* can be bound, the discrete nature of the problem might be expected to admit at the least a brute-force solution for obtaining R that can be used to obtain {circumflex over (λ)}. Indeed, if measurement noise is low, then the integer signal that minimizes measurement error for yd is likely to be x*d. But a finite search space alone has not enabled integer-constrained CS research to achieve M<k in general.
One may wonder whether SPoRe is simply taking advantage of this practically finite search space and, by virtue of MC sampling over thousands of iterations, is effectively finding the right solution by brute force. To address this possibility, it was compared against an 0-Oracle that was given k and the maximum value in X* in order to test all
combinations of X's support. For each combination, it enumerates the (max(X*)+1)k possibilities for each xd and selects {circumflex over (x)}d=argminx∥yd−Φx∥2. Finally, it returns the k-sparse solution with the lowest minimized measurement error. This algorithm is the only Poisson-free approach in this section.
Comparing SPoRe and other Poisson-enabled baselines against the 0-Oracle characterized the effect of incorporating the Poisson assumption on recovery performance. An early solution of ours for tackling MMVP, which was used as a baseline, was an alternating optimization framework to update estimates of
Noting that p(X|Y,λ)∝p(Y|X)p(X|{circumflex over (λ)}), this MAP framework for solving for X under AWGN with variance σ2 was
where the Gamma function Γ(⋅) is the continuous extension of the factorial (Γ(xn,d+1)=xn,d!) and is log-concave in the space of positive reals +,+N. the classic branch-and-bound (BB) algorithm was implemented (Land & Doig, 1960) to find the optimal integer-valued solution {circumflex over (X)} of the concave objective. Once an estimate {circumflex over (X)} was available, the update to {circumflex over (λ)} was also concaved with the closed form solution
The biconcavity of this objective function in X and λ makes this approach attractive, but it is unclear how to best initialize {circumflex over (λ)}. this alternating baseline algorithm was referred to with the prefix “Alt” followed by the method of initialization. For example, for Alt-Random, random initialization was used with a small offset ({circumflex over (λ)}n˜Uniform(0,1)+0.1) to avoid making any particular λn irretrievable from the start.
A few “guided” initialization processes were also explored. The quantity Σnλ*n can hypothetically be estimated from data if P(xd=0|λ) is significant and easily estimated from Y. In fact, quantification in microfluidics often relies on a clear identification of empty sample partitions (that is, where xd=0) (Basu, 2017). This motivates a strategy of relaxing the problem by optimizing X with a Poisson assumption on the sum of each column Σnxn,d rather than each element of X individually. The Σnλn* Oracle is given Σnλn* and optimizes for P(Σnxn,d|Σnλn*) in place of P(xd|λ) in (20). It is straightforward to show that this objective is also concave. Each estimate R d is solved via BB from which
The inventors use the Σnλn*-Oracle as its own baseline and as an initialization to the inventors' alternating framework (Alt-Σnλn*). For the next guided initialization, Σnλn* was again used for an unbiased initialization where the first estimate of {circumflex over (λ)}n=(Σn′λn′*)/N for all n (Alt-Unbiased). Finally, the output of SPoRe was used as an initial value for {circumflex over (λ)} (Alt-SPoRe). Alt-SPoRe can be understood as a way to use SPoRe to estimate {circumflex over (X)} if needed.
Sparsity and Σnλn*. The limitations of SPoRe's recovery performance were empirically tested under very challenging conditions of M=2, 3≤k≤7, N=50, σ2=10−2. Here D=1000 was set to better reflect the typical capabilities of biomedical systems, whereas D=100 in the baseline comparisons was due to the inventors' budget on computational time strained by solving BB for {circumflex over (x)}d. From the analysis and previous simulations, it was expected that both k and the magnitudes of λn* will influence recovery.
Moreover,
However, note in (8) that SPoRe's gradients are defined by an average of xs weighted by p(yd|xs). The previous result from
Although increasing D is feasible in microfluidics, it generally corresponds with a reduction in λ* since a sample's total analyte content is fixed. Therefore, in
Efficiency. For system design, it is helpful to know how many observations D are necessary and sufficient for stable estimation of {circumflex over (λ)}. Such insight is often derived from the analysis of Fisher Information . Recall that for direct observations of Poisson signals xd*, the ideal case, the MLE solution {circumflex over (λ)}n=Σdx*n,d/D is an efficient estimator and achieves the Cramer-Rao′ bound such that var({circumflex over (λ)}n)=λn*/D.
The reduction in n,n was derived from MMVP measurements and reason was found to expect that the reduction increases with k. Here, this effect was empirically characterized (
It was found that the structure in the MMVP problem can be easily exploited for substantial improvements in signal recovery. While compressed sensing of arbitrary integer signals has proven challenging in the past, Poisson constraints not only make the recovery problem tractable, but even significantly easier. Most inverse problems necessitate constraints that make the signal-to-measurement transformations nearly isometric: in compressed sensing, these manifest as restrictions on Φ, noise, and the relationship between M, N, and k. In MMVP, recovery of λ* is theoretically possible under very lax conditions on Φ (Theorem III.2) and practically achievable as shown in the inventors' simulations.
In practice, the new SPoRe algorithm exhibits high performance even under deliberately challenging circumstances of high noise and M<k. Here, recovery with G=1 appears particularly feasible even with noisy measurements if Σnλn* is small. Because the log-likelihood is not concave, SPoRe's gradient ascent approach is not theoretically guaranteed to find a global optimum since local optima may exist. However, if they exist, it was speculated that SPoRe is naturally poised to evade these traps due to stochasticity in its gradient steps from both batch draws and MC integrations.
A few scenarios were noted in which SPoRe's MC sampling appears to cause issues with convergence or early termination that are generally associated with increases in k and Σnλn*. It was anticipated that further increases in N may also contribute to these effects. While k and N are entirely determined by the application, system designers can reduce Σnλn* by increasing the spatial or temporal sampling rate. In microfluidics, this adjustment translates to either generating more (smaller) partitions D given a fixed sample volume or diluting a sample prior to partitioning. The initial implementation of SPoRe uses S=1000, can easily run on personal computers, and is sufficient for systems with N<10 (The full code base is available at github.com/pavankkota/SPoRe with instructions on how to implement SPoRe with custom models). This scale is appropriate for most applications in biosensing, and future work with parallelized or adaptive sampling strategies could improve the reliability of recovery for larger systems. Moreover, it was found that increasing M and G appear to mitigate poor performance due to excessive sampling noise.
To date, CS implementations with low M and high G, based on the inventors' notation, reformulate the acquired measurements into a canonical CS problem with GM measurements (Duarte et al., 2008). As discussed above, this reformulation implicitly assumes that the average signal in every group (
The ability to recover signals in MMVP, even in the extreme case of M=G=1, is unprecedented in CS and offers a new paradigm for sensor-constrained applications. The current state-of-the-art method for achieving similar efficiency in microfluidics is to essentially guarantee single-analyte capture for classification by substantially increasing the sampling rate, whereas the MMVP framework is not reliant on such an intervention. Increasing G can make SPoRe reliable under harsh conditions, is straightforward with microfluidics, and offers a potent alternative to increasing the sampling rate. For example, diluting to a tolerable concentration is challenging with samples of unknown content such as in diagnostics applications, and increasing D either delays results or necessitates high throughput measurement acquisition which may not be feasible depending on the nature of the sensors.
The theoretical and empirical results showed the promise of MMVP, but there are many directions for further research. For instance, theoretical results that precisely relate M, N, D, k, λ* and noise such as in a recovery guarantee would be highly valuable. Moreover, SPoRe can accept any signal-to-measurement model p(y|x). While the inventors have proven identifiability under linear mappings with common noise models, SPoRe can be applied with any application-specific model even if proving identifiability of the Poisson mixture is difficult.
A. Proof of Theorem III.2
A direct proof was used, assuming P(u|λ)=P(u|λ′) ∀u and proving the resulting implication λ=λ′. Let z(x)∈ be such that x+z(x)∈x. By Definition III.4, x+z(x)∈+N and z(x)∈(Φ).
Lemma A.1. If (Φ)∩+N={0}, and P(0|λ)=P(0|λ′), then Σnλn=Σnλn.
Proof. The null space condition on Φ means that 0={0}; there is no vector z(0) that satisfies 0+z(0 ∈+N other than z(0)=0. Therefore, P(0|λ)=P(0|λ′)⇒e−Σ
Attention was turned to the one-hot collision sets. Let ej denote the jth standard basis vector. By Definition III.4,
e
={x:Φx=ϕj, x∈+N}. For e
Lemma A.2. If (Φ)∩+N={0}, and e
Proof. The restriction on ej means P(ej|λ)=P(ej|λ′)⇒λje−Σ
By similar arguments to Lemmas A.1 and A.2, Corollary III.3 was proven under the assumption that there are no collisions instead of the null space condition. When the elements of Φ are independently drawn from continuous distributions, the collision of any particular x and x′ occurs with probability zero. Since +N×+N is countably infinite, there are no collisions almost surely. As such, e
For the conditions in Theorem III.2, the following Lemma states the existence of at least one j satisfying e
Lemma A.3. If (Φ)∩+N={0} and ϕn≠ϕn′∀n,n′∈{1, . . . , N} with n/=n′, then ∃j such that e
Proof. If e
By the null space condition, z(e
With P=N, let one concatenate the z(e
Note that each column z(e
First, let {1, . . . , N}. Because all off-diagonal components in Z are nonnegative and because z must have one negative value, one of the rows of Z must be entirely zero except for the −1 on the diagonal. Note the ordering of the columns in Z is arbitrary, so without loss of generality, let this be the first row. Now, let's say that ={2,3, . . . , N}. The same logic holds: at least one row must contain all zeros except for the −1. Without loss of generality, the inventors can set [Z]2,3, [Z]2,4, . . . [Z]2,N=0. Repeating this process, they get a lower triangular matrix:
However, examining the final column, it was seen that z(e
Proof of Theorem III.2: Lemma A.3 confirms P<N, meaning that the inventors can form the concatenated matrix of ze
Let P(Ce
where Lemma A.1 (Σnλn=Σnλ′n) yields the first equality, and Lemma A.2 (all λi=λ′i∀i>P) yields the second equality when combined with the fact that xi=0 for i<P due to (24). The only x∈e
Now there is λi=λ′i∀i>P−1. Following the same arguments, P(e
Proof of Proposition III.5
Proof. By straightforward integration,
Therefore, given the constraints ∥λ∥1≤1 and λn≥0, the first-order KKT condition is
where κ(n*)=(κ(n,n*))n=0N, s∈∂∥λ∥1, c≥0, and λn≥0. By complementary slackness, λnλn=0 for all n. Because K is symmetric, the above was rewritten as
where the fraction represents element-wise division. Solving for λ and rescaling μ, the desired expression was obtained.
Proof of Theorem III.6
Proof. Let κg be defined the same as κ from (27) with ϕn(g). Then again by straightforward integration,
Let
and let
be the sub-matrix of {tilde over (K)} excluding n=0. Then
where
and J is a matrix of all ones. Leveraging the block matrix inverse, it was observed that there was the form
assuming the final column corresponds to n=0, for some scalars b, c, and d. Using the formula from Proposition III.5 and the fact that s−μ=1 by assumption, it was conclude that for n>0, {tilde over (λ)}n∝λn*+C for some C. Rote algebra verifies that the constant of proportionality is non-negative.
The inventors recently coupled advances in microfluidic partitioning with ideas from compressed sensing to address these challenges. Compressed sensing (CS) seeks to infer sparse signals efficiently: faster or with fewer sensors (Baraniuk, 2007; Donoho, 2006). In biosensing, samples are sparse when among many possible analytes, only a handful are present in any given sample. For instance, a patient could be infected with any of hundreds of pathogens, but only one or a few are responsible for the current infection (Peters et al., 2012). In this application, CS is analogous to quantifying analyte fingerprints from mixed measurements (Aghazadeh et al., 2016). The inventors recently developed new theory and a new Sparse Poisson Recovery (SPoRe) algorithm that couples principles of CS with microfluidic partitioning (Kota et al., 2022). SPoRe performs maximum likelihood estimation (MLE) with gradient ascent over a generalized likelihood function. While the inventors chose to explore this direction because microfluidics approaches the limit of sensitivity through single-molecule analysis, they also found fundamental advantages from a signal processing perspective. Most notably, leveraging the Poisson-distributed capture of analytes enables improved rates of multiplexing (fewer sensors, more analytes), tolerates multi-analyte capture in the same partition, withstands very high measurement noise, and can enable partial fingerprints to be captured separately in sensor-constrained systems. The inventors call this latter concept asynchronous fingerprinting.
The inventors' initial research focused on linear measurements to contextualize their research in the field of compressed sensing, but the inventors found that SPoRe can be readily applied to any sensing model. In this work, the inventors first derive broader sufficient conditions for the types of practical sensors that can apply the inventors' framework. Next, they present the first in vitro demonstration of the inventors' framework towards bacterial infection diagnostics, quantifying nine pathogen genera with only five orthogonal DNA probes in droplet digital PCR (ddPCR). The five probes assign binary barcodes to the 16S rRNA genes of these bacteria, functioning as the fingerprints in the inventors' system. However, given a limitation of only two fluorescence channels on their system, the inventors apply the concept of asynchronous fingerprinting. They split a sample into four groups of droplets, each with a subset of two probes, and the inventors infer the concentrations of the bacteria with SPoRe over the pooled dataset from all droplet groups. They characterize the performance of the inventors' assay with 18 samples each with a mixture of 2-4 bacteria and extend the inventors' characterization in simulated experiments. Finally, they show how their probabilistic framework enables the flagging of samples that contain 16S barcodes outside of the designed panel.
In ddPCR, microfluidics splits a sample into thousands of droplets to stochastically capture nucleic acids (
The inventors designed nonspecific 11-nucleotide hydrolysis probes spiked with locked nucleic acids (LNAs) that bind to the 16S genes of a panel of 12 bacterial species. Briefly, the inventors filter the set of 11-mers to avoid heterodimers and weak mismatches while ensuring a sufficiently high melting temperature. Each 16S gene elicits a binary barcode response to the set of five candidate probes based on the presence or absence of the probe sequences in the gene (
There are multiple 16S gene copies in each bacterial genome with slight sequence variability. Although the inventors attempted to design probes such that each genus had a unique, consistent barcode for all copies, E. cloacae appeared to exhibit a small proportion of variant barcodes, a fraction which the inventors computed experimentally (Methods 4.8). Such variation is likely inevitable especially in larger scale systems but can be accounted for. The inventors store each pathogen's fractional barcode distribution across its copies in a column of a matrix C (
If the analytes are the bacteria, then λn(bact) is the concentration of the nth bacteria's 16S genes, regardless of the particular barcodes of individual genes. Because λn(BC)=Cλn(bact), these results currently depend on a C matrix of rank N to readily convert between the barcode concentrations λ(BC) and the bacterial concentrations λ(bact). For context-dependent reasons, the inventors make use of both definitions of “analyte” and carefully clarify which they are using at any time, and they often drop the superscript notation.
Ideally, the inventors could estimate λ(BC) by simply capturing individual 16S genes in droplets with all five probes (Section 5.1). However, the unique clusters only hold for single-gene capture; if multiple genes appear (with distinct sequences) in the same droplet, an effect known as partition specific competition (PSC) occurs and fluorescent intensities can decrease (Whale et al., 2016). Second, unique clusters for every barcode cannot scale beyond this study if the eventual goal is to quantify dozens to hundreds of microbes.
Instead, the inventors generate four sensor groups of droplets each with a different subset of two probes (
Generalized Maximum Likelihood Estimation with Microfluidic Partitioning. The standard for quantification in digital microfluidics data is based on MLE (Majumdar et al., 2015). The inventors generalize the MLE approach and show how the conventional equations can be derived from special case assumptions (Section 5.3). Their particular approach with ddPCR is one realization of the inventors' generalized MLE framework. Respectively, the general terms used in this section analyte, partition, and measurement vector correspond with the physical concepts of a barcode or bacterium, a droplet, and the two pre-binarized measurements acquired from each droplet. Section 4.1.1 contains detailed clarification of the inventors' mathematical notation.
Let N and D define the number of unique analytes in the assay and the number of partitions, respectively. The inventors let xd be an N-dimensional nonnegative integer vector representing the quantities of each analyte in partition d. The inventors say λ is k-sparse if k elements are nonzero. With microfluidic partitioning, xd is distributed as Poisson(λ) where λ is the N-dimensional parameter vector that characterizes the rate of capture of each of N analytes. Let yd represent the measurement vector acquired from partition d (e.g., in the inventors' assay, yd∈{0,1}2). Note that while yd is observed directly, λ must be inferred and xd is latent. The inventors use an asterisk (λ*, xd) to denote true values and a hat ({circumflex over (λ)}, xd) to denote estimates. The goal in MLE is to find an estimate {circumflex over (λ)}MLE that maximizes the likelihood of the observed measurements. A general form of the likelihood function is:
Denoting the likelihood function from the right-hand side of (2) as , the gradient is
Although fairly obtuse, this equation leads to two commonly observed equations in specialized implementations of MLE that use digital fingerprinting or orthogonal assays (Section 5.3). As discussed, digital fingerprinting necessitates clear classification boundaries, and orthogonal assays in each partition restrict the number of analytes that can be practically quantified. In the inventors' ddPCR assay, no droplet can be assumed to have or lack a particular 16S barcode. Despite this ambiguity, the inventors' SPoRe algorithm uses gradient descent to iteratively solve Equation (2).
Modifications to SPoRe Algorithm. The inventors summarize two particular changes to the original SPoRe algorithm with implementation details left to Sections 4.7 and 5.2. First, they have noted the flexibility of SPoRe for any sensing model. This flexibility arises from the fact that p(yd|x), the sensing model that maps signals to measurements, is the only term that must be defined for a particular application. The inventors use a simple model for the inventors' ddPCR assay. For yd∈{0,1}2 with M=2 for the two fluorescence channels, they say p(y|x)=Πmp(ym|x) with p(ym|x)=1 if x has at least one copy of a gene that contains the corresponding probe, with p(ym|x)=0 otherwise. In the inventors' implementation, they define the analyte as the bacterial content and account for fractional barcode content in the gradient computations (Section 5.2).
Second, note in Equation (3) that the gradients contain an infinite summation over all of N-dimensional nonnegative integer vectors (Z+N). In the inventors' earlier work (Kota et al., 2022), the inventors used Monte Carlo approximations of this gradient on batches of measurements. They showed how these approximations became more variable with increases to k, and the inventors speculate the same would occur with increases to N. While the estimation variance can be reduced with more samples in the Monte Carlo approximation, this approach would increasingly rely on parallelized high performance computing as N or k increases. Fortunately, with the finite measurement space of ddPCR and the fact that binary probe measurements depend merely on the presence or absence of an amplicon with the complementary sequence, these gradients can be computed quickly and exactly over all measurements (Section 5.2). Given an exact gradient, the inventors use backtracking line search to speed up convergence of gradient descent to estimate λMLE(bact). Such accelerations are less feasible if the gradient is loosely approximated. The exact gradient, while cumbersome to derive for the 2-channel ddPCR system, could be similarly calculated for any number of channels and is computationally cheap.
Identifiability of Model. Running gradient descent to convergence will find a solution, but the inventors need to develop some assurance that it is the correct solution. Proving the identifiability of a model ensures that there is a unique global optimum to the likelihood function given infinite data. Formally, identifiability states that if p(y|λ)=p(y|λ′) for all y in each sensor group, then λ=λ′.
The inventors formally define terms and prove sufficiency conditions for identifiability in Methods 4.1. Briefly, they find that λ can be inferred uniquely if the sensing functions that dictate the mapping of x to y are monotonic and saturating. Monotonic functions do not change direction in output; in biosensing, most outputs increase with increases in input and vice versa such that the sensors are monotonic increasing. Any additional analyte should not decrease the output measurement. By saturating, the inventors mean that once a sensor has reached a saturating point in its response to an analyte, no further addition of that analyte will change the measurement value. Lastly, the inventors impose a system-wide condition called fingerprint equivalence which (informally) states that analytes with the same single-molecule fingerprints in a given sensing group behave interchangeably.
To align with these conditions, the inventors define the analytes as the barcodes. Under reasonable PSC effects (not an overwhelming diversity of initial 16S genies in a droplet) the addition of new barcodes to a droplet cannot reduce the binarized measurements (monotonicity). The binary data is determined by the presence of an 16S amplicon with a complementary probe sequence, with any quantity greater than zero resulting in the same digital signal (saturating). Lastly, gene copies with the same combination of probe binding sites are interchangeable (fingerprint equivalence). Therefore, Theorem 4.7 proves the conditions for λ(BC)=λ′(BC) but with rank(C)=N, λ(BC)=λ′(BC)⇒λ(bact)=λ′(bact).
Here, the inventors provide an informal explanation and concrete example of the result of Theorem 4.7. This Theorem defines a matrix Z(g) whose rows indicate the positions of analytes with equal, nonzero fingerprint responses in the sensor group g. Stacking these matrices for each g yields a matrix Z, and if rank(Z)=N, then the system is identifiable.
This result implies that the inventors cannot arbitrarily assign probes to each group, and interestingly, using probes in multiple groups can be beneficial by adding rows to Z. It is not sufficient to simply capture each probe's information in at least one group. Also, for rank(Z) to be N, the inventors can derive a simple rule of thumb for binary ddPCR with M channels: G(2M−1)≥N is necessary for the conditions of Theorem 4.7. Although the inventors had access to two-channel Bio-Rad Qx200, this result also indicates promise for applying the inventors' framework to digital PCR systems with more than two channels: up to N=15G analytes such as on the 4-channel QxOne from Bio-Rad, or up to 63G analytes such as on a the six-channel naica system (Stilla Technologies, Villejuif, France). Note that the inventors do not intend to rank these instruments, as other factors such as the volume that can be processed, the number of partitions that can be generated, automation of workflows, etc. are also important factors that are application-dependent.
However, the implications of identifiability must be approached with caution. From an optimization perspective, identifiability simply implies that given infinite data (D→∞) in each group, the true λ* is the global optimum of the likelihood function. These results fall short of a recovery guarantee for a finite D partitions. In their earlier work (Kota et al., 2020), the inventors derived the insight that less sparse λ* (more analytes with nonzero quantities) necessitate more partitions for stable recovery. Nonetheless, in contrast to typical applications of CS, there is no explicit maximum for the sparsity level as any λ is identifiable under the conditions they derived.
Demonstration of Polymicrobial Diagnostics. The inventors tested SPoRe's ability to quantify bacterial loads in mixed samples of purified genomic DNA. They used reference wells with individual bacterial dilutions to estimate the ground truth concentration and assist with manual thresholding (Section 4.6) to binarize the data. The inventors passed this pre-binarized data to SPoRe.
Solving an optimization problem can risk returning a locally optimal solution. However, the inventors find that the likelihood of the converged solution is higher than that of the supposed ground truth (
Differing in concentration estimates for the bacteria truly in the sample is to be expected, as some of this error will arise simply from pipetting volume variablity and sampling variability in capturing 16S gene copies.
However, in some samples, SPoRe misses a bacteria of low abundance (a false negative) while including a bacteria that is absent in the sample (a false positive). The inventors hypothesize that these errors may be due to imperfect thresholding of the inventors' data that could easily be misclassifying some droplets. Because they pass pre-binarized data to SPoRe, mistakes in thresholding propagate to SPoRe; that is, given “warped” data, SPoRe will return a warped solution which could appear to have higher mean likelihood than the estimated ground truth (6). Some degree of droplet misclassification is very likely with the inventors' current assay for three reasons illustrated in
Second, PSC appears to be blurring the boundaries between the typical binary clusters in droplets with more than one amplicon. Third, the negative droplets also exhibit some “lean” and “lift” most likely due to partial probe interactions with amplicons containing slightly mismatched sequences.
Recall that all 16S genes are expected to be exponentially amplified, and the probe signal the inventors receive is based on its interaction with the amplicons in each droplet. Overall, these combined effects squeeze the boundary in which they can draw thresholds. While these effects are common and have some popular tools to help disambiguate droplets (Jones et al., 2014, Brink et al., 2018), the inventors decided to use manual clustering as these tools are generally not designed for the conditions of the inventors' assay.
The inventors designed a simulated experiment to directly evaluate the effect of imperfect binarization of the raw measurements. Given the estimated ground truth concentrations and the droplet counts in each group, they simulated underlying droplet gene content (X with xd˜Poisson(λ*)) and the resulting binary measurements using the inventors' p(y|x) model. On this simulated data, SPoRe returned virtually perfect solutions with a mean cosine similarity of 0.9997 (
Characterization of Limit of Detection. In infection diagnostics, pathogen loads can vary by several orders of magnitude. Tolerating multi-gene capture reduces the risk that high concentration samples flood a system. The inventors designed samples such that total bacterial concentrations (Σnλn*) would be between 1 and 5 to illustrate this ability. Tolerating multigene capture allows design flexibility for microfluidics systems with fewer partitions (e.g., smaller form factors with nanowells instead of droplets). However, demonstrating this capability on the Bio-Rad Qx200 meant that the inventors' samples have 16S concentrations that are unrealistically high for most clinical presentations.
The inventors characterized the limit of detection in terms of 16S copy counts per sample for partitioning systems that may still result in multi-analyte capture by randomly subsampling the inventors' experimental data. For each sample, they subsampled 10%, 1%, 0.1% and 0.01% of the droplet data and passed it to SPoRe. The inventors estimate the 16S copy count in this data as the product of the number of subsampled droplets and the total estimated ground truth concentration Ds(Σnλn*).
Of course, a final system may have the flexibility to generate many partitions, meaning that at low 16S copy counts, non-empty partitions most likely have only one target molecule captured.
While this is uniquely unnecessary for the inventors' given fingerprinting approach, low magnitudes of λ empirically help recovery (Kota et al., 2022). Intuitively, if the same sample is split over more partitions, signal inference can only stand to gain information by capturing measurements from more individual molecules rather than their combined effects. Moreover, in ddPCR, single-molecule capture would avoid PSC altogether such that thresholding may be more reliable.
Flagging of Samples with Unknown Barcodes. A disadvantage of the inventors' system is that MLE will always report some solution even if the sample contains a bacteria with a different 16S barcode distribution outside of the panel given to SPoRe. However, the inventors can use the probabilistic nature to evaluate recovered solution in useful ways. They can characterize the expected distribution of the discrete measurement vectors given a recovered {circumflex over (λ)} and perform a χ2 goodness of fit test between the expected and observed distribution of measurements.
For each tested polymicrobial sample, the inventors simulated the effect of having an “unknown” bacteria in it by removing each of the correct, present bacteria (one at a time) from SPoRe's database before running the algorithm. They repeated this process for both the manually thresholded and simulated data. In both cases, the p value of the test can be a highly reliable metric for flagging samples with out-of-panel bacteria as indicated by the receiver operating characteristic (ROC) curves (
Reporting “unknown bacteria” is likely far more useful to a clinician than a “negative” result that would be returned from panels designed by specific sensors. This ability mirrors that of mass spectrometry and other fingerprinting systems, but the statistical underpinning could lead to theoretically grounded approaches with more research.
The inventors demonstrate a new scalable framework for infection diagnostics that leverages the sparsity of samples and the Poisson distribution of microfluidic capture. The inventors' nonspecific probes coupled with this framework enabled the quantification of 9 bacterial genera with 5 probes and 2 primers at levels as low as approximately 200 total copies of the 16S gene. The limit of detection appears to be sufficient for clinical applications, although the inventors' samples with only pre-extracted bacterial genomic DNA are highly idealized. Real-world samples with background nucleic acids are likely to hinder performance, but improvements in cluster separation through assay optimization may offset these effects.
More generally, the inventors present theoretical results that substantially broaden the types of sensors that allow λ to be identifiable. However, the inventors' SPoRe algorithm is modular for any user-defined sensing function, and the inventors' theoretical conditions are sufficient but not necessary. They encourage users to proceed with simulations even if their sensing model is outside the scope of the inventors' currently developed theory. As a nonconvex optimization process, the inventors' approach is not guaranteed to find the best solution. New theoretical results that provide convergence guarantees and worst-case error quantification would substantially increase user confidence in this approach. For the time being, the inventors have demonstrated a method for flagging solutions that are inconsistent with the observed data. This technique can help identify samples that contain analytes outside of the designed panel and solutions that have converged to poor local optima (if any). However, both the reliability of such flagging and the accuracy of estimated analyte quantities may be jeopardized if the model provided to SPoRe does not closely match real-world conditions.
The application of MLE to ddPCR data is not new, but the inventors' generalization offers unique insights and practical advantages for microfluidics-based diagnostics. These advantages stem from the inventors' tolerance of ambiguity from individual partition measurements.
Conventionally, to achieve scalable diagnosis of many analytes, single-analyte capture is enforced by a limiting dilution (Σnλn0.1) such that individual partition measurements can be directly classified. In diagnostics, sample concentrations are unknown and can vary over multiple orders of magnitude. The inventors show how their framework tolerates multi-analyte capture such that samples of higher concentrations do not flood the inventors' system. The ability to diagnose with higher λ also reduces the need to generate many partitions. Enabling diagnostics with, for instance, hundreds of microchambers could enable scalable multiplexing on static, point-of-care formats. Asynchronous fingerprinting offers a second practical advantage. Many biosensing techniques are limited in M, the number of orthogonal measurements acquired from each partition, either due to spectral overlap or sensor cross-reactivity. However, more sensors may be necessary to assign unique fingerprints to each analyte. Asynchronous fingerprinting bypasses this limitation by enabling reconstruction of the signal from partial fingerprint measurements. Still, the inventors' previous theoretical and simulated results indicate that users should first maximize M and increase G as necessary (Kota et al., 2022).
The inventors' ddPCR assay has a few critical limitations that warrant future development. First and foremost, they designed probes to assign unique barcodes to the bacterial genera in the inventors' chosen panel. However, future LNA probe design must account for bacteria outside the panel that could plausibly appear in a sample to ensure that the designed barcodes are specific to the intended pathogens. These design requirements motivate new bioinformatics tools. In general, the inventors recommend that their framework is applied in applications where the scope of plausible analytes in the intended samples is well understood such that the specificity of the analytes' fingerprints can be verified. Second, they amplified the full 16S gene for maximum sequence flexibility in probe design. Amplifying the full length, 1500 base pair gene is well beyond the manufacturer recommendations. In ddPCR, longer extension times and high cycle counts can generally only help improve endpoint amplitudes and clustering. Here, the PCR protocol alone was over 8 hours which is not ideal for infection diagnostics. The primary goal was to demonstrate the inventors' mathematical framework such that the thorough optimization of this assay is beyond the scope of this report. Future iterations on the inventors' approach could either employ custom master mixes with faster polymerases or restrict the amplicon to a shorter 16S segment. Third, many clinical infections may be caused by bacteria or fungi. Multiplexing primers to include eukaryotic marker genes along with the 16S primers for bacteria could enable broadening the panel. All of the above lines of research should also consider the optimal marker genes to target with nonspecific probes; certainly, no single gene target will be optimal for all applications in microbial diagnostics.
While there is room for improvement in the ddPCR approach, expanding the measurement space with non-binary measurements would dramatically improve diagnostic performance as fewer sensors could assign unique fingerprints to microbes at a higher rate. Some of these non-binary effects may be readily present in the droplet “lean” and “lift” likely caused by weak interactions between probes and slightly mismatched amplicons. These potential sources of additional information were ignored by the hard thresholding employed in this work. Moreover, while the inventors present their framework in the context of microbial diagnostics where the inventors' research group is most familiar, they note the generality of the approach for other diagnostic applications. Combining conventional sensors with new techniques in microfluidics and signal processing will offer a suite of new interdisciplinary approaches to scalable, multiplexed biosensing.
Theory: Identifiability with Common Types of Sensors. For estimating λ, the property of identifiability means that there is a one-to-one correspondence between each realizable distribution of measurements and the Poisson rates λ: if p(y|λ)=p(y|λ′) for all y, then λ=λ′. From an optimization perspective, identifiability implies that the λ* is the unique global optimum to the likelihood function if the inventors have infinite measurements. Therefore, identifiability essentially confirms that the inventors' approach is possible and worth pursuing further.
Notation. The inventors use bold face upper and lower case letters for matrices and vectors, respectively. Non-bold, lower case letters represent scalars. They denote the vector of all zeros as 0 with its dimensionality dependent on context. They use script letters (A, B, etc.) to denote sets. They denote ej as the standard basis vector with ej=1 and ei=0 for all i≠j. Let a and b be two arbitrary vectors of the same dimension, and let ai and bi denote their ith elements. They use supp(a) to denote the support of vector a defined as the index set where ai>0 for i∈supp(a). The inventors use the notation ab to imply that ai≥bi ∀i, and they use ab to further imply the existence of at least one index i where ai>bi. A set in the subscript of a vector such as xA refers to the subvector of x indexed by the elements of A. The inventors make frequent use of the shorthand Σa to denote the summation over elements of a vector a.
Definitions and Assumptions. The inventors treat the dataset of measurements from all partitions in a sensor group as samples of a random variable y. The signal (i.e., analyte quantities in a partition) x is N-dimensional with x˜Poisson(λ*). Each signal is measured by M sensors to yield the observation vector y (e.g., M fluorescence measurements). They define the function ƒ:+N→M that is composed of M scalar functions ƒn:+N→. A particular measurement value ym is determined by the sensor output ƒm(x) plus some additive, zero-mean random noise nm.
Note that the sensor functions fm are group-dependent. For example, each group may have different probes.
The inventors assume that the inventors' sensors are monotonic and saturating and that the inventors' system obeys fingerprint equivalence. Given these properties, they prove sufficient conditions for identifiability. Without loss of generality, they will say that all M sensor functions obey these properties. An additional sensor that does not follow these properties does not nullify the inventors' analysis.
Definition 4.1 (Monotonic Sensors). A sensor function ƒm: +N→ is monotonic increasing if a bƒm(a)≥ƒm(b) and monotonic decreasing if ab⇒ƒm(a)≤ƒm(b).
Monotonic functions are very common and natural; for instance, many sensing modalities have a monotonic increasing sigmoidal response to their input. Notably, sensors may saturate in their response to a subset of analytes at which point increasing the analyte content does not change the output. The inventors define this formally:
Definition 4.2 (Saturating Sensors). If xx′ and ƒ(x)=ƒ(x′) for a set of monotonic sensors, then for all z that satisfy zz′x′ and supp(z−z′)⊆supp(x−x′), ƒ(z)=ƒ(z′).
Any analytes added to x′ to yield x, indexed by supp(x-x′), did not influence senor output measurements. These sensors are either saturated in their response to these analytes, or these analytes do not interact with the sensors at all. Because any addition of these components to x′ yields the same M measurements, any addition of them to a signal greater than or equal to x′ must also result in no change in sensor measurements for these sensors to be considered saturating.
The inventors define a final intuitive condition on the inventors' system called fingerprint equivalence. The fingerprint of analyte n is the measurement yielded by an isolated copy of the analyte, or ƒ(en). Among analytes with identical fingerprint responses within a sensor group, the total number of occurrences of these analytes dictates the output response. In other words, the sensors treat these analytes as interchangeable copies of each other.
Definition 4.3 (Fingerprint Equivalence). Let ⊆{1, . . . , N} be an index set of analytes with identical fingerprints, i.e. ƒ(e
Note that even if all analyte fingerprints are distinct, multiple signals can still map to the same measurement vector since the inventors allow for cases of multi-analyte capture in the same partition. The inventors define these signals as members of a collision set.
Definition 4.4 (Collision Sets). The collision set x for signal x is the set of all signals x that satisfy ƒ(x′)=ƒ(x).
The inventors define as the set of unique collision sets. In 2-channel ddPCR with binarized measurements, there are four collision sets in each sensor group ({0,1}2). It will soon be clear that observations y are drawn from a mixture model. The inventors can define each mixture element as follows:
Definition 4.5 (Mixture Element). The mixture element Ex for signal x is the set of all signals x′ that satisfy p(y|x)˜p(y|x′).
Note with any zero-mean noise, p(y|x)˜p(y|x′)⇒ƒ(x)=ƒ(x′) such that εx⊆x. In some cases, such as additive white Gaussian noise, εx⊆x. The inventors define ν as the set of unique mixture elements with arbitrary εx∈ν.
Proof of Identifiability. With G different sensor groups indexed by g, the inventors assume that the sensor group applied to a measurement y is known and deterministic. Each sensor group has a different function ƒ that maps x to M-dimensional space (e.g., different probes in ddPCR). Identifiability means that p(y|λ)=p(y|λ′) ∀y,g⇒λ=λ′. Each λ must yield a unique set of G distributions of measurements.
The inventors will use the notation ug and εvg to specify the group g when necessary. For an arbitrary group, they can express p(y|λ) as:
If a mixture distribution is identifiable, it means that identical distributions must come from the same set of weights on the mixture elements; in this context, p(y|λ)˜p(y|λ′)⇒P(εv|λ)=P(εv|λ′) ∀v. Many finite mixtures (what the inventors practically have in MMVP) and countably infinite mixtures with common noise distributions are identifiable (Yakowitz & Spragins, 1968; Yang & Wu, 2013), and they assume that the system noise characteristics lend to an identifiable mixture. However, the inventors need to prove the identifiability of MMVP, or that equal mixture weights implies equal Poisson parameters: P(εvg|λ)=P(εvg|λ′) ∀v,g⇒λ=λ′. Note that because εvg⊆ug and unique collision sets are disjoint, P(εvg|λ)=P(εvg|λ′) ∀v,g⇒P(ug|λ)=P(ug|λ′) ∀u,g.
The inventors assume P(ug|λ)=P(ug|λ′) ∀u,g and prove the implication of λ=λ′ given a set of monotonic, saturating sensors and with a system exhibiting fingerprint equivalence in all G groups. They first focus on what can be concluded from a single, arbitrary sensor group (dropping the g superscript) with analytes potentially having nonunique fingerprints, and then they conclude with how multiple groups can be pooled to achieve identifiability. Without loss of generality, the inventors' analysis will focus on monotonic increasing sensors.
Define ⊆{1, . . . , N} that is indexed by a where there exists some m such that ƒm(ea)>ƒ(0). Define the complementary set indexed by b where ƒ(eb)=ƒ(0).
Lemma 4.1. ƒ(x)=ƒ(0) if and only if xa=0 ∀a∈
Proof: Let one define the shorthand xB∈{x:xa=0 ∀a∈}. Note supp(xB)⊆. By definition, ƒ(eb)=ƒ(0) for any b∈. Let c be an arbitrary positive integer. Monotonic functions allow for ƒ(ceB)ƒ(eb) which could lead to a nonzero sensor response. However, because the inventors' functions are also saturating, ƒ(ceb)=ƒ(0) since supp(ceb−eb) S supp(eb−0). Next, because of fingerprint equivalence, any signal xB with supp(xB)⊆ and with a sum ΣxB=c satisfies ƒ(xB)=ƒ(ceb). Therefore, ƒ(xB)=ƒ(0).
The inventors prove the forward condition, ƒ(x)=ƒ(0)⇒xa=0 ∀a∈, by contradiction. Say ƒ(z)=ƒ(0) and let z satisfy za≥1 for some a∈. By definition of , ƒ(ea)>ƒ(0), and zea. With monotonic functions, ƒ(z)ƒ(ea)ƒ(0) and they have arrived at a contradiction.
This result indicates that the values of xB do not influence the measurement output of any signal because sensors are saturating. In other words, for an arbitrary signal z and some xB, ƒ(z+xB)=ƒ(z) because supp(z+xB−z)=supp(xB−0) and ƒ(xB)=ƒ(0). The values of elements in xB are therefore entirely arbitrary for the analysis of collision sets.
Lemma 4.2. If P(0|λ)=P(0|λ′), then =
Proof: By Lemma 4.1, 0 contains all x with =0 with arbitrary values on . Therefore, P(0|λ)=P(0|λ′) implies
P( =P(x=0|λ′) (8)
= (9)
Σ=Σ (10)
Lemma 4.3. If for x∈e
Proof. For identifiability, the inventors assume P(e
P(ea|=P(ea|) (11)
λa =λa′ (12)
Lemma 4.2 implies that λa=λ′a. Similarly, if the latter condition holds, then P(e
P(e
(1−e−λ
Σ=Σ (15)
Lemma 4.2 coupled with this last equality means that λa=λa′, thus concluding the proof.
From here, the inventors first derive results for the special case where all analytes indexed by have unique singlecopy fingerprints. Afterwards, they generalize to multiple groups, allowing for equal nonzero fingerprints within a group. The next Lemma guarantees at least one index a to which Lemma 4.3 can be applied.
Lemma 4.4. If ƒ(ei)≠ƒ(ej) ∀i,j∈ with i≠j, ∃a∈ such that x∈e
Proof: First, with unique nonzero fingerprints in A and monotonic sensors, the fingerprint responses ƒm(ea) can be sorted. Without loss of generality, the inventors allow all functions to be monotonic increasing. Starting arbitrarily with m=1, they can select the minimal set ⊆ that minimizes ƒ1(ea) such that ∀a∈, ∀j∈c, ƒ1(ea)<ƒ1(ej). If ||>1, then the process can be repeated with m=2 (and so forth) on the subset until there is one unique minimum and its corresponding index a.
For this ea, all i∈\a satisfy ƒm(ei)>ƒm(ea) for at least one m. Therefore, signals in the collision set x∈e
Next, the inventors show how this result chains to all analytes indexed in .
Lemma 4.5. If ƒ(ei)≠ƒ(ej) ∀i,j∈ with i≠j, λa=λa′ ∀a∈.
Lemmas 4.3 and 4.4 guarantee at least one a that yields λa=λa′. Let one call this index a1 and define the subset ⊆, the subset of indices for which λi=λi′ ∀i∈. At this point, ={a1}. Repeating the process in the proof of Lemma 4.4, the inventors can find a new index a2 that satisfies ƒm(ei)>ƒm(ea
For the direct proof of identifiability, the inventors assume P(e2|λ)=P(e2|λ′), or:
Among signals x in e
The second type of signals in e
The exponential terms are equal on both sides of the equation by previous conclusions and can be distributed out of the sum and divided, yielding λa
If xa
Because λi=λi′, this also yields λa
Now, the inventors will extend this result to the case of having equal fingerprints in the same sensor group, i.e. that ƒ(ei)=ƒ(ej) for some pairs i,j.
Lemma 4.6. Define the disjoint sets 1, 2, . . . C indexed by c with ∪c=1C c= such that for all i,j∈c, ƒ(ei)=ƒ(ej). Then, =.
Proof: Note that the Poisson distribution has the property that if xi are each independently drawn from Poisson(λi), then xi˜Poisson (). The inventors can then simply define a dummy variables xc†=xi and λc† such that xc†˜Poisson(λc†). This dummy variable represents an “analyte” that appears with a distribution governed by the total quantities of analytes with the same fingerprint. However, what matters for identifiability is the sensor functional values of signals, i.e. that ƒ(a)=ƒ(b) if for all c, =. Namely, xc†˜Poisson(λc\) by fundamental properties of the Poisson distribution, but it is only with the condition of fingerprint equivalence (Definition 4.3) that let one apply all previous results that are based on collision sets, i.e., sets of signals with equal functional values. These yield λc†=λ′c†′, or = ∀c.
Theorem 4.7. Let g index G different sensor groups that satisfy fingerprint equivalence and that contain monotonic, saturating sensors. For each group g, define the row vector zcg of zeros and ones with ones in the indices associated with c. Define the N-column matrix Zg whose C rows are comprised of zcg ∀c. Define the N-column matrix Z as the vertical concatenation Zg ∀g. If rank(Z)=N, then λ=λ′.
Proof: This theorem is a formal way of saying that Lemma 4.6 must yield N independent equations when applied to all groups where the sensing and system conditions hold. The inventors can consider the system of equations yielded by Lemma 4.6 and represented by Zλ=Zλ′, or Z(λ−λ′)=0. If rank(Z)=N, then it follows that λ=λ′. Therefore, they have P(ug|λ)=P(ug|λ′) ∀(u,g)⇒λ=λ′, concluding the inventors' proof of identifiability. use Bacterial Panel. The inventors ordered bacterial species' genomic DNA from the American Type Culture Collection (ATCC, Manassas, VA). The species' names and their ATCC identifiers are as follows: A. baumannii (BAA-1605), B. fragilis (25285), E. cloacae (13047), E. faecium (BAA-23200, E. coli (11775), K. pneumoniae (13883), P. aeruginosa (BAA-1744), S. aureus (12600), S. epidermidis (14990), S. saprophyticus (15305), S. agalactiae (13813), and S. pneumoniae (33400). Particular strains were selected based on their availability at the time of purchase and only if ATCC provided whole genome sequence information for the isolate.
Probe Design. All oligonucleotides were acquired from Integrated DNA Technologies (Coralville, IA) with HPLC purification. The inventors used ThermoBLAST from DNA Software (Plymouth, MI) to align the 16S primers (27F and 1492R from (Johnson et al., 2019), Table 1) against bacterial genomes and find amplicons. The inventors passed these amplicons to a custom Matlab script to design probes. They needed probes with a melting temperature (Tm) a few degrees higher than that of the primers, targeting at least 65° C. Probes for barcoding must hit multiple bacterial taxa, and shorter probes are naturally less specific. LNA's increase probe melting temperature in a highly position-specific manner. To avoid combinatorially increasing the inventors' probe search space, the inventors deferred LNA positioning until after sequence selection. They used heuristics to filter for probes that would have sufficient Tm with some flexibility in LNA positioning. They chose a sequence length of 11 nucleotides with 5-8 GC nucleotides, without four consecutive G's or C's, and without a G on the 5′ end to avoid self-quenching of the fluorophore. They used Smith-Waterman alignment in Matlab to pre-screen for probes that self-hybridize and to assess cross-hybridization of probes amongst the evolving candidate set. To assist in achieving near binary measurements, the inventors considered perfect matches on all 16S copies to be “1” for a genome, and for imperfect homology, they filtered for sequences neither had nine consecutive matches nor a single G-T mismatch. This latter filtering is a proxy for ensuring that probes have weak, negligible interactions against 16S sequences where they do not have perfect complementarity. The former filtering for positive hits was intended to avoid the issue of mixtures of barcodes for any particular bacteria for simplicity in the inventors' initial demonstration.
Given a set of filtered, candidate probes, the inventors used a coordinate ascent strategy to iteratively optimize a set. They hypothesized that barcoding the full-length 16S gene with probes could achieve genus level resolution, as achieving below this level with sequencing of a subsection of the gene is often limited to genus information. As a result, the inventors encouraged similarity use Z 129 of the three Staphylococcus species and the two Streptococcous species. Define as a set of pairs of bacteria (bi,bj) within a genus that are similar. The complementary set includes all other bacterial pairs. Let kp,i represent the 11-mer barcode of bacteria i with probes indexed by p. Coordinate ascent sought to solve
The first term is taken from research in metric learning (Xing et al., 2002), and the second term (with a weight of θ=10) highly rewards some nonzero separation between all bacterial pairs that are intended to be discriminated between. The inventors chose an initial random set of probes that passed the inventors' cross-hybridization check. They iteratively cycled through a shuffled order of the candidate set of probes, evaluating one probe at a time for replacement with any of the other probes that passed the initial filtering step. If replacing a probe improved the objective function, the probe set was updated and the search continued. The algorithm terminated when all probe sequences had been evaluated for replacement but not replaced. For the chosen set of sequences, the inventors evaluated the alignment against 16S genes with imperfect homology (the zeroes in the barcodes). As much as possible, they positioned LNAs at mismatch sites to improve the thermodynamic discrimination against these sequences. They evaluated all Tm's in IDT's OligoAnalyzer, positioning additional LNAs as necessary to reach a sufficient probe Tm. Final probe sequences are listed in Table 3.
Preparation of Bacterial Samples. First, the inventors prepared monomicrobial dilutions of genomic DNA in MilliQ purified water. One dilution was prepared for each bacteria, approximately targeting a concentration λ between 0.2 and 2 (“Concentration 1”). They diluted each of these by ½ to yield a second dilution of “Concentration 2.” They used a custom script to assign random combinations of these bacterial dilutions to samples, generating five samples with k=2 unique bacteria and six samples of k=3 and k=4. They reserved one sample to be a no template control (NTC, water alone). The probability of drawing each bacteria was adaptively weighted to encourage approximately even representation of each taxa across the samples (note, Staphylococcus and Streptococcus species were lumped and treated as one taxa each to not overrepresent their constituent species).
Droplet Digital PCR. Primers were at 900 nM and all probes were at 125 nM. The inventors used Bio-Rad's ddPCR Multiplex Supermix and prepared master mixes, generated droplets, and read out droplets according to the manufacturer's instructions. For PCR cycling, extension times were set to 7 minutes because of the long amplicon (approximately 1500 base pairs) that is atypical in ddPCR, partly following guidance from (Lasham et al., 2020) and internal data.
PCR cycling was as follows: 95° C. for 10 minutes (initial denaturation and hot-start deactivation), 60 cycles of 94° C. for 30 seconds (denaturation) and 60° C. for 7 minutes (anealing and extension), 98° C. for 10 minutes. Ramp rates during cycling were set to 2° C./second. Samples were refrigerated at 4° C. for 30 minutes prior to droplet readout.
Plate Arrangement and Manual Thresholding.
For manual thresholding, the inventors pooled the data from the reference wells for each group and overlaid a particular well's data (
Settings for SPoRe. The inventors set the initial value of λ to the constant vector 0.1 for an unbiased initialization. The inventors performed gradient descent to minimize the negative log likelihood with backtracking line search to adaptively tune the learning rate at each iteration. After computing the gradient and projecting to λn=max(λn, 10−6), SPoRe checks if the negative log likelihood decreased. Values of λn=0 cause numerical issues in gradient computations, hence the small offset. If it did not, the iteration's step size was cut in half. SPoRe only commits to an update to λ after confirming a reduction in the negative log likelihood, at which point it resets the learning rate and proceeds to the next iteration. SPoRe terminates when the log likelihood's relative change over the previous five iterations falls below 10−6. In contrast to the inventors' original implementation of SPoRe (Kota et al., 2022), the inventors use an exact gradient computation over the entire dataset made feasible by the nature of ddPCR data (Section 5.2) rather than a Monte Carlo approximation over a batch of data.
Assessing E. cloacae Barcode Variability. E. cloacae's amplicons appeared to always interact with Probes 3 and 4, but a small subcluster appeared to lack the HEX response to Probe 1 (
Amplitude Multiplexing with More than Two Probes. Amplitude multiplexing is a technique to resolve more probes than the available number of color channels, but it is typically used with each unique probe participating in an orthogonal assay with its own primer pair (Whale et al., 2016). Probe concentrations can be adjusted to “move” the cluster positions. Here, the inventors adjust probe concentrations with a single pair of primers. Probes 1 and 5 were tagged with HEX, and Probes 2-4 were tagged with FAM (
Exact Gradient Computation and p(y|x) Model for ddPCR. The inventors first focus on the gradient resulting from a single probe group. In a single group, there are only four viable measurements with y∈{0,1}2. Let one define ={0,1}2, and py as the proportion of the D measurements that equal y. The inventors can then re-express Equation (2) as:
Here, the inventors will use λ≡λ(bact) since they will be optimizing over the bacterial concentrations directly. In this case, E. cloacae is the only bacteria with a fractional abundance of a probe binding site—approximately 87.5% of its copies interact with probes 1, 3, and 4, and 12.5% interact with only probes 3 and 4 (Section 4.8). Similarly to how the inventors defined C in Section 2. 1, they can define C(g) for group g with each bacterium's fractional abundances of genes with a corresponding barcode.
The inventors define p(y|x)=Πmp(ym|x). For p(ym=1|x), then p(ym|x)=1 if the droplet has at least one copy of a gene that interacts with probe m and p(ym|x)=0 otherwise. For p(ym=0|x), this likelihood is 1 if none of the genes in the droplet interact with probe m and 0 otherwise.
However, with the analyte current defined as a copy of the nth bacterium's 16S gene, the inventors must be careful. For instance, with index 3 corresponding with E. cloacae, if x3=1 in x, p(y1|x) may not be 1 since one copy of E. cloacae's 16S gene is not guaranteed to interact with probe 1. To resolve this, they will temporarily transform the problem to the space of gene barcodes for this group: λ(BC
The linearity of gradients allow one to treat this one y at a time, summing the contributions from each y at the end. In general, treat 00 as short for [0,0], 01 for [0,1], etc. Let ∇λ00 be the component of the gradient from y=[0,0], ∇λ01 from y=[0,1], etc. The inventors will similarly define the mean log likelihood contributions as 00,01, etc. Similarly, define the rows of C(1) as C00(1), C00(1), etc. By convention, λ and other vectors should be assumed to be column vectors, but the rows of C(1) are row vectors. Thus, the inventors have:
00
=p
00 log P(x01=0 and x10=0 and x11=0) (24)
=p00−x
=p00(−C01(1)−C10(1)−C11(1))λ (26)
∇λ00=p00(−C01(1)−C10(1)−C11(1))T. (27)
In the first line, the inventors define the conditions for x(BC
A virtually identical simplification for ∇λ10 is omitted here. Lastly, for y=[1,1], the inventors have:
11
=p
11 log(P(x11≥1)+P(x11=0 and x01>0 and x10>0) (32)
=p11 log((1−e−x
The remaining algebraic steps are omitted, but the final result is
where Cij(1)λ can be substituted for any xij.
The inventors can now say that for group 1, ∇λ(1)=∇λ00+∇λ01+∇λ10+∇λ11. The above process can be repeated for any group g. Therefore, the final gradient vector (arbitrarily scaled) is
where pg is the proportion of total droplets that come from group g.
Special Cases of MLE with ddPCR. From Equation (3) describing the generalized gradient in MLE, the inventors consider two commonly employed special cases. First, if samples are sufficiently dilute such that partitions are either empty (xd=0) or have only one analyte, the goal is often to identify each nonzero signal independently with a classification process. In other words, assays must be designed such that p(yd|x)>0 only for x*d—the measurements are unambiguous.
Setting the gradient equal to zero and simplifying leads to
In practice, clusters of classes must have reliable decision boundaries and concentrations are estimated by totaling the classification results.
The second specialized case is common for ddPCR where for each PCR assay is specific for a target analyte and assigned to a particular channel. With M channels, N=M and each measurement unambiguously determines the presence or absence of each target sequence. Precisely, p(yd|x) is one or zero, and considerations of each analytes' quantity xn can be simplified to xn=0 (absent) or xn>0 (present). Each analyte n can be inferred independently such that
where D0,n is the number of droplets that do not contain analyte n. Each analyte's concentration is based on the proportion of droplets that do not analyte. This formula can be found by applying the above assumptions, setting Equation (3) to zero, and simplifying.
All of the methods disclosed and claimed herein can be made and executed without undue experimentation in light of the present disclosure. While the compositions and methods of this invention have been described in terms of preferred embodiments, it will be apparent to those of skill in the art that variations may be applied to the methods and in the steps or in the sequence of steps of the method described herein without departing from the concept, spirit and scope of the invention. More specifically, it will be apparent that certain agents which are both chemically and physiologically related may be substituted for the agents described herein while the same or similar results would be achieved. All such similar substitutes and modifications apparent to those skilled in the art are deemed to be within the spirit, scope and concept of the invention as defined by the appended claims.
The following references, to the extent that they provide exemplary procedural or other details supplementary to those set forth herein, are specifically incorporated herein by reference.
This application claims benefit of priority to U.S. Provisional Application Ser. No. 63/279,484, filed Nov. 15, 2021, the entire contents of which are hereby incorporated by reference.
This invention was made with government support under Grant No. T15LM007093 awarded by the National Institutes of Health, Grant No. CBET 2017712 awarded by the National Science Foundation, and Grant No. N00014-18-1-2047 awarded by the Department of Defense and Office of Naval Research. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
63279484 | Nov 2021 | US |