Throughout this application, various publications are referenced, including referenced in parenthesis. The disclosures of all publications mentioned in this application in their entireties are hereby incorporated by reference into this application in order to more fully describe the state of the art to which this invention pertains.
This invention will be better understood by reference to the detailed description which follows, but those skilled in the art will readily appreciate that the specific experiments detailed are only illustrative of the invention as defined in the claims which follow thereafter.
The advance of biology over the last half-century is inextricably tied to DNA sequencing. Although there are dozens of major technologies for sequencing, all rely on the serial identification of the four bases G/T/A/C from single molecules, and the spatial isolation of signals arising from each molecule being the main determinant of throughput.
Recently, advances in next-generation sequencing have arisen from the spatial isolation of each molecule into a small volume, enabling many single-molecule sequencing reactions to run in parallel. The fundamental limit to throughput with this technique is the need to isolate individual molecules on a spatial scale, so that sequencing signals are not mixed.
Disclosed here is a method to accurately sequence complex mixtures of DNA and RNA species, in such a way as to reveal the underlying sequences that make up the mixture. This approach provides for a dramatic increase in the density of DNA molecules in a sequencing reaction for both in-vitro and in-situ techniques.
Aspects of the invention provided herein can be analogized to attempting to read a grocery list having overlapping text that is so densely written that two words are superimposed. For example, two overlapping words may appear as:
Unraveling this mixture is difficult. However, a dictionary of 5-letter fruits is useful to determine the identity of each word. In the above example, the dictionary may contain the fruits: [APPLE, GRAPE, LEMON, MANGO, MELON, PEACH, PRUNE]. Applying logical deduction or a combinatorial search reveals that the mixed signal provided above can be resolved only one way:
This problem can also be framed as a linear algebra problem. (See
This invention provided herein demonstrates that it is possible to break this barrier, and to generate accurate sequencing information while operating in a heavily mixed regime. To resolve these mixtures, a toolset of compressed sensing is used, which seeks to identify the simplest or ‘sparsest’ solution to mixed signals, and efficiently ‘demixes’ the ambiguous original sequence information. This approach, which we term Mixture Sequencing (MIXSEQ), enables a massive increase in sequencing throughput for a variety of traditional sequencing platforms, as well as a new class of experiments for in-situ sequencing.
In comparison to the grocery list problem above, the problem of high-density DNA sequencing differs in form in a few important ways. First, there are only four letters (A,T,G,C) instead of 26. Second, a raw sequencing signal is not as categorical as letters from an alphabet, but instead varies in amplitude, and is contaminated by noise. Third, it is not always obvious how many signals are mixed together, and this must be determined in the process of demixing the signal. In general, however, these differences are addressed within the compressed sensing framework.
Importantly, the approach described here reframes the sequencing operation from a search for de novo sequences, into a regression problem in which sequences are matched to an existing dictionary.
Depending on the biological problem, this dictionary may represent a transcriptome, genome, a set of random DNA barcodes, a set of RNA or DNA aptamers, or any other set of oligonucleotides with biological relevance.
The form of this dictionary is important. Suppose one attempts to decode our ambiguous grocery signal described above—G+P, E+R, A+A, C+P, E+H—using the full English dictionary instead of a simple dictionary of fruits. Suddenly, two equivalent solutions may be found: two fruits (PEACH+GRAPE), or two generic nouns (PEACE+GRAPH). This kind of ambiguity also affects the DNA sequencing problem and arises directly as a function of dictionary size. In general, larger dictionaries make the demixing problem more difficult. However, as shown herein, the relevant dictionaries for a wide range of biological problems—including transcriptome sequencing, genome sequencing for CNV analysis, and single-cell barcoding—are readily available or can be readily applied. Moreover, multiple dictionaries can be applied individually to the same data set and the results can be compared for differences which in turn can be used to decide which is the most probably correct result.
To properly frame the problem solved by the MIXSEQ approach, the general process of DNA sequencing in its current form is outlined. Typical approaches to DNA sequencing typically have three steps: (1) isolation of a single molecular species, (2) selective amplification and (3) performance of the actual sequencing reaction through repeated measurement. The exact sequencing method varies, but common methods such as Sanger or sequencing by synthesis e.g., Illumina, typically return a 4-channel measurement corresponding to each possible base at a given nucleotide position.
As an example, consider modern next-generation sequencing on the Illumina HiSeq platform. Individual molecules from the DNA sample are first isolated onto a glass flow cell, and then amplified to form many small colonies. This colony is subjected to “sequencing by synthesis,” in which the sequence is read via successive incorporation of fluorescent nucleotides. Using this platform, sequencing information is read out through 4-channel fluorescent microscopy by identifying the fluorescent signal associated with each colony.
The massive throughput enabled by this method is enabled by the high density of the DNA colonies, and the ability to multiplex millions of colonies on a single coverslip. Nonetheless, this density is limited by the need to avoid generating colonies at too high a density, because over-lapping colonies generate ambiguously mixed sequence information. This limitation arises because of the impossibility of performing traditional base-calling on mixed sequence information.
Modern methods for in-situ sequencing suffer from a similar problem. For instance, Fluorescent In-Situ Sequencing (FISSEQ) is a method for transcriptome sequencing that relies on transforming each RNA molecule into a small amplified RNA colony (“rolony”) in the physical context of the original cell. These rolonies can then be sequenced using fluorescent chemistry. The efficacy of this method is ultimately limited by rolony density, as overlapping rolonies provide an ambiguously mixed signal.
To move beyond this barrier, “mixture sequencing” (MIXSEQ) replaces the traditional base-calling step with an algorithmic demixing of overlapping signals. This approach operates directly on the superimposed fluorescent signal arising from multiple DNA sequences.
In order to resolve these signals, MIXSEQ relies on previous knowledge of a dictionary of known sequences, such as a previously sequenced genome or transcriptome. In many cases, the dictionary can also be selected based on the data itself.
By enabling the identification of sequence information from mixed samples, MIXSEQ allows for a new form of multiplexed sequencing that enhances the throughput of both in-vitro and in-situ sequencing and allows for new biological experiments in in-situ sequencing methods.
To further demonstrate the utility of applying MIXSEQ to in-situ sequencing methods, FISSEQ is first described here in more detail. Using the traditional FISSEQ technique, endogenous mRNAs are subjected to a three-step process. First, endogenous RNAs are subjected to reverse transcription, forming a short complementary DNA (cDNA) containing the “target sequence”. Second, the target sequence is selected and incorporated onto an exogenous nucleic acid backbone either via a gap-filling ligation using a padlock probe or via circLigase. Third, the circularized product including the target sequence is amplified via rolling circle amplification using Phi29 polymerase, which generates a rolling circle colony or “rolony”. Fourth, the target sequence is selectively read out by application of a flanking sequencing primer and well-known sequencing methods, such as the chemical processes involved in sequencing by ligation (SBL) or Illumina sequencing methods. This gives rise to a “sequencing signal”. Fifth, the sequencing signal is subjected to standard base-calling methods, which seek at each position to find the most likely nucleotide in the target sequence. For standard sequencing using four-color fluorescent methods, this base-calling happens by identifying the color channel with maximum intensity.
Theoretically, every RNA molecule in the sample is assumed to give rise to a maximum of one rolony, and one associated nucleotide sequence known as the target sequence. While some target sequences may be present in multiple rolonies, each rolony only contains one target sequence. Therefore, the base-calling algorithm may be run on each 2-dimensional pixel or 3-dimensional voxel individually or may be run on a collection of pixels such as an entire identified rolony. This base-calling algorithm operates by identifying and interpreting the fluorescent intensities arising from the sequencing operation (the sequencing signal) and assigning a single nucleotide base for each position in the molecule according to the relative fluorescent intensity in each channel.
A major practical limitation of FISSEQ is that in order to run standard base-calling algorithms, the set of rolonies in the sample must not overlap physically. Indeed, currently significant effort is taken to avoid rolony overlap. Rolony overlap can be avoided through several processes including (a) sequencing relatively few target sequences at a time (Ke et al., 2013), (b) physically expanding the tissue using “expansion microscopy,” (Chen et al., 2016), (c) reducing the physical size of rolonies, at the expense of a dimmer signal and less-reliable base-calling, (d) sequencing only a subset of rolonies in a given sample by careful selection of sequencing primers, or (e) improvements in microscopy, at the expense of imaging time. Each of these methods has costs in terms of the number of sequenced molecules, signal-to-noise ratio, imaging time, etc.
The need to prevent rolony overlap provides a fundamental limit to the usefulness of FISSEQ, because the size of the cell is quite limited relative to the size of a single rolony. For example, a spherical cell of 5 μm radius can only contain ˜1000 spherical rolonies of radius 0.5 μm. This number is insufficient to support many uses of FISSEQ, such as robust quantification of mRNA copy number for more than a small number of genes.
The inventive method described here provides an alternative method for addressing the problem of rolony overlap and increases the overall throughput of the sequencing reaction.
First, we alter the method of FISSEQ to intentionally generate rolonies with high levels of overlap. This overlap may be quantified by the average distance between rolonies, such that at least 5%, 10%, 25%, 50%, 90%, or 100% of rolonies are within 0.5 μm 1 μm, 2 μm of its nearest neighboring rolony. In another method of quantifying overlap without careful distance measurements, rolonies may be considered to overlap if, for at least 5%, 10%, 25%, 50%, 90%, or 100% of rolonies, at least 5%, 10%, 25%, 50% 90%, or 100% of the pixels imaged for that rolony overlap with pixels from another rolony.
In the process of sequencing, the overlap of two or more rolonies gives rise to a “mixed sequencing signal,” which may contain information from two or more target sequences. For a given pixel or collection of pixels, the mixed sequencing signal may represent the summation of sequencing signals for two or more rolonies, either in equal proportion, or in unequal proportions. Using traditional base-calling, it is not possible to “demix” this signal to identify the original target sequences.
However, using MIXSEQ it is possible to “demix” the mixed sequencing signal to identify the set of target sequences that are present in overlapping rolonies. This method can be applied either to individual pixels in the image or it can be applied to collections of pixels. In both cases, we will refer to the input data as the “mixed sequencing signal.”
As described in this application, in order to identify the target sequences present in the set of overlapping rolonies that form the mixed sequencing signal, the signal is compared to a known dictionary, i.e. a database of known nucleic acid sequences that are potentially expressed or contained within the sample. In this application, such a dictionary is referred to as the “sequence dictionary,” and it can be drawn from, for example, known sequences from the transcriptome or genome of any species. The sequence dictionary may also contain a set of apparently random sequences of known length, e.g. “barcodes,” that may arise from exogenous sources such as virus infection or direct transfection and are found within a tissue as RNA or DNA molecules.
For the demixing process, the goal is to identify a combination of target sequences that may adequately reconstruct the mixed sequencing signal. This combination is referred to as the “demixed solution” to this problem and defines a set of weights or probabilities for each sequence in the sequence dictionaries, with these values corresponding to an estimate of the proportional contribution (or probability of contribution) to the mixed sequencing signal. This demixed solution can be found using a variety of algorithms, including those of regression, constrained regression, LASSO, combinatorial theory, compressed sensing, compressive sensing, convex optimization, approximate message passing, belief propagation, logistic regression, deep learning, and others.
One useful approach is to seek a demixed solution that is “simple” in some way. In the context of the demixed solution, “simplest” may suggest that the smallest number of target sequences are used, or that the relative weights of several target sequences are relatively low, measured as the average weight, maximum weight, L1 weight, L2 weight, entropy of weights, etc. This approach is commonly used in other fields that seek to “demix” other signals such as image processing, radar, etc.
In the context of the demixed solution, a reconstruction may be understood as “adequate” if it is sufficient to explain most of the variability, or amplitude of the mixed sequencing signal, with error that is less than 90%, 75%, 50%, 25%, 10%, 5% or less.
As the mixed sequencing signal may consist of signals from many pixels, the weights applied to each pixel may be different. The measurement of a “simple” solution may also be combined across pixels. For example, the solution to a many-pixel problem may be found by a “Group LASSO” or “multiple Gaussian” solver. Additional relational information about the many-pixel problem may arise, for example, if pixels are arranged spatially such that nearby pixels are likely to carry similar signals. Additional relational information about the many-pixel problem may also arise if groups of target sequences within the sequence dictionary are likely to show correlations in their presence or absence across pixels. Lastly, additional relational information about the many-pixel problem may be used when the goal is to identify deviations from the dictionary—for example, using the solutions of many such sequencing problems to identify deletions or single nucleotide polymorphisms (SNPs) in the target sequences.
The outline of the MIXSEQ approach has three parts: (1) a mathematical framework for sequence demixing using compressed sensing; (2) delineation of the limits of MIXSEQ and heuristics for identifying solvable problems; and (3) practical applications of this technique for sequencing genomes, transcriptomes on Illumina or FISSEQ platforms.
The MIXSEQ approach essentially replaces the traditional base-calling step of DNA sequencing. Traditionally, base-calling operates on a signal from one DNA species, consisting of an analogue value for each possible nucleotide i.e., G/T/A/C, at each position. The MIXSEQ approach replaces this step, and instead enables the processing of superimposed signals from many DNA species. To outline the MIXSEQ approach, the problem of base-calling must first be framed in terms of linear algebra (
For a general, non-sequencing related problem, a set of n measurements can be made and represented as a vector s. It is known that s is made up of several superimposed signals with differing features, drawn from a dictionary A consisting of p dictionary elements. Thus, s is a weighted sum of the elements, or columns, of A. We can write the problem as a simple regression problem:
s=Ax Equation 1:
Here x is the unknown set of weights or loadings that denote which dictionary elements i.e., columns of A, have been mixed into the measurement s. Also, note that the problem here is simplified such that each measurement is a one-dimensional scalar, rather than a 4- or 26- dimensional vector.
Using traditional linear algebra, we can solve for x as long as the number of measurements (n) is at least as large as the dictionary size (p), assuming A has full rank. However, this regime of n>=p is less useful for our purposes, as it requires a very small dictionary. The more likely regime is n<<p. Here, however, the problem is underdetermined—that is, it has infinitely many possible solutions.
The fundamental result of compressed sensing is that underdetermined problems can indeed be solved, as long as long as the solution is “sparse”—that is, made up of relatively few underlying components.
For the grocery list problem described previously, this is analogous to saying that a solution that combines two 5-letter words e.g., PEACH/GRAPE is preferred over one that combines several smaller words or letters e.g., the 5-component mixture [PEA--+--APE+G----+-R---+---C-+----H.]
Several methods are available to identify the “sparsest” solution.
The most obvious is to search over all combinations of dictionary elements in A. The sparsest solution might then be found by minimizing the number of non-zero elements in x—this is known as the L0 norm of x, written |x|_o. While effective, this is computationally impractical for all but the most trivial problems.
It has been shown that the sparsest solution can also be identified by minimizing the L1 norm of x, corresponding to the summed magnitude of elements of x. This approach identifies the same solution as the L0 norm, but is computationally tractable and efficient. A variety of algorithms are available for this approach and are used, for example, in radar, JPEG compression, and MRI (Blanchard, 2013). Approaching the large-dictionary problem with L1-norm minimization or related convex problems has proved to be a powerful and general method for resolving ambiguous mixtures.
To adapt this approach to the problem of DNA sequencing, we must rewrite the measurement s to match the form of the signal coming from the sequencing machine. We start with a single molecule of DNA of length c. Under four-color fluorescent sequencing, the sequence of this molecule would be read as a n=4c sequence matrix S. Each row of S corresponds to a single nucleotide position, and each column corresponding to a color channel associated with nucleotide A/T/G/C. (
However, for a mixture of molecules in a sequencing reaction the fluorescence output at each nucleotide position will consist of multiple colors and the sum S* of two or more sequencing matrices i.e., S*=S1+ . . . +Sn, will be observed. (
s*=Ax(1) Equation 2:
Here A is a dictionary of p known sequences relevant to the biological problem. As before, this problem is trivially solvable if n>=p. Modern sequencing technology restricts the size of n to approximately 1E2-1E3. What then, is the size of p?
In the worst-case scenario, the dictionary would contain every possible n-mer, and we would seek the sparsest solution using combinatorial search. If we sequence only n nucleotides, there are a total of p=4n possible sequences and ˜4kn combinations of k sequences. Analyzing this problem combinatorically is clearly infeasible: for a measurement of length n=20, testing one pairwise combination per nanosecond would require nearly a billion years. Fortunately, two factors coincide to render this problem more tractable.
The first is that, in typical biological applications, dictionary size is much smaller than 4n because only a subset of possible n-mers are relevant to the problem. For example, of all 4°˜1E12 nucleotide sequences of length 20, less than 0.4% (p˜1E10) are actually used in the human genome. (Liu et al., 2008). In the case of transcriptome sequencing, appropriate dictionaries can be built on the order of the number of genes (p˜1E5-1E6). Or, for truly random sequences such as those used for tissue barcoding, p is a directly tunable parameter (p˜1E4-1E10 for neural barcoding). Thus, the size of the working dictionary is much more manageable than at first glance.
Our second result is mathematical, arising from the field of compressed sensing. An important result is that the “sparsest” solution as measured by the L0 norm is, in many cases, also the sparsest when measured by the L1 norm. This problem can then be solved, as before, by seeking solutions that minimize the L1 norm of X.
Identifying the solution with the smallest L1 norm is significantly easier than the L0 norm because this problem is convex—any locally optimal solution is guaranteed to be the global optimum as well. In most cases and as assessed in detail below, the L1 solution is also the same as the L0 solution and can be found using a variety of algorithms that are efficient, robust, and resistant to noise.
To provide a flavor of these algorithms, one algorithm known alternately as Matching Pursuit or stepwise regression is briefly outlined below. It proceeds as follows:
In practice this procedure will identify one DNA sequence from A for each iteration, repeating until the mixed signal is adequately explained.
To demonstrate the efficacy of this technique, a dictionary of 10,000 random barcodes was generated and used to attempt to resolve a mixed sequencing signal of 25 nucleotides generated from 10 randomly chosen barcodes. (See
In the next section, the conditions under which this approach is likely to be successful is outlined and it is shown that many biological problems lie within the feasible range.
(2) Determining when the MIXSEQ Problem is Solvable in the Absence of Noise
The field of compressed sensing has devoted considerable energy to understanding the circumstances under which underdetermined problems (such as Equation 1) can be efficiently solved and when they find the right answer. A major result is the observation that the solvability of a compressed sensing problem shows a phase transition—shifting between “almost always” solvable and “almost always” insolvable—that depends on the size of the problem. This is defined by three parameters: number of measurements n, the size of the dictionary p, and the number of mixed components k in the final solution. (Donoho and Elad, 2003). In general, problems are solvable for relatively large n, relatively small p, and relatively small k. This result is remarkably general and provides a roadmap for ensuring that any given problem is solvable.
To begin an analysis of this problem, we return to our one-dimensional problem (Equation 1) in which the dictionary A is derived from random Gaussian measurements, the k non-zero loadings in x are all equal in magnitude, and there is no noise. Problems of this form have the most permissive bounds for solvability.
To understand the behavior of this problem, two parameters are defined. First, the sparsity fraction (delta=k/n), which is the number of non-zero coefficients in x per measurement (n). The under-sampling rho is also defined, which is the number of measurements (n) divided by the dictionary size (p). When rho=1, the problem is no longer underdetermined and is always solvable.
As an example, we then generated a random dictionary of size p=1000 and simulated the results of 10,000 simulations. The results show a remarkable phase transition in solvability. Following Donoho, we report both the L2 error in the coefficients (that is, ∥X−X*∥) as well as the probability that this L2 error exceeds a small tolerance (eps=1E-5). (See
Importantly, however, the success or failure of the algorithm here is not easily determined without knowledge of the ground truth {x}. Methods for assessing these errors are discussed below.
The location of this phase transition for different types of dictionaries is examined next. The problem is optimal when the dictionary has maximum entropy, as for the Gaussian distribution. However, similar dictionaries with less entropy, including the Bernoulli dictionaries, consisting of random numbers drawn from [0,1], show similar results. (
The dictionary of DNA barcodes is indexed somewhat differently than others, because the 4-dimensional measurement for each nucleotide is concatenated into a single vector. These bounds are somewhat clearer when indexed by the number of bases sequenced. (See
One important feature of the problem is not incorporated into this representation—dictionary size (see
(3) Resistance to Noise
The MIXSEQ approach described herein also shows a remarkable resistance to noise (see
(4) Estimation of the False Positive Rate
For any practical application of this technique, it will be important to estimate the false detection rate. In this context the false detection rate is defined as the probability that, for a problem of a given size solved by a given algorithm, the correct dictionary sequence will be chosen. This probability can be approximated by adding a set of “bait” sequences to the sequence dictionary, which are known not to correspond to any biological sequence. Assuming that the dictionary is random, the likelihood that the demixing procedure will choose any given “bait” sequence is equal to the probability of choosing false-positive within the original sequence dictionary. As the inclusion of such “bait” sequences always decreases the overall probability of successful recovery, this FDR can be considered to be a conservative estimate.
(5) Exploiting Multi-Pixel Correlations
The method used in Equation 2, s=Ax, is actually a simplification of the sequencing signal. Typically, for either Illumina or FISSEQ sequencing, in which the signal is a multi-channel fluorescent signal observed through fairly traditional microscopy, there are strong correlations between neighboring pixels. This correlation can be exploited by altering the approach used in Equation 2 to enable a multi-pixel decoding, encouraging solutions that use the same dictionary elements across pixels. Here it is assumed that the measurement vector s is repeated, forming a measurement matrix S—each column corresponding to an individual pixel. Similarly, the weight vector x is expanded to a weight matrix X with each column corresponding to the weights of each dictionary sequence for a given pixel. It is assumed that the same set of non-zero weights in X are shared across pixels, although the magnitude of each may vary for each pixel. Typically, it is also assumed that these weights are independent and uncorrelated across pixels, however, this assumption is frequently violated in real data to little detriment.
Thus, the problem can be reframed such that the reconstruction error of S=AX is attempted to be minimized while identifying the sparsest set of weights X. We encourage the use of the same barcode across pixels calculating the L2 norm of each barcode (across pixels), and subsequently calculating the L1 norm of total barcode loadings. This reduces the penalty for using the same barcode across multiple pixels, encouraging the use of similar barcodes.
min(∥AX-S∥) s.t|X|1,2<epsilon Equation 3:
Similarly, it is also possible to encourage solutions in which there is explicit spatial smoothness i.e., neighboring pixels are similar. We define f(X) as a measure of the spatial variation in X, and then solve the equation:
min(|X|) s.t.∥AX−S∥<epsilon AND f(X)<epsilon Equation 4:
Although both approaches are algorithmically complex, either can be solved using the flexible toolkit of convex optimization.
These approaches have two uses. The spatially smoothed approach is useful when the actual physical measurement e.g., the signal arising from fluorescent microscopy, is spatially smooth; it exploits this smoothness to more reliably identify the correct sparse solution to the mixing problem. For the non-spatial approach, which simply encourages the consistent use of a common set of dictionary sequences, the assumption is that there is an intrinsic structure to the sequences themselves. This might be appropriate, for example, when identifying species communities, from a collection of multiple mixed sequence signals that independently or differentially subsample the underlying population. (Amir and Zuk, 2010). We note that this work takes a similar approach to that described here but is restricted to a single measurement of one mixed sequencing signal.
(6) Identify SNPs by Reliable Deviation from the Dictionary
In many cases, the biological question of interest involves deviations from an expected sequence, particularly including single nucleotide polymorphisms (SNPs). Intuitively, the presence of a SNP seems likely to derail the demixing process, as there is no correct dictionary match. In practice, however, the SNP variation will not disrupt identification of the correct template from the sequence dictionary.
Indeed, this approach can even be exploited to identify the presence of the SNP, without actually performing base-calling at any step (
More generally, we can define our problem as Eq 3: S=AX. With SNPs, the correct problem uses a slightly different, but unknown, dictionary ASNP; thus Eq 4: S=ASNPXSNP. The set of SNPs that distinguish  from A is unknown. However, this problem can be rewritten as S=(A+Δ)XSNP, where Δ=ASNP−A. Assuming that the effect of SNPs on the resulting solution X and measurement vector S is small, then X≅XSNP, and the associate property allows for S=AXSNP+ΔXSNP. This can be solved piecewise by (1) solving S=AX; (2) calculating the residual R=S−AX; and (3) solving R=ΔXSNP for unknown Δ.
(7) Learning the Dictionary
Identifying SNPs is a specific case of the more general problem of learning the full dictionary A. Given the ability to exploit correlations in the signal between neighboring pixels, it is often possible to learn the dictionary directly. In this sense, the set of pixels showing mixed fluorescence can be thought of as delineating a subspace that is spanned by a few unknown sequences. These can be learned using a variety of subspace estimation algorithms that are similar to principal components analysis (PCA). For example, both Non-Negative Matrix Factorization, Independent Components Analysis, and an appropriately formed and trained neural network can effectively identify the correct dictionary sequences.
For non-negative matrix factorization, we attempt to learn both the dictionary matrix A and weight matrix X. We do this by solving a related problem S=WH, where W is a dictionary representation and H is the weight matrix. One important bound placed on this problem is that all entries in W and H are constrained to be non-negative. This problem and variants can be reliably solved to identify the set of sparse components W that can be combined via the weight matrix H to reconstruct the signal S.
In the problem shown in
Importantly, a variety of such algorithms exist, and can exploit (a) explicit assumptions about spatial smoothness, (b) cross-validation strategies for identifying sequences, and (c) terms that encourage further sparsity in both W and H. One important aspect of this approach is that, in general, the matrix W is not explicitly constrained to have the form of a traditional DNA sequence. While a simple solution is to run base-calling on each sequence in W to identify the nearest valid DNA sequence, it is also possible to constrain the algorithm to search only in the space of valid sequences.
One important aspect of dictionary learning techniques such as NMF and ICA, is the need to estimate the total number of sequences to learn. For this case, two methods are available—the L-curve (originally introduced in the context of ridge regression, (See
(8) Deviations from the Assumption of a Random Dictionary
The reformulation of the sequencing problem via compressed sensing carries an important assumption about the form of the sequence dictionary. This assumption usually states that the dictionary is drawn from a Uniform Spherical Ensemble—that is, that dictionary elements can be understood as points on a unit sphere, distributed uniformly on that sphere. (See, e.g., Donoho et al., 2006).
In the original motivation for this work—DNA barcoding—the barcode dictionary consists of barcodes that are approximately random at each base. This is sufficient to provide for a Uniform Spherical Ensemble.
Typical synthesis procedures for barcode generation do not ensure equal distribution of each base at each position. This decreases the overall entropy of the barcode library and reduces the information gained in each round of sequencing. However, the magnitude of this effect is relatively small. For a perfectly random barcode library, we gain 2 bits per nucleotide; if one base is left out entirely, this drops to 1.6 bits. If each nucleotide in the mixture is drawn uniformly from a range of 0×-2× the target value, we receive 1.75 bits of information. Indeed, simulations suggest that for sequence dictionaries derived from existing genomes or transcriptomes, the dictionary is likely to deviate even further from random. For every such dictionary, a confusion matrix can be generated to identify the probability that two sequences will be swapped during the demixing procedure.
(9) Recovery of Copy Number Variation
In many cases, it is desirable to treat the recovered sequences as arising from a single linear chromosome and to identify duplications and deletions in this chromosome. This problem is similar to the circular binary segmentation problem, (Olshen et al., 2004), but, in this case, must be recovered from data in which all sequences are intermixed. We have recently demonstrated that this is possible through an additional constraint on the MIXSEQ approach, as discussed in detail below.
Firstly, copy number variation (CNV) is an important contributor to heritable and acquired genetic disorders such as cancer. However, analysis of copy number variation is expensive at the level of sequencing because each genome position must be sampled multiple times (30x+) in order to reliably recover its overall prevalence. As described herein, mixture sequencing (MIXSEQ) is an approach in which multiple DNA molecules are sequenced simultaneously, giving rise to superimposed signals that must be disambiguated using the toolbox of compressive sensing. This method lowers the overall cost of resequencing and is particularly well suited to the problem of copy number variation. Here, we outline the application of MIXSEQ to investigation of copy number variation. In particular, we attempt to derive a method in which a smoothed, or piecewise linear, estimate of CNV can be derived directly from the compressive sensing problem.
A model is derived from an extension of the degenerate oligonucleotide primed polymerase chain reaction (DOP-PCR) approach to linear whole-genome amplification and standard Illumina sequencing. In this technique, a degenerate primer is used to linearly amplify a small fraction of the genome for sequencing and can be used through various techniques for highly reliably CNV calling. (Wang et al., 2016). The small subset of the genome that is amplified using this technique is relatively small, e.g. 20,000 sequences, and can serve as a reasonable dictionary in a compressed sensing framework. It is also assumed that the sequencing operation is similar to standard 4-color Illumina sequencing, with many molecules sequenced across many thousands of pixels/clusters.
Under the assumption that the genome shows copy number variation, some fraction of subsequences along each chromosome will deviate from diploid copy number. Typical CNVs include duplications and deletions that lead to triploid/monoploid states, although more dramatic changes are possible.
The set of m sequences amplified by DOP-PCR is defined as the columns of a matrix A(n;m)). Sequences of length n in A (indexed as A.,m for column m) consist of length-4n sequences with bases chosen randomly at an A+T:G+C ratio of 0.5. A single sequence s=‘GTAC’ then has the sequence vector representations=[A1 . . . A4, T1 . . . T4, G1 . . . G4, C1 . . . C4]T=[1 0 0 0, 0 1 0 0, 0 0 1 0, 0 0 1]T (here, commas are for convenience only). Although rows if n are not strictly independent in this case, this is ignored here. In addition, it is assumed that sequences in A are ordered and evenly spaced along a single linear chromosome.
The loadings of each sequence in A across p pixels are denoted as X(m,p). To account for copy number variation, a reference vector c is defined as a bimodal stairstep alternating between diploid, triploid, and monoploid states (See
We seek to estimate “Copy Number Variation” as c but, due to Poisson sampling, can only recover the noisy estimate we might call “Sequence Number Variation” s, where ŝm≡∥Xm,.>0∥0. This is the quantity recoverable using traditional sequencing methods, in which there is one sequence per pixel (i.e., k=1) and traditional base-calling and sequence mapping may be used to populate X. After identifying the s by counting the occurrence of each sequence that maps to the genome, c is estimated as a piece-wise constant approximation to s by circular binary segmentation. (Olshen et al., 2004). The goal is to identify piecewise-constant subregions of the chromosome that share a common copy number, with well-defined edges.
Our problem deviates from the traditional approach in two ways. First, it is assumed that k>1, and that the estimate {circumflex over (X)} must be derived by a sparse inversion, taking for granted that this is feasible on both a physical and algorithmic basis. Second, it is noted that correct recovery of {circumflex over (X)} allows an estimate of Sequence Number Variation s that could be used to approximate Copy Number Variation c via circular binary segmentation. However, we assume that it is both possible and desirable to estimate ĉ directly by smoothing the estimate contained in {circumflex over (X)} using a variant of Fused LASSO. (Tibshirani et al., 2005).
We begin by framing the problem as a multi-task LASSO, (Obozinski et al., 2006), or the related multiple measurement vector Basis Pursuit problem (MMV-BP):
We then apply a norm on {circumflex over (X)} to encourage the recovery of a smooth d.
For a single measurement vector, i.e., single pixel, the Fused LASSO (Tibshirani et al., 2005), which constrains the approximate derivative of x to be piecewise constant, and can be implemented by applying a the first-order differencing operator (D)
For our MMV-BP problem, the natural extension is to apply these operators to minimize ∥Dĉ∥1 where row-cardinality ĉ=Σ1p Xp,.>0 is again the occurrence of each sequence in X. However, as threshold operations are non-convex, P we instead apply the differential operator D to the estimate Σ1p Xp,. (which we refer to as an L1 norm).
To encourage a sparse X, we compare the effect of three sparsening norms. Here, we compare a total variation constraint ∥X∥1,1, an L2,1 norm that tolerates more small values of X, as well as a nuclear norm.
The motivation for the use of the nuclear norm is that, as the nonzeros in X arise from an i.i.d. sampling process on c, the approximation 1/p Σ1p X. . . ,i is itself a rank-one estimate of c via the central limit theorem. Constraints on the rank of {circumflex over (X)} may therefore encourage a useful kind of sparsity, but are nonconvex—instead, we can apply the nuclear norm, which constrains the sum of singular values of {circumflex over (X)}. (Recht, 2007) For implementation, we seek solutions we implement these constraints in Basis Pursuit format, and optimize problems of the following form using the cvx package
In conclusion, copy number variation arising due to segmental duplications or deletions can be recovered by enforcing smoothness in the recovery matrix, as a fused LASSO problem using a differencing operator. (See
Applying MIXSEQ to Sequences Generated from FISSEQ
(a) Introduction
The MIXSEQ approach relies on a tight integration of molecular biology, i.e. sequencing and math (compressed sensing), to enable the demixing of superimposed DNA sequences. Generally, we are interested in counting elements of DNA, rather than de novo sequencing. The design of primers for reverse transcriptase, amplification and sequencing all happen in concert and play an important role. The design of primers determines three critical factors: (1) which mRNAs will be sampled by the sequencing process; (2) the exact sequences that will be read via FISSEQ (target sequences); and (3) the contents of the sequence dictionary that will be used for demixing.
In all cases below, the result is a cDNA of some kind that contains a target sequence that will be amplified. These target sequences are equivalent to the sequences that would normally arise during de novo sequencing—the only difference is that they are known advance.
Target sequences are intentionally chosen so that they are as different as possible from one another, according to a variety of metrics. In addition, these dictionary elements may be chosen to conform, or nearly conform, to a known error-correcting code such as a Hamming code or Levenshtein code. (See, e.g., Buschmann and Bystrykh, 2013). This choice defines the sequence dictionary and is critical to our technique.
(b) Experimental Outline
(c) Experimental Details
1) Design primers and sequencing approach—This is a critical step and is required to happen first. Details are folded into the text below.
2) Process tissue for FISSEQ—Although this process may vary, generally the steps include:
a) Lightly fixing the tissue;
b) Digesting away DNA;
c) Digesting away protein; and
d) Fixing the RNA via cross-linking (possibly during fixation, possibly after reverse transcription).
3) Perform reverse transcription—A critical feature of the MIXSEQ technique is that it allows RT/amplification to generate rolonies at high densities, such that they overlap optically and/or physically.
The methods used to do this can vary significantly e.g., changing primers, changing RT conditions, using PADLOCK probes, etc. However, the end result is a population of rolonies that arise at such high density that they would not normally provide useful sequencing information. The assumption here is that the techniques described herein give rise to this superposition.
4) Amplify cDNA—The cDNA containing our target sequence is then amplified, typically using rolling circle amplification (RCA). The result of this amplification is a rolony.
5) Run FISSEQ reaction to sequence the target sequence—the FISSEQ reaction uses either Illumina or Solid Sequencing chemistry to generate a fluorescent signal that corresponds to the target sequence.
6) The intensities from multiple pixels are consolidated into a coherent measurement matrix—The measurements made during sequencing arise as a set of multi-color snapshots, with one snapshot for each base position. Each snapshot may be 2D or 3D, depending on whether a full volume is being imaged.
Compositions of matter that give rise to mixed sequencing signals. In the general context of oligonucleotide sequencing, there are many compositions that give rise to mixed sequencing signals. In general, mixed sequencing signals arise whenever two or more unique target sequences are amplified and recovered during sequencing within one pixel or a set of contiguous pixels.
For example, a mixed sequencing signal may arise when two rolonies are amplified and sequenced in close proximity to one another, with one rolony arising from a PLA reaction associated with an oligonucleotide, and the second rolony arising by association with a protein (
For example, a mixed sequencing signal may arise when multiple subsequences within a single oligonucleotide are targeted by hybridization of an RNAScope-style set of hybridization probes (
For example, a mixed sequencing signal may arise when multiple subsequences within a single oligonucleotide are targeted by hybridization of an RNAScope-style set of hybridization probes (
A mixed sequencing signal may arise when a single amplicon such as a rolony is made into a double-stranded molecule, and then sequenced in two directions simultaneously. When sequenced in one direction (
In a modification of traditional PLA, amplification of one rolony (red) may be dependent on proximity to a second rolony (green). When only one rolony is amplified, this gives rise to an (unmixed) sequencing signal (
Under standard sequencing, a target sequence (for example, GTACGTCCGAC) has a corresponding sequence matrix that is not mixed. (
Under some circumstances, we may use a neural network to demix a set of mixed sequencing signals. One possible architecture for this neural network is shown in
Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by a person of ordinary skill in the art to which this invention belongs.
As used herein, and unless stated otherwise or required otherwise by context, each of the following terms shall have the definition set forth below.
As used herein, the term “nucleotide” refers to a nucleotide of any length, which can be DNA or RNA, can be linear, circular or branched and can be either single-stranded or double-stranded.
As used herein, the term “sequence” refers to the sequence information encoded by a nucleotide molecule.
As used herein, the term “gene” includes a DNA region encoding a gene product, as well as all DNA regions which regulate the production of the gene product, whether or not such regulatory sequences are adjacent to coding and/or transcribed sequences. Accordingly, a gene includes, but is not necessarily limited to, promoter sequences, terminators, translational regulatory sequences such as ribosome binding sites and internal ribosome entry sites, enhancers, silencers, insulators, boundary elements, replication origins and locus control regions.
As used herein, the term “target sequence” refers to the sequence of interest which is selected, amplified, and revealed via the sequencing operation. This sequence is represented in a traditional format via the oligonucleotide bases (e.g. G, T, A, C, and U) or in a similar textual format.
As used herein, the term “sequence matrix” of a given oligonucleotide sequence refers to a representation the sequence content of an oligonucleotide in a matrix format that is appropriate for a given sequencing methodology. For example, a sequence matrix might be represented as a matrix where each row and column represent the fluorescent intensity associated with a given sequencing step (for each row), and each channel of the microscopy image (for each column) For example, using 4-color Illumina Hiseq chemistry this representation has an intuitive form: a given target sequence may be represented by the fluorescent signal expected during sequencing: that is, in numerical matrix where one dimension (e.g. rows) represents a position along the oligonucleotide sequence, and another dimension (e.g. columns) represent the possible nucleotide bases (G,T,A, or C). For Illumina Hiseq chemistry, the nucleotides might be represented as G=[1 0 0 0], T=[0 1 0 0], A=[0 0 1 0], C=[0 0 0 1] with appropriate ordering of fluorescent channels, and the sequence “GTAC” is represented as [[1 0 0 0], [0 1 0 0], [0 0 1 0], [0 0 0 1]]. The same target sequence may have differing sequence matrix representations depending on the chemistry used during sequencing. For example, Illumina NextSeq chemistry uses only two colors, and the sequence G=[1 0], T=[0 1], A=[1 1], C=[0 0]. As another example, the SOLiD Sequencing method does not have a one-to-one relationship between each sequencing reaction (i.e. each sequencing image) and a given position in the target sequence, but can still be represented as an appropriate sequence matrix. In this case, a representation of an oligonucleotide sequence appropriate for SOLiD sequencing might represent a sequence as a series of ligation steps in one dimension (for example, rows) and fluorescent output channels (for example, columns). In addition, the sequence matrix representation may incorporate information about bleed-through between microscopy channels or expected intensity associated with each microscopy channel and may thus represent the expected output of a specific microscope or microscope configuration. In addition to these variations of sequencing chemistry, a sequence matrix may be reordered without losing or changing its content—for instance, by transposition, or transformation into a vector by concatenating the rows or columns of a sequence matrix.
As used herein, the term “sequence vector” refers to a reshaping a sequence matrix into a vector, either by concatenating the rows or columns of a sequence matrix or through some other reordering.
As used herein, the term “sequencing signal” refers to the signal arising from the sequencing reaction (i.e., the fluorescent output) for a single pixel or collection of pixels. The sequencing signal can be represented in “matrix format”, with each row corresponding to a position along the linear RNA/DNA molecule, and each column corresponding to a different channel arising from fluorescence. For example, using 4-color Illumina sequencing (HiSeq), each column would correspond to the fluorescent signal associated with one nucleotide base (G,T,A, or C). For instance, a blue fluorescent output signal indicates a CTP was incorporated into the strand being synthesized by the sequencing reaction. Similar correspondences can be made for sequencing methods that do not spectrally separate each nucleotide base in a trivial manner (such as two-color sequencing in NextSeq, or the more complex color scheme associated with SOLiD Sequencing). In general, the “sequencing signal” can be considered as a matrix or vector representation of the raw signal arising from the sequencing reaction, either directly or after some appropriate mathematical transformation. In addition, the “sequencing signal” may be transformed from a matrix as described above into a “sequence vector”.
As used herein, the term “sequence dictionary” refers to the set of reference sequences which may be present in a particular biological sample. The membership of this set is determined jointly by the biological sample, and the processes used to select and amplify the DNA or RNA (cDNA). For instance, when MIXSEQ is applied to a genome sequencing, in which a set of sequences derived from the genome are sequenced to a length of 250 base-pairs, the dictionary may be considered to be the set of all possible unique sequences of length 250 that are contained within the genome. As further explained below, the sequence dictionary may not be known in advance, or may be partially known in advance, with membership of the set of reference sequences determined as an application of additional relational information about, inter alia, the mixed sequencing signals, the pixels, known reference sequences, and/or target sequences.
The set of reference sequences contained within a sequence dictionary is determined by factors such as the primers used for reverse transcription, circularization, and sequencing. Each reference sequence in a sequence dictionary may be represented in standard text form (for example, GTAC) or in the form of a sequence matrix appropriate for a given sequencing methodology.
As used herein, the term “mixed sequencing signal” refers to a sequence vector which represents sequencing information in which information from two or more individual sequences is superimposed: For example, the sequencing signal corresponding to a single pixel may generate a “mixed sequencing signal” if the field of view associated with that pixel contains two unique molecules with different sequences. As another example, the sequencing signal corresponding to a single pixel may generate a “mixed sequencing signal” if two isolated subsequences on a given oligonucleotide are sequenced at the same time.
As used herein, a reference sequence, reference sequence vector, or reference sequence matrix is considered “representative of” a mixed sequencing signal, mixed sequence matrix, or mixed sequence vector, where the reference sequence, reference sequence vector, or reference sequence matrix are sufficient to explain the variability of the mixed sequencing signal, mixed sequence matrix, or mixed sequence vector with an error less than 90%, 75%, 50%, 25%, 10%, 5% or less.
As used herein, the term “next generation sequencing” or “NGS” refers to any modern high-throughput sequencing technology. NGS includes, but is not limited to, sequencing technologies such as Illumina (Solexa) sequencing and SOLiD sequencing. A major advantage of NGS over previous sequencing technologies is the ability to perform massively parallel sequencing, in which many sequences are read in parallel but are not mixed. The invention herein provides a method, referred to herein as “MIXSEQ,” which allows for deconvolution of previously unusable, mixed data generated by massively parallel sequencing, MIXSEQ is particularly useful for in-situ sequencing methods such as FISSEQ.
As used herein, sequencing in parallel includes, at least, simultaneously sequencing regions originating from multiple distinct oligonucleotides, or simultaneously sequencing multiple regions of an oligonucleotide.
Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limit of that range, and any other stated or intervening value in that stated range, is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included in the smaller ranges, and are also encompassed within the invention, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the invention.
min x∥Y−AX∥2,1 s.t. ∥X∥<beta s.t.∥XG∥<lambda.
We then show recovered loadings for each dictionary element which correspond to recovered cells.
Copy number variation along the chromosome was modeled as an alternating stairstep. Due to Poisson sampling of individual sequences along the chromosome, recoverable estimates of CNV are noisy ({circumflex over (X)}, green line), and must be smoothed. The regularized estimate (X, magenta line) is identical to the ground truth.
This application claims benefit of U.S. Provisional Application No. 62/953,174, filed Dec. 23, 2019, the entire content of which is hereby incorporated by reference herein.
This invention was made with government support under grant number D16PC0008 awarded by the Intelligence Advanced Research Projects Activity (IARPA). The government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2020/066853 | 12/23/2020 | WO |
Number | Date | Country | |
---|---|---|---|
62953174 | Dec 2019 | US |