Biological systems are incredibly complex, and are governed largely by fluctuations in the expression levels of a multitude of genes. Such differential expression reflects the way those cells interact with others and react to our world. The expression levels of all genes at a particular time point, or in a particular environmental situation, can represent one particular “state”. Gene expression levels can change very rapidly, and so therefore can the “state” of a particular biological system, for example a cell or tissue or organ. Determining the “state”, i.e. the relative expression of a number of genes at a particular point has clear utility in diagnostics, prognostics and in for example industrial biotechnology, since it is important to know whether a particular biological system is behaving as expected/desired.
Genes do not act in isolation, but as part of complex networks. Because there are so many interacting genes and separate gene networks, fully determining the state of a biological system, such as a cell, is itself highly complex. Although it is now possible to relatively routinely analyse the expression level of all genes within a biological system, for example via RNA-seq, this is not cost nor time effective, both in terms of the sequencing and the subsequent bioinformatics, particularly since only a subset of genes are likely relevant to predict or classify whether a biological system is in a particular state or is in a different particular state, or is exhibiting a particular activity, for example a high protein production state. Determining such complex relationships require pattern recognition, rather than simple algebraic thresholds.
The premise that particular gene networks can be approximated to relatively discrete units underlies much of modern diagnostics, and once a predictive relationship or differential gene regulation signature has been identified that utilises information from a small (or relatively small) subset of the total transcriptome, an assessment of the entire transcriptome of each test sample is not necessary to diagnose the sample as being in a particular state or not. For example, there are a large number of instances where gene expression data, for example transcriptome data, has been obtained from two or more different types of sample and has been analysed, using bioinformatics including machine learning, to identify particular subsets of genes/mRNAs that are under or overexpressed, and to different levels, between the two sample types. The identification of such diagnostic or predictive expression patterns has been used in for example cancer diagnostics, cancer prognostics, diagnosis of tuberculosis and sepsis, as well as veterinary uses such as diagnosing bovine tuberculosis and mastitis, and prediction of response to therapy.
The same types of diagnostic and predictive relationships, decision surface or differential gene regulation signatures based on the relative gene expression of a given set of genes can be used in cell and tissue engineering. For example, often, the goal of “regenerative medicine” is to guide stem cells to differentiate into a specific terminal cell type, or to shift the activity of differentiated cells towards one task or another. Through gene expression profiling, and specifically the idea of “molecular time”, it is possible to determine “How differentiated are the cells? How polarized are the cells?”. In addition, the field of synthetic biology presents a unique challenge. In a population of cells with highly engineered gene pathways, or several such populations cooperating towards a given task, the bioprocess engineer requires a means of determining whether the system is behaving the way it was designed to.
In the simplest instance, such a predictive relationship, decision surface or differential gene regulation signature can involve the assessment of the presence or absence of expression from a single gene. For example, the presence of mRNA from gene A in a sample predicts that the sample is in a state A (for example “has disease A”) and the absence of mRNA from gene A predicts that the sample is in a state B (for example “does not have disease A” i.e. has a different disease or has no disease).
However, most disease states, or other states such as particular regulatory states (for example states in which gene regulation occurs within tolerance windows defined by engineers for quality control) that may be relevant for the bioproduction of various compounds, can only be accurately predicted or diagnosed using the expression data from a larger number of genes. The requirement for the assessment of expression data from a larger number of genes means that even once the predictive relationship or differential gene regulation signature has been determined (e.g. from the analysis of a larger set of expression data to identify those “markers” that can be used to predict a particular state), specialised equipment and skilled bioinformaticians are required to analyse the diagnostic/predictive expression data and form the prediction/diagnosis. For example, the use of techniques such as microarrays, RNA-seq, Nanostring to determine the expression levels of a number of genes requires the use of a range of probes for example with a range of labels, each requiring separate determination. Current methods of determining the diagnosis/prediction of a particular state therefore requires extensive data handling and statistics post sample preparation and post obtaining the actual expression data, and are not suitable for, for example, point of care diagnostic situations.
It would be beneficial to simplify the use and output of a predictive relationship or differential gene regulation signature such that the end-user can perform simple assays that give one, or a low number, of outputs which is typically directly predictive of one of two or more particular states, for example “has disease” or “does not have disease”, and which does not require input from statisticians or complicated equipment.
The present invention solves at least the above-mentioned problems with the prior art methods of using predictive relationships or differential gene regulation signatures generated from biological data.
The inventors of the present invention have developed methods and components that can be used to significantly reduce the complexity of converting the pre-determined predictive relationship, decision surface or differential target oligonucleotide pattern (such as a gene regulation signature between gene expression pattern and a particular state) into a useful diagnostic or predictive result.
The methods described herein use the molecules of the assay themselves to reflect the complex math and artificial intelligence currently used to analyse the standard target oligonucleotide pattern (for example expression data) that is routinely obtained in, for example, medical diagnostics.
The methods disclosed are easy to use, with no requirement for particularly specialist instrumentation, and sample preparation is standard. Once the necessary components have been optimised through routine procedures, actually putting the methods into practice for example in diagnostics/prognostics is very simple and requires in some embodiments a simple multiplex PCR amplification reaction and the reading of two fluorophores. This is in contrast to the present methods that require for example amplification of a number of RNA species using multiple fluorophores, determining the amount of each fluorophore, and subsequently feeding those data into a complicated bioinformatics system that compares the relative levels of each RNA species to determine the “state”. For example, if a predictive relationship, decision surface or differential target oligonucleotide pattern (such as a differential gene regulation signature) is based on the relative expression of 10 genes, currently either the expression level of each gene needs to be determined separately, so that the same fluorophore may be used; or 10 different fluorophores need to be used so that the amplification method can be multiplexed. Accordingly, in either case, at least 10 different readings are needed. A key advantage of the present invention is that it reduces the number of readings down, in some cases to a single reading of two different fluorophores (or of all fluorophores used), in a single tube.
The results produced by the methods of the invention are easy to obtain, are clear and can be interpreted by the laboratory researcher, the fermentation specialist and the bedside clinician.
The methods are typically centred around nucleic acid amplification, which the skilled person will understand is highly routine and can be performed with minimal equipment.
In addition, many of the prior art methods reduce the complex networks and predictive relationships, decision surfaces or target oligonucleotide pattern (such as a differential gene regulation signatures) to simple linear relationships, i.e. for example more expression from one gene predicts a certain state, more expression of a different gene predicts a different state. Such a reductionist approach does not accurately reflect biological systems and does not adequately capture and reflect the predictive relationships or differential gene regulation signatures that are capable of being identified and generated, for example through the use of AI.
For example, an AI system may determine that if the expression of gene A is above an arbitrary expression threshold of 10 and the expression of gene B is below a threshold of 5, and the expression of gene C is above a threshold of 7, then the sample is in a particular state, e.g. State A; whereas if the expression of gene A is above a threshold of 10 and the expression of gene B is above a threshold of 10 and the expression of C is below a threshold of 7 then the sample is in a different particular state, State B.
It will be clear that a larger number of different “states” can be determined and predicted based simply on the expression levels of three genes. Whether or not these different states represent clinically useful or biotechnologically useful states will be determined by the samples that the AI system is trained on. In any event, it is possible to see that expression of gene A above a threshold of 10 (i.e. “more” expression of gene A) does not simply reflect a single state. It is the relative expression levels of each of the genes in the particular network, or that have been identified as being part of the predictive network, that are important.
The methods of the present invention are able to capture this complex interdependent relationship and condense it down to a single output which tells the user whether the sample is in, or is likely to be, State A or State B; or is in State A and not in State B or State C, for example.
The methods of the present invention can be termed Competitive Amplification Networks (CANs). The methods adapt RNA/DNA amplification technologies such as PCR to the recognition of complex gene expression patterns. As the name implies, the reaction is engineered with competitive interactions that translate the information provided by a given gene transcript or a set of transcripts into the relative probability of state A versus state B. In some embodiments which utilise fluorophore labelled probes, these probabilities combine to provide an overall diagnosis represented by two colours: interpretation is as simple as checking which colour is brighter. The networks are scalable to encompass a large number of genes without a significant increase in cost or operational complexity. Finally, these networks can be engineered to perform complex, nonlinear operations on multiple targets simultaneously. This technology provides a platform for engineering application-specific kits for disease diagnosis, therapeutics monitoring, regenerative medicine research, and quality control of bioprocess manufacturing.
Accordingly, the invention provides a method of translating the relative abundance of (or presence or absence of) at least two oligonucleotides, for example the relative expression of at least two genes, or presence or absence of at least two mutations, into the relative probability of a particular state, for example the relative probability of State A versus State B.
The invention also provides a method of combining the relative abundance of at least two oligonucleotides, for example the relative expression of at least two genes, or presence or absence of at least two mutations, into a single value.
The invention also provides:
Arriving at the “probability of a particular state” and the “predictive relationship”, “decision surface”, or “differential target oligonucleotide pattern” or “differential gene regulation signature” and the “statistical information” is within the means of the skilled person. Such information is typically obtained from microarray data or RNAseq data, for instance, followed by bioinformatics to produce a relationship between two or more markers that can be used to predict the probability of for example state A versus state B. Many examples of such predictive panels exist, see for example: (1) Warsinske, H.; Vashisht, R.; Khatri, P. Host-Response-Based Gene Signatures for Tuberculosis Diagnosis: A Systematic Comparison of 16 Signatures. PLOS Medicine 2019, 16 (4), e1002786. https://doi.org/10.1371/journal.pmed.1002786.
(2) Sweeney, T. E.; Wong, H. R.; Khatri, P. Robust Classification of Bacterial and Viral Infections via Integrated Host Gene Expression Diagnostics. Science Translational Medicine 2016, 8 (346), 346ra91-346ra91. https://doi.org/10.1126/scitranslmed.aaf7165. (3) Cardoso, F.; van't Veer, L. J.; Bogaerts, J.; Slaets, L.; Viale, G.; Delaloge, S.; Pierga, J.-Y.; Brain, E.; Causeret, S.; DeLorenzi, M.; Glas, A. M.; Golfinopoulos, V.; Goulioti, T.; Knox, S.; Matos, E.; Meulemans, B.; Neijenhuis, P. A.; Nitz, U.; Passalacqua, R.; Ravdin, P.; Rubio, I. T.; Saghatchian, M.; Smilde, T. J.; Sotiriou, C.; Stork, L.; Straehle, C.; Thomas, G.; Thompson, A. M.; van der Hoeven, J. M.; Vuylsteke, P.; Bernards, R.; Tryfonidis, K.; Rutgers, E.; Piccart, M. 70-Gene Signature as an Aid to Treatment Decisions in Early-Stage Breast Cancer. New England Journal of Medicine 2016, 375 (8), 717-729. https://doi.org/10.1056/NEJMoa1602253. (4) Zaas, A. K.; Aziz, H.; Lucas, J.; Perfect, J. R.; Ginsburg, G. S. Blood Gene Expression Signatures Predict Invasive Candidiasis. Science Translational Medicine 2010, 2 (21), 21ra17-21ra17. https://doi.org/10.1126/scitranslmed.3000715.
By a predictive relationship, we include the meaning of any statistical classification technique that can be visualized as “decision surface” where each input dimension represents the concentration of a particular target sequence and each output dimension represents a different class. For example, the input domain could consist of two genes and the output domain two classes, healthy and sick. The “decision surface” is then a two-dimensional surface where a given point represents the concentration of the two gene transcripts and the height of the surface at that point corresponds to the probability of being sick if a patient's two genes are expressed at those respective levels. In another example, the input domain could consist of 10 distinct mutations observed in circulating tumour DNA (ctDNA) of a post-surgical prostate cancer patient and the output domain could consist of three categories: no recurrence, mild recurrence, and aggressive recurrence, each of which recommends to the physician a different course of action. The decision surface in this case is (more or less) a 10-dimensional cube, where each point translates a particular combination of mutation concentrations to a relative probability of the three categories, perhaps visualized with color as the relative intensities of the red, green, and blue components of an image.
While a 10-dimensional tricolored cube is difficult to visualize, arriving at such a representation would be routine for a biostatistician, bioinformatician, mathematician, statistician, or data scientist. The expert would begin with a dataset containing the measured concentrations of many potential targets, such as expression of various genes or mutational profile of post-surgical ctDNA, from many individuals, where each individual is known to belong to a different category (e.g., healthy/sick or no/mild/aggressive recurrence). The expert would then apply any of several classification algorithms to arrive at the decision surface, including but not limited to logistic regression, Gaussian process classification, artificial neural network classification, decision trees, random forests, naïve bayes, support vector machines, or nearest neighbours.
Alternatively, the decision surface may be constructed in a more manual, principled manner. For instance, the bioproduction engineer may know the optimal expression level and respective tolerance for each of several genes expressed by their engineered organism or population of organisms. For quality control and process-monitoring purposes, the engineer may wish to know if any of those genes is outside that tolerance window. In this instance, the decision surface could be represented as a multidimensional Gaussian distribution that extends from −1 to +1 in the output domain. Each dimension, as specified above, would represent the concentration of the particular gene transcript, and the marginal Gaussian distribution along that dimension would have its mean (peak) at that gene's ideal concentration and its standard deviation (width) correspond to the respective tolerance window. The competitive amplification network implementation of such a decision surface would exhibit one fluorescent color if all transcripts are at or near their ideal, and another if any transcript is too far beyond its tolerance window.
Another such principled decision surface could arise from personalized surveillance of circulating tumour DNA for the purposes of monitoring a post-surgical prostate cancer patient for early signs of relapse (Coombes et al Clinical Cancer Research 2019 25: DOI: 10.1158/1078-0432.CCR-18-3663). The target mutations of interest would be identified at the time of surgery by comparing the genome of the tumour to that of the patient's healthy tissue. The expert would then select a threshold concentration so that if any of the mutations are observed in the ctDNA above this threshold, the expert would conclude that the cancer has relapsed. The marginal decision surface for a given mutation in this case would consist of a transition from 0 in the absence of the mutation to +1 at that threshold concentration.
Having obtained a decision surface, or probabilistic relationship between targets of interest and classification, the expert would then design a competitive amplification network which approximates this relationship. A given signal (fluorophore color, such as FAM, or band intensity on a lateral flow strip) is designated arbitrarily as corresponding to the positive direction of the output and a second signal (such as HEX) is designated as the negative direction. The difference between the intensities of these two colors thus corresponds to the “height” of the decision surface. Alternatively, should the output domain consist of more than two categories, an appropriate number of signals can be chosen so that certain pairwise differences between them correspond to the probability of different output categories.
Having translated the output domain of the decision surface into the relative intensity of various signals, the expert would choose the architecture of the network. This architecture consists of determining how many synthetic competitors to include, how many primers to include, which oligonucleotide strands share which primers, and which strands are targeted by which probes. For each architecture, then, there are numerous combinations of amplification parameters for each oligo in the system. Choosing among architectures and parameter values would be done by simulating the surface produced by a numerous different architectures each at numerous different parameter values (see section “Simulating competitive amplification”) to identify the architecture and combination of parameter values that resemble the pre-determined decision surface. There are many ways known to the art of performing this optimization task, including Evolutionary Algorithms and Simulated Annealing for the choice between architectures as well as Gradient Descent, Stochastic Gradient Descent, or Quasi-Newton methods for identifying ideal parameter combinations. Finally, the expert would design target and synthetic competitor oligonucleotides which exhibit the parameters identified here and share primers according to the selected architecture (see section “Testing and predicting competitor amplification behavior”). Further explanation is provided in the Examples below.
Each of these methods involves the amplification of one or more target polynucleotides in such a way so that the amount of each product that indicates a first state can be cumulatively quantified, and each product that indicates a second state can be cumulatively quantified. Combining these two readings produces a single overall reading that indicates whether the sample is more likely to be in a first state or a second state, i.e. regardless of the number of genes under investigation, the difference between the total green intensity and the total orange intensity (for example), integrates the information from the whole system. For example, in one embodiment all products that are associated with a first state are labelled with a first fluorophore and all products that are labelled with a second state are labelled with a second fluorophore. Provided that the relative contribution of each product to the overall predictive relationship or differential gene regulation signature is taken into account, summing the cumulative quantifications of each state produces an accurate and predictive value. The competitive polynucleotides of the invention and that are used in the methods described herein are engineered, designed or tuned to reflect this predictive relationship or differential gene regulation signature.
Accordingly, the invention provides:
and
wherein the method comprises the step of amplifying one or more target polynucleotides in a sample.
The method of amplifying one or more target polynucleotides in a sample as described herein is itself provided by the invention.
Theoretically, every target molecule in solution should be replicated every cycle until these primers are used up, but, crucially to CAN design principles, i.e. the methods disclosed herein, perfect doubling is actually difficult to achieve. It is the tuned competitor polynucleotides that comprise the appropriate features that allows a single output to reflect a complex network of expression levels.
Target sequence characteristics such as GC content influence the proportion of molecules that are replicated each cycle and these features are deliberately built into the competitor polynucleotides used herein so that the target polynucleotide(s) is amplified with the appropriate efficiency where the efficiency is tailored to mimic the contribution of that particular target in the overall predictive relationship or differential gene regulation signature.
For example, in a hypothetical scenario where increased expression of two genes is predictive of disease:
If the expression of G1 and G2 is simply obtained and added together, without taking into account any individual predictive power, then a sample with a G1 expression level of 10 (predicting “non-disease”) and a G2 expression level of 7 (predicting “disease”) would have an overall expression level of “disease predicting genes” of 17; whereas a sample with a G1 expression level of 1 (predicting “non-disease”) and a G2 expression level of 10 would only have an overall expression level of “disease predicting genes” of 11. On the face of it, without taking the individual predictive power into account, then the first sample would appear to be more likely to be diseased than the second sample. However, when we take into account that G1 is only weakly predictive but G2 is strongly predictive, the actual prediction of disease may be much more likely for the second sample.
Accordingly, it is not enough to simply amplify all “disease associated genes” and add up the amount of product. However, adding up the amount of product is a simple means to obtain a cumulative and accurate prediction based on a number of expression level inputs. The inventors have managed to incorporate the individual predictive power into competitive polynucleotides, so that the relative amount of a target versus a corresponding competitor polynucleotide indicates the predictive power.
For example, taking the above hypothetical example, in one example:
For sample 2 which had effectively 1 G1 target molecule and 10 G2 target molecules, G1 may produce a green reading of 0.5 and an orange reading of 1; and G2 may produce a green reading of 9 and an orange reading of 2, with a cumulative reading of 9.5 green versus 3 orange.
It will be understood by the skilled person that in some situations an increased expression of one gene and a repressed expression of a different gene may be indicative of a particular state, for example a diseased state.
The predictive relationship or differential gene regulation signature derived from the original data set(s) (e.g. microarray data, RNAseq data) will provide a threshold of how “green” the overall cumulative fluorescence needs to be to result in a diagnosis of “state A” (i.e. “disease”).
If the targets were amplified in a 1:1 manner, then sample 1 would have an overall green reading of 17 and sample 2 would have a reading of 11 which does not accurately reflect how likely the samples are to be in that particular state, e.g. a diseased state.
Although the above is discussed in the context of relative gene expression, the skilled person will understand and appreciate that the same premise is true of situations in which the presence or absence of various mutations is indicative of a particular disease state, such as cancer, or the relative abundance of non-coding RNAs (so, strictly not “gene expression” in the context of protein coding genes, but transcription in general).
Accordingly, as discussed above, where reference is made to relative gene expression, this should be read as also applying to combinations of mutations, or relative transcription and production of for example non-coding RNAs.
Amplification progression can be monitored in real-time by inclusion of a fluorescently labelled probe oligonucleotide specific to a region of the target product or competitor product between the primer-binding sites (see
where F is the fluorescence intensity, r is the exponential growth rate (base e), K is the signal plateau, and m is the drift of this plateau. The key component here is the r, which, when expressed in base 2, represents the fraction of (probe-bound) target strands which replicate each cycle. The r can be changed by altering the sequence of the target between the primer regions, as demonstrated in
Note that, for the most part, all reactions with a given target have the same fluorescence intensity at the end, regardless of the starting quantity of the target. The endpoint of the reaction gives you minimal information about the sample, a drawback remedied by engineering competition into PCR as according to the present invention.
The amplification is a “competitive” amplification that involves the use of a competitor polynucleotide that has been “tuned” to have particular features that are described herein. The skilled person will appreciate that prior art methods of competitive PCR are typically used for target nucleic acid quantification and the competitive polynucleotide used is designed to be as close in sequence to the target as possible, to avoid any discrepancies in amplification efficiency. The amount of target product is compared to the amount of competitor product, typically using gel electrophoresis, and from this the amount of starting target material can be quantified.
In contrast, the present invention specifically requires that the competitor polynucleotide be designed to have a sequence that intentionally results in a particular difference in amplification efficiency between amplification of the target and amplification of the competitor.
In one embodiment then, the invention provides a method of amplifying one or more target polynucleotides in a sample, wherein the method comprises:
providing:
For the avoidance of doubt, the methods of the present invention are different to “toe-hold” methods in which a “toe hold” primer is initially bound to a shorter “protector” strand, so this protector and the target compete for binding to the target. In this case, the “protector” isn't amplified (it's shorter than the primer.
Also, to be clear, the first tuned competitor polynucleotide is a polynucleotide that has been specifically designed, or “tuned” to have particular properties and has been intentionally introduced into the amplification reaction. A competitor polynucleotide as described herein is considered to be distinct from, for example, other polynucleotides that just happen to also be present in the sample. For example, a competitor polynucleotide according to the invention is not simply another piece of genomic DNA that may compete for hybridisation to the primers, resulting in unwanted background amplification. In one embodiment then, the competitor polynucleotides described herein at intentionally amplified. In one embodiment the competitor polynucleotides described herein are not naturally present in the sample.
It is also important to note that the present method is distinct from prior art methods of competitive amplification whereby the competitor oligonucleotide is designed to intentionally have similar amplification kinetic properties to the target polynucleotide. Such methods are using the art to estimate the concentration of the target polynucleotide, for example where a known amount of competitor polynucleotide is included in the amplification reaction. It is imperative in such methods that the rate of amplification of the competitor mirrors that of the target. It will be clear to the skilled person that this is not the case for the present invention. The present invention requires the tuned competitor oligonucleotide to have different amplification kinetics to the respective target polynucleotide so that the rate of relative amplification of the target and competitor result in products that match the predictive relationship, decision surface or differential target oligonucleotide pattern such as a differential gene regulation signature that is indicative of one of at least two states.
Accordingly in one embodiment the competitor polynucleotide does not have the same or does not have substantially similar amplification kinetics to the respective target polynucleotide.
The present methods are also distinct to methods such as 16s nested PCR which first amplifies a genetic sequence common to most bacteria (a ribosomal subunit) before amplifying or sequencing species-specific sub-regions (Yu et al PLoS One 2015 10: e0132253). A similar approach is used to probe VDJ recombination in human B cells (Koning et al British Journal of Haematology 2016 178: 983-968. In both cases competition occurs, though only among natural sequences. Accordingly, in one embodiment the method is not a 16s nested PCR method, and/or is not a method used to probe VDJ recombination in human B cells.
The skilled person will appreciate that it is possible to amplify a given target sequence and/or tuned competitor sequence using just one primer, for example asymmetric amplification or EXPAR, an exponential amplification reaction (see Reid et al Angewandte Chemie 2018 57: 11856-11866), or with two primers, for example as in the standard PCR. It is not considered necessary that two primers are used to amplify a given target sequence or a given competitor sequence, though typically two primers will be used, arranged so that the first and second primer hybridise on opposite strands of a double stranded target sequence or competitor sequence, so as to result in the production of a target product or competitor product. Two primers may be used to amplify the target sequence, and/or may be used to amplify a portion of or all of the tuned competitor polynucleotide. The skilled person will understand what is required for an appropriate primer, for example length, sequence identify to a portion of the target/competitor sequence.
Accordingly, in some embodiments the method comprises providing a second primer.
In some embodiments the second primer is capable of hybridising to the first target polynucleotide, wherein the first and second primer hybridise on opposite strands of the target so as to result in the production of the first target product, optionally a first target polymerase chain reaction (PCR) product.
The skilled person will also understand that for a first primer to be capable of hybridising to a first target polynucleotide and to a first tuned competitor polynucleotide, a portion of the first target polynucleotide and a portion of the first tuned competitor will have the same, or substantially the same sequence, so as to allow a single primer to hybridise to the two different polynucleotides. The remaining sequence of the target and competitor can be entirely different.
In some instances, where the method comprises the use of a second primer that is capable of hybridising to the first target polynucleotide, the same second primer is also capable of hybridising to the first tuned competitor polynucleotide, wherein the first and second primer hybridise on opposite strands of the first tuned competitor polynucleotide so as to result in the production of the first tuned competitor product, optionally first tuned competitor PCR product. In this case, the first target polynucleotide and the first tuned competitor polynucleotide will share two regions that are identical, or that are substantially identical, so as to allow the hybridisation of the first and second primer to each polynucleotide. The skilled person will understand how similar two sequences need to be so as to allow hybridisation of the same primer.
This arrangement, whereby the first target and the first competitor polynucleotides are amplified using the same first and second primers is depicted in
When the target and the competitor are amplified in the same amplification reaction, they compete for the primers. Since primers are consumed by each replication of a target strand, the amplification of both sequences stops as soon as the primer pool is exhausted. The quantity of each amplification product at the end of the reaction depends on the relative starting quantity of the two targets. This is reflected in the resulting fluorescent signal (see for example
Testing and Predicting Competitor Amplification Behavior
To estimate parameters governing amplification behaviour, each competitor can be amplified in a reaction containing the appropriate primers, the relevant fluorophore-labelled probe, and standard qPCR master mix (TaqMan Fast Advanced Master Mix from ThermoFisher Scientific). The resulting fluorescent data should be fitted with one of a number of algorithms which the skilled person will able to select, for example (herein referred to as the mechanistic model as used in the Examples) using standard non-linear least squares estimation,
where f is defined as
where r is the amplification rate, F0 is the initial fluorescence at the beginning of the reaction, m indicates the degree of drift of the steady-state fluorescence, and K gives the steady-state fluorescence in the absence of drift. The above equation is merely exemplary, other models which describe amplification behaviour may also be used. As described below, one way of estimating the parameters of this mechanistic model is via a Generalized Linear Model, specified as follows. To allow efficient estimation, the following variable substitution on F0 and r is first applied:
The input parameters to the model are the length of region of the sequence between the primers, in base pairs (BP), the GC content of that region in percent (GC), and the concentration of the sequence in copies (Q). The input and output (ρ, τ, K, and m) parameters are first put into “standardized” form (indicated by a {circumflex over ( )}) as follows:
log BP=·σBP+μBP (6)
logit GC=·σGC+μGC (7)
log10 Q={circumflex over (Q)}·σQ+μQ (8)
logit ρ={circumflex over (ρ)}·σρ+μρ (9)
τ={circumflex over (τ)}·στ+μτ−(Q−μQ)·log2 10 (10)
logit K={circumflex over (K)}·σK+μK (11)
log m={circumflex over (m)}·σm+μm (12)
logit ρ={circumflex over (ρ)}·σρ+μρ (13)
The regression model is then given by:
{circumflex over (ρ)}=αρ+βρ,BP·+βρ,GC·
+ϵρ+(γρ+ζρ,BP·
+ζρ,GC·
+ϵρ,Q)·{circumflex over (Q)} (14)
{circumflex over (τ)}=ατ+βτ,BP·+βτ,GC·
+ϵτ+(γτ+ζτ,BP·
+ζτ,GC·
+ϵτ,Q)·{circumflex over (Q)} (15)
{circumflex over (K)}=α
K+βK,BP·+βK,GC·
+ϵK+(γK+ζK,BP·
+ζK,GC·
+ϵK,Q)·{circumflex over (Q)} (16)
{circumflex over (m)}=α
m+βm,BP·+βm,GC·
+ϵm+(γm+ζm,BP·
+ζm,GC·
+ϵm,Q)·{circumflex over (Q)} (17)
where α denotes the “typical” value of the given parameter across all sequences and concentrations, β indicates the dependence on the length or GC content of a given sequence, respectively, γ represents the “typical” dependence on concentration across all sequences, and ζ defines how the dependence on concentration varies with length and GC content. In the regression model, which seeks to estimate parameter values from observed data, e represents the deviation of ϵ given sequence's behavior from the global trend indicated by
the remaining parameters; the prediction model, which supplies parameter values for new, untested sequences, is the same as the regression model but without the ϵ components.
As shown in the Examples, in one embodiment 16 different competitors ranging in length from 30 to 240 base pairs and GC content from 15% to 85% are amplified. Each competitor at seven different concentrations (i.e., the reaction contained 102, 103, 104, 105, 106, 107, or 108 copies of the competitor) in duplicate. The skilled person will be able to select an appropriate number of competitors, appropriate length, appropriate GC content and concentration, depending on the particular circumstances. The parameter values for the model above can be estimated using a Bayesian approach; however, other linear regression techniques could be used, including but not limited to maximum-likelihood estimation, least-squares estimation, ridge regression, and lasso regression.
The results of the regression of the 16 competitors described in the Examples are shown in
Besides a Generalized Linear Model, other regression techniques could be used, including but not limited to non-linear regression and non-parametric regression such as polynomial regression, Gaussian Processes, Artificial Neural Networks, Support Vector Machines, Nearest Neighbours, Decision Trees, Random Forests, and Naïve Bayes.
Simulating Competitive Amplification
The above equations describe the amplification of a given sequence in isolation. To simulate amplification behaviour when multiple oligos compete with one another, a more fine-grained model is used. Competitive amplification is modelled as an example of Monod growth (Monod, Jacques (1949). “The growth of bacterial cultures”. Annual Review of Microbiology. 3: 371-394. doi:10.1146/annurev.mi.03.100149.002103).
Commonly used to model growth of microorganisms, this approach describes replication at some maximal rate that is dampened as the limiting substrate is consumed. Each of the two strands of a given oligonucleotide are considered as a separate “organism” that generates its complement at the maximum rate described above as the sequence's amplification rate. In doing so, it consumes the corresponding primer; the decreasing concentration of this primer depresses the generation rate of new strands. The magnitude of this dampening is given by the ratio of the given primer concentration to the sum of that same concentration and the concentration of all strands which bind to the primer. For simple, non-competitive PCR (one target, two primers), the model consists of the following system of ordinary differential equations:
where A+ and A− are the concentrations of the positive and negative strands of a sequence A, p1 and p2 are the concentration of two primers, and r is the amplification rate for the sequence (note that the μ here is unrelated to the μ in the previous equations.
The model for direct competitive PCR (two targets WT and REF, two primers) is as follows:
A skilled person could thus describe all the competitive amplification systems contained herein in a similar manner. These systems of differential equations can be solved using any of many analytical or numerical techniques known in the art to yield curves which describe the concentration of each species in the reaction over time. To obtain curves of the signal from a given probe or set of probes over time, the practitioner would combine the concentrations of the strands cognate to those probes. For example, in the above example of direct competitive PCR, consider a case where a FAM-labeled probe was designed to bind to the WT− strand (i.e., it shares sequence identity with the WT+ strand), and a HEX-labeled probe was designed to bind to the REF− strand. The FAM signal is thus given by the concentration of the WT+ strand, and the HEX signal is given by the concentration of the REF+ strand. If an additional FAM-labeled probe was designed to bind to the REF+ strand, the FAM signal would be given by the sum of the WT+ and REF− strand concentrations.
The scenario described so far, i.e. one target polynucleotide and one corresponding competitor polynucleotide represents one of the simplest applications of the invention. However, assessing the expression level of one gene does not really represent a gene network. The expression level of multiple genes in a gene network can be assessed using a combination of amplifying more than one target polynucleotide and/or providing more than one competitor polynucleotide. The invention provides different combinations, some of which will be described in more detail, but the skilled person will understand that a large number of combinations of different target polypeptides, different competitors and different arrangements of primers, e.g. primers shared between target and competitor, shared between competitor and competitor, and/or shared between target and target.
Some of these methods are termed “indirect” methods, or indirect CAN.
The indirect CAN methods described herein are considered to be less expensive when larger gene signatures are to be analysed, since in the “direct” methods at least one if not two probes need to be designed for each transcript targeted. For gene signatures (e.g. gene expression levels, presence or absence of particular mutations, abundance of non-coding RNA) with 20-50 targets iterating on sequence designs becomes prohibitively expensive. To address this issue, indirect CANs provide similar functionality at a more or less fixed cost regardless of the number of genes under investigation. Indirect competition also opens the possibility of higher-order networks capable of complex, non-linear analysis of multiple targets simultaneously. Finally, redundant targeting allows additional flexibility for all CAN architectures.
The direct competition methods described herein use competition between a probed target polynucleotide product and a probed competitor polynucleotide product. The indirect method uses an un-probed target polynucleotide to simply mediate the competition between competitor polynucleotide. Because both primers are necessary for exponential amplification of a given target, replication can be arrested by depletion of only one primer. In this embodiment of the invention a competitor polynucleotide, shown as REFH in
Accordingly, in some embodiments the method comprises providing a second tuned competitor polynucleotide.
In some embodiments the second primer is:
In other less preferred embodiments of the indirect method, the second primer is:
In some embodiments, the second primer is capable of hybridising to a second target polynucleotide, and is optionally not capable of hybridising to the first target polynucleotide.
It will be appreciated that, as described above, the method can be used in the context of more than one target polynucleotide. In some instances, the method is used to determine the expression of more than one gene, the presence or absence of more than one particular mutation, and/or the abundance of more than one non-coding RNA. In other embodiments, the skilled person will understand that the relevant primers may be designed so that the more than one target polynucleotide are part of the same actual RNA molecule. For example several primer pairs can be designed to amplify several different regions from a single mRNA. In conjunction with the appropriate competitor polynucleotides this embodiment of the methods of the invention is termed a “redundant” method.
Accordingly, in one embodiment the second target polynucleotide is part of the same polynucleotide molecule as the first target polynucleotide.
In other embodiments the second target polynucleotide is on a different polynucleotide molecule to the first target polynucleotide.
It will be appreciated that typically two primers are used to amplify each target. Accordingly, the methods of the invention may comprise more than two primers, for example at least 3, 4, 5, 6 or more primers.
For example, in one embodiment, the second primer is:
In other embodiments the method comprises providing a fourth primer, wherein the fourth primer is capable of hybridising to the first target polynucleotide, wherein the first and fourth primer hybridise on opposite strands of the target so as to permit formation of the first target product, optionally a first target PCR product.
As can be seen, any suitable arrangement of primers is provided by the methods of the invention, so that each relevant target or competitor is amplified, and so that each target and competitor compete appropriately for the relevant primers.
To further exemplify the different combinations of target, competitor and primer arrangement provided by the invention, in some embodiments the method comprises providing:
It will be clear that the fourth and fifth primers may bind to other target polynucleotides and/or to other competitor polynucleotides, expanding the complexity of the network that is assessed.
As described above, a key feature of the present invention is the use of one or more tuned competitor polynucleotides, that has an amplification rate that has been specifically tuned relative to the corresponding target polynucleotide or relative to the amplification rate of other target or competitor polynucleotides within the network. This tuning provides the discrimination in amplification that translates the predictive relationship, decision surface, or differential target oligonucleotide pattern (such as a differential gene regulation signature or presence or absence of particular mutations) into a relative abundance of each amplification product that can be simply interrogated, for example by using labelled nucleic acid probes.
Accordingly, in one embodiment the amplification rate of the first target polynucleotide is different to the amplification rate of the first tuned competitor polynucleotide. In other embodiments the amplification rate of a target polynucleotide is different to the amplification rate of its corresponding tuned competitor polynucleotide.
Typically, in prior art amplification methods, when trying to amplify a product the amplification rates are optimised, so that amplification is as efficient as possible. The skilled person is aware of techniques to increase the efficiency of amplification, for example altering the length of the product, altering the G/C content and changing the concentration of the primers. Since the skilled person knows how to improve amplification, so the skilled person knows how to make amplification less efficient, i.e. decrease the rate of amplification.
The skilled person will understand that it is the relative amplification rate between the target and the competitor (or in some cases between the target and competitors, or between the targets and competitor, or between the targets and competitors) that is important, not necessarily the absolute amplification rate. Accordingly, it is important that the most appropriate region of the target is chosen for amplification, for example the most appropriate 200 bp region of a particular target mRNA, so that the relative amplification rate between target and competitor is appropriate.
Accordingly, in one embodiment the amplification rate of any of the target polynucleotides or competitor polynucleotides can be altered by one or more of:
Accordingly, in one embodiment the amplification rate of the competitor polynucleotide can be altered by increasing or decreasing the number of base pairs of the competitor polynucleotide product.
In some embodiments the amplification rate of the competitor polynucleotide is:
As examples, the sequences of pairs of target product and corresponding competitor product, tuned to provide various relative rates of amplification and exemplified in the Examples, are provided below.
The amplification rate can be defined as the “r” estimated from fitting the following equation to a fluorescent trace of standard quantitative PCR run on the polynucleotide with only the primers capable of hybridizing to it, in the absence of any other polynucleotides:
This and other suitable equations are known in the art, see for example Spiess et al BMC Bioinformatics 9 article number 221 (2008); Rutledge NAR 2004 32: e178; and Liu et al Cell Culture and Tissue Engineering 2001 27: 1407-1414.
Where t is the cycle at which each fluorescence value was measured. A typical reaction would include commercially available qPCR master mix, 125 nM of each of the two primers, 250 nM of the respective probe, run for 60 cycles at 60° C. The curve fitting would typically be performed through a non-linear least-squares (NLLS) algorithm. Variations in this procedure, including substituting the probe with a fluorescent dye (e.g., Sybr Green, EvaGreen), altering the duration, temperature, or concentrations involved, or alternative statistical approaches such as Bayesian estimation are permissible as long as the same approach is used for all polynucleotides being evaluated. In a similar vein, different equations can be used to estimate “r”, including but not limited to:
Since the competitor polynucleotide is tuned to have a different amplification rate to the target polynucleotide, in a situation wherein the amplification reaction comprises the same or substantially the same number of initial target and competitor template molecules, the number of target product polynucleotides generated is different to the number of tuned competitor product polynucleotides generated. Accordingly, in one embodiment of the method, the number of target product polynucleotides generated is different to the number of tuned competitor product polynucleotides generated, when the initial number of target polynucleotides and the number of tuned competitor polynucleotides prior to primer extension is the same or is substantially the same.
The premise of tuning the competitor polynucleotide so that the target and competitor have particular relative amplification rates is ultimately to mimic the predictive relationship, decision surface or differential gene regulation signature or presence/absence of particular mutations that underlies the purpose of the method, for example in diagnostics, prognostics, or simply taking a snapshot of the current state of a system or gene network. Accordingly, in one embodiment the sequence of the first target polynucleotide to be amplified, and the sequence of the at least first tuned competitor polynucleotide, is selected so as to result in a final detectable signal that varies with the initial concentration of the first target polynucleotide in such a way that approximates, reproduces or matches the predictive relationship or differential gene regulation signature of the target to one or more states.
In this way, if a particular sample has a low level of expression of a gene, but that low expression is, for instance, highly predictive of a disease state (or has a particular mutation but that mutation is more highly predictive of a disease state than a second mutation), the final detectable level of the target product may be high (the corresponding competitor polynucleotide is designed to have a sequence that is a poor competitor); whereas a gene that has a high level of expression but is poorly predictive of a disease may have a lower final detectable level of target product (i.e. the corresponding competitor polynucleotide is designed to have a sequence that is highly competitive, converting the high gene expression to a lower amount of target product), since the competitor sequences are chosen to apply the correct weighting to the amplification of each target.
This same premise applies to direct methods, whereby each target polynucleotide is amplified by two primers, which also amplify a corresponding tuned competitor polynucleotide (keeping in mind that in each reaction is it possible to have a number of different targets and different corresponding competitor polynucleotides being amplified, as described below); and also applies to indirect methods whereby for example the target is amplified by two primers, one of which is also used to amplify a first competitor along with a second competitor primer, which itself is used to amplify a second competitor polynucleotide, e.g. -target-competitor1-competitor2-, wherein each “-” is a primer. The skilled person is able to generate such amplification networks that effectively encode the predictive relationship or differential gene regulation signature, such that the output, i.e. the amount of product of target and competitor, is diagnostic, prognostic, or otherwise predicts the probability of state A versus state B.
Accordingly, in one embodiment, the rate of amplification of a first target polynucleotide and the rate of amplification of a second target polynucleotide approximates, reproduces or matches a pre-defined weighting. The skilled person will understand that the weighting is derived from whatever is necessary for the assay signal to approximate, reproduce or match the predictive signal, which will typically be identified via simulation.
Prior art methods that involve competitive amplification require that the competitor be as close as possible in sequence to the target sequence—since the methods are used to quantify the amount of starting target template, any difference in amplification rate would skew the results. It is clear from the disclosure herein that the competitor polynucleotides of the present invention are intentionally designed to have a different amplification rate to the target. This can be achieved by having a different sequence to the target. In one embodiment, the sequence of the first tuned competitor polynucleotide to be amplified shares less than 95%, 90%, 88%, 85%, 80%, 75%, 70%, 65%, 60%, 55%, 50%, 45%, 40%, 35%, 30% sequence identity with the sequence of the first target polynucleotide to be amplified. It will be clear to the skilled person that the target sequence to be amplified is typically a subsequence within a larger polynucleotide, for example a 200 nucleotide region of a 500 nucleotide polynucleotide. The skilled person will understand that the requirement for a particular sequence identity, or amplification rate, applies only to this portion of the polynucleotide that is to be amplified, and the sequence of the flanking regions is largely irrelevant.
As described above, a different amplification rate can be achieved by altering the GC content of the sequence to be amplified. Accordingly, in one embodiment, the sequence of the first tuned competitor polynucleotide to be amplified (i.e. the sequence of the first tuned competitor product) comprises least 15% GC, or at least 25%, is at least 35%, is at least 55%, is at least 65%, is at least 75%, is at least 85%, or at least 85% GC.
In the same or different embodiments the difference in GC content of the first target polynucleotide portion to be amplified and the first competitor polynucleotide to be amplified is at least 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 1%, 10%, 12%, 13%, 14%, 15%, 16%, 17%, 18%, 19%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85% or at least 90% or 95%. For example, the first target polynucleotide portion to be amplified may comprise a sequence that is 20% GC, and the first competitor polynucleotide to be amplified may comprise a sequence that is 25% GC, resulting a difference in GC content of 5%.
Altering the length of the product to be generated, i.e. the distance between the sites of hybridisation of the two primers used in any given amplification, can also be used (alone or in combination with other methods described here such as altering the GC content) to tune the amplification rate. Accordingly, in the same or different embodiment, the first tuned competitor product is at least 5 nucleotides longer than the first target product, optionally at least 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290 or at least 330 nucleotides longer than the first target product.
In some embodiments the first tuned competitor product is at least 5 nucleotides shorter than the first target product, optionally at least 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290 or at least 330 nucleotides shorter than the first target product.
In some embodiments the first tuned competitor product is at least 5 nucleotides longer than the first target product, optionally at least 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290 or at least 330 nucleotides longer than the first target product.
The skilled person will appreciate that any combination of one or all of the above parameters, i.e. GC content, sequence identity and length of amplicon can be used to produce an appropriately tuned competitor polynucleotide.
Following amplification, it will be apparent to the skilled person that the amplification products are detected. In some instances it is sufficient to detect the presence or absence of a particular product. In other instances determination of the actual or relative abundance of a product is required. Various means are available to the skilled person to determine the presence or amount of an amplification product, including gel based electrophoresis assays, affinity-based capture of the amplification products for example on lateral flow strips, and fluorescence labelled probe based assays.
The present invention is particularly powerful when used to determine the relative abundance of at least two target polynucleotides. Accordingly in some embodiments the one or more target products, optionally one or more target PCR products; and the one or more tuned competitor products, optionally one or more competitor polynucleotide PCR products are detected.
In preferred embodiments, each target product and each corresponding competitor product is detected. In particularly preferred embodiments, the detection involves the use of fluorescently labelled probes wherein no matter how many targets and competitors are detected, the detection only uses two different fluorophores. Summing the fluorescence from each probe (i.e. just a single reading of fluorescence from both fluorophores) produces a single overall value, i.e. which of the fluorescence labels is higher. In turn, this corresponds to a diagnosis or prognosis.
Accordingly, in some embodiments the method comprises providing one or more probe groups, wherein each probe group comprises at least one probe polynucleotide labelled with a first label and at least one probe polynucleotide labelled with a second label, and wherein the first and the second label are different.
In some instances the at least one probe labelled with the first label is capable of hybridising to the first target product; and the at least one probe labelled with a second label is capable of hybridising to the first tuned competitor product. In some embodiments neither probe is capable of hybridising to the first target product.
In other instance the at least one probe labelled with the first label is capable of hybridising to the first tuned competitor product; and the at least one probe labelled with the second label is capable of hybridising to the second tuned competitor product. In some embodiments neither probe is capable of hybridising to the first target product.
The above reflects the fact that some genes may be predictive or diagnostic when the expression level is increased as compared to a control (e.g. non-diseased) sample; and that some genes may be predictive or diagnostic when the expression level is decreased as compared to a control sample. The skilled person will be able to ensure that the correct label is assigned to the correct probe so that combining the total fluorescence takes into account the direction of gene expression. A key feature of the present invention is that it is the difference between labels that provides the information; which label provides the “positive” signal and which provides a “negative” signal is decided by the skilled person.
A particular probe group represents a set of probes that are each labelled with one of only two different labels. It will be clear that as described above, the methods may be used to detect a number of different target products and competitor products. Accordingly, in some embodiments, within a single probe group there are at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95 or at least 100 different probes each labelled with the first label. In the same or other embodiments within a single probe group there are at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95 or at least 100 different probes each labelled with the second label. The direct method described above will typically require one probe with one label that can hybridise to the target product, and a corresponding probe labelled with the second label that can hybridise to the corresponding competitor product, i.e. a 1:1 ratio of probes (though the labels may be swapped as described above depending on the predictive relationship or differential gene regulation signature). The indirect method does not necessarily require this 1:1 ratio, since for example a single target product may be associated with two or more competitor products.
Accordingly, in some embodiments, within a single probe group there are:
In some embodiments, appropriate probes are as follows:
As described above, the power in the methods comes at least from combining the detection of a number of different targets and competitors into two single readings (i.e. a reading of the first label and a reading of the second label, both of which can be done in one single reading), which themselves are combined into a single reading—how much first label versus how much second label.
However, if analysis of the expression of a larger number of genes is required, or the analysis of more complex networks, it is possible to use further probe groups, labelled with a third and fourth primer for instance (or, a 3rd probe group labelled with a fifth and sixth label etc). In this way, one set of genes may be analysed using a first probe group (reading the first and second label, followed by how much first label versus how much second label) and a second probe group (reading the third and fourth label, followed by how much third label versus how much fourth label). If necessary the overall reading of first:second:third:fourth label can be taken. This will all depend on the predictive relationship or differential gene regulation signature that is being employed.
Accordingly, in some embodiments the method comprises providing at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95 or at least 100 different probe groups, wherein no particular label is used in more than one probe group.
In some embodiments the method comprises providing a number of labelled probe polynucleotides such that each target product has a corresponding labelled target probe polynucleotide and each tuned competitor product has a corresponding labelled competitor probe,
and wherein the labelled probes corresponding to the target product and the tuned competitor product are labelled with different labels.
In some embodiments the only labels present on the probes are the first label and the second label.
In some embodiments, each probe is labelled with a single type of label. For example, each probe is labelled only with HEX, or is only labelled with FAM, and is not labelled with both HEX and FAM. It will be clear to the skilled person however that each probe may be labelled with more than one molecule of the same label, for example may be labelled with 1, 2, 3, 4, 5 or more HEX molecules.
The probes may be labelled with any type of detectable label for example an enzyme based label that results in a colour change. Preferably, the label is a fluorophore. Accordingly, in some embodiments the first and second label are fluorophores. Examples of fluorophore labelled probes are “TaqMan” probes (that require degradation to release the fluorophore from proximity to a quencher), Hybeacons (which light up only when bound to the target), and Molecular Beacons (which physically distance two fluorophores when bound to an amplicon though the fluorophores remain tethered through the probe), and Scorpion probes.
It will be clear then from the above that reference to a fluorophore does not mean that a quencher may not also be present. For example in some embodiments the probes are labelled with a first and a second fluorophore. However, each probe may also be labelled with an appropriate quencher, as will be understood by the skilled person.
Alternatively, probes may be labelled in a manner intended for affinity-based separation (see for example Abingdon probes for Nucleic acid lateral flow immunoassays https://www.abingdonhealth.com/other-products/nucleic-acid-detection-pcrd/and the probes provided by Twistdx https://www.twistdx.co.uk/docs/default-source/Application-notes/app-note-001---pcrd-rpa-use-v1-7.pdf?sfvrsn=615403fc_46). As an example of one such embodiment, one probe is labelled with FAM and the other with the hapten digoxigenin (DIG). A primer for each the target and the competitor is labelled with biotin; thus amplification produces some amplicons labelled at one end with biotin and at the other with FAM, as well as other amplicons labelled at one end with biotin and at the other with DIG. The amplicons are mixed with a solution of streptavidin-coated gold nanoparticles, which binds to the biotin to form nanoparticle-amplicon complexes, then allowed to flow up a lateral flow strip. Anti-FAM and anti-DIG antibodies printed in separate lines on this strip act act as affinity purification agents, binding to the respective amplicons. This causes gold nanoparticles to be trapped at the printed lines, producing a dark red band visible to the naked eye. The relative intensity of these two bands provides the “signal” in the same manner as the relative intensity of two fluorophores described above.
The skilled person understands what is required of a probe that functions via hybridisation to a nucleic acid target. For example, the probe could have a sequence that is 100% identical to the relevant region of the target. However, the skilled person also understands that the sequences do not have to be 100% identical. Designing such hybridisation probes is entirely routine for the skilled person.
The skilled person will understand what is meant by a fluorophore and is capable of identifying appropriate fluorophores or fluorophore pairs. Preferably, the first and second fluorophore are chosen so that they have distinct emission spectra. Exemplary fluorophores are TAM, SUN, VIC, TET, JOE, the cyanine dyes (Cy3, Cy3.5, Cy5, Cy5.5), the Atto dyes, and the Alexa Fluors (see for example https://eu.idtdna.com/site/Catalog/modifications/dyes and https://www.trilinkbiotech.com/omi—
Particularly useful combinations are considered to be FAM and HEX; CY3 and CY5; and any combination of FAM, HEX, TET and Cy5.
A particularly useful pair of fluorophores are FAM and HEX.
Accordingly, in one embodiment, the first label is FAM and the second label is HEX. In another embodiment, the first label is HEX and the second label is FAM.
It is important that the probe that binds to the target product and the probe that binds to the corresponding competitor product are labelled with different labels, so the relative amounts of each product can be either determined, or incorporated into an overall determination of the amount of different target products and different competitor products.
Accordingly, in one embodiment, the at least one probe that is capable of hybridising to the first target product; and the at least one probe that is capable of hybridising to the first tuned competitor product are labelled with different labels.
In the same or different embodiment, the at least one probe that is capable of hybridising to the first tuned competitor product; and the at least one probe that is capable of hybridising to the second tuned competitor product are labelled with different labels.
In some embodiments where a group of genes are all predictive of the particular state (e.g. disease, prognosis) when the expression of the genes is increased relative to a control sample or control level, then it is appropriate that each probe that is capable of hybridising to the a target product is labelled with the same first label; and each probe that is capable of hybridising to a tuned competitor product are labelled with the same second label.
However, in some embodiments as described above, some genes are predictive of a particular state when the gene expression is repressed. Since many predictive relationships or differential gene regulation signatures and networks involve an increased expression of some genes and a concomitant repression of other genes, it is important that this can be reflected in the simple output from the method. Accordingly in some embodiments at least one of the probes that are capable of hybridising to a target product is labelled with a first label, and at least one of the probes that are capable of hybridising to a tuned competitor product are labelled with the same first label.
In some instances, within a given amplification reaction, there will be probes that are capable of hybridising to a target product that are labelled with a first label, probes that are capable of hybridising to a target product that are labelled with a second label, probes that are capable of hybridising to a competitor product that are labelled with a first label, and probes that are capable of hybridising to a competitor product that are labelled with a second label.
In some embodiments each probe that is capable of hybridising to a target polynucleotide product that is associated with a positive predictive relationship or differential gene regulation signature of a particular state is labelled with the first label, and the corresponding probe that is capable of hybridising to the tuned competitor polynucleotide product is labelled with the second label; and/or
wherein each probe that is capable of hybridising to a target polynucleotide product that is associated with a negative predictive relationship or differential gene regulation signature of the particular state is labelled with the second label, and the corresponding probe that is capable of hybridising to the tuned competitor polynucleotide product is labelled with the first label.
In some instances, wherein following amplification the actual amount of each product detected by the first probe and the amount of product detected by the second probe is determined.
In other embodiments, it is the relative amounts of each probe that are determined. For instance in some embodiments the relative amounts of each probe are compared to a standard curve to determine the relative probability of one or more states.
Generating an appropriate standard curve is routine for the skilled person and will require calibration, either by the individual user or the manufacturer, to relate a raw signal (or, in this case, the difference between signals) to a prediction/diagnosis.
An advantage of the present invention is that it allows the interrogation of a number of different expression patterns simultaneously, for example via multiplex PCR, and due to the use of only 2, or perhaps a small number for example 3, 4, 5, 6 different fluorophores, allows the abundance, or relative abundance, or each product to be condensed into a single reading, for example a single reading over multiple wavelengths (channels) to detect the amount of fluorescence from each probe label, or multiple readings performed in quick succession on the same sample.
It will be clear then that the methods of the invention translate the information provided by a given gene transcript or set of transcripts into the relative probability of a particular state.
The methods described herein capture the state of a portion of a gene expression network, optionally as a single value.
It will be clear to the skilled person that the target polynucleotide can be any nucleic acid from any source, provided that it is capable of being amplified. In one embodiment the target polynucleotide is RNA, optionally is an RNA transcript, optionally is an mRNA. In some embodiments the target polynucleotide is an miRNA, lncRNA or an siRNA.
The target polynucleotide may also be DNA. The DNA may be a modified form of DNA.
The sample may be any sample provided it comprises, or is expected to comprise, nucleic acid.
The methods of the present invention have both medical uses and biotechnological/bioproduct uses. The sample may be selected from the group comprising or consisting of: tissue, biopsy, blood, plasma, serum, pathogens, microbial cells, cell culture and cell lysate.
The sample may comprise any source of nucleic acid. In some examples the sample comprises any one or more of: cells, optionally white blood cells and/or red blood cells; exosomes; circulating tumour DNA (ctDNA); cell-free DNA (cfDNA); RNA; or pathogen nucleic acid.
The cells may be of any cell type. For example the cells may be mammalian cells, bacterial cells, yeast cells or plant cells. The mammalian cells may be human cells or are derived from human cells.
The cells may be cultured cells, optionally primary patient-derived cells or immortalized cell lines.
The cells may be mammalian stem cells.
In some embodiments, the cells are engineered cells, optionally engineered cells used in the bioproduction of metabolites and compounds.
The cells may be yeast cells, optionally wherein the yeast cells are used in brewing.
As is clear from the above, the method of the invention the is, in some preferred embodiments, for the amplification of at least a first and a second target polynucleotide.
In some embodiments, the method is for the amplification of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95 or at least 100 target polynucleotides.
As described above, the present methods also include what is termed a “redundant” model, whereby at least two or more portions of the same physical target polynucleotide molecule are amplified.
Accordingly, in some embodiments the first and the second target polynucleotides are target sequences within the same single polynucleotide.
In some particular embodiments, the method comprises amplification of a tuned competitor polynucleotide with at least one primer that is capable of hybridising to the first and to the second target polynucleotide and producing a first target product and a second target product.
In some embodiments the method comprises amplification of two tuned competitor polynucleotides, wherein the method comprises:
It will be clear that following amplification, detection of the product, for example detection of the signal produced by the fluorophore labelled probes, is indicative of any one or more of:
In some embodiments, (i), (ii), (iii) and/or (iv) above is indicative of one or more of:
As mentioned herein, the methods of the present invention can be used to determine whether a particular sample more likely to be in a particular state A rather than a particular state B. The states are the states on which the predictive relationship or differential gene regulation signature is based. In some instances the states may be “particular disease” vs “no disease” or vs “other disease” or vs “not particular disease”.
Any of the methods provided by the invention can be for the diagnosis and/or prognosis of a disease or condition in a subject.
Accordingly, the invention also provides a method for the diagnosis and/or prognosis of a disease or condition in a subject.
In some instances, to diagnose a disease or condition requires the assessment of the relative expression levels of at least two genes, optionally requires the assessment of the relative expression levels of at least 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95 or at least 100 genes.
In some embodiments, the disease or condition is selected from: human tuberculosis, human tuberculosis with HIV co-infection, human tuberculosis without HIV co-infection, cancer, optionally prostate cancer, sepsis, bloodstream candidiasis, bovine tuberculosis, bovine mastitis. In particular embodiments the disease is tuberculosis.
In very particular embodiments, the disease is tuberculosis, and the differential gene regulation signature and/or predictive relationship or differential gene regulation signature is identified from the white blood cells of the subject.
In some embodiments, where the disease is tuberculosis, the degree of differential regulation of GBP6, ARG1 and TMCC1 contributes to an overall probability of having tuberculosis as compared to having some “other disease”. The gene expression signature is upregulation of GBP6, and downregulation of ARG1 and TMCC1, compared to the levels of these genes in patients not having tuberculosis.
In the embodiments where the disease is tuberculosis and the degree of differential regulation of GBP6, ARG1 and TMCC1 contributes to an overall probability of having tuberculosis as compared to having some “other disease”, examples of the primers and competitor sequences that can be used are shown in
In
In one embodiment, where the target is TMCC1 and the target sequence is SEQ ID NO: 4, appropriate competitor sequences used to determine the most optimum competitor are considered to be SEQ ID NO: 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34 and 36. Appropriate primers for amplification of the target and competitors are shown in SEQ ID NO: 1 and 3. Appropriate probes for detection of this target's contribution are shown in SEQ ID NO: 77 and 78.
In one embodiment, where the target is ARG1 and the target sequence is SEQ ID NO: 40, appropriate competitor sequences used to determine the most optimum competitor are considered to be SEQ ID NO: 42, 44, 46 and 48. Appropriate primers for amplification of the target and competitors are shown in SEQ ID NO: 37 and 39. Appropriate probes for detection of this target's contribution are shown in SEQ ID NO: 79 and 78.
In one embodiment, where the target is GBP6 and the target sequence is SEQ ID NO: 52, appropriate competitor sequences used to determine the most optimum competitor are considered to be SEQ ID NO: 54, 56, and 58. Appropriate primers for amplification of the target and competitors are shown in SEQ ID NO: 49 and 51. Appropriate probes for detection of this target's contribution are shown in SEQ ID NO: 80 and 77.
In other embodiments, the disease is cancer, for example is prostate cancer or breast cancer, optionally prostate cancer.
Where the disease is prostate cancer, the primers and probes that can be used are as follows:
In some embodiments, the disease is cancer, and the relative expression of a mutant version of a gene, particular allelic variant and/or cell-free tumour DNA is detected.
In any of the methods and embodiments described herein, the target polynucleotides may comprise SNPs, SNVs (single nucleotide variants) indels or copy-number variants (CNVs) associated with a disease state, optionally associated with the presence of a tumour and/or cancer, for example may comprise snps, snvs or indels in cell-free tumour DNA.
In some embodiments the target is EGFR, in particular a SNP in EGFR. In some embodiments the target sequence is SEQ ID NO: 62, and appropriate competitor sequences are SEQ ID NO: 64, 67 and 71. Appropriate primer sequences are SEQ ID NO: 68 and 70.
In some methods, a blocker oligonucleotide is used, wherein the blocker oligonucleotide cannot undergo extension of its 3′ end, and wherein the blocker oligonucleotide is not complementary to the portion of the sequence in the at least one target polynucleotide containing the single-nucleotide polymorphism, optionally wherein the snp is a snv, but wherein the blocker oligonucleotide is complementary to the corresponding wild-type sequence and wherein the sequence in the target polynucleotide that comprises the sequence that is complementary to the blocker oligonucleotide overlaps with at least a portion of the sequence complementary to one of the primers.
In some instances, appropriate blocker sequences are SEQ ID NO: 75 and 76.
In some instance, the sample is obtained from a subject that is already suspected of having a particular disease or condition. In other instances, the method may be used as part of a routine screening programme, in which case the target polynucleotide may be derived from a sample obtained from a subject not suspected of having a particular disease or condition. The subject may be considered to be at risk of a particular disease or condition, for example due to age or lifestyle.
As mentioned here, in addition to medical uses, the present invention is useful in the field of bioengineering and industrial biotechnology. In some embodiments the detection of the relative expression of a specific gene or genes is indicative of the expression of specific natural and/or engineered genes in cells in culture and can for example allow the skilled person to determine whether a cell or system is behaving favourable or if culture parameters need to be optimised, for example.
As described above, any means of amplification is suitable for use with the present invention. However, preferred methods of amplification include the polymerase chain reaction (PCR) or the recombinase polymerase reaction (RPA).
As can be seen above, the invention provides numerous methods for the amplification of one or more target polynucleotides. As indicated at the outset, the invention provides:
and
wherein the method comprises the step of amplifying one or more target polynucleotides in a sample. The step of amplifying one or more target polynucleotides can be performed according to any of the methods of amplification described herein.
The invention further provides a method of diagnosis or prognosis of a disease or condition in a subject wherein the method comprises any of the methods of amplification of the invention. In some embodiments the subject is diagnosed as having a disease or condition or prognosis of a disease or condition when the relative amounts of the first label and the second label indicate prognosis of disease or condition.
As described above, the disease or condition may be selected from: human tuberculosis, human tuberculosis with HIV co-infection, human tuberculosis without HIV co-infection, cancer optionally prostate or breast cancer, sepsis, bloodstream candidiasis, bovine tuberculosis, bovine mastitis. Preferences for the disease or condition are as described elsewhere herein.
The invention also provides various compositions and kits that can be used to put the methods of the invention into practice. For example, the invention provides a composition comprising one or more of:
The skilled person will appreciate that a composition for nucleic acid amplification may comprise one or more standard amplification components, such as a polymerase enzyme; appropriate amounts of each of four nucleotides A, C, T and G; a recombinase enzyme; a single stranded binding protein; and/or appropriate amounts of each of the nucleotides A, C, T, G and U.
The invention also provides a tuned competitor polynucleotide as defined herein. Preferences for features of the tuned competitor polynucleotide are described elsewhere herein.
The invention also provides a kit for carrying out any of the methods of the invention, for example wherein the kit comprises one or more of:
In particular embodiments the kit comprises;
The invention also provides a composition comprising any one or more of:
In one embodiment the composition comprises:
In one embodiment the composition comprises:
In one embodiment the composition comprises:
In one embodiment the composition comprises:
In one embodiment, the kit or composition comprises any one more of the sequences shown in
In one embodiment, the kit or composition is for amplifying a portion of TMCC1 mRNA and comprises any one more of the competitor sequences of SEQ ID NO: 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34 and 36. In some embodiments the kit or composition also comprises appropriate primers for amplification of the target and competitors, such as those of SEQ ID NO: 1 and 3.
In the same or different embodiment, the kit or composition is for, or is also for, amplifying a portion of ARG1 mRNA and comprises any one more of the competitor sequences of SEQ ID NO: 42, 44, 46 and 48. In some embodiments the kit or composition also comprises appropriate primers for amplification of the target and competitors, such as those of SEQ ID NO: 39 and 39.
In the same or different embodiment, the kit or composition is for, or is also for, amplifying a portion of GBP6 mRNA and comprises any one more of the competitor sequences of SEQ ID NO: 54, 56, and 58. In some embodiments the kit or composition also comprises appropriate primers for amplification of the target and competitors, such as those of SEQ ID NO: 49 and 51.
In other embodiments, the kit or composition is for amplifying a portion of EGFR genomic DNA, for example genomic DNA that is in a sample of ctDNA, for example in order to distinguish between the wild-type allele and a particular mutation, such as the L858R SNP, and comprises any one more of the competitor sequences of SEQ ID NO: 64, 67 and 71. In some embodiments the kit or composition also comprises appropriate primers for amplification of the target and competitors, such as those of SEQ ID NO: 68 and 70.
The invention also provides a collection or kit that comprises at least two tuned competitor polynucleotides as described herein, wherein the collection comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 22, 24, 25, 26, 28, 30, 32, 34, 35, 36, 38, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or at least 200 tuned competitor polynucleotides.
The invention also provides a collection or kit that comprises at least two tuned competitor polynucleotides and at least two corresponding labelled probes.
The invention also provides a collection or kit that comprises:
Further, the invention provides a collection or kit that comprises:
The invention also provides a method of tuning a first competitor polynucleotide that competes for hybridisation with at least a first primer with a first target polynucleotide and which results in amplification of a first target product and a first tuned competitor product, and wherein:
The method of tuning a competitor polynucleotide of the invention may also comprise:
In some instances said optimising comprises producing two or more test tuned competitor polynucleotides that following amplification result in:
In some instances, said optimising comprises producing at least 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 different test tuned competitor polynucleotides.
In some embodiments said optimising comprises performing at least 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 test amplification reactions with each test tuned competitor polynucleotide,
In a preferred embodiment, at least two replicates of five amplification reactions are performed, wherein each of the five amplification reactions employs a different tuned competitor polynucleotide.
In some instances, each test amplification using a particular test tuned competitor polynucleotide is performed using a different concentration and/or number of target polynucleotide templates.
In some embodiments the test amplification reactions are performed with a range of concentrations and/or number of target polynucleotide templates that span 100 copies/μL to 108 copies/μL.
As described herein, in some instances the test tuned competitor polynucleotides are designed to have different GC contents.
Also provided by the present invention is a method of optimising a competitive amplification reaction according to any of the preceding claims, wherein said optimising comprises:
The invention also provides a method of multiplexed competitive amplification of at least two target polynucleotides wherein the method comprises at least one competitive polynucleotide and wherein the target amplification products are detected using probes labelled with the same label, optionally labelled with the same fluorophore, optionally wherein the competitive polynucleotide is a tuned competitive polynucleotide according to any of the preceding claims.
The invention also provides a method of determining the transcriptional state of a system wherein the method comprises competitive amplification according to any method of the invention.
The invention also provides a method of determining whether a system is in state A or in state B wherein the method comprises competitive amplification according to any method of the invention.
The method also provides a method of simultaneous competitive amplification of at least two target polynucleotides in a sample wherein the method comprises providing
In some embodiments of the method of simultaneous competitive amplification of at least two target polynucleotides the method comprises providing
In some embodiments of the method of simultaneous competitive amplification of at least two target polynucleotides one of the labelled target probes is labelled with the second label and the corresponding labelled competitor probe is labelled with the first label.
In some embodiments of the method of simultaneous competitive amplification of at least two target polynucleotides the method further comprises simultaneously detecting the amount of the first label and the second label following multiplexed amplification.
The listing or discussion of an apparently prior-published document in this specification should not necessarily be taken as an acknowledgement that the document is part of the state of the art or is common general knowledge.
Preferences and options for a given aspect, feature or parameter of the invention should, unless the context indicates otherwise, be regarded as having been disclosed in combination with any and all preferences and options for all other aspects. For example, exemplary combinations of features provided by the invention include:
A summary of the overall approach that may be taken by the skilled person to put the invention into practice for specific applications is as follows:
A summary of an exemplary method of tuning a competitor polynucleotide is as follows:
Simulations were carried out to identify ideal parameters values describing optimal behaviour. Designing a competitor sequence which displays behaviour reflected by one or more of these parameter values is the goal of tuning. First, numerous amplicon sequences are designed and obtained with identical primer sequences and variable “core” sequences between the primers. These sequences are tested experimentally, and their behaviour analysed to derive values for the descriptive parameters. Assuming none of these sequences displayed ideal amplification behaviour, the data is used to rationally design a new sequence with the best chance of matching the target behaviour. To this end, performed regression is performed to determine how various sequence design parameters predicted the parameters of interest describing amplification behaviour. Specifically, a Gaussian Process regressor can be trained to relate the length and GC-content of the “core” sequence to the “amplification rate” parameter. This, or any other such regressor, could then be used to predict the behaviour of a given designed amplicon as well as provide the sequence descriptors (length and GC content) most likely to achieve the desired objective. This process of simulation, design, experimentation, analysis, and regression is iterated for every sequence in the Competitive Amplification Network until a suitable sequence is found. Modifications of this approach include incorporating information on the primer sequences themselves within the regression. This allows determination of both a global relationship between design parameters and amplification parameters as well as the idiosyncrasies of that relationship specific to a given pair of primers.
The invention is further described in the following numbered embodiment paragraphs:
The invention is also further defined by the following numbered embodiments:
Illustration of how regression enables tuning of the competitors to achieve a given target amplification rate r. A) A regression surface (far left) is generated, for example through Gaussian Process regression, that relates the two competitor design parameters of length (BP, in nucleotides) and GC content (in percent) to the observed amplification rate, along with the uncertainty in that relationship. Here, observed points (i.e., competitor sequences which have been designed and experimentally tested) are denoted by circles shaded by amplification rate. Filled contours represent the expected amplification rate at each point determined by the regression algorithm, and dashed lines represent iso-uncertainty contours (the square root of the variance returned by the regressor), indicated as a multiple of the standard deviation of all observed r values thus far. From this regression surface, a metric such as Expected Improvement can be calculated that indicates a new design likely to display the desired target amplification rate. Shown here are the Expected Improvement surfaces for different targets, lighter shades indicating a higher likelihood of achieving the goal. B) The regression surface and expected improvement surfaces, shown here for a target amplification rate of 1.0, change as new sequences are tested and added to the model. In this way, the practitioner can iteratively tune the competitor sequences to achieve the desired amplification rate: i) regression is performed on data obtained thus far, ii) a new design is proposed which has high likelihood of achieving the desired rate, iii) a new sequence based on this design is obtained and experimentally tested, iv) if observed behavior is suboptimal, the regression surface can be updated to incorporate this data, and v) yet another design can be proposed.
Shown here are the real-time fluorescence traces for competitive amplification reactions between each synthetic amplicon shown in
List of examined sequences, design characteristics, and observed amplification parameters used in this work, any of which may be used as components of any CAN. Each sequence listed here was amplified in using traditional PCR techniques and the resulting fluorescence curves were analyzed as described in this work. The measured parameters F0_lg, K, r, and m are those that appear in equations 2 and 3, and tau and rho appear in equations 4 and 5.
SEQ ID NOs: 1-80 are as set out in
The invention will now be described further by the following non-limiting Examples.
The core technology is a system of at least three natural target or competitor polynucleotides, used in a nucleic acid amplification reaction for evaluation of a certain combination of one or more sequences of interest. As the sequences are replicated, they compete for these shared primers, conferring unique characteristics to the resulting readout. For example, take a set of natural gene transcripts, each paired with an engineered synthetic competitor (
The “direct” competitive amplification network described above, comprising multiple pairs of natural and synthetic targets each competing for both primers, constitutes the simplest embodiment of this invention. However, the same competition principle applies to more complex networks. For example, a natural target could share one of its primers with one synthetic target, which in turn shares its other primer with a second synthetic target, making an “indirect” CAN (
Direct Competitive PCR
In competitive PCR, a competitor polynucleotide (REF) is included as a reference alongside the target (denoted in the figures as WT)(
When the target and the competitor are amplified in the same PCR reaction, they compete for the primers. Since primers are consumed by each replication of a target or competitor strand, the amplification of both sequences stops as soon as the primer pool is exhausted. The quantity of each amplification product at the end of the reaction depends on the relative starting quantity of the two targets. This is reflected in the resulting fluorescent signal (
Direct Competitive Amplification Networks
Now, a pair of competing targets is not much of a “network”, nor does a single gene target reflect the complexity of gene expression signatures. However, we can combine multiple competitive pairs in the same reaction, each producing HEX and FAM signals that reflect a different RNA transcript. Each competitive pair reports on how close the given gene is to its individual set point, and these signals will all simply stack on top of one another. The result is an aggregate measure of the overall similarity of all genes. Regardless of the number of genes under investigation, the difference between the total HEX intensity and the total FAM intensity integrate the information from the whole system. To illustrate why this is useful, let's look at how we can use such a network to diagnose tuberculosis by mimicking the statistical technique of logistic regression.
Case Study: Diagnosis Tuberculosis with a Direct CAN
More people die each year from tuberculosis than from any other infectious disease. 2018 saw 10 million new cases and 1.5 million deaths. Tuberculosis is particularly prevalent (and deadly) among those also infected with HIV, a population particularly difficult to diagnose with current TB tests. A gene expression signature was found in human white blood cells that can be used to diagnose TB. ((1)
Kaforou, M.; Wright, V. J.; Oni, T.; French, N.; Anderson, S. T.; Bangani, N.; Banwell, C. M.; Brent, A. J.; Crampin, A. C.; Dockrell, H. M.; Eley, B.; Heyderman, R. S.; Hibberd, M. L.; Kern, F.; Langford, P. R.; Ling, L.; Mendelson, M.; Ottenhoff, T. H.; Zgambo, F.; Wilkinson, R. J.; Coin, L. J.; Levin, M. Detection of Tuberculosis in HIV-Infected and -Uninfected African Adults Using Whole Blood RNA Expression Signatures: A Case-Control Study. PLOS Medicine 2013, 10 (10), e1001538. https://doi.org/10.1371/journal.pmed.1001538. (2)
Gliddon, H. D.; Kaforou, M.; Alikian, M.; Habgood-Coote, D.; Zhou, C.; Oni, T.; Anderson, S. T.; Brent, A. J.; Crampin, A. C.; Eley, B.; Kern, F.; Langford, P. R.; Ottenhoff, T. H. M.; Hibberd, M. L.; French, N.; Wright, V. J.; Dockrell, H. M.; Coin, L. J.; Wilkinson, R. J.; Levin, M.; Consortium, on behalf of the I. Identification of Reduced Host Transcriptomic Signatures for Tuberculosis and Digital PCR-Based Validation and Quantification. bioRxiv 2019, 583674. https://doi.org/10.1101/583674.)
Crucially, this test performs equally well in patients with and without HIV. However, the technology used to identify this signature—microarrays—is too cumbersome and expensive for use in the rural, poor regions of the world where such a test is needed most. A direct Competitive Amplification Network can evaluate the gene expression signature and translate the test to a rapid, inexpensive, and easy-to-use format.
Diagnosing with Statistics: Logistic Regression
To understand how we can use a CAN to diagnose TB, we first need to understand the statistical technique we are trying to mimic: logistic regression. Logistic regression models the probability of being in one group (infected with tuberculosis) compared to another (having some other disease, OD) by looking at the individual contributions of various determining factors (expression levels of various genes). It assumes that the log-odds, or relative probability, is given by a (linear) weighted sum of these factors:
We can look at the contribution of individual genes to the overall classifier by finding the marginal log-odds for each (
To diagnose a patient based on logistic regression, we just add up the contribution of each individual gene. For example, a patient may have 103 copies of GBP6, contributing a marginal log-odds of +0.25. The same patient might have 104 and 104 copies of ARG1 and TMCC1, respectively, contributing −0.5 and −0.2. The overall log-odds of this patient having TB would be 0.25-0.5-0.2=−0.45, so we can conclude that this patient is unlikely to have TB. Repeating this for every patient (
Mimicking the Statistics with a Direct CAN
We can use a direct CAN to recapitulate this statistical inference on a molecular level by designing a competitor for each of our three gene transcripts (
In order to choose an appropriate target region and design the synthetic target sequence, we use the results of logistic regression as an “objective function”: our goal is to find a pair of sequences that, when amplified together, give us an input-output response curve that approximates this objective. Thus, for each target, we try to approximate a line with the slope derived from the equation above (the respective S term) and which intercepts 0 at the mean concentration of that target observed in our data set. Using simulation, we can predict the behaviour of any two sequences amplified together, and so we can use standard curve-fitting algorithms known to the art to find the optimal parameters. In this case, those are the parameters that produce a response curve that matches the line specified above as closely as possible in the range of target concentrations observed in our dataset, then flattens as quickly as possible outside that range (See
Once suitable parameters are found, we then need to select sequences which exhibit them. Using the equations described above in the section “Testing and predicting competitor amplification behavior”, we can predict the combinations of length and GC content which provide these parameters. Note that our simulations do not include the drift term (m) or plateau term (K) found in our regression equations. This is because the simulations represent ideal behavior, and these two parameters describe deviations from that ideal. Thus, in choosing optimal length and GC content, we would seek to minimize drift and maximize the plateau, so that we select sequences as close to the ideal as possible.
It is likely that multiple sets of parameters could give nearly-optimal curves. It may be preferable that a suitable target sequence be identified a priori (due to external constraints), its amplification parameters measured, then using the curve-fitting algorithm to select only competitor amplification parameters which produce a nearly-optimal response when simulated along with the measured parameters. The simulation of the amplification behavior is described above; supplied with the suitable equations for simulation, the skilled person would be able to perform any of several optimization techniques and algorithms, including Gradient Descent, Stochastic Gradient Descent, and Quasi-Newton optimization, among others.
Limitations of Direct CANs
The direct networks presented above have two main drawbacks. First, they will get expensive quickly for larger gene signatures since at least one if not two probes need to be designed for each transcript targeted. Economies of scale for DNA sequences are quite favourable for scale-up, but at a development scale each fluorescently-labelled probe costs ˜£200 (for context, each primer costs ˜£2 and each synthetic target ˜£20). For gene signatures with 20-50 targets iterating on sequence designs becomes prohibitively expensive. Second, direct CANs are somewhat limited in the response curves attainable. To address these issues, indirect CANs provide similar functionality at a more or less fixed cost regardless of the number of genes under investigation. Indirect competition also opens the possibility of higher-order networks capable of complex, non-linear analysis of multiple targets simultaneously. Finally, redundant targeting allows additional flexibility for all CAN architectures.
Indirect Competitive PCR
Instead of direct competition between a probed target and a probed competitor, an unprobed target can simply mediate the competition between competitor polynucleotides. Because both primers are necessary for exponential amplification of a given target, replication can be arrested by depletion of only one primer. So, we can design a synthetic target, REFH, that shares one primer with a natural sequence, WT, and its second primer with a second synthetic target, REFF (
Case Study: Diagnosing Cancer with an Indirect CAN
A promising avenue of early cancer diagnosis or monitoring of cancer treatment is through detection of tumor-derived DNA in the bloodstream (circulating tumour DNA, ctDNA), chromosomal fragments shed by the cells as they die. This is distinguishable from the ordinary milieu of cell-free DNA (cfDNA) through specific mutations, such as single nucleotide polymorphisms (SNPs) or insertion-deletions (indels). By detecting known pathogenic mutations, we may be able to diagnose someone before the tumour shows up on a scan. We can also look for ctDNA after or during treatment, to see if the patient is responding or if the cancer has come back. The difficulty is, these variants are much lower in concentration than the corresponding natural sequence. Furthermore, a single base change is hard to differentiate using ordinary PCR (indels are easier, so we'll focus on SNPs with the understanding that whatever works for SNPs will work even better for indels). While in some cases specific mutations can inform treatment decisions (namely targeted treatment susceptibility/resistance), in general the total ctDNA burden is all that is needed even though any of numerous mutations can act as proxies for that total, making this a good application for CANs.
To use CANs for ctDNA detection, we will adapt Blocker Displacement Amplification (Wu et al., 2017), a published approach for preferentially amplifying variant alleles over the corresponding wild-type (
Higher-Order Competitive Networks
The flexibility of the indirect CAN allows incorporation of multiple natural targets in a single closed network, enabling non-linear analysis of target combinations. For example,
The CANs shown above are limited in their response to a given target; the output is always monotonic or at least unimodal with regards to the target concentration. However, we can further exploit the additive nature of fluorescent signals by redundantly targeting a single sequence. Genes transcripts are typically several thousand nucleotides long, while only 50-300 nucleotides are needed for a PCR target. Accordingly, we can design independent CANs each targeting a different region of the same sequence. Their outputs will stack, producing powerful emergent behaviour. From a mathematical point of view, the individual networks become a library of “basis functions” from which theoretically any response relationship can be built, limited only by the number of target regions available within a given sequence.
Case Study: Dilution-Agnostic Comparator with a Redundant CAN
Biosensing faces a bit of a paradox: variation in the concentration of a biomolecule is used to infer disease state, yet there are many non-biological reasons a sample could vary in the concentration of targets. The patient could be more or less hydrated than expected, the sample volume could be inaccurate, or simple statistics could lead to variation in the number of cells obtained. A classic approach to accommodate these uncertainties is the use of an internal standard, something innate to the sample that shouldn't vary with disease condition. For analysis of RNA, this internal standard is typically a “housekeeping” gene, a transcript so fundamental to growth of a cell (controlling cytoskeleton or cell membrane metabolism, for example) that its concentration reflects only the number of cells analysed rather than their state. The concentration of truly interesting gene transcripts can be compared to the housekeeping gene(s) to produce a more reliable measure of their deviation from normality. Typically, these are either separate PCR reactions performed in parallel or multiple probes within a single reaction; in either case, this becomes very time-, resource-, and sample-intensive if, say, 16 genes of interest and 5 housekeeping genes are needed, with extensive post-processing required. Redundant targeting of indirect CANs offers a way to perform this calculation explicitly, on the molecular level, so the reported signal reflects the relative concentrations of two genes regardless of their absolute concentrations (
Further Applications
Two and a half decades of gene expression analysis have identified dozens or even hundreds of potentially diagnostic expression signatures. RT-PCR, Nanostring, and RNA-seq analyses have similarly produced useful insight. In addition to the signatures described above, the following reports present promising candidates for adaptation of the CAN platform:
The CAN platform could also solve a problem in bioprocessing, the industrial use of synthetic cells to produce a product such as a drug or to break down a material, such as petrochemicals or greenhouse gases. This involves coordination of several synthetic and natural gene systems and may involve more than one population of engineered cells grown simultaneously. Currently, system performance is verified through RNA-seq or microarrays, which are expensive and time consuming. Alternatively, engineers include genes that produce “reporter” in conjunction with the desired product. However, doing so consumes raw materials that otherwise could be used for production of the desired compound while putting greater stress and uncertainty on the engineered cells. The CAN architecture would provide a way to get a snapshot of the transcriptional activity of all relevant genes simultaneously. A CAN could be designed to produce one colour if all genes are operating within a pre-specified window, but if any gene is above or below that window a different colour is produced.
Competitive Amplification Networks offer the potential to perform powerful calculations on a molecular level, explicitly performing analyte pattern recognition within a biosensor architecture. By leveraging the ubiquitous DNA amplification technology PCR, the CAN platform is fast, inexpensive, and, above all, easy to use. The data-driven nature of the technology is both its strength and its weakness: an adequate dataset is all that's necessary to design and test a CAN but acquiring a sufficiently robust dataset may be a lengthy challenge. Fortunately, extensive literature exists on the topic, much with open-access data. The results here only begin to describe the potential of the technology; more work is needed to establish rules and algorithms for network design, target sequence selection, and experimental validation. As it is early stages yet, creating a CAN is a very manual process, but the whole process could become simplified through integration of modelling and automated instrumentation to iterate on the cycle of i) design a network for an application, ii) select competitor and primer sequences, iii) robotically assemble the competitors from building block oligos, iv) run an appropriate number of reactions, v) compare the results against the predicted response, vi) adjust the network or sequence design. Such a close-loop development system will allow rapid deployment of the CAN platform for a wide range of biosensing applications.
Number | Date | Country | Kind |
---|---|---|---|
2015943.0 | Oct 2020 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/GB2021/052594 | 10/7/2021 | WO |