The invention is in the field of analytical technology and relates to an improved procedure for determining the concentration or activity of an analyte in a sample. Specifically the invention provides an automated algorithm for the quality control of quantitative PCR (qPCR) assays.
The Polymerase Chain Reaction (PCR), developed in 1984 by Kary Mullis, is a means to amplify the amount of DNA or mRNA fragments, e.g. of a specific gene, in a (patient) sample. If mRNA fragments shall be amplified, they first have to be transcribed to cDNA in a reverse transcription (RT) step. In this case, the reaction is called RT-PCR.
The PCR takes place in small reaction tubes in a thermal cycler. The reaction mix consists of
The PCR process is a sequence of ˜20-50 cycles, each of them consisting of the following three steps:
If the number of target molecules in the reaction mix at the beginning of the reaction is c0, the number of target molecules after n cycles is
c
n
=c
0·(1+E)n (1)
or equivalently
log(1+E)(cn)=n+log(1+E)(c0) (2)
where E denotes the efficiency of the PCR reaction.
Ideally the DNA is duplicated in each cycle, and the efficiency equals one. Plotting log(1+E)(cn) against the cycle number, one would theoretically get a straight line with slope one and intercept log(1+E)(c0). Two different values of c0 would lead to parallel straight lines, the one belonging to the higher value of c0 lying above the one belonging to the lower value of c0. This is not correct in reality, because at the beginning and at the end of the PCR the process is inhibited for different reasons. During the first cycles there is just a small amount of template molecules, therefore the portion of the fluorescence signal which is caused by the template molecules is negligible. The end of the reaction is characterized by decreasing reactant concentrations and thus the reaction rate saturates; another problem that can occur is the deterioration of reactants.
The PCR can also be used to quantify the amount of DNA or mRNA fragments in a sample. In this case, a real-time Polymerase Chain Reaction (qPCR) has to be carried out which relies on the basic PCR. It takes place in a TaqMan (Applied Biosystems) or MX3005 (Stratagene), for example.
The qPCR follows the same pattern as the basic PCR except that a probe has to be added to the reaction mix. This probe is labeled by two fluorophors, a reporter and a quencher, and has to be designed in such a way that it binds to the target DNA strands. This binding takes place during the primer annealing phase. If the probe is activated by a specific wave length during this phase, the fluorescence of the reporter is suppressed due to the spatial vicinity of reporter dye and quencher dye as the reporter releases its energy to the quencher. The underlying concept of this energy transfer is called FRET (fluorescence resonance energy transfer). During the elongation phase the polymerase eliminates the probe which is dissolved. Thus the distance between reporter dye and quencher dye increases and the reporter begins to fluoresce. The higher the number of templates, the higher the number of redundant reporter molecules, therefore the intensity of the fluorescence is a measure of the initial number of target molecules.
Plotting the fluorescence intensity after n cycles against the cycle number n leads to a so-called sigmoid function which consists of three different parts:
To quantify the amount of DNA or mRNA fragments in a sample one now compares the fluorescence intensity to a pre-defined threshold and determines the cycle at which this threshold is reached for the first time (linear interpolation is used between subsequent cycles to obtain fractional cycle numbers). The (fractional) cycle at which the threshold is reached for the first time is called Ct value. The earlier this takes place the higher was the amount of initial target molecules in the reaction mix. It is important that the threshold is chosen in such a way that the Ct value is obtained during the exponential growth phase.
When using a TaqMan or a MX3005 to carry out the qPCR, the determination of the Ct values is done automatically by the associated software (SDS respectively MX Pro) except that the operator has to choose the threshold that shall be used when working with the TaqMan. If this threshold isn't reached until the end of the reaction, the Ct value is called “Undetermined” (SDS software language). Due to contaminations within the reaction mix, failures of the laser or the photo detector which measures the fluorescence intensity or other problems occurring during the reaction some qPCR curves show a behavior which deviates from the common sigmoid shape. These curves have to be filtered out by visual inspection of the operator, a process which is time-consuming and subjective.
The concept described in the second and third chapter of this document is a means to automate on the one hand the determination of Cq values and on the other hand the quality control of qPCR reactions carried out on any appropriate instrument. It has only been tested for reactions carried out on a TaqMan up to now. In this context, the term Cq value denotes—like the Ct value—a value which provides information about the initial amount of DNA or mRNA fragments in the sample and about the validity of a curve.
1) A Flexible Sigmoid Function of Determinate Growth
The paper works on the determination of a function which is well suited to describe the sigmoid pattern of determinate growth in agricultural crops. A new function, the so-called “beta growth function” which is characterized by the three parameters tm, to and wmax, is proposed and its advantages (and disadvantages) in comparison to the logistic function, Richards' function, the Gompertz function, the Weibull function and two expolinear equations are worked out.
Difference to Our Invention:
No regularization was used to fit the model to the data.
No constraint like the linearityNorm was introduced to ensure robust fitting in case of zero target molecules.
No optimization of starting point for fitting routine is described.
No rules are proposed to decide if a curve really shows a sigmoid pattern or not.
Agricultural crops are investigated, not fluorescence data of qPCR reactions.
2) A new method for robust quantitative and qualitative analysis of real-time PCR
The paper presents a new method for the analysis of real-time PCR data, the so-called “maxRatio method”. This method contains the following steps:
for each cycle n.
2) Determination of the maximal value of the mapping ratio(n), this value is called MR.
3) Determination of the fractional cycle n for which ratio(n)=MR holds. This fractional cycle is called FCN.
4) Determination of the position of the point (FCN, MR) in a plot which depicts the MR value against the FCN value for a set of reference curves and thus classification into “normal reactive” (large MR, large FCN), “abnormal reactive” (medium MR, medium FCN) and “nonreactive” (small MR, small FCN).
Difference to Our Invention:
No model is fitted to the fluorescence data, therefore neither regularization nor constraints nor an optimization of the starting value is worked on.
3) A new real-time PCR method to overcome significant quantitative inaccuracy due to slight amplification inhibition
The paper compares four different methods (Ct method, second derivative (Cp) method, sigmoidal curve fitting (SCF) method, and Cy0 method) for the analysis of qPCR data. Of these, the Cy0 method (based on a nonlinear regression of Richards' equation) is the most accurate and precise method even in suboptimal amplification conditions.
Difference to Our Invention:
No regularization was used to fit the model to the fluorescence data.
No constraint like the linearityNorm was introduced to ensure robust fitting in case of zero target molecules.
No optimization of starting point for fitting routine is described.
No rules are proposed to decide if a curve really shows a sigmoid pattern or not.
4) Quantitative real-time RT-PCR based transcriptomics: Improvement of evaluation methods
Difference to Our Invention:
No regularization was used to fit the model to the fluorescence data.
No constraint like the linearityNorm was introduced to ensure robust fitting in case of zero target molecules.
No optimization of starting point for fitting routine is described.
No rules are proposed to decide if a curve really shows a sigmoid pattern or not.
5) Determination of stable housekeeping genes, differentially regulated target genes and sample integrity: BestKeeper—Excel-based tool using pair-wise correlations
The paper presents a software called “BestKeeper” intended to enhance standardization of RNA quantification results (as, for example, results gained by an RT-PCR). The tool chooses the best suited standards out of ten candidates and combines them into an index (as geometric mean) whose correlation to the expression levels of target genes can be computed. Used are Cp (crossing point) values which are gained by the “second derivative maximum” method as computed by the LightCycler software or Ct values.
Difference to Our Invention:
No model is fitted to the fluorescence data, therefore neither regularization nor constraints nor an optimization of the starting value is worked on.
No rules are proposed to decide if a curve really shows a sigmoid pattern or not.
6) Inhibition of real-time RT-PCR quantification due to tissue-specific contaminants
The paper works on the influence of unknown tissue-specific factors on amplification kinetics. Various methods of Cp value acquisition (first derivative and second derivative maximum of the four-parameter sigmoid model, “fit point method” and “second derivative maximum method” computed by the LightCycler software, “Taqman threshold level” computation method) are analyzed for this purpose.
Difference to Our Invention:
No regularization was used to fit the model to the fluorescence data.
No constraint like the linearityNorm was introduced to ensure robust fitting in case of zero target molecules.
No optimization of starting point for fitting routine is described.
No rules are proposed to decide if a curve really shows a sigmoid pattern or not.
7) Standardized determination of real-time PCR efficiency from a single reaction set-up
The paper works on a computing method for the estimation of real-time PCR amplification efficiency. Instead of using serial dilution steps the following procedure is applied:
Difference to Our Invention:
No regularization was used to fit the model to the fluorescence data.
No constraint like the linearityNorm was introduced to ensure robust fitting in case of zero target molecules.
No optimization of starting point for fitting routine is described.
No rules are proposed to decide if a curve really shows a sigmoid pattern or not.
8) Tissue-specific expression pattern of bovine prion gene: quantification using real-time RT-PCR
The quantification of expression of bovine prion (proteinaceous infectious particle) in different organs via real-time RT-PCR is described and it is shown how the organs are involved in pathogenesis. Quantification is carried out via the “Second Derivative Maximum Method” as being implemented in the LightCycler software (second derivative maximum within exponential phase of amplification curve is linearly related to a starting concentration of the template DNA).
Difference to Our Invention:
No model is fitted to the fluorescence data, therefore neither regularization nor constraints nor an optimization of the starting value is worked on.
No rules are proposed to decide if a curve really shows a sigmoid pattern or not.
9) Improving quantitative real-time RT-PCR reproducibility by boosting primer-linked amplification efficiency
The paper works on the impact of primer selection on the performance of real-time PCR reactions. For this purpose, a four-parametric sigmoid model is fit to the fluorescence data. Via ANOVA it is shown that most of the variance between b (slope) parameters (which is a measure for the efficiency of the primer itself and the variance of amplification efficiency) results from the use of different primers, not different tissues. Defining Cp as the maximum of the second derivative of the used four-parametric sigmoid model leads to CP=n0−1.317*b. Since the CP value is linearly related to b, the variability in CP values is linearly related to the amplification efficiency. It is mentioned that other sigmoid models can be used, too.
Difference to Our Invention:
No regularization was used to fit the model to the fluorescence data.
No constraint like the linearityNorm was introduced to ensure robust fitting in case of zero target molecules.
No optimization of starting point for fitting routine is described.
No rules are proposed to decide if a curve really shows a sigmoid pattern or not.
10) Inhibition of Taq Polymerase and MMLV Reverse Transcriptase by Tea Polyphenols (+)-Catechin and (−)-Epigallocatechin-3-Gallate (EGCG)
The effect of catechin and EGCG on the performance of (RT-) PCR reactions is investigated. This is done by fitting a mathematical model (the four-parametric sigmoid model) to the data and comparing the resulting fitting parameters for reactions with and without catechin and EGCG (ANOVA). Quantification is achieved by computing the maximum of the second derivative of the four-parametric sigmoid model or by using the “second derivative maximum method” implemented in the LightCycler software.
Difference to Our Invention:
No regularization was used to fit the model to the fluorescence data.
No constraint like the linearityNorm was introduced to ensure robust fitting in case of zero target molecules.
No optimization of starting point for fitting routine is described.
No rules are proposed to decide if a curve really shows a sigmoid pattern or not.
11) Genomic Health Patent (WO 2006/014509 A2)
Difference to Our Invention:
No Gompertz function is used to fit a model to the normalized fluorescence data of a qPCR curve. Instead, a linear regression analysis is performed in adjacent (possibly overlapping) regions of the data. A score relying on these regression data represents the quality of the well and determines its Pass/Fail status.
The computation of a quantification value does not rely on “overall estimated” parameters “inflexion point” and “slope”, but on a localized (for example quadratic) regression of the data in a predefined region and subsequent comparison with a threshold.
No description of
12) Automated Quality Control Method and System for Genetic Analysis
Difference to Our Invention:
Quality control metrics are used to determine the status of a well (e.g. empty well), but these metrics are not derived by fitting any models to the fluorescence data. The only exception is the derivation of the metric which tests if genetic material and probe dyes have been amplified. This metric fits a straight line to the amplification curve in order to get a baseline amplification curve.
13) WO 2010/025985
A linear regression is performed on the linear range of the fluorescence data of an RT-PCR reaction. Those values lying outside the linear range are compared to a threshold to determine if there is a signal at all. Afterwards, a (Gompertz) model is fitted to the background-subtracted data and rules relying on the fitting parameters are defined to classify a curve as valid or invalid. Quantification is achieved by using a cp value (maximum of second derivative) or a by value (intersection between background and tangent in the inflexion point).
Difference to Our Invention:
No regularization was used to fit the model to the fluorescence data.
No constraint like the linearityNorm was introduced to ensure robust fitting in case of zero target molecules.
No optimization of starting point for fitting routine is described.
The rules which are defined are only univariate, in our invention there are also bivariate rules.
The invention provides an automated algorithm for the quality control of (RT-) qPCR reactions. Plotting the fluorescence intensity of a reporter dye divided by the fluorescence intensity of a passive reference dye against the cycle number leads to a so-called sigmoid function which is characterized by a background phase, an exponential growth phase and a plateau phase. Since the fluorescence intensity as a function of cycles relates to the initial number of template molecules in the sample, qPCR curves can be used to quantify the amount of RNA or DNA fragments in the sample by determination of a so-called Cq value. Due to contaminations within the reaction mix, unintended chemical reactions within wells, failures of the optical system of the PCR device which measures the fluorescence intensity, or other problems occurring during the reaction (e.g. air bubbles) some qPCR curves show a behavior which deviates from the common sigmoid shape. Information gained by these curves shouldn't be used for further analyses. Therefore, quality control of qPCR curves consists of three steps:
1) Determination if a curve is valid at all.
2) Determination if the initial number of template molecules is zero or very small (below some LOD).
3) Determination of a Cq value which relates to the initial number of template molecules for all curves which are valid.
A mathematical model (on the basis of the Gompertz function which is suitable to describe sigmoid curves) is fit to the data in consideration of nonlinear constraints and regularization parameters. In detail, values for parameters y0, r, a b and n0 are chosen in such a manner that the deviation (normalized sum of squared errors) between the data and the model
is minimized, where x denotes the cycle and f(x) is fitted to the normalized fluorescence signal. Parameter b is forced to be positive (by considering exp(beta) instead of b) and a is regularized to be approximately 3.9110. Additionally, parameter combinations for which the largest absolute difference between the exponential term on the right hand side of the equation (called Gompertz term) and a straight line gained by a linear regression of the Gompertz term (called linearityNorm) is smaller than 10-3 are forbidden by a constraint. The parameters n0 and b are used for the definition of a so-called AIP value which is a measure of the amount of RNA fragments in the sample. The optimal AIP value was found to be
AIP=n
0−0.72·b
The six features (y0, r, a, b, n0, linearityNorm) and the AIP value are used to define a set of rules which identify a curve as “Numeric” (quantification by AIP value), “Invalid” (curve is not reliable and should be ignored for further processing) or “Undetected” (initial number of molecules zero or very small).
There are several inventive steps that have to be combined to yield the advantages:
We used regularization of parameter a to ensure robustness of the fitting of the Gompertz model.
We introduced the linearityNorm constraint to ensure robust fitting in case of zero target molecules.
We optimized the starting point and customized the numeric optimization algorithm to yield meaningful parameter values (intended local minimum of objective function).
Based on real data we defined rules that rely on the features y0, r, a, b, n0, linearityNorm, AIP and that classify fluorescence curves into “Numeric”, “Invalid” or “Undetected”.
The first advantage which arises from the present invention is the objectivity and reproducibility of the method. This is especially advantageous for curves for which the decision between “Numeric” and “Invalid” is questionable and differs when asking different operators. In addition, tests have shown that when analyzing triplicate measurements the number of outliers is smaller when using the curve processor algorithm instead of Ct values. Furthermore, Cq values do not depend on the choice of a certain threshold.
The second advantage is saving of work time which currently is necessary to manually control fluorescence curves. Thus, costs are reduced.
The invention relates to a method for determining the concentration or activity of an analyte in a sample, the method comprising:
In this context, “extracting a score value reliably” refers to the ability to generate a similar score value in a duplicate experiment.
Said score value can be the above described Cq value.
When measuring a signal's changing over time, time can be measured as real time (in seconds, minutes or hours) or, in the case of cyclic amplification reactions such as PCR time can also be measured in terms of amplification cycles.
According to an aspect of the invention the analyte is a nucleic acid, in particular DNA or RNA, in particular mRNA.
According to an aspect of the invention the analyte-dependent amplification reaction is a PCR reaction, in particular a Reverse Transcriptase PCR (RT-PCR) reaction.
According to an aspect of the invention the amplification of the analyte is detectable by a fluorescence or optical signal.
According to an aspect of the invention an absolute concentration can be determined by use of an internal standard of known concentration.
According to an aspect of the invention said regularization is performed on parameter a. In particular, said regularization may be realized by a summand additional to the objective function used for fitting, where the summand is a weighted square of the z-transformed parameter a. Parameters of the z-transformation, i.e. mean and standard deviation, are empirical estimates from samples of signal curves showing saturation within the observed time interval which were known to be fitted robustly and with sufficient confidence.
According to an aspect of the invention parameters are constrained during fitting to ensure robust and confident estimation of parameters for curves showing no amplification behavior or amplification behavior in the very end only. Constraints may be uni- or multivariate, linear or non-linear. In particular, a constraint may be used comprising the following steps (linearityNorm):
(i) A linear model is fitted to the extended Gompertz model on the observed time interval. (Said extended Gompertz model is defined by some parameter set which may not be optimal.)
(ii) The deviation between the linear and the extended Gompertz model is calculated using some mathematical norm (e.g. Euclidian, Manhattan or max-Norm) and based on the observed time interval.
(iii) Said deviation is compared to a suitable threshold such that a parameter set of the extended Gompertz model is said to be allowed if said deviation is above said threshold and the parameter set is said to be forbidden if said deviation is below said threshold.
Alternatively, instead of the extended Gompertz model one can use the Gompertz core only, which is defined as
According to an aspect of the invention the fitting of the extended Gompertz model is realized by a gradient-based or local optimization algorithm and wherein the starting point for said optimization algorithm and its configuration is chosen such that the resulting local optimum corresponds to a fitted model, for which the technical interpretation of the curve corresponds to the meaning of the parameters: parameters y0 and r describe the beginning of the curve (background), parameters n0 and b describe the time point and velocity—respectively—of the amplification growth, and parameter a describes the height of the saturation level above background. In particular, parameters b and a must be positive.
According to an aspect of the invention the model fit is realized using a Euclidean distance measure, and wherein the optimization is nested by separating parameters: For fixed parameters b and n0 parameters y0, r, and a are optimized analytically by linear algebra operations since the objective function (including regularization) is a quadratic form of these parameters; and parameters b and n0 are optimized non-linearly in an outer loop. This approach is advantageous, because it needs fewer iterations, it is more robust, and starting values have to be defined only for parameters b and n0.
According to an aspect of the invention said quality control classification is realized by a decision tree, wherein each decision is based on at least one feature from the following list: said parameters (y0, r, a, b, n0), said score, a goodness-of-fit measure, the times of observation (in particular the bound of the interval) and features from constraints according to claims 8 or 9 if used. Each decision is derived from empirical training data by a data-driven method, wherein training curves are classified into said quality control classes by manual inspection, commercially available software or a combination of both.
According to an aspect of the invention said observed time interval may be restricted previous to described calculations in order to eliminate measurement outliers or parts of the curve showing behavior deviating from typical amplification behavior.
According to an aspect of the invention said decision tree is degenerated to the following linear list of rules:
The invention further relates to an apparatus which is capable of automatically carrying out the method according to any of claims 1 to 14 for determining the activity or concentration of an analyte, comprising
The invention further relates to a computer program product for carrying out the method according to the invention for determining the activity or concentration of an analyte, comprising:
In the following, the invention is described in exemplary fashion using the illustrative example of a quantitative RT-PCR (RT-qPCR) as amplification reaction. The method of invention may be applied to other amplification reactions yielding sigmoid growth curves.
The fluorescence data gained during an (RT-)qPCR reaction will be described and it will be explained how they are normalized in order to obtain so-called Rn values which will be used for further processing. Afterwards it will be illustrated how outliers are detected and eliminated from further analysis. The Gompertz function will be presented as an appropriate mathematical model to describe the curvature of the Rn values and it will be explained why concepts like regularization and the definition of constraints are necessary to achieve a good fit. It will be pointed out that there are many examples for curves which show an awkward behavior in the first cycles which leads to the necessity of leaving an appropriate number of cycles out during the optimization routine. Technical details of this routine will be explained afterwards and it will be emphasized how important the choice of a skilled starting value is to force the routine to end up in the right minimum. Afterwards it will be explained how the Cq value (score value) is defined and which rules are necessary to distinguish valid curves from curves with very low or zero target concentration and curves with an awkward behavior which can't be trusted and have to be sorted out for analysis.
Reference is made to the attached Figures.
As described above, plotting the fluorescence data of a reporter dye gained during an (RT-) qPCR reaction against the cycle number leads to a so-called sigmoid graph characterized by three different phases: background, exponential growth and plateau, as shown in
Normalizing these data by the fluorescence data of a passive reference dye which are almost constant hardly changes the shape of the graph, only the scale of the y-axis is altered. These normalized fluorescence data are called Rn, as shown in
A question that arises is how to deal with repeated measurements of one cycle. Dependent on the instrument that is used during the reaction and on the settings chosen by the user the data arising from an experiment can differ in the number of cycles and the number of measurements per cycle. For example, when using the TaqMan and the SDS software, three (or more) measurements are recorded for each of the 40 cycles; other protocols define different measurements e.g. one measurement per cycle only for MX3005/MxPro™. In the case the repeated measurements are close together in each cycle (which means the difference between them can be explained by measurement noise only) one can just compute the mean of the three repeats (we use the SDS software language here) to obtain one Rn value per cycle. It also has to be defined what has to be done in the presence of one or more outlier(s) as shown in
By computing the mean of the three measurements during the first cycle one would get an Rn value of approximately three for the first cycle which does not suit the remaining curve. Therefore it would be better to detect the measurement resulting in an Rn value of approximately one as outlier and leave it out of the analysis by computing the mean of the remaining two measurements during the first cycle. For this purpose, a method to detect outliers was established. This method consists of the following steps:
resi,j=Rni,j−cycleMean(i).
j
th repeat of ith cycle outlier|resi,j|>6·√{square root over (variance)}
Coding missing repeat measurements as well as detected outliers as NaN (not a number) simplifies an implementation.
Applying this method to the data shown in
There are three objectives that can now be defined:
Many different sigmoid functions can be found in literature (e.g. logistic function, Gaussian error function, Gompertz function, . . . ) and one can't decide ad hoc which of them is the most appropriate means to describe the fluorescence data of (RT-)qPCR reactions. In detail, the aforementioned functions are of the form
As can be seen immediately, all of these functions consist of two parameters b and n0 which are necessary to describe the inflexion point and the sigmoid shape of the curve. Furthermore, a detailed analysis shows that logistic function and Gaussian error function are symmetric around their inflexion point and all three functions fulfill the conditions
if b is greater than zero. Since fluorescence data of (RT-)qPCR reactions have a background that is different from zero and a plateau height which differs from one, the use of any of the three functions requires a transformation of the form
y
0
+r·x+a·ƒ(x) (9)
with additional parameters y0, r, and a. Up to this stage of analysis all three functions seem comparably suited to describe the fluorescence data, because all of them are of sigmoid shape, need the same number of parameters (namely five) and show the appropriate convergence behavior. As mentioned before the Gompertz function shows no symmetry, indeed, but this is no disadvantage because the fluorescence data don't indicate the need of symmetry. In order to find out which function to use, all of them were fitted to a set of example data and the residuals of the fits, i.e. the deviations between models and data, were examined. The results can be seen in
Thus, we chose the Gompertz function to describe the Rn values of (RT-)qPCR reactions:
The meaning of the five parameters which characterize the Gompertz function can be described quite easily by geometric means, see
To explain the geometrical relevance of the sigmoid width b, some more considerations have to be made. Firsta of all, subtracting the background in equation (10) and dividing by the pedestal height leads to the term
on the right-hand side of the equation. This term is bounded by zero from below and by one from above, still being of sigmoid shape, see
The next step is to determine those fractional cycles xlow and xhigh which define the beginning and the end of the exponential growth phase, respectively. This can be done by resolving the equations
Showing the solution for xlow exemplary, one gets
by taking the natural logarithm twice and thus
x
low
=n
0−log(−log(0.001))·b (16)
Similarly, one gets
x
high
=n
0−log(−log(0.999))·b (17)
Taking equations (16) and (17) together, one realizes that the width of the exponential growth phase is
x
high
−x
low=(−log(−log(0.999))+log(−log(0.001)))·b=8.8399·b (18)
explaining why b is called “sigmoid width”. Since the length of the exponential growth phase should be positive, it is advisable to replace b by exp(β) and optimize β instead of b. Equation (10) is therefore replaced by
The objective of the fitting routine is thus the solution of the minimization problem
Some more considerations have to be made on the pedestal height a: Given fluorescence data of experiments with low target concentrations where the plateau phase is not reached within the range of observed cycles the three parameters characterizing the exponential growth and the plateau phase are not well determined by the data and the resulting fit lacks of robustness. This situation is shown exemplarily in
This problem can be solved by forcing the pedestal height to a predetermined value. Once the magnitude of a has been fixed, all remaining parameters can be estimated robustly, too. In optimization theory, this procedure is called regularization. Regularization of a raises the question which value is sensitive for a. In order to figure this value out, the data of 26640 RT-qPCR reactions belonging to the HECOG0309 study were fitted by a preliminary fitting algorithm that will not be described here. The fitting algorithm resulted in a set of five parameters (background intercept, background slope, pedestal height, sigmoid width and pedestal height) and a mean squared error for each of the 26640 reactions. The results of 436 curves were sorted out, because the corresponding Ct value was either an outlier or the operator responsible for the experiments marked the curve as “Bad”. A histogram of the values of the pedestal height for the remaining 26204 curves was plotted and is shown in
As can be seen when taking a closer look at this histogram, the majority of a values is smaller than or equal to ten. It is assumed that those values being larger than ten come from curves where most of the cycles belong to the background phase and thus aren't fitted robustly. Therefore they are sorted out as well, leading to a histogram which indicates that the pedestal height a should be forced to a value of approximately four, see
Computing the median of the remaining a values leads to
â=3.9110 (22)
where â denotes the value the pedestal height shall be regularized to.
A general regularization approach can be realized by expanding the minimization problem (20) by an additional summand.
The purpose of the matrix weights is to determine how strict the regularization shall be. The larger an entry of weights, the higher is the effort of the fitting routine to force the corresponding value of c to equal its counterpart in ĉ, which consists of the parameter values we want to regularize to. Since we only want to regularize parameter a in a univariate way the matrix weights consists of only one non-zero entry. To balance both summands in problem (23) we normalize them by estimates of their variance.
The first denominator in equation (24) was derived as follows: A histogram of the values of the mean squared errors of the same 26204 HECOG0309 curves as used for the determination of â was plotted and is shown in
As can be seen when taking a closer look at this histogram, the majority of mean squared error values is smaller than or equal to 0.01. Only considering these curves leads to a median value for the mean squared error of 5.4674·10−4, see
The usefulness of the regularization can be seen in the following example.
Here, a regularization of the pedestal height leads to parameters which are more trustable. In addition, convergence of the fitting routine is achieved, see
Please note that—compared to the unregularized approach—the shape of the model curve is nearly unchanged but parameter a is much more realistic.
One class of (RT-)qPCR curves which is difficult to handle are those curves which mainly consist of background and are thus quite linear. A fitting concept that can be used to achieve good fitting results even for these fluorescence data curves is the application of a special constraint which forces the optimization routine to introduce a nonlinearity into the data. When having specified this constraint, the fitting routine is forced to find parameter values in such a way that the condition given by the constraint is fulfilled. Whenever the constraint is violated a new set of parameters has to be chosen. In the current situation, the constraint is chosen as follows:
linearityNorm≧10−3
The next question that arises is how many cycles should be used for the fitting procedure. Preferably all cycles should be used, of course, but since there are many curves which show an awkward behavior in the first cycles (immoderate increase or decrease of Rn values, waves) the first two cycles are left out of the fitting process for all curves. Examples for the aforementioned awkward behavior are shown in the following
While the data shown in
In other words
are computed where ƒ denotes the Gompertz function defined in equation (19) and x1 denotes the first cycle used in the corresponding fit (not the first cycle measured). Whenever one of the conditions
is fulfilled, the fit is doubted and the whole procedure (model fit and investigation of goodness-of-fit) is repeated for all data but the first three cycles. This process is iterated (leaving more and more first cycles out) until either the fit is good enough or no data are left. For the data shown in the figures above, the fit is not rejected when beginning with cycle three (
To be able to fit the Gompertz function to each given set of normalized (RT-)qPCR fluorescence data and to obtain a set of parameters y0, r, a, β, and n0 which is really suited to describe the data, some more technical considerations have to be made. First of all, one has to decide if it is better to optimize all parameters at once or to start with, for example, those two parameters characterizing the background (y0 and r) and optimize those parameters characterizing exponential growth and plateau phase (a, β, and n0) later. Since there is no biological indication why one should fit background and exponential growth/plateau separately, the holistic approach should be preferred. Unfortunately, especially for curves representing experiments with low target concentrations where most of the cycles belong to the background phase this approach fails because the regression often converges to a meaningless set of parameters or doesn't converge at all. In order to solve this problem and force the regression routine to converge at meaningful parameter values one has to have a closer look on the minimization problem given in (24),
When inserting the Gompertz function for ƒ one sees that the term that has to be minimized is a quadratic form of parameters y0, r, and a; thus these parameters can be obtained by a regularized linear regression for fixed β and n0. The technical advantage is that for fixed β and n0 parameters y0, r, and a can be optimized analytically by solving linear equations, see below. Taking this into account, the problem of fitting the Gompertz function to the Rn values can be reduced to a nonlinear optimization problem on only two parameters. In detail, the following is done: At first, the problem
is solved for fixed β and n0 by computing the first derivative of the term to be minimized and setting it to zero:
This leads via
Afterwards, β and n0 have to be chosen by the nonlinear fitting routine in such a manner that
gets minimal.
Another question that arises is which MATLAB function should be used to fit the Gompertz function to the fluorescence data. Since the function has to be able to handle the constraint described in paragraph 2.3.3 a routine called “fmincon” is chosen. It is a method based on gradients to find the minimum residual. Some more details on “fmincon” will be presented in paragraph 3.2. Among other things the fitting options that have to be used in the current situation are set there. It shall be mentioned that without specifying the parameters “TypicalX”, “RelLineSrchBnd” and “RelLineSrchBndDuration” the possibility of finding the wrong local minimum for res as defined in equation (39) is given. This is illustrated exemplarily in
One crucial step in a fitting optimization is the choice of the starting value that is needed for all local optimization procedures. If this choice isn't skilled enough, the possibility of ending in a wrong local minimum is given. The starting value for β can be set to −log(0.3) for each (RT-)qPCR curve, finding an appropriate starting value for n0 is more challenging. Considering the fluorescence data presented in
A closer look on the residuals of the given data for fixed β and n0 shows that this choice of starting values is sensible, because the minimum in which the optimization routine converges is suited to describe the data, see
Nevertheless, a problem becomes apparent: In addition to the minimum that specifies the best choice for β and n0 there is another local minimum in which the optimization routine could converge if the starting value for n0 would be ten, for example. Converging in this second minimum is not desirable, because the resulting parameters would describe a Gompertz function with an inflexion point of approximately ten which is insensitive from a biological kind of view—fluorescence data of (RT-)qPCR reactions don't reach the exponential phase at these early cycles. The next figure illustrates an example of fluorescence data for which the choice of starting values as described above leads to convergence in the wrong minimum, see
This problem can be solved by slightly altering the definition of the starting value for n0: The cycle with the largest step in Rn is replaced by the maximum of the aforementioned value and a constant value of 30. By forcing the algorithm to start with a value for n0 that is greater than or equal to 30, the right minimum is found because it is not possible for the optimizer to cross the “mountains” between both minima, see
The last step is to use the parameters gained by the optimization routine described in the last paragraphs to define a Cq value which provides information about the validity of an (RT-)qPCR curve and the amount of RNA or DNA in the sample. More precisely, there are four different classes the Cq value of each given example of fluorescence data can belong to, and these four classes have to be described by the parameters of the Gompertz function. In detail, a Cq value is either of
The first idea to calculate the AIP value was to determine the fractional cycle at which the tangent in the inflexion point crosses the background as illustrated in
Given any function {tilde over (ƒ)}, the formula for the tangent in any point x0 is given by
g(x)={tilde over (ƒ)}(x0)+{tilde over (ƒ)}′(x0)·(x−x0) (40)
For being able to apply this to the inflexion point of the Gompertz function, one first has to compute the first derivative by twice using the chain rule:
This leads to
and since the value of the Gompertz function in the inflexion point is
the formula for the tangent in the inflexion point is given by
For computing the AIP value one no has to equalize equation (42) with the background:
This definition makes clear where the notation AIP value comes from: AIP is the abbreviation for “adjusted inflexion point”. The question that will be answered below is: Is this definition of the AIP value the best choice or is there a better one?
The basic framework of the definition shall remain constant, this means the AIP value shall be computed by adjusting the inflexion point n0, but perhaps the definition
AIP=n
0
−α·b
works better for some α≠1. In order to determine the best value of α one has to create a measure for comparing different versions of an AIP value definition. For this purpose, the 26640 RT-qPCR reactions belonging to the HECOG0309 study which were already used for the determination of the regularization constants in section 2.3.2 were fitted by a preliminary fitting algorithm again. Of these 26640 reactions, three wells contain the same material at a time and form a so-called triplicate. This means, in each of these three reactions the same gene was measured for the same tissue sample, and therefore the results of these three measurements should be the same. Therefore, one could define that definition of the AIP value as best for which the sum of the standard deviations of the AIP value over all 8880 replicates is minimal:
where AIP(α,i,j) denotes the AIP value of the jth member of the ith triplicate when using α. This approach is not entirely satisfying, because each of the 8880 triplicates gets the same weight. That's not sensitive, because there are triplicates which contain large amounts of RNA as well as others which only contain few RNA molecules. For the triplicates mentioned last a higher standard deviation of the AIP values is acceptable, while for triplicates with much RNA the standard deviation has to be small. Therefore, a weighting factor has to be introduced. This weighting factor makes use of the Ct values (from SDS software) because they are the only criterion at hand. It is defined as follows:
(43)
Noise(
A total of 499 triplicates were additionally sorted out for different reasons, resulting in the following minimizing task to optimize the AIP value:
The minimum was found at α=−0.72, therefore the AIP value is defined by
AIP=n
0−0.72·b (45)
The determination of α is illustrated in
The next step is to decide for each fluorescence curve on the basis of the parameters gained by the optimization routine to which of the three classes—“Numeric”, “Undetected”, or “Invalid”—the corresponding Cq value belongs. As described above, the class “Missing” does not depend on the parameters but only on the raw fluorescence data. Whenever a well is classified as “Missing”, it's actually unnecessary to perform the fit, because its results are not needed.
Basis for the determination of the rules were the fitting results of the 28800 fluorescence curves belonging to the HECOG0309 study. Of these, 3600 curves were sorted out, because in the corresponding wells one of the genes ALCAM (SC312), MAP4 (SC083) or SPP1 (SC309) was measured and the results of these genes have to be doubted for some reason. The remaining 25200 curves were classified into three groups—
Scatter plots of the distributions of parameter values resulting from the fitting routines were created, depicting curves “Good” by Ct as dark gray crosses, “Undetermined” by Ct as light gray crosses, and “Bad” by Ct as black crosses. These scatter plots were used to define the rules classifying curves into “Numeric” Cq, “Undetected” Cq, and “Invalid” Cq. Obviously, it is desirable to classify “Good” Ct as “Numeric” Cq, “Undetermined” Ct as “Undetected” Cq, and as few as possible curves as “Invalid” Cq. Nevertheless it is evident as well that this will not work for all 25200 curves. In addition there were also some rare cases where this equivalence between methods is not desirable, e.g. a curve with strongly increasing background, for which the Ct value is unreliable, but the AIP value is reliable. For some of the wells for which the classification based on Ct values on the one hand and on optimization results on the other hand differed we investigated whether or not this is problematic.
There were two kinds of rules that we defined—rough rules and fine rules. The purpose of the rough rules is to filter out curves for which at least one of the parameters is exceptionally high or low and classify them as “Invalid”. The fine rules separate curves whose parameters are in the normal range into the three classes; their determination was more challenging than just setting the rough rules.
The first parameter that was investigated was the first cycle that was used for the Gompertz fit. As described in paragraph 2.3.4, this is the first cycle greater than two for which neither inequality (28) nor inequality (29) is fulfilled. The distribution of this parameter is shown in
As can be seen in the histogram, for a majority of curves the first cycle which is used is smaller than ten. In detail, this is true for 99.96% of all curves. Those curves for which the fit starts with the tenth cycle or even later are classified as “Invalid”, because we evaluated these fits relying on too few data as not trustable:
First cycle used≧10Cq=“Invalid” (46)
By this rule, nine curves are called “Invalid”; five of them are “Bad” curves, the rest are “Good” curves. One of the “Good” curves is exemplarily shown in
In the next step further rough rules are determined.
2.4.1. Background Intercept Vs. Background Slope
The second and third rough rules are based on the parameters characterizing the background of the fluorescence data:
|y0|>20Cq=“Invalid” (47)
|r|>1Cq=“Invalid” (48)
In the training data both rules characterize exactly the same curve as “Invalid”, this curve was classified “Bad” by the Ct value as well. The rules are also biologically sensible, because a rapid increase or decrease of the background (as given by a large absolute value of the background slope) as well as initial Rn values far from zero (as given by a large absolute value of the background intercept) indicate that something went wrong during the reaction.
2.4.2. Pedestal Height Vs. Sigmoid Width
The rough rule for the sigmoid width b only has to provide an upper bound, since b is bounded below by zero due to its definition as exp(β):
a>25Cq=“Invalid”
a<−15Cq=“Invalid” (49)
b>50Cq=“Invalid” (50)
Again, these rules are also biologically sensible: A negative value of the pedestal height indicates that the plateau lies beyond the background level. This means a decrease of RNA or DNA molecules during the (RT-)qPCR reaction. Since this is impossible due to the principle of such a reaction, a pedestal height which is even negative has to be doubted. A sigmoid width b larger than 50 implies via equation (18) that the length of the exponential region exceeds 400. Such a curve has to be doubted, too. Exactly one curve fulfills rule (49), this curve has a “Good” Ct. As depicted in
No curve of the training data set is classified as “Invalid” by the rough rule depending on the sigmoid width.
2.4.3. Inflexion Point Vs. Logarithm of Linearity Norm
See
n
0>65Cq=“Invalid”
n
0<15Cq=“Invalid” (51)
From a biologically point of view this rules makes sense, too, because an inflexion point smaller than 15 would indicate that the reaction has reached the exponential phase at approximately cycle ten or even earlier. None of the 25200 HECOG0309 curves is caught by this rule. Taking together all rough rules and the rule on the first cycle which was used for the fitting routine, one gets a total of ten curves which are classified as “Invalid” so far. These curves will be left out for the derivation of the fine rules below.
2.4.4. Pedestal Height Vs. Logarithm of Linearity Norm
In
Since there are more parameters which can help to classify a curve as “Undetected”, the second option was chosen here:
log(linearityNorm)≦−3.5Cq=“Undetected” (52)
A total of 31 curves are classified as “Undetected” by this rule. Of these, 29 have “Undetermined” Ct; the other two fluorescence data plots are shown in
For this curve in
It's not wrong to classify this curve as “Undetected”, because all cycles belong to the background phase and the beginning of the exponential region. In addition, computing the AIP value of this curve leads to 47.18; the curve in
Those 31 curves being classified as “Undetected” by the rule on the logarithm of the linearity norm will be left out for the further analysis. Since
2.4.5. Logarithm of Linearity Norm Vs. Logarithm of Sigmoid Width
The
Of the 829 curves which are classified as “Undetected” by this rule, 827 have a Ct value which is “Undetermined”.
The other two curves are “Bad”; one of them is exemplarily shown in
Obviously, both classifications are sensible for this curve, for the same reason as for the data shown in
2.4.6. Pedestal Height Vs. Adjusted Inflexion Point
The scatter plot in
This is not amazing, because a low pedestal height indicates that the amount of molecules in background phase and plateau phase is not very different. This could mean that the exponential phase detected in the data doesn't imply that the curve really has a sigmoid shape. All cycles might as well belong to the background phase and the detection of the exponential phase might be pure chance. This situation is shown exemplarily in
A total of 247 curves is classified as “Undetected” by this rule, 245 of them belong to the class “Undetermined” Ct. The other two curves have Ct values of 39.5865 and 38.0806 respectively. One of the curves is shown exemplarily in
The question that remains is what to do with curves that are characterized by a small pedestal height and a small AIP value. A closer look on
Rule (55) classifies 361 curves as “Undetected”. Of these, all except one curve have “Undetermined” Ct. The exceptional case is depicted in
Though a classification as “Invalid” would be better it is maintainable to characterize it as “Undetected”. A closer look on
The last rule that has to be defined is an upper border for the AIP value. This border is set to 40:
AIP≧40Cq=“Undetected” (57)
A total of 355 curves were classified as “Undetected” by this rule. Only 103 of them belong to the class “Undetermined” Ct, 17 of them are “Bad”. As can be seen in
A comparison of the results of Cq values and Ct values results in the following table:
This means for a total of 96.79% both methods result in equivalent classifications. Please note that “Numeric” Cq and “Good” Ct should match as well as “Undetected” Cq and “Undetermined” Ct should match; “Invalid” Cq and “Bad” Ct must not necessarily match since the (subjective) criteria for a reliable Ct value are very different from the criteria for a reliable Cq value due to very different calculation approaches.
When taking a closer look at those wells which the Cq method as well as the Ct method classify as “Numeric” and “Good”, respectively, one realizes that on the lower end of the scale AIP values are greater by trend, while on the upper end the situation is vice versa, see
Analyzing the correlation between both methods for separate genes reveals that for some genes the methods result in similar values, while for other genes there is a constant shift and for the rest of the genes Cq values and Ct values are non-linear, see
The purpose of the last chapter is to explicitly describe the details of the program that was implemented in MATLAB language to export fluorescence data of TaqMan or MX3005 experiments and fit those data to a Gompertz model in order to determine Cq values as a means to evaluate the results of an experiment. The program consists of three steps:
The MATLAB implementation codes missing numeric data as NaN (not-a-number), which is a special value of the numeric data type “double”. In particular NaN codes missing repeats, outliers within repeats, missing fluorescence values of single cycles (pure dyes as well as Rn values), “Invalid” or “Missing” Cq values, and “Bad” Ct values. “Undetected” Cq values and “Undetermined” Ct values are coded as Inf (positive infinity), which also is a special value of data type “double”.
Data Export from SDS Software
The program described in the next section expects txt-files of a certain structure which have to be exported from the SDS software manually. To be able to use the program, one has to use version 2.2, 2.2.2 or 2.3 of the SDS software. Other versions are not supported. To export the data from the SDS software, the first step is to push the “Analyze” button (marked by a green arrow). Afterwards, one has to choose “Export” in the “File” menu. In the graphical user interface which pops up, three changes have to be made:
The last step is to push the “Export” button; afterwards the SDS software isn't needed anymore. The txt-file which is exported satisfies the following structure:
The program which imports the data included in the txt-file to MATLAB is called “ImportSDSFileFromTXT”. It is called with five input parameters and two output parameters:
[expressionData, msg]=ImportSDSFileFromTXT(basename, reporter, passiveReference, directory, outputFunction).
The parameters “msg” and “outputFunction” are optional parameters which are only needed when the program is started from a GUI (graphical user interface). They won't be considered here. All obligatory input parameters have to fulfill certain conditions:
The program first checks the input parameters; in detail the following checks are done:
Afterwards, the first fields of the structure “expressionData” are set: “expressionData.source” is defined as “SDS” and “expressionData.wellNames” becomes a 16*24 cell array consisting of the entries “A1”, . . . , “P24”. In addition “expressionData.basename”, “expressionData.reporter” and “expressionData.passiveReference” are assigned the associated input parameters. When this is done the txt-file “<basename>mc.txt” is opened and the function goes through the lines of the file and does the following:
In the case it begins with “SDS”, the message
The following steps are repeated until the end of the txt-file:
When the end of the txt-file is reached some amendments have to be made:
Data Export from MX Pro Software
The program described in the next section expects xls-files of a certain structure which have to be exported from the MX Pro software manually. Version 4.01 should be used, but since the version is not explicitly specified in the xls-file the use of the appropriate version can not be checked by the program which imports the data included in the xls-file to MATLAB. Therefore other versions are supported, too, but their use is not recommended. To export the data from the MX Pro software, one has to choose “Export Instrument Data/Export Instrument Data to Excel/Format 1—Columnar” in the menu “File”. The Excel file which pops up afterwards should be saved under the name of the corresponding MX Pro-file, but with extension “.xls” instead of “.mxp”. Afterwards the MX Pro software isn't needed anymore. The xls-file which is exported satisfies the following structure:
The program which fits the data and computes the Cq values is called “FitAndComputeBV”. It is called with two input parameters and two output parameters:
[fitData, msg]=FitAndComputeBV(expressionData, outputFunction).
The parameters “msg” and “outputFunction” are optional parameters which are only needed when the program is started from a GUI (graphical user interface). They won't be considered here. The input parameter “expressionData” is the MATLAB structure generated by the program “ImportSDSFileFromTXT” or “ImportMXProFileFromXLS”. The program first checks the validity of the input parameter “expressionData”. When using one of the programs described in the previous section to gain the variable “expressionData” some of the following errors can't occur, but they are described here nevertheless. In particular, the following steps are done:
File <basename>: Source <expressionData.source> unknown.
File <basename>: Passive reference has to be measured.
The program loops over all reporter dyes specified in “expressionData.reporter” and does the following:
[x, fval, exitflag]=fmincon(fun, x—0, . . . [ ], [ ], [ ], [ ], [ ], [ ], nonlcon, options).
c(x)≦0 (58)
fun=@(x)loc_Gompertz(x, regressionRange, Rn, ‘objective’)
nonlcon=@(x)loc_Gompertz(x, regressionRange, Rn, ‘constraint’)
[result, dummy]=loc_Gompertz(x, data—x, data—y, resultType).
After all these steps are done for all wells and all reporter dyes, all warning messages that were saved during the fitting process are printed as warnings on the screen.
Number | Date | Country | Kind |
---|---|---|---|
10160573.1 | Apr 2010 | EP | regional |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP2011/055406 | 4/7/2011 | WO | 00 | 3/29/2013 |