The present invention relates to methods for analyzing multi-channel profiles, e.g., gene expression profiles. The invention also relates to methods for comparing expression profiles obtained using different microarrays.
DNA array technologies have made it possible to monitor the expression level of a large number of genetic transcripts at any one time (see, e.g., Schena et al., 1995, Science 270:467-470; Lockhart et al., 1996, Nature Biotechnology 14:1675-1680; Blanchard et al., 1996, Nature Biotechnology 14:1649; Ashby et al., U.S. Pat. No. 5,569,588, issued Oct. 29, 1996). Of the two main formats of DNA arrays, spotted cDNA arrays are prepared by depositing PCR products of cDNA fragments with sizes ranging from about 0.6 to 2.4 kb, from full length cDNAs, ESTs, etc., onto a suitable surface (see, e.g., DeRisi et al., 1996, Nature Genetics 14:457-460; Shalon et al., 1996, Genome Res. 6:689-645; Schena et al., 1995, Proc. Natl. Acad. Sci. U.S.A. 93:10539-11286; and Duggan et al., Nature Genetics Supplement 21:10-14). Alternatively, high-density oligonucleotide arrays containing thousands of oligonucleotides complementary to defined sequences, at defined locations on a surface are synthesized in situ on the surface by, for example, photolithographic techniques (see, e.g., Fodor et al., 1991, Science 251:767-773; Pease et al., 1994, Proc. Natl. Acad. Sci. U.S.A. 91:5022-5026; Lockhart et al., 1996, Nature Biotechnology 14:1675; McGall et al., 1996, Proc. Natl. Acad. Sci. U.S.A. 93:13555-13560; U.S. Pat. Nos. 5,578,832; 5,556,752; 5,510,270; and 6,040,138). Methods for generating arrays using inkjet technology for in situ oligonucleotide synthesis are also known in the art (see, e.g., Blanchard, International Patent Publication WO 98/41531, published Sep. 24, 1998; Blanchard et al., 1996, Biosensors and Bioelectronics 11:687-690; Blanchard, 1998, in Synthetic DNA Arrays in Genetic Engineering, Vol. 20, J. K. Setlow, Ed., Plenum Press, New York at pages 111-123). Efforts to further increase the information capacity of DNA arrays range from further reducing feature size on DNA arrays so as to further increase the number of probes in a given surface area to sensitivity- and specificity-based probe design and selection aimed at reducing the number of redundant probes needed for the detection of each target nucleic acid thereby increasing the number of target nucleic acids monitored without increasing probe density (see, e.g., Friend et al., International Publication No. WO 01/05935, published Jan. 25, 2001).
By simultaneously monitoring tens of thousands of genes, DNA array technologies have allowed, inter alia, genome-wide analysis of mRNA expression in a cell or a cell type or any biological sample. Aided by sophisticated data management and analysis methodologies, the transcriptional state of a cell or cell type as well as changes of the transcriptional state in response to external perturbations, including but not limited to drug perturbations, can be characterized on the mRNA level (see, e.g., Stoughton et al., International Publication No. WO 00/39336, published Jul. 6, 2000; Friend et al., International Publication No. WO 00/24936, published May 4, 2000). Applications of such technologies include, for example, identification of genes which are up regulated or down regulated in various physiological states, particularly diseased states. Additional exemplary uses for DNA arrays include the analyses of members of signaling pathways, and the identification of targets for various drugs. See, e.g., Friend and Hartwell, International Publication No. WO 98/38329 (published Sep. 3, 1998); Stoughton, International Publication No. WO 99/66067 (published Dec. 23, 1999); Stoughton and Friend, International Publication No. WO 99/58708 (published Nov. 18, 1999); Friend and Stoughton, International Publication No. WO 99/59037 (published Nov. 18, 1999); Friend et al., U.S. Pat. No. 6,218,122 (filed on Jun. 16, 1999).
The various characteristics of this analytic method make it particularly useful for directly comparing the abundance of mRNAs present in two cell types. For example, an array of cDNAs was hybridized with a green fluor-tagged representation of mRNAs extracted from a tumorigenic melanoma cell line (UACC-903) and a red fluor-tagged representation of mRNAs was extracted from a nontumorigenic derivative of the original cell line (UACC-903 +6). Monochrome images of the fluorescent intensity observed for each of the fluors were then combined by placing each image in the appropriate color channel of a red-green-blue (RGB) image. In this composite image, one can see the differential expression of genes in the two cell lines. Intense red fluorescence at a spot indicates a high level of expression of that gene in the nontumorigenic cell line, with little expression of the same gene in the tumorigenic parent. Conversely, intense green fluorescence at a spot indicates high expression of that gene in the tumorigenic line, with little expression in the nontumorigenic daughter line. When both cell lines express a gene at similar levels, the observed array spot is yellow.
In some cases, visual inspection of such results is sufficient to identify genes which show large differential expression in the two samples. A more thorough study of the changes in expression requires the ability to discern quantitatively changes in expression levels and to determine whether observed differences are the result of random variation or whether they are likely to reflect changes in the expression levels of the genes in the samples. Assuming that DNA products from two samples have an equal probability of hybridizing to the probes, the intensity measurement is a function of the quantity of the specific DNA products available within each sample. Locally (or pixelwise), the intensity measurement is also a function of the concentration of the probe molecules. On the scanning side, the fluorescent light intensity also depends on the power and wavelength of the laser, the quantum efficiency of the photomultiplier tube, and the efficiency of other electronic devices. The resolution of a scanned image is largely determined by processing requirements and acquisition speed. The scanning stage imposes a calibration requirement, though it may be relaxed later. The image analysis task is to extract the average fluorescence intensity from each probe site (e.g., a cDNA region).
The measured fluorescence intensity for each probe site comes from various sources, e.g., background, cross-hybridization, hybridization with sample 1 or sample 2. The average intensity within a probe site can be measured by the median image value on the site. This intensity serves as a measure of the total fluors emitted from the sample mRNA targets hybridized on the probe site. The median is used as the average to mitigate the effect of outlying pixel values created by noise.
Typically, in a two-color microarray gene expression experiment, the experiment sample is labeled in one dye color (Cy5, red) and the control sample is labeled in a different color (Cy3, green). The two samples are mixed and hybridized to a micro-array slide. After hybridization, the expression intensity is measured with a laser scanner of two different colors. The experiment is conducted in a biology laboratory (wet lab). To obtain the expression profile, we compute the logarithmic ratio of the two measured intensities (red and green).
There are various types of biases (errors), e.g., inter-slide bias and color bias, which may affect the accuracy of the ratio estimation. Inter-slide bias is the difference between two separated slides. The two-color technique avoids the inter-slide error by running the experiment in a single slide. But different dyes can cause difference between the two intensity measurements, so that the ratio is biased. To overcome this color bias problem, the experiment can be run twice with reversed fluorescent dye labeling from one to the other. The two expression ratios are then combined to cancel out the color bias. A method for calculating individual errors associated with each measurement made in repeated microarray experiments was also developed. The method offers an approach for minimizing the number of times a cellular constituent quantification experiment must be repeated in order to produce data that has acceptable error levels and for combining data generated in repeats of a cellular constituent quantification experiment based on rank order of up-regulation or down-regulation. See, e.g., Stoughton et al., U.S. Pat. Nos. 6,351,712.
U.S. Pat. No. 6,691,042 discloses methods for generating differential profiles A vs. B, i.e., differential profiles between samples having been subject to condition A and condition B, from data obtained in separately performed experimental measurements A vs. C and B vs. D. When C and D are the same, i.e., common, the methods involve determination of systematic measurement errors or biases between measurements carried out in different experimental reactions, i.e., cross-experiment errors or biases, using data measured for samples under the common condition and for removal or reduction of such cross-experiment errors. U.S. Pat. No. 6,691,042 also provides methods for generating differential profiles A vs. B from data obtained in separately performed single-channel measurements A and B.
Discussion or citation of a reference herein shall not be construed as an admission that such reference is prior art to the present invention.
The invention provides a method for correcting errors in at least one of a plurality of pairs of profiles {Am, Cm}, Am being an experiment profile, Cm being a reference profile, where m=1, 2, . . . , M, M is the number of pairs of profiles, said method comprising (a) calculating an average reference profile {overscore (C)} of reference profiles {Cm}, m=1, 2, . . . , M; (b) determining for at least one profile pair mε{1, 2, . . . , M} a differential reference profile of Cm and {overscore (C)}; and (c) generating for said at least one profile pair m an error-adjusted experiment profile A′m by a method comprising adjusting said experimental profile Am using said differential reference profile determined for said profile pair m, thereby correcting errors in said at least one of said plurality of pairs of profiles; wherein for each m ε{1, 2, . . . , M}, said error-adjusted experiment profile A′m comprises data set {A′m(k)}, said experiment profile Am comprises data set {Am(k)}, said reference profile Cm comprises data set {Cm(k)}, and said average reference profile C comprises data set {C (k)}, wherein said data set {Am(k)} comprises measurements of a plurality of different cellular constituents measured in a sample having been subject to condition Am, said data set {Cm(k)} comprises measurements of said plurality of different cellular constituents measured in a sample having been subject to condition C, and wherein k=1, 2, . . . , N is an index of measurements of cellular constituents, N being the total number of measurements. Preferably, said steps (b) and (c) are performed for each profile pair m.
The invention also provides a method for correcting errors in at least one of a plurality of pairs of profiles {Am, Cm}, Am being an experiment profile, Cm being a reference profile, where m=1, 2, . . . , M, M is the number of pairs of profiles, said method comprising generating for at least one profile pair mε{1, 2, . . . , M} an error-adjusted experiment profile A′m by a method comprising adjusting said experimental profile Am using a differential reference profile generated using Cm and an average reference profile {overscore (C)} determined for said profile pair m, wherein said average reference profile {overscore (C)} is an average of reference profiles {Cm}, m=1, 2, . . . , M; wherein for each mε{1, 2, . . . , M}, said error-adjusted experiment profile A′, comprises data set {A′m(k)}, said experiment profile Am comprises data set {Am(k)}, said reference profile Cm comprises data set {Cm(k)}, and said average reference profile {overscore (C)} comprises data set {{overscore (C)}(k)}, wherein said data set {Am(k)} comprises measurements of a plurality of different cellular constituents measured in a sample having been subject to condition Am, said data set {Cm(k)} comprises measurements of said plurality of different cellular constituents measured in a sample having been subject to condition C, and wherein k=1, 2, . . . , N is an index of measurements of cellular constituents, N being the total number of measurements.
The experiment profile Am and reference profile Cm are preferably measured in the same experimental reaction. In one embodiment, each said pair of profiles Am and Cm is measured in a two-channel microarray experiment. In one embodiment, said reference profiles {Cm}, m=1, 2, . . . , M, are measured with samples labeled with a same label. In another embodiment, at least one of said plurality of pairs of profiles {Am, Cm} is a virtual profile.
In a preferred embodiment, said {overscore (C)}(k) is calculated according to equation
said differential reference profile is calculated according to equation
Cdiff(m, k)=Cm(k)−{overscore (C)}(k)
and said error-adjusted profile is calculated according to equation
A′m(k)=Am−Cdiff(m, k)
In another preferred embodiment, the method further comprises a step of (d) calculating for at least one, preferably each profile pair m an error-corrected experiment profile A″m comprising data set {A″m(k)} by combining said error-adjusted experiment profile A′m with said experiment profile Am using a weighing factor {w(k)}, k=1, 2, . . . , N, wherein w(k) is a weighing factor for the k′th measurement. Preferably, said error-corrected experimental profile A″m is calculated according to equation
A″m(k)=(1−w(k))·Am(k)+w(k)·A′m(k)
In one embodiment, said weighing factor w(k) is determining according to equation
where avg_bkgstd is an average background standard error. In one embodiment, said avg_bkgstd is determined according to equation
where bkgstd (m, k) is background standard error of Cm(k).
In a preferred embodiment, the method further comprises determining errors {cm} of said error-adjusted experiment profiles {A′m}. In one embodiment, said errors are determined according to equation
where σm(k) is the standard error of Am(k), mixed_σm(k) is determined according to equation
and where Cor(k) is a correlation coefficient between experiment profile and reference profile. In one embodiment, said Cor(k) is determined according to equation
where CorMax is a number between 0 and 1.
In still another embodiment, the method further comprises determining errors {σ″m} of said error-corrected experiment profile {A″m}. In one embodiment, said errors are determined according to equation
σ″m(k)={square root}{square root over ([1−w(k)]·σm2(k)+w(k)·σ′m(k))}
where σm(k) is the standard error of Am(k), σ′m(k) is determined according to equation
where mixed_σm(k) is determined according to equation
and where Cor(k) is a correlation coefficient. In one embodiment, said Cor(k) is determined according to equation
where CorMax is a number between 0 and 1.
In another preferred embodiment, the plurality of pairs of profiles {Am, Cm} are transformed profiles comprising transformed measurements. In one embodiment, said transform measurements are obtained according to equations
where experiment profile XAm comprises measured data set {XAm(k)}, said reference profile XCm comprises measured data set {XCm(k)}, where d is described by equation
and where a is the fractional error coefficient of said experiment, b is the Poisson error coefficient of said experiment, and c is the standard deviation of background noise of said experiment.
In another preferred embodiment, said experiment profile Am and reference profile Cm comprises measurements from which nonlinearity is removed. In one embodiment, said measurements from which nonlinearity is removed are obtained by a method comprising (i) determining an average profile of all experiment profiles {Am} and reference profiles {Cm}; and (ii) adjusting each Am or Cm based on a difference between said Am or Cm and said average profile. In one embodiment, said difference is determined using a subset of measurements in the profiles. In a preferred embodiment, said subset of measurements in the profiles consists of measurements that are ranked similarly between an experiment or reference profile and said average profile. In one embodiment, said comparing in said step (ii) is carried out by a method comprising: (ii1) binning measurements in said subset into a plurality of bins, each said bin consisting of measurements having a value in a given range; (ii2) calculating mean difference between said Am or Cm and the average profile in each bin; (ii3) determining a curve of said mean difference as a function of values of measurements for said Am or Cm, nonlinear_Am or nonlinear_Cm, respectively; and (ii4) adjusting Am or Cm according to equations
Amcorr(k)=Am(k)−nonlinear—Am(k)
or
Cmcorr(k)=Cm(k)−nonlinear—Cm(k)
where k=1, . . . , N.
In another preferred embodiment, each said experiment profile Am and reference profile Cm is a normalized profile. In one embodiment, said normalized profile is obtained by a method comprising normalizing experiment profile Am and reference profile Cm according to equation
where {overscore (A)}m is an average of profile {Am(k)}, and {overscore (C)}m is an average of profile {Cm(k)};
wherein {overscore (AC)} is an average of all profiles calculated according to equation
The method of the invention can further comprise normalizing errors of said experiment profile Am and reference profile Cm according to equation
where σmA(k) and σmC(k) are the standard error of Am(k) and Cm(k), respectively, and σmNA(k) and σmNC(k) are normalized standard error of NAm(k) and NCm(k), respectively.
In another embodiment, the method further comprises normalizing background errors of said experiment profile Am and reference profile Cm according to equation
where bkgstdmA(k) and bkgstdmC(k) are the standard background error of Am(k) and Cm(k), respectively, and bkgstdmNA(k) and bkgstdmNC(k) are normalized standard background error of NAm(k) and NCm(k), respectively.
In a preferred embodiment, said {overscore (Am)} and {overscore (Cm)} are an average of measurements in profile {Am(k)} and {Cm(k)}, respectively, excluding measurements having values among the highest 10%.
The invention also provides a method of correcting errors in a plurality of pairs of profiles {XAm, XCm}, XAm being an experiment profile, XCm being a reference profile, where m=1, 2, . . . , M, M is the number of pairs of profiles, said method comprising (a) processing said profiles to obtain a plurality of pairs of processed profiles {Am, Cm}, Am being a processed experiment profile, Cm being a processed reference profile; (b) calculating an average reference profile {overscore (C)} of reference profiles {Cm}, m=1, 2, . . . , M; (c) determining for each profile pair m a differential reference profile of Cm and {overscore (C)}; and (d) generating for each profile pair m an error-adjusted experiment profile A′m by a method comprising adjusting said experimental profile Am using said differential reference profile determined for said profile pair m, thereby correcting errors in said plurality of pairs of profiles; wherein for each mε{1, 2, . . . , M}, said error-adjusted experiment profile A′m comprises data set {A′m(k)}, said processed experiment profile Am comprises data set {Am(k)}, said processed reference profile Cm comprises data set {Cm(k)}, and said average reference profile {overscore (C)} comprises data set {{overscore (C)}(k)}, said experiment profile XAm comprises data set {XAm(k)}, said reference profile XCm comprises data set {XCm(k)}, wherein said data set {XAm(k)} comprises measurements of a plurality of different cellular constituents measured in a sample having been subject to condition Am, said data set {XCm(k)} comprises measurements of said plurality of different cellular constituents measured in a sample having been subject to condition C, and where k=1, 2, . . . , N is an index of measurements of cellular constituents, N being the total number of measurements. The experiment profile XAm and reference profile XCm are preferably measured in the same experimental reaction. In one embodiment, each said pair of profiles XAm and XCm is measured in a two-channel microarray experiment. Preferably, said reference profiles {XCm}, m=1, 2, . . . , M, are measured with samples labeled with a same label. In another embodiment, at least one of said pair of profiles {XAm, XCm} is a virtual profile.
In one embodiment, said step (a) of the method comprises normalizing each said experiment profile XAm and reference profile XCm. In a preferred embodiment, said normalizing is carried out according to equation
where NAm and NCm denotes normalized experiment and normalized reference profiles, respectively, where {overscore (XAm)} is an average of profile {XAm}, and {overscore (XCm)} is an average of profile {XCm}; wherein {overscore (XAC)} is an average of all profiles calculated according to equation
In another embodiment, the method of the invention further comprises normalizing errors of said experiment profile XAm and reference profile XCm according to equation
where σmXA(k) and σmXC(k) are the standard error of XAm(k) and XCm(k), respectively, and σmA(k) and σmC(k) are normalized standard error of An(k) and Cm(k), respectively.
In still another embodiment, the method of the invention further comprises normalizing background errors of said experiment profile XAm and reference profile XCm according to equation
where bkgstdmXA(k) and bkgstdmXC(k) are the standard background error of XAm(k) and XCm(k), respectively, and bkgstdmA(k) and bkgstdmC(k) are normalized standard background error of Am(k) and Cm(k), respectively.
Preferably, said {overscore (XAm)} and {overscore (XCm)} are an average of measurements in profile {XAm} and {XCm}, respectively, excluding measurements having values among the highest 10%.
In still another embodiment, said step (a) of the invention further comprises transforming said normalized profiles to obtain transformed profiles. In one embodiment, said transforming is carried out according to equations
where experiment profile XAm comprises measured data set {XAm(k)}, said reference profile XCm comprises measured data set {XCm(k)}, where d is described by equation
and where a is the fractional error coefficient of said experiment, b is the Poisson error coefficient of said experiment, and c is the standard deviation of background noise of said experiment.
In still another embodiment, said step (a) of the invention further comprises removing nonlinearity from each said transformed experiment profile TAm and transformed reference profile TCm. In one embodiment, said removing nonlinearity is carried out by a method comprising (a1) determining an average transformed profile of all transformed experiment profiles {TAm} and transformed reference profiles {TCm}; and (a2) adjusting each TAm or TCm using a difference between said TAm or TCm and said average transformed profile. In a preferred embodiment, said difference is determined using a subset of measurements in said transformed profiles. In one embodiment, said subset of measurements in said transformed profiles consists of measurements that are ranked similarly between an experiment or reference profile and said average profile. In one embodiment, said comparing in said step (a2) is carried out by a method comprising: (a2i) binning measurements in said subset into a plurality of bins, each said bin consisting of measurements having a value in a given range; (a2ii) calculating mean difference between said Am or Cm and the average profile in each bin; (a2iii) determining a curve of said mean difference as a function of values of measurements for said TAm or TCm, nonlinear_TAm or nonlinear_TCm, respectively; and (a2iv) adjusting TAm or TCm according to equations
TAmcorr(k)=TAm(k)−nonlinear—TAm(k)
or
TCmcorr(k)=TCm(k)−nonlinear—TCm(k)
where k=1, . . . , N.
In one embodiment, said {overscore (C)}(k) is calculated according to equation
wherein said differential reference profile is calculated according to equation
Cdiff(m,k)=Cm(k)−{overscore (C)}(k)
and wherein said error-adjusted profile is calculated according to equation
A′m(k)=Am−Cdiff(m,k).
In another embodiment, the method further comprises (d) calculating for at least one, preferably each profile pair m an error-corrected experiment profile A″m comprising data set {A″m(k)} by combining said error-adjusted experiment profile A′m with said experiment profile Am using a weighing factor {w(k)}, k=1, 2, . . . , N, wherein w(k) is a weighing factor for the k′th measurement.
In a preferred embodiment, said error-corrected experimental profile A″m is calculated according to equation
A″m(k)=(1−w(k))·Am(k)+w(k)Am(k).
In one embodiment, said weighing factor is determining according to equation
where avg_bkgstd is an average background noise. In one embodiment, said avg_bkgstd is determined according to equation
where bkgstd (m, k) is background standard error of Cm(k).
In another embodiment, the method further comprises determining errors {σ′m} of said error-adjusted experiment profile {A′m}. In one embodiment, said errors are determined according to equation
where σm(k) is the standard error of Am(k), mixed_σm(k) is determined according to equation
and where Cor(k) is a correlation coefficient between experiment profile Am and reference profile Cm. In one embodiment, said Cor(k) is determined according to equation
where CorMax is a number between 0 and 1.
In another embodiment, the method further comprises determining errors {σ″m} of said error-corrected experiment profile {A″m}. In one embodiment, said errors are determined according to equation
σ″m(k)={square root}{square root over ([1−w(k)]·σm2(k)+w(k)·σ′m(k))}
where σm(k) is the standard error of Am(k), σ′m(k) is determined according to equation
where mixed_σm(k) is determined according to equation
and where Cor(k) is a correlation coefficient. In one embodiment, said Cor(k) is determined according to equation
where CorMax is a number between 0 and 1.
The invention further provides a method for generating a differential profile A vs. B from differential profiles A vs. CA and B vs. CB, comprising calculating said differential profile A vs. B according to equation
lratioAB(k)=polarityAC·lratioAC(k)−polarityBC·lratioBC(k)
where k=1, 2, . . . , N, is the index of measurements in a profile, N being the total number of measurements; wherein lratioAC(k)=Log{A(k)/CA(k)}, if PolarityAC=1, and lratioAC(k)=Log {CA(k)/A(k)}, if PolarityAC=−1, where A(k), and CA(k) are the k′th measurement from sample A and CA, respectively; wherein lratioBC(k)=Log{B(k)/CB(k)}, if PolarityBC=1, and lratioAC(k)=Log{CB(k)/B(k)}, if PolarityBC=−1, where B(k), and CB(k) are the k′th measurement from sample B and CB, respectively; wherein {A(k)} representing measurements of a plurality of different cellular constituents measured in a sample having been subject to condition A, {B(k)} representing measurements of said plurality of different cellular constituents measured in a sample having been subject to condition B, and {CA(k)} and {CB(k)} each representing measurements of said plurality of different cellular constituents measured in a sample having been subject to condition C. In one embodiment, A vs. CA and B vs. CB are experimentally measured profiles. In another embodiment, at least one of A vs. CA and B vs. CB is a virtual profile.
In one embodiment, the method further comprising calculating an error of differential profile A vs. B according to equation
wherein σlratioAC(k) and σlratioBc(k) are errors of lratioAC(k) and lratioBC(k), respectively, and wherein CorMax is an estimated maximum correlation coefficient between errors of A/C and B/C.
The invention also provides a computer system comprising a processor and a memory coupled to said processor and encoding one or more programs, wherein said one or more programs cause the processor to carry out any one of the methods of the invention.
The invention also provides a computer program product for use in conjunction with a computer having a processor and a memory connected to the processor, said computer program product comprising a computer readable storage medium having a computer program mechanism encoded thereon, wherein said computer program mechanism may be loaded into the memory of said computer and cause said computer to carry out any one of the methods of the invention.
FIGS. 44A-B are all-signature-ROC plots of (A) Ratio-Splitter and (B) Re-Ratioer. All detected differentially expressed feature-level signatures are included in the study. Both of them have the near common reference pools. The thick solid black line is the ROC curve of the fluor-reversal combined real ratio experiments of the original data. The thin solid black line is the ROC curve of the real single red-vs-green experiment without fluor-reversal combination. These two lines are the same in (A) and (B). They are the reference ROC curves in the all-signature comparison. The dotted thin black straight line is the random decision ROC curve where there is no statistical power.
FIGS. 45A-B are weak-signature-ROC plots of (A) Ratio-Splitter and (B) Re-Ratioer. Strong signatures of more than 1.2-fold in the real combined experiments are excluded in the study. Both of them have the near common reference pools. The thick solid black line is the ROC curve of the fluor-reversal combined real ratio experiments of the original data. The thin solid black line is the ROC curve of the real single red-vs-green experiment without fluor-reversal combination. These two lines are the same in (A) and (B). They are the reference ROC curves in the weak-signature comparison.
FIGS. 46A-B are all-signature-ROC plots of (A) Ratio-Splitter and (B) Re-Ratioer. Both of them have the distant common reference pools.
FIGS. 47A-B are weak-signature-ROC plots of (A) Ratio-Splitter and (B) Re-Ratioer. Both of them have the distant common reference pools.
FIGS. 48A-B are (A) All-signature-ROC plot and (B) weak-signature plot of Ratio-Splitter without common reference controls. Both of them do not have ISEC applied.
The present invention provides methods for analyzing multi-channel profiles, e.g., two-channel profiles. For example, a R-channel profile 1A/2A/ . . . R−1/C(R is an integer) comprises measurements of a plurality of samples 1A, 2A, . . . R−1A, and C, where measurements of each sample constitute one channel. Thus, a multi-channel profile can comprise a plurality of profiles each representing measurements of one sample. A frequently encountered multi-channel profile is a two-channel profile, e.g., a two-color intensity profile. Herein, for simplicity reasons, methods for analyzing multi-channel profiles are often discussed with reference to two-channel profiles. It will be understood that such methods are readily applicable to multi-channel profiles.
A two-channel profile A vs. C comprises measurements of two samples A and C, where measurements from each sample constitute one channel. Thus, a two-channel profile can comprise a pair of profiles each representing measurements of one sample. A two-channel profile can also be a differential profile. As used herein, a differential profile refers to a collection of changes of measurements of cellular constituents, e.g., changes in expression levels of nucleic acid species or changes in abundances of proteins species, in cell samples under different conditions, e.g., under the perturbations of different drugs, under different environmental conditions, and so on. The pair of profiles may be measured concurrently in one experiment. Such a two-channel profile is also referred to as an experimental two-channel profile. A skilled person in the art will understand that a two-channel profile can be a pair of profiles selected from a multi-channel profile having additional profiles. For example, a two-channel profile consisting of a green channel profile and a red channel profile may be obtained from a three-channel profile which also comprises a blue channel. The pair of profiles may also be measured separately and combined together. Methods for combining separately measured profile date sets are described in this application and in U.S. Pat. Nos. 6,351,712 and 6,691,042, each of which is incorporated herein by reference in its entirety. A two-channel profile that comprises a pair of separately measured profiles is also referred to as a virtual two-channel profile. In preferred embodiments, C in a two-channel profile, either experimental or virtual, is a reference sample. In such cases, measurements of sample C are also referred to as the reference channel, and the corresponding measurements of sample A are also referred to as the experiment channel.
The invention provides a method for correcting systematic cross-profile (cross-experiment) errors among a plurality of multi-channel profiles having a common reference channel. A common reference channel or common reference profile refers to profiles measured using reference samples that are nominally the same, i.e., prepared the same way. The method involves estimating the cross-experiment errors using profiles in the common reference channel, and removing such cross-experimental errors from profiles in the experiment channels. In one embodiment, an average reference profile is obtained by averaging the profiles of the common reference channel. The systematic cross-experiment error in each individual multi-channel profile is then determined by comparing the reference channel profile in the multi-channel profile with the average reference profile. Such systematic cross-experiment error can be represented as an error profile. The systematic cross-experiment error can then be removed from the experiment channel, e.g., by subtracting the error profile from the experiment profile. The obtained error-corrected experiment channel data can then be used in comparison with each other, e.g., in generating virtual differential profiles between pairs of experiment channels.
Profiles of measurements of cellular constituents, e.g., measured expression levels of nucleic acid species, in a cell sample having been subject to a particular condition, e.g., conditions A, B, or C, are represented as sets of data {A(k)}, {B(k)}, and {C(k)}, respectively, in which k=1, 2, . . . , N, and N is the number of measurements of cellular constituents, equivalently, the number of probes used to carry out the measurement. Herein, for convenience, such data sets are often referred to as A, B, or C. It will be understood by one of ordinary skill in the art that a profile of measurements may comprise redundant measurements. For example, the same probe may be printed at more than one location on an array. A profile obtained from such an array comprises more than one measurement of the probe, each obtained from the probe at a different probe site. Herein, each of such measurements is also referred to as a feature. The changes in measurements of cellular constituents, e.g., expression levels, can be characterized by any convenient metric, e.g., arithmetic difference, ratio, log(ratio), etc. The mathematical operation log can be any logarithm operation. Preferably, it is the natural log or log10. As used herein, a differential profile A vs. B is defined as a profile representing changes of cellular constituents, e.g., expression levels of nucleic acid species or abundances of proteins species, from A to B, e.g., B-A, when an arithmetic difference is used, or B/A, when a ratio is used, where the difference or ratio is calculated for each feature. Differential profiles obtained from mathematical operations, e.g., arithmetic difference, ratio, log(ratio), etc., on the measured data sets, e.g., A, B, or C, are often referred to by short-hand symbols, e.g., A-B, A/B, or log(A/B). It will be understood by one skill in the art that when such short-hand symbols are used, they refer to data sets representing the differential profiles that contain data points resulting from the respective mathematical operation. For example, differential profile A-B refers to a differential profile comprising data set {A(k)−B(k)}, whereas differential profile log(B/A) refers to a differential profile comprising data set {log[B(k)/A(k)]}. Thus, for example, a differential profile A vs. B can comprise a collection of ratios of expression levels {B(k)/A(k)}, or log(ratio)'s, i.e., {log[B(k)/A(k)]}, and so on. It will be apparent to one skill in the art that a differential profile can be a response profile as described in Section 5.1.2, infra.
The methods of the invention are applicable to any type of multi-channel profiles, including but not limited to profiles of raw measurements, e.g., raw fluorescence intensities, or transformed profiles. Any type of suitably transformed profiles can be used in the present invention. In one embodiment, log (intensity) is used. In a preferred embodiment, transformed profiles obtained by the methods described in U.S. patent application Ser. No. 10/354, 664, filed on Jan. 30, 2003, which is incorporated by reference herewith in its entirety, are used.
As used herein, a “same-type” or “same vs. same” profile or differential profile is often referred to. As used herein, a same-type profile or differential profile refers to a profile or differential profile for which the two conditions are the same, e.g., C vs. C. In a preferred embodiment, a same-type profile or differential profile contains data measured from a biological sample in a base-line state. As used herein, a “baseline state” refers to a state of a biological sample that is a reference or control state.
As used herein, a “single-channel measurement” refers broadly to any measurements of cellular constituents made on a sample having been subject to a given condition in a single experimental reaction, whereas a “two-channel measurement” refers to any measurements of cellular constituents made distinguishably and concurrently on two different samples in the same experimental reaction. The term “same experimental reaction” refers to use in the same reaction mixture, i.e., by contacting with the same reagents in the same composition at the same time (e.g., using the same microarray for nucleic acid hybridization to measure mRNA, cDNA or amplified RNA; or the same antibody array to measure protein levels). Data generated in a single-channel measurement of a sample subject to condition A are often represented as A, whereas data generated in a two-channel measurement of two samples having been subject to conditions A and B, respectively, are often represented as A vs. B. For example, measurement of the expression level of a gene in a cell sample having been subject to an environmental perturbation A obtained in a single color microarray experiment is a single-channel measurement A. On the other hand, measurement of the expression levels of the genes in two cell samples, one having been subject condition A and one having been subject to condition C, obtained in a single two-color fluorescence experiment is a two-channel measurement A vs. C. In some embodiments, a two-channel measurement such as A vs. C can be broken into two separate single-channel measurements A and C. In this invention, a pair of two-channel measurements comprising measurements of samples having been subject to a common condition in one of the two channels are often of interest. In such cases, data associated with the common condition may further be identified by their association with the other condition in each two-channel measurement, e.g., CA identifying data set measured using a sample having been subject to condition C in a two-channel measurement A vs. CA and CB identifying data set measured on a sample having been subject to condition C in a two-channel measurement B vs. CB. Any types of single-channel and/or two-channel measurements known in the art can be used in the invention. Preferably, when single-channel measurements are used for generation of a differential profile, the two single-channel measurements are of the same type, e.g., both fluorescence measurements. Expression measurements made distinguishably and concurrently on more than two different samples, e.g., N-color fluorescence experiments, where N is greater than two, can also be used in generation of differential expression profiles by the methods of the present invention.
Although the methods of the present invention are often described for microarray-based expression measurements, it will be apparent to one skilled in the art that the methods of the present invention can also be adapted for generating response profiles of other types of cellular constituents.
The state of a cell or other biological sample is represented by cellular constituents (any measurable biological variables) as defined in Section 5.1.1, infra. Those cellular constituents vary in response to perturbations, or under different conditions.
As used herein, the term “biological sample” is broadly defined to include any cell, tissue, organ or multicellular organism. A biological sample can be derived, for example, from cell or tissue cultures in vitro. Alternatively, a biological sample can be derived from a living organism or from a population of single cell organisms.
The state of a biological sample can be measured by the content, activities or structures of its cellular constituents. The state of a biological sample, as used herein, is taken from the state of a collection of cellular constituents, which are sufficient to characterize the cell or organism for an intended purpose including, but not limited to characterizing the effects of a drug or other perturbation. The term “cellular constituent” is also broadly defined in this disclosure to encompass any kind of measurable biological variable. The measurements and/or observations made on the state of these constituents can be of their abundances (i.e., amounts or concentrations in a biological sample), or their activities, or their states of modification (e.g., phosphorylation), or other measurements relevant to the biology of a biological sample. In various embodiments, this invention includes making such measurements and/or observations on different collections of cellular constituents. These different collections of cellular constituents are also called herein aspects of the biological state of a biological sample.
One aspect of the biological state of a biological sample (e.g., a cell or cell culture) usefully measured in the present invention is its transcriptional state. In fact, the transcriptional state is the currently preferred aspect of the biological state measured in this invention. The transcriptional state of a biological sample includes the identities and abundances of the constituent RNA species, especially mRNAs, in the cell under a given set of conditions. Preferably, a substantial fraction of all constituent RNA species in the biological sample are measured, but at least a sufficient fraction is measured to characterize the action of a drug or other perturbation of interest. The transcriptional state of a biological sample can be conveniently determined by, e.g., measuring cDNA abundances by any of several existing gene expression technologies. One particularly preferred embodiment of the invention employs DNA arrays for measuring mRNA or transcript level of a large number of genes. The other preferred embodiment of the invention employs DNA arrays for measuring expression levels of a large number of exons in the genome of an organism.
Another aspect of the biological state of a biological sample usefully measured in the present invention is its translational state. The translational state of a biological sample includes the identities and abundances of the constituent protein species in the biological sample under a given set of conditions. Preferably, a substantial fraction of all constituent protein species in the biological sample is measured, but at least a sufficient fraction is measured to characterize the action of a drug of interest. As is known to those of skill in the art, the transcriptional state is often representative of the translational state.
Other aspects of the biological state of a biological sample are also of use in this invention. For example, the activity state of a biological sample, as that term is used herein, includes the activities of the constituent protein species (and also optionally catalytically active nucleic acid species) in the biological sample under a given set of conditions. As is known to those of skill in the art, the translational state is often representative of the activity state.
This invention is also adaptable, where relevant, to “mixed” aspects of the biological state of a biological sample in which measurements of different aspects of the biological state of a biological sample are combined. For example, in one mixed aspect, the abundances of certain RNA species and of certain protein species, are combined with measurements of the activities of certain other protein species. Further, it will be appreciated from the following that this invention is also adaptable to other aspects of the biological state of the biological sample that are measurable.
The biological state of a biological sample (e.g., a cell or cell culture) is represented by a profile of some number of cellular constituents. Such a profile of cellular constituents can be represented by the vector S: S=[S1, . . . Si, . . . Sk], where Si is the level of the i′th cellular constituent, for example, the transcript level of gene i, or alternatively, the abundance or activity level of protein i.
In some embodiments, cellular constituents are measured as continuous variables. For example, transcriptional rates are typically measured as number of molecules synthesized per unit of time. Transcriptional rate may also be measured as percentage of a control rate. However, in some other embodiments, cellular constituents may be measured as categorical variables. For example, transcriptional rates may be measured as either “on” or “off”, where the value “on” indicates a transcriptional rate above a predetermined threshold and value “off” indicates a transcriptional rate below that threshold.
The responses of a biological sample to a perturbation, i.e., under a condition, such as the application of a drug, can be measured by observing the changes in the biological state of the biological sample. A response profile is a collection of changes of cellular constituents. In the present invention, the response profile of a biological sample (e.g., a cell or cell culture) to the perturbation m is defined as the vector v(m):
where vim is the amplitude of response of cellular constituent i under the perturbation m. In some particularly preferred embodiments of this invention, the biological response to the application of a drug, a drug candidate or any other perturbation, is measured by the induced change in the transcript level of at least 2 genes, preferably more than 10 genes, more preferably more than 100 genes and most preferably more than 1,000 genes. In another preferred embodiment of the invention, the biological response to the application of a drug, a drug candidate or any other perturbation, is measured by the induced change in the expression levels of a plurality of exons in at least 2 genes, preferably more than 10 genes, more preferably more than 100 genes and most preferably more than 1,000 genes.
In some embodiments of the invention, the response is simply the difference between biological variables before and after perturbation. In some preferred embodiments, the response is defined as the ratio of cellular constituents before and after a perturbation is applied.
In some preferred embodiments, vim is set to zero if the response of gene i is below some threshold amplitude or confidence level determined from knowledge of the measurement error behavior. In such embodiments, those cellular constituents whose measured responses are lower than the threshold are given the response value of zero, whereas those cellular constituents whose measured responses are greater than the threshold retain their measured response values. This truncation of the response vector is a good strategy when most of the smaller responses are expected to be greatly dominated by measurement error. After the truncation, the response vector v(m) also approximates a ‘matched detector’ (see, e.g., Van Trees, 1968, Detection, Estimation, and Modulation Theory Vol. I, Wiley & Sons) for the existence of similar perturbations. It is apparent to those skilled in the art that the truncation levels can be set based upon the purpose of detection and the measurement errors. For example, in some embodiments, genes whose transcript level changes are lower than two fold or more preferably four fold are given the value of zero.
In some preferred embodiments, perturbations are applied at several levels of strength. For example, different amounts of a drug may be applied to a biological sample to observe its response. In such embodiments, the perturbation responses may be interpolated by approximating each by a single parameterized “model” function of the perturbation strength u. An exemplary model function appropriate for approximating transcriptional state data is the Hill function, which has adjustable parameters a, u0, and n:
The adjustable parameters are selected independently for each cellular constituent of the perturbation response. Preferably, the adjustable parameters are selected for each cellular constituent so that the sum of the squares of the differences between the model function (e.g., the Hill function) and the corresponding experimental data at each perturbation strength is minimized. This preferable parameter adjustment method is well known in the art as a least squares fit. Other possible model functions are based on polynomial fitting, for example by various known classes of polynomials. More detailed description of model fitting and biological response has been disclosed in Friend and Stoughton, Methods of Determining Protein Activity Levels Using Gene Expression Profiles, U.S. Pat. No. 6,324,479, which is incorporated herein by reference for all purposes.
The invention provides a method for generating a virtual ratio profile from two two-channel profiles. The two input two-channel profiles can be both experimental, both virtual, or one experimental and one virtual. In one embodiment, the invention provides a method termed “re-ratioer,” which takes two input ratio profiles A/C and B/C and generates a new “virtual” ratio profile or experiment A/B. It does not require the raw intensity information.
Data fields for input experiment C-vs-B (B/C) are similarly defined.
The re-ratioer computes data fields of the new virtual ratio experiment B-vs-A (A/B) as following:
lratioAB(k)=polarityAC·lratioAC(k)−polarityBC·lratioBC(k) (1)
PolarityAB=+1 (3)
if PolarityAC>0 and PolarityBC>0:
Intensity1AB(k)={square root}{square root over (Intensity1AC(k)·Intensity2BC(k))} (4)
Intensity2AB(k)={square root}{square root over (Intensity2AC(k)·Intensity1BC(k))} (5)
if PolarityAC<0 and PolarityBC<0:
Intensity1AB(k)={square root}{square root over (Intensity2AC(k)·Intensity1BC(k))} (6)
Intensity2AB(k)={square root}{square root over (Intensity1AC(k)·Intensity2BC(k))} (7)
if PolarityAC>0 and PolarityBC<0:
Intensity1AB(k)={square root}{square root over (Intensity1AC(k)·Intensity1BC(k))} (8)
Intensity2AB(k)={square root}{square root over (Intensity2AC(k)·Intensity2BC(k))} (9)
if PolarityAC<0 and PolarityBC>0:
Intensity1AB(k)={square root}{square root over (Intensity2AC(k)·Intensity2BC(k))} (10)
Intensity2AB(k)={square root}{square root over (Intensity1AC(k)·Intensity1BC(k))} (11)
In Equation 2, the parameter CorMax is the estimated maximum correlation coefficient between errors of A/C and B/C. CorMax has a value in the range of 0 to 1. The default value of CorMax is 0.5. It is the only adjustable parameter shown in
The re-ratioer can be applied when the end result is a ratio experiment A/B and available input ratio experiments have a common reference C. For example, in a pooled experiment design, these are real ratio experiments in compound-vs-pool and vehicle-vs-pool. Re-ratioer can be used to derive virtual ratio experiment of compound-vs-vehicle with the re-ratioer. The re-ratioer can also be used in looped designs to derive distant ratios. For example, given real profiles A/B, B/D, and D/E, virtual experiment A/D can first be obtained from A/B and B/D. Virtual A/E can then be obtained from the virtual A/D and the real D/E.
The main advantage of the re-ratioer is its simplicity. The new ratio is directly derived from two input ratios (Equation 1). There is no normalization needed. Intensities are not involved in the ratio computation. The only thing the user needs to do is to specify the two inputs. One is the numerator (experiment) of the new virtual ratio and the other is the denominator (baseline) of the new ratio. Any one of the two inputs can be real or virtual ratio profile or experiment. Pre-combined ratio experiments can be directly used as inputs.
The re-ratioer has its limitations. The two input ratio experiments must have a common reference C. The common reference itself will introduce errors. This error will accumulate when distant ratios are derived along a looped design. The output of the re-ratioer is a new ratio experiment. It does not provide individual intensity experiments A, B, etc.
When sequences in the common reference C are expressed, the two intensity measurements of C in A/C and B/C effectively serve as control references to reduce the inter-slide variation between the two inputs when the new ratio A/B is calculated using Equation 1. However, when the expression of C is very weak, the noise in C may cause the control reference to fluctuate. When intensity C is near zero, it becomes a zero/zero situation. The resulting log-ratio becomes unstable. Examples in Section 6 demonstrate the limitation.
The invention provides a method for correcting errors in a plurality of pairs of profiles {Am, Cm}, where m=1, 2, . . . , M, M is the number of pairs of profiles. Each pair of profiles consists of experiment profile Am comprising data set {Am(k)} and a reference profile Cm comprising data set {Cm(k)}, where k=1, 2, . . . , N,N is the number of measurements in each profile. In preferred embodiment, N is at least 10, at least 100, at least 1,000, or at least 10,000. Data set {Am(k)} comprises measurements or transformed measurements of a plurality of different cellular constituents measured in a sample having been subject to condition Am, and data set {Cm(k)} comprises measurements or transformed measurements of the plurality of different cellular constituents measured in a sample having been subject to condition C. Each pair of profiles can be a pair of profiles selected from a multi-channel profile having additional profiles. Preferably, experiment profile Am and reference profile Cm are measured in the same experimental reaction. For example, the pair of profiles {Am, Cm} can be a two-channel profile measured in the mth experimental reaction. The profiles can be measured profiles. The profiles can also be transformed profiles. For example, each Cm, mε{1, 2, . . . , M}, can represent measurements or transformed measurements of a plurality of different cellular constituents measured in a sample having been subject to common condition C. The method of the invention involves determining a systematic error in each experiment profile Am based on the corresponding reference profile Cm, and removing such systematic error from the experiment profile. The obtained error-corrected experiment profiles can then be further analyzed, e.g., directly compared using a difference or ratio, as input data in ANOVA, and so on.
In one embodiment, an average reference profile {overscore (C)} of the M reference profiles {Cm} is first determined according to equation
This average reference profile {overscore (C)} is then used as the common reference for the M profiles. The deviation of each reference profile Cm from {overscore (C)} is calculated as a differential reference profile
Cdiff(m, k)=Cm(k)−{overscore (C)}(k) (13)
and is used as the systematic bias of Am. This differential reference profile can be used to correct Am according to equation
A′m(k)=Am−Cdiff(m, k) (14)
The errors {σ′m} of the error-adjusted experiment profile {A′m} can be determined according to equation
where σm(k) is the standard error of Am(k), mixed_σm(k) is determined according to equation
and where Cor(k) is a correlation coefficient between the experiment channel and the corresponding reference channel. This correlation may be intensity dependent. For example, when intensity is high, the correlation is strong, whereas when intensity is low and near the background noise level, the correlation is weak. In one embodiment, a simple correlation model is built to estimate Cor(k):
CorMax defines the maximum correlation. In some embodiments, CorMax is taken to be 0.5. CorMax can have value between 0 and 1. Small CorMax makes the error estimation more conservative, while large CorMax produces smaller error estimation, which is more aggressive.
In some cases, e.g., when one or more measurements in the common reference profiles, e.g., the common-reference intensity, are near or below the background noise level, the correlation between the experiment and the reference channels decreases significantly. In such cases, correction of systematic bias using the above-described differential reference profile may add noise to such measurements in the corrected Am rather than reduces it. Thus, in a preferred embodiment, a weighting model is used. The weighting model involves calculating an error-corrected experiment profile A″m comprising data set {A″m(k)}, k=1, 2, . . . , N, by combining the error-adjusted experiment profile A′m, e.g., A′m as determined by equation (14) with the experiment profile Am using a weighing factor {w(k)} in such a manner that correction of each measurement by the corresponding difference value in the differential reference profile is smoothly phased out when the measurement in the common-reference profile is approaching or falling below the background noise level. In one embodiment, the weighting model calculates an error-corrected experimental profile A′m according to equation
A″m(k)=(1−w(k))·Am(k)+w(k)·A′m(k) (19)
where w(k) is a weighing factor. In a preferred embodiment, the weighing factor is determining according to equation
where avg_bkgstd is an average background standard error. In one embodiment, avg_bkgstd is determined according to equation
where bkgstd (m, k) is background standard error of Cm(k).
The errors {σ″m} of error-corrected experiment profile {A″m} can be determined according to equation
σ″m(k)={square root}{square root over ([1−w(k)]·σm2(k)+w(k)·σ′m(k))}. (22)
The experiment and reference profiles {Am, Cm} can be transformed profiles. Data in such transformed profiles are transformed measurements. Any suitable type of transformed data may be used in conjunction with the present invention. In a preferred embodiment, the transformed measurements are obtained using the error model based transformation described in Section 5.4., infra.
The experiment profile Am and reference profile Cm can also be normalized profiles. In one embodiment, normalized profile is obtained by normalizing data from all channels, i.e., experiment profiles {Am}and reference profiles {Cm}, according to equations
where NAm(k) and NCm(k) denotes normalized measurements in the experiment and reference channel, respectively, {overscore (Am)} is an average of all or a portion of measurements in profile {Am(k)}, and {overscore (Cm)} is an average of all or a portion of measurements in profile {Cm(k)}; {overscore (AC)} is an average of all channels:
The errors of the normalized experiment profile NAm and reference profile NCm can be determined according to equation
where σmA(k) and σmC(k) are the standard error of Am(k) and Cm(k), respectively, and σmNA(k) and σmNC(k) are normalized standard error of NAm(k) and NCm(k), respectively.
The background errors of the normalized experiment profile NAm and reference profile NCm can be determined according to equation
where bkgstdmA(k) and bkgstdmC(k) are the standard background error of Am(k) and Cm(k), respectively, and bkgstdmNA(k) and bkgstdmNC(k) are normalized standard background error of NAm(k) and NCm(k), respectively.
In a preferred embodiment, the average or median of measurements in a experiment or reference profile or channel, {overscore (Am)} or {overscore (Cm)}, e.g., the channel brightness, is the average of a portion of the measurements in the respective channel. In one embodiment, the portion of measurements to be used in determining the averages are obtained by eliminating measurements having values above a certain level, e.g., measurements having intensities in a chosen highest intensity range. In a preferred embodiment, measurements having values among the highest 5%, 10% or 20% are excluded from average determination.
The experiment and reference profiles {Am, Cm} can also be processed profiles in which nonlinearity is removed from raw or transformed experiment and reference profiles. Methods for nonlinearity removal are also called “detrending.” In detrending, the measurement value, e.g., intensity, dependant non-linearity in all channels is minimized. In one embodiment, an average feature intensity profile of all channels is first calculated. This average profile is then used as the reference for correcting non-linearity. Each channel profile (experiment or reference profile) is compared to the average profile. If there is non-linearity between the two, the channel profile is adjusted to minimize the non-linearity.
In a preferred embodiment, an invariant sub-set (ISS) of features, i.e., features that are considered unchanged between an individual channel and the average profile, is identified. In one embodiment, measurements are rank ordered and compared between a channel profile and the averaged profile. Features that rank similarly within a small range are considered unchanged. In a preferred embodiment, the method described in Schadt et al., 2001, J. Cell. Biochem. Supp. 37:120-125, which is incorporated by reference herein in its entirety, is employed to find ISS.
In a preferred embodiment, measurement values of all ISS features, both positive and negative, are cut into small range bins. The total number of bins can be defined by rounding the result of dividing the number of features by a chosen number, e.g., 1000. Preferably, the number of bins is between a minimum of about 2 for arrays with a small number of features and a maximum of about 12 for arrays with a large number of features. Mean difference between feature value in an individual channel and feature value in the average profile in each bin is calculated. The mean difference is placed as a point at the center of the bin (see, e.g.,
For all features, both invariant and variant, in each individual channel profile, the measurement values are corrected by the respective nonlinearity curve:
Amcorr(k)=Am(k)−nonlinear—Am(k) (30)
or
Cmcorr(k)=Cm(k)−nonlinear—Cm(k) (31)
In one embodiment, the invention provides a computer program for splitting a plurality of multi-channel profiles into individual profiles. The program is also referred to as a ratio-splitter.
As an example, the ratio scans A/CA, B/CB, D/CD and E/CE, may or may not have common reference controls. If they do, sample CA, CB, CD and CE are the same. Otherwise, sample CA, CB, CD and CE are different. Preferably, the ratio scans are first sent to the technology-specific error-model. In one embodiment, the error-model used is the same error model for creating ratio profiles of a given microarray technology. The error model provides intensity error estimations for the red and the green channels to the ratio splitter. When creating regular ratio profiles, the error model only uses the estimated intensity errors internally. For a given scan, e.g. CA-vs-A, the error model provides following quantities:
Intensity data from the error model are then sent to group preprocessing that includes one or more of the following: normalization, intensity transformation, and detrending. Group preprocessing reduces certain systematic biases in the data, such as gain biases and non-linearity.
If there are no common reference controls, i.e. sample CA, CB, CD and CE are different, the ratio-splitter inversely transforms the intensity data and output 2*N intensity profiles. If the user indicates there are common references, the ratio-splitter uses the common reference to estimate and correct inter-slide errors. Then the intensity data is inversely transformed. In this case, there are N intensity profiles from the ratio-splitter output.
There are three components in the group processing: group normalization, intensity transformation, and group detrending.
In group normalization, the average brightness of all intensity channels are made the same. In the ratio-splitter a global normalization is used. The channel brightness, Brightness(n), is the average of intensities from all positive features in the n′th channel, preferably after excluding top 10% brightest spots that are often saturated. Assuming there are N ratio scans (2*N channels), and there are K features on each chip, the intensity of the k′th feature (k: 1−K) on the n'th channel (n: 1-2*N) is normalized as
is the average brightness of all channels. In Eq. 34, bkgstdnorm(k) is the normalized standard background error of the k′th feature.
To simplify detrending and inter-slide error correction, an intensity forward transformation can be applied. A preferred transformation is the error-model based transformation that is described in Section 5.4., infra, and in U.S. patent application Ser. No. 10/354,664, filed on Jan. 30, 2003, which is incorporated by reference herein in its entirety. In the transformed domain, the intensity variance is more homogenous across all intensity levels.
In the detrending step, the intensity dependant non-linearity in all channels is minimized. In one embodiment, an average feature intensity profile of all intensity channels is first calculated. This average profile is then used as the reference in correcting non-linearity. Each intensity channel profile is compared to the average profile. If there is non-linearity between the two, the channel profile, but not the average profile, is adjusted to minimize the non-linearity.
In a preferred embodiment, an invariant sub-set (ISS) of features, i.e., features that are considered unchanged between the individual channel and the average profile, is identified. In one embodiment, intensities are rank ordered and compared among channel profiles and the averaged profile. Features that rank similarly within a small range are considered unchanged. In a preferred embodiment, the method described in Schadt et al., 2001, J. Cell. Biochem. Supp. 37:120-125, which is incorporated by reference herein in its entirety, can be employed to find ISS.
In one embodiment, a smoothing spline method is used to obtained the non-linearity curve of the intensity difference vs. mean intensity of the channel profile and the average profile (Schadt et al., 2001, J. Cell. Biochem. Supp. 37:120-125). In another embodiment, a piece-wise linear method is used to fit the non-linearity curve. Straight lines connect these points from one bin to the next. In a preferred embodiment, transformed intensities of all ISS features, both positive and negative, are cut into small range bins. The total number of bins can be defined by the round number of the number of features divided by a chosen number, e.g., 1000. Preferably, the number of bins is between a minimum of about 2 for arrays with a small number of features and a maximum of about 12 for arrays with a large number of features. Mean difference between an individual channel and the average profile of the transformed feature intensities in each bin is calculated. The mean difference is placed as a point at the center of the bin (see
For all features in each individual channel profile, their transformed intensities are corrected by the nonlinearity curve:
corr—trans—I(n, k)=trans—I(n, k)−nonlinear—diff(trans—I(n, k)) (36)
When using two-color ratio arrays to compare two samples, imperfectness in microarray slides may be corrected. For example, many unwanted microarray measurement variations come from the manufacturing quality variation and hybridization process variation. The imperfection is usually spot and chip dependent. Oftentimes, the variations have similar effects on both red and green measurements. When ratios of the red and the green intensities of the same chip are computed, the effects caused by the slide imperfection may often be canceled. As the result, the spot/chip dependent variations have relatively small effects on intra-slide differential expression measurements in ratios or log-ratios of the two-color arrays.
But when splitting the two channels and using them as individual intensity profiles together with split profiles from other two-color microarrays, the spot/chip dependent variations may not cancel out anymore. Intensity measurement errors caused by the imperfectness reduce the precision of the inter-slide intensity comparison.
When common control samples are hybridized in one channel of the two-color microarrays, such as in the pooled design, the reference channel can be used to reduce the inter-slide error significantly. An inter-slide error correction method was first introduced in U.S. Pat. No. 6,691,042 for building one virtual ratio profile from two two-channel profiles. In the ratio-splitter of this disclosure, two-channel profiles are split to provide intensity profiles instead of ratio profiles.
As an example to demonstrate the concept of inter-slide error correction,
However, when the two same-vs-same differences in
In one embodiment, when some of the input ratio scans have common reference controls in the green channel and others have common controls in the red channel, to avoid mixing the fluor bias in inter-slide error estimation, the scans of common controls in different fluorescence colors are processed separately (
In ISEC, the mean and the standard-deviation of the reference intensity are first computed:
where n is the index of chips, k is the index of features, Nref is the total number of reference channels in a given color.
The difference of the individual common reference intensity and the averaged reference intensity is:
ref—diff(n, k)=trans—I—ref(n, k)−avg—ref(k) (39)
The adjusted experiment intensity is calculated by subtracting the difference from the original intensity:
adj—I(n,k)=trans—I—exp(n,k)−ref—diff(k) (40)
The error of the adjusted experiment intensity is then determined. When Nref is large, std_ref(k) in Equation 38 is an unbiased estimation of the standard deviation of the common reference. However, when Nref is small, std_ref(k) is not reliable. In one embodiment, to stabilize the error estimation for the common reference, the scattered error std_ref(k) is combined with the error model estimated error σtrans
The error of the adjusted experiment intensity in Equation 40 can be estimated as:
In Equation 42, Cor(k) is an estimated correlation coefficient between the experiment and the reference channels.
where the average background standard error avg_bkgstd is computed as
Parameter CorMax in Equation 43 defines the maximum correlation, CorMax=0.5 by default. CorMax can have value between 0 and 1. Smaller CorMax makes the error estimation more conservative. While larger CorMax produces smaller error estimation, which is more aggressive.
When the common-reference intensity is very low, e.g., near or below the background noise level, the correlation between the experiment and the reference channels decreases significantly. In this case, the ISEC method in Equation 18 may no longer be desired and may add noise in the result. Thus, it is preferable that when intensity is near zero, ISEC should be phased out. In one embodiment, a weighting model is used in the ratio splitter to smoothly phase out ISEC. In a preferred embodiment, the weighting function is:
When avg_ref(k) is large, Weights(k) is one. When avg_ref(k) is below avg_bkgstd, Weights(k) is near zero. The original transformed intensity is combined with the adjusted intensity to get the final transformed experiment intensity:
Ratio splitter provides users of two-color microarrays the maximum flexibility in analyzing the data. They can be compared in ANOVA, trend, and clustering methods. Profiles from the ratio-splitter output can be used in building new intensity or ratio experiments of any combinations.
It is shown in the Examples that the ISEC method makes the quality of split intensity profiles significantly better. It is preferable that common reference controls are employed whenever possible to allow achieving more accurate results in splitting the ratio data. In addition, with common references available, the commonly used fluor-reversal procedure may become unnecessary. If all experimental samples are in one color and all common reference controls in the other color, the color bias will have no effect in differential analysis of the split intensities. This may permit a saving of up to 50 percent of chips.
In the fluor-reversal case, to avoid mixing the fluorescent color bias in the ISEC process, two-channel data with red and the green references are processed in two separate groups. After ratio split, the intensity replicates of two different colors can be combined together to form an intensity experiment free of color bias. Otherwise the color bias will affect down-stream analyses if different colors are not carefully separated or combined. Methods for combining fluor-reversed pair of profiles are known in the art; see, e.g., U.S. Pat. No. 6,691,042.
Preferably, the ratio splitter is used to process ratio data that have the raw scan data with an internal error model. The internal error model not only provides the intensity error estimation, but also the parameters for intensity transformation applied in the ratio splitter. It is less preferred to apply the ratio splitter to data loaded from an external error model or without an error model.
The methods of the invention can be used to analyze transformed measurements. Measured data obtained in a microarray experiment often contain errors due both to the inherent stochastic nature of gene expression and to measurement errors from various external sources. The many sources of measurement error that may occur in a measured signal include those that fall into three categories—additive error, multiplicative error, and Poisson error. The signal magnitude-independent or intensity-independent additive error includes errors resulted from, e.g., background fluctuation, or spot-to-spot variations in signal intensity among negative control spots, etc. The signal magnitude-dependent or intensity-dependent multiplicative error, which is assumed to be directly proportional to the signal intensity, includes errors resulted from, e.g., the scatter observed for ratios that should be unity. The multiplicative error is also termed fractional error. The third type of error is a result of variation in number of available binding sites in a spot. This type of error depends on the square-root of the signal magnitude, e.g., measured intensity. It is also called the Poisson error, because it is believed that the number of binding sites on a microarray spot follows a Poisson distribution, and has a variance which is proportional to the average number of binding sites.
In a preferred embodiment, measured data are first transformed by an error model based transformation before analyzed by the improved ANOVA method of the invention. The results from the ANOVA analysis can be transformed back by an appropriate inverse transformation. An error model based data transformation method is described in U.S. patent application Ser. No. 10/354,664, filed on Jan. 30, 2003, which is incorporated by reference herewith in its entirety.
Errors in measured data can be described by error models (see, e.g., Supplementary material to Roberts et al, 2000, Science, 287:873-880; and Rocke et al., 2001, J. Computational Biology 8:557-569). In preferred embodiments, an error model (see, e.g., Supplementary material to Roberts et al, 2000, Science, 287:873-880; and Rocke et al., 2001, J. Computational Biology 8:557-569) contains two or three error terms to describe the dominant error sources. In a two-term error model, a first error term is used to describe the low-level additive error which comes from, e.g., the background of the array chip. Since this additive error has a constant variance, in this disclosure, it is also called the constant error. The constant error is independent from the hybridization levels of individual spots on a microarray. It may come from scanner electronics noise and/or fluorescence due to nonspecific binding of fluorescence molecules to the surface of the microarray. In one embodiment, this constant additive error is taken to have a normal distribution with a mean bkg and a standard deviation σbkg. After background level subtraction, which is typically applied in microarray data processing, the additive mean bkg becomes zero. In this disclosure, it is often assumed that the background intensity offset has been corrected. An ordinary skilled artisan in the art will appreciate that in cases where the background mean is not corrected, the methods of the invention can be used with an additional step of making such a correction.
The second error source is the multiplicative error that is the combined result of the speckle noise inherent in the coherent laser scanner and the fluorescence dye related noise. The multiplicative error is also called fractional error because its level is directly proportional to the magnitude of the measured signal, e.g., the measured intensity level. It is the dominant error source at high intensity levels. In one embodiment in which the measured signal is obtained from a microarray experiment, the standard deviation of the fractional error in the k′th spot can be approximated as
σfrac(k)≈a·x(k) (48)
where x(k) is the measured intensity in the k′th spot. The constant a in Equation 4 is termed fractional error coefficient, and describes the proportion of the fractional error to the intensity of the measured signal. In one embodiment, the constant has a value in the range of 0.1 to 0.2. This constant may vary depending on the particular microarray technology used for obtaining the measured signal and/or the particular hybridization protocol used in the measurement. In one embodiment, parameter a is determined during the error building phase by measuring the variance of the log ratio near the high intensity side in a same-vs.-same ratio experiment where the intensities in the ratio numerator and denominator come from the same sample and treatment. At high intensities, the variance of log ratio x1 over x2 relates to parameter a:
when x1 and x2>>σbkg. In one embodiment, x1 and x2 are at least 4, 10, 50, 100, or 200 times σbkg.
In a two-term error model, the measurement error in a measured signal, e.g., measured intensity, x(k) can be defined as
σx(k)={square root}{square root over (σbkg(k)2+σfrac(k)2)}≈{square root}{square root over (σbkg(k)2+a2·x(k)2)} (50)
In a preferred embodiment of the invention, the background noise variances in Equation 6 are taken as slightly different in different microarray spots or regions of a microarray chip. In one embodiment, the difference is less than 20%, 10%, 5%, or 1%.
In a three-term error model, an extra square-root term is included to describe measurement errors originated from variation in the number of available binding sites in a microarray spot. This term is also called the Poisson term. In one embodiment, without knowledge of actual number of binding sites in a microarray spot, the measured intensity is used to provide an estimate of the average number of binding sites. In such an embodiment, the Poisson error can be approximated as
σPoisson(k)≈b·{square root}{square root over (x(k))} (51)
where parameter b is an overall proportional factor, termed Poisson error coefficient. In a three-term error model, the measurement error in a measured signal, e.g, a measured fluorescence intensity, x(k) can be defined as
In a preferred embodiment, during error model development, when σbkg and parameter a have been determined, parameter b in Equation 52 is determined by measuring the intensity variance in the middle intensity ranges of the same-vs.-same experiments. In one embodiment, the intensity variance is measured in the 25 to 75 percentile range, 35 to 65 percentile range, or 45 to 50 percentile range for determination of b.
In a preferred embodiment, after the error model development phase, parameters a and b are fixed for an error model under a given microarray technology and experiment protocol. The background noise σbkg can be estimated for each particular microarray experiment. In another preferred embodiment, when a set of replicate experiments are carried out, the background noise σbkg for the set can be obtained by averaging the background noise estimated for each of the replicate experiments.
The two-term error model as described by Equation 50 can been seen as a simplified version of the three-term error model described by Equation 52 by setting the Poisson parameter b to zero. In this disclosure, Equation 52 is used as the general mathematical description of error models. It will be apparent to an ordinarily skilled artisan that any results obtained based on Equation 52 are also applicable to a two-term error model by setting the Poisson parameter b to zero.
It will be apparent to an ordinarily skilled artisan that other methods may also be used to determine an error model (see, e.g., Rocke et al., 2001, J. Computational Biology 8:557-569).
It is clear from Equation 8 that microarray intensity measurements do not meet the constant-variance requirement. There are different measurement errors (or variances) in different intensities. The intensity error is a function of intensity itself. To overcome this problem, a function f( ) is needed to transform measured data, e.g. the intensity data, x to a new domain y in which the variance becomes a constant. All analysis and data processing can then be carried out in the transformed domain. In a preferred embodiment, such a transformation is described as
y(k)=f(x(k)), for all x and (53)
σy(k)≈C, for all x where C is a constant. (54)
Preferably the transformation works for both positive and negative (e.g, negative signals obtained after background subtraction) x. More preferably the transformation meets the following additional constraints:
Still more preferably, an inverse transformation function g exists so that the transformed data in the transformed domain can be transformed back to the original domain. The inverse transformation does the following operation:
x(k)=g(y(k)), for all y (55)
Preferably, the inverse transformation function g meets above four constraints as well. In one embodiment, the error in the inversely transformed intensity can be determined when the first derivative f′( ) of the forward transformation function f is available:
It is most preferable that the forward transformation function f its first derivatives f′, and the inverse transformation function g are all in analytical closed-forms.
A transformation based on an error model is provided and used to transform measured data obtained in an experiment to a transformed domain such that the measurement errors in transformed data are equal to the measurement errors in the measured data normalized by errors determined based on an error model. As used in this disclosure, such an measurement error, i.e., a measurement error which equals the measurement error in the measured signal normalized by an error determined based on an error model, is also referred to as a normalized error. Any suitable error model can be used in the invention. In a preferred embodiment, the error model is a two-term or a three-term error model described in Section 5.4.1.1. In a particularly preferred embodiment, the variance of the transformed data in the transformed domain is close to a constant. More preferably, the transformation meets all requirements discussed in Section 5.4.1.2. The basic concept of the new transformation method is to apply an error model to normalize errors in real measurements, e.g., standard deviations in measured data, such that the normalized errors are close to a constant. Then a transformation function f( ) is found by the integration of the normalization function. The methods are applicable to any set of measured data whose errors can be described by a particular error model.
In a specific embodiment, the real measurement standard deviation Δx is for the positive intensity x>0. The real standard deviation Δx is usually known before the transformation. An error model in Equation 52 provides (x that is an estimate of the real standard deviation Δx for different intensities. In one embodiment, Δx is an error determined by the experiment. In another embodiment, Δx is calculated using an error model of the experiment. In a preferred embodiment, Δx is chosen to be the larger of an experimentally determined error or an error model-calculated error. Assuming the transformed standard deviation is Δy, the following approximation relates the two errors with the first derivative function of the transformation:
If the equation is rearranged, one obtains
Δy≈Δx·f′(x) (58)
Because Equation 8 is an approximation of Δx, if a normalization function y′ is defined as follows:
where a, b, and c are defined as in Section 5.4.1.1, one can expect that the variance of y is close to a constant.
Equation 15 provides an analytical form of the first derivative function of the desired transformation. To obtain the transformation function itself, both sides of Equation 15 are integrated:
The integral in Equation 60 does have an analytical solution. The solution is described by equation
Applying the zero intercept constraint (ii) in Section 5.4.1.2, i.e., y=0 when x=0, the constant d in Equation 61 is found to be
As indicated in Equation 55 in Section 5.4.1.2, preferably one finds the inverse transformation function g(y) so that the transformed intensity y can be converted back to the original x scale whenever necessary. By using linear algebra or a symbolic-solution software, such as Maple, one finds
To complete the forward and the inverse transformation pair for both intensity and its error, the standard deviation of the inversely transformed intensity can be estimated by using Equation 56.
In a specific embodiment, the transformation function can be further defined to be symmetric to zero for all x. When x<0, the absolute value |x| is used to replace x in the forward transformation in Equation 61 and to give a negative sign to the result y. In the inverse transformation in Equation 63, when y<0, the absolute value |y| is used to replace y and to give a negative sign to the result x. Under the forward transformation, the estimated transformed error σy is one over all intensity ranges of x or y, so that constant C=1 in Equation 54. The transformation also meets all other requirements and constraints described above. In addition, the transformation has several other interesting properties:
The transformation described in this section is applicable to any measured data in which the errors can be described by a three-term error model. In preferred embodiments, the measured data are measured in a microarray gene expression experiment. In other preferred embodiments, the measured data are measured in a protein array experiment or a 2D gel protein experiment.
In one preferred embodiment, the measured data are signal data obtained in an microarray experiment in which two spots or probes on a microarray are used for obtaining each measured signal, one comprising the targeted nucleotide sequence, i.e., the target probe (TP), e.g., a perfect-match probe, and the other comprising a reference sequence, i.e., a reference probe (RP), e.g., a mutated mismatch probe. The RP probe is used as a negative control, e.g., to remove undesired effects from non-specific hybridization. In one embodiment, the measured signal obtained in such a manner is defined as the difference between the intensities of the TP and RP, xTP-xRP. In such an embodiment, the fractional error, the Poisson error, and the background constant error σbkg are described respectively according to equations
The transformation described in this section remains applicable if Equations 66-68 are used to calculate the fractional error, the Poisson error and the background constant error, respectively. In one embodiment, the TP probe is a perfect-match probe (PM), and the RP probe is a mismatch probe (MM) (see, e.g., Lockhart et al., 1996, Nature Biotechnology 14: 1675). In another embodiment, the RP probe is a reverse probe of the TP probe, i.e., the RP probe has a sequence that is the reverse complement of the TP probe (see, Shoemaker et al., U.S. patent application Ser. No. 09/781,814, filed on Feb. 12, 2001; and Shoemaker et al., U.S. patent application Ser. No. 09/724,538, filed on Nov. 28, 2000).
It will be apparent to one skilled in the art that although the transformations as described by equations 61 and 63 are preferably carried out using parameters a, b, and c chosen based on a three-term error model, the transformations as described by equations 61 and 63 can also be used by replacing parameters a, b, and c with other parameters. Embodiments using such parameters are also encompassed by the present invention.
Another transformation that can be used to transform the data before ANOVA analysis is a logarithm transformation:
y(k)=f(x(k))=ln(x(k)), for x>0 (69)
In Equation 52, when intensity x is very high, the fractional error is the dominant error source. In this case, the standard deviation of y is approximately a constant:
When intensity x is low, the standard deviation of y is inversely proportional to x, and is approaching infinity:
Still another transformation that can be used to transform the data is a piecewise hybrid transformation (see, e.g., D. Holder, et al, “Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach”, presented in Genelogic Workshop on Low Level Analysis of Affymetrix Genechip® data, Nov. 19, 2001, Bethesda, MD., http://oz.berkeley.edu/users/terry/zarray/Affy/GL_Workshop/Holder.ppt). This hybrid transformation uses a linear function at the low intensity side and a logarithm function for high intensities. An arbitrary parameter c′ defines the boundary between the linear and the logarithmic functions. Equation 72 is the mathematical definition of the hybrid transformation function.
y(k)=f(x(k))=x(k), for 0≦x(k)<c′
y(k)=f(x(k))=c′·ln(x(k)/c′)+c′, for x(k)>c′
y(k)=f(x(k))=0, for x(k)<0 (72)
In one embodiment, parameter c′ in Equation 72 is chosen to be 20. Errors of the hybrid-transformed intensities can be estimated as
σy(k)≈σx(k)·f′(x(k))=σx(k), for 0≦x(k)<c′
σy(k)≈σx(k)·f′(x(k))=c′·σx(k)/x(k),, for x(k)≧c′ (73)
The analytical methods of the present invention can preferably be implemented using a computer system, such as the computer system described in this section, according to the following programs and methods. Such a computer system can also preferably store and manipulate a compendium of the present invention which comprises a plurality of perturbation response profiles and which can be used by a computer system in implementing the analytical methods of this invention. Accordingly, such computer systems are also considered part of the present invention.
An exemplary computer system suitable from implementing the analytic methods of this invention is illustrated in
The external components can include a mass storage 4904. This mass storage can be one or more hard disks that are typically packaged together with the processor and memory. Such hard disk are typically of 1 GB or greater storage capacity and more preferably have at least 6 GB of storage capacity. For example, in a preferred embodiment, described above, wherein a computer system of the invention comprises several nodes, each node can have its own hard drive. The head node preferably has a hard drive with at least 6 GB of storage capacity whereas each sibling node preferably has a hard drive with at least 9 GB of storage capacity. A computer system of the invention can further comprise other mass storage units including, for example, one or more floppy drives, one more CD-ROM drives, one or more DVD drives or one or more DAT drives.
Other external components typically include a user interface device 4905, which is most typically a monitor and a keyboard together with a graphical input device 4906 such as a “mouse.” The computer system is also typically linked to a network link 4907 which can be, e.g., part of a local area network (“LAN”) to other, local computer systems and/or part of a wide area network (“WAN”), such as the Internet, that is connected to other, remote computer systems. For example, in the preferred embodiment, discussed above, wherein the computer system comprises a plurality of nodes, each node is preferably connected to a network, preferably an NFS network, so that the nodes of the computer system communicate with each other and, optionally, with other computer systems by means of the network and can thereby share data and processing tasks with one another.
Loaded into memory during operation of such a computer system are several software components that are also shown schematically in
Software component 4912 comprises any analytic methods of the present invention described supra, preferably programmed in a procedural language or symbolic package. For example, software component 4912 preferably includes programs that cause the processor to implement steps of accepting a plurality of measured expression profiles and storing the profiles in the memory. For example, the computer system can accept exon expression profiles that are manually entered by a user (e.g., by means of the user interface). More preferably, however, the programs cause the computer system to retrieve measured expression profiles from a database. Such a database can be stored on a mass storage (e.g., a hard drive) or other computer readable medium and loaded into the memory of the computer, or the compendium can be accessed by the computer system by means of the network 4907.
In addition to the exemplary program structures and computer systems described herein, other, alternative program structures and computer systems will be readily apparent to the skilled artisan. Such alternative systems, which do not depart from the above described computer system and programs structures either in spirit or in scope, are therefore intended to be comprehended within the accompanying claims.
The following examples are presented by way of illustration of the present invention, and are not intended to limit the present invention in any way.
To verify the re-ratioer and the ratio splitter, the microarray data as described in He et al., 2003, Bioinformatics 19:956-965 were used. In this data set, replicated and fluor-reversed two-color Agilent microarrays were hybridized to many different tissue samples in a pooled-looped design.
In the rest of the example section, “Pool 1+εC” will be referred to as sample C and “Pool 1+εD” will be referred to as sample D. As discussed in the following examples, the “virtual D/C” from the re-ratioer or the ratio-splitter was compared to the real D/C measured from direct hybridizations. Some of the real ratio experiments that were used as verification references are shown in
In order to verify the accuracy of the re-ratioer, a reference standard is needed. A combined fluor-reversal real C-vs-D experiment (+97, −98) was used as the standard.
Results shown in the previous section came from data of a near pool, i.e. sample C and sample D were part of the pooled sample (Pool 1). In this example results from data with a distant pool as the common reference, i.e., sample C and sample D were not included in the reference pool, are described.
When a distant pool is used, the ratio-splitter may also suffer from the same proble of low precision and low accuracy as in the case of re-ratioer. In this example, the ratio-splitter is verified in data either with a common near pool or without a common pool.
In the re-ratioer and ratio-splitter verification examples discussed above, common reference controls were employed, i.e., there was either a near pool or a distant pool in one of the two channels. The common controls were used as references to reduce inter-slide variations. However, when the common controls are not available, the inter-slide error correction (ISEC) is preferably not used during ratio splitting. Ratio-splitter results without leveraging common reference pools are shown in this example.
The precision and accuracy of the re-ratioer and the ratio-splitter were discussed in previous examples. In this example, the sensitivity and specificity are examined. Sensitivity is the ability to detect expression changes. Generally, the higher the sensitivity is, the better the detection method is. Specificity rate can be defined as one minus false positive rate. False positives are those features or sequences that are detected as differentially expressed but that are actually not differentially expressed. The lower the false positive rate, the better the detection method is. Sensitivity and false positive may be tradeoffs. For example, increasing sensitivity by using higher p-value thresholds may increase false positive rate. ROC (receiver operating characteristics) analysis allows consideration of both sensitivity and false positive rate when comparing different gene expression detection methods.
ROC curves are plots in which the X-axis corresponds to false positive rate and the Y-axis corresponds to sensitivity. For each p-value threshold level, e.g. p-value<0.01, the false positive rate from same-vs-same experiments, and the sensitivity from different-vs-different experiments are measured. The measured false positive rate (FPR) and total positive rate (TPR) is one point on the ROC curve. By varying the threshold from very low levels to very high levels, the entire ROC curve can be obtained. For a given test data set, a detection method having its ROC curve closer to the upper-left corner of the ROC plot has higher statistical power in differential expression analysis. In this example, the total positive rate was used instead of the true positive rate because true positive rate is hard to measure. The true positive rate is related to the total positive rate, which includes both true positives and false positives. A superior method in terms of a ROC of total-positive vs. FPR is normally also superior in terms of a ROC of true-positive vs. FPR.
In all of the following ROC plots, the ROC curves are the averaged results of two different sets of same-vs-same and different-vs-different data. The false positive rate is the number of signature features for a given p-value threshold in a same-vs-same experiment divided by the total number of features in a chip. The total positive rate is the number of signature features for a given p-value threshold in a different-vs-different experiment divided by the total number of features in a chip.
The different-vs-different data are those C-vs-D experiments shown in previous sections. Sample C and sample D had moderately strong differential expressions. In addition to including all signatures in the ROC analysis, separate ROC curves for which features of more than 1.2-fold up- or down-regulation in both real combined C-vs-D experiments were excluded are also provided in
FIGS. 44(a) and (b) compare the all-signature-ROC curves of the ratio-splitter and the re-ratioer having the near common reference pool (Pool 1) used in ISEC. These ROC curves are plotted in log-log scales to help clearly compare the differences at low FPR. ROC curves of real ratio experiments in black lines are shown as references for comparison with the results of virtual experiments from ratio-splitter and re-ratioer. At the medium FPR levels (0.001<FPR<0.1), the real combined fluor-reversal experiments have higher ROC curves than the virtual combined experiments as shown by the dark dashed lines. At low FPR levels (FPR<0.001), both ratio-splitter and re-ratioer combined experiments have similar or higher ROC curves than the real combined experiments. Using the ROC curve of the combined real (thick solid black lines) as a reference, it can be seen that the ratio-splitter had a slightly higher ROC curve than the re-ratioer in the virtual combined experiments.
With the ratio-splitter and the re-ratioer, ratio experiments of the same color (red-red or green-green) can be formed. Because there is no color bias in the same-color virtual experiments, ROC curves of the same-color without combining is significantly higher than the ROC curve from the real two-color chips in FIGS. 44(a) and (b) (thin solid black lines). The virtual two-color experiment exhibits the lowest ROC curves (thin dashed lines).
FIGS. 45(a) and (b) are ROC curves of weak signatures. When signatures of strong differential expressions were excluded, all ROC curves moved down. The real combined experiments still had the highest ROC curves in the medium FPR range. Ratio-splitter still outperformed the real in the low FPR range. At low FPR range, ROC curves of the re-ratioer at the same-color are higher than the curves of the ratio-splitter. For both re-ratioer and ratio-splitter, the ROC curves of red single-color experiments of green common controls are higher than the ROC curves of the green experiments of red common controls. This is quite interesting. It indicates that green (Cy3) fluorescence is preferably used to label the common near reference pool if fluor-reversal pairs are not to be obtained. This is particularly important when differential expressions are weak.
It was shown in the previous examples that when distant pools were used, the precision and accuracy of the ratio-splitter and re-ratioer decreased. Distant pools also decrease the sensitivity and specificity in differential expression detections by the ratio-splitter or re-ratioer. FIGS. 46(a) and (b) are the all-signature ROC curves with the distant Pool 2 as the common reference in ISEC. Comparing
Re-ratioer and ratio-splitter with ISEC are preferably not used if there is no common reference control in one of the two channels of the original data. In such cases, the ratio-splitter only provides intensity profiles without inter-slide error correction (see
As these examples demonstrated, the re-ratioer and the ratio-splitter provide additional flexibility in analyzing two-color microarray data. Ratio-splitter allows the use of two-color microarrays to generate intensity profiles as alternatives to single-channel microarrays, such as those from Affymetrix. The inter-slide error correction method (ISEC) significantly reduces slide-to-slide variations when a common reference control sample is hybridized to one of the two channels of the two-color microarrays. The following summarizes observations from method verifications described in the Example Section:
All references cited herein are incorporated herein by reference in their entirety and for all purposes to the same extent as if each individual publication or patent or patent application was specifically and individually indicated to be incorporated by reference in its entirety for all purposes.
Many modifications and variations of the present invention can be made without departing from its spirit and scope, as will be apparent to those skilled in the art. The specific embodiments described herein are offered by way of example only, and the invention is to be limited only by the terms of the appended claims along with the full scope of equivalents to which such claims are entitled.