Label integrity verification of chemical array data

BACKGROUND OF THE INVENTION

Researchers use experimental data obtained from arrays and other similar research test equipment to cure diseases, develop medical treatments, understand biological phenomena, and perform other tasks relating to the analysis of such data. However, the conversion of useful results from this raw data is restricted by physical limitations of, e.g., the nature of the tests and the testing equipment. All biological measurement systems leave their fingerprint on the data they measure, distorting the content of the data, and thereby influencing the results of the desired analysis. For example, systematic biases can distort array analysis results and thus conceal important biological effects sought by the researchers. Biased data can cause a variety of analysis problems, including signal compression, aberrant graphs, and significant distortions in estimates of differential expression.

Gradient effects or patterns are those in which there is a pattern of expression signal intensity which corresponds with specific physical locations and/or sequence properties within a chemical array and which are characterized by a smooth change in the expression values from one end of the array to another and/or across sequence properties of probes. This can be caused by variations in array design, manufacturing, dye-bias, probe affinity and/or hybridization procedures.

In dual-channel systems, it is well known that the two dyes used to evaluate the binding of target molecules to probes on an array do not always perform equally efficiently, for equivalent target concentrations, uniformly across the whole array. This is sometimes referred to as dye-related, signal correlation bias. For example, for dual-channel systems in which probes have been labeled using cyanine3 (Cy3)- and cyanine5 (Cy5)-dyes, the red channel (detecting Cy5 labeling) often demonstrates higher signal intensity than the green channel at higher target abundances. Even when comparing results from two single-channel experiments, there may be differences in dye performances, even when the same dye is used, such as when different experimental conditions, either intended or unintended, occur when running each of the experiments. Also, the label intensity may not follow an ideal performance curve over the range of analyte concentration. For example, for drug discovery experiments, label intensity may not follow the ideal dose-response curve over the range of the analyte (e.g., mRNA) concentration being used as a marker of drug efficacy. For example, red dye (e.g., Cy5) tends to amplify brightness in an accelerated manner with respect to an increase in concentration, at high concentrations beyond the typical sigmoidal profile.

The degree the intensity of dye signals fails to report the concentration of target being measured is not easily quantified, and therefore difficult to address. Dye-swap normalization experiments are sometimes run in which a first set of experiments assigns the red dye label to a first set of probes and the green dye label to a second set of probes. A second set of experiments is run against the same target solution, but in which the green dye label is assigned to the first set of probes and the red dye label is assigned to the second set of probes. By comparing the output of the first set with that of the second set, the bias attributable to the effects of the red versus green dye can be measured. However, this is a time consuming process and significantly increases the cost of experimentation, as twice the amount of arrays, reagents, target and processing are required.

In addition to fluorescent labels, other types of labeling, such as radioactive labels, phosphorescent labels, fluorescent labels, visible light labels, ultraviolet labels, and others, are also susceptible to causing signal correlation bias.

Also, results that appear to have labeling bias may be due to other technical errors. For example, for a single channel system, the system may be erroneously reporting probe signals, even though the results appear to be the cause of dye bias. Since there is only one channel, and no control channel, it is not possible to distinguish between the systematic reader error and dye bias, in this instance.

Thus there remains a need for improved systems and methods for normalizing biological data to address dye-related, signal correlation bias and other types of labeling bias as data is read from arrays.

SUMMARY OF THE INVENTION

Methods, systems and computer readable media are provided for checking label integrity of labeled biopolymers in a single sample assayed by chemical array analysis. In one embodiment, at least first and second labels are incorporated into biopolymers in the single sample to produce a multi-labeled, single sample. The multi-labeled, single sample is hybridzed to probes on a chemical array, and signal values are read from a probe on the chemical array bound to a set of biopolymer sequences labeled with the at least first and second labels. First-labeled signal values from the probe bound to biopolymer having the first label incorporated therein are compared with second-labeled signal values from the probe bound to biopolymer having the second label incorporated therein. The steps of reading signal values and comparing first-labeled signal values with second-labeled signal values are repeated for at least one additional probe on the chemical microarray bound to a set of different biopolymer sequences labeled with the at least first and second labels. Label integrity is determined to be of acceptable quality if divergence between the first-labeled signal values read from the probes and the second-labeled signal values read from the same probes, over the set of probes read and compared, is less than a predetermined threshold value.

In another embodiment, a chemical array is provided that has had a multi-labeled sample contacted thereto so that multi-labeled biopolymers from the same have hybridized with probes on the chemical array. Methods, systems and computer readable media are provided for reading signal values from a probe on the chemical array bound to a set of biopolymer sequences labeled with at least first and second labels; comparing first-labeled signal values from the probe bound to biopolymer having the first label incorporated therein with second-labeled signal values from the probe bound to biopolymer having the second label incorporated therein; and repeating the reading signal values and comparing first-labeled signal values with second-labeled signal values for at least one additional probe on the chemical microarray bound to a set of different biopolymer sequences labeled with the at least first and second labels. Label integrity is determined to be of acceptable quality if divergence between the first-labeled signal values read from the probes and the second-labeled signal values read from the same probes, across all probes read, is less than a predetermined threshold value.

These and other advantages and features of the invention will become apparent to those persons skilled in the art upon reading the details of the methods, systems and computer readable media as more fully described below.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic representation of a chemical array.

FIG. 2 is an enlarged view of a portion of the array shown in FIG. 1.

FIG. 3 shows a flowchart of events that may be carried out in processing a sample with multiple different labels.

FIG. 4 schematically illustrates a linear amplification method for producing multiple antisense cRNA sequences from a sample mRNA sequence.

FIG. 5 schematically illustrates a process for incorporating two fluorescent dye nucleotides into an antisense RNA strand.

FIG. 6 illustrates a process of incorporating two different fluorescent dyes into a single sample, cDNA target.

FIG. 7 illustrates another approach to incorporating two different dye labels into cRNA.

FIG. 8 is a graphical representation of the number of features provided on the arrays for each of samples in an example described herein.

FIG. 9 shows a plot of the distribution of log ratio values for the signals obtained from scanning arrays in an example experiment described herein.

FIGS. 10A-10C show plots of inter-array coefficient of variation (CV) values calculated for background-subtracted, dye-normalized signals read from arrays in an example experiment described herein.

FIGS. 11A-11C show plots of inter-array coefficient of variation (CV) values (relative noise) similar to FIGS. 10A-10C, except that the signals used for calculations to generate FIGS. 11A-11C were background subtracted, but not dye-normalized.

FIG. 11D shows a plot of inter-array coefficient of variation (CV) values (relative noise) corresponding to the plot of FIG. 11C, except in this case, the signals have been weighted.

FIG. 12 illustrates a typical computer system in accordance with an embodiment of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

Before the present systems, methods, kits and computer readable media are described, it is to be understood that this invention is not limited to particular examples described, as such may, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting, since the scope of the present invention will be limited only by the appended claims.

Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limits of that range is also specifically disclosed. Each smaller range between any stated value or intervening value in a stated range and any other stated or intervening value in that stated range is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included or excluded in the range, and each range where either, neither or both limits are included in the smaller ranges is also encompassed within the invention, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention, the preferred methods and materials are now described. All publications mentioned herein are incorporated herein by reference to disclose and describe the methods and/or materials in connection with which the publications are cited.

It must be noted that as used herein and in the appended claims, the singular forms “a”, “and”, and “the” include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to “a probe” includes a plurality of such probes and reference to “the array” includes reference to one or more arrays and equivalents thereof known to those skilled in the art, and so forth.

The publications discussed herein are provided solely for their disclosure prior to the filing date of the present application. Nothing herein is to be construed as an admission that the present invention is not entitled to antedate such publication by virtue of prior invention. Further, the dates of publication provided may be different from the actual publication dates which may need to be independently confirmed.

Definitions

In the present application, unless a contrary intention appears, the following terms refer to the indicated characteristics.

A “biopolymer” is a polymer of one or more types of repeating units. Biopolymers are typically found in biological systems and particularly include polysaccharides (such as carbohydrates), and peptides (which term is used to include polypeptides and proteins) and polynucleotides as well as their analogs such as those compounds composed of or containing amino acid analogs or non-amino acid groups, or nucleotide analogs or non-nucleotide groups. This includes polynucleotides in which the conventional backbone has been replaced with a non-naturally occurring or synthetic backbone, and nucleic acids (or synthetic or naturally occurring analogs) in which one or more of the conventional bases has been replaced with a group (natural or synthetic) capable of participating in Watson-Crick type hydrogen bonding interactions. Polynucleotides include single or multiple stranded configurations, where one or more of the strands may or may not be completely aligned with another.

A “nucleotide” refers to a sub-unit of a nucleic acid and has a phosphate group, a 5-carbon sugar and a nitrogen containing base, as well as functional analogs (whether synthetic or naturally occurring) of such sub-units which in the polymer form (as a polynucleotide) can hybridize with naturally occurring polynucleotides in a sequence-specific manner analogous to that of two naturally occurring polynucleotides. For example, a “biopolymer” includes DNA (including cDNA), RNA, oligonucleotides, and PNA and other polynucleotides as described in U.S. Pat. No. 5,948,902 and references cited therein (all of which are incorporated herein by reference), regardless of the source. An “oligonucleotide” generally refers to a nucleotide multimer of about 10 to 100 nucleotides in length, while a “polynucleotide” includes a nucleotide multimer having any number of nucleotides. A “biomonomer” references a single unit, which can be linked with the same or other biomonomers to form a biopolymer (for example, a single amino acid or nucleotide with two linking groups one or both of which may have removable protecting groups).

“Technical factors” refer to all patterns in the signal data that are not representative of the biological information in the target sample, but are rather caused by technical sources, such as hybridization bubbles (caused by uneven distribution of the sample to all probes during mixing by a bubbler), temperature gradients, sequence-composition gradients, writer/pen anomalies causing uneven patterns in the amounts deposited across the array, label kit biases, dye differences, bulk chemical solution effects, flow-cell dynamics, wash deposits, auto-fluorescence, oxidation gradients, and the like.

“Incorporation” of a label, into biopolymers or nucleotides, for example, refers to any known technique for labeling a biopolymer or nucleotide, including, but not limited to primer extension using labeled nucleotides and/or labeled primers, labeling during an amplification procedure, chemical conjugation, labeling by binding a labeled moiety that binds to the biopolymer, etc.

“Label integrity”, as used herein refers to a property of labels incorporated into biopolymers wherein signals that are read from the label-incorporated biopolymers can be consistently and stably reproduced across multiple experiments. Also, different labels vary proportionally over a range of signals, so that they can be reliably compared with one another, as measuring the same signal levels for the same sample, or correct ratios between different samples. Labels that lack label integrity are considered unstable, and this leads to amplified array noise and the inability to accurately compare signals from the same biopolymers labeled with different labels. Stability with respect to time (e.g., “shelf life”) is also a desirable property for maintaining label integrity.

When one item is indicated as being “remote” from another, this is referenced that the two items are not at the same physical location, e.g., the items are at least in different buildings, and may be at least one mile, ten miles, or at least one hundred miles apart.

“Communicating” information references transmitting the data representing that information as electrical signals over a suitable communication channel (for example, a private or public network).

“Forwarding” an item refers to any means of getting that item from one location to the next, whether by physically transporting that item or otherwise (where that is possible) and includes, at least in the case of data, physically transporting a medium carrying the data or communicating the data.

A “processor” references any hardware and/or software combination which will perform the functions required of it. For example, any processor herein may be a programmable digital microprocessor such as available in the form of a mainframe, server, or personal computer (desktop or portable). Where the processor is programmable, suitable programming can be communicated from a remote location to the processor, or previously saved in a computer program product (such as a portable or fixed computer readable storage medium, whether magnetic, optical or solid state device based). For example, a magnetic or optical disk may carry the programming, and can be read by a suitable disk reader communicating with each processor at its corresponding station.

Reference to a singular item, includes the possibility that there are plural of the same items present.

“May” means optionally.

Methods recited herein may be carried out in any order of the recited events which is logically possible, as well as the recited order of events.

A “chemical array”, “array”, “microarray” or “bioarray” unless a contrary intention appears, includes any one-, two- or three-dimensional arrangement of addressable regions bearing a particular chemical moiety or moieties (for example, biopolymers such as polynucleotide sequences) associated with that region. An array is “addressable” in that it has multiple regions of different moieties (for example, different polynucleotide sequences) such that a region (a “feature” or “spot” of the array) at a particular predetermined location (an “address”) on the array will detect a particular target or class of targets (although a feature may incidentally detect non-targets of that feature). Array features are typically, but need not be, separated by intervening spaces. In the case of an array, the “target” will be referenced as a moiety in a mobile phase (typically fluid), to be detected by probes (“target probes”) which are bound to the substrate at the various regions. However, either of the “target” or “target probes” may be the one which is to be evaluated by the other (thus, either one could be an unknown mixture of polynucleotides to be evaluated by binding with the other).

An “array layout” refers to one or more characteristics of the features, such as feature positioning on the substrate, one or more feature dimensions, and an indication of a moiety at a given location.

“Hybridizing” and “binding”, with respect to polynucleotides, are used interchangeably.

A “pulse jet” is a device which can dispense drops in the formation of an array. Pulse jets operate by delivering a pulse of pressure to liquid adjacent an outlet or orifice such that a drop will be dispensed therefrom (for example, by a piezoelectric or thermoelectric element positioned in a same chamber as the orifice).

A “subarray” or “subgrid” is a subset of an array. Typically, a number of subgrids are laid out on a single slide and are separated by a greater spacing than the spacing that separates features or spots or dots.

Any given substrate (e.g., slide) may carry one, two, four or more arrays disposed on a front surface of the substrate. Depending upon the use, any or all of the arrays may be the same or different from one another and each may contain multiple spots or features. A typical array may contain more than ten, more than one hundred, more than one thousand more ten thousand features, or even more than one hundred thousand features, in an area of less than 20 cm²or even less than 10 cm². For example, features may have widths (that is, diameter, for a round spot) in the range from a 10 μm to 1.0 cm. In other embodiments each feature may have a width in the range of 1.0 μm to 1.0 mm, usually 5.0 μm to 500 μm, and more usually 10 μm to 200 μm. Non-round features may have area ranges equivalent to that of circular features with the foregoing width (diameter) ranges. At least some, or all, of the features are of different compositions (for example, when any repeats of each feature composition are excluded the remaining features may account for at least 5%, 10%, or 20% of the total number of features).

Interfeature areas will typically (but not essentially) be present which do not carry any polynucleotide (or other biopolymer or chemical moiety of a type of which the features are composed). Such interfeature areas typically will be present where the arrays are formed by processes involving drop deposition of reagents but may not be present when, for example, photolithographic array fabrication processes are used. It will be appreciated though, that the interfeature areas, when present, could be of various sizes and configurations.

Each array may cover an area of less than 100 cm², or even less than 50 cm², 10 cm²or 1 cm². In many embodiments, the substrate carrying the one or more arrays will be shaped generally as a rectangular solid (although other shapes are possible; for example, some manufacturers are currently working on flexible substrates), having a length of more than 4 mm and less than 1 m, usually more than 4 mm and less than 600 mm, more usually less than 400 mm; a width of more than 4 mm and less than 1 m, usually less than 500 mm and more usually less than 400 mm; and a thickness of more than 0.01 mm and less than 5.0 mm, usually more than 0.1 mm and less than 2 mm and more usually more than 0.2 and less than 1 mm. With arrays that are read by detecting fluorescence, the substrate may be of a material that emits low fluorescence upon illumination with the excitation light. Additionally in this situation, the substrate may be relatively transparent to reduce the absorption of the incident illuminating laser light and subsequent heating if the focused laser beam travels too slowly over a region. For example, a substrate may transmit at least 20%, or 50% (or even at least 70%, 90%, or 95%), of the illuminating light incident on the front as may be measured across the entire integrated spectrum of such illuminating light or alternatively at 532 nm or 633 nm.

Arrays can be fabricated using drop deposition from pulse jets of either polynucleotide precursor units (such as monomers) in the case of in situ fabrication, or the previously obtained polynucleotide. Such methods are described in detail in, for example, the previously cited references including U.S. Pat. Nos. 6,242,266; 6,232,072; 6,180,351; 6,171,797; and 6,323,043, and in U.S. patent application Ser. No. 09/302,898 filed Apr. 30, 1999 by Caren et al., and the references cited therein. As already mentioned, these references are incorporated herein, in their entireties, by reference thereto. Other drop deposition methods can be used for fabrication, as previously described herein. Also, instead of drop deposition methods, photolithographic array fabrication methods may be used. Interfeature areas need not be present particularly when the arrays are made by photolithographic methods.

Following receipt by a user of an array made by an array manufacturer, it will typically be exposed to a sample (for example, a fluorescently labeled polynucleotide or protein containing sample) and the array then read. Reading of the array may be accomplished by illuminating the array and reading the location and intensity of resulting fluorescence at multiple regions on each feature of the array. For example, a scanner may be used for this purpose which is similar to the AGILENT MICROARRAY SCANNER manufactured by Agilent Technologies, Palo Alto, Calif. Other suitable apparatus and methods are described in U.S. Pat. Nos. 6,406,849; 6,371,370; and 6,756,202; and in U.S. Patent Publication No. 2003/0160183 titled “Reading Dry Chemical Arrays Through The Substrate” by Dorsel et al. However, arrays may be read by any other method or apparatus than the foregoing, with other reading methods including other optical techniques (for example, detecting chemiluminescent or electroluminescent labels) or electrical techniques (where each feature is provided with an electrode to detect hybridization at that feature in a manner disclosed in U.S. Pat. Nos. 6,251,685 and 6,221,583 and elsewhere). A result obtained from the reading followed by a method of the present invention may be used in that form or may be further processed to generate a result such as that obtained by forming conclusions based on the pattern read from the array (such as whether or not a particular target sequence may have been present in the sample, or whether or not a pattern indicates a particular condition of an organism from which the sample came). A result of the reading (whether further processed or not) may be forwarded (such as by communication) to a remote location if desired, and received there for further use (such as further processing).

The term “stringent assay conditions” or “stringent conditions” as used herein refers to conditions that are compatible to produce binding pairs of nucleic acids, e.g., surface bound and solution phase nucleic acids, of sufficient complementarity to provide for the desired level of specificity in the assay while being less compatible to the formation of binding pairs between binding members of insufficient complementarity to provide for the desired specificity. Stringent assay conditions are the summation or combination (totality) of both hybridization and wash conditions.

A “stringent hybridization” and “stringent hybridization wash conditions” in the context of nucleic acid hybridization (e.g., as in array, Southern or Northern hybridizations) are sequence dependent, and are different under different experimental parameters. Stringent hybridization conditions that can be used to identify nucleic acids within the scope of the invention can include, e.g., hybridization in a buffer comprising 50% formamide, 5×SSC, and 1% SDS at 42° C., or hybridization in a buffer comprising 5×SSC and 1% SDS at 65° C., both with a wash of 0.2×SSC and 0.1% SDS at 65° C. Exemplary stringent hybridization conditions can also include a hybridization in a buffer of 40% formamide, 1 M NaCl, and 1% SDS at 37° C., and a wash in 1×SSC at 45° C. Alternatively, hybridization to filter-bound DNA in 0.5 M NaHPO₄, 7% sodium dodecyl sulfate (SDS), 1 mM EDTA at 65° C., and washing in 0.1×SSC/0.1% SDS at 68° C. can be employed. Yet additional stringent hybridization conditions include hybridization at 60° C. or higher and 3×SSC (450 mM sodium chloride/45 mM sodium citrate) or incubation at 42° C. in a solution containing 30% formamide, 1M NaCl, 0.5% sodium sarcosine, 50 mM MES, pH 6.5. Those of ordinary skill will readily recognize that alternative but comparable hybridization and wash conditions can be utilized to provide conditions of similar stringency.

In certain embodiments, the stringency of the wash conditions that set forth the conditions which determine whether a nucleic acid is specifically hybridized to a surface bound nucleic acid. Wash conditions used to identify nucleic acids may include, e.g.: a salt concentration of about 0.02 molar at pH 7 and a temperature of at least about 50° C. or about 55° C. to about 60° C.; or, a salt concentration of about 0.15 M NaCl at 72° C. for about 15 minutes; or, a salt concentration of about 0.2×SSC at a temperature of at least about 50° C. or about 55° C. to about 60° C. for about 15 to about 20 minutes; or, the hybridization complex is washed twice with a solution with a salt concentration of about 2×SSC containing 0.1% SDS at room temperature for 15 minutes and then washed twice by 0.1×SSC containing 0.1% SDS at 68° C. for 15 minutes; or, equivalent conditions. Stringent conditions for washing can also be, e.g., 0.2×SSC/0.1% SDS at 42° C.

A specific example of stringent assay conditions is rotating hybridization at 65° C. in a salt based hybridization buffer with a total monovalent cation concentration of 1.5 M (e.g., as described in U.S. patent application Ser. No. 09/655,482 filed on Sep. 5, 2000, the disclosure of which is herein incorporated by reference) followed by washes of 0.5×SSC and 0.1×SSC at room temperature.

Stringent assay conditions are hybridization conditions that are at least as stringent as the above representative conditions, where a given set of conditions are considered to be at least as stringent if substantially no additional binding complexes that lack sufficient complementarity to provide for the desired specificity are produced in the given set of conditions as compared to the above specific conditions, where by “substantially no more” is meant less than about 5-fold more, typically less than about 3-fold more. Other stringent hybridization conditions are known in the art and may also be employed, as appropriate.

As noted above, conventional bioassays use one dye label per signal channel, with no direct onboard way to assure integrity of the label dyes. Examples of widely-used single-channel platforms include GeneChip®, by Affymetrix (http://www.affymetrix.com/products/arrays/index.affx) and the CodeLink System from GEHealthcare (http://www.affymetrix.com/products/arrays/index.affx). A gradient pattern that results from reading such an array does not necessarily imply a dye-biasing error, but could be due to other production factors during production of the array and/or hybridization conditions, as noted above. Further, with single-channel systems, since there is only one channel being analyzed, it is not possible to run dye-swap experiments, as there is typically only one set of probes and one dye used.

The present invention provides solutions that include onboard verification of labeling, even for single-channel systems. Multiple labels may be incorporated into one sample, such that the probes on an array read by a single channel of a system will get information from multiple labels. For example, for dye-biasing, both red and green dye labels may be incorporated in biopolymers in the same sample, and the multi-labeled sample is then exposed to the probes on an array under stringent hybridization conditions. The resulting signals read by an array scanner will then reflect the same sample labeled with green dye, as well as with red dye. Thus, a two-channel, or two color scanner may be used to process a single sample in this instance, with one channel of signal measurement.

FIGS. 1-2 illustrate an exemplary array, where the array shown in this representative embodiment includes a contiguous planar substrate 110 carrying an array 112 disposed on a surface 111b of substrate 110. It will be appreciated though, that more than one array (any of which are the same or different) may be present on surface 111b, with or without spacing between such arrays. That is, any given substrate may carry one, two, four or more arrays disposed on a surface of the substrate and depending on the use of the array, any or all of the arrays may be the same or different from one another and each may contain multiple spots or features. The one or more arrays 112 usually cover only a portion of the surface 111b, with regions of the surface 111b adjacent the opposed sides 113c, 113d and leading end 113a and trailing end 113b of slide 110, not being covered by any array 112. An opposite surface 111a of the slide 110 typically does not carry any arrays 112. Each array 112 can be designed for testing against any type of sample, whether a trial sample, reference sample, a combination of them, or a known mixture of biopolymers such as polynucleotides. Substrate 110 may be of any shape, as mentioned above.

As mentioned above, array 112 contains multiple spots or features 116 of oligomers, e.g., in the form of polynucleotides, and specifically oligonucleotides. As mentioned above, all of the features 116 may be different, or some or all could be the same. The interfeature areas 117 could be of various sizes and configurations. Each feature carries a predetermined oligomer such as a predetermined polynucleotide (which includes the possibility of mixtures of polynucleotides). It will be understood that there may be a linker molecule (not shown) of any known types between the surface 111b and the first nucleotide.

Substrate 110 may carry on surface 111a, an identification code, e.g., in the form of bar code (not shown) or the like printed on a substrate in the form of a paper label attached by adhesive or any convenient means. The identification code may contain information relating to array 112, where such information may include, but is not limited to, an identification of array 112, i.e., layout information relating to the array(s), etc.

In the case of an array in the context of the present application, the “target” may be referenced as a moiety in a mobile phase (typically fluid), to be detected by “probes” which are bound to the substrate at the various regions.

A “scan region” refers to a contiguous (preferably, rectangular) area in which the array spots or features of interest, as defined above, are found or detected. Where fluorescent labels are employed, the scan region is that portion of the total area illuminated from which the resulting fluorescence is detected and recorded. Where other detection protocols are employed, the scan region is that portion of the total area queried from which resulting signal is detected and recorded. For the purposes of this invention and with respect to fluorescent detection embodiments, the scan region includes the entire area of the slide scanned in each pass of the lens, between the first feature of interest, and the last feature of interest, even if there exist intervening areas that lack features of interest.

FIG. 3 shows a flowchart of events that may be carried out in processing a sample with multiple different labels. At event 302, multiple different labels are incorporated into a single sample containing target nucleic acids into which the labels are incorporated. The labels are combined with the single sample in amounts such that each label incorporates into the nucleic acids of the sample to produce proportional signals across probes on an array to which the labeled nucleic acids are to be hybridized. Although specific examples described herein are directed to dye labeling, and incorporation of two different dye labels into the same sample, it is again noted here that the principles and methods described herein are equally applicable to other label types. For example, biopolymers (e.g., nucleic acids) in the same sample may be labeled with either Cy3-dye or Cy5-dye and labeled with a radioactive label, as well, or with two radioactive labels (radioactive isomers), biotinylated dyes, or with two different labels of any known types, as long a system or systems are available for reading the signals associated with such labels.

It should be further noted here that the present invention is not limited to incorporation of only two different labels into biopolymers (e.g., nucleic acids) in the same sample, as more than two different labels may be incorporated into the biopolymers to perform the functions described herein, and which would be processed similarly. By incorporating a mixture of multiple (two or more) different labels into the biopolymers (e.g., nucleic acids) of a single sample, the signal values read from a probe bound to biopolymers incorporating a first label may be compared to the signal values read from the same probe bound to biopolymers incorporating a second label, as well as against signal values from the probe bound to biopolymers incorporating a third, forth or fifth label, etc., and these comparisons can be made across a plurality or even all probes on an array that bind to the target sample, to compare the performance of one label versus another label for the same nucleic acids across a plurality of probes binding to different biopolymers. The degree to which the first and second-labeled signals (or first and third, first, second and third, or however many different signals are compared, depending upon the number of labels incorporated) are proportional to one another across a plurality of different probes (e.g., across the probes on the array) may be characterized by a divergence metric, thereby providing a check of integrity of the labels as a quantitative measurement of label integrity and hence, fidelity of the signals read as they are influenced by the labels incorporated therein. For example, if incorporation of one particular label, for example a dye, results in signal levels read from probes bound to nucleic acids having the dye incorporated therein, that when plotted against the positions of the features/probes from which the signals were read, presents an unusual gradient in the surface characterizing the plotted signal levels, as compared to surface plots produced from signals read from the same corresponding probes bound to nucleic acids having other labels incorporated therein, respectively, then this is direct evidence that that dye has a lack of integrity across the range of signal levels read. For example, Cy5 label (red) is more susceptible to ozone degradation than Cy3 label (green). Another example is that auto-fluorescence can influence signals from biopolymers (e.g., nucleic acids) having Cy3 dye label incorporated therein much more than signals from the same biopolymers (e.g., nucleic acids) having Cy5 dye label incorporated therein. In situations such as these, the signals read from the biopolymers (e.g., nucleic acids) labeled with red dye and the signals read from the corresponding biopolymers (e.g., nucleic acids) labeled with green dye result in a mutually divergent pattern when the signals are plotted with regard to the positions of the features on the array to produce response surface plots, since chemical differences are amplified by unstable conditions.

The labels are incorporated into the molecules in the sample at a fixed ratio across all the molecules into which the labels are incorporated, such that signals that are read from the labeled molecules will be at a fixed ratio across molecules, when comparing one label versus another. Both the normal substrate (for example, dCTP) and a dye-modified dNTP (for example, Cye-dCTP) may be present in the reaction. A fixed ratio of the normal substrate to the dye substrate (derivative) dictates how much dye is incorporated into the sample and this does not change over time, as long as both substrates are present in excess and the effective concentration does not change as a function of time. So, for example, when two dyes are to be incorporated into the same sample, the amount of each substrate for the two dyes, respectively should be at a fixed ratio, and as long as the reactants (dyes not yet incorporated into sample) are available, the enzyme drives incorporation of the dyes into the sample at a fixed rate, and in quantities that are at the fixed ratio determined as described above. Examples of dyes that may be incorporated include those dyes used for fluorescent labeling in which fluorescently tagged nucleotides, (e.g., Cy3-CTP) are incorporated into an antisense RNA, or, for example, Cy3-dCTP are incorporated into cDNA (from a first strand synthesis or a non-amplification method) product during the transcription step. Fluorescent moieties which may be used to tag nucleotides for producing labeled samples include: fluorescein, the cyanine dyes, such as Cy3, Cy5, Alexa 542, Bodipy 630/650, and the like. Other labels may also be employed as are known in the art.

One approach for incorporating multiple fluorescent dye labels into the same sample employs linear amplification techniques. According to this approach, mRNA in the sample molecules are linearly amplified into antisense RNA. Thus amplified amounts of antisense RNA are produced by amplification of an initial amount of mRNA. By amplified amounts is meant that for each initial mRNA, multiple corresponding antisense RNAs, where the term antisense RNA is defined here as ribonucleic acid complementary to the initial mRNA, are produced. By corresponding is meant that the antisense RNA shares a substantial amount of sequence identity with the sequence complementary to the mRNA (i.e. the complement of the initial mRNA), where substantial amount means at least 95% usually at least 98% and more usually at least 99%, where sequence identity is determined using the BLAST algorithm. Further information regarding this step can be found in U.S. Pat. Nos. 6,132,997 and 6,916,633, each of which is incorporated herein, in its entirety, by reference thereto. Generally, the number of corresponding antisense RNA molecules produced for each initial mRNA during the subject linear amplification methods will be at least about 10, usually at least about 50 and more usually at least about 100, where the number may be as great as 600 or greater, but often does not exceed about 1000.

FIG. 4 schematically illustrates an mRNA sequence 400 from the sample to be labeled with multiple labels. The sample is subjected to a series of enzymatic reactions under conditions sufficient to ultimately produce double-stranded DNA for each initial mRNA in the sample that is amplified. An RNA polymerase promoter region (e.g., T7 promoter 410) is next incorporated into the resultant product, which region is critical for the transcription step described in greater detail below. The poly T region of the primer (promoter) binds with the poly-A tail of the mRNA, as shown (where “T” and “A” represent base components of RNA, as is well-known).

The initial mRNA may be present in a variety of different samples, where the sample will typically be derived from a physiological source. The physiological source may be derived from a variety of eukaryotic sources, with physiological sources of interest including sources derived from single-celled organisms such as yeast and multicellular organisms, including plants and animals, particularly mammals, where the physiological sources from multicellular organisms may be derived from particular organs or tissues of the multicellular organism, or from isolated cells derived therefrom. In obtaining the sample of RNA to be analyzed from the physiological source from which it is derived, the physiological source may be subjected to a number of different processing steps, where such processing steps might include tissue homogenization, cell isolation and cytoplasm extraction, nucleic acid extraction and the like, where such processing steps are known to those of skill in the art. Methods of isolating RNA from cells, tissues, organs or whole organisms are known to those of skill in the art. Alternatively, at least some of the initial steps of the subject methods may be performed in situ, as described in U.S. Pat. No. 5,514,545, which is hereby incorporated herein, in its entirety, by reference thereto.

Depending on the nature of the primer employed during first strand synthesis, amplified amounts of antisense RNA can be produced corresponding to substantially all of the mRNA present in the initial sample, or to a proportion or fraction of the total number of distinct mRNAs present in the initial sample. By substantially all of the mRNA present in the sample is meant more than 90%, usually more than 95%, where that portion not amplified is solely the result of inefficiencies of the reaction and not intentionally excluded from amplification.

The promoter-primer employed in the amplification reaction includes: (a) a poly-dT region for hybridization to the poly-A tail of the mRNA; and (b) an RNA polymerase promoter region 5′ of the -poly-dT region that is in an orientation capable of directing transcription of antisense RNA. In certain embodiments, the primer will be a “lock-dock” primer, in which immediately 3′ of the poly-dT region is either a “G’, “C”, or “A” such that the primer has the configuration of 3′-XTTTTTTT . . . 5′, where X is “G”, “C”, or “A”. The poly-dT region is sufficiently long to provide for efficient hybridization to the poly-A tail, where the region typically ranges in length from 10-50 nucleotides in length, usually 10-25 nucleotides in length, and more usually from 14 to 20 nucleotides in length.

A number of RNA polymerase promoters may be used for the promoter region of the first strand cDNA primer, i.e. the promoter-primer. Suitable promoter regions will be capable of initiating transcription from an operationally linked DNA sequence in the presence of ribonucleotides and an RNA polymerase under suitable conditions. The promoter will be linked in an orientation to permit transcription of antisense RNA. A linker oligonucleotide between the promoter and the DNA may be present, and if, present, will typically comprise between about 5 and 20 bases, but may be smaller or larger as desired. The promoter region will usually comprise between about 15 and 250 nucleotides, preferably between about 17 and 60 nucleotides, from a naturally occurring RNA polymerase promoter or a consensus promoter region. In general, prokaryotic promoters are preferred over eukaryotic promoters, and phage or virus promoters are most preferred. As used herein, the term “operably linked” refers to a functional linkage between the affecting sequence (typically a promoter) and the controlled sequence (the mRNA binding site). The promoter regions that find use are regions where RNA polymerase binds tightly to the DNA and contain the start site and signal for RNA synthesis to begin. A wide variety of promoters are known and many are very well characterized. Representation promoter regions of particular interest include T7, T3 and SP6 as described in Chamberlin and Ryan, The Enzymes (ed. P. Boyer, Academic Press, New York) (1982) pp 87-108.

The promoter-primer described above and throughout this specification may be prepared using any suitable method, such as, for example, the known phosphotriester and phosphite triester methods, or automated embodiments thereof. In one such automated embodiment, dialkyl phosphoramidites are used as starting materials and may be synthesized as described by Beaucage et al. (1981), Tetrahedron Letters 22, 1859. One method for synthesizing oligonucleotides on a modified solid support is described in U.S. Pat. No. 4,458,066. It is also possible to use a primer that has been isolated from a biological source (such as a restriction endonuclease digest). The primers herein are selected to be “substantially” complementary to each specific sequence to be amplified, i.e.; the primers should be sufficiently complementary to hybridize to their respective targets. Therefore, the primer sequence need not reflect the exact sequence of the target, and can, in fact be “degenerate.” Non-complementary bases or longer sequences can be interspersed into the primer, provided that the primer sequence has sufficient complementarity with the sequence of the target to be amplified to permit hybridization and extension.

Reverse transcriptase is then used to make a cDNA strand 412. The RNA strand 400 is next degraded using RNaseH, and a primer 414 is added. An exogenous primer can be added (e.g., random hexamer) or priming can occur by synthesis from residual RNA that is still bound to the DNA or snap back priming from the cDNA strand made during first strand synthesis. An -enzyme is used to make a copy of cDNA strand 412 according to known techniques, to synthesize double-stranded cDNA 412,412′. After hybridizing the oligonucleotide promoter-primer 410 with an initial mRNA sample 400, the primer-mRNA hybrid is converted to a double-stranded cDNA product that is recognized by an RNA polymerase, as noted. The promoter-primer is contacted with the mRNA under conditions that allow the poly-dT site to hybridize to the poly-A tail present on most mRNA species. The catalytic activities required to convert primer-mRNA hybrid to double-stranded cDNA are an RNA-dependent DNA polymerase activity, a RNaseH activity, and a DNA-dependent DNA polymerase activity. Most reverse transcriptases, including those derived from Moloney murine leukemia virus (MMLV-RT), avian myeloblastosis virus (AMV-RT), bovine leukemia virus (BLV-RT), Rous sarcoma virus (RSV) and human immunodeficiency virus (HIV-RT) catalyze each of these activities. These reverse transcriptases are sufficient to convert primer-mRNA hybrid to double-stranded DNA in the presence of additional reagents which include, but are not limited to: dNTPs; monovalent and divalent cations, e.g. KCl, MgCl.sub.2; sulfhydryl reagents, e.g. dithiothreitol; and buffering agents, e.g. Tris-Cl. Alternatively, a variety of proteins that catalyze one or two of these activities can be added to the cDNA synthesis reaction. For example, MMLV reverse transcriptase lacking RNaseH activity (described in U.S. Pat. No. 5,405,776) which catalyzes RNA-dependent DNA polymerase activity and DNA-dependent DNA polymerase activity, can be added with a source of RNaseH activity, such as the RNaseH purified from cellular sources, including Escherichia coli. These proteins may be added together during a single reaction step, or added sequentially during two or more substeps. Finally, additional proteins that may enhance the yield of double-stranded DNA products may also be added to the cDNA synthesis reaction. These proteins include a variety of DNA polymerases (such as those derived from E coli, thermophilic bacteria, archaebacteria, phage, yeasts, Neurosporas, Drosophilas, primates and rodents), and DNA Ligases (such as those derived from phage or cellular sources, including T4 DNA Ligase and E. coli DNA Ligase).

Conversion of primer-mRNA hybrid to double-stranded cDNA by reverse transcriptase proceeds through an RNA:DNA intermediate which is formed by extension of the hybridized promoter-primer by the RNA-dependent DNA polymerase activity of reverse transcriptase. The RNaseH activity of the reverse transcriptase then hydrolyzes at least a portion of the RNA:DNA hybrid, leaving behind RNA fragments that can serve as primers for second strand synthesis (Meyers et al., Proc. Nat'l Acad. Sci. USA (1980) 77:1316 and Olsen & Watson, Biochem. Biophys. Res. Comm. (1980) 97:1376). Extension of these primers by the DNA-dependent DNA polymerase activity of reverse transcriptase results in the synthesis of double-stranded cDNA. Other mechanisms for priming of second strand synthesis may also occur, including “self-priming” by a hairpin loop formed at the 3′ terminus of the first strand cDNA and “non-specific priming” by other DNA molecules in the reaction, i.e. the promoter-primer.

The second strand cDNA synthesis results in the creation of a double-stranded promoter region. The second strand cDNA includes not only a sequence of nucleotide residues that comprise a DNA copy of the mRNA template, but also additional sequences at its 3′ end which are complementary to the promoter-primer used to prime first strand cDNA synthesis. The double-stranded promoter region serves as a recognition site and transcription initiation site for RNA polymerase, which uses the second strand cDNA as a template for multiple rounds of RNA synthesis, as noted.

Using the promoter (e.g., T7 promoter), RNA polymerase is added, which binds to the promoter to generate a cRNA (complementary RNA) strand (antisense RNA) 416, as a copy of strand 412, and this copying process repeats itself 418 to produce hundreds, possibly about a thousand cRNA copies of the cDNA strand. The cRNA generated is an exact copy of strand 412, or a reverse complement of strand 412′. Antisense RNA is made, which contains TTT (i.e., a poly-T tail). It is not a copy of the mRNA that was started with, which contains AAA (i.e., a poly-A tail).

The antisense RNA resultant from the double-stranded cDNA is produced by transcribing by RNA polymerase to yield antisense RNA, which is complementary to the initial mRNA target from which it is amplified. This step is carried out in the presence of reverse transcriptase which is present in the reaction mixture. Thus, this technique does not involve a step in which the double-stranded cDNA is physically separated from the reverse transcriptase following double-stranded cDNA preparation. The reverse transcriptase that is present during the transcription step is rendered inactive, and thus, the transcription step is carried out in the presence of a reverse transcriptase that is unable to catalyze RNA-dependent DNA polymerase activity, at least for the duration of the transcription step. As a result, the antisense RNA products of the transcription reaction cannot serve as substrates for additional rounds of amplification, and the amplification process cannot proceed exponentially.

The reverse transcriptase present during the transcription step may be rendered inactive using any convenient protocol. The transcriptase may be irreversibly or reversibly rendered inactive. Where the transcriptase is reversibly rendered inactive, the transcriptase is physically or chemically altered so as to be no longer able to catalyze RNA-dependent DNA polymerase activity. The transcriptase may be irreversibly inactivated by any convenient means. Thus, the reverse transcriptase may be heat inactivated, in which the reaction mixture is subjected to heating to a temperature sufficient to inactivate the reverse transcriptase prior to commencement of the transcription step. In these embodiments, the temperature of the reaction mixture and therefore the reverse transcriptase present therein is typically raised to 55° C. to 70° C. for 5 to 60 minutes, usually to about 65° C. for 15 to 20 minutes. Alternatively, reverse transcriptase may be irreversibly inactivated by introducing a reagent into the reaction mixture that chemically alters the protein so that it no longer has RNA-dependent DNA polymerase activity. In yet other embodiments, the reverse transcriptase is reversibly inactivated. In these embodiments, the transcription may be carried out in the presence of an inhibitor of RNA-dependent DNA polymerase activity. Any convenient reverse transcriptase inhibitor may be employed which is capable of inhibiting RNA-dependent DNA polymerase activity a sufficient amount to provide for linear amplification. However, these inhibitors should not adversely affect RNA polymerase activity. Reverse transcriptase inhibitors of interest include ddNTPs, such as ddATP, ddCTP, ddGTP or ddTTP, or a combination thereof, the total concentration of the inhibitor typically ranges from about 50 μM to 200 μM.

For this transcription step, the presence of the RNA polymerase promoter region on the double-stranded cDNA is exploited for the production of antisense RNA. To synthesize the antisense RNA, the double-stranded DNA is contacted with the appropriate RNA polymerase in the presence of the four ribonucleotides, under conditions sufficient for RNA transcription to occur, where the particular polymerase employed will be chosen based on the promoter region present in the double-stranded DNA, e.g. T7 RNA polymerase, T3 or SP6 RNA polymerases, E. coli RNA polymerase, and the like. Suitable conditions for RNA transcription using RNA polymerases are known in the art. As mentioned above, a critical feature of the subject methods is that this transcription step is carried out in the presence of a reverse transcriptase that has been rendered inactive, e.g. by heat inactivation or by the presence of an inhibitor.

Because of the nature of the steps described, all of the necessary polymerization reactions, i.e., first strand cDNA synthesis, second strand cDNA synthesis and antisense RNA transcription, may be carried out in the same reaction vessel at the same temperature, such that temperature cycling is not required. As such, these methods are particularly suited for automation, as the requisite reagents for each of the above steps need merely be added to the reaction mixture in the reaction vessel, without any complicated separation steps being performed, such as phenol/chloroform extraction.

The resultant antisense RNA may next be labeled with multiple different labels. As noted, labels may include any known types that are designed to be interpreted, scanned or read during processing of the sample after its hybridization on a chemical array, including radioactive labeling, dye labeling, etc.

FIG. 5 illustrates how two fluorescent dye nucleotides can be incorporated into antisense RNA. Starting with double-stranded cDNA 412,412′ as described above with regard to FIG. 4, in the absence of dye nucleotides, the reaction described with regard to FIG. 4 results in antisense cRNA sequence 416, as noted. In the presence of dye-CTP 602 and amino-allyl ATP, the double-stranded cDNA 412,412′ generates the dye labeled nucleotide 420. During the transcription reaction, two modified nucleotides are present. For example, the first modified nucleotide in FIG. 5 may be dye-DTP 602, which will result in the dye flurorophore directly incorporating into the cRNA during its synthesis (see 420). The second modified nucleotide present in the transcription reaction may contain a chemical reactive group that allows for a dye attachment during a chemical conjugation step after the transcription reaction. The second dye label 604 is incorporated by first incorporating a nucleotide derivative that has a chemical reactive group (e.g., amino-allyl or biotin), and then, in a secondary step, the second dye, that has been provided with a chemical reactive group (e.g., NHS-ester or strptavidin) is added, wherein the two chemical reactive groups (e.g., NHS-ester and amino-allyl or biotin and streptavidin) react to bind the second dye, thereby incorporating dye 604 into the sequence as a dye conjugate (see 422).

FIG. 6 illustrates a process of incorporating two different fluorescent dyes into a single sample, where the target generated is a fluorescently labeled cDNA, starting from mRNA sample 516. For 5′ end-labeling of the target (see mRNA template 518), a first dye may be bound to an oligo-dT primer (e.g., 5′-dye-TTTTVN-3′ or 5′-dye-TTTn-3′ in which synthesis of the complementary DNA (cDNA target) will begin at the 3′ end of the mRNA, or a random primer 504 (e.g., 5′-dye-NNNNNN-3′), in which synthesis of the cDNA target can be initiated randomly across the mRNA (see template mRNA at 520). Random primers may be used, for example, in splicing applications, where it is desired to generate fluorescently labeled cDNA copies of the mRNA (use of oligo-dT for the primer generates cDNA's that are biased to the 3′ end of the mRNA). For example, for a random 7-mer primer, a total of 47 different sequences of primers would be provided, each bound to the first dye. Alternatively, the first dye may be provided in the form of a dye nucleotide, in which case more than one dye molecule of the first dye may be incorporated into the cDNA strand. The second dye may be incorporated in the form of a dye conjugated nucleotide.

In the above annotations, “N” represents any base (i.e., A, T, G or C), and the number of N's represents the number of bases. For example, an oligo-dT primer may include from about 12 to about 20 nucleotides (bases) and a random primer 504 may include from about 6 to about 12 nucleotides. Alternatively, oligo-dT primer 502 may be a lock-docked type primer (e.g., 5′-TTT..VN-3′), wherein “V” represents A, G or C base.

If it is desired to incorporate only one molecule of each dye per cDNA strand, then incorporation of a first dye with an oligo-dT primer 502 or random primer 504, as noted above, guarantees that only one molecule of the first dye is incorporated into the target cDNA strand 516 as shown at 518. The second dye is then provided in the form of a dye-dideoxy nucleotide (dye-ddCTP), which acts as a chain terminator, and only one molecule of the second dye is incorporated into the target cDNA, or a dye-deoxy nucleotide (dye-dCTP), in which multiple molecules of the second dye may be incorporated into the target cDNA. After the dye incorporation process, the mRNA template is degraded, leaving dye-labeled cDNA target sequence 524 if dye-ddCTP was used as the second dye) or 524′ (if dye-dCTP was used as the second dye). The resultant dye-labeled sequences resulting from processing of the respective mRNA to fluorescently labeled cDNA are then used as the target for hybridizing a chemical array having probes designed to bind to the molecules of the sample that the dye-labeled sequences represent.

For dye-labeling using a random primer 504, the second dye may be provided in the form of a dye-dideoxy nucleotide to act as a chain terminator, and only one molecule of the second dye is then incorporated into the target cDNA sequence, or a dye-deoxy nucleotide (dye-dCTP) may be provided, in which multiple molecules of the second dye may be incorporated into the target cDNA.

After the dye incorporation process, the mRNA template is degraded, leaving dye-labeled sequence 528 (if dye-dCTP was used for the second dye) or 528′ if dye-ddCTP was used as the second dye). The resultant dye-labeled sequences resulting from processing of the respective mRNA to fluorescently labeled cDNA are then used as the target for hybridizing a chemical array having probes designed to bind to the molecules of the sample that the dye-labeled sequences represent.

FIG. 7 illustrates another approach to incorporating two different dye labels into cRNA. Using this approach, the first dye label 702 is directly incorporated into the antisense cRNA sequence during the in vitro transcription reaction in the same way as described with regard to FIG. 5 above. In the presence of dye1-CTP 702 (the first dye), the double-stranded cDNA 412,412′ generates the dye labeled nucleotide cRNA 720. After labeling with the first dye 702, the labeled, antisense cRNA is then fragmented, providing segments 720s of the dye-labeled strand. The fragmented segments 720s currently labeled with the first dye 702 are next labeled with the second dye by a 3′-end labeling process, using poly-A polymerase and dye-ATP so that the second dye is incorporated as an end label 704 at the 3′-end of each fragmented cRNA 720s.

Alternative to the use of linear amplification techniques for multiple labeling of a sample, non-amplification techniques may be used. FIG. 6, discussed above, illustrates an example of a non-amplification technique that may be used to generate fluorescently labeled targets that contain two different fluorophores. FIG. 6 schematically illustrates how an mRNA sequence 516800 from the sample is converted to a representative cDNA that contains multiple labels. The sample is subjected to a series of enzymatic reactions under conditions sufficient to ultimately produce a first strand cDNA synthesis for each initial mRNA in the sample to be labeled, using techniques known in the art.

Reverse transcriptase is then used to make a cDNA strand. The RNA strand is next degraded using RNase, leaving the cDNA strand (single-stranded cDNA). The cDNA strand may be labeled with multiple labels 802,804 in any of the manners described above during the description of the linear amplification processes (e.g., incorporating a dye nucleotide and/or incorporating a modified nucleotide with subsequent conjugation of dye, with or without fragmentation, etc.). The multi-labeled cDNA sequences are then used as the target sample for further processing as described below.

Referring now back to FIG. 3, after multiple labels have been incorporated into a single sample according to any of the techniques described above, at event 304, the multi-labeled sample is hybridized with probes on an array having probes designed to bind with polynucleotides that are expected to be present in the sample. Replicates of probes may be provided on the array. Upon hybridizing the array with the target, multi-labeled sample, each probe is expected to bind with numbers or concentrations of each label to produce the proportional signals or scanner counts, as incorporated in the specific biopolymer (e.g., polynucleotide) that that probe is designed to bind with, since labels were applied to the sample by such design. Ideally, equal signals are produced for each different label incorporated into the same biopolymer (e.g., nucleic acid), but this is not necessary, since a comparison of patterns (e.g., gradients) across the signals received from the probes is what is important in determining the degree of divergence, not a comparison of signal magnitudes per se. Conversion methods can be applied when comparing unequal signal magnitudes, as taught in U.S. Pat. No. 6,188,969 and/or in U.S. Patent Publication No. 2005/0143935, both of which are incorporated herein, in their entireties, by reference thereto.

After washing and other typical processing steps, the array is then processed at event 306 to read the array (such as by scanning, or the like) to obtain signals from the probes with regard to each different label, respectively. The signal values associated with each of the different labels for each probe may then be used as a measure of label integrity, i.e., to measure the fidelity of the signals as effected by one label versus the others. Additionally, the signal values associated with each of the different labels may be used to improve quantitation and reproducibility of signal quantitation results, as will be described below. Thus, the techniques described herein describe an onboard diagnostic test of the labels employed, which may be used in experimental arrays for improving quality of results from arrays actually used in running experiments.

Since each label is expected to be incorporated into the nucleic acids in the sample in proportions designed to produce proportional signal levels on the same probe, across probes on the array, each set of signals for each label, respectively, are expected to measure the same biopolymers (e.g., polynucleotides) in equal concentrations across probes. Thus, a comparison of the signals associated with each label provides a reliable measure of whether the labels are distorting the signal readings, since all other technical factors do not vary (e.g., array to array differences, lot to lot differences, hybridization conditions, array manufacturing conditions, etc., factors that may typically be causes of gradients and other pattern variations when comparing two samples contacted to two different arrays.

The signal intensity values associated with the different labels are then compared at event 308 to identify label-induced errors (i.e., errors resulting from a lack of label integrity) in the signal intensities, or to confirm label integrity. One technique for comparison involves calculating (and optionally, plotting) response surfaces for each set of signals (where each set is associated with a different label) against the locations of the probes on the array from which the signals were obtained. Response surfaces may be plotted using any of a number of known techniques. The response surfaces should generally follow the same contour to confirm that label integrity exists, since the other technical factors (e.g., hybridization differences, array production and processing differences, etc., between experiments) are effectively eliminated by processing the same single sample on the same array, with respect to all labels. If a response surface associated with any particular label diverges from the response surfaces associated with the other label or labels, then this is an indication of error induced by one or more of the labels. A divergence threshold may be set that defines acceptable performance as defined by customer microarray markets. For example, if customers require the median inter-array coefficient of variation percentages (% CV) to be 12% or less, then it would be reasonable to set a threshold at 0.12 or less (e.g., 0.10) and, when set at 0.10, for example, a volatile, non-persistent ratio gradient between response surfaces produced from signals associated with first and second labels, respectively, with % CV>10% would be determined to be not acceptable, for lack of label integrity.

Thus, for example, if the response surfaces generated from signals associated with labels 2, 3 and 4, respectively, generally follow the same contours, but the response surface generated from signals associated with label 1 follows significantly different contours along all or a portion of the response surface, then this is indication that there may be a problem with the label integrity of label 1. When only two labels are used, it may be indeterminate as to whether one or the other label (or both) are lacking in integrity. However, in any of the preceding instances, the result is the same, in that the results of an array experiment would be unreliable or unacceptable for lack of label integrity.

Another technique for comparison includes calculating log ratios of intensity signal pairs, associated with different labels (label-incorporated biopolymers), but the same probe. Signal pair ratios may be calculated for all possible combinations of different pairs of different labels, for each probe. For any given probe, each different label referred to is incorporated in the same target biopolymer (for example, the same nucleic acid) of the sample which that probe is designed to bind with. In this case, the ratios calculated are not expression ratios or ratios to indicated other signals characterizing the sample (e.g., indicating copy number, as in a CGH assay or transcription factor binding sites, as in a location analysis assay), but rather are ratios of the same signal reading, but where each intensity signal of a probe is associated with a different label (i.e., the same biopolymer sequences bind to a probe, but the sequences have different labels. Assuming that the labels perform equally, the calculated log ratios should have a value of zero. However, there may be some bias between labels. For example, dye bias is known to be possible, such that a red dye associated with the same polynucleotide as a green dye may result in a higher signal intensity reading with regard to the polynucleotide incorporating the red dye relative to the polynucleotide incorporating the green dye. In these instances, the data may be processed to remove label biasing, by any variety of known techniques. However, with or without processing to remove label biasing, the log ratio values should remain fairly consistent across all probes on the array if there is label integrity. That is, even with dye bias being present, the log ratio of signal values associated with two different labels, from a first probe should be the same as the log ratio of signal values associated with those same two different labels from a second probe, if label integrity exists. In other words, the difference between the log ratio of signal values associated with two different labels, from a first probe, and the log ratio of signal values associated with those same two different labels from any other probe on the array should be zero, or within a predetermined threshold value (positive difference less than the threshold value, negative difference greater than the negative of the threshold value), if label integrity exists. Another example is that if other technical factors exist that would cause a gradient in the surface response for signal intensities associated with label 1, then those technical factors will also exist with regard to the signal intensities associated with label 2, so that although the surface response associated with each of labels 1 and 2 will each show a gradient, a response surface generated from the ratios or log ratios of the signal associated with label 1 to the signals associated with label 2 (or vice versa) will not have the gradient, indicating that the gradient in the response surfaces associated with the single labels is induced by technical factors other than the labels themselves.

After comparison of the signal intensity readings associated with the different labels, a determination may be made, based on such comparison, as to whether the fidelity of the signal intensity readings, as impacted by the labels used, is reliable. If it is determined that one or more labels lack integrity, such as by observing significant divergence of response surfaces, or variation in the differences between ratios across the array, then label integrity is determined to be absent at event 310 and the data is considered to be unreliable at event 312. Unstable labeling tends to amplify all differences such as the chemical differences between two different label dyes, for example. On the other hand, if label integrity is found to exist at event 310, then the data (signal intensity readings) may be considered reliable, at least to the extent that the labels used are not distorting the signal intensity readings.

It has been further discovered that the signal intensity readings associated with the different labels may be combined to form a composite or average signal intensity level for a probe, which may be more accurate, reliable and reproducible across experiments than if any single signal intensity level associated with any single label associated with the experiment were used. Such processing may optionally be carried out at event 316. The technique can average out small inconsistencies that may be present with various different types of labels. For example, labels such as dyes may exhibit a small amount of abundance-dependence, such as when dyes are incorporated into RNA according to the number of opportunities present (i.e., the number of nucleic acids that are present and complementary to the labeled nucleic acids). By averaging the signals, the effects of abundance dependence of one of the labels is reduced by the values associated with the other labels that are not abundance dependent in that range of signal levels. As a simple example, if label 1 amplifies the signal somewhat at lower abundances and thus provides stronger signals at lower signal levels reflective of lower abundance of the sample on a probe and label 2 does not, then by averaging the signals the amplification is reduced.

An example where different labels were incorporated into separate, equal aliquots of the same sample, then mixed into a single (multi-label) sample and hybridized to probes on an array, follows. Although the specific example is directed to dye labeling, it is again noted here that the principles and methods described herein are equally applicable to other label types. For example, the same sample may be labeled with either Cy3- or Cy5-dye and labeled with a radioactive label as well, or with two radioactive labels (radioactive isomers), biotinylated dyes, or with two different labels of any known types, as long as a system or systems are available for reading the signals associated with such labels. Further, as noted, the present invention may be carried out by incorporated multiple different dyes into a single aliquot of a sample.

The example experiment was conducted on self-self arrays in which equivalent proportions of cyanine3-(Cy3) and cyanine5-(Cy5) dye were separated incorporated into nucleic acids in equal, but separate quantities of the same sample, and both labeled samples were then combined and hybridized, as a single combined sample having both labels, under the same conditions to the same array configured for two channel processing, commonly referred to as “self-self hyb”, in order to demonstrate post processing techniques that would be the same for a single sample having had multiple different labels applied thereto. Further details about this simulation may be found in co-pending, commonly owned Application Serial No. (Application Serial No. not yet assigned, Attorney's Docket No. 10051059-1) filed concurrently herewith and titled “Label Integrity Verification of Chemical Array Data”, which is hereby incorporated herein, in its entirety, by reference thereto.

The “self-self hyb” examples were subject to the following conditions: For a self-self hybridization, 1 μg of Hela or K562 total RNA was amplified and By3- and Cy5-labeled using Agilent's Low Input RNA Fluorescent Linear Amplification Kit (5184-3523, Agilent Technologies, Inc., Palo Alto, Calif.) in separate reactions, following protocol described in the user's manual of the kit. Hybridizations were performed using Agilent's Human 1A (V2) Oligo Microarrays (G4110B, Agilent Technologies Inc., Palo Alto, Calif.) and the in-situ Hybridization Plus Kit (5184-3568, Agilent Technologies, Inc., Palo Alto, Calif.). 750 ng of Cy3- and 750 ng of Cy5-labeled cRNA were co-hybridized to each microarray, as described in the microarray user manual (G4140-90030, Agilent Technologies, Inc., Palo Alto, Calif.). Slides were scanned on an Agilent Microarray Scanner (Model G2505B, Agilent Technologies, Inc., Palo Alto, Calif.) and the raw images were processed using Agilent's Feature Extraction (v7.5.1, Agilent Technologies, Inc., Palo Alto, Calif.).

This experiment was closely controlled to provide the same technical factors to both samples on the same array, to validate usefulness of providing two or more labels to the same sample to monitor label integrity as described herein. Table 1 lists the four Agilent oligo, two-color arrays (self 3, self4, self 7 and self8) that were prepared for the experiment. The arrays self3 and self7 used HeLa_—11 as the sample for both red and green dyes in equal proportions, and the arrays self4 and self8 used K562_—12 as the sample for both red and green dyes in equal proportions.

TABLE 1Red-Green-ArrayBarcodeSampleSampDescriptionself316011877010Cy5 HeLaCy3 HeLaCy3 HeLa + Cy5 HeLself416011877010Cy5 K562Cy3 K562Cy3 K562 + Cy5 K562self716011877010Cy5 HeLaCy3 HeLaCy3 HeLa + Cy5 HeLself816011877010Cy5 K562Cy3 K562Cy3 K562 + Cy5 K562

FIG. 8 is a graphical representation 900 of the number of features provided on the arrays for each of samples HeLa_—11 and K562_—12, as an overall count for arrays self3, self4, self7 and self 8 combined, as well as the numerical totals for each and the total overall. As noted in FIG. 8, there were 71,944 probes designed for the HeLa_—11 sample and 71,944 probes designed for the K562_—12 sample. As noted above, the signal intensity ratios between red and green labeled signals for the same probe measure the integrity of the dye, rather than expression ratios. More specifically, these ratios measure dye parallelism, where a plot of ratio values from probe to probe should be fairly constant (with the exception of random noise), even if ratio values are not zero.

Upon hybridizing each array with the target samples as indicated above, each probe was ideally expected to bind with equal concentrations Cy3-labeled polynucleotides and Cy5-labeled polynucleotides of the specific polynucleotide that is designed to bind with.

After washing and other typical processing steps, the arrays were scanned with a two-channel Agilent scanner to obtain signals from the probes for both the Cy3-labeled target as well as the Cy5-labeled target on the two channels, respectively. The ratios of the signal values from the two channels for each probe were than analyzed as a measure of dye integrity, i.e., to measure the fidelity of the signals as effected by one dye versus the other. Since both channels were expected to measure the same biopolymers (e.g., labeled polynucleotides) present in equal concentrations for each probe, a comparison of the signals from each channel with the processing described herein, provides a reliable measure of whether the labels are distorting the signal readings, since all other technical factors do not vary (e.g., such as one or more of: array to array differences, lot to lot differences, hybridization conditions, array manufacturing conditions, etc., that may typically be causes of gradients and other pattern variations when comparing two samples contacted to two different arrays.

By providing multiple labels in a manner described with a universal reference (i.e., a reference designed to use for a broad coverage of different gene expression studies, e.g., see http://www.stratagene.com/products/displayProduct.aspx?pid=439), label integrity can be checked by comparison of signals as described, as read from the biopolymers on the universal reference that have been labeled with multiple labels, thus providing an experimenter with assurance that the labels associated with experimentation are not a significant source of error and assay instability.

FIG. 9 shows a plot 1000 of the distribution of log ratio values for the signals obtained from scanning all four of the arrays identified in Table 1 above, where each log ratio value is the log ratio of an intensity signal associated with the red dye to the intensity signal associated with green dye, for the same probe/target on the same array. It can be observed that the distribution of the log ratio values shows that the log ratio values are centered around zero, as expected. The associated statistics shown in FIG. 9 indicate that the median ratio value is zero, with 25th and 75^thpercentile values being within 0.063 of zero, with a tight distribution, indicating a relatively low amount of random noise.

As one approach to analysis of the array data from scanning the arrays identified in Table 1, ANOVA analysis of the signal data obtained from the arrays was performed using JMP*SAS software (http://www.jmp.com/) to characterized the response surfaces and check for relative dye patterns in the signal intensities, as measured by natural log ratios of dye-normalized, background subtracted signals (LnRatiOrgDNS) for red to green ratios from the probes/targets on the arrays. The ratios were analyzed to look for patterns of divergence caused by differences in performance of the red and green dyes. The analysis performed was standard ANOVA analysis to measure the dye integrity for the arrays noted. Further information regarding ANOVA analysis can be found in co-pending, commonly assigned application Ser. No. 11/198,362, filed Aug. 4, 2005 and Ser. No. 11/026,484, filed Dec. 30, 2004. Both application Ser. No. 11/198,362 and application Ser. No. 11/026,484 are hereby incorporated herein, in their entireties, by reference thereto. Table 2 shows summary results for the surface fit and the Analysis of Variance Results as determined by the ANOVA processing.

TABLE 2Analysis of VarianceSummary of FitSourceDFSSQMean SquareF RatioRSquare0.015855Model2332.49551.41285100.6756RSquare Adj0.015697Error1437312017.07150.01403Prob > FRMS Error0.118464C. Total1437542049.56700.0000Mean of Resp0.000467Sum Wgts143755

Table 2 reports well-known, established standard statistics for an ANOVA analysis. In the “Summary of Fit” portion of Table 2 above, “RSquare” measures the proportion of the variation around the mean explained by the linear or polynomial model. The remaining variation is attributed to random error. RSquare is 1 if the model fits perfectly. An RSquare value of zero indicates that the fit is no better than a simple mean model. RSquare is the standard regression result of one minus the ratio residual sum of squares, divided by the total sum of squares, about the mean. “RSquare Adj.” adjusts the RSquare value to make it more comparable over models with different numbers of parameters by using the degrees of freedom in its computation. Thus it is a ratio of mean squares instead of sums of squares.

“RMS Error”, or “Root Mean Square Error” estimates the standard deviation of the random error. RMS Error is calculated as the square root of the mean square for Error in the Analysis of Variance table shown in the “Analysis of Variance” portion of Table 2. “Mean of Response” is the sample mean (arithmetic average) of the response variable. This is the predicted response when no model effects are specified. “Sum of Weights”, or “Observations”, indicates the number of observations used to estimate the fit, in this case, the number of rows of data that were inputted.

In the “Analysis of Variance” portion of Table 2 above, “DF” refers to the degrees of freedom for each calculation reported. The Total Error DF is the degrees of freedom figure reported at the “Error” entry of the Analysis of Variance portion of Table 2, and is the difference between the “C. Total” DF value and the “Model” DF value. The Sum of Squares or “SSQ” records an associated sum of squares for each source of error. The Total Error “SSQ” is the sum of square value reported on the “Error” line of the Analysis of Variance portion of Table 2.

“Mean Square” is the sum of squares divided by it associated degrees of freedom, i.e., SSQ/DF. This computation converts the sum of squares to an average (mean square). “F Ratio” is the ratio of mean square for lack of fit to mean square for pure error. The F-Ratio tests the hypothesis that the lack of fit error is zero. F-ratios for statistical tests are the ratios of mean squares. “Prob>F” is the observed significance probability (p-value) of obtaining a greater F-ratio value by chance alone if the specified model fits no better than the overall response mean (i.e., probability of a noise effect). Observed significance probabilities (Prob>F) of 0.05 or less are often considered evidence of a regression effect.

Table 3 shows the parameter estimates that were calculated for performing the ANOVA analysis. The nominal terms inputted were the self-self arrays (ArraySelf3, ArraySelf4 and ArraySelf7) with the array self8 (ArraySelf8) serving as the intercept term, as one of the nominal terms (levels) becomes the designated dependent effect to be left out of the model to avoid singularity problems. This parameter becomes the negative of the sum of all other level parameters and therefore absorbs the singularity. The “Estimate” column lists the parameter (term) estimates of the linear model. The prediction formula is the linear combination of these estimates with the values of their corresponding variables. “Std. Err.” lists the estimates of the standard errors of the parameter estimates. These Std. Err. estimates are used for constructing tests and confidence intervals.

The “t Ratio” column lists the test statistics for the hypothesis that each parameter is zero. The t Ratio is the ratio of the parameter estimate to its standard error. If the hypothesis is true, then this statistic has a Student's t-distribution. Looking for a t Ratio greater than 2 in absolute value is a common rule of thumb for judging significance because it approximates the 0.05 significance level.

The final column labeled “Prob>|t|” lists the observed significance probability calculated from each t Ratio. Prob>|t| is the probability of getting, by chance alone, a t Ratio greater (in absolute value) than the computed value, given a true hypothesis. Often, a value below 0.05 (or sometimes 0.01) is interpreted as evidence that the effect of the parameter considered is significantly different from zero. The different values in this column for the nominal variables ArraySelf3, ArraySelf4 and ArraySelf7 indicate LnRatio shifts due to variation in the amount of response of the red dye relative to the green dye for the same probe/target, over all of the probes on the arrays among the arrays, respectively. ANOVA nominal variables are composed of dummy values which represent shifts as estimated by their parameters. The shifts were considered to be within an acceptable range in this example. An acceptable range may be preset to make this determination. For example, in this example, the range was preset for a determination that a shift was in an acceptable range if the p-value was less than 0.05, which is a typical threshold setting for significance.

The second grouping of terms in Table 3 (i.e., Col&RS, (Row-103.983)*(Row-103.983), (Row-103.983)*(Col-215.455), and (Col-215.455)*(Col-215.455)), are scaled or covariate terms, minus their average value (to improve numerical and statistical properties), and provide the statistical results that characterize the global, persistent (array-independent pattern) effects, to the second order, of the row and column positions of the probes on the arrays with respect to all four of the arrays (ArraySelf3, ArraySelf4, ArraySelf7 and ArraySelf8) considered together, upon the outcome of the signal levels (natural log ratios of dye-normalized, background subtracted signals, in this example). Note that the numerical values “103.983” and “215.455” are the average row and column positions on an x-y grid, as measured on the array by the analysis software, and that these values are subtracted from each row and column position, respectively, to center the data for performance of the analysis, thereby reducing effect correlations. Specifically, in this example, Col&RS characterizes the effect of the column positions, (Row-103.983)* (Row-103.983) characterizes the second order effect of row positions, or row-row interaction (i.e., row²), (Row-103.983)* (Col-215.455) characterizes the effect of row and column interaction, and (Col-215.455)* (Col-215.455) characterizes the second order effect of column positions, or column-column interaction (i.e., column²). Given the extremely low p-values in the last column for these terms, this indicates that persistent gradients apply to all the arrays considered, in the LnRatiOrgDNS data, but that these gradients are very small as indicated by the small parameter estimates for these terms.

The third grouping of terms in Table 3 (i.e., (Row-103.983)*ArraySelf3, (Row-103.983)*ArraySelf4, (Row-103.983)*ArraySelf7, (Col-215.455)*ArraySelf3, (Col-215.455)*ArraySelf4, (Col-215.455)*ArraySelf7, (Row-103.983)*(Row-103.983)*ArraySelf3, (Row-103.983)*(Row-103.983)*ArraySelf4, (Row-103.983)*(Row-103.983)*ArraySelf7, (Row-103.983)*(Col-215.455)*ArraySelf3, (Row-103.983)*(Col-215.455)*ArraySelf4, (Row-103.983)*(Col-215.455)*ArraySelf7, (Col-215.455)*(Col-215.455)*ArraySelf3, (Col-215.455)*(Col-215.455)*ArraySelf4, and (Col-215.455)*(Col-215.455)*ArraySelf7) are scaled or covariate terms, per array, that characterize the changes in LnRatiOrgDNS values for each array, on a per array basis, respectively, as effected by row and column positions of the probes/targets on the arrays. These parameters indicate the shift in the persistent parameters for each array for all gradient effects.

TABLE 3Parameter EstimatesTermEstimateStd. Err.t RatioProb > |t|Intercept0.02323860.00097223.91<.0001ArraySelf30.00333110.0010143.290.0010ArraySelf40.00131030.0010141.290.1963ArraySelf70.00138310.0010141.360.1726Row & RS−0.0000850.000005−16.09<.0001Col & RS−0.0000180.000003−7.23<.0001(Row-103.983)*(Row-103.983)5.4806e−79.907e−86.63<.0001(Row-103.983)*(Col-215.455)6.8524e−74.263e−816.07<.0001(Col-215.455)*(Col-215.455)−7.786e−72.271e−8−34.28<.0001(Row-103.983)*ArraySelf30.00004580.0000095.01<.0001(Row-103.983)*ArraySelf40.00004960.0000095.44<.0001(Row-103.983)*ArraySelf7−0.0000010.000009−0.150.8841(Col-215.455)*ArraySelf3−0.0000190.000004−4.42<.0001(Col-215.455)*ArraySelf4−0.0000320.000004−7.23<.0001(Col-215.455)*ArraySelf7−0.0000210.000004−4.83<.0001(Row-103.983)*(Row-103.983)*ArraySelf31.9264e−71.716e−71.120.2616(Row-103.983)*(Row-103.983)*ArraySelf4−0.0000011.716e−7−6.14<.0001(Row-103.983)*(Row-103.983)*ArraySelf75.55393−71.716e−73.240.0012(Row-103.983)*(Col-215.455)*ArraySelf3−4.804e−87.383e−8−0.650.5152(Row-103.983)*(Col-215.455)*ArraySelf4−3.04e−87.385e−8−0.410.6806(Row-103.983)*(Col-215.455)*ArraySelf72.1317e−87.384e−80.290.7728(Col-215.455)*(Col-215.455)*ArraySelf3−6.149e−83.934e−8−1.560.1180(Col-215.455)*(Col-215.455)*ArraySelf41.0122e−83.934e−82.570.0101(Col-215.455)*(Col-215.455)*ArraySelf7−8.415e−83.934e−8−2.140.0324

Specifically, “(Row-103.983)*ArraySelf3” characterizes the row effect shift upon any gradient that may be observed in array self3. (Row-103.983)*ArraySelf4 characterizes the row effect shift upon any gradient that may be observed in array self4, (Row-103.983)*ArraySelf7 characterizes the row effect shift upon any gradient that may be observed in array self7, (Col-215.455)*ArraySelf3 characterizes the column effect shift upon any gradient that may be observed in array self3, (Col-215.455)*ArraySelf4 characterizes the column effect shift upon any gradient that may be observed in array self4, (Col-215.455)*ArraySelf7 characterizes the column effect shift upon any gradient that may be observed in array self7, (Row-103.983)*(Row-103.983)*ArraySelf3 characterizes the second-order row effect shift (shift/correction relative to the persistent array-independent pattern noted above) upon any gradient that may be observed in array self3, (Row-103.983)*(Row-103.983)*ArraySelf4 characterizes the second-order row effect shift upon any gradient that may be observed in array self4, (Row-103.983)*(Row-103.983)*ArraySelf7 characterizes the second-order row effect shift upon any gradient that may be observed in array self7, (Row-103.983)*(Col-215.455)*ArraySelf3 characterizes the (shift/correction relative to the persistent array-independent pattern upon any gradient that may be observed in array self3, (Row-103.983)*(Col-215.455)*ArraySelf4 characterizes the (shift/correction relative to the persistent array-independent pattern noted above) upon any gradient that may be observed in array self4, (Row-103.983)*(Col-215.455)*ArraySelf7 characterizes the row and column interaction effect shift upon any gradient that may be observed in array self7, (Col-215.455)*(Col-215.455)*ArraySelf3 characterizes the second-order column effect shift upon any gradient that may be observed in array self3, (Col-215.455)*(Col-215.455)*ArraySelf4 characterizes the second-order column effect shift upon any gradient that may be observed in array self4, and (Col-215.455)*(Col-215.455)*ArraySelf7) characterizes the second-order column effect shift upon any gradient that may be observed in array self7.

That is, these metrics provide a measure of array-dependent gradients, i.e., the variation of the gradient pattern from array to array, relative to the persistent, array-independent pattern (estimated as the pattern averaged over all array-specific patterns). Based upon the significance values (<0.05) relative to the parameter sizes, it was determined that the array-dependent gradients are significant, but very small.

Because of the large number of data points (LnRatiOrgDNS values) used in this analysis, a lot of statistical leverage was provided and it was possible to detect very small changes in gradient, much less than a level that was considered significant (i.e., where significance was considered for values of p<0.05). Therefore, it was concluded that the gradient levels were significant and, if the consequential percent CV levels are above thresholds considered acceptable, then the arrays fail market requirements. The Ln Ratio, array-dependent gradients are also significant, but very small as indicated by the third grouping of parameters and associated statistics.

Table 4 shows the combined statistics for all of the terms described above in Table 3. Rather than reporting p-values for array shifts separately, Table 4 combines the effects over all arrays and provides p-values that were calculated for each term over all arrays. Thus, the information in Table 4 is provided to answer the question as to whether there is an array effect of one ore more terms on the LnRatiOrgDNS data. Table 4 reports ensemble significance, that is the significance of all levels of each term considered together. Terms may also be custom-combined in a manner as taught in co-pending, commonly assigned application Ser. No. 11/198,362.

“Source’ lists each of the variables/terms that were considered in performing the ANOVA calculations. DF list the degrees of freedom for the calculations performed for the variable listed in the same row, respectively. For nominal variables, the DF value was the total number of levels (nominal variables) minus one, to account for the intercept, as noted above, and further discussed in application Ser. No. 11/198,362. The Sum of Squares calculations divided by DF, respectively, provide the relative weights attributed to the effect of each variable on the LnRatiOrgDNS data. An F-ratio value was calculated for Sum of Squares term and reported in the next adjacent column. From these F-ratio values, p-values were calculated to show the probability that each effect is due to noise, or actually due to the term/variable considered. A p-value of 1 means that there is no evidence at all to suggest that there is a systematic effect caused by the variable/term for which the p-value is calculated. Conversely, a p-value less that 0.0001 means that the result is highly significant, and that the effect (mean sum of squares term, versus the residual mean sum of squares term) calculated for that term is due predominantly to the term considered, and not to random noise. Thus, the lower the p-value, the more significant is the result (i.e., the calculated sum of squares value is more likely to actually be due to the term considered, rather than predominantly to noise). The low Prob>F values in Table 4 imply statistically significant impact, but unacceptable arrays according to typical market requirements, since % CV impact of the effect estimates are small and less than 12%.

TABLE 4Effect TestsSourceDFSum of Squares TermF RatioProb > FArray30.52231312.4062<.0001Row & RS13.633771258.9326<.0001Col & RS10.73427752.3226<.0001Row*Row10.42944830.6013<.0001Row*Col13.625657258.3544<.0001Col*Col116.4921481175.185<.0001Row*Array & RS31.69528540.2671<.0001Col*Array & RS33.86381791.7750<.0001Row*Row*Array30.55320713.1400<.0001Row*Col*Array30.0134160.31870.8119Col*Col*Array30.1569923.72890.0108

The total (mean-adjusted) sum of squares calculated was 2049.5670, as indicated in Table 2. The sum of squares calculations for each of the terms considered, as shown in Table 4, are very small relative to the total sum of squares. Thus, although the effects of these terms are statistically significant, as shown by the p-values in the last column of Table 4, the effects are very small compared to the total sum of squares calculation. Thus, the terms considered are not accounting for the large majority of variation in the signal values. Therefore, the overall variation in the signal values analyzed is not due to dye integrity issues. Based on the small gradients as indicated by the magnitudes of the parameters estimates that model the contour plots, as characterized by the results of the ANOVA testing, it was concluded that the signals associated with red dye versus the respective signals associated with green dye were behaving in parallel (i.e., any effect on the signal caused by red dye, if any, was nearly the same as the effect on the signal caused by green dye, if any, across all probes on all arrays, showing inter-array consistency of the dye labels), and that dye integrity was sufficient so as not to effect the reliability of the signal data representing the actual targets binding to probes. Therefore the labeling (red and green dyes) passed the quality test. That is, the dye effect estimates on the signal data were significant, but small and acceptable as to expected consequential impact, as measured by % CV. Statistical significance of the dye effects, by itself, does not imply unacceptable label integrity, but is necessary when the effect estimates exceed a valid threshold value that would imply unacceptable integrity.

As briefly referred to above, it was determined that the signal intensity readings associated with the different labels may be combined to form a composite or average signal intensity level for a probe, which may be more accurate, reliable and reproducible across experiments than if any single signal intensity level associated with any single label associated with the experiment were used. FIGS. 10A-10C show plots of inter-array coefficient of variation (CV) values (relative noise) 1100A, 1100B and 1100C, respectively plotted for the signals associated with the green dye (Cy3) (FIG. 10A), the signals associated with the red dye (Cy5) (FIG. 10B) and average signals computed from an average of both the signal (FIG. 10C) associated with the red dye and the signal associated with the green dye from each probe (CVgLnDNS, CVrLnDNS and CVgrLnDNS, respectively). In each case the signals were dye normalized, background-subtracted signals described with regard to the example above for which ANOVA analysis was performed.

Table 5 reports the numerical quantile statistics and moments calculated from the data shown in FIGS. 10A-10C. N represents the total number of data points (number of probes over two different targets) analyzed in each instance.

The median CV values (array-to-array variability in signal) for Cy3 and Cy5 are 0.1719 and 0.1792, respectively, or 17.19% and 17.92%, which are considered to be unacceptable levels. For example, a typical threshold % CV value considered to be acceptable currently is about 12% or less, sometimes 10% or less. The median CV for the combined signal (FIG. 10C) is 0.1733 or 17.33%, which indicates that the interarray coefficient of variation for the combined signals is as good as for the individual signals, in terms of population statistics. However, the CV for the combined signal is also considered to be unacceptable, as being too high.

FIGS. 11A-11C show plots of inter-array coefficient of variation (CV) values (relative noise) 1200A, 1200B and 1200C, respectively (CVgLnBSS, CVrLnBSS and CVrgLnBSS, respectively), corresponding to the plots of FIGS. 10A-10C, except in this case, the signals analyzed were not dye-normalized, although they were background-subtracted in the same manner as the signals that are the subject matter of FIGS. 10A-10C.

TABLE 5Quantiles-FIG. 10AQuantiles-FIG. 10BQuantiles-FIG. 10C100.0%max4.9136100.0%max4.2909 100%max4.282499.5%1.350299.5%1.405099.5%1.374397.5%0.898097.5%0.961097.5%0.931190.0%0.526990.0%0.574290.0%0.544375.0%qtle0.397775.0%qtle0.427075.0%qtle0.413250.0%med0.171950.0%med0.179250.0%med0.173325.0%qtle0.078925.0%qtle0.082825.0%qtle0.080010.0%0.031410.0%0.0344 10.0%0.03282.5%0.00782.5%0.0088 2.5%0.00820.5%0.00150.5%0.0016 0.5%0.00170.0%min5.59e−60.0%min0.00001 0.0%min3.12e−6Moments-FIG. 10AMoments-FIG. 10BMoments-FIG. 10CMean0.2562217Mean0.2742067Mean0.2640669Std. Dev.0.2448092Std. Dev.0.2622933Std. Dev.0.2533214Std. Err. Mean0.0009133Std. Err. Mean0.0009784Std. Err. Mean0.0009448Uppr 95% Mean0.2580117Uppr 95% Mean0.2761242Uppr 95% Mean0.2659187Lwr 95% Mean0.2544317Lwr 95% Mean0.2722891Lwr 95% Mean0.2622151N71856N71876N71892

Table 6 reports the numerical quantile statistics and moments calculated from the data shown in FIGS. 11A-11C. N represents the total number of data points analyzed in each instance.

The median CV values (array-to-array variability in signal) for Cy3 and Cy5 are 0.1166 and 0.1204, respectively, or 11.66% and 12.04%, in this case. The median CV for the combined signal (CVrgLnBSS in FIG. 11C) is 0.1143 or 11.43%, which indicates that the interarray coefficient of variation for the combined signals is even better than for the individual signals for the signals that have not been dye-normalized. The reasons for the better performance may be that if one of the dyes, for example, performs better at relatively lower signal levels, and the other dye is relatively better performing at relatively higher signal levels, then by averaging both dye related signals at all levels of the spectrum, the impact of the poorer performing dye gets averaged out somewhat by the better performing dye.

TABLE 6Quantiles-FIG. 11AQuantiles-FIG. 11BQuantiles-FIG. 11C100.0%max5.1631100.0%max4.5634 100%max4.183899.5%1.581099.5%1.823199.5%1.695997.5%1.126997.5%1.381397.5%1.255690.0%0.554590.0%0.787090.0%0.577275.0%qtle0.233175.0%qtle0.293875.0%qtle0.253750.0%med0.116650.0%med0.120450.0%med0.114325.0%qtle0.053025.0%qtle0.052125.0%qtle0.051010.0%0.021010.0%0.0202 10.0%0.01992.5%0.00522.5%0.0049 2.5%0.00480.5%0.000980.5%0.00099 0.5%0.000920.0%min0.00000.0%min0.00001 0.0%min0.0000Moments-FIG. 11AMoments-FIG. 11BMoments-FIG. 11CMean0.2154316Mean0.2660332Mean0.2369707Std. Dev.0.288496Std. Dev.0.3651846Std. Dev.0.3259648Std. Err. Mean0.0010762Std. Err. Mean0.0013621Std. Err. Mean0.0012157Uppr 95% Mean0.217541Uppr 95% Mean0.2687029Uppr 95% Mean0.2393535Lwr 95% Mean0.2133221Lwr 95% Mean0.2633634Lwr 95% Mean0.2345879N71856N71876N71892

The background-subtracted, but not dye-normalized signals were weighted according to their performances at different relative signal intensities. From experience, it was known that the green dye (Cy3) performs with better integrity (i.e., better reproducibility, less variation, relative to that observed in signals associated with the red dye Cy5) with signals of relatively lower intensity and that the red dye (Cy5) performs with better integrity (i.e., better reproducibility, less variation, relative to that observed in signals associated with the green dye Cy3) with signals of relatively higher intensity. Accordingly, for signals higher than the average signal, rather than just calculating the Ln average of the signal associated with the red dye and the signal associated with the green dye for a probe, the signal associated with the red dye was weighted more heavily than the signal associated with the green dye. Conversely, for signal intensities less than the average signal intensity, the signal associated with the green dye for a probe was weighted more heavily that the signal associated with the red dye for the same probe, and then a log average of these signals was calculated. Thus, signals associated with green dye and having less than the median signal intensity were weighted at a factor of greater than 0.5 and signals associated with red dye having less than the median signal intensity were weighted at a factor of less than 0.5, wherein the weighting factors for red and green associated signals from the same probe sum to a total of one. Weighting was performed conversely for the signals having greater than the median signal intensity. A weighting curve was empirically developed to optimize the weighting values applied.

FIG. 11D shows a plots of inter-array coefficient of variation (CV) values (relative noise) 1200D (CVwrgLnBSS), corresponding to the plot of FIG. 11C, except in this case, the signals have been weighted in the manner described above. Table 7 reports the numerical quantile statistics and moments calculated from the data shown in FIG. 11D. N represents the total number of data points analyzed.

TABLE 7Quantiles-FIG. 11DMoments-FIG. 11D100.0%max5.1631Mean0.219456999.5%1.5858Std. Dev.0.29407397.5%1.1296Std. Err. Mean0.00109790.0%0.5772Uppr 95% Mean0.221607175.0%qtle0.2508Lwr 95% Mean0.217306750.0%med0.1092N7185625.0%qtle0.048710.0%0.01932.5%0.00470.5%0.000870.0%min0.0000

Note that the median CV value for CVwrgLnBSS is 0.1092 or 10.92%, which is even better (i.e., exhibits less array-to-array variation) than the combined signals of FIG. 11C (CVrgLnBSS) in which equal weighting was applied to signal associated with red dye and signals associated with green dye.

Accordingly, by providing multiple labels for a single sample to be analyzed on an array by interpreting one channel of signals from the array, this offers a unique ability to verify the integrity of each label in a manner that eliminates other production or hybridization factors that may otherwise be confused with effects caused by lack of label integrity. Further, by combining the signals associated with the multiple labels and a particular probe/target, composite signal can be used for measurement of the target. Such composite signal may be more reliable and reproducible than a signal that is associated with any one of the multiple different labels applied to the same sample. Further, weighting may be performed to further emphasize the advantages in the performances of the labels, based on signal intensity.

If unacceptable divergence is identified among the labels, than a user may either have to do the experimentation over (redo the experimentation with new arrays, or strip arrays and repeat the processing) or may be able to identify the bad label and use the results associated with one or more labels that have been determined to be reliable.

FIG. 12 illustrates a typical computer system in accordance with an embodiment of the present invention. The computer system 1300 includes any number of processors 1302 (also referred to as central processing units, or CPUs) that are coupled to storage devices including primary storage 1306 (typically a random access memory, or RAM), primary storage 1304 (typically a read only memory, or ROM). As is well known in the art, primary storage 1304 acts to transfer data and instructions uni-directionally to the CPU and primary storage 1306 is used typically to transfer data and instructions in a bi-directional manner Both of these primary storage devices may include any suitable computer-readable media such as those described above. A mass storage device 1308 is also coupled bi-directionally to CPU 1302 and provides additional data storage capacity and may include any of the computer-readable media described above. Mass storage device 1308 may be used to store programs, data and the like and is typically a secondary storage medium such as a hard disk that is slower than primary storage. It will be appreciated that the information retained within the mass storage device 1308, may, in appropriate cases, be incorporated in standard fashion as part of primary storage 1306 as virtual memory. A specific mass storage device such as a CD-ROM or DVD-ROM 1314 may also pass data uni-directionally to the CPU. Alternatively, device 1314 may be connected for bi-directional data transfer, such as in the case of a CD-RW or DVD-RW, for example.

CPU 1302 is also coupled to an interface 1310 that may include one or more input/output devices such as video monitors, track balls, mice, keyboards, microphones, touch-sensitive displays, transducer card readers, magnetic or paper tape readers, tablets, styluses, voice or handwriting recognizers, or other well-known input devices such as, of course, other computers. Finally, CPU 1302 optionally may be coupled to a computer or telecommunications network using a network connection as shown generally at 1312. With such a network connection, it is contemplated that the CPU might receive information from the network, or might output information to the network in the course of performing the above-described method steps. The above-described devices and materials will be familiar to those of skill in the computer hardware and software arts.

The hardware elements described above may implement the instructions of multiple software modules for performing the operations of this invention. For example, instructions for calculating sums of square terms and or for calculating metrics may be stored on mass storage device 1308 or 1314 and executed on CPU 1302 in conjunction with primary memory 1306.

In addition, embodiments of the present invention further relate to computer readable media or computer program products that include program instructions and/or data (including data structures) for performing various computer-implemented operations. The media and program instructions may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well known and available to those having skill in the computer software arts. Examples of computer-readable media include, but are not limited to, magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROM, CDRW, DVD-ROM, or DVD-RW disks; magneto-optical media such as floptical disks; and hardware devices that are specially configured to store and perform program instructions, such as read-only memory devices (ROM) and random access memory (RAM). Examples of program instructions include both machine code, such as produced by a compiler, and files containing higher level code that may be executed by the computer using an interpreter.

While the present invention has been described with reference to the specific embodiments thereof, it should be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the true spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation, material, composition of matter, process, process step or steps, to the objective, spirit and scope of the present invention. All such modifications are intended to be within the scope of the claims appended hereto.

Label integrity verification of chemical array data

Information

Publication Number

Date Filed

Date Published

Inventors

CPC

US Classifications

International Classifications

Abstract

Description

Claims