Researchers use experimental data obtained from arrays and other similar research test equipment to cure diseases, develop medical treatments, understand biological phenomena, and perform other tasks relating to the analysis of such data. However, the conversion of useful results from this raw data is restricted by physical limitations of, e.g., the nature of the tests and the testing equipment. All biological measurement systems leave their fingerprint on the data they measure, distorting the content of the data, and thereby influencing the results of the desired analysis. For example, systematic biases can distort array analysis results and thus conceal important biological effects sought by the researchers. Biased data can cause a variety of analysis problems, including signal compression, aberrant graphs, and significant distortions in estimates of differential expression.
Gradient effects or patterns are those in which there is a pattern of expression signal intensity which corresponds with specific physical locations and/or sequence properties within a chemical array and which are characterized by a smooth change in the expression values from one end of the array to another and/or across sequence properties of probes. This can be caused by variations in array design, manufacturing, dye-bias, probe affinity and/or hybridization procedures.
In dual-channel systems, it is well known that the two dyes used to evaluate the binding of target molecules to probes on an array do not always perform equally efficiently, for equivalent target concentrations, uniformly across the whole array. This is sometimes referred to as dye-related, signal correlation bias. For example, for dual-channel systems in which probes have been labeled using cyanine3 (Cy3)- and cyanine5 (Cy5)-dyes, the red channel (detecting Cy5 labeling) often demonstrates higher signal intensity than the green channel at higher target abundances. Even when comparing results from two single-channel experiments, there may be differences in dye performances, even when the same dye is used, such as when different experimental conditions, either intended or unintended, occur when running each of the experiments. Also, the label intensity may not follow an ideal performance curve over the range of analyte concentration. For example, for drug discovery experiments, label intensity may not follow the ideal dose-response curve over the range of the analyte (e.g., mRNA) concentration being used as a marker of drug efficacy. For example, red dye (e.g., Cy5) tends to amplify brightness in an accelerated manner with respect to an increase in concentration, at high concentrations beyond the typical sigmoidal profile.
The degree the intensity of dye signals fails to report the concentration of target being measured is not easily quantified, and therefore difficult to address. Dye-swap normalization experiments are sometimes run in which a first set of experiments assigns the red dye label to a first set of probes and the green dye label to a second set of probes. A second set of experiments is run against the same target solution, but in which the green dye label is assigned to the first set of probes and the red dye label is assigned to the second set of probes. By comparing the output of the first set with that of the second set, the bias attributable to the effects of the red versus green dye can be measured. However, this is a time consuming process and significantly increases the cost of experimentation, as twice the amount of arrays, reagents, target and processing are required.
In addition to fluorescent labels, other types of labeling, such as radioactive labels, phosphorescent labels, fluorescent labels, visible light labels, ultraviolet labels, and others, are also susceptible to causing signal correlation bias.
Also, results that appear to have labeling bias may be due to other technical errors. For example, for a single channel system, the system may be erroneously reporting probe signals, even though the results appear to be the cause of dye bias. Since there is only one channel, and no control channel, it is not possible to distinguish between the systematic reader error and dye bias, in this instance.
Thus there remains a need for improved systems and methods for normalizing biological data to address dye-related, signal correlation bias and other types of labeling bias as data is read from arrays.
Methods, systems and computer readable media are provided for checking label integrity of labeled biopolymers in a single sample assayed by chemical array analysis. In one embodiment, at least first and second labels are incorporated into biopolymers in the single sample to produce a multi-labeled, single sample. The multi-labeled, single sample is hybridzed to probes on a chemical array, and signal values are read from a probe on the chemical array bound to a set of biopolymer sequences labeled with the at least first and second labels. First-labeled signal values from the probe bound to biopolymer having the first label incorporated therein are compared with second-labeled signal values from the probe bound to biopolymer having the second label incorporated therein. The steps of reading signal values and comparing first-labeled signal values with second-labeled signal values are repeated for at least one additional probe on the chemical microarray bound to a set of different biopolymer sequences labeled with the at least first and second labels. Label integrity is determined to be of acceptable quality if divergence between the first-labeled signal values read from the probes and the second-labeled signal values read from the same probes, over the set of probes read and compared, is less than a predetermined threshold value.
In another embodiment, a chemical array is provided that has had a multi-labeled sample contacted thereto so that multi-labeled biopolymers from the same have hybridized with probes on the chemical array. Methods, systems and computer readable media are provided for reading signal values from a probe on the chemical array bound to a set of biopolymer sequences labeled with at least first and second labels; comparing first-labeled signal values from the probe bound to biopolymer having the first label incorporated therein with second-labeled signal values from the probe bound to biopolymer having the second label incorporated therein; and repeating the reading signal values and comparing first-labeled signal values with second-labeled signal values for at least one additional probe on the chemical microarray bound to a set of different biopolymer sequences labeled with the at least first and second labels. Label integrity is determined to be of acceptable quality if divergence between the first-labeled signal values read from the probes and the second-labeled signal values read from the same probes, across all probes read, is less than a predetermined threshold value.
These and other advantages and features of the invention will become apparent to those persons skilled in the art upon reading the details of the methods, systems and computer readable media as more fully described below.
Before the present systems, methods, kits and computer readable media are described, it is to be understood that this invention is not limited to particular examples described, as such may, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting, since the scope of the present invention will be limited only by the appended claims.
Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limits of that range is also specifically disclosed. Each smaller range between any stated value or intervening value in a stated range and any other stated or intervening value in that stated range is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included or excluded in the range, and each range where either, neither or both limits are included in the smaller ranges is also encompassed within the invention, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention, the preferred methods and materials are now described. All publications mentioned herein are incorporated herein by reference to disclose and describe the methods and/or materials in connection with which the publications are cited.
It must be noted that as used herein and in the appended claims, the singular forms “a”, “and”, and “the” include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to “a probe” includes a plurality of such probes and reference to “the array” includes reference to one or more arrays and equivalents thereof known to those skilled in the art, and so forth.
The publications discussed herein are provided solely for their disclosure prior to the filing date of the present application. Nothing herein is to be construed as an admission that the present invention is not entitled to antedate such publication by virtue of prior invention. Further, the dates of publication provided may be different from the actual publication dates which may need to be independently confirmed.
Definitions
In the present application, unless a contrary intention appears, the following terms refer to the indicated characteristics.
A “biopolymer” is a polymer of one or more types of repeating units. Biopolymers are typically found in biological systems and particularly include polysaccharides (such as carbohydrates), and peptides (which term is used to include polypeptides and proteins) and polynucleotides as well as their analogs such as those compounds composed of or containing amino acid analogs or non-amino acid groups, or nucleotide analogs or non-nucleotide groups. This includes polynucleotides in which the conventional backbone has been replaced with a non-naturally occurring or synthetic backbone, and nucleic acids (or synthetic or naturally occurring analogs) in which one or more of the conventional bases has been replaced with a group (natural or synthetic) capable of participating in Watson-Crick type hydrogen bonding interactions. Polynucleotides include single or multiple stranded configurations, where one or more of the strands may or may not be completely aligned with another.
A “nucleotide” refers to a sub-unit of a nucleic acid and has a phosphate group, a 5-carbon sugar and a nitrogen containing base, as well as functional analogs (whether synthetic or naturally occurring) of such sub-units which in the polymer form (as a polynucleotide) can hybridize with naturally occurring polynucleotides in a sequence-specific manner analogous to that of two naturally occurring polynucleotides. For example, a “biopolymer” includes DNA (including cDNA), RNA, oligonucleotides, and PNA and other polynucleotides as described in U.S. Pat. No. 5,948,902 and references cited therein (all of which are incorporated herein by reference), regardless of the source. An “oligonucleotide” generally refers to a nucleotide multimer of about 10 to 100 nucleotides in length, while a “polynucleotide” includes a nucleotide multimer having any number of nucleotides. A “biomonomer” references a single unit, which can be linked with the same or other biomonomers to form a biopolymer (for example, a single amino acid or nucleotide with two linking groups one or both of which may have removable protecting groups).
“Technical factors” refer to all patterns in the signal data that are not representative of the biological information in the target sample, but are rather caused by technical sources, such as hybridization bubbles (caused by uneven distribution of the sample to all probes during mixing by a bubbler), temperature gradients, sequence-composition gradients, writer/pen anomalies causing uneven patterns in the amounts deposited across the array, label kit biases, dye differences, bulk chemical solution effects, flow-cell dynamics, wash deposits, auto-fluorescence, oxidation gradients, and the like.
“Incorporation” of a label, into biopolymers or nucleotides, for example, refers to any known technique for labeling a biopolymer or nucleotide, including, but not limited to primer extension using labeled nucleotides and/or labeled primers, labeling during an amplification procedure, chemical conjugation, labeling by binding a labeled moiety that binds to the biopolymer, etc.
“Label integrity”, as used herein refers to a property of labels incorporated into biopolymers wherein signals that are read from the label-incorporated biopolymers can be consistently and stably reproduced across multiple experiments. Also, different labels vary proportionally over a range of signals, so that they can be reliably compared with one another, as measuring the same signal levels for the same sample, or correct ratios between different samples. Labels that lack label integrity are considered unstable, and this leads to amplified array noise and the inability to accurately compare signals from the same biopolymers labeled with different labels. Stability with respect to time (e.g., “shelf life”) is also a desirable property for maintaining label integrity.
When one item is indicated as being “remote” from another, this is referenced that the two items are not at the same physical location, e.g., the items are at least in different buildings, and may be at least one mile, ten miles, or at least one hundred miles apart.
“Communicating” information references transmitting the data representing that information as electrical signals over a suitable communication channel (for example, a private or public network).
“Forwarding” an item refers to any means of getting that item from one location to the next, whether by physically transporting that item or otherwise (where that is possible) and includes, at least in the case of data, physically transporting a medium carrying the data or communicating the data.
A “processor” references any hardware and/or software combination which will perform the functions required of it. For example, any processor herein may be a programmable digital microprocessor such as available in the form of a mainframe, server, or personal computer (desktop or portable). Where the processor is programmable, suitable programming can be communicated from a remote location to the processor, or previously saved in a computer program product (such as a portable or fixed computer readable storage medium, whether magnetic, optical or solid state device based). For example, a magnetic or optical disk may carry the programming, and can be read by a suitable disk reader communicating with each processor at its corresponding station.
Reference to a singular item, includes the possibility that there are plural of the same items present.
“May” means optionally.
Methods recited herein may be carried out in any order of the recited events which is logically possible, as well as the recited order of events.
A “chemical array”, “array”, “microarray” or “bioarray” unless a contrary intention appears, includes any one-, two- or three-dimensional arrangement of addressable regions bearing a particular chemical moiety or moieties (for example, biopolymers such as polynucleotide sequences) associated with that region. An array is “addressable” in that it has multiple regions of different moieties (for example, different polynucleotide sequences) such that a region (a “feature” or “spot” of the array) at a particular predetermined location (an “address”) on the array will detect a particular target or class of targets (although a feature may incidentally detect non-targets of that feature). Array features are typically, but need not be, separated by intervening spaces. In the case of an array, the “target” will be referenced as a moiety in a mobile phase (typically fluid), to be detected by probes (“target probes”) which are bound to the substrate at the various regions. However, either of the “target” or “target probes” may be the one which is to be evaluated by the other (thus, either one could be an unknown mixture of polynucleotides to be evaluated by binding with the other).
An “array layout” refers to one or more characteristics of the features, such as feature positioning on the substrate, one or more feature dimensions, and an indication of a moiety at a given location.
“Hybridizing” and “binding”, with respect to polynucleotides, are used interchangeably.
A “pulse jet” is a device which can dispense drops in the formation of an array. Pulse jets operate by delivering a pulse of pressure to liquid adjacent an outlet or orifice such that a drop will be dispensed therefrom (for example, by a piezoelectric or thermoelectric element positioned in a same chamber as the orifice).
A “subarray” or “subgrid” is a subset of an array. Typically, a number of subgrids are laid out on a single slide and are separated by a greater spacing than the spacing that separates features or spots or dots.
Any given substrate (e.g., slide) may carry one, two, four or more arrays disposed on a front surface of the substrate. Depending upon the use, any or all of the arrays may be the same or different from one another and each may contain multiple spots or features. A typical array may contain more than ten, more than one hundred, more than one thousand more ten thousand features, or even more than one hundred thousand features, in an area of less than 20 cm2 or even less than 10 cm2. For example, features may have widths (that is, diameter, for a round spot) in the range from a 10 μm to 1.0 cm. In other embodiments each feature may have a width in the range of 1.0 μm to 1.0 mm, usually 5.0 μm to 500 μm, and more usually 10 μm to 200 μm. Non-round features may have area ranges equivalent to that of circular features with the foregoing width (diameter) ranges. At least some, or all, of the features are of different compositions (for example, when any repeats of each feature composition are excluded the remaining features may account for at least 5%, 10%, or 20% of the total number of features).
Interfeature areas will typically (but not essentially) be present which do not carry any polynucleotide (or other biopolymer or chemical moiety of a type of which the features are composed). Such interfeature areas typically will be present where the arrays are formed by processes involving drop deposition of reagents but may not be present when, for example, photolithographic array fabrication processes are used. It will be appreciated though, that the interfeature areas, when present, could be of various sizes and configurations.
Each array may cover an area of less than 100 cm2, or even less than 50 cm2, 10 cm2 or 1 cm2. In many embodiments, the substrate carrying the one or more arrays will be shaped generally as a rectangular solid (although other shapes are possible; for example, some manufacturers are currently working on flexible substrates), having a length of more than 4 mm and less than 1 m, usually more than 4 mm and less than 600 mm, more usually less than 400 mm; a width of more than 4 mm and less than 1 m, usually less than 500 mm and more usually less than 400 mm; and a thickness of more than 0.01 mm and less than 5.0 mm, usually more than 0.1 mm and less than 2 mm and more usually more than 0.2 and less than 1 mm. With arrays that are read by detecting fluorescence, the substrate may be of a material that emits low fluorescence upon illumination with the excitation light. Additionally in this situation, the substrate may be relatively transparent to reduce the absorption of the incident illuminating laser light and subsequent heating if the focused laser beam travels too slowly over a region. For example, a substrate may transmit at least 20%, or 50% (or even at least 70%, 90%, or 95%), of the illuminating light incident on the front as may be measured across the entire integrated spectrum of such illuminating light or alternatively at 532 nm or 633 nm.
Arrays can be fabricated using drop deposition from pulse jets of either polynucleotide precursor units (such as monomers) in the case of in situ fabrication, or the previously obtained polynucleotide. Such methods are described in detail in, for example, the previously cited references including U.S. Pat. Nos. 6,242,266; 6,232,072; 6,180,351; 6,171,797; and 6,323,043, and in U.S. patent application Ser. No. 09/302,898 filed Apr. 30, 1999 by Caren et al., and the references cited therein. As already mentioned, these references are incorporated herein, in their entireties, by reference thereto. Other drop deposition methods can be used for fabrication, as previously described herein. Also, instead of drop deposition methods, photolithographic array fabrication methods may be used. Interfeature areas need not be present particularly when the arrays are made by photolithographic methods.
Following receipt by a user of an array made by an array manufacturer, it will typically be exposed to a sample (for example, a fluorescently labeled polynucleotide or protein containing sample) and the array then read. Reading of the array may be accomplished by illuminating the array and reading the location and intensity of resulting fluorescence at multiple regions on each feature of the array. For example, a scanner may be used for this purpose which is similar to the AGILENT MICROARRAY SCANNER manufactured by Agilent Technologies, Palo Alto, Calif. Other suitable apparatus and methods are described in U.S. Pat. Nos. 6,406,849; 6,371,370; and 6,756,202; and in U.S. Patent Publication No. 2003/0160183 titled “Reading Dry Chemical Arrays Through The Substrate” by Dorsel et al. However, arrays may be read by any other method or apparatus than the foregoing, with other reading methods including other optical techniques (for example, detecting chemiluminescent or electroluminescent labels) or electrical techniques (where each feature is provided with an electrode to detect hybridization at that feature in a manner disclosed in U.S. Pat. Nos. 6,251,685 and 6,221,583 and elsewhere). A result obtained from the reading followed by a method of the present invention may be used in that form or may be further processed to generate a result such as that obtained by forming conclusions based on the pattern read from the array (such as whether or not a particular target sequence may have been present in the sample, or whether or not a pattern indicates a particular condition of an organism from which the sample came). A result of the reading (whether further processed or not) may be forwarded (such as by communication) to a remote location if desired, and received there for further use (such as further processing).
The term “stringent assay conditions” or “stringent conditions” as used herein refers to conditions that are compatible to produce binding pairs of nucleic acids, e.g., surface bound and solution phase nucleic acids, of sufficient complementarity to provide for the desired level of specificity in the assay while being less compatible to the formation of binding pairs between binding members of insufficient complementarity to provide for the desired specificity. Stringent assay conditions are the summation or combination (totality) of both hybridization and wash conditions.
A “stringent hybridization” and “stringent hybridization wash conditions” in the context of nucleic acid hybridization (e.g., as in array, Southern or Northern hybridizations) are sequence dependent, and are different under different experimental parameters. Stringent hybridization conditions that can be used to identify nucleic acids within the scope of the invention can include, e.g., hybridization in a buffer comprising 50% formamide, 5×SSC, and 1% SDS at 42° C., or hybridization in a buffer comprising 5×SSC and 1% SDS at 65° C., both with a wash of 0.2×SSC and 0.1% SDS at 65° C. Exemplary stringent hybridization conditions can also include a hybridization in a buffer of 40% formamide, 1 M NaCl, and 1% SDS at 37° C., and a wash in 1×SSC at 45° C. Alternatively, hybridization to filter-bound DNA in 0.5 M NaHPO4, 7% sodium dodecyl sulfate (SDS), 1 mM EDTA at 65° C., and washing in 0.1×SSC/0.1% SDS at 68° C. can be employed. Yet additional stringent hybridization conditions include hybridization at 60° C. or higher and 3×SSC (450 mM sodium chloride/45 mM sodium citrate) or incubation at 42° C. in a solution containing 30% formamide, 1M NaCl, 0.5% sodium sarcosine, 50 mM MES, pH 6.5. Those of ordinary skill will readily recognize that alternative but comparable hybridization and wash conditions can be utilized to provide conditions of similar stringency.
In certain embodiments, the stringency of the wash conditions that set forth the conditions which determine whether a nucleic acid is specifically hybridized to a surface bound nucleic acid. Wash conditions used to identify nucleic acids may include, e.g.: a salt concentration of about 0.02 molar at pH 7 and a temperature of at least about 50° C. or about 55° C. to about 60° C.; or, a salt concentration of about 0.15 M NaCl at 72° C. for about 15 minutes; or, a salt concentration of about 0.2×SSC at a temperature of at least about 50° C. or about 55° C. to about 60° C. for about 15 to about 20 minutes; or, the hybridization complex is washed twice with a solution with a salt concentration of about 2×SSC containing 0.1% SDS at room temperature for 15 minutes and then washed twice by 0.1×SSC containing 0.1% SDS at 68° C. for 15 minutes; or, equivalent conditions. Stringent conditions for washing can also be, e.g., 0.2×SSC/0.1% SDS at 42° C.
A specific example of stringent assay conditions is rotating hybridization at 65° C. in a salt based hybridization buffer with a total monovalent cation concentration of 1.5 M (e.g., as described in U.S. patent application Ser. No. 09/655,482 filed on Sep. 5, 2000, the disclosure of which is herein incorporated by reference) followed by washes of 0.5×SSC and 0.1×SSC at room temperature.
Stringent assay conditions are hybridization conditions that are at least as stringent as the above representative conditions, where a given set of conditions are considered to be at least as stringent if substantially no additional binding complexes that lack sufficient complementarity to provide for the desired specificity are produced in the given set of conditions as compared to the above specific conditions, where by “substantially no more” is meant less than about 5-fold more, typically less than about 3-fold more. Other stringent hybridization conditions are known in the art and may also be employed, as appropriate.
As noted above, conventional bioassays use one dye label per signal channel, with no direct onboard way to assure integrity of the label dyes. Examples of widely-used single-channel platforms include GeneChip®, by Affymetrix (http://www.affymetrix.com/products/arrays/index.affx) and the CodeLink System from GEHealthcare (http://www.affymetrix.com/products/arrays/index.affx). A gradient pattern that results from reading such an array does not necessarily imply a dye-biasing error, but could be due to other production factors during production of the array and/or hybridization conditions, as noted above. Further, with single-channel systems, since there is only one channel being analyzed, it is not possible to run dye-swap experiments, as there is typically only one set of probes and one dye used.
The present invention provides solutions that include onboard verification of labeling, even for single-channel systems. Multiple labels may be incorporated into one sample, such that the probes on an array read by a single channel of a system will get information from multiple labels. For example, for dye-biasing, both red and green dye labels may be incorporated in biopolymers in the same sample, and the multi-labeled sample is then exposed to the probes on an array under stringent hybridization conditions. The resulting signals read by an array scanner will then reflect the same sample labeled with green dye, as well as with red dye. Thus, a two-channel, or two color scanner may be used to process a single sample in this instance, with one channel of signal measurement.
As mentioned above, array 112 contains multiple spots or features 116 of oligomers, e.g., in the form of polynucleotides, and specifically oligonucleotides. As mentioned above, all of the features 116 may be different, or some or all could be the same. The interfeature areas 117 could be of various sizes and configurations. Each feature carries a predetermined oligomer such as a predetermined polynucleotide (which includes the possibility of mixtures of polynucleotides). It will be understood that there may be a linker molecule (not shown) of any known types between the surface 111b and the first nucleotide.
Substrate 110 may carry on surface 111a, an identification code, e.g., in the form of bar code (not shown) or the like printed on a substrate in the form of a paper label attached by adhesive or any convenient means. The identification code may contain information relating to array 112, where such information may include, but is not limited to, an identification of array 112, i.e., layout information relating to the array(s), etc.
In the case of an array in the context of the present application, the “target” may be referenced as a moiety in a mobile phase (typically fluid), to be detected by “probes” which are bound to the substrate at the various regions.
A “scan region” refers to a contiguous (preferably, rectangular) area in which the array spots or features of interest, as defined above, are found or detected. Where fluorescent labels are employed, the scan region is that portion of the total area illuminated from which the resulting fluorescence is detected and recorded. Where other detection protocols are employed, the scan region is that portion of the total area queried from which resulting signal is detected and recorded. For the purposes of this invention and with respect to fluorescent detection embodiments, the scan region includes the entire area of the slide scanned in each pass of the lens, between the first feature of interest, and the last feature of interest, even if there exist intervening areas that lack features of interest.
It should be further noted here that the present invention is not limited to incorporation of only two different labels into biopolymers (e.g., nucleic acids) in the same sample, as more than two different labels may be incorporated into the biopolymers to perform the functions described herein, and which would be processed similarly. By incorporating a mixture of multiple (two or more) different labels into the biopolymers (e.g., nucleic acids) of a single sample, the signal values read from a probe bound to biopolymers incorporating a first label may be compared to the signal values read from the same probe bound to biopolymers incorporating a second label, as well as against signal values from the probe bound to biopolymers incorporating a third, forth or fifth label, etc., and these comparisons can be made across a plurality or even all probes on an array that bind to the target sample, to compare the performance of one label versus another label for the same nucleic acids across a plurality of probes binding to different biopolymers. The degree to which the first and second-labeled signals (or first and third, first, second and third, or however many different signals are compared, depending upon the number of labels incorporated) are proportional to one another across a plurality of different probes (e.g., across the probes on the array) may be characterized by a divergence metric, thereby providing a check of integrity of the labels as a quantitative measurement of label integrity and hence, fidelity of the signals read as they are influenced by the labels incorporated therein. For example, if incorporation of one particular label, for example a dye, results in signal levels read from probes bound to nucleic acids having the dye incorporated therein, that when plotted against the positions of the features/probes from which the signals were read, presents an unusual gradient in the surface characterizing the plotted signal levels, as compared to surface plots produced from signals read from the same corresponding probes bound to nucleic acids having other labels incorporated therein, respectively, then this is direct evidence that that dye has a lack of integrity across the range of signal levels read. For example, Cy5 label (red) is more susceptible to ozone degradation than Cy3 label (green). Another example is that auto-fluorescence can influence signals from biopolymers (e.g., nucleic acids) having Cy3 dye label incorporated therein much more than signals from the same biopolymers (e.g., nucleic acids) having Cy5 dye label incorporated therein. In situations such as these, the signals read from the biopolymers (e.g., nucleic acids) labeled with red dye and the signals read from the corresponding biopolymers (e.g., nucleic acids) labeled with green dye result in a mutually divergent pattern when the signals are plotted with regard to the positions of the features on the array to produce response surface plots, since chemical differences are amplified by unstable conditions.
The labels are incorporated into the molecules in the sample at a fixed ratio across all the molecules into which the labels are incorporated, such that signals that are read from the labeled molecules will be at a fixed ratio across molecules, when comparing one label versus another. Both the normal substrate (for example, dCTP) and a dye-modified dNTP (for example, Cye-dCTP) may be present in the reaction. A fixed ratio of the normal substrate to the dye substrate (derivative) dictates how much dye is incorporated into the sample and this does not change over time, as long as both substrates are present in excess and the effective concentration does not change as a function of time. So, for example, when two dyes are to be incorporated into the same sample, the amount of each substrate for the two dyes, respectively should be at a fixed ratio, and as long as the reactants (dyes not yet incorporated into sample) are available, the enzyme drives incorporation of the dyes into the sample at a fixed rate, and in quantities that are at the fixed ratio determined as described above. Examples of dyes that may be incorporated include those dyes used for fluorescent labeling in which fluorescently tagged nucleotides, (e.g., Cy3-CTP) are incorporated into an antisense RNA, or, for example, Cy3-dCTP are incorporated into cDNA (from a first strand synthesis or a non-amplification method) product during the transcription step. Fluorescent moieties which may be used to tag nucleotides for producing labeled samples include: fluorescein, the cyanine dyes, such as Cy3, Cy5, Alexa 542, Bodipy 630/650, and the like. Other labels may also be employed as are known in the art.
One approach for incorporating multiple fluorescent dye labels into the same sample employs linear amplification techniques. According to this approach, mRNA in the sample molecules are linearly amplified into antisense RNA. Thus amplified amounts of antisense RNA are produced by amplification of an initial amount of mRNA. By amplified amounts is meant that for each initial mRNA, multiple corresponding antisense RNAs, where the term antisense RNA is defined here as ribonucleic acid complementary to the initial mRNA, are produced. By corresponding is meant that the antisense RNA shares a substantial amount of sequence identity with the sequence complementary to the mRNA (i.e. the complement of the initial mRNA), where substantial amount means at least 95% usually at least 98% and more usually at least 99%, where sequence identity is determined using the BLAST algorithm. Further information regarding this step can be found in U.S. Pat. Nos. 6,132,997 and 6,916,633, each of which is incorporated herein, in its entirety, by reference thereto. Generally, the number of corresponding antisense RNA molecules produced for each initial mRNA during the subject linear amplification methods will be at least about 10, usually at least about 50 and more usually at least about 100, where the number may be as great as 600 or greater, but often does not exceed about 1000.
The initial mRNA may be present in a variety of different samples, where the sample will typically be derived from a physiological source. The physiological source may be derived from a variety of eukaryotic sources, with physiological sources of interest including sources derived from single-celled organisms such as yeast and multicellular organisms, including plants and animals, particularly mammals, where the physiological sources from multicellular organisms may be derived from particular organs or tissues of the multicellular organism, or from isolated cells derived therefrom. In obtaining the sample of RNA to be analyzed from the physiological source from which it is derived, the physiological source may be subjected to a number of different processing steps, where such processing steps might include tissue homogenization, cell isolation and cytoplasm extraction, nucleic acid extraction and the like, where such processing steps are known to those of skill in the art. Methods of isolating RNA from cells, tissues, organs or whole organisms are known to those of skill in the art. Alternatively, at least some of the initial steps of the subject methods may be performed in situ, as described in U.S. Pat. No. 5,514,545, which is hereby incorporated herein, in its entirety, by reference thereto.
Depending on the nature of the primer employed during first strand synthesis, amplified amounts of antisense RNA can be produced corresponding to substantially all of the mRNA present in the initial sample, or to a proportion or fraction of the total number of distinct mRNAs present in the initial sample. By substantially all of the mRNA present in the sample is meant more than 90%, usually more than 95%, where that portion not amplified is solely the result of inefficiencies of the reaction and not intentionally excluded from amplification.
The promoter-primer employed in the amplification reaction includes: (a) a poly-dT region for hybridization to the poly-A tail of the mRNA; and (b) an RNA polymerase promoter region 5′ of the -poly-dT region that is in an orientation capable of directing transcription of antisense RNA. In certain embodiments, the primer will be a “lock-dock” primer, in which immediately 3′ of the poly-dT region is either a “G’, “C”, or “A” such that the primer has the configuration of 3′-XTTTTTTT . . . 5′, where X is “G”, “C”, or “A”. The poly-dT region is sufficiently long to provide for efficient hybridization to the poly-A tail, where the region typically ranges in length from 10-50 nucleotides in length, usually 10-25 nucleotides in length, and more usually from 14 to 20 nucleotides in length.
A number of RNA polymerase promoters may be used for the promoter region of the first strand cDNA primer, i.e. the promoter-primer. Suitable promoter regions will be capable of initiating transcription from an operationally linked DNA sequence in the presence of ribonucleotides and an RNA polymerase under suitable conditions. The promoter will be linked in an orientation to permit transcription of antisense RNA. A linker oligonucleotide between the promoter and the DNA may be present, and if, present, will typically comprise between about 5 and 20 bases, but may be smaller or larger as desired. The promoter region will usually comprise between about 15 and 250 nucleotides, preferably between about 17 and 60 nucleotides, from a naturally occurring RNA polymerase promoter or a consensus promoter region. In general, prokaryotic promoters are preferred over eukaryotic promoters, and phage or virus promoters are most preferred. As used herein, the term “operably linked” refers to a functional linkage between the affecting sequence (typically a promoter) and the controlled sequence (the mRNA binding site). The promoter regions that find use are regions where RNA polymerase binds tightly to the DNA and contain the start site and signal for RNA synthesis to begin. A wide variety of promoters are known and many are very well characterized. Representation promoter regions of particular interest include T7, T3 and SP6 as described in Chamberlin and Ryan, The Enzymes (ed. P. Boyer, Academic Press, New York) (1982) pp 87-108.
The promoter-primer described above and throughout this specification may be prepared using any suitable method, such as, for example, the known phosphotriester and phosphite triester methods, or automated embodiments thereof. In one such automated embodiment, dialkyl phosphoramidites are used as starting materials and may be synthesized as described by Beaucage et al. (1981), Tetrahedron Letters 22, 1859. One method for synthesizing oligonucleotides on a modified solid support is described in U.S. Pat. No. 4,458,066. It is also possible to use a primer that has been isolated from a biological source (such as a restriction endonuclease digest). The primers herein are selected to be “substantially” complementary to each specific sequence to be amplified, i.e.; the primers should be sufficiently complementary to hybridize to their respective targets. Therefore, the primer sequence need not reflect the exact sequence of the target, and can, in fact be “degenerate.” Non-complementary bases or longer sequences can be interspersed into the primer, provided that the primer sequence has sufficient complementarity with the sequence of the target to be amplified to permit hybridization and extension.
Reverse transcriptase is then used to make a cDNA strand 412. The RNA strand 400 is next degraded using RNaseH, and a primer 414 is added. An exogenous primer can be added (e.g., random hexamer) or priming can occur by synthesis from residual RNA that is still bound to the DNA or snap back priming from the cDNA strand made during first strand synthesis. An -enzyme is used to make a copy of cDNA strand 412 according to known techniques, to synthesize double-stranded cDNA 412,412′. After hybridizing the oligonucleotide promoter-primer 410 with an initial mRNA sample 400, the primer-mRNA hybrid is converted to a double-stranded cDNA product that is recognized by an RNA polymerase, as noted. The promoter-primer is contacted with the mRNA under conditions that allow the poly-dT site to hybridize to the poly-A tail present on most mRNA species. The catalytic activities required to convert primer-mRNA hybrid to double-stranded cDNA are an RNA-dependent DNA polymerase activity, a RNaseH activity, and a DNA-dependent DNA polymerase activity. Most reverse transcriptases, including those derived from Moloney murine leukemia virus (MMLV-RT), avian myeloblastosis virus (AMV-RT), bovine leukemia virus (BLV-RT), Rous sarcoma virus (RSV) and human immunodeficiency virus (HIV-RT) catalyze each of these activities. These reverse transcriptases are sufficient to convert primer-mRNA hybrid to double-stranded DNA in the presence of additional reagents which include, but are not limited to: dNTPs; monovalent and divalent cations, e.g. KCl, MgCl.sub.2; sulfhydryl reagents, e.g. dithiothreitol; and buffering agents, e.g. Tris-Cl. Alternatively, a variety of proteins that catalyze one or two of these activities can be added to the cDNA synthesis reaction. For example, MMLV reverse transcriptase lacking RNaseH activity (described in U.S. Pat. No. 5,405,776) which catalyzes RNA-dependent DNA polymerase activity and DNA-dependent DNA polymerase activity, can be added with a source of RNaseH activity, such as the RNaseH purified from cellular sources, including Escherichia coli. These proteins may be added together during a single reaction step, or added sequentially during two or more substeps. Finally, additional proteins that may enhance the yield of double-stranded DNA products may also be added to the cDNA synthesis reaction. These proteins include a variety of DNA polymerases (such as those derived from E coli, thermophilic bacteria, archaebacteria, phage, yeasts, Neurosporas, Drosophilas, primates and rodents), and DNA Ligases (such as those derived from phage or cellular sources, including T4 DNA Ligase and E. coli DNA Ligase).
Conversion of primer-mRNA hybrid to double-stranded cDNA by reverse transcriptase proceeds through an RNA:DNA intermediate which is formed by extension of the hybridized promoter-primer by the RNA-dependent DNA polymerase activity of reverse transcriptase. The RNaseH activity of the reverse transcriptase then hydrolyzes at least a portion of the RNA:DNA hybrid, leaving behind RNA fragments that can serve as primers for second strand synthesis (Meyers et al., Proc. Nat'l Acad. Sci. USA (1980) 77:1316 and Olsen & Watson, Biochem. Biophys. Res. Comm. (1980) 97:1376). Extension of these primers by the DNA-dependent DNA polymerase activity of reverse transcriptase results in the synthesis of double-stranded cDNA. Other mechanisms for priming of second strand synthesis may also occur, including “self-priming” by a hairpin loop formed at the 3′ terminus of the first strand cDNA and “non-specific priming” by other DNA molecules in the reaction, i.e. the promoter-primer.
The second strand cDNA synthesis results in the creation of a double-stranded promoter region. The second strand cDNA includes not only a sequence of nucleotide residues that comprise a DNA copy of the mRNA template, but also additional sequences at its 3′ end which are complementary to the promoter-primer used to prime first strand cDNA synthesis. The double-stranded promoter region serves as a recognition site and transcription initiation site for RNA polymerase, which uses the second strand cDNA as a template for multiple rounds of RNA synthesis, as noted.
Using the promoter (e.g., T7 promoter), RNA polymerase is added, which binds to the promoter to generate a cRNA (complementary RNA) strand (antisense RNA) 416, as a copy of strand 412, and this copying process repeats itself 418 to produce hundreds, possibly about a thousand cRNA copies of the cDNA strand. The cRNA generated is an exact copy of strand 412, or a reverse complement of strand 412′. Antisense RNA is made, which contains TTT (i.e., a poly-T tail). It is not a copy of the mRNA that was started with, which contains AAA (i.e., a poly-A tail).
The antisense RNA resultant from the double-stranded cDNA is produced by transcribing by RNA polymerase to yield antisense RNA, which is complementary to the initial mRNA target from which it is amplified. This step is carried out in the presence of reverse transcriptase which is present in the reaction mixture. Thus, this technique does not involve a step in which the double-stranded cDNA is physically separated from the reverse transcriptase following double-stranded cDNA preparation. The reverse transcriptase that is present during the transcription step is rendered inactive, and thus, the transcription step is carried out in the presence of a reverse transcriptase that is unable to catalyze RNA-dependent DNA polymerase activity, at least for the duration of the transcription step. As a result, the antisense RNA products of the transcription reaction cannot serve as substrates for additional rounds of amplification, and the amplification process cannot proceed exponentially.
The reverse transcriptase present during the transcription step may be rendered inactive using any convenient protocol. The transcriptase may be irreversibly or reversibly rendered inactive. Where the transcriptase is reversibly rendered inactive, the transcriptase is physically or chemically altered so as to be no longer able to catalyze RNA-dependent DNA polymerase activity. The transcriptase may be irreversibly inactivated by any convenient means. Thus, the reverse transcriptase may be heat inactivated, in which the reaction mixture is subjected to heating to a temperature sufficient to inactivate the reverse transcriptase prior to commencement of the transcription step. In these embodiments, the temperature of the reaction mixture and therefore the reverse transcriptase present therein is typically raised to 55° C. to 70° C. for 5 to 60 minutes, usually to about 65° C. for 15 to 20 minutes. Alternatively, reverse transcriptase may be irreversibly inactivated by introducing a reagent into the reaction mixture that chemically alters the protein so that it no longer has RNA-dependent DNA polymerase activity. In yet other embodiments, the reverse transcriptase is reversibly inactivated. In these embodiments, the transcription may be carried out in the presence of an inhibitor of RNA-dependent DNA polymerase activity. Any convenient reverse transcriptase inhibitor may be employed which is capable of inhibiting RNA-dependent DNA polymerase activity a sufficient amount to provide for linear amplification. However, these inhibitors should not adversely affect RNA polymerase activity. Reverse transcriptase inhibitors of interest include ddNTPs, such as ddATP, ddCTP, ddGTP or ddTTP, or a combination thereof, the total concentration of the inhibitor typically ranges from about 50 μM to 200 μM.
For this transcription step, the presence of the RNA polymerase promoter region on the double-stranded cDNA is exploited for the production of antisense RNA. To synthesize the antisense RNA, the double-stranded DNA is contacted with the appropriate RNA polymerase in the presence of the four ribonucleotides, under conditions sufficient for RNA transcription to occur, where the particular polymerase employed will be chosen based on the promoter region present in the double-stranded DNA, e.g. T7 RNA polymerase, T3 or SP6 RNA polymerases, E. coli RNA polymerase, and the like. Suitable conditions for RNA transcription using RNA polymerases are known in the art. As mentioned above, a critical feature of the subject methods is that this transcription step is carried out in the presence of a reverse transcriptase that has been rendered inactive, e.g. by heat inactivation or by the presence of an inhibitor.
Because of the nature of the steps described, all of the necessary polymerization reactions, i.e., first strand cDNA synthesis, second strand cDNA synthesis and antisense RNA transcription, may be carried out in the same reaction vessel at the same temperature, such that temperature cycling is not required. As such, these methods are particularly suited for automation, as the requisite reagents for each of the above steps need merely be added to the reaction mixture in the reaction vessel, without any complicated separation steps being performed, such as phenol/chloroform extraction.
The resultant antisense RNA may next be labeled with multiple different labels. As noted, labels may include any known types that are designed to be interpreted, scanned or read during processing of the sample after its hybridization on a chemical array, including radioactive labeling, dye labeling, etc.
In the above annotations, “N” represents any base (i.e., A, T, G or C), and the number of N's represents the number of bases. For example, an oligo-dT primer may include from about 12 to about 20 nucleotides (bases) and a random primer 504 may include from about 6 to about 12 nucleotides. Alternatively, oligo-dT primer 502 may be a lock-docked type primer (e.g., 5′-TTT..VN-3′), wherein “V” represents A, G or C base.
If it is desired to incorporate only one molecule of each dye per cDNA strand, then incorporation of a first dye with an oligo-dT primer 502 or random primer 504, as noted above, guarantees that only one molecule of the first dye is incorporated into the target cDNA strand 516 as shown at 518. The second dye is then provided in the form of a dye-dideoxy nucleotide (dye-ddCTP), which acts as a chain terminator, and only one molecule of the second dye is incorporated into the target cDNA, or a dye-deoxy nucleotide (dye-dCTP), in which multiple molecules of the second dye may be incorporated into the target cDNA. After the dye incorporation process, the mRNA template is degraded, leaving dye-labeled cDNA target sequence 524 if dye-ddCTP was used as the second dye) or 524′ (if dye-dCTP was used as the second dye). The resultant dye-labeled sequences resulting from processing of the respective mRNA to fluorescently labeled cDNA are then used as the target for hybridizing a chemical array having probes designed to bind to the molecules of the sample that the dye-labeled sequences represent.
For dye-labeling using a random primer 504, the second dye may be provided in the form of a dye-dideoxy nucleotide to act as a chain terminator, and only one molecule of the second dye is then incorporated into the target cDNA sequence, or a dye-deoxy nucleotide (dye-dCTP) may be provided, in which multiple molecules of the second dye may be incorporated into the target cDNA.
After the dye incorporation process, the mRNA template is degraded, leaving dye-labeled sequence 528 (if dye-dCTP was used for the second dye) or 528′ if dye-ddCTP was used as the second dye). The resultant dye-labeled sequences resulting from processing of the respective mRNA to fluorescently labeled cDNA are then used as the target for hybridizing a chemical array having probes designed to bind to the molecules of the sample that the dye-labeled sequences represent.
Alternative to the use of linear amplification techniques for multiple labeling of a sample, non-amplification techniques may be used.
Reverse transcriptase is then used to make a cDNA strand. The RNA strand is next degraded using RNase, leaving the cDNA strand (single-stranded cDNA). The cDNA strand may be labeled with multiple labels 802,804 in any of the manners described above during the description of the linear amplification processes (e.g., incorporating a dye nucleotide and/or incorporating a modified nucleotide with subsequent conjugation of dye, with or without fragmentation, etc.). The multi-labeled cDNA sequences are then used as the target sample for further processing as described below.
Referring now back to
After washing and other typical processing steps, the array is then processed at event 306 to read the array (such as by scanning, or the like) to obtain signals from the probes with regard to each different label, respectively. The signal values associated with each of the different labels for each probe may then be used as a measure of label integrity, i.e., to measure the fidelity of the signals as effected by one label versus the others. Additionally, the signal values associated with each of the different labels may be used to improve quantitation and reproducibility of signal quantitation results, as will be described below. Thus, the techniques described herein describe an onboard diagnostic test of the labels employed, which may be used in experimental arrays for improving quality of results from arrays actually used in running experiments.
Since each label is expected to be incorporated into the nucleic acids in the sample in proportions designed to produce proportional signal levels on the same probe, across probes on the array, each set of signals for each label, respectively, are expected to measure the same biopolymers (e.g., polynucleotides) in equal concentrations across probes. Thus, a comparison of the signals associated with each label provides a reliable measure of whether the labels are distorting the signal readings, since all other technical factors do not vary (e.g., array to array differences, lot to lot differences, hybridization conditions, array manufacturing conditions, etc., factors that may typically be causes of gradients and other pattern variations when comparing two samples contacted to two different arrays.
The signal intensity values associated with the different labels are then compared at event 308 to identify label-induced errors (i.e., errors resulting from a lack of label integrity) in the signal intensities, or to confirm label integrity. One technique for comparison involves calculating (and optionally, plotting) response surfaces for each set of signals (where each set is associated with a different label) against the locations of the probes on the array from which the signals were obtained. Response surfaces may be plotted using any of a number of known techniques. The response surfaces should generally follow the same contour to confirm that label integrity exists, since the other technical factors (e.g., hybridization differences, array production and processing differences, etc., between experiments) are effectively eliminated by processing the same single sample on the same array, with respect to all labels. If a response surface associated with any particular label diverges from the response surfaces associated with the other label or labels, then this is an indication of error induced by one or more of the labels. A divergence threshold may be set that defines acceptable performance as defined by customer microarray markets. For example, if customers require the median inter-array coefficient of variation percentages (% CV) to be 12% or less, then it would be reasonable to set a threshold at 0.12 or less (e.g., 0.10) and, when set at 0.10, for example, a volatile, non-persistent ratio gradient between response surfaces produced from signals associated with first and second labels, respectively, with % CV>10% would be determined to be not acceptable, for lack of label integrity.
Thus, for example, if the response surfaces generated from signals associated with labels 2, 3 and 4, respectively, generally follow the same contours, but the response surface generated from signals associated with label 1 follows significantly different contours along all or a portion of the response surface, then this is indication that there may be a problem with the label integrity of label 1. When only two labels are used, it may be indeterminate as to whether one or the other label (or both) are lacking in integrity. However, in any of the preceding instances, the result is the same, in that the results of an array experiment would be unreliable or unacceptable for lack of label integrity.
Another technique for comparison includes calculating log ratios of intensity signal pairs, associated with different labels (label-incorporated biopolymers), but the same probe. Signal pair ratios may be calculated for all possible combinations of different pairs of different labels, for each probe. For any given probe, each different label referred to is incorporated in the same target biopolymer (for example, the same nucleic acid) of the sample which that probe is designed to bind with. In this case, the ratios calculated are not expression ratios or ratios to indicated other signals characterizing the sample (e.g., indicating copy number, as in a CGH assay or transcription factor binding sites, as in a location analysis assay), but rather are ratios of the same signal reading, but where each intensity signal of a probe is associated with a different label (i.e., the same biopolymer sequences bind to a probe, but the sequences have different labels. Assuming that the labels perform equally, the calculated log ratios should have a value of zero. However, there may be some bias between labels. For example, dye bias is known to be possible, such that a red dye associated with the same polynucleotide as a green dye may result in a higher signal intensity reading with regard to the polynucleotide incorporating the red dye relative to the polynucleotide incorporating the green dye. In these instances, the data may be processed to remove label biasing, by any variety of known techniques. However, with or without processing to remove label biasing, the log ratio values should remain fairly consistent across all probes on the array if there is label integrity. That is, even with dye bias being present, the log ratio of signal values associated with two different labels, from a first probe should be the same as the log ratio of signal values associated with those same two different labels from a second probe, if label integrity exists. In other words, the difference between the log ratio of signal values associated with two different labels, from a first probe, and the log ratio of signal values associated with those same two different labels from any other probe on the array should be zero, or within a predetermined threshold value (positive difference less than the threshold value, negative difference greater than the negative of the threshold value), if label integrity exists. Another example is that if other technical factors exist that would cause a gradient in the surface response for signal intensities associated with label 1, then those technical factors will also exist with regard to the signal intensities associated with label 2, so that although the surface response associated with each of labels 1 and 2 will each show a gradient, a response surface generated from the ratios or log ratios of the signal associated with label 1 to the signals associated with label 2 (or vice versa) will not have the gradient, indicating that the gradient in the response surfaces associated with the single labels is induced by technical factors other than the labels themselves.
After comparison of the signal intensity readings associated with the different labels, a determination may be made, based on such comparison, as to whether the fidelity of the signal intensity readings, as impacted by the labels used, is reliable. If it is determined that one or more labels lack integrity, such as by observing significant divergence of response surfaces, or variation in the differences between ratios across the array, then label integrity is determined to be absent at event 310 and the data is considered to be unreliable at event 312. Unstable labeling tends to amplify all differences such as the chemical differences between two different label dyes, for example. On the other hand, if label integrity is found to exist at event 310, then the data (signal intensity readings) may be considered reliable, at least to the extent that the labels used are not distorting the signal intensity readings.
It has been further discovered that the signal intensity readings associated with the different labels may be combined to form a composite or average signal intensity level for a probe, which may be more accurate, reliable and reproducible across experiments than if any single signal intensity level associated with any single label associated with the experiment were used. Such processing may optionally be carried out at event 316. The technique can average out small inconsistencies that may be present with various different types of labels. For example, labels such as dyes may exhibit a small amount of abundance-dependence, such as when dyes are incorporated into RNA according to the number of opportunities present (i.e., the number of nucleic acids that are present and complementary to the labeled nucleic acids). By averaging the signals, the effects of abundance dependence of one of the labels is reduced by the values associated with the other labels that are not abundance dependent in that range of signal levels. As a simple example, if label 1 amplifies the signal somewhat at lower abundances and thus provides stronger signals at lower signal levels reflective of lower abundance of the sample on a probe and label 2 does not, then by averaging the signals the amplification is reduced.
An example where different labels were incorporated into separate, equal aliquots of the same sample, then mixed into a single (multi-label) sample and hybridized to probes on an array, follows. Although the specific example is directed to dye labeling, it is again noted here that the principles and methods described herein are equally applicable to other label types. For example, the same sample may be labeled with either Cy3- or Cy5-dye and labeled with a radioactive label as well, or with two radioactive labels (radioactive isomers), biotinylated dyes, or with two different labels of any known types, as long as a system or systems are available for reading the signals associated with such labels. Further, as noted, the present invention may be carried out by incorporated multiple different dyes into a single aliquot of a sample.
The example experiment was conducted on self-self arrays in which equivalent proportions of cyanine3-(Cy3) and cyanine5-(Cy5) dye were separated incorporated into nucleic acids in equal, but separate quantities of the same sample, and both labeled samples were then combined and hybridized, as a single combined sample having both labels, under the same conditions to the same array configured for two channel processing, commonly referred to as “self-self hyb”, in order to demonstrate post processing techniques that would be the same for a single sample having had multiple different labels applied thereto. Further details about this simulation may be found in co-pending, commonly owned Application Serial No. (Application Serial No. not yet assigned, Attorney's Docket No. 10051059-1) filed concurrently herewith and titled “Label Integrity Verification of Chemical Array Data”, which is hereby incorporated herein, in its entirety, by reference thereto.
The “self-self hyb” examples were subject to the following conditions: For a self-self hybridization, 1 μg of Hela or K562 total RNA was amplified and By3- and Cy5-labeled using Agilent's Low Input RNA Fluorescent Linear Amplification Kit (5184-3523, Agilent Technologies, Inc., Palo Alto, Calif.) in separate reactions, following protocol described in the user's manual of the kit. Hybridizations were performed using Agilent's Human 1A (V2) Oligo Microarrays (G4110B, Agilent Technologies Inc., Palo Alto, Calif.) and the in-situ Hybridization Plus Kit (5184-3568, Agilent Technologies, Inc., Palo Alto, Calif.). 750 ng of Cy3- and 750 ng of Cy5-labeled cRNA were co-hybridized to each microarray, as described in the microarray user manual (G4140-90030, Agilent Technologies, Inc., Palo Alto, Calif.). Slides were scanned on an Agilent Microarray Scanner (Model G2505B, Agilent Technologies, Inc., Palo Alto, Calif.) and the raw images were processed using Agilent's Feature Extraction (v7.5.1, Agilent Technologies, Inc., Palo Alto, Calif.).
This experiment was closely controlled to provide the same technical factors to both samples on the same array, to validate usefulness of providing two or more labels to the same sample to monitor label integrity as described herein. Table 1 lists the four Agilent oligo, two-color arrays (self 3, self4, self 7 and self8) that were prepared for the experiment. The arrays self3 and self7 used HeLa—11 as the sample for both red and green dyes in equal proportions, and the arrays self4 and self8 used K562—12 as the sample for both red and green dyes in equal proportions.
Upon hybridizing each array with the target samples as indicated above, each probe was ideally expected to bind with equal concentrations Cy3-labeled polynucleotides and Cy5-labeled polynucleotides of the specific polynucleotide that is designed to bind with.
After washing and other typical processing steps, the arrays were scanned with a two-channel Agilent scanner to obtain signals from the probes for both the Cy3-labeled target as well as the Cy5-labeled target on the two channels, respectively. The ratios of the signal values from the two channels for each probe were than analyzed as a measure of dye integrity, i.e., to measure the fidelity of the signals as effected by one dye versus the other. Since both channels were expected to measure the same biopolymers (e.g., labeled polynucleotides) present in equal concentrations for each probe, a comparison of the signals from each channel with the processing described herein, provides a reliable measure of whether the labels are distorting the signal readings, since all other technical factors do not vary (e.g., such as one or more of: array to array differences, lot to lot differences, hybridization conditions, array manufacturing conditions, etc., that may typically be causes of gradients and other pattern variations when comparing two samples contacted to two different arrays.
By providing multiple labels in a manner described with a universal reference (i.e., a reference designed to use for a broad coverage of different gene expression studies, e.g., see http://www.stratagene.com/products/displayProduct.aspx?pid=439), label integrity can be checked by comparison of signals as described, as read from the biopolymers on the universal reference that have been labeled with multiple labels, thus providing an experimenter with assurance that the labels associated with experimentation are not a significant source of error and assay instability.
As one approach to analysis of the array data from scanning the arrays identified in Table 1, ANOVA analysis of the signal data obtained from the arrays was performed using JMP*SAS software (http://www.jmp.com/) to characterized the response surfaces and check for relative dye patterns in the signal intensities, as measured by natural log ratios of dye-normalized, background subtracted signals (LnRatiOrgDNS) for red to green ratios from the probes/targets on the arrays. The ratios were analyzed to look for patterns of divergence caused by differences in performance of the red and green dyes. The analysis performed was standard ANOVA analysis to measure the dye integrity for the arrays noted. Further information regarding ANOVA analysis can be found in co-pending, commonly assigned application Ser. No. 11/198,362, filed Aug. 4, 2005 and Ser. No. 11/026,484, filed Dec. 30, 2004. Both application Ser. No. 11/198,362 and application Ser. No. 11/026,484 are hereby incorporated herein, in their entireties, by reference thereto. Table 2 shows summary results for the surface fit and the Analysis of Variance Results as determined by the ANOVA processing.
Table 2 reports well-known, established standard statistics for an ANOVA analysis. In the “Summary of Fit” portion of Table 2 above, “RSquare” measures the proportion of the variation around the mean explained by the linear or polynomial model. The remaining variation is attributed to random error. RSquare is 1 if the model fits perfectly. An RSquare value of zero indicates that the fit is no better than a simple mean model. RSquare is the standard regression result of one minus the ratio residual sum of squares, divided by the total sum of squares, about the mean. “RSquare Adj.” adjusts the RSquare value to make it more comparable over models with different numbers of parameters by using the degrees of freedom in its computation. Thus it is a ratio of mean squares instead of sums of squares.
“RMS Error”, or “Root Mean Square Error” estimates the standard deviation of the random error. RMS Error is calculated as the square root of the mean square for Error in the Analysis of Variance table shown in the “Analysis of Variance” portion of Table 2. “Mean of Response” is the sample mean (arithmetic average) of the response variable. This is the predicted response when no model effects are specified. “Sum of Weights”, or “Observations”, indicates the number of observations used to estimate the fit, in this case, the number of rows of data that were inputted.
In the “Analysis of Variance” portion of Table 2 above, “DF” refers to the degrees of freedom for each calculation reported. The Total Error DF is the degrees of freedom figure reported at the “Error” entry of the Analysis of Variance portion of Table 2, and is the difference between the “C. Total” DF value and the “Model” DF value. The Sum of Squares or “SSQ” records an associated sum of squares for each source of error. The Total Error “SSQ” is the sum of square value reported on the “Error” line of the Analysis of Variance portion of Table 2.
“Mean Square” is the sum of squares divided by it associated degrees of freedom, i.e., SSQ/DF. This computation converts the sum of squares to an average (mean square). “F Ratio” is the ratio of mean square for lack of fit to mean square for pure error. The F-Ratio tests the hypothesis that the lack of fit error is zero. F-ratios for statistical tests are the ratios of mean squares. “Prob>F” is the observed significance probability (p-value) of obtaining a greater F-ratio value by chance alone if the specified model fits no better than the overall response mean (i.e., probability of a noise effect). Observed significance probabilities (Prob>F) of 0.05 or less are often considered evidence of a regression effect.
Table 3 shows the parameter estimates that were calculated for performing the ANOVA analysis. The nominal terms inputted were the self-self arrays (ArraySelf3, ArraySelf4 and ArraySelf7) with the array self8 (ArraySelf8) serving as the intercept term, as one of the nominal terms (levels) becomes the designated dependent effect to be left out of the model to avoid singularity problems. This parameter becomes the negative of the sum of all other level parameters and therefore absorbs the singularity. The “Estimate” column lists the parameter (term) estimates of the linear model. The prediction formula is the linear combination of these estimates with the values of their corresponding variables. “Std. Err.” lists the estimates of the standard errors of the parameter estimates. These Std. Err. estimates are used for constructing tests and confidence intervals.
The “t Ratio” column lists the test statistics for the hypothesis that each parameter is zero. The t Ratio is the ratio of the parameter estimate to its standard error. If the hypothesis is true, then this statistic has a Student's t-distribution. Looking for a t Ratio greater than 2 in absolute value is a common rule of thumb for judging significance because it approximates the 0.05 significance level.
The final column labeled “Prob>|t|” lists the observed significance probability calculated from each t Ratio. Prob>|t| is the probability of getting, by chance alone, a t Ratio greater (in absolute value) than the computed value, given a true hypothesis. Often, a value below 0.05 (or sometimes 0.01) is interpreted as evidence that the effect of the parameter considered is significantly different from zero. The different values in this column for the nominal variables ArraySelf3, ArraySelf4 and ArraySelf7 indicate LnRatio shifts due to variation in the amount of response of the red dye relative to the green dye for the same probe/target, over all of the probes on the arrays among the arrays, respectively. ANOVA nominal variables are composed of dummy values which represent shifts as estimated by their parameters. The shifts were considered to be within an acceptable range in this example. An acceptable range may be preset to make this determination. For example, in this example, the range was preset for a determination that a shift was in an acceptable range if the p-value was less than 0.05, which is a typical threshold setting for significance.
The second grouping of terms in Table 3 (i.e., Col&RS, (Row-103.983)*(Row-103.983), (Row-103.983)*(Col-215.455), and (Col-215.455)*(Col-215.455)), are scaled or covariate terms, minus their average value (to improve numerical and statistical properties), and provide the statistical results that characterize the global, persistent (array-independent pattern) effects, to the second order, of the row and column positions of the probes on the arrays with respect to all four of the arrays (ArraySelf3, ArraySelf4, ArraySelf7 and ArraySelf8) considered together, upon the outcome of the signal levels (natural log ratios of dye-normalized, background subtracted signals, in this example). Note that the numerical values “103.983” and “215.455” are the average row and column positions on an x-y grid, as measured on the array by the analysis software, and that these values are subtracted from each row and column position, respectively, to center the data for performance of the analysis, thereby reducing effect correlations. Specifically, in this example, Col&RS characterizes the effect of the column positions, (Row-103.983)* (Row-103.983) characterizes the second order effect of row positions, or row-row interaction (i.e., row2), (Row-103.983)* (Col-215.455) characterizes the effect of row and column interaction, and (Col-215.455)* (Col-215.455) characterizes the second order effect of column positions, or column-column interaction (i.e., column2). Given the extremely low p-values in the last column for these terms, this indicates that persistent gradients apply to all the arrays considered, in the LnRatiOrgDNS data, but that these gradients are very small as indicated by the small parameter estimates for these terms.
The third grouping of terms in Table 3 (i.e., (Row-103.983)*ArraySelf3, (Row-103.983)*ArraySelf4, (Row-103.983)*ArraySelf7, (Col-215.455)*ArraySelf3, (Col-215.455)*ArraySelf4, (Col-215.455)*ArraySelf7, (Row-103.983)*(Row-103.983)*ArraySelf3, (Row-103.983)*(Row-103.983)*ArraySelf4, (Row-103.983)*(Row-103.983)*ArraySelf7, (Row-103.983)*(Col-215.455)*ArraySelf3, (Row-103.983)*(Col-215.455)*ArraySelf4, (Row-103.983)*(Col-215.455)*ArraySelf7, (Col-215.455)*(Col-215.455)*ArraySelf3, (Col-215.455)*(Col-215.455)*ArraySelf4, and (Col-215.455)*(Col-215.455)*ArraySelf7) are scaled or covariate terms, per array, that characterize the changes in LnRatiOrgDNS values for each array, on a per array basis, respectively, as effected by row and column positions of the probes/targets on the arrays. These parameters indicate the shift in the persistent parameters for each array for all gradient effects.
Specifically, “(Row-103.983)*ArraySelf3” characterizes the row effect shift upon any gradient that may be observed in array self3. (Row-103.983)*ArraySelf4 characterizes the row effect shift upon any gradient that may be observed in array self4, (Row-103.983)*ArraySelf7 characterizes the row effect shift upon any gradient that may be observed in array self7, (Col-215.455)*ArraySelf3 characterizes the column effect shift upon any gradient that may be observed in array self3, (Col-215.455)*ArraySelf4 characterizes the column effect shift upon any gradient that may be observed in array self4, (Col-215.455)*ArraySelf7 characterizes the column effect shift upon any gradient that may be observed in array self7, (Row-103.983)*(Row-103.983)*ArraySelf3 characterizes the second-order row effect shift (shift/correction relative to the persistent array-independent pattern noted above) upon any gradient that may be observed in array self3, (Row-103.983)*(Row-103.983)*ArraySelf4 characterizes the second-order row effect shift upon any gradient that may be observed in array self4, (Row-103.983)*(Row-103.983)*ArraySelf7 characterizes the second-order row effect shift upon any gradient that may be observed in array self7, (Row-103.983)*(Col-215.455)*ArraySelf3 characterizes the (shift/correction relative to the persistent array-independent pattern upon any gradient that may be observed in array self3, (Row-103.983)*(Col-215.455)*ArraySelf4 characterizes the (shift/correction relative to the persistent array-independent pattern noted above) upon any gradient that may be observed in array self4, (Row-103.983)*(Col-215.455)*ArraySelf7 characterizes the row and column interaction effect shift upon any gradient that may be observed in array self7, (Col-215.455)*(Col-215.455)*ArraySelf3 characterizes the second-order column effect shift upon any gradient that may be observed in array self3, (Col-215.455)*(Col-215.455)*ArraySelf4 characterizes the second-order column effect shift upon any gradient that may be observed in array self4, and (Col-215.455)*(Col-215.455)*ArraySelf7) characterizes the second-order column effect shift upon any gradient that may be observed in array self7.
That is, these metrics provide a measure of array-dependent gradients, i.e., the variation of the gradient pattern from array to array, relative to the persistent, array-independent pattern (estimated as the pattern averaged over all array-specific patterns). Based upon the significance values (<0.05) relative to the parameter sizes, it was determined that the array-dependent gradients are significant, but very small.
Because of the large number of data points (LnRatiOrgDNS values) used in this analysis, a lot of statistical leverage was provided and it was possible to detect very small changes in gradient, much less than a level that was considered significant (i.e., where significance was considered for values of p<0.05). Therefore, it was concluded that the gradient levels were significant and, if the consequential percent CV levels are above thresholds considered acceptable, then the arrays fail market requirements. The Ln Ratio, array-dependent gradients are also significant, but very small as indicated by the third grouping of parameters and associated statistics.
Table 4 shows the combined statistics for all of the terms described above in Table 3. Rather than reporting p-values for array shifts separately, Table 4 combines the effects over all arrays and provides p-values that were calculated for each term over all arrays. Thus, the information in Table 4 is provided to answer the question as to whether there is an array effect of one ore more terms on the LnRatiOrgDNS data. Table 4 reports ensemble significance, that is the significance of all levels of each term considered together. Terms may also be custom-combined in a manner as taught in co-pending, commonly assigned application Ser. No. 11/198,362.
“Source’ lists each of the variables/terms that were considered in performing the ANOVA calculations. DF list the degrees of freedom for the calculations performed for the variable listed in the same row, respectively. For nominal variables, the DF value was the total number of levels (nominal variables) minus one, to account for the intercept, as noted above, and further discussed in application Ser. No. 11/198,362. The Sum of Squares calculations divided by DF, respectively, provide the relative weights attributed to the effect of each variable on the LnRatiOrgDNS data. An F-ratio value was calculated for Sum of Squares term and reported in the next adjacent column. From these F-ratio values, p-values were calculated to show the probability that each effect is due to noise, or actually due to the term/variable considered. A p-value of 1 means that there is no evidence at all to suggest that there is a systematic effect caused by the variable/term for which the p-value is calculated. Conversely, a p-value less that 0.0001 means that the result is highly significant, and that the effect (mean sum of squares term, versus the residual mean sum of squares term) calculated for that term is due predominantly to the term considered, and not to random noise. Thus, the lower the p-value, the more significant is the result (i.e., the calculated sum of squares value is more likely to actually be due to the term considered, rather than predominantly to noise). The low Prob>F values in Table 4 imply statistically significant impact, but unacceptable arrays according to typical market requirements, since % CV impact of the effect estimates are small and less than 12%.
The total (mean-adjusted) sum of squares calculated was 2049.5670, as indicated in Table 2. The sum of squares calculations for each of the terms considered, as shown in Table 4, are very small relative to the total sum of squares. Thus, although the effects of these terms are statistically significant, as shown by the p-values in the last column of Table 4, the effects are very small compared to the total sum of squares calculation. Thus, the terms considered are not accounting for the large majority of variation in the signal values. Therefore, the overall variation in the signal values analyzed is not due to dye integrity issues. Based on the small gradients as indicated by the magnitudes of the parameters estimates that model the contour plots, as characterized by the results of the ANOVA testing, it was concluded that the signals associated with red dye versus the respective signals associated with green dye were behaving in parallel (i.e., any effect on the signal caused by red dye, if any, was nearly the same as the effect on the signal caused by green dye, if any, across all probes on all arrays, showing inter-array consistency of the dye labels), and that dye integrity was sufficient so as not to effect the reliability of the signal data representing the actual targets binding to probes. Therefore the labeling (red and green dyes) passed the quality test. That is, the dye effect estimates on the signal data were significant, but small and acceptable as to expected consequential impact, as measured by % CV. Statistical significance of the dye effects, by itself, does not imply unacceptable label integrity, but is necessary when the effect estimates exceed a valid threshold value that would imply unacceptable integrity.
As briefly referred to above, it was determined that the signal intensity readings associated with the different labels may be combined to form a composite or average signal intensity level for a probe, which may be more accurate, reliable and reproducible across experiments than if any single signal intensity level associated with any single label associated with the experiment were used.
Table 5 reports the numerical quantile statistics and moments calculated from the data shown in
The median CV values (array-to-array variability in signal) for Cy3 and Cy5 are 0.1719 and 0.1792, respectively, or 17.19% and 17.92%, which are considered to be unacceptable levels. For example, a typical threshold % CV value considered to be acceptable currently is about 12% or less, sometimes 10% or less. The median CV for the combined signal (
Table 6 reports the numerical quantile statistics and moments calculated from the data shown in
The median CV values (array-to-array variability in signal) for Cy3 and Cy5 are 0.1166 and 0.1204, respectively, or 11.66% and 12.04%, in this case. The median CV for the combined signal (CVrgLnBSS in
The background-subtracted, but not dye-normalized signals were weighted according to their performances at different relative signal intensities. From experience, it was known that the green dye (Cy3) performs with better integrity (i.e., better reproducibility, less variation, relative to that observed in signals associated with the red dye Cy5) with signals of relatively lower intensity and that the red dye (Cy5) performs with better integrity (i.e., better reproducibility, less variation, relative to that observed in signals associated with the green dye Cy3) with signals of relatively higher intensity. Accordingly, for signals higher than the average signal, rather than just calculating the Ln average of the signal associated with the red dye and the signal associated with the green dye for a probe, the signal associated with the red dye was weighted more heavily than the signal associated with the green dye. Conversely, for signal intensities less than the average signal intensity, the signal associated with the green dye for a probe was weighted more heavily that the signal associated with the red dye for the same probe, and then a log average of these signals was calculated. Thus, signals associated with green dye and having less than the median signal intensity were weighted at a factor of greater than 0.5 and signals associated with red dye having less than the median signal intensity were weighted at a factor of less than 0.5, wherein the weighting factors for red and green associated signals from the same probe sum to a total of one. Weighting was performed conversely for the signals having greater than the median signal intensity. A weighting curve was empirically developed to optimize the weighting values applied.
Note that the median CV value for CVwrgLnBSS is 0.1092 or 10.92%, which is even better (i.e., exhibits less array-to-array variation) than the combined signals of
Accordingly, by providing multiple labels for a single sample to be analyzed on an array by interpreting one channel of signals from the array, this offers a unique ability to verify the integrity of each label in a manner that eliminates other production or hybridization factors that may otherwise be confused with effects caused by lack of label integrity. Further, by combining the signals associated with the multiple labels and a particular probe/target, composite signal can be used for measurement of the target. Such composite signal may be more reliable and reproducible than a signal that is associated with any one of the multiple different labels applied to the same sample. Further, weighting may be performed to further emphasize the advantages in the performances of the labels, based on signal intensity.
If unacceptable divergence is identified among the labels, than a user may either have to do the experimentation over (redo the experimentation with new arrays, or strip arrays and repeat the processing) or may be able to identify the bad label and use the results associated with one or more labels that have been determined to be reliable.
CPU 1302 is also coupled to an interface 1310 that may include one or more input/output devices such as video monitors, track balls, mice, keyboards, microphones, touch-sensitive displays, transducer card readers, magnetic or paper tape readers, tablets, styluses, voice or handwriting recognizers, or other well-known input devices such as, of course, other computers. Finally, CPU 1302 optionally may be coupled to a computer or telecommunications network using a network connection as shown generally at 1312. With such a network connection, it is contemplated that the CPU might receive information from the network, or might output information to the network in the course of performing the above-described method steps. The above-described devices and materials will be familiar to those of skill in the computer hardware and software arts.
The hardware elements described above may implement the instructions of multiple software modules for performing the operations of this invention. For example, instructions for calculating sums of square terms and or for calculating metrics may be stored on mass storage device 1308 or 1314 and executed on CPU 1302 in conjunction with primary memory 1306.
In addition, embodiments of the present invention further relate to computer readable media or computer program products that include program instructions and/or data (including data structures) for performing various computer-implemented operations. The media and program instructions may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well known and available to those having skill in the computer software arts. Examples of computer-readable media include, but are not limited to, magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROM, CDRW, DVD-ROM, or DVD-RW disks; magneto-optical media such as floptical disks; and hardware devices that are specially configured to store and perform program instructions, such as read-only memory devices (ROM) and random access memory (RAM). Examples of program instructions include both machine code, such as produced by a compiler, and files containing higher level code that may be executed by the computer using an interpreter.
While the present invention has been described with reference to the specific embodiments thereof, it should be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the true spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation, material, composition of matter, process, process step or steps, to the objective, spirit and scope of the present invention. All such modifications are intended to be within the scope of the claims appended hereto.