The present invention is related to processing of microarray data and, in particular, to a method and system for partitioning microarray data based on displacement vectors calculated for feature positions so that the partitions represent blocks or zones of features that are commonly aligned.
The present invention is related to one of a number of initial steps in processing microarray data concerned with identifying, as accurately as possible, the positions of features on the two-dimensional surface of the microarray. A general background of microarray technology is first provided, in this section, to facilitate discussion of microarray-data processing, in following subsections. It should be noted that microarrays are also referred to as “microarrays” and simply as “arrays.” These alternate terms may be used interchangeably in the context of microarrays and microarray technologies. Art described in this section is not admitted to be prior art to this application.
Array technologies have gained prominence in biological research and are likely to become important and widely used diagnostic tools in the healthcare industry. Currently, microarray techniques are most often used to determine the concentrations of particular nucleic-acid polymers in complex sample solutions. Molecular-array-based analytical techniques are not, however, restricted to analysis of nucleic acid solutions, but may be employed to analyze complex solutions of any type of molecule that can be optically or radiometrically scanned and that can bind with high specificity to complementary molecules synthesized within, or bound to, discrete features on the surface of an array. Because arrays are widely used for analysis of nucleic acid samples, the following background information on arrays is introduced in the context of analysis of nucleic acid solutions following a brief background of nucleic acid chemistry.
Deoxyribonucleic acid (“DNA”) and ribonucleic acid (“RNA”) are linear polymers, each synthesized from four different types of subunit molecules. The subunit molecules for DNA include: (1) deoxy-adenosine, abbreviated “A,” a purine nucleoside; (2) deoxy-thymidine, abbreviated “T,” a pyrimidine nucleoside; (3) deoxy-cytosine, abbreviated “C,” a pyrimidine nucleoside; and (4) deoxy-guanosine, abbreviated “G,” a purine nucleoside. The subunit molecules for RNA include: (1) adenosine, abbreviated “A,” a purine nucleoside; (2) uracil, abbreviated “U,” a pyrimidine nucleoside; (3) cytosine, abbreviated “C,” a pyrimidine nucleoside; and (4) guanosine, abbreviated “G,” a purine nucleoside.
The DNA polymers that contain the organization information for living organisms occur in the nuclei of cells in pairs, forming double-stranded DNA helixes. One polymer of the pair is laid out in a 5′ to 3′ direction, and the other polymer of the pair is laid out in a 3′ to 5′ direction. The two DNA polymers in a double-stranded DNA helix are therefore described as being anti-parallel. The two DNA polymers, or strands, within a double-stranded DNA helix are bound to each other through attractive forces including hydrophobic interactions between stacked purine and pyrimidine bases and hydrogen bonding between purine and pyrimidine bases, the attractive forces emphasized by conformational constraints of DNA polymers. Because of a number of chemical and topographic constraints, double-stranded DNA helices are most stable when deoxy-adenylate subunits of one strand hydrogen bond to deoxy-thymidylate subunits of the other strand, and deoxy-guanylate subunits of one strand hydrogen bond to corresponding deoxy-cytidilate subunits of the other strand.
FIGS. 2A-B illustrates the hydrogen bonding between the purine and pyrimidine bases of two anti-parallel DNA strands.
Two DNA strands linked together by hydrogen bonds forms the familiar helix structure of a double-stranded DNA helix.
Double-stranded DNA may be denatured, or converted into single stranded DNA, by changing the ionic strength of the solution containing the double-stranded DNA or by raising the temperature of the solution. Single-stranded DNA polymers may be renatured, or converted back into DNA duplexes, by reversing the denaturing conditions, for example by lowering the temperature of the solution containing complementary single-stranded DNA polymers. During renaturing or hybridization, complementary bases of anti-parallel DNA strands form WC base pairs in a cooperative fashion, leading to reannealing of the DNA duplex. Strictly A-T and G-C complementarity between anti-parallel polymers leads to the greatest thermodynamic stability, but partial complementarity including non-WC base pairing may also occur to produce relatively stable associations between partially-complementary polymers. In general, the longer the regions of consecutive WC base pairing between two nucleic acid polymers, the greater the stability of hybridization between the two polymers under renaturing conditions.
The ability to denature and renature double-stranded DNA has led to the development of many extremely powerful and discriminating assay technologies for identifying the presence of DNA and RNA polymers having particular base sequences or containing particular base subsequences within complex mixtures of different nucleic acid polymers, other biopolymers, and inorganic and organic chemical compounds. One such methodology is the array-based hybridization assay.
Once an array has been prepared, the array may be exposed to a sample solution of target DNA or RNA molecules (410-413 in
Finally, as shown in
One, two, or more than two data subsets within a data set can be obtained from a single microarray by scanning the microarray for one, two or more than two types of signals. Two or more data subsets can also be obtained by combining data from two different arrays. When optical scanning is used to detect fluorescent or chemiluminescent emission from chromophore labels, a first set of signals, or data subset, may be generated by scanning the microarray at a first optical wavelength, a second set of signals, or data subset, may be generated by scanning the microarray at a second optical wavelength, and additional sets of signals may be generated by scanning the molecular at additional optical wavelengths. Different signals may be obtained from a microarray by radiometric scanning to detect radioactive emissions one, two, or more than two different energy levels. Target molecules may be labeled with either a first chromophore that emits light at a first wavelength, or a second chromophore that emits light at a second wavelength. Following hybridization, the microarray can be scanned at the first wavelength to detect target molecules, labeled with the first chromophore, hybridized to features of the microarray, and can then be scanned at the second wavelength to detect target molecules, labeled with the second chromophore, hybridized to the features of the microarray. In one common microarray system, the first chromophore emits light at a red visible-light wavelength, and the second chromophore emits light at a green, visible-light wavelength. The data set obtained from scanning the microarray at the red wavelength is referred to as the “red signal,” and the data set obtained from scanning the microarray at the green wavelength is referred to as the “green signal.” While it is common to use one or two different chromophores, it is possible to use one, three, four, or more than four different chromophores and to scan a microarray at one, three, four, or more than four wavelengths to produce one, three, four, or more than four data sets.
When a microarray is scanned, data may be collected as a two-dimensional digital image of the microarray, each pixel of which represents the intensity of phosphorescent, fluorescent, chemiluminescent, or radioactive emission from an area of the microarray corresponding to the pixel. A microarray data set may comprise a two-dimensional image or a list of numerical, alphanumerical pixel intensities, or any of many other computer-readable data sets. An initial series of steps employed in processing scanned, digital microarray images includes constructing a regular coordinate system for the digital image of the microarray by which the features within the digital image of the microarray can be indexed and located. For example, when the features are laid out in a periodic, rectilinear pattern, a rectilinear coordinate system is commonly constructed so that the positions of the centers of features lie as closely as possible to intersections between horizontal and vertical gridlines of the rectilinear coordinate system. Then, regions of interest (“ROIs”) are computed, based on the initially estimated positions of the features in the coordinate grid, and centroids for the ROIs are computed in order to refine the positions of the features. Once the position of a feature is refined, feature pixels can be differentiated from background pixels within the ROI, and the signal corresponding to the feature can then be computed by integrating the intensity over the feature pixels.
In general, microarrays are manufactured with the intent of positioning features as exactly periodically and regularly spaced as possible. Accurately positioning features on the surface of the microarray greatly facilitates extracting data from a scanned, digital image of a microarray produced by a microarray scanner. However, despite great care and attention paid to accurately positioning features onto the surface of microarrays during microarray manufacture, indications of feature-position errors are observed in microarray data-processing steps. Thus, designers, manufactures, and users of microarrays have recognized the need for methods for detecting and accounting for feature-position errors in microarray data.
One embodiment of the present invention provides a method and system for detecting block and zone misalignments of feature positions within a microarray-data set and for correcting feature positions for block or zone misalignment. In a described embodiment of the present invention, displacement vectors representing the vector differences between observed positions of features and expected positions for the features of a microarray are calculated, based on an initially determined coordinate system. Features within a microarray data set are then partitioned with respect to the calculated vector displacements, so that features misaligned by a common rotation or translation are partitioned into a separate partition. A correction for the common misalignment of the features of each partition can then be calculated and applied to the features of the partition.
FIGS. 2A-B illustrate the hydrogen bonding between the purine and pyrimidine bases of two anti-parallel DNA strands.
FIGS. 8A-B illustrate one class of feature-location anomalies that is observed in manufactured microarrays.
FIGS. 9A-B illustrate a first step undertaken in various embodiments of the present invention.
FIGS. 10A-B illustrate the partitioning of the initial region, illustrated in
FIGS. 11A-B illustrate that the partitions from
One embodiment of the present invention provides a method and system for detecting and correcting for block and zone misalignments within a microarray data set. In a first subsection, below, additional information about molecular arrays is provided. Those readers familiar with molecular arrays may skip over this first subsection. In a second subsection, embodiments of the present invention are provided through examples, graphical representations, and with reference to several flow-control diagrams.
An array may include any one-, two- or three-dimensional arrangement of addressable regions, or features, each bearing a particular chemical moiety or moieties, such as biopolymers, associated with that region. Any given array substrate may carry one, two, or four or more arrays disposed on a front surface of the substrate. Depending upon the use, any or all of the arrays may be the same or different from one another and each may contain multiple spots or features. A typical array may contain more than ten, more than one hundred, more than one thousand, more ten thousand features, or even more than one hundred thousand features, in an area of less than 20 cm2 or even less than 10 cm2. For example, square features may have widths, or round feature may have diameters, in the range from a 10 μm to 1.0 cm. In other embodiments each feature may have a width or diameter in the range of 1.0 μm to 1.0 mm, usually 5.0 μm to 500 μm, and more usually 10 μm to 200 μm. Features other than round or square may have area ranges equivalent to that of circular features with the foregoing diameter ranges. At least some, or all, of the features may be of different compositions (for example, when any repeats of each feature composition are excluded the remaining features may account for at least 5%, 10%, or 20% of the total number of features). Inter-feature areas are typically, but not necessarily, present. Inter-feature areas generally do not carry probe molecules. Such inter-feature areas typically are present where the arrays are formed by processes involving drop deposition of reagents, but may not be present when, for example, photolithographic array fabrication processes are used. When present, inter-feature areas can be of various sizes and configurations.
Each array may cover an area of less than 100 cm2, or even less than 50 cm2, 10 cm2 or 1 cm2. In many embodiments. the substrate carrying the one or more arrays will be shaped generally as a rectangular solid having a length of more than 4 mm and less than 1 m, usually more than 4 mm and less than 600 mm, more usually less than 400 mm; a width of more than 4 mm and less than 1 m, usually less than 500 mm and more usually less than 400 mm; and a thickness of more than 0.01 mm and less than 5.0 mm, usually more than 0.1 mm and less than 2 mm and more usually more than 0.2 and less than 1 mm. Other shapes are possible, as well. With arrays that are read by detecting fluorescence, the substrate may be of a material that emits low fluorescence upon illumination with the excitation light. Additionally in this situation, the substrate may be relatively transparent to reduce the absorption of the incident illuminating laser light and subsequent heating if the focused laser beam travels too slowly over a region. For example, a substrate may transmit at least 20%, or 50% (or even at least 70%, 90%, or 95%), of the illuminating light incident on the front as may be measured across the entire integrated spectrum of such illuminating light or alternatively at 532 nm or 633 nm.
Arrays can be fabricated using drop deposition from pulsejets of either polynucleotide precursor units (such as monomers) in the case of in situ fabrication, or the previously obtained polynucleotide. Such methods are described in detail in, for example, U.S. Pat. Nos. 6,242,266, 6,232,072, 6,180,351, 6,171,797, 6,323,043, U.S. patent application Ser. No. 09/302,898 filed Apr. 30, 1999 by Caren et al., and the references cited therein. Other drop deposition methods can be used for fabrication, as previously described herein. Also, instead of drop deposition methods, photolithographic array fabrication methods may be used. Inter-feature areas need not be present particularly when the arrays are made by photolithographic methods as described in those patents.
A microarray is typically exposed to a sample including labeled target molecules, or, as mentioned above, to a sample including unlabeled target molecules followed by exposure to labeled molecules that bind to unlabeled target molecules bound to the array, and the array is then read. Reading of the array may be accomplished by illuminating the array and reading the location and intensity of resulting fluorescence at multiple regions on each feature of the array. For example, a scanner may be used for this purpose, which is similar to the AGILENT MICROARRAY SCANNER manufactured by Agilent Technologies, Palo Alto, Calif. Other suitable apparatus and methods are described in U.S. patent application Ser. No. 10/087447 “Reading Dry Chemical Arrays Through The Substrate” by Corson et al., and in U.S. Pat. Nos. 6,518,556; 6,486,457; 6,406,849; 6,371,370; 6,355,921; 6,320,196; 6,251,685; and 6,222,664. However, arrays may be read by any other method or apparatus than the foregoing, with other reading methods including other optical techniques, such as detecting chemiluminescent or electroluminescent labels, or electrical techniques, for where each feature is provided with an electrode to detect hybridization at that feature in a manner disclosed in U.S. Pat. Nos. 6,251,685, 6,221,583 and elsewhere.
A result obtained from reading an array may be used in that form or may be further processed to generate a result such as that obtained by forming conclusions based on the pattern read from the array, such as whether or not a particular target sequence may have been present in the sample, or whether or not a pattern indicates a particular condition of an organism from which the sample came. A result of the reading, whether further processed or not, may be forwarded, such as by communication, to a remote location if desired, and received there for further use, such as for further processing. When one item is indicated as being remote from another, this is referenced that the two items are at least in different buildings, and may be at least one mile, ten miles, or at least one hundred miles apart. Communicating information references transmitting the data representing that information as electrical signals over a suitable communication channel, for example, over a private or public network. Forwarding an item refers to any means of getting the item from one location to the next, whether by physically transporting that item or, in the case of data, physically transporting a medium carrying the data or communicating the data.
As pointed out above, array-based assays can involve other types of biopolymers, synthetic polymers, and other types of chemical entities. A biopolymer is a polymer of one or more types of repeating units. Biopolymers are typically found in biological systems and particularly include polysaccharides, peptides. and polynucleotides, as well as their analogs such as those compounds composed of, or containing, amino acid analogs or non-amino-acid groups, or nucleotide analogs or non-nucleotide groups. This includes polynucleotides in which the conventional backbone has been replaced with a non-naturally occurring or synthetic backbone, and nucleic acids, or synthetic or naturally occurring nucleic-acid analogs, in which one or more of the conventional bases has been replaced with a natural or synthetic group capable of participating in Watson-Crick-type hydrogen bonding interactions. Polynucleotides include single or multiple-stranded configurations, where one or more of the strands may or may not be completely aligned with another. For example, a biopolymer includes DNA, RNA, oligonucleotides, and PNA and other polynucleotides as described in U.S. Pat. No. 5,948,902 and references cited therein, regardless of the source. An oligonucleotide is a nucleotide multimer of about 10 to 100 nucleotides in length, while a polynucleotide includes a nucleotide multimer having any number of nucleotides.
As an example of a non-nucleic-acid-based microarray, protein antibodies may be attached to features of the array that would bind to soluble labeled antigens in a sample solution. Many other types of chemical assays may be facilitated by array technologies. For example, polysaccharides, glycoproteins, synthetic copolymers, including block copolymers, biopolymer-like polymers with synthetic or derivitized monomers or monomer linkages, and many other types of chemical or biochemical entities may serve as probe and target molecules for array-based analysis. A fundamental principle upon which arrays are based is that of specific recognition, by probe molecules affixed to the array, of target molecules, whether by sequence-mediated binding affinities, binding affinities based on conformational or topological properties of probe and target molecules, or binding affinities based on spatial distribution of electrical charge on the surfaces of target and probe molecules.
Scanning of a microarray by an optical scanning device or radiometric scanning device generally produces a scanned image comprising a rectilinear grid of pixels, with each pixel having a corresponding signal intensity. These signal intensities are processed by an array-data-processing program that analyzes data scanned from an array to produce experimental or diagnostic results which are stored in a computer-readable medium, transferred to an intercommunicating entity via electronic signals, printed in a human-readable format, or otherwise made available for further use. Molecular array experiments can indicate precise gene-expression responses of organisms to drugs, other chemical and biological substances, environmental factors, and other effects. Molecular array experiments can also be used to diagnose disease, for gene sequencing, and for analytical chemistry. Processing of microarray data can produce detailed chemical and biological analyses, disease diagnoses, and other information that can be stored in a computer-readable medium, transferred to an intercommunicating entity via electronic signals, printed in a human-readable format, or otherwise made available for further use.
One embodiment of the present invention provides a method and system for detecting block and zone misalignments of feature positions within a microarray-data set and for correcting feature positions for block or zone misalignment. A block or zone can be a physically contiguous set of features or can be a set of features defined by print-tip or nozzle membership, such as, for example, a set of features printed from the same print-tip or nozzle among other printing devices. Block and zone misalignments are sets of features translated, rotated, or rotated and translated with respect to initially expected positions for the features based on an initially determined coordinate grid derived from the determined positions of a subset of features. The terms “block misalignment” and “zone misalignment” are essentially interchangeable, although the term “block misalignment” may be more appropriate for relatively smaller regions of misalignment, while the term “zone misalignment” may be more appropriate for relatively larger regions of misalignment. The term “block/zone misalignment” is used, below, to refer to either a block or zone misalignment. In one embodiment of the present invention, displacement vectors representing the vector differences between observed positions of features and expected positions for the features of a microarray are calculated, based on an initially determined coordinate system. Features within a microarray data set are then partitioned with respect to the calculated vector displacements, so that features misaligned by a common rotation or translation are partitioned into a separate partition. A correction for the common misalignment of the features of the block or zone can then be calculated and applied to the features of the block or zone.
FIGS. 8A-B illustrate one class of feature-location anomalies that is observed in manufactured microarrays. As shown in
As discussed above, each feature can be separately analyzed in order to reconcile the observed location, or location calculated from observed locations of nearby known features, with the location expected from the most recently calculated coordinate axes. However, a feature-by-feature repositioning approach may fail to take into account systematic error information for misaligned feature blocks and feature zones that can potentially provide greater accuracy for feature location, particularly in the case of features with low signal-to-noise ratios.
FIGS. 9A-B illustrate a first step undertaken in various embodiments of the present invention.
Of course, displacement vectors may be oriented from the expected feature position to the observed feature position, or may be oriented in an exactly opposite direction from the observed feature position to the expected feature position, depending on the order of positions in the subtraction operation used to generate a displacement vector. Displacement vectors may be appropriately scaled for visual display and displayed superimposed over feature positions in a displayed image of the feature positions, in order to assist an experimenter in visually identifying block misalignments.
Several different cumulative displacement-vector-based metrics can be calculated from the displacement vectors for the partition. A first metric is the vector sum of the displacement vectors for a region within a microarray, μv, calculated by summing all displacement vectors di in the region, as follows:
A second metric that can be calculated is the length, or magnitude, of the vector sum of the displacement vectors, {overscore (μ)}v, calculated as:
{overscore (μ)}v={square root}{square root over (μv·μv)}
Finally, a third metric that can be calculated is the average length of the displacement vectors within the region, {overscore (μ)}s, calculated as follows:
In
Once the vector sum of the displacement vectors, μv, the length of the vector sum, {overscore (μ)}v, and the sum of the length of the displacement vectors, {overscore (μ)}s are computed for the overall region, shown in
When the ratio
falls significantly below 1.0, there is a strong indication of a rotational misalignment within the partition, since the vector sum of displacement vectors about a rotation point, such as rotation point 912, are symmetrical and tend to cancel each other out. The fact that the average displacement-vector length {overscore (μ)}s is greater in partition 1004 than in the original region indicates that the partitioning has potentially isolated a block misalignment within the partition, since the features within the partition have a greater, average displacement-vector length than the features of the initial region, in general. Similarly, for partition 1006, the increase in the average displacement-vector length from 0.4 R to 0.75 R indicates the presence, within the partition, of a block misalignment, but the increase in the ratio
from 0.5 to 1.0 indicates that a rotational misalignment present in the initial region is no longer present, or present to a much smaller extent, in partition 1006. The fact that the average displacement-vector length for partitions 1005 and 1007 have markedly decreased indicates that these partitions do not include block misalignments.
FIGS. 11A-B illustrate that the partitions from
With regard to partition 1004, the increase in the ratio
for each of the subpartitions within partition 1004 indicates that the subpartitions have partitioned a block rotational misalignment, and thus the partition 1004 is probably best not further partitioned in order that partition 1004 fully includes the block rotational misalignment. With regard to the partition 1006, the fact that the average vector-displacement length {overscore (μ)}s markedly increases for the lower-left subpartition 1108 indicates that subpartition 1108 has isolated a block misalignment more effectively than partition 1006, in which it is included. However, the fact that the average vector-displacement length {overscore (μ)}s has decreased for the upper two subpartitions 1110 and 1112 indicates that the ratio of features exhibiting a block misalignment within those subpartitions to all features within the subpartitions is less than the ratio for the larger including partition 1006. Thus, a repartitioning of partition 1006, or repartitioning of all but subpartition 1108, may be needed in order to isolate the block translational misalignment.
With the principles of microarray-region partitioning and vector-displacement-based metrics described with reference to
for the currently considered partition is compared to that for the parent partition. If the ratio
for the currently considered partition is greater than that for the parent partition, then partitioning of a larger rotational misalignment is indicated, and control flows to step 1411 where the routine “partition” determines whether there are any additional partitioning methods to try. If so, then control flows back to step 1406. If not, then the parent partition is added to a list of partitions, and the routine “partition” returns, in step 1416. If the ratio
is less than or equal to the ratio for the parent partition, then, in step 1412, the routine “partition” determines whether the average displacement-vector length, {overscore (μ)}s, for the currently considered subpartition is less than for the parent partition or whether the average vector displacement length, {overscore (μ)}s, for the currently considered subpartition is less than a threshold value. If so, then control flows to step 1413, where the local variable “count” is incremented, indicating that currently considered subpartition is a candidate for further inclusion in a final list of partitions. Next, in step 1414, the routine “partition” determines whether there are more subpartitions to evaluate. If so, control flows back to step 1409. Otherwise, control flows to step 1415, in which the routine “partition” determines whether the local variable “count” is greater than zero. If so, then the routine “partition” calls the routine “add subpartitions” in step 1418.
for the combined pair of partitions is less than or equal to the ratio for each partition, then, in step 1607, the routine “coalesce” increments the local variable “num” and, in step 1608, merges the currently considered pair of partitions together into a single partition. If there are more pairs of partitions to consider, as determined in step 1609, then control flows back to step 1605. Otherwise, if the local variable “num” is greater than zero, as determined in step 1610, then the routine “coalesce” repeats the ƒor-loop to attempt to additionally merge partitions. If no mergers occur in the current iteration, as detected in step 1610, the routine “coalesce” returns in step 1612. It should be emphasized that the coalescing step represented by the routine “coalesce” may, in many cases, prove to increase computational overhead more than the benefit obtained, and would be, in those cases, either omitted or used only in particular situations.
A method for calculating displacement vectors for the features in a partition is next provided. The location coordinates (xi, yi) for a feature can be calculated from the row and column index for the feature (ri, ci) as follows:
xi=rimxx+cimxy+Ox
yi=rimyx+cimxy+Ox
where
Although the present invention has been described in terms of a particular embodiment, it is not intended that the invention be limited to this embodiment. Modifications within the spirit of the invention will be apparent to those skilled in the art. For example, many different partitioning methods may be employed in step 1404 of
The foregoing description, for purposes of explanation, used specific nomenclature to provide a thorough understanding of the invention. However, it will be apparent to one skilled in the art that the specific details are not required in order to practice the invention. The foregoing descriptions of specific embodiments of the present invention are presented for purpose of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Obviously many modifications and variations are possible in view of the above teachings. The embodiments are shown and described in order to best explain the principles of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the following claims and their equivalents: