Pharmaceutical, biotechnology, or genomics companies use polynucleotide arrays (such as DNA or RNA arrays), for example, as diagnostic or screening tools. Such arrays or microarrays include regions (sometimes referenced as spots or features) of usually different sequence polynucleotides arranged in a predetermined configuration on a substrate such as a microchip. The arrays, when exposed to a sample, will exhibit a binding pattern. This binding pattern can be observed, for example, by labeling all polynucleotide targets (for example, DNA) in the sample with a suitable label (such as a fluorescent compound), and accurately observing the fluorescent signal on the array. Assuming that the different sequence polynucleotides were correctly deposited in accordance with the predetermined configuration, then the observed binding pattern will be indicative of the presence and/or concentration of one or more polynucleotide components of the sample.
Biopolymer arrays can be fabricated using either in situ synthesis methods or deposition of the previously obtained biopolymers. The deposition methods basically involve depositing biopolymers at predetermined locations on a substrate which are suitably activated such that the biopolymers can link thereto. Biopolymers of different sequence may be deposited at different regions of the substrate to yield the completed array. Washing or other additional steps may also be used. Procedures known in the art for deposition of polynucleotides, particularly DNA such as whole oligomers or cDNA, include touching drop dispensers to a substrate or use of an ink jet type head to fire drops onto the substrate.
A scanner is then used to read the fluorescence of these resultant surface bound molecules under illumination with suitable (most often laser) light. The scanner acts like a large field fluorescence microscope in which the fluorescent pattern caused by binding of labeled molecules is scanned on the chip. In particular, a laser induced fluorescence scanner provides for analyzing large numbers of different target molecules of interest, e.g., genes/mutations/alleles, in a biological sample.
The scanning equipment typically used for the evaluation of microarrays includes a scanning fluorometer. A number of different types of such devices are commercially available from different sources, such as Axon Instruments in Union City, Calif.; Perkin Elmer of Wellesly, Mass.; and Agilent Technologies, Inc. of Palo Alto, Calif. Analysis of the data, (i.e., collection, reconstruction of image, comparison and interpretation of data) is performed with associated computer systems and commercially available software, such as GenePix by Axon Instruments, QuantArray by Perkin Elmer or Feature Extraction by Agilent of Palo Alto, Calif.
In such scanning devices, a laser light source generates a “most often collimated” beam. The collimated beam sequentially illuminates small surface regions of known locations on an array substrate. The resulting fluorescence signals from the surface regions are collected either confocally (employing the same lens used to focus the laser light onto the array) and/or off-axis (using a separate lens positioned to one side of the lens used to focus the laser onto the array). The collected signals are then transmitted through appropriate spectral filters to an optical detector. A recording device, such as a computer memory, records the detected signals and builds up a raster scan file of intensities as a function of position, or time as it relates to the position. Such intensities, as a function of position, are typically referred to in the art as “pixels” or “pixel values.” Collectively, the pixels make up a microarray scan image having a multiplicity of feature cells or “features”, wherein each feature cell is comprised of a group of pixels.
In array fabrication, the quantities of DNA available for the array are usually very small and expensive. Sample quantities available for testing are usually also very small and it is therefore desirable to simultaneously test the same sample against a large number of different probes on an array. These conditions require use of arrays with large numbers of very small, closely spaced spots or “features”.
The use of microarray technologies to conduct experiments that measure thousands of genes and proteins simultaneously and under different conditions are becoming the norm in both academia and pharmaceutical/biotech companies. Microarray technology is leading to greater feature density as well as to extremely high-resolution scanning. In their largest capacities, such as in a full human genome catalog array, there may be as many as three or four 25,000 to 50,000-feature cells. This results in increasingly large amounts of both image and feature analysis data which can be problematic for several reasons. First the higher the density of features on an array, the increasingly more difficult it becomes to accurately extract these features. Higher accuracy and precision of the scanning apparatus becomes necessary. Even more importantly, higher accuracy and precision of the manufacturing techniques, preparation techniques, and associated apparatus are required, so that at the user end, the user can locate the information to be read and distinguish it from noise.
Currently, arrays from different sources and/or manufacturers vary greatly in quality. Layouts of the pixels may vary, as array formats are not standardized. In the early period of microarray analysis, microarrays were arranged in a single, rectangular matrix of dots or features. Currently, however, there are a multitude of layouts, including microarrays having a plurality of subarrays formed on a single chip, where each subarray is made up of a rectangular matrix of dots or features, and each subarray is spaced or separated from the other subarrays. These spacings may be due to the arrangement of pens used to deposit the features, wherein the pens are spaced far apart and a collection of all the dots (features) made by each single pen is what constitutes each subgrid. In this way, with each pass of a pen matrix, a single feature is added to each subgrid.
Further variations in microarray patterns occur due to poor quality or errors in the application of the features to the chip. Ideally, when the features are dots, each should be well-formed (e.g., a substantially perfect circle) and uniformly spaced. With the wide variation of manufacturers now available, however, the features are not always so homogeneous. For example, “doughnuts” (i.e., a dot only filled circumferentially along the perimeter, with a blank or hole in the center) may be formed in some instances, rather than a fully filled circle. Other partially formed or mis-formed features or manufacturing may also occur, such as crescent-shaped features; irregular boundaries (perimeters) of the features; misaligned rows or columns of features; misalignment between consecutive features, along a row and/or a column; variations in the size or circumferences of the dots; and others.
Still other error factors may exist on microarrays which make accurate interpretation of the data more difficult. These include dust (such as that existing during or after preparation of the microarray); gasket marks, which may leave high intensity noise on the array, drying stains, scratches, spots or spurious pixels or the like.
Currently, users who examine microarrays of a variety of array designs from different origins typically spend between several minutes up to an hour to adapt their feature extraction programs to each array scan and especially each time the array layout changes, or when the quality of an array is substandard.
For all of the above reasons, a universal system for feature extraction and method of feature extraction is needed to shorten the time that is required by the user to begin viewing the actual data contained on a microarray.
Methods, systems and computer readable media for aligning an array for determining a layout of features of the array, which are provided in a two-dimensional array. A first block of features of the array is selected, the first block having a predefined width in the direction of one of the X and Y axes, and a length sufficient to extend over all of the features in the direction of the other of the X and Y axes. A second block of features is selected, the second block having a predefined width in the same direction as the width of the first block, and a length sufficient to extend over all of the features in the direction of the other of the X and Y axes, wherein the second block contains features not contained by the first block. The features contained in the first block are projected in the direction of the width of the first block and the features contained in the second block are projected in the direction of the width of the second block. The peaks resultant from the projection of the features contained in the first block are correlated with peaks resultant from the projection of the features contained in the second block. Offset values for pairs of peaks identified by the correlating are calculated, and an offset value is selected from the offset values to represent the offset of features in the direction of the other of the X and Y axes.
These and other advantages and features of the invention will become apparent to those persons skilled in the art upon reading the details of the methods, systems and computer readable media as more fully described below.
Before the present systems and methods are described, it is to be understood that this invention is not limited to particular methods, method steps, algorithms, software or hardware described, as such may, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting, since the scope of the present invention will be limited only by the appended claims.
Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limits of that range is also specifically disclosed. Each smaller range between any stated value or intervening value in a stated range and any other stated or intervening value in that stated range is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included or excluded in the range, and each range where either, neither or both limits are included in the smaller ranges is also encompassed within the invention, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention, the preferred methods and materials are now described. All publications mentioned herein are incorporated herein by reference to disclose and describe the methods and/or materials in connection with which the publications are cited.
It must be noted that as used herein and in the appended claims, the singular forms “a”, “and”, and “the” include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to “a pixel” includes a plurality of such pixels cells and reference to “the row” includes reference to one or more rows and equivalents thereof known to those skilled in the art, and so forth.
The publications discussed herein are provided solely for their disclosure prior to the filing date of the present application. Nothing herein is to be construed as an admission that the present invention is not entitled to antedate such publication by virtue of prior invention. Further, the dates of publication provided may be different from the actual publication dates which may need to be independently confirmed.
Definitions
In the present application, unless a contrary intention appears, the following terms refer to the indicated characteristics.
“Projecting” or “projection” of a two dimensional image onto a one dimensional line refers to adding all the values or selected values within the same row or column index of a matrix to yield a one-dimensional dataset with the same length as one of the dimensions of the matrix or the length of the matrix taking into account the selected values, i.e. number of rows or number of columns. “Simple projection” may be performed without any image rotation or additional image processing. Simple projection works well when the features are well aligned and well-formed, as a line of such features along the projection direction will form a sharp peak on the projection. Because simple projection is linear, its result is dominated by large intensity signals on the image being projected, regardless of whether the large intensity signals are caused by features or spurious pixels (e.g., scratches, drying traces, gasket traces, etc.) Further, the projection of a doughnut shaped feature will appear as a double peak, separated by the inner diameter (i.e., “doughnut hole”) of the doughnut shaped feature. If the features are poorly separated, or if the rows and columns of the grid are not exactly aligned along the rows and columns of the image, the projection will appear blurred, as the expected peaks will bleed into those formed by neighboring features.
“Non-linear projecting” or “non-linear projection” is the same as “projecting” or “projection”, except that preprocessing is done before the projection. Such preprocessing may include computing local minima or maxima, taking the logarithm of the local minima or maxima, computing projections along rotated axes as needed, and/or dithering sums over the one-dimensional data set.
“Projecting based on orthogonal projection” includes any of the above projection techniques wherein the row or columns that are summed are only those that are near a local maximum in the other dimension (row or column). Typically, only the middle half (or some other predefined central portion) of those maxima are considered.
“Gauss filtering”, “Gaussian Filtering” or “Gaussian Integration” involves a classical application of a correlation with a Gauss kernel.
“Large trace removing” involves addressing and filtering artifacts caused by gasket traces, drying traces, or other large artifacts on the slide/chip which, if not treated, tend to show up as very bright, large areas when viewed.
“Zero rank filtering” involves removing the local baseline under the one-dimensional data after it has been projected.
“Peak picking” a one-dimensional dataset involves finding all the local maxima, and further processing the local maxima to determine which are features to be kept for data interpretation.
“Spacing estimation” involves the computation of the most frequent distance between adjacent peak centers.
“Peak height selection” involves statistical processing to weed out peaks that are created by image artifacts.
“Block finding” involves forming groups of peaks from the total population, based on sets which are generally equally spaced according to a given or measured spacing. Blocks are intended to define subarrays, subgrids or zones in a microarray.
“Block size computing” involves computing to choose the block size that involves the maximum number of peaks, after peak picking has been performed.
“Block fixing” attempts to force a given block size on blocks that are either smaller or larger than the computed block size.
A “biopolymer” is a polymer of one or more types of repeating units. Biopolymers are typically found in biological systems and particularly include polysaccharides (such as carbohydrates), and peptides (which term is used to include polypeptides and proteins) and polynucleotides as well as their analogs such as those compounds composed of or containing amino acid analogs or non-amino acid groups, or nucleotide analogs or non-nucleotide groups. This includes polynucleotides in which the conventional backbone has been replaced with a non-naturally occurring or synthetic backbone, and nucleic acids (or synthetic or naturally occurring analogs) in which one or more of the conventional bases has been replaced with a group (natural or synthetic) capable of participating in Watson-Crick type hydrogen bonding interactions. Polynucleotides include single or multiple stranded configurations, where one or more of the strands may or may not be completely aligned with another.
A “nucleotide” refers to a sub-unit of a nucleic acid and has a phosphate group, a 5 carbon sugar and a nitrogen containing base, as well as functional analogs (whether synthetic or naturally occurring) of such sub-units which in the polymer form (as a polynucleotide) can hybridize with naturally occurring polynucleotides in a sequence specific manner analogous to that of two naturally occurring polynucleotides. For example, a “biopolymer” includes DNA (including cDNA), RNA, oligonucleotides, and PNA and other polynucleotides as described in U.S. Pat. No. 5,948,902 and references cited therein (all of which are incorporated herein by reference), regardless of the source. An “oligonucleotide” generally refers to a nucleotide multimer of about 10 to 100 nucleotides in length, while a “polynucleotide” includes a nucleotide multimer having any number of nucleotides. A “biomonomer” references a single unit, which can be linked with the same or other biomonomers to form a biopolymer (for example, a single amino acid or nucleotide with two linking groups one or both of which may have removable protecting groups).
When one item is indicated as being “remote” from another, this is referenced that the two items are not at the same physical location, e.g., the items are at least in different buildings, and may be at least one mile, ten miles, or at least one hundred miles apart.
“Communicating” information references transmitting the data representing that information as electrical signals over a suitable communication channel (for example, a private or public network).
“Forwarding” an item refers to any means of getting that item from one location to the next, whether by physically transporting that item or otherwise (where that is possible) and includes, at least in the case of data, physically transporting a medium carrying the data or communicating the data.
A “processor” references any hardware and/or software combination which will perform the functions required of it. For example, any processor herein may be a programmable digital microprocessor such as available in the form of a mainframe, server, or personal computer (desktop or portable). Where the processor is programmable, suitable programming can be communicated from a remote location to the processor, or previously saved in a computer program product (such as a portable or fixed computer readable storage medium, whether magnetic, optical or solid state device based). For example, a magnetic or optical disk may carry the programming, and can be read by a suitable disk reader communicating with each processor at its corresponding station.
Reference to a singular item, includes the possibility that there are plural of the same items present.
“May” means optionally.
Methods recited herein may be carried out in any order of the recited events which is logically possible, as well as the recited order of events.
A “chemical array”, “array”, “microarray” or “bioarray” unless a contrary intention appears, includes any one-, two- or three-dimensional arrangement of addressable regions bearing a particular chemical moiety or moieties (for example, biopolymers such as polynucleotide sequences) associated with that region. An array is “addressable” in that it has multiple regions of different moieties (for example, different polynucleotide sequences) such that a region (a “feature” or “spot” of the array) at a particular predetermined location (an “address”) on the array will detect a particular target or class of targets (although a feature may incidentally detect non-targets of that feature). Array features are typically, but need not be, separated by intervening spaces. In the case of an array, the “target” will be referenced as a moiety in a mobile phase (typically fluid), to be detected by probes (“target probes”) which are bound to the substrate at the various regions. However, either of the “target” or “target probes” may be the one which is to be evaluated by the other (thus, either one could be an unknown mixture of polynucleotides to be evaluated by binding with the other).
An “array layout” refers to one or more characteristics of the features, such as feature positioning on the substrate, one or more feature dimensions, and an indication of a moiety at a given location.
“Hybridizing” and “binding”, with respect to polynucleotides, are used interchangeably.
A “pulse jet” is a device which can dispense drops in the formation of an array. Pulse jets operate by delivering a pulse of pressure to liquid adjacent an outlet or orifice such that a drop will be dispensed therefrom (for example, by a piezoelectric or thermoelectric element positioned in a same chamber as the orifice).
A “subarray” or “subgrid” is a subset of an array. Typically, a number of subgrids are laid out on a single slide and are separated by a greater spacing than the spacing that separates features or spots or dots.
Any given substrate (e.g., slide) may carry one, two, four or more arrays disposed on a front surface of the substrate. Depending upon the use, any or all of the arrays may be the same or different from one another and each may contain multiple spots or features. A typical array may contain more than ten, more than one hundred, more than one thousand more ten thousand features, or even more than one hundred thousand features, in an area of less than 20 cm2 or even less than 10 cm2. For example, features may have widths (that is, diameter, for a round spot) in the range from a 10 μm to 1.0 cm. In other embodiments each feature may have a width in the range of 1.0 μm to 1.0 mm, usually 5.0 μm to 500 μm, and more usually 10 μm to 200 μm. Non-round features may have area ranges equivalent to that of circular features with the foregoing width (diameter) ranges. At least some, or all, of the features are of different compositions (for example, when any repeats of each feature composition are excluded the remaining features may account for at least 5%, 10%, or 20% of the total number of features).
Interfeature areas will typically (but not essentially) be present which do not carry any polynucleotide (or other biopolymer or chemical moiety of a type of which the features are composed). Such interfeature areas typically will be present where the arrays are formed by processes involving drop deposition of reagents but may not be present when, for example, photolithographic array fabrication processes are used. It will be appreciated though, that the interfeature areas, when present, could be of various sizes and configurations.
Each array may cover an area of less than 100 cm2, or even less than 50 cm2, 10 cm2 or 1 cm2. In many embodiments, the substrate carrying the one or more arrays will be shaped generally as a rectangular solid (although other shapes are possible; for example, some manufacturers are currently working on flexible substrates), having a length of more than 4 mm and less than 1 m, usually more than 4 mm and less than 600 mm, more usually less than 400 mm; a width of more than 4 mm and less than 1 m, usually less than 500 mm and more usually less than 400 mm; and a thickness of more than 0.01 mm and less than 5.0 mm, usually more than 0.1 mm and less than 2 mm and more usually more than 0.2 and less than 1 mm. With arrays that are read by detecting fluorescence, the substrate may be of a material that emits low fluorescence upon illumination with the excitation light. Additionally in this situation, the substrate may be relatively transparent to reduce the absorption of the incident illuminating laser light and subsequent heating if the focused laser beam travels too slowly over a region. For example, a substrate may transmit at least 20%, or 50% (or even at least 70%, 90%, or 95%), of the illuminating light incident on the front as may be measured across the entire integrated spectrum of such illuminating light or alternatively at 532 nm or 633 nm.
Arrays can be fabricated using drop deposition from pulse jets of either polynucleotide precursor units (such as monomers) in the case of in situ fabrication, or the previously obtained polynucleotide. Such methods are described in detail in, for example, the previously cited references including U.S. Pat. Nos. 6,242,266; 6,232,072; 6,180,351; 6,171,797; and 6,323,043, and in U.S. patent application Ser. No. 09/302,898 filed Apr. 30, 1999 by Caren et al., and the references cited therein. As already mentioned, these references are incorporated herein, in their entireties, by reference thereto. Other drop deposition methods can be used for fabrication, as previously described herein. Also, instead of drop deposition methods, photolithographic array fabrication methods may be used. Interfeature areas need not be present particularly when the arrays are made by photolithographic methods.
Following receipt by a user of an array made by an array manufacturer, it will typically be exposed to a sample (for example, a fluorescently labeled polynucleotide or protein containing sample) and the array then read. Reading of the array may be accomplished by illuminating the array and reading the location and intensity of resulting fluorescence at multiple regions on each feature of the array. For example, a scanner may be used for this purpose which is similar to the AGILENT MICROARRAY SCANNER manufactured by Agilent Technologies, Palo Alto, Calif. Other suitable apparatus and methods are described in U.S. Pat. Nos. 6,406,849; 6,371,370; and 6,756,202; and in U.S. Patent Publication No. 2003/0160183 titled “Reading Dry Chemical Arrays Through The Substrate” by Dorsel et al. However, arrays may be read by any other method or apparatus than the foregoing, with other reading methods including other optical techniques (for example, detecting chemiluminescent or electroluminescent labels) or electrical techniques (where each feature is provided with an electrode to detect hybridization at that feature in a manner disclosed in U.S. Pat. Nos. 6,251,685 and 6,221,583 and elsewhere). A result obtained from the reading followed by a method of the present invention may be used in that form or may be further processed to generate a result such as that obtained by forming conclusions based on the pattern read from the array (such as whether or not a particular target sequence may have been present in the sample, or whether or not a pattern indicates a particular condition of an organism from which the sample came). A result of the reading (whether further processed or not) may be forwarded (such as by communication) to a remote location if desired, and received there for further use (such as further processing).
When analyzing the results of experiments performed on a microarray, the interpretation of the data is facilitated when the slide on which the microarray is produced is of known origin and accompanied by complete detailed information that specifies the positions, spacing and all details of the pixels deposited to form the microarray. For example, detailed layout information may be provided to direct the analysis to the exact positions of the dots or pixels making up the array, including spacings of microarrays, when present. An example of such layout information is that contained in a GAL (GenePix Array Layout) file (GenePix, of Union City, Calif.), or an Agilent design file (Agilent, Palo Alto, Calif.)
However, on many occasions, incomplete information (or no information at all) is available to assist in viewing a microarray. For example, the GAL file, even if provided, may be incomplete and not include “x” and “y” coordinates of the pixels on the slide. Alternatively, only a text file may be provided which includes a list of all the names of the genes being experimented on, possibly with an assigned order, such as the column and row that each gene appears in, for example, possibly also with a subgrid number, but with no “x” and “y” coordinates given, or spacing between pixels, or spacing between subgrids. In the worst case, there may be no information accompanying a microarray that needs to be viewed.
Since, as noted above, arrays from different sources and/or manufacturers vary greatly in quality and layouts of the pixels may vary, as array formats are not standardized, the present invention provides means for identifying the grid upon which a microarray has been formed, and identifying the features to be examined, as well as identifying spurious information which is to be disregarded.
A goal, when examining a slide containing microarray data, is to locate the layout of the information contained on the slide, i.e., the “dots”, “spots” or “features” as they are arranged and spaced on the slide, including whether they form a single array or grid, or a series of subarrays or subgrids. By identifying where the features reside, this makes it further possible to ignore or filter out other information which is not located where the features are.
An initial approach to locating the features on a grid aims to locate the centers of the features. To begin with, the slide containing the array is read to sum the rows and sum the columns of the array to create the projections of the two dimensional image formed by the slide along one dimensional lines. By condensing the data from a two dimensional image to two vectors of one dimensional data, this greatly reduces processing time, since the processing time required to process one dimensional data is the square root of the time for processing the two-dimensional image data.
If the dots or features are thought of as bumps or hills, the projection process according to the present invention endeavors to look in the plane of these features to determine the skyline or topography of the features. If there are a few missing features here or there, it doesn't matter to the projection, because there are enough present, so that statistically, all of the projections will have about the same height. Also, even if most of the spots are faint, the sum of all those values is going to be significantly higher than the background signal. This provides an additional advantage over two-dimensional processing because of the increased signal to noise ratio produced by summing the features.
By locating the centers of all the features (peaks which are determined to represent features) in one dimension, and the centers of all those peaks in the second dimension, the present invention identifies the grid of data represented as features on the microarray. By finding centers, this gives the “x” and “y” coordinates for each feature which are then used to identify the grid.
The convention used for the grid on slide 100 in
Projection for X:
where
Projection for Y:
where
The result of performing the projections reduces the matrix of intensity values provided by the grid on slide 100 to two vectors of values.
A smoothing function may also be applied to the projection to get rid of the higher frequency minor points (“jitters”) 136 which may be superimposed upon the major peaks. The smoothing of the peaks makes it easier to discern the actual peaks that are representative of the features. For example, a correlation with a Gaussian kernel that is a few points wide (typically three to 5 points wide, using 10 micron pixels or points) may yield appropriate smoothing.
After smoothing, the local maxima of the plots are determined. Then, an interval is taken around each local maximum, and a Gaussian curve is fitted in the interval. The center of the Gaussian 136, which may be different from the peak maximum 137, is found using a centering algorithm. Typically the centroid algorithm (i.e., where the center of gravity of the peak is computed) gives satisfactory results. Additionally, the area under the curve defined by the peak within the interval is calculated, as well as the peak width (half-width maximum) 138 (see
Next the peak shapes are statistically processed in an effort to recognize and filter out the peaks which do not fit the shape of the general population (i.e., filter out the outliers) and which are therefore most likely to be representative of noise caused by artifacts, rather than illumination caused by features. Statistics may be done on the area, as well as width of the peaks, in an effort to filter out the peaks that do not fit the general population (i.e., to identify the outliers, which are most probably noise masquerading as peaks). The median value of the areas under the peaks and the median peak width are calculated, and peaks that have a significant variation from these median values are discarded from the set of peaks to be considered for viewing as features. Peaks that are determined to show a significant variation are those peaks that have an area that is more than a predetermined amount less that the median area (for example, twenty times smaller than the median area) as well as peaks whose width is more than a predetermined amount greater (e.g., at least about 50% greater) than the calculated median width.
It is not practical to attempt to identify peaks having a height that is significantly higher than the general population, because the data may be such that most of the features are very faintly illuminated, with one or a few being very intense. Of course, these very intense features are features which should be considered. Also, there is no need to remove peaks that are too narrow, because such peaks are also usually too small in area, so as to be effectively filtered by the median area filter.
Next the system endeavors to find the spacing between peaks. The spaces between each pair of adjacent peaks are calculated and tabulated, after which, the median difference between adjacent peaks is calculated. The median value is then set to be the feature spacing, i.e., distance between adjacent features in the dimension being considered. Although the median is the preferred measure for determining peak spacing, it is noted that other statistical measurements could be substituted for peak spacing. For example, some other form of “average” calculation could be employed to determine peak spacing, although some approaches may not be as accurate, since the subgrid spacing distances will generally be calculated along with the interfeature distances. Using the median measure in
The peak spacing may be further used to determine group spacing, e.g., distance between subgrids of features. In the example shown in
By the foregoing techniques, the system determines that the peaks shown in
All of the preceding procedures may then be repeated in the second dimension to determine peaks and subgrid spacing in the other dimension. For example, if projection, etc. is performed in the X-direction first, then the procedures are repeated in the Y-direction, or vice versa.
The projections are easiest to calculate when all of the features are well-formed and consistent, and the grid/subgrids are all aligned with each other as well as with the slide. However, in reality, many discrepancies from this ideal layout occur. For example, the entire grid may be rotated with respect to the X and/or Y axes. Additionally or alternatively, one or more subgrids may be rotated with respect to the other subgrids in the grid. Also, rows and/or columns of one or more subgrids may be misaligned with rows and/or columns of adjacent subgrids.
One way of dealing with this problem is to reduce the size of the features 110F. To do so, the program, according to the present system, filters the reading of the features during the projection process, by recording the minimum value within a window at each pixel position as it passes over a feature 110F. The window function has a width that is smaller than the width that the features 110F are usually produced to have. For example, features 110F may be formed to have a width of about 15 to 30 pixels, and the window used may have a width of about 7 pixels.
This filtering process results in a much narrower, more well-defined peak representative for each feature 110F read.
The present invention is not concerned with determining an accurate reading of the intensity value of a feature 110F, or even of the particular shape of each feature. Rather, the process is performed for targeting the locations of the features 110F. Therefore, the logarithm of the intensity values are used during processing to curb the overall effect of the large intensity values (privilege or weight the general population of intensity values versus those which are abnormally high). This may be useful to downgrade the importance of anomalous sources of illumination which may present with higher intensity than the features.
A further refinement of projection processing may be performed to increase the signal to noise ratio of the resulting projections. This refinement involves computing projections of all of the pixels only for the first projection performed, whether it be in the Y direction or the X direction. After obtaining the first projection plot in the manner described above, the system identifies only the location where peaks representing features 110F are suspected. The locations of the peak maximums are the result of the peak-picking algorithm that has been performed in the current dimension of processing.
Once the suspected peak locations are identified in the first dimension, only those pixel lines (columns or rows) corresponding to the identified peaks (and a predetermined distance on either side of each peak (e.g., for typical feature sizes and using 10 microns pixels, about 3-8 pixels on each side, preferably about 4-6 pixels on each side) are processed for a projection in the other (X or Y direction). This not only reduces processing time, but eliminates a lot of the background noise in between the rows or columns of the features 110F which need not be processed. For example, if the first projection is in the X direction, then the identification of peaks from this first projection narrows down the rows of pixels which are to be considered. Then, when subsequently performing the projection in the Y direction, only those column values which lie in the identified row positions are considered during the projection. The same process can be applied if the first projection is done in the Y direction, wherein, when doing the subsequent X projection, only those row values which lie in the identified column positions would be considered.
Returning to the first example, after projection in the X direction, selection of peaks, and then projecting in the Y direction using only selected row positions, the projection in the Y direction can be processed as described above, to find peak centers of the features 110F (e.g., using window 320), to determine the spacing between features 110F and to determine the layout of the subgrids. Then another projection is done in the X direction using only the identified peak locations from the Y-projection to limit the column positions of the row pixels that are projected.
The system may further apply the process steps described above in order to determine the rotation (if any) of one or more subgrids in the array.
The location patterns of features 110F of each of these portions is then compared to determine the offset, in both the X and Y directions. Using the offset values, the degree of rotation can be readily calculated. For example, in
The above-described method for rotation correction works well when only minor rotation(s) are present. When one or more arrays or subarrays is tilted or rotated by larger amounts however, the above technique may have difficulties, as it may be difficult to discern the origins needed to calculate the angle of rotation.
In these situations, a better approach to identifying and correcting for rotation involves a comparison of the overall projections 424a, 424b with one another, without regard to trying to locate the origins y1 and y2, in contrast to the method described previously. Because of the regular geometric layout of the features of an array, the waveforms produced by the projections at 424a and 424b will be very similar, if not the same, with a phase difference that reflects the offset of the features in the second block sampled 410B relative to the first block sampled 410A.
For this procedure, however, the blocks 410A and 410B are selected close to one another, such that the offset between the origin features in the blocks being analyzed is less than half the offset between adjacent features in the same block in the same direction in which the offset is being determined. This is to ensure that the peak matching between the peaks in the projections 424a and 424b are matched against features in the same respective row (or column, depending upon which direction the projections are taken in) of features for each block. For example, if the blocks are selected so that the origins are offset by one feature spacing (i.e., distance between adjacent features in the same direction for which the offset is to be determined), then the resultant projections 424a and 424b may line up with each other and indicate that there is no offset between the features in the two selected blocks, and therefore no rotation. If the offset is more than one half the feature spacing, but less than one, it is possible to measure the angle of offset in the opposite direction. It is desirable therefore, to select the block 410A, 410B so that the offset between origins, or other features corresponding to the same respective positions in each block is less than half the separation between adjacent features in the direction for which the offset is sought, to avoid matching adjacent peaks between projections 424a, 424b in the wrong direction which would result in measuring the angle of rotation in the opposite direction.
Thus, initially, the blocks 410A and 410B are typically selected very close together.
If the rotation of the array being analyzed is substantial, then the calculation described above, by itself, is very robust. However, if the rotation is only slight (such as in cases described above with regard to
Iterations of this process may be carried out, each time, moving block 410B a further distance (i.e., to the right in
Blocks 410A and 410B are typically chosen to be fairly narrow in width, so as to get clearer peak formations in the projections 424a, 424b. For example, the block width may be of sufficient distance to capture two columns (or rows) of features, which is useful when features are closely packed as in the example of
The initial placement of block 4101B is typically directly adjacent to block 410A, as described above, but need not be. For example, the initial placement of block 410B may be separated from block 410A by one, two or more columns (or rows) of features along the array. Also, each time the block 410A it may be moved to another position spaced from the previous position, alternative to the adjacent placements already discussed. Each iteration is typically performed based upon movements of the block 4101B by equal distances each time, but this is also not required. Further alternatively, the system may determine, based upon the results of the first iteration (first placement of block 4101B and processing with regard to block 410A) how far to move the block 410B for the next (and potentially, subsequent iterations). For example, based on the evaluation of the offset or rotation from the first iteration, the system can estimate what is the maximum desirable separation between blocks 410A and 4101B and then place block 4101B at a position safely below this maximum, but far enough to increase precision, for example, ½, to ⅞, typically about ¾ times the estimated maximum. This reduces processing time, since it is apparent, from the first round, that there is little or no rotation, and that a large separation distance between blocks 410A and 410B will be needed to measure the offset, if indeed there is even any offset at this distance. Alternatively, if the initial offset measurement is within a preset percentage of the threshold, then the system may make the next placement of the block 410B immediate adjacent the current placement, as not much more separation between blocks may be tolerated before the offset threshold is exceeded. Intermediate of these two extreme examples, other preset distances for movement distance for a subsequent movement of block 410B may be programmed, based upon offset distances calculated for the present iteration, relative to the offset threshold, as would be readily apparent to one of ordinary skill in the art after reading the above description. Typically, a first movement offset threshold may be set very near to zero, e.g., 0.02 to 0.10 of the center to center distance between features in the same direction that offset is being determined. If the first movement offset threshold is not met or exceeded by the offset determined from the first iteration, then block 4101B is moved a greater distance from block 410A than the system is programmed to normally move the block 410B. For example, block 410B may be moved as close to the end of the array as possible. On the other hand, if the first movement offset threshold is met or exceeded by the offset determined from the first iteration, then the system may place the block 410B immediately adjacent the current placement (or move it by some preset default distance) for conducting the next iteration.
Still further, a second movement offset threshold may be set that is close to the offset threshold used to determine when to end the iteration processing. For example, the second movement offset threshold may be about 0.18 to 0.22 of the center to center distance between features in the same direction that offset is being determined. If the first movement offset threshold is not met or exceeded by the offset determined from the first iteration, then block 410B is moved a greater distance from block 410A than the system is programmed to normally move the block 410B. For example, block 410B may be moved as close to the end of the array as possible, as described above. If the second movement offset threshold is met or exceeded by the offset determined from the first iteration, then block 410B, but the offset threshold for ending iterations is not met or exceeded, then the system may place the block 410B immediately adjacent the current placement. If the first movement offset threshold is exceeded, but the second movement offset threshold is not met, then the block may be moved by some preset default distance, which may be equal to or greater than the distance for adjacent placement) for conducting the next iteration. The same controls may be provided for subsequent movements of the block 410B, for subsequent iterations.
Once an offset distance (e.g., selection of a median or other average offset value from comparing the correlating peaks, as described above) is determined in any iteration to meet or exceed the set threshold offset value, or the last iteration has been completed, the offset value selected during the iteration when the threshold value was met or exceeded, or the offset value selected during the last iteration is use to calculate the angle of rotation of the array being analyzed, where the selected offset value for the examples discussed above is characterized by (y2−y1) (The offset would be the value for (x2−x1 if the analysis were carried out along the other dimension of the array). The angle of rotation may be determined by calculating a tangent, i.e., (y2−y1)/(x2−x1). The rotation is then subtracted out.
This technique may be carried out prior to, or after, processing for peaks, etc discussed above. Reduction of feature sizes may be performed first. Typically, processing for rotation by the procedures discussed immediately above are carried out prior to processing to find peaks for locating the features (rows and columns) of the array. If there is no, or no significant rotation of the array, then this technique will determine a very small or no detectable angle of rotation, and further processing can be carried out according to the other techniques disclosed previously, without performing a rotation of the array.
Assuming a rotation is performed, after subtracting out the rotations determined along the first dimension, processing in the same manner may be repeated, but in the other dimension, using the iterative technique described above. Skew may be optionally performed, and may not be performed since skew is not typically according to a very large angle. The rotational results in this dimension, combined with the rotational results in the first dimension (described in detail above) determine the skew of the pattern.
Baseline Processing
An example of baseline processing according to the present invention involves filtering sources of illumination which have a period (width) that is substantially greater (e.g., twice, or some other predetermined multiple) than the width (or expected width) of the peak spacing (i.e., distance between centers of the features 110F. The predetermined multiple may vary depending upon how much information is known about the grid prior to processing. For example, if no information is known, the predetermined multiple may be about twice the expected peak spacing. If the peak spacing has already been specified prior to processing, the predetermined multiple may be about 1.5 times the peak spacing, or even equal to the peak spacing. In one example where no information is known, a window spacing of 31 points (pixels) is used (assuming pixel size of 10 microns).
This baseline filtering process may employ a window function that operates conceptually similarly to the window function used for reducing the peak size, as discussed above with regard to
As noted above, the window 530 for the window function is selected to be no smaller than the peak spacing or expected peak spacing, and is generally about 1 to 2.5 times the peak (or expected peak) spacing. As the window 530 is passed over the projection 500, the minimum value observed in the window is obtained for each progressive position of the window 530 over the projection 500. Window 530 may be advanced by as little as one pixel between each position for which a minimum value is obtained, or a larger incremental movement may be employed for faster processing. However, by performing the projections as noted, this typically reduces the number of points to consider to somewhere in the neighborhood of about 6000. With this reduction, it is possible to advance one pixel at a time and still complete the processing very quickly, as the reduced information for the entire grid (array) can be loaded into a processor cache.
To remove the remnant block portions 510B, a reverse transformation is employed, wherein maximum values of plot 540 are obtained using the same window function. The plot 550 resulting from the maximum value filtering step is also shown in
Peak Width Measurement and Gaussian Fit
When finding the peak centers as described above with regard to
Accordingly, once the peak spacing has been determined by locating the peak centers using the relatively narrow window, processing may be returned for iterations on finding a Gaussian fit, for a more accurate fit. Since the spacing is now known, a window which is about half the peak spacing can be used to do the Gaussian integration to fit the Gaussian curves for the peaks.
Grouping the Peaks
After the peak centers, peak spacing and peak widths have been established, according to the above methods, the system further processes the data to establish peak grouping. Peak grouping relates to the features as they are arranged in subgrids, for example. In certain situations, a consistent repeating pattern of peak grouping may be observable in the data, while one or more such groups may deviate slightly from the established pattern. For example, 10A shows a repeating pattern of four peaks consistently spaced, with each group being relatively consistently spaced by a larger distance than the distance between peaks within the group. The peaks 600A and 600C in groups A and C fit this pattern well. However, group B shows only three peaks 600B and is spaced somewhat further from group A than it is from group C. The program, in this type of situation, recognizes the established pattern of groups, the group spacing and the peak spacing. By comparing the average group spacings over the entire data set, and comparing this with the group spacing and number of peaks in group B as compared with the other groups, group B will be arranged to account for the missing peak (first peak in the group in this instance) to maintain proper alignment of this group with the other subgrids.
Another anomalous situation that may occur is like that shown in
An example of a process which was determined to perform well for microarray images containing features with varied sizes and geometries is as follows:
A projection along the smaller dimension of the microarray (which was usually the sum of the columns) was performed. The sum (i.e., projection) was peak picked, and then a non-linear projection along the other dimension (usually, the rows) was performed, based on the picked peaks. The non-linear projection was then analyzed. Analyzing is meant to include any or all of: performing large trace removal, Gauss filtering (Gaussian peak fitting), zero rank filtering, peak picking with arbitrary width integration (in these examples, five points (pixels) on each side of the peak), estimating peak spacing, redoing the Gaussian peak fitting using a width of half the estimated peak spacing on each side of the peak center, block finding, block size estimating, block fixing, recomputing and validating peak spacing and block spacing using peaks in valid blocks (i.e., those block which did not need block fixing), and estimating the origin of the pattern of blocks using all valid blocks.
After analyzing, the number of zones and number of features are located (“filled in”) on the grid in the first dimension (usually X axis). Next, a non-linear projection was performed along the smaller dimension was performed, based on the picked peaks in the larger dimension, wherein only even numbered picked peaks were used. This projection was then analyzed, and then the number of zones and number of features are located (“filled in”) on the grid in the second dimension (usually Y axis) to give the packed array offset information.
Once all of the subgrids have been located, processing was carried out as above, individually on each subgrid. Additionally, processing to determine rotation and skew, if any were carried out on the subgrids, in the manner described above.
While the present invention has been described with reference to the specific embodiments thereof, it should be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the true spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation, material, composition of matter, process, process step or steps, to the objective, spirit and scope of the present invention. All such modifications are intended to be within the scope of the claims appended hereto.
This application is a continuation-in-part application of application Ser. No. 10/449,175, filed May 30, 2003, which is hereby incorporated herein, in its entirety by reference thereto, and to which application we claim priority under 35 USC §120.
Number | Date | Country | |
---|---|---|---|
Parent | 10449175 | May 2003 | US |
Child | 11221312 | Sep 2005 | US |