The present invention relates to the processing of data scanned from a molecular array and, in particular, to a method and system for automatically detecting outlying signals scanned from features and feature backgrounds based on an estimated scanned data variance calculated from the scanned data and on a maximum variance threshold calculated from the scanned data and from a model variance.
The present invention is related to processing of data scanned from molecular arrays. Molecular array technologies have gained prominence in biological research and are likely to become important and widely used diagnostic tools in the healthcare industry. Currently, molecular-array techniques are most often used to determine the concentrations of particular nucleic-acid polymers in complex sample solutions. Molecular-array-based analytical techniques are not, however, restricted to analysis of nucleic acid solutions, but may be employed to analyze complex solutions of any type of molecule that can be optically or radiometrically scanned and that can bind with high specificity to complementary molecules synthesized within, or bound to, discrete features on the surface of a molecular array. Because molecular arrays are widely used for analysis of nucleic acid samples, the following background information on molecular arrays is introduced in the context of analysis of nucleic acid solutions following a brief background of nucleic acid chemistry.
Deoxyribonucleic acid (“DNA”) and ribonucleic acid (“RNA”) are linear polymers, each synthesized from four different types of subunit molecules. The subunit molecules for DNA include: (1) deoxy-adenosine, abbreviated “A,” a purine nucleoside; (2) deoxy-thymidine, abbreviated “T,” a pyrimidine nucleoside; (3) deoxy-cytosine, abbreviated “C,” a pyrimidine nucleoside; and (4) deoxy-guanosine, abbreviated “G,” a purine nucleoside. The subunit molecules for RNA include: (1) adenosine, abbreviated “A,” a purine nucleoside; (2) uracil, abbreviated “U,” a pyrimidine nucleoside; (3) cytosine, abbreviated “C,” a pyrimidine nucleoside; and (4) guanosine, abbreviated “G,” a purine nucleoside.
The DNA polymers that contain the organization information for living organisms occur in the nuclei of cells in pairs, forming double-stranded DNA helixes. One polymer of the pair is laid out in a 5′ to 3′ direction, and the other polymer of the pair is laid out in a 3′ to 5′ direction. The two DNA polymers in a double-stranded DNA helix are therefore described as being anti-parallel. The two DNA polymers, or strands, within a double-stranded DNA helix are bound to each other through attractive forces including hydrophobic interactions between stacked purine and pyrimidine bases and hydrogen bonding between purine and pyrimidine bases, the attractive forces emphasized by conformational constraints of DNA polymers. Because of a number of chemical and topographic constraints, double-stranded DNA helices are most stable when deoxy-adenylate subunits of one strand hydrogen bond to deoxy-thymidylate subunits of the other strand, and deoxy-guanylate subunits of one strand hydrogen bond to corresponding deoxy-cytidilate subunits of the other strand.
Two DNA strands linked together by hydrogen bonds forms the familiar helix structure of a double-stranded DNA helix.
Double-stranded DNA may be denatured, or converted into single stranded DNA, by changing the ionic strength of the solution containing the double-stranded DNA or by raising the temperature of the solution. Single-stranded DNA polymers may be renatured, or converted back into DNA duplexes, by reversing the denaturing conditions, for example by lowering the temperature of the solution containing complementary single-stranded DNA polymers. During renaturing or hybridization, complementary bases of anti-parallel DNA strands form WC base pairs in a cooperative fashion, leading to regions of DNA duplex. Strictly A-T and G-C complementarity between anti-parallel polymers leads to the greatest thermodynamic stability, but partial complementarity including non-WC base pairing may also occur to produce relatively stable associations between partially-complementary polymers. In general, the longer the regions of consecutive WC base pairing between two nucleic acid polymers, the greater the stability of hybridization between the two polymers under renaturing conditions.
The ability to denature and renature double-stranded DNA has led to development of many extremely powerful and discriminating assay technologies for identifying the presence of DNA and RNA polymers having particular base sequences or containing particular base subsequences within complex mixtures of different nucleic acid polymers, other biopolymers, and inorganic and organic chemical compounds. These methodologies include molecular-array-based hybridization assays.
Once a molecular array has been prepared, the molecular array may be exposed to a sample solution of DNA molecules that includes DNA molecules (410–413 in
Molecular-array-based hybridization techniques allow extremely complex solutions of DNA molecules to be analyzed in a single experiment. Molecular arrays may contain hundreds, thousands, or tens of thousands or different oligonucleotides, allowing for the detection of hundreds, thousands, or tens of thousands of different DNA polymers containing complementary nucleotide sub-sequences in the complex DNA solutions to which the molecular array is exposed. In order to perform different sets of hybridization analyses, molecular arrays containing different sets of bound oligonucleotides are manufactured by any of a number of complex manufacturing techniques. These techniques generally involve synthesizing the oligonucleotides within corresponding features of the molecular array through complex iterative synthetic steps.
As pointed out above, molecular-array-based assays can involve other types of biopolymers, synthetic polymers, and other types of chemical entities. For example, one might attach protein antibodies to features of the molecular array that would bind to soluble labeled antigens in a sample solution. Many other types of chemical assays may be facilitated by molecular array technologies. For example, polysaccharides, glycoproteins, synthetic copolymers, including block coploymers, biopolymer-like polymers with synthetic or derivitized monomers or monomer linkages, block copolymers, and many other types of chemical entities may serve as probe and target molecules for molecular-array-based analysis. A fundamental principle upon which molecular arrays are based is that of specific recognition, by probe molecules affixed to the molecular array, of target molecules, whether by sequence-mediated binding affinities, binding affinities based on conformational or topological properties of probe and target molecules, or binding affinities based on spatial distribution of electrical charge on the surfaces of target and probe molecules.
DNA, and other biological polymers, may be labeled with different chemical chromophores, radioactive nuclides, or other signal-generating entities, and may be optically scanned at different wavelengths of light, radiometrically scanned for different types of radioactive emission within different energy ranges, or scanned by other techniques appropriate to detect signals produced by other signal-generating entities. In the case of optical scanning, each different wavelength at which a molecular array is scanned produces a different signal. Thus, in optical scanning, it is common to describe the signal produced by scanning in terms of the color of the wavelength of light employed for the scan. For example, a red signal is produced by scanning a molecular array with light having a wavelength corresponding to that of visible red light.
Scanning of a feature by an optical scanning device or radiometric scanning device generally produces a scanned image comprising a rectilinear grid of pixels, with each pixel having a corresponding signal intensity.
It is desirable for the signal intensities, or counts, of pixels within the area of a pixel-based scanned image corresponding to a feature to be relatively uniform. Similarly, it is also desirable for the signal intensities within background regions surrounding features to be relatively uniform. Non-uniform signal intensity distributions generally indicate the occurrence of one or more error or noise conditions that may prevent meaningful data from being collected from the feature.
Currently, outlier features, or feature backgrounds, are commonly identified by using negative control features manufactured into molecular arrays and by manual inspection of scanned images. However, control-feature-based outlier detection may be insensitive to various types of non-uniformities and significantly adds to the cost of molecular array manufacture and molecular array scanning and data processing. Manual outlier detection suffers from the inaccuracies and deficiencies well-known to occur in most human-dependent tasks, and is also quite slow and economically inefficient. Thus, designers, manufacturers, and users of molecular arrays have recognized the need for a more accurate, automated technique for recognizing outlier features and outlier feature backgrounds in scanned images of molecular arrays.
The present invention is directed towards a method and system for identifying outlier features and outlier feature backgrounds in scanned images of molecular arrays. The method and system of the present invention employ pixel-based, signal-intensity data contained within areas of a scanned image of a molecular array corresponding to features and feature backgrounds in order to determine whether or not the features or feature backgrounds have non-uniform signal intensities and are thus outlier features and outlier feature backgrounds. A calculated, estimated variance for the signal intensities within a feature or feature background is compared to a maximum allowable variance calculated for the feature or feature background based on a signal intensity variance model. When the experimental variance is less than or equal to the maximum allowable variance, the feature or feature background is considered to have acceptable signal-intensity uniformity. Otherwise, the feature or feature background is flagged as an outlier feature or outlier feature background.
The present invention is directed to identifying outlier features and outlier feature backgrounds within scanned images of molecular arrays. The variance of signal intensities within a feature or feature background is compared to a maximum allowable variance calculated based on a variance model in order to determine whether or not the region of a scanned image of a molecular array corresponding to a feature or feature background contains adequately uniform pixel-based signal intensities within. In the following, a description of the variance model and the fundamental statistical concepts and distributions on which it is based is provided with reference to
Data processing techniques employed in outlier detection involve application of various statistical measurements on the per-pixel counts, or pixel-based signal intensities measured for a particular feature or feature background and included in a digital representation of the scanned image of the molecular array. A molecular array scanner produces a raw digital representation including a count, or signal intensity, for each pixel within the digital representation. As a first step in processing the raw data, net signals “snet” are calculated from measured signals “smeasured” via a subtractive process:
snet=smeasured−soffset
For each measured per-pixel count, or pixel-based signal intensity, the net signal is obtained by subtracting a signal offset “soffset” from the measured signal “smeasured.” The signal offset may be automatically provided by the scanner device or may be empirically determined by identifying a minimal signal in the digital representation of the molecular array produced by scanning the molecular array and processing the scanned data. An estimate of the variance of the per-pixel counts within the area of a digital representation of a molecular array corresponding to a feature or feature background is obtained as follows:
In order to determine whether the pixel counts or pixel-based signal intensities within a feature or feature background are sufficiently uniform, the calculated variance “S2s
{circumflex over (σ)}2={circumflex over (σ)}2labeling and feature synthesis+{circumflex over (σ)}2counting+{circumflex over (σ)}2noise
The model variance “{circumflex over (σ)}2labeling and feature synthesis” is the variance expected for non-uniformities associated with target-molecule labeling, feature synthesis, and other solution and surface and chemistry effects. The model variance “{circumflex over (σ)}2counting” is the variance expected in scanning measurement, or counting, error. The model variance “{circumflex over (σ)}2noise” is the expected variance due to electronic noise in the scanner, background-level signal noise produced by the glass substrate of the molecular array, and other such noise.
In one embodiment of the present invention, the non-uniformity associated with labeling and feature synthesis is considered to be normally distributed.
In the described embodiment, the model variance is “{circumflex over (σ)}2” is alternatively expressed as:
{circumflex over (σ)}2=A{overscore (s)}2net+B{overscore (s)}net+C
For the scanner Poissonian noise, the signal to noise ratio is estimated, in the described embodiment, based on the number of molecules of chromophores and the number of photons produced by each molecule, as follows:
S/N=√{square root over (m)}√{square root over (p)}/√{square root over (p+1)}
In the described embodiment, the constant “C” is found, through scanning experiments, to have a value of 144. The estimated values of constants “A,” “B,” and “C” obviously vary with varying experimental conditions, target and probe biopolymers, molecular array substrates, chromophores, and scanning and data reduction equipment.
Using the above-described variance model, a threshold value, or {circumflex over (σ)}2max, can be estimated using an assumption that the following expression is distributed according to a χ2 distribution with n−1 degrees of freedom, where n is the number of feature or feature background pixels:
where σ2 is the true feature or feature background variance under the assumption that the model is valid, and the feature or feature background is not an outlier
A representative χ2 distribution is shown in
It should be noted that, although the above described variance model has been found to provide an effective basis for outlier detection, many other type of variance models are possible. Additional terms can be included, to account for other types of variances, terms may be modified, to more precisely describe the variances, and terms may be deleted from the above expression for the model variance. The techniques of the present invention may use any of the many possible model variances for outlier detection.
A C++-like pseudocode implementation showing an embodiment of the present invention is provided below. Note that the pseudocode implementation is not intended to describe a complete data processing program for molecular array data, but only to provide sufficient detail to illustrate one possible embodiment the above-described outlier identification methodology as the embodiment might occur within a molecular array data processing program, or in molecular array scanning and data processing equipment. The molecular array data processing program including the techniques of the present invention analyzes data scanned from a molecular array to produce experimental or diagnostic results which are stored in a computer-readable medium, transferred to an intercommunicating entity via electronic signals, printed in a human-readable format, or otherwise made available for further use.
First, the pseudocode implementation includes several constants and enumerations:
Next, the pseudocode implementation includes the class “scannedData,” provided below:
An instance of the class “scannedData” describes the pixel-based signal intensities, or counts, for a particular background or feature area of a scanned molecular array. The pixels are assumed to be rectilinearly oriented, with the shape of the area having a major horizontal axis, or row, that intersects with all columns of pixels within the area. Thus, the pseudocode implementation can model square features, disk-shaped features, elliptically shaped features, and other similar symmetrical closed forms. The class “scannedData” contains the following data members: (1) “data,” a pointer to the pixel counts; (2) “rowSize,” the size, in columns, of the major horizontal axis, or major row; (3) “colSize,” a pointer to the sizes of columns that include each pixel of the major row; (4) “total,” a total number of counts for the area of the scanned image; and (5) “outlying,” a Boolean value indicating whether or not the distribution of counts within the area is non-uniform. The class “scannedData” includes various member functions for setting and retrieving the values of the above-described data members, a member function “getPixelCount” that returns the per-pixel count measured by a scanning device by the pixel with row and column coordinates supplied as arguments, and a constructor “scannedData” that takes raw data as input. An implementation for the member function “getPixelCount” and the constructor are not provided, as the implementations are quite dependent on the format of the raw data and implementation of other portions of the data processing package, and are outside the scope of the present invention.
Next, the pseudocode implementation includes a declaration of the class “feature,” provided below:
An instance of the class “feature” describes a feature of the molecular array, and includes a pointer to an array of instances of the class “scannedData,” described above, for the areas corresponding to the feature and to the background feature scanned at red and green visible wavelengths. The class feature includes the following data members: (1) “x—coordinate,” the x coordinate of the feature in a rectilinear grid of features that comprises the molecular array; (2) the y coordinate of the feature; and (3) “features,” a pointer to an array of instances of the class “scannedData.” The class feature includes the following member functions: (1) “outlier,” declared and implemented on lines 9 and 10, above, which returns a Boolean value indicating whether or not the area of the feature corresponding to argument “a” is an outlier with respect to the signal provided by argument “c;” (2) “getCount,” declared and implemented above on lines 11–12, which returns the total net signal for either the background of the feature or the feature and scanned at a particular wavelength; and (3) “feature,” a constructor for the feature.
The constructor for the class “feature” contains the code relevant to one embodiment of the present invention. An implementation for the constructor “feature” is provided below:
The constructor “feature” takes the following arguments: (1) “data,” a pointer to an array of instances of the class “scannedData;” (2) “offsets,” a pointer to an array of offsets, corresponding to the term “soffset” in the above-described expression for the net signal “snet”; (3) “A,” “B,” and “C,” pointers to arrays of constants for each type of scanned in area, e.g., feature or feature background scanned in red or green light, where the constants in the arrays correspond to the constants “A,” “B,” and “C,” in the above-described expression for the model variance “{circumflex over (σ)}2;” (4) “chiSquaredXPoint,” the threshold variance value “χ2x,” described above; and (5) “x” and “y,” the x and y coordinates for the feature. On lines 5–13, a number of local variables are declared. These local variables include: (1) “total,” pixel counts obtained from an area associated with a feature during a particular scan; (2) “total2,” the square of the total pixel counts; (3) “num,” the number of pixels in the area; (4) “count,” a particular net count for a pixel “snet;” (5) “s—net,” the average value of the net signals from an area; (7) “s—net2,” the square of the average net signals from an area; (8) “s2—model,” the calculated model variance for an area feature under a particular scan; (8) “s2—max,” the threshold value “σ2max,” described above; and (9) “s2,” the estimated variance for the pixel intensities within the area. On lines 15–17, member data for the class feature are initialized based to the values of supplied arguments. In the nested for-loops of lines 19–50, each of the instances of the class “scannedData” describing scans of areas associated with the feature are processed according to the above-described technique for obtaining net signals and determining whether or not the uniformity of the signal intensities within an area are acceptable. Thus, the code of lines 22–48 is executed for each scan of each areas associated with the feature. In the case of the described embodiment, instances of the class “scannedData” represent red and green scans of the feature background and the feature. In the for-loop of lines 26–35, the square of the total net signals, the total net signals, and the number of pixels in an area are calculated for the area. On line 36, the value {overscore (s)}net is calculated. On line 37, the value {overscore (s)}net2 is calculated. On line 38, the value {circumflex over (σ)}2 is calculated. On line 39, the value σ2max is calculated. On line 40, the estimated variance for the pixel counts within the area is calculated. On lines 41–42, the member data “outlier” for the instance of the class “scannedData” is set to “false” if the estimated variance is less than or equal to the threshold variance σ2max, and is set to “true” otherwise. On line 43, the member data “total” is set to the total net signal count for the area. Finally, on lines 44–48, array pointers are incremented for the next iteration of the nested for-loops.
Although the present invention has been described in terms of a particular embodiment, it is not intended that the invention be limited to this embodiment. Modifications within the spirit of the invention will be apparent to those skilled in the art. For example, an almost limitless number of different implementations of the outlier detection method of the present invention can be written in any of many different programming languages, embodied in firmware, embodied in hardware circuitry, or embodied in a combination of one or more of firmware, hardware, or software, for inclusion in molecular array data processing equipment employing a computational processing engine to execute software or firmware instructions encoding techniques of the present invention or including logic circuits that embody both a processing engine and instructions. Various different variance models can be employed, including models with additional model variance terms corresponding to observed errors, defects, and noises different from, in addition to, or in place of those used in the described embodiment. Use of statistical variance modeling for generating variance thresholds for outlier detection can be applied to many different types of molecular arrays, and to many other molecular-array-like scientific and diagnostic devices. In the described embodiment, the techniques of the present invention are employed to detect outlier features and features backgrounds, but the same techniques may be applied to identify non-uniformity in other regions of a scanned image of a molecular array. The techniques of the present invention may be applied to scanned images of molecular arrays, regardless of the wavelength of light used in an optical scan, energy levels of emitted radiation detected, or other type of signal detection employed to generate the scanned image. Of course, each different type of scanning device, molecular array, type of signal detected, and other variations will need a corresponding variance model for calculating useful variance thresholds.
The foregoing description, for purposes of explanation, used specific nomenclature to provide a thorough understanding of the invention. However, it will be apparent to one skilled in the art that the specific details are not required in order to practice the invention. The foregoing descriptions of specific embodiments of the present invention are presented for purpose of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Obviously many modifications and variations are possible in view of the above teachings. The embodiments are shown and described in order to best explain the principles of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the following claims and their equivalents:
Number | Name | Date | Kind |
---|---|---|---|
5837475 | Dorsel et al. | Nov 1998 | A |
6083763 | Balch | Jul 2000 | A |
6100030 | McCasky Feazel et al. | Aug 2000 | A |
6122407 | Peters | Sep 2000 | A |
6249593 | Chu et al. | Jun 2001 | B1 |
6341182 | Fitzgerald et al. | Jan 2002 | B1 |
6344316 | Lockhart et al. | Feb 2002 | B1 |
6349144 | Shams | Feb 2002 | B1 |
6355423 | Rothberg et al. | Mar 2002 | B1 |
6516276 | Ghandour et al. | Feb 2003 | B1 |
Number | Date | Country |
---|---|---|
0 902 394 | Mar 1999 | EP |
0 998 137 | May 2000 | EP |
1 162 572 | Dec 2001 | EP |
WO 0155967 | Aug 2001 | WO |
Number | Date | Country | |
---|---|---|---|
20030081819 A1 | May 2003 | US |