The present teachings are directed to electrophoresis equipment which identifies migrating species based on an analysis of detected fluorescence levels. The present teachings are directed to such equipment having an in-situ calibration capability so as to permit various dye sets to be used and a three dimensional graphical representation of results to allow for simplified base calling.
In the detector system in accordance with U.S. Pat. No. 6,027,627, fluoresced light from migrating species within a plurality of capillaries aligned in parallel passes through a filter, a transmission grating beam splitter and a lens before it impinges on a CCD detector array. In the preferred embodiment, the CCD detector array comprises 1024×256 pixels. The first dimension, (1024 pixels) covers 96 parallel capillaries, each capillary being focused onto at least one of the 1024 rows, although the number of rows per capillary can be increased by selecting a lens with a different focal length or changing other optical parameters. The second dimension (256 pixels) covers the fluorescence spectrum spread by the transmission grating.
In this prior art system, both the first order and second order components can be focused onto the detector array, although this is not an absolute requirement. What is required, however, is that a spectrum (such as represented by the 1st order components) be created for each capillary and detected. The spectrum of interest should include the wavelengths of light at which the dyes are known to fluoresce. The spectrum of interest for each capillary is spread over P contiguous pixels and these are divided into R channels of Q contiguous pixels, R=P/Q. R should be at least as large as the number of dyes M being used and preferably is greater than this number.
The detector of this prior art system outputs a spectrum comprising R light intensity values for each capillary, each time that data is provided to the associated processor. The processor then maps the spectrum of R intensity values for each capillary, onto values which help determine which dye has been detected in that capillary. This is typically done by multiplying calibration coefficients by the vector of intensity values, for each capillary.
The principle behind the calibration coefficients is that a spectrum of received light intensities in each of the channels is caused by the spectrum of a single dye (tagging a corresponding base) weighted by the effects (calibration coefficients) of the detection system. If I0(n), I1(n), . . . , I9(n) represent the measured intensities of the R=10 channels at the nth set of outputs from the CCD (after preprocessing including detection, binning and baseline subtraction), B0(n), B1(n), . . . , B3(n) is a vector representing the contribution (presence 1 or absence 0) from the M=4 bases, and C.ij are coefficients of a known 10×4 matrix which maps the bases onto the detected channels, we then have the following relationship:
Eq. 1 can thus be rewritten as:
I(n)=CB(n) (Eq. 2)
Given a vector of intensities output by a CCD for each separation lane, the theory of determining the presence or absence of each of the M=4 bases from the R=10 wavelength channels is fairly well established. This is simply a particular case of an over-determined system in which a smaller number of unknowns is determined from a greater number of equations. After mathematical transformation, Eq. 2 can be written as:
B(n)=(CTC)−1CTI(n) (Eq. 3)
where B0(n), . . . , B3(n) now represent the unknown values of the individual bases as functions of time index n, each value being reflective of the relative likelihood of the corresponding dye tagging that base being present; I0(n), I1(n), . . . I9(n) are the fluorescence intensities of the ten channels, and Cij's are the coefficients of wavelength i under known base j and where CT is a transpose of the matrix C and A=(CTC)−1CT is the pseudo-inverse of matrix C. While in the above analysis, C is a 10×4 matrix because a total of ten channels and four bases are used, in the general case, C is an R×M matrix wherein R≧M, and R and M are both integers greater than 2.
Typically, in prior art systems, the calibration matrix C is determined at the time the system is created. More particularly, calibration matrix C is specific to a set of dyes that are used, and is constant for all separation lanes in a system. If such a prior art system is then modified, such as by upgrading to a new set of optical filters, the calibration matrix C needs to be re-calibrated.
In general, different dye sets have different spectra. As a consequence, each dye set has a different calibration matrix. Consequently, a further disadvantage of using a single calibration matrix for a multi-lane separation system is that one cannot run multiple dye sets in different separation lanes.
In one aspect, the present teachings disclose a multi-lane electrophoretic separation apparatus which simultaneously utilizes multiple calibration matrices which calibrate for different dyes used to tag migrating species.
In another aspect, the present teachings disclose a multi-lane electrophoretic separation apparatus in which a calibration matrix is calculated in-situ.
In yet another aspect, the present teachings disclose a multi-lane electrophoretic separation apparatus in which a calibration matrix is calculated for each lane.
In yet another aspect, the present teachings disclose a method for calculating a calibration matrix from data acquired from a sample. The method comprises the steps of detecting emitted fluorescence spectra from a plurality of tagged migrating species, clustering the detected peaks into a number of groups, and then calculating calibration coefficients representative of at least some of the groups. After detection, and prior to grouping, the peaks may be culled to ensure that only peaks having a high probability of being associated with a particular group are used to calculate the calibration coefficients.
These and other features of the present teachings are set forth herein.
The skilled artisan will understand that the drawings described below are for illustration purposes only. The drawings are not intended to limit the scope of the present teachings in any way.
a and 4b show intermediate results from peak spacing determination;
a represents plots for each cluster of nucleotides;
b represents histograms for the isolated peaks;
a shows coefficient plots for the three dyes used in conjunction with
b shows a histogram for the clustered peaks corresponding to three dyes of
a-11c present experimental results for identifying proteins;
a-13b present calibration coefficient matrices for each of four dye sets commonly used in DNA sequencing;
A system on which the present teachings can be used is an automated capillary electrophoresis system, such as is described in U.S. Pat. No. 6,027,627. A detector arrangement for such a system is shown in U.S. Pat. No. 5,998,796. The contents of both of these are incorporated by reference to the extent necessary to understand the present teachings.
The present teachings are described with reference to a detector system in which a total of P=30 pixels are binned into R=10 wavelength channels of Q=3 pixels each. The binning is done onboard the CCD array chip under software control. For DNA sequencing, the number of dyes M is 4—one for each nucleotide—and the spectrum of interest is in the range of 520 nm to 670 nm. Thus, the spectral resolution of the 10 wavelength channels is about 15 nm each. During data collection, for each of the 96 capillaries, 10 data points are offloaded each time the CCD array is read out and these values are stored for subsequent analysis. Furthermore, during an electrophoresis run, data from the CCD array is offloaded periodically, at a sample rate of f samples per second. Thus, during a run which lasts time T, a total of N=fT samples are taken.
Step 204—Data Smoothing and Baseline Subtractions. The raw data are smoothed by Savitzky-Golay method for a few close points, e.g., 1, 3, 5, 7, 9 points, as determined by a user of the present teachings. In general, the data would not be smoothed if 1 point is chosen. The base lines of the smoothed data in the ten channels are subtracted with software that runs on the processor associated with the detector system. The software searches local minimum of every local section, for example, 300 data points in a channel as a section. A straight line, baseline, connects the two minimums in the consecutive sections. The values of raw data between the two local minimums are subtracted to the baseline value. The new values after the baseline subtraction and smoothing are stored for further processing. The order of data smoothing and baseline subtraction can be reversed.
Step 206—Peak-Picking in Time Domain. The properties of each wavelength channel after baseline subtraction are calculated before peak-picking. These properties include global average signal intensity, global average intensity deviation between two consecutive points, local maximum and local average deviation in a predetermined number of sections, preferably 40.
where Ij represents the intensity at point j, m is the total number of data points, s is the number of data points in a local section, and k is the starting point in the section.
The above four parameters for each of the ten channels, at appropriate points along the sampled intensity values, are used in a heuristic algorithm for determining peaks. A point Ij in a given channel is considered to be a peak if it meets the following criteria:
(1) Ij is a local maximum among five consecutive points: Ij>Ij−1>Ij−2 and Ij>Ij+1>Ij+2;
(2) Ij is greater than 20% of the section maximum and is also greater than 40% of global average intensity: Ij>0.2Is,max; Ij>0.4Ig,ave;
(3) At least one of the two edge deviations on either side of Ij must be greater than 70% of the section average deviation and greater than 20% of global average deviation: i.e., e.g.—
right edge deviation: (Ij+1−Ij+3)/2>0.7 Il,dev and (Ij+1−Ij+3)/2>0.2Ig,dev, or
left edge deviation: (Ij−1−Ij−3)/2>0.7 Il,dev and (Ij−1−Ij−3)/2>0.2Ig,dev, or both;
(4) Peak assembly. This is a process to remove a peak that happens in only one channel (not physically sound) and to identify as the same peak if a peak maximum is shifted one frame due to mathematical manipulation, and then to determine band location in the time domain. Most of the peak maximums in more than one channel happen at a specific time. At least two channels have shown peaks at a specific time. Since the individual channel has been carried out, baseline subtraction is done separately. Sometimes peak maximum may shift a frame in time domain. It is the same peak if peak position is shifted a frame in different color channels. Peak intensities in all of the channels are summed in the time domains shown in the figures.
(a) Peak spacing in a local section. In the local section, peak spacing can be considered as a constant. After all of the peaks are determined from the last step shown in
(b) Identifying the overlapped peaks by peak-fitting software. After these peaks are identified, peak widths can be identified with peak-fitting software. In most electrophoresis separations, the peaks coming out at the first section of the electropherograms are usually very sharp and the peaks in the late section of separation are usually wide. However, the peak widths in a small local section, for example, in 300 frames, are essentially the same. This concept is very important to resolving the temporal overlapping peaks in a local section. In DNA analysis, the complete overlapping bands with different DNA size in time domain are rare. Most of the overlap is confined to the rising or tailing edge of the peaks where one enters into the detection window and the other is moving out the windows. The overlapping peaks often are 30% wider than single peaks in DNA separation. If the intensity of a peak in a channel is small, e.g. 20% of local maximum intensity, the peak width was not calculated due to its low intensity. The peak width and spacing at a specific moment can be calculated from the ten traces of the data.
Step 210. Peak Filtering & Spike Rejection. The width of a normal peak is usually between 4 to 20 frames. In contrast, spikes usually happen in one frame and appear as very sharp peaks. The spikes can result from cosmic ray pickups by the camera, thermal noise due to overheating of the camera, and sample impurity. Spacing criteria: If the peak spacing is 75% greater than the average peak spacing, the two peaks are retained for the coefficient calculation. Another technique is to use both the peak width and spacing. If the average of two widths of adjunct peaks at their half intensity is bigger than peak spacing, the two peaks are rejected from the calculation of the matrix coefficients. There are two cases and rationale for the overlapping peaks. In one case, the two peaks are from the same dye tagging to the DNA molecules. These are not separated because of poor separation resolution. We found that this case would not cause any problem in matrix calculation since it involves the same dye. Nonetheless, we would prefer to reject these types of peaks in the matrix calculation as a general rule of peak width. The other type of case is where the two peaks are from different dyes tagging the DNA molecules with size differences of 1 base pair. We found that these two dyes are usually somewhat separated in time domain, but not completely resolved. Therefore the peak positions in all of the channels differ by a few (2-3) frame number. Peak fitting will consider them to be overlapping peaks. Rejecting these bands is important for the matrix calculation. Intensity criteria: If a peak whose maximum intensity is only 20% of the average peak intensity in a local section, the peak is rejected for the calculation of matrix coefficients. The small peak will cause significant errors for the matrix coefficients.
Step 212. Band Categorization (Clustering). If a band has passed the above-described filtering process, the band will go to the band categorizing (clustering) process. The band intensity is determined from a data channel that is the sum of the intensity over all of the wavelength channels. This channel signal, in most cases, is from 0th-order of the grating, which has no color-dispersed power. Another technique is to create this channel of the data that is the sum of the intensities over all of the channels.
Step 602—Normalizing Intensities. The following example is a set of data extracted from
Step 604—Band Clustering Starting with the Strongest Bands. The process of band pattern recognition starts from the strongest band, and then moves to the next strongest, and so on. If a band shows up in a few channels at a specific time as peaks, then the intensity is normalized over all of the intensities in other channels as a matrix coefficient. There are certain advantages to choosing the band with the strongest intensity first, and then the second strongest, and so forth. Because of instrument noise, the coefficient calculation of the strongest bands is more accurate than the low intensity bands. Accordingly, the effects of the leading and trailing portions of spurious peaks have lesser overall effect on a stronger band, than on a weaker band.
Step 606—Intensity culling; noise effects; low intensity and coefficients. In various implementations, the overall noise level from all noise sources, such as shot noise, CCD reading noise, and CCD dark noise, is on the order of about 50 counts. Mathematical manipulation of the raw data, such as baseline subtraction and smoothing, can also introduce noise to the data. In some embodiments, the data intensity is chosen to be about three times (150 counts) the noise level, and so this value is selected as a threshold. This criteria is consonant with conventional statistical principles. Thus, if the data intensity is lower than 150 counts, it preferably is not used for band categorizing. For example, in Table 1, the data in channels 0, 1, 2 and 3 are less than 150 and so their coefficients 0.0114, 0.021, 0.0198, and 0.0158 are not be used for categorizing. These coefficients are called un-comparable coefficients, which are likely to cause calculation errors, and so are discarded.
Step 608—Band categorizing. If the difference in the comparable coefficients of two bands is less than 5% of maximum intensity (or 0.05 units), the two bands are clustered as being in the same category. Table 2 shows an example with 7 sets of coefficients, each set having been individually normalized. In the bands shown in Table 2, bands 1, 3, 4 are in the same category, because none of their coefficients differ by more than 0.05. However, band 1 and band 2 have coefficient differences of more than 0.05 units and so are considered to be in different categories. Using the 5% rule, it is evident that bands 5 and 6 are in the same category and band 7 forms its own category.
Upon considering the data in Table 2, one may think it adequate to always categorize bands based on the maximum normalized peak. This, however, is not always the best approach. In some cases, the channel having the maximum intensity can be in either of two close channels for the same type of bands. For example, if two bands have their coefficients of 0.9948, 1 and 1, 0.982 in, say, channels 2 and 3, respectively, one might consider the two bands to belong to different categories, if only a maximum intensity rule is used. However, a system using the 5% of the maximum intensity rule will always take these two peaks as the same type of bands.
On occasion, a computer may automatically cluster the bands into more dye spectra than the number of dyes used in the electrophoresis. This results in a fake cluster 720, as seen in
Step 610—Standard deviation rejection. The average and the standard deviation of each set of coefficients are calculated after the band categorizing process. If the deviations of the normalized coefficients for a given set are greater than 130% of the standard deviation, the corresponding band should be rejected for the coefficient calculation.
Step 612—Coefficient calculation. After clustering, the coefficients of the sets within each of four clusters (one cluster for each nucleotide) can be plotted, as seen in
Step 216. Color (Spectral) Deconvolution. During use, the pseudo-inverse of coefficient matrix C calculated for each separation lane is used to map a detected set of intensities from that separation lane, onto a decision vector B, as given in Eq. 3. The position of the highest value in decision vector B corresponds to the identity of the dye.
Applications of the teachings described above are now illustrated using various examples.
Experimental conditions: capillary ID 75 um, OD 200 um, total length 80 cm, effective length (from injection end to detection window) 55 cm. Separation voltage 150 v/cm (12 kV). 96 capillaries are arranged in parallel on a plane to form a capillary array.
Injection: 6 kV for 1 min. DNA sequencing sample: labeled PE Biosystem BigDye.
Excitation: all-line Ar ion laser emitting between at 450-520 nm (514.5 nm and 488 nm are two strongest emission lines). Laser light is spread over a 96-capillary array by cylindrical lenses. Detection: Nikon camera lens with focal length 85 mm and F1.4 is used to collect the fluorescence from the capillary array. The fluorescence then pass through a longpass optical filter (cutoff 525 nm) (Optical Omaga Inc., CT) and a transmission grating (Edmund Scientific, NY) and impinge on a CCD camera (PixelVision, WA). The resolution of the system is about 5 nm/pixel. Every three consecutive pixels is binned and each channel represents the fluorescence intensity over 15 nm.
Gel and separation conditions. The gel is a 5% linear polymer gel with 7 M. The DNA in
a shows the spectra profiles of several resulting DNA bands. The bands are classified into four categories, each of which corresponds to one of the four bases.
A similar setup has been used for capillary zone electrophoresis. The protein samples are injected into the individual capillary of a 96-capillary array. The capillaries of ID 50 um, OD 150 um, total length 35 cm and effective length of 25 cm are used for the experiment. The separation takes place at 150 V/cm. The borate buffer at pH 10.5 was the separation medium. The samples are mixtures of proteins injected with a vacuum (hydrodynamical injection). One standard with different emission spectra from the proteins is added to the sample for quantitative analysis. The data of 6 wavelengths are collected to resolve the two unknowns as in
The teachings discussed herein have been used to automatically obtain the calibration coefficients for different dye sets commonly used in DNA sequencing. The methodology includes peak classification, initial peak rejection, coefficient determination, refined peak rejection, and color de-convolution.
(1) Peak classification. To automatically calibrate a single dye set, a tagged DNA sample was introduced into a single capillary and electrophoresed. Approximately 500 bases in a single electropherogram were detected, each base giving rise to a peak within the set of 10 channels. The peaks were then classified according to the channel in which their intensity was a maximum. First, peak positions and intensities were recorded and metrics such as average peak spacing in the time domain and average peak intensity were also calculated. In general, when a peak shows up in one channel, a peak often shows up in the other channels in the time domain at the same time. This is because each member within a dye family causes some overlap among the 10 contiguous channels. At the specific time that a peak shows up, the intensities of the peaks over the ten wavelength channels were compared to determine in which of the 10 channels a peak exhibited maximum intensity. The channel numbers in which the maximum intensity of a peak was found were recorded for each peak, and this was histogrammed.
(2) Initial peak rejection. Three types of peaks were rejected prior to the calculation of calibration coefficients. First, peaks whose maximum intensities did not fall into any of channels 2, 4, 6, 8 were rejected and eliminated from consideration. Second, peaks which overlapped in the time domain were also rejected. Two peaks were considered to overlap if the spacing between two adjacent peaks in time domain was smaller than 80% of average spacing distance between peaks. Third, low intensity peaks, defined as those peaks having a maximum peak intensity less than 20% of the average peak intensity, were also rejected from further consideration. After initial peak rejection, only about 300 of the original 500 peaks remained as candidates for use in calculating calibration coefficients.
(3) Calculation of the average coefficients and their standard deviation. The maximum intensity of the remaining 300 or so peaks was first normalized to 1.0000, the normalization being done in the wavelength domain. In other words, if the maximum for a peak was in channel 2, indicating a “G” base for a particular set of dyes, the 10 coefficients for the “G” base for this particular peak were calculated as the ratio of the intensity in each of the 10 channels to the intensity found in channel 2 for that peak. Thus, the set of calibration coefficients for base G is derived from those remaining 300 peaks whose maximum intensity was found in channel 2 by normalizing each such peak in the wavelength domain and taking the averages of each of the 10 sets of coefficients. Similarly, the set of calibration coefficients for the A, T and C bases were calculated from those remaining 300 peaks whose maximum intensities were found in channels 4, 6 and 8, respectively. The 10 group coefficient averages and the 10 group standard deviations for each of the four groupings (G, A, T and C) are then calculated for further processing.
(4) Additional peak rejection. If the difference between any one of the 10 normalized coefficients for a peak within a particular group (G A, T or C) and the group average for that coefficient is bigger than a predetermined times (e.g., 1.5 times) the group standard deviation for that coefficient, that peak is rejected and not used in coefficient calculations.
(5) Matrix formation. After the additional peak rejections have been performed, the average coefficients for each group are calculated to establish the calibration matrix.
(6) Color deconvolution. Given the output from the detector, Equation 3 is used in conjunction with the appropriate calibration matrix to calculate the four base intensities. This results in color deconvolution of the signals.
Calibration coefficient matrices were calculated for the SpectruMedix Model SCE 9610 Genetic Analysis System for each of the following dye sets: ABI BigDye terminator dye set, ABI Rhodamine terminator dye set, Amersham ET primer dye set, and Baylor Bodipy dye set. The resulting matrices are shown in
As is known to those skilled in the art, Bodipy dyes have a narrow emission spectrum and small wavelength spacing (20 nm) between adjacent dyes. To accommodate Bodipy dyes, only two adjacent pixels, rather than three, were binned so as to give high spectral resolution. The new matrix, which is based on two-pixel binning for each channel, dramatically enhances results using Bodipy dyes for DNA sequencing.
Because each lane in a multi-lane electrophoretic separation system can have its own calibration matrix, one can use multiple dye sets at the same time, only a single dye set being used to tag the sample in each lane. This allows one to divide a sample into two or more moieties, tag each moiety with a different dye set, and compare the results of performing separation of the sample, as tagged with different dye sets. Thus, one can directly compare the performance of different dye sets without changing instrument set-up, such as by using a different set of filters. In samples that have been separated using an array of capillaries, different combinations of the dye sets have been used to tag samples, with each capillary having therein a sample tagged with only one dye set.
In the present teachings, various heuristic and statistical techniques are used to select peaks whose underlying data are used to form calibration matrices, especially in DNA sequencing applications. An alternative approach for selecting peaks to be used for coefficient calculation is to identify solitary peaks in topographic plots of time-frequency plots.
The plots of
Identification of the solitary peaks, and direct basecalling, can either be performed visually by humans, or automatically by using machine-based image processing or pattern recognition techniques well known to those skilled in the art of computer vision. Thus, in the case of machine-based processing, morphological filters can be used as templates to identify the features seen in
The section headings used herein are for organizational purposes only and are not to be construed as limiting the subject matter described in any way.
While the present teachings are described in conjunction with various embodiments, it is not intended that the present teachings be limited to such embodiments. On the contrary, the present teachings encompass various alternatives, modifications, and equivalents, as will be appreciated by those of skill in the art.
This application is a continuation of U.S. patent application Ser. No. 11/047,355, filed Jan. 31, 2005, which claims priority to U.S. patent application Ser. No. 09/676,526, now U.S. Pat. No. 6,863,791, filed Oct. 2, 2000, and to U.S. Provisional Application No. 60/231,574, filed Sep. 11, 2000.
Number | Date | Country | |
---|---|---|---|
60231574 | Sep 2000 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 11047355 | Jan 2005 | US |
Child | 12154598 | US | |
Parent | 09676526 | Oct 2000 | US |
Child | 11047355 | US |