The present invention is related to analysis of experimental data and, in particular, to a method and system for identifying biopolymer-sequence abnormalities, including amplifications and deletions of subsequences of the DNA sequence of a chromosomal DNA, in samples of interest compared to control samples by array-based comparative hybridization.
A great deal of basic research has been carried out to elucidate the causes and cellular mechanisms responsible for transformation of normal cells to a precancerous or cancerous state, and for the growth of cancerous tissues and metastasis of cancerous tissues. Enormous strides have been made in understanding various causes and cellular mechanisms of cancer, and this detailed understanding is currently providing new and useful approaches for preventing, detecting, and treating cancer.
There are myriad different types of causative events and agents associated with the development of cancer. Moreover, there are many different types of cancer, and many different patterns of cancer development for each of the many different types of cancer. Although initial hopes and strategies were predicated on finding one or a few basic, underlying causes and mechanisms, researchers have, over time, recognized that, in fact, the term “cancer” encompasses a very large number of different diseases. Nonetheless, there do appear to be certain common cellular phenomena associated with cancer. One common phenomenon, evident in many different types of cancer, is the onset of genetic instability in precancerous tissues, and progressive genomic instability as cancerous tissue develops. While there are many different types and manifestations of genomic instability, a change in the number of copies of particular DNA subsequences within a cancerous cell may be a fundamental indication of genomic instability. Various techniques have been developed to detect and at least partially quantify amplification and deletion of chromosomal DNA subsequences in cancerous cells. One technique is referred to as “comparative genomic hybridization.” Comparative genomic hybridization (“CGH”) can offer striking, visual indications of chromosomal-DNA-subsequence amplification and deletion, in certain cases, but, like many biological and biochemical analysis techniques, is subject to significant noise and sample variation, leading to problems in quantitative analysis of CGH data. Research scientists, diagnosticians, and medical personnel have recognized the need for CGH-data analysis techniques to more accurately quantify DNA-subsequence-copy variation in diseased tissue samples, including cancerous cells, as well as techniques for analyzing CGH-data, and visualizing analytical results, obtained by applying CGH techniques to samples from multiple sources in order to identify possible genetic bases for various observed characteristics and conditions related to the sources.
Embodiments of the present invention include methods and systems for analysis of comparative hybridization data, including comparative genomic hybridization (“CGH”) data, such as CGH data obtained from microarray experiments. Various embodiments of the present invention include parametric and non-parametric normalization methods for CGH data and methods for identifying sets of one or more contiguous chromosomal DNA subsequences that are amplified or deleted in cells from particular tissue samples. When combined with well-designed microarray-based experimental systems, method embodiments of the present invention provide markedly increased quantitative precision in the identification of chromosomal abnormalities, including amplified and deleted DNA subsequences based on CGH data. Additional embodiments of the present invention are directed to detecting, by comparative hybridization, deletion, amplifications, and other changes to general biopolymer sequences, including biopolymers other than DNA.
FIGS. 18A-F show screen captures that illustrate a user interface developed to provide visual and interactive access to methods of CGH data analysis and results of the analysis as part of a CGH-data-analysis system.
Embodiments of the present invention provide methods and systems for analysis of comparative genomic hybridization (“CGH”) data. The methods and systems are general, and applicable to comparative hybridization data obtained from a variety of different experimental approaches and protocols. Described embodiments, below, are particularly applicable to microarray-based CGH data, obtained from high-resolution microarrays containing oligonucleotide probes that provide relatively uniform and closely-spaced coverage of the DNA sequence or sequences representing one or more chromosomes. One application for methods of the present invention is for detecting amplified and deleted genes. Examples are discussed below. However, any subsequence of chromosomal DNA may be amplified or deleted, and CGH techniques may be applied to generally detect amplification or deletion of chromosomal DNA subsequences. Comparative hybridization methods can be used to detect amplification or deletion of subsequences of any information-containing biopolymer, and other sequence changes and abnormalities.
Prominent information-containing biopolymers include deoxyribonucleic acid (“DNA”), ribonucleic acid (“RNA”), including messenger RNA (“mRNA”), and proteins.
In cells, DNA is generally present in double-stranded form, in the familiar DNA-double-helix form.
A gene is a subsequence of deoxyribonucleotide subunits within one strand of a double-stranded DNA polymer. A gene can be thought of as an encoding that specifies, or a template for, construction of a particular protein.
In eukaryotic organisms, including humans, each cell contains a number of extremely long, DNA-double-strand polymers called chromosomes. Each chromosome can be thought of, abstractly, as a very long deoxyribonucleotide sequence. Each chromosome contains hundreds to thousands of subsequences corresponding to genes. The exact correspondence between a particular subsequence identified as a gene and the protein encoded by the gene can be somewhat complicated, for reasons outside the scope of the present invention. However, for the purposes of describing embodiments of the present invention, a chromosome may be thought of as a linear DNA sequence of contiguous deoxyribonucleotide subunits that can be viewed as a linear sequence of DNA subsequences. In certain cases, the subsequences are genes, each gene specifying a particular protein. But these embodiments are far more general. Amplification and deletion of any DNA subsequence or group of DNA subsequences can be detected by the described methods, regardless of whether or not the DNA subsequences correspond to protein-sequence-specifying, biological genes, to DNA subsequences specifying various types of non-protein-encoding RNAs, or to other regions with defined biological roles. Moreover, these methods may be applied to other types of biopolymers to detect changes in biopolymer-subsequence occurrence. The term “gene” is used in the following as a notational convenience, and should be understood as simply an example of a “biopolymer subsequence.” Similarly, although the described embodiments are directed to analyzing DNA chromosomal sequences, the sequences of any information-containing biopolymer are analyzable by methods of the present invention. Therefore, the term “chromosome,” and related terms, are used in the following as a notational convenience, and should be understood as an example of a biopolymer or biopolymer sequence.
As shown in
Although differences between genes and mutations of genes may be important in the predisposition of cells to various types of cancer, and related to cellular mechanisms responsible for cell transformation, cause-and-effect relationships between different forms of genes and pathological conditions are often difficult to elucidate and prove, and very often indirect. However, other genomic abnormalities are more easily associated with pre-cancerous and cancerous tissues. Two prominent types of genomic aberrations include gene amplification and gene deletion.
Generally, deletion of multiple, contiguous genes is observed, corresponding to the deletion of a substantial subsequence from the DNA sequence of a chromosome. Much smaller subsequence deletions may also be observed, leading to mutant and often nonfunctional genes. A gene deletion may be observed in only one of the two chromosomes of a chromosome pair, in which case a gene deletion is referred to as being heterozygous. A second chromosomal abnormality in the altered genome shown in
Changes in the number of gene copies, either by amplification or deletion, can be detected by comparative genomic hybridization (“CGH”) techniques.
CGH data may be obtained by a variety of different experimental techniques. In one technique, DNA fragments are prepared from tissue samples and labeled with a particular chromophore. The labeled DNA fragments are then hybridized with single-stranded chromosomal DNA from a normal cell, and the single-stranded chromosomal DNA then visually inspected via microscopy to determine the intensity of light emitted from labels associated with hybridized fragments along the length of the chromosome. Areas with relatively increased intensity reflect regions of the chromophore amplified in the corresponding tissue chromosome, and regions of decreased emitted signal indicate deleted regions in the corresponding tissue chromosome. In other techniques, normal DNA fragments labeled with a first chromophore are competitively hybridized to a normal single-stranded chromosome with fragments isolated from abnormal tissue, labeled with a second chromophore. Relative binding of normal and abnormal fragments can be detected by ratios of emitted light at the two different intensities corresponding to the two different chromophore labels.
A third type of CGH is referred to as microarray-based CGH (“aCGH”).
The microarray may be exposed to sample solutions containing fragments of DNA. In one version of aCGH, an array may be exposed to fragments, labeled with a first chromophore, prepared from abnormal tissue and to fragments, labeled with a second chromophore, prepared from normal tissue. The normalized ratio of signal emitted from the first chromophore versus signal emitted from the second chromophore for each feature provides a measure of the relative abundance of the portion of the normal chromosome corresponding to the feature in the abnormal tissue versus the normal tissue. In the hypothetical microarray 1002 of
Microarray-based CGH data obtained from well-designed microarray experiments provide a relatively precise measure of the relative or absolute number of copies of genes in cells of a sample tissue. Sets of aCGH data obtained from pre-cancerous and cancerous tissues at different points in time can be used to monitor genome instability in particular pre-cancerous and cancerous tissues. Quantified genome instability can then be used to detect and follow the course of particular types of cancers. Moreover, quantified genome instabilities in different types of cancerous tissue can be compared in order to elucidate common chromosomal abnormalities, including gene amplifications and gene deletions, characteristic of different classes of cancers and pre-cancerous conditions. Unfortunately, biological data can be extremely noisy, with the noise obscuring underlying trends and patterns. Scientists, diagnosticians, and other professionals have therefore recognized a need for statistical methods for normalizing and analyzing aCGH data, in particular, and CGH data in general, in order to identify signals and patterns indicative of chromosomal abnormalities that may be obscured by noise arising from many different kinds of experimental and instrumental variations.
One approach to ameliorating the effects of high noise levels in CGH data involves, as a first step, normalizing sample-signal data by using control signal data. In many aCGH experiments, normal, control samples, including chromosomal DNA fragments of chromosomal DNA fragments, isolated from normal tissues are hybridized to arrays as control samples along with DNA fragments or copies isolated or produced from abnormal or diseased tissues for which a measure of chromosomal alterations or abnormalities is sought. Often, multiple control samples are available. Therefore, rather than simply using the log ratio of the signal generated by hybridization of fragments from diseased tissue to signal generated from one control sample, the signal generated from diseased tissue can be normalized using multiple control-sample-derived signals. It should be noted that the methods of the present invention may be applied to normalization of any signals produced from any type of sample, including diseased-tissue samples, samples produced by particular experiments, samples produced at particular times during particular experiments, and other samples of interest. The phrase “diseased tissue sample” is therefore interchangeable, in the following discussions, with the phrase “sample of interest.”
In a more general case, an aCGH array may contain a number of different features, each feature generally containing a particular type of probe, each probe targeting a particular chromosomal DNA subsequence indexed by index k that representis a genomic location. A subsequence indexed by index k is referred to as “subsequence k.” One can define the signal generated for subsequence k by either a control or diseased-tissue sample j as the sum of the log-ratio signals from the different probes targeting subsequence k divided by the number of probes targeting subsequence k or, in other words, the average log-ratio signal value generated from the probes targeting subsequence k, as follows:
where num_featuresk is the number of features that target the subsequence k; and
C(b,j) is the normalized signal log ratio for sample j at feature b.
In the case where a single probe targets a particular subsequence, k, then no averaging is needed. In the following discussion, normalization of signals for a solution of interest is discussed, such as a solution of DNA fragments obtained from a particular tissue or experiment. A solution of interest may be subject to a single CGH analysis, or a number of identical samples derived from the solution of interest may be each separately subject to CGH analysis, and the signals produced by the analysis for each subsequence k may be averaged to produce a single, averaged, signal data set for the solution of interest.
To re-emphasize, each aCGH data point is generally a log ratio of signals read from a particular feature of a microarray that contains probes targeting a particular subsequence, the log-ratio of signals representing the ratio of signals emitted from a first label used to label fragments of a diseased tissue to a signal generated from a second label used to label fragments of a normal, control tissue. Both the diseased-tissue fragments and the normal, control fragments hybridize to normal-tissue-derived probe molecules on the microarray. A normal tissue or sample may be any tissue or sample selected as a control tissue or sample for a particular experiment. The term “normal” does not necessarily imply that the tissue or sample represents a population average, a non-diseased tissue, or any other subjective or object classification.
Having averaged signals produced from features containing identical probes, and having obtained a single, or a single averaged, data set for a solution of interest, such as for a particular diseased tissue, and having obtained multiple, control data sets, the multiple, control data sets can be used together to normalize the data set for the solution of interest in order to generate better signal-to-noise ratios for subsequence amplification and deletion indications, and indications of other sequence abnormalities. Using multiple control data sets for normalization, rather than a single control data set, produces more statistically reliable indications of sequence abnormalities.
Next, a mean control-signal for a particular subsequence k can be computed from the signal generated for subsequence k by a number J of control samples 1, . . . , J as follows:
where J=number of normal, control samples
Similarly, the standard deviation for the J control signals for subsequence k can be computed as follows:
Using μk and σk, a normalized signal for a particular subsequence k generated by a diseased-tissue sample s can be computed as:
In cases where there are not a sufficient number of control sample signals in order to compute a reliable mean and standard deviation for generation of the normalized signal for a particular diseased-tissue sample Cz(k, s), a rank-ordering-based normalization may be carried out. First, the position of an element q within an ordered set of values X, such that q ε X, is defined, as follows:
position(q, X)=i
where X={x1,x2, . . . , xm};
The normalized signal produced by diseased-tissue-sample s for a particular subsequence k is the position, or rank, of the signal generated for the subsequence k by diseased-tissue sample s within the ordered set C that includes a number of signals generated by control samples j1, . . . jJ as well as by the diseased-tissue sample s, as follows:
Cr(k,s)=position(C(k,s),C)
where s=a particular sample; and
Thus, as discussed above, one can compute either a mean-and-standard-deviation-based normalized diseased-tissue signal for a particular subsequence k, Cz, or a rank-order-based normalized signal generated from a diseased-tissue sample s, Cr. The former normalization is used when there are sufficient number of control samples to determine a statistically reliable mean and standard deviation. Otherwise, the rank-order method is employed.
Subsequence deletions and amplifications generally span a number of contiguous subsequences of interest, such as genes, control regions, or other identified subsequences, along a chromosome. It therefore makes sense to analyze aCGH data in a chromosome-by-chromosome fashion, statistically considering groups of consecutive subsequences along the length of the chromosome in order to more reliably detect amplification and deletion. Specifically, it is assumed that the noise of measurement is independent for each subsequence along the chromosome, and independent for distinct probes. Statistical measures are employed to identify sets of consecutive subsequences for which deletion or amplification is relatively strongly indicated. This tends to ameliorate the effects of spurious, single-probe anomalies in the data. A parametric approach can be used when the measurement noise along the chromosome is independent for distinct probes and aproximately normally distributed. A non-parametric approach is used when these assumptions cannot be made.
For either method, one considers the measured, normalized, or otherwise processed signals for subsequences along the chromosome of interest to be a vector V as follows:
V={v1,v2, . . . ,vn}
where vk=Cz(k,s)or vk=Cr(k, s)
Note that the vector, or set V, is sequentially ordered by position of subsequences along the chromosome. In the parametric approach, a statistic S is computed for each interval I of subsequences with fixed size along the chromosome as follows:
where I={v1, . . . ,vj}; and
Under a null model assuming no sequence aberrations, the statistic S has a normal distribution of values with mean=0 and variance=1, independent of the number of probes included in each interval I. The statistical significance of the normalized signals for the subsequences in an interval I can be computed by a standard probability calculation based on the area under the normal distribution curve:
Alternatively, the magnitude of S(I) can be used as a basis for determining alteration.
A non-parametric approach employs the rank-order-based normalized signal values for a diseased-tissue sample and a number of control samples. A rank-sum can be computed for a given interval I by adding together the rank-order-based normalized signals for each of the subsequences v1, . . . vk, and the expected value for the rank of an interval rank (I) is straightforwardly computed, as follows:
In order to statistically consider and evaluate intervals for putative amplification and deletion, one needs to compute the probability of large deviations from the expected value. To do this, the k-th order convolution of the uniform distribution on {1, . . . ,m} is computed. The probability Tm(r,z) is the probability that r independent random variables uniformly distributed in {1, . . . ,m} sum to exactly the value z. This probability can be recursively computed as follows:
The exact probabilities Tm(r,z) can be used to compute the probability that a sum of r independent random variables X1, . . . , Xr uniformly distributed in {1, . . . ,m} is greater than a particular value y, r≦y≦r·m, as follows:
A similar sum of Tm(r,z) exact probabilities can be used to compute the probability that a sum of r independent random variables uniformly distributed in {1, . . . ,m} is less than a particular value y, r≦y≦r·m, or within an arbitrary range of values.
In a fashion similar to the probability computation using the parametric approach, discussed above, the probability that a sum of random variables, each uniformly distributed from 1 to m, is greater than an observed rank (I) can be used to compute the statistical significance of a relatively high rank (I) value corresponding to an amplification of subsequences within an interval I, as follows:
Similarly, the probability that the sum of the number of random variables uniformly distributed from 1 to m is less than an observed rank (I) can be used to compute the significance of a relatively low rank (I) value indicating deletion of the subsequences in interval (I), as follows:
It should be noted that various different interval lengths may be used, iteratively, to compute amplification and deletion probabilities over a particular biopolymer sequence. In other words, a range of interval sizes can be used to refine amplification and deletion indications over the biopolymer.
As an example of the computation of the above-described probabilities for determining significance values for computed interval ranks, the following C++-like pseudocode can be used to determine the probability of observing a rank (I) value for some numbers of control samples plus a diseased-tissue sample for an arbitrary number of subsequences in I within a range of rank (I) values. This concise C++-like pseudocode is included in order to illustrate one approach to computing probabilities of ranges of rank (I) values, in turn used to estimate the significance of an observed rank (I) value in an experimental procedure. It is not presented as the most efficient or most elegant approach to the problem.
First, a small number of constants are declared:
These constants specify the maximum number of samples and subsequences that can be specified as initial values with a probability determination.
Next, a declaration for a simple class “createTable” is provided:
The class “createTable” creates a table of counts of the number of possible rank combinations that lead to a particular rank (I) value for a given number of subsequences in interval I for a particular number of samples m. The private data members for the class “createTable” include: (1) rank, a particular rank (I) value; (2) nGenes, the number of subsequences an interval I; (3) nSamples, a number of samples in the experiment; (4) accumulator, an integer used to accumulate counts in a recursive routine, described below; (5) probs, a table of probabilities obtained by dividing the number of combinations of ranks leading to a particular rank (I) value divided by the total number of possible combinations of subsequence-rank values; and (6) sampleSizePtrs, a table of indexes into the table “probs,” described above. The class “createTable” includes the following function members: (1) compute, a routine that computes the probability of a particular rank (I) for a particular number of subsequences over a particular number of samples; (2) recCompute, a recursive routine called by the routine “compute” for computing the counts of the combinations of subsequence-rank values that sum to a particular rank (I) value; (3) pTable, a routine that computes the probability values stored in the table “probs,” described above; and (4) Prob, a routine that computes the probability that an observed rank (I) value falls within a range of rank (I) values specified as arguments for a particular number of subsequences over a particular number of samples. Next, an implementation of the recursive routine “recCompute” is provided:
The recursive routine “recCompute” recursively computes the number of combinations of subsequence-rank values that can produce a particular rank (I) value. It recursively considers the possible subsequence-rank values for each subsequence in an interval.
Next, an implementation for the routine “Compute” is provided:
The routine “compute” returns either 0, in the case that the specified rank does not fall within the range of possible ranks for the specified number of subsequences and samples, or otherwise calls recursive routine “recCompute” to compute the number of combinations of subsequence-rank values leading to a particular rank, specified as an argument. Next, an implementation for the routine “pTable” is provided:
This routine computes the probabilities of observing a particular rank (I) value by dividing the number of combinations for the rank (I) value computed by the routine “Compute,” on line 13 by the total number of combinations of subsequence-rank values, computed on line 14.
Next, an implementation of the routine “Prob” is provided:
This routine simply sums the probabilities of individual rank (I) values within a range of rank (I) values in order to compute the probability of observing a particular rank (I) value within a range of rank (I) values.
Finally, a simple main routine is provided to indicate how a probability is computed using an instance of the class “createTable”:
After the probabilities for observing either the parametric, statistical value for intervals or the rank values for intervals are computed, those intervals with computed probabilities outside of a reasonable range of expected probabilities under the null hypothesis of no amplification or deletion are identified, and redundancies in the list of identified intervals are removed.
FIGS. 18A-F show screen captures that illustrate a user interface developed to provide visual and interactive access to methods of CGH data analysis and results of the analysis as part of a CGH-data-analysis system. Features of the user interface, as shown in
The data-analysis-representation display area 1806 displays, along selected regions of a chromosome or entire genome, in the case of DNA biopolymer analysis, a heat-map representation of the results of a CGH data analysis for each of a number of samples, indicating with increasing intensity of one color, such as green, the likelihood that a region is deleted, and indicating with increasing intensity of a different color, such as red, the likelihood that a region is amplified. In the heat-map representation, regions in which neither amplification or deletion are indicated may be represented in a neutral color, such as white or grey. The CGH analysis is undertaken, as described above, to use control data, and to compute deletion and amplification statistics that factor in indications of adjoining subsequences and the various diseased tissue samples selected in the sample-selection window 1810. As
FIGS. 18C-F show different display formats for single sample signals, and sample signals in the context of control data. In
Although the present invention has been described in terms of a particular embodiment, it is not intended that the invention be limited to this embodiment. Modifications within the spirit of the invention will be apparent to those skilled in the art. For example, an almost limitless number of different implementations of computer programs and computer-program routines can be created to compute the above-described analysis methods for analyzing chromosomal aberrations in diseased-tissue samples when a number of control samples are available. Although recursive methods are indicated in the above discussion, and used in the above C++-like pseudocode implementation, more efficient, non-recursive algorithms can be employed to more efficiently compute the desired statistics. The above-described methods can be easily modified to encompass experimental data from many different organisms having different numbers of chromosomes, different numbers of subsequences per chromosome, and other genetic differences. In each component of the above-described method, many possible mathematically similar, but alternative approaches may be employed. For example, different methods for computing means and variances can be used, as well as different statistical parameters used to characterize particular distributions. Many different types of user-interface implementations, in addition to the user-interface implementation discussed above with reference to FIGS. 18A-F can be employed to allow for convenient selection of parameters that control CGH analysis and various different CGH-data-analysis-results display formats.
The foregoing description, for purposes of explanation, used specific nomenclature to provide a thorough understanding of the invention. However, it will be apparent to one skilled in the art that the specific details are not required in order to practice the invention. The foregoing descriptions of specific embodiments of the present invention are presented for purpose of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Obviously many modifications and variations are possible in view of the above teachings. The embodiments are shown and described in order to best explain the principles of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the following claims and their equivalents:
This application claims the benefit of provisional application No. 60/541,711, filed Feb. 3, 2004
Number | Date | Country | |
---|---|---|---|
60541711 | Feb 2004 | US |