The present invention relates to a method for performing spectral DNA analysis, i.e. DNA sequences being represented in spectral space using Fourier transform. The invention also relates to a corresponding computer program product.
DNA Spectrogram method from DNA sequence has been described in the past, cf. Benson et al. in Nucleic Acid Research. 18 (21), p. 6305-6310, and 18 (10), 3001-3006, 1990, for earlier references on this topic.
A DNA spectrogram is generated by converting a DNA sequence to binary indicator sequence and then applying short-time Fourier transform and mapping to a color space in order to visualize the output. In order to allow the phylogenetic and biological comparison of a large number of long sequences in frequency domain, these sequences need to be visualized in such a way that the similarities are (easily) detectable, even by a human observer. Therefore, strategies are required that group together sequences with similar patterns in frequency.
An important advantage of performing DNA analysis in the spectral domain is that the N2-scaling of conventional sequence to sequence matching is avoided, N being the number nucleotide bases in the sequence. U.S. Pat. No. 6,287,773 discloses e.g. a frequency domain based comparison method which scale as N log(N) which may very significantly lower the computational time for long sequences, e.g. longer than 10.000 nucleotide bases.
Even with the advantages of present spectral analysis for DNA analysis, there is still a need for even faster and/or more efficient analysis tools because of the huge amount of data. For example, the entire chromosome 1 of the human genome is 247 millions nucleotides long, and accordingly viewing the DNA spectrograms as so-called spectra video as recently suggested by N. Dimitrova et al., “Analysis and visualization of DNA spectrograms: open possibilities for genome research,” in ACM MM., Santa Barbara, Calif., October 2006, may also be a tedious task.
Moreover, despite efforts to date, a need remains for systems and methods that facilitate expeditious analysis of DNA sequence information. Also there remains a need for tools that can identify structurally or compositionally similar patterns that exhibit similar spectral properties. Such tools are to be contrasted with conventional sequence alignment tools that seek to align sequences in linear order or by nucleotide appearance.
Clustering algorithms currently used for sequence alignment are not suitable for spectral analysis, where we need to analyze the content at individual frequencies. Standard clustering methods involve a global distance metric, which in this case would be applied over all frequencies considered in the spectrogram. While such a method would be able to detect strong patterns in many frequencies, it would screen out strong patterns in individual frequencies. However, there is no relation between patterns on different frequencies to consider them in a single distance metric. In spectral analysis strong (long) patterns on single frequencies are relevant.
Hence, an improved method for analysis of DNA sequences would be advantageous, and in particular a more efficient and/or reliable method would be advantageous.
Accordingly, the invention preferably seeks to mitigate, alleviate or eliminate one or more of the above mentioned disadvantages singly or in any combination. In particular, it may be seen as an object of the present invention to provide a method that solves the above mentioned problems of the prior art with analysis of DNA sequences.
This object and several other objects are obtained in a first aspect of the invention by providing a method for analyzing a DNA sequence, the method comprising:
The invention is particularly, but not exclusively, advantageous for obtaining a method for providing a user with a much improved ability to see unique strong patterns in vast amount of DNA sequence data. It is further possible to extracting the strength of a pattern and evaluate which is the strongest pattern on an individual frequency or a set of frequencies, or all patterns on all frequencies in the DNA sequences to analyze.
The invention may beneficially be implemented with a fully or semi-automated pattern search through all the DNA spectra coupled with an annotation and/or a visualization environment.
The use of binning functions (BF) may allow for a flexible measure of “similarity” which can be adapted to the dataset in order to detect all relevant patterns, coping with variation in the DNA sequences.
Additionally, the invention is scalable and suitable for parallel implementation that makes feasible search through vast genomic data spaces such as genomes of different species.
This method may efficiently and effectively compare multiple large genomic sequences based on their spectral patterns for deriving the gene homology and hence phylogenetic relationships
A common spectral pattern among sequences could for instance identify periodic repeating of nucleotides in the sequences and would help discover novel repeat elements in coding and non-coding DNA which may not otherwise be “visible” due to periodicity of only specific nucleotides after randomly arranged nucleotides in the periodic interval.
In the context of the present invention, other method for spectral analysis may also be beneficially applied, as for example the method described in PCT Application PH008112WO1 (Attorney Reference), IB2008/051434 (PCT Application number).
The binning function may comprise truncation, rounding up, rounding down, a modular function, and/or a threshold function or any other relevant binning function available to the skilled person that may implemented in connection with the present invention.
Typically, the binning function (BF) is defined for all channels (X). Thus, for DNA channels X={A, T, C, and G} may be modified, but alternatively only a subset of channels may be modified depending on the requirements of the analysis.
Beneficially, the finding of substantially equal modified Fourier coefficients (Usk_X(k)) within the said portion of the plurality of spectra may comprise a quantitative analysis of a distribution of the modified Fourier coefficients (Usk_X(k)) with respect to the said binning function (BF). Thus, it may comprise plotting the said distribution, e.g. plotting in a histogram, as it will be explained in more detail below, or other kind of graphs.
Typically, the method is repeated for a set of frequencies (K_i), e.g. all frequencies, or an interval, continuously or not continuously i.e. disjoint, depending on the requirements of the desired analysis.
It should be noted, the method may equally be applied for analyzing a RNA sequence or an amino acid sequence instead of a DNA sequence. The application of the present invention is thus not limited to applications in connection with analysis of DNA sequences, but could also be applied on similar sequences of relevance within biochemistry e.g. RNA sequences and amino acid sequences.
One may create a binary indicator representation for amino acids (20 of them) and then we apply STFT to convert the BIS sequences to Fourier domain space. Then the rest of the procedure for implementing the invention would be the same. Here is a list of the aminoacids:
alanine-ala-A
arginine-arg-R
asparagine-asn-N
aspartic acid-asp-D
cysteine-cys-C
glutamine-gln-Q
glutamic acid-glu-E
glycine-gly-G
histidine-his-H
isoleucine-ile-I
leucine-leu-L
lysine-lys-K
methionine-met-M
phenylalanine-phe-F
proline-pro-P
serine-ser-S
threonine-thr-T
tryptophan-trp-W
tyrosine-tyr-Y
valine-val-V
The 20 different amino acids can be mapped to 20 different colors in Red-Green-Blue (RGB) (or Hue Saturation Value—HSV space). Either one of these spaces can be quantized into 20 colors—one for each of the amino acids. Thus, the teaching of the present invention is not limited to DNA analysis, but may be extended to RNA and/or amino acid analysis with the relevant modifications readily recognized by a skilled person in this field.
Preferably, the set of binary indicator sequences may be reduced into a smaller set of BIS using a merge function, the merge function may preferably comprise a logical AND function.
The set of found substantially equal modified Fourier coefficients (Usk_X(k)) within the said portion of the plurality of spectra may be defined to constitute a pattern. In one embodiment, a first group of spectra (S) with the largest set of substantially equal modified Fourier coefficients (Usk_X(k)) in any frequency and/or channel may be found and separated from the remaining spectra, the remaining spectra forming a second group of spectra. The term “largest set” means collective group with the highest number of re-occurring modified Fourier coefficients. Additionally, the largest set of substantially equal modified Fourier coefficients (Usk_X(k)) may be found and separated within the second group of spectra. Moreover, the separation of spectra into a first and a second group of spectra may be repeated disregarding the previously found longest set of modified Fourier coefficients (Usk_X(k)), thus finding next longest. The separation of spectra into a first and a second group may be repeated i) until a pre-defined threshold for the longest set of modified Fourier coefficients (Usk_X(k)) is found, ii) until a pre-defined number of separations into a first and a second group of spectra is performed, or iii) until the first and/or the second group of spectra contains a single sequence, so as to provide an end to the separation.
In another embodiment, a first group of spectra (S) with the largest set of substantially equal modified Fourier coefficients (Usk_X(k)) in any frequency and/or channel may be found and marked. The set may preferably be displayed for analysis. Furthermore, a second group of spectra with the largest set of substantially equal modified Fourier coefficients (Usk_X(k)) in any frequency and/or channel may be found, disregarding the previously found longest set of modified Fourier coefficients (Usk_X(k)), and marked. The set may also preferably be displayed to a user for analysis. Additionally, the first and/or the next group of spectra may be re-ordered, and preferably displayed, taking the marking into account. In this way, the longest pattern in any frequency and/or channel may be found. Finally, the longest set may be found and the group of spectra may be reordered i) until a pre-defined threshold for the length of longest set of the modified Fourier coefficients (Usk_X(k)) is found, ii) until a pre-defined number of longest set is found, or iii) until the longest set contains a single sequence, so as to provide an end to the process of this embodiment.
In yet another embodiment, all groups of spectra (S) having length of a pattern of found modified Fourier coefficients (Usk_X(k)) being above a first predefined threshold (N_thres1), or all groups of spectra containing the k longest patterns, k being an integer, may be found and separated from the remaining spectra, the remaining spectra forming a second group of spectra. The groups of spectra selected are not necessarily disjoint. Each group of spectra thus separated may be further separated using a second predefined threshold (N_thres2) for the length of the pattern of modified Fourier coefficients (Usk_X(k)), or using the j longest patterns, j being an integer equal or different from k. To provide an end to the separation, the separation of spectra into groups may be repeated i) until a pre-defined threshold for length of the patterns of modified Fourier coefficients (Usk_X(k)) is found, ii) until a pre-defined number of separations into a first and a second group of spectra is performed, or iii) until the first and/or the second group of spectra contains sequences of modified Fourier coefficients (Usk_X(k)) having length equal to one.
In a second aspect, the invention relates to a computer program product being adapted to enable a computer system comprising at least one computer to implement the method according to the first aspect of the invention.
This aspect of the invention is particularly, but not exclusively, advantageous in that the present invention may be implemented by a computer program product enabling a computer system to perform the operations of the second aspect of the invention. Thus, it is contemplated that some known computer system may be changed to operate according to the present invention by installing a computer program product on a computer system controlling the said optical recording apparatus. Such a computer program product may be provided on any kind of computer readable medium, e.g. magnetically or optically based medium, or through a computer based network, e.g. the Internet.
The invention can be implemented in any suitable form including hardware, software, firmware or any combination of these. The invention or some features of the invention can be implemented as computer software running on one or more data processors and/or digital signal processors. The elements and components of an embodiment of the invention may be physically, functionally and logically implemented in any suitable way. Indeed, the functionality may be implemented in a single unit, in a plurality of units or as part of other functional units. As such, the invention may be implemented in a single unit, or may be physically and functionally distributed between different units and processors.
These and other aspects of the invention will be apparent from and elucidated with reference to the embodiments described hereinafter.
The present invention will now be explained, by way of example only, with reference to the accompanying Figures, where
DNA spectrograms can be generated in a conventional manner, as described in greater detail herein below with reference to
(i) Formation of binary indicator sequences (BISs) uA[n], uT[n], uC[n] and uG[n] for the four nucleotide bases. An exemplary BIS pattern is reproduced in
(ii) Discrete Fourier Transform (DFT) on BISs. The frequency spectrum of each base is obtained by computing the DFT of its corresponding BIS using Equation (1):
As illustrated in
(iii) Mapping of DTF values to RGB colors. The four DFT sequences are reduced to three sequences in the RGB space by a set of linear equations which are reproduced below:
X
r
[k]=a
r
U
A
[k]+t
r
U
T
[k]+c
r
U
C
[k]+g
r
U
G
[k]
X
g
[k]=a
g
U
A
[k]+t
g
U
T
[k]+c
g
U
C
[k]+g
g
U
G
[k]
X
b
[k]=a
b
U
A
[k]+t
b
U
T
[k]+c
b
U
C
[k]+g
b
U
G
[k] (2)
where (ar, ag, ab), (tr, tg, tb), (cr, cg, cb) and (gr, gg, gb) are the colour mapping vectors for the nucleotide bases A, T, C and G, respectively. The resultant pixel color ((Xr[k], Xg[k], Xb[k]) is thus a superposition of the colour mapping vectors weighted by the magnitude of the frequency component of their respective nucleotide base as indicated in the right side of
(iv) Normalizing the pixel values. Before rendering the colored spectrogram 30, the RGB values of each pixel are generally normalized so as to fall between 0 and 1. Numerous normalization procedures are readily available to the skilled person once the general principle of the invention is recognized.
(v) Short-time Fourier Transform (STFT). A plurality of DNA spectra 20 i.e. a spectrogram 30 is formed by a concatenation of individual DNA sequence spectra 20 (“strips”), where each strip or spectrum generally depicts the frequency spectrum of a local DNA segment as shown in
The spectrogram shown in
The appearance of a spectrogram 30 is very much affected by the choice of the STFT window W size, the length of the overlapping sequence between adjacent windows W, and the color mapping vectors, cf. Equation (2). The window size determines the effective range of a pixel value in a spectrogram 30. A larger window results in a spectrogram that reveals statistics collected from longer DNA segments. In general, the window W size should be made several times larger than the length of the repetitive pattern of interest and smaller than the size of the region that contain the pattern of interest. For exploratory purposes, it is recommended to try a range of window sizes. The window overlap determines the length of the DNA segment common to two adjacent STFT windows. Therefore, the larger the overlap, the more gradual is the transition of the frequency spectrum from one STFT window to the next. A higher image resolution makes it easier to extract features by image processing or visual inspection.
Viewing large amounts of sequence data requires an efficient method for information analysis and visualization. In order to optimize the viewing of spectra derived from very large sequences or spectra containing many small windows, the spectra can be rendered a video as shown by the present inventor; N. Dimitrova, et al. “Analysis and visualization of DNA spectrograms: open possibilities for genome research,” in ACM MM., Santa Barbara, Calif., October 2006, which is hereby incorporated by reference in its entirety.
In situation A, a binning function BF is applied on one frequency indicated by vector U_1 and as a result of the binning function BF, the Fourier coefficients Usk_X(k) of U_1 are modified and therefore the vector changed as shown.
In situation B, a binning function BF is applied on two frequencies indicated by vectors U_2 and U_3 and as a result of the binning function BF, the Fourier coefficients Usk_X(k) of both U_2 and U_3 are modified into vectors U_2′ and U_3′, respectively. In this particular case, the binning function BF has the effect that U_2′ is equal to U_3′. This may for example be the case of a binning function BF that significantly alters values e.g. a harsh rounding down or similar. Thus, information is lost but more easy and/or improved analysis may be performed.
In situation C, a binning function BF is applied on two frequencies indicated by vectors U_4 and U_5 and as a result of the binning function BF, the Fourier coefficients Usk_X(k) of both U_4 and U_5 are modified into vectors U_4′ and U_5′, respectively. In this particular case, the binning function BF has the effect of turning, in the vector space, the two vectors U_4 and U_5.
There is then defined a binning function BF for a frequency K′, here K′=2, applicable to the Fourier coefficients Usk_X(k) with respect to the relevant channels X. Thus, the binning function can for example comprises truncation, rounding up, rounding down, a modular function, and/or a threshold function or other relevant mathematical function relevant for the purpose of the present invention. In one embodiment, a truncation is performed. Typically, the binning function (BF) is defined for all channels X, thus, X={A, T, C, and G}, but for some application one or a subset, e.g. C and G, can be channels to be analyzed. In
There after substantially equal modified Fourier coefficients Usk_X(k) within the said portion of the plurality of spectra, e.g. s=1 and upwards, is found and preferably marked or tagged for further analysis. Thus by finding is meant for instance counting how many entries having a certain value of the modified Fourier coefficients Usk_X(k), e.g. 10. The term “substantially equal” is meant to take into account of numerical errors introduced after application of the binning function BF.
For each frequency, values that are “similar”, i.e. substantially equal according to the applied binning function BF, are grouped together and histograms showing the number of values falling in each bin are built. The values of A, C, G, T for an individual frequency can be compared independently, or combined in a common measure taking into account similarities on all four nucleotides to find similarities in that frequency.
Next, for each frequency one or more histogram bins (e.g. the largest) are selected, according to a chosen strategy. In the following three such strategies: Top Down Hierarchical Sorting (TDHS), Independent Iterative Sorting (IIS), and Lattice Sorting (LS), are further explained, but other methods are readily available to the skilled person within the context and teaching of the present invention. The domain may then be split according to the chosen strategy and taking into account the histogram bins, and the process is repeated in each of the sub-domains until a stopping criterion is reached.
For instance, when the largest bin is selected, it provides the largest number of sequences that share a “similar” value according to the binning function BF in that specific frequency for one of the nucleotides. The frequency for the largest value in all histogram bins across all frequencies (for each frequency there is a single histogram) is selected and the sequences contributing to that histogram are grouped together. The whole domain of sequences is split this way into the group of sequences sharing a similarity in that frequency and the rest, obtaining two “clusters” (although this is not a clustering algorithm in the strict sense of the word this terminology may be adopted), and a specific selection and processing strategy is applied on each of the two clusters. Next, histograms of the values are built again or the computed histograms bins are updated to reflect the split into clusters; the longest histogram is selected, and according to that histogram the domain is split into two clusters again. The iterations stop when the size of the longest histogram is below a predefined threshold, when the user-defined number of long patterns to be extracted was reached, or when each of the two clusters contains a single sequence. Other stop criterion may also be applied.
Next, in each of the two clusters, or the first and second group, the (next) longest pattern is found and the clusters are again each split or subdivided into clusters or groups containing the long pattern and the rest. This is illustrated in
This hierarchical sorting is illustrated by the “sorting three” to lower left in
This algorithm stops when a threshold for the longest pattern or for the number of steps was reached, or when each of the two clusters or groups contains a single sequence, e.g. spectrum s=2 in
Furthermore, a so-called lattice sorting (LS) algorithm may be implemented in connection with the present invention. Initially, for all patterns longer than a given size N_thres1 (or alternatively for the k longest patterns) form clusters by selecting the rows or spectra including those patterns and discarding the rest. Afterwards, there is performed the same selection iteratively in each cluster or group until no suitable patterns are found, i.e. until all patterns are shorter than N_thres2 (or all patterns left are of length 1). With this strategy the clusters can be overlapping, and each cluster has one child. Unlike TDHS, LS never misses long patterns. Also with this strategy fully coexisting patterns will always appear.
All the above strategies of TDHS, IIS, and LS can implemented interactive, in the sense that at each step the patterns can be visualized and the user can decide which branches in the hierarchies of clusters or groups to explore.
Next, the spectra may be stacked on top of each other in a new representation called sorted video as shown in
In addition, this present invention is conducive to parallelization unlike other clustering methods known in the art (say hierarchical clustering). For sorting, the histograms are built for each frequency, which makes it easy to split the domain of Fourier values among several processes and execute them in parallel, on a parallel or distributed system, or on grids.
At the end, the present invention provides a visualization method (as in
S1 providing a DNA sequence,
S2 creating a plurality of spectra 20 based on the DNA sequence by converting the DNA sequence into a plurality of binary indicator sequences (BIS), and applying short term Fourier transform (STFT) on the binary indicator sequences, each spectrum comprising corresponding frequencies k and Fourier coefficients Usk_X(k), where each kind of Fourier coefficient constitutes a channel X,
S3 defining a binning function BF for a frequency K′ applicable to the Fourier coefficients Usk_X(k) with respect to one or more channels X,
S4 applying the binning function BF on at least a portion of the plurality of spectra and thereby modifying the corresponding Fourier coefficients Usk_X(k), and
S5 finding substantially equal modified Fourier coefficients Usk_X(k) within the said portion of the plurality of spectra.
The invention can be implemented in any suitable form including hardware, software, firmware or any combination of these. The invention or some features of the invention can be implemented as computer software running on one or more data processors and/or digital signal processors. The elements and components of an embodiment of the invention may be physically, functionally and logically implemented in any suitable way. Indeed, the functionality may be implemented in a single unit, in a plurality of units or as part of other functional units. As such, the invention may be implemented in a single unit, or may be physically and functionally distributed between different units and processors.
Although the present invention has been described in connection with the specified embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present invention is limited only by the accompanying claims. In the claims, the term “comprising” does not exclude the presence of other elements or steps. Additionally, although individual features may be included in different claims, these may possibly be advantageously combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. In addition, singular references do not exclude a plurality. Thus, references to “a”, “an”, “first”, “second” etc. do not preclude a plurality. Furthermore, reference signs in the claims shall not be construed as limiting the scope.
Number | Date | Country | Kind |
---|---|---|---|
08158610.9 | Jun 2008 | EP | regional |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/IB2009/052517 | 6/12/2009 | WO | 00 | 5/16/2011 |