This invention pertains in general to the field of biology and bioinformatics. More particularly the invention relates to the field of categorization of cancer tumours and even more particularly to identifying methylated sites, which may aid in categorization of cancer tumours.
Worldwide, breast cancer is the fifth most common cause of cancer death, after lung cancer, stomach cancer, liver cancer, and colon cancer. Among women, breast cancer is the most common cancer and the most common cause of cancer death.
Breast cancer is diagnosed by the pathological examination of surgically removed breast tissue. Following diagnosis, it is important to analyze the tumour type in order to aid clinicians when choosing the right therapy. Within the art, such analysis is performed according to two categories.
The first category involves the use of immuno-histopathological variables, such as tumour size, ER/PR status, lymph node negativity, etc. to define a clinical prognostic index such as the Nottingham Prognostic Index (NPI). The problem with such an index is that it has been shown to be very conservative, thus typically causing patients to receive aggressive therapy even when they are a low risk of disease recurrence.
The second category involves the measurement of the expression levels of a large number of genes, typically around 500, and calculating probability of a subtype based on the relative expression levels of the genes. This method is very costly in terms of tissue handling requirements. It is also hard to perform in a clinical setting, due to the demand of laboratory equipment.
DNA methylation, a type of chemical modification of DNA that can be inherited and subsequently removed without changing the original DNA sequence, is the most well studied epigenetic mechanism of gene regulation. There are areas in DNA where a cytosine nucleotide occurs next to a guanine nucleotide in the linear sequence of bases called CpG islands.
CpG islands are generally heavily methylated in normal cells. However, during tumorigenesis, hypomethylation occurs at these islands, which may result in the expression of certain repeats. These hypomethylation events also correlate to the severity of some cancers. Under certain circumstances, which may occur in pathologies such as cancer, imprinting, development, tissue specificity, or X chromosome inactivation, gene associated islands may be heavily methylated. Specifically, in cancer, methylation of islands proximal to tumour suppressors is a frequent event, often occurring when the second allele is lost by deletion (Loss of Heterozygosity, LOH). Some tumour suppressors commonly seen with methylated islands are p16, Rassf1a, and BRCA1.
There are reported epigenetic markers for colorectal and prostate cancer. For example, Epigenomics AG (Berlin, Germany) has the Septin 9 as a marker for colorectal cancer screening in blood plasma. A method for using methylation sites to predict differential therapy responses in cancer and recommending an appropriate therapy has been disclosed in US20050021240A1. However, the results predicted by this method are limited, since they cannot be directly applied in clinical practice. Therefore, it would advantageous to have a method for the analysis of breast cancer disorders, which is time efficient, reliable and cost-effective.
Accordingly, the present invention preferably seeks to mitigate, alleviate or eliminate one or more of the above-identified deficiencies in the art and disadvantages singly or in any combination and solves at least the above mentioned problems by providing a method for the analysis of breast cancer disorders according to the appended patent claims.
According to an aspect a method for analysis of breast cancer disorders is disclosed. The method comprises determining the genomic methylation status of one or more CpG dinucleotides in a sequence selected from the group of sequences consisting of SEQ ID NO. 1 to SEQ ID NO. 600. The method provides for improved abilities to characterize cancer tumours using methylation patterns.
The regions of interest of the sequences SEQ ID NO. 1 to 600 are designated in table 1 (as “start” and “end” on respective “chromosome”).
This aspect presents improvements over the state of the art in that it enables a highly specific classification of breast cell proliferative disorders.
In an aspect a computer program product is disclosed. The computer program product is stored on a computer-readable medium comprising software code adapted to perform the steps of the method according to an aspect when executed on a data-processing apparatus.
In an aspect a device is disclosed. The device comprises means adapted to carry out methods according to som embodiments. An advantage with this is to support a clinician.
Herein, the sequences claimed also encompass the sequences, which are reverse complement to the sequences designated.
These and other aspects, features and advantages of which the invention is capable of will be apparent and elucidated from the following description of embodiments of the present invention, reference being made to the accompanying drawings, in which
Several embodiments of the present invention will be described in more detail below with reference to the accompanying drawings in order for those skilled in the art to be able to carry out the invention. The invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. The embodiments do not limit the invention, but the invention is only limited by the appended patent claims. Furthermore, the terminology used in the detailed description of the particular embodiments illustrated in the accompanying drawings is not intended to be limiting of the invention.
An idea according to some embodiments is a method using a small selection of DNA sequences to analyze breast cancer disorders. The analysis is done by determining genomic methylation status of one or more CpG dinucleotides, in either sequence disclosed herein, or its reverse complement.
It was surprisingly found that some DNA sequences, SEQ ID NO: 1 to SEQ ID NO: 600 act as epigenetic markers that may be used to analyze breast cancer by subtyping tumours. In prior art, it is possible to subtype breast cancer based on gene expression. Five different subtypes have been reported; luminal A, luminal B, basal, ERBB2 overexpressing, and normal-like. The inventors have identified the same subtypes using DNA methylation.
The DNA SEQ ID NO: 1 to SEQ ID NO: 600 were identified by analysing 150 000 individual genomic loci for methylation, across a set of 83 breast tumours. The availability of clinical information regarding tumour specimens allowed for an investigation of DNA methylation in the context of breast cancer subtypes, histology and tumour aggressiveness. The five major breast cancer molecular subtypes (luminal A and B, basal, ERBB2 overexpressing, and normal-like) were identified. First, an investigation was performed regarding however unsupervised clustering of the tumour set using methylation recapitulates the major Luminal and basal classes that were identified by expression analysis or not. A filtering criterion was used to identify the features to be used in clustering. This criterion was the top 500 loci that varied most across the 83 tumour samples. Then, the top 100 loci that distinguished tumours from normal tissues from were added. These 600 features, displayed in table 1, were used to cluster the 83 tumours for which the expression subtype data was available. Hierarchical clustering with Pearson correlation and complete linkage of the samples based on these six hundred loci gave a dendrogram that is surprisingly similar to the one produced by expression analysis.
In an embodiment a method 10 is provided, according to
Selecting 100 a feature subset may be performed based on hierarchical clustering with Pearson correlation and complete linkage to characterize the fitness of each feature subset, given a dataset with methylation characterization for of each sample (si, i=1 . . . M) in a form of a vector mi of N values, where mi,j provides the methylation status for the i-th sample and the j-th probe. Typically, some statistical analysis of the measured signal will produce a set of probes (features) to be input to the hierarchical clustering method above.
The feature subset selection 100 uses a Genetic Algorithm (GA), which repetitively evaluate feature subsets based on a fitness function that in some way characterizes some property of the feature subset. In an embodiment, hierarchical clustering with Pearson correlation and complete linkage is used as the fitness function to assess how good a feature subset is.
The following example is used to illustrate the principle.
Next, clustering may be performed.
After having clustered the datasets, a ranking of all clustering results is performed. In one embodiment, a cluster analysis method is used for the ranking. For example, it is possible to characterize and rank individual clusters based on their validity, for example in terms of cluster cohesion or separation. This may be done in one of multiple ways well known to a person skilled in the art. Thus, it is possible to rank two or more feature subsets based on the quality of the clusters they generate when used to cluster the samples.
In another embodiment, some property of the samples (e.g. cancer subtype based on pathology) is used for ranking. From this property, the same or related subtypes are grouped together. For example, if the five samples from
In an embodiment, two clustering outputs D1 and D2, are compared based on the clusters. First, N (C1, C2, . . . CN) clusters are obtained based on the dendrogram, produced by the clustering. Then, a property is computed based on the clusters, such as the popular method of silhouette width—SIL(Ci). Now a single-number characterization of a clustering is obtained by the formula:
AVGSIL(D)=(SUM[i=1 . . . N]SIL(Ci))/N
By comparing AVGSIL(D1) and AVGSIL(D2), it may be determined which clustering is preferable. In another embodiment, build a data structure G is built in form of a matrix with dimensions N×L, where L is the number of distinct labels available for the samples. With labels {X. Y}, L=2, or for labels {normal, aggressive cancer, non-aggressive cancer} L=3. Then for each cluster i (i=1 . . . N) L values are obtained in the following manner for each element gij from G:
g
ij=count(sample in cluster i and has label j)
Now, it is possible to compute uniformity of each cluster Ci:
UNIFORMITY(Ci)=max(counts in row i in G)/sum(counts in row i in G)
Finally, the clustering is characterized with:
AVGUNIFORMITY(D)=SUM[i=1 . . . N](UNIFORMITY(Ci))/N
as a single-number characterization of a clustering. By comparing AVGUNIFORMITY (D1) and AVGUNIFORMITY (D2) it may be determined which clustering is preferable.
Iterative repetition of this selection process gradually refines the quality of the clustering of the feature subsets discovered by the GA. After a number of repetitions, all evaluated features subsets can be further filtered based on their performance during the GA execution. In one embodiment, feature subsets are sorted by the average clustering performance in stratification of the clinical samples. In another embodiment, feature subsets, in addition to the average performance, are filtered based on their persistent re-evaluation. In other words, feature subsets that are repeatedly selected for further evaluation are preferred to feature subsets that are dropped from consideration only after a few iterations. The final output of a GA feature subset selection is to run multiple instances with different initial conditions, and merge the filtered feature subsets from each of these instances. Feature subsets from one such evaluation are listed in Table 3A. Furthermore, a cumulative characterization of a collection of GA runs can be obtained and used to generate feature subsets that aggregate the feature subsets in single set of subsets. In one embodiment, the appearance of each feature in feature subsets is counted and a total histogram is obtained giving the degree of utilization of each of the 600 features. Based on this information and for example in one embodiment the frequencies of the pairwise occurrences of the 600 features are used to build feature subsets that summarize the GA run in a single set of subsets, a so called trend pattern. Table 3B provides such feature subset of lengths 45 and 60.
Examples of feature subsets are provided in Tables 2, 3A and 3B. Thus, in an embodiment, the feature subset comprises the CpG dinucleotides according to one of the selections listed in Table 2.
In an embodiment, the feature subset comprises the CpG dinucleotides according to one of the selections listed in Table 3A.
In an embodiment, the feature subset comprises the CpG dinucleotides according to one of the selections listed in Table 3B.
In an embodiment the method 10 comprises determining 120 the methylation status of one or more CpG dinucleotides in a sequence selected from the group of sequences corresponding to the marker panel, resulting in a methylation classification list. There are numerous methods for determining 120 the methylation status of a DNA molecule of a subject, corresponding to the feature subset. The DNA may be obtained by any method for purifying DNA known to a person skilled in the art. In an embodiment the methylation status is determined 110 by means of one or more of the methods selected form the group of, bisulfite sequencing, pyrosequencing, methylation-sensitive single-strand conformation analysis (MS-SSCA), high resolution melting analysis (HRM), methylation-sensitive single nucleotide primer extension (MS-SnuPE), base-specific cleavage/MALDI-TOF, methylation-specific PCR (MSP), microarray-based methods, msp I cleavage.
In an embodiment, the method 10 also comprises statistically analyzing 120 the methylation classification list, thus obtaining a category of the breast cancer of the subject. This may be done by jointly clustering the subject methylation data and the samples from the clinical study. The resulting clustering is then split in N groups (e.g. by cutting the clustering dendrogram into N sub-trees). The sub-tree containing the subject is evaluated for the categories of breast cancer present in the study samples and the subject sample is assigned the category of the majority samples in the sub-tree.
In an embodiment, the method 10 further comprises classifying (130) the subject as belonging to one of the five major subtypes of breast cancers.
In an embodiment according to
In an embodiment according to
The invention may be implemented in any suitable form including hardware, software, firmware or any combination of these. However, preferably, the invention is implemented as computer software running on one or more data processors and/or digital signal processors. The elements and components of an embodiment of the invention may be physically, functionally and logically implemented in any suitable way. Indeed, the functionality may be implemented in a single unit, in a plurality of units or as part of other functional units. As such, the invention may be implemented in a single unit, or may be physically and functionally distributed between different units and processors.
Although the present invention has been described above with reference to specific embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the invention is limited only by the accompanying claims and, other embodiments than the specific above are equally possible within the scope of these appended claims.
In the claims, the term “comprises/comprising” does not exclude the presence of other elements or steps. Furthermore, although individually listed, a plurality of means, elements or method steps may be implemented by e.g. a single unit or processor. Additionally, although individual features may be included in different claims, these may possibly advantageously be combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. In addition, singular references do not exclude a plurality. The terms “a”, “an”, “first”, “second” etc do not preclude a plurality. Reference signs in the claims are provided merely as a clarifying example and shall not be construed as limiting the scope of the claims in any way.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/IB10/50316 | 1/25/2010 | WO | 00 | 9/20/2011 |
Number | Date | Country | |
---|---|---|---|
61148413 | Jan 2009 | US |