This application claims the benefit of Korean Patent Application No. 10-2012-0089667, filed on Aug. 16, 2012, in the Korean Intellectual Property Office, the disclosure of which is incorporated herein in its entirety by reference.
1. Field
The present disclosure relates to methods and apparatuses for analyzing personalized multi-omics data by combining different types of genetic information into a single representation.
2. Description of the Related Art
A genome is the entirety of a living organism's genetic information. As techniques for sequencing the genome of an individual have continued to evolve, various novel sequencing methods such as Next Generation Sequencing and Next Next Generation Sequencing are being developed. Genetic information containing nucleic acid sequences and protein are widely used to identify genes causing diseases such as diabetes and cancer or to detect correlations between genetic variations and characteristics expressed in an individual. Genetic information collected from an individual is crucial for identifying the genetic characteristics of an individual related to the onset or progression of different symptoms or diseases. Thus, by providing information about a present illness or the future likelihood of some diseases, personal genome information such as nucleic acid sequences or protein plays an important role in determining the best treatment at the early stages of a disease if it is present or in preventing the occurrence of disease. Due to its growing importance, research is being conducted on techniques for precisely analyzing personal genome information using a genome detecting device such as a DNA chip or microarray for detecting single nucleotide polymorphisms (SNP) and copy number variation (CNV) as genomic information of a living organism.
Provided are methods and apparatuses for analyzing personalized multi-omics data by integrating different types of biological data. Also provided is a computer readable recording medium having recorded thereon a computer program for executing the above methods.
According to an aspect of the present invention, a method of analyzing personalized multi-omics data includes: acquiring a plurality of biological data groups containing different types of genome data from an individual's gene sample; estimating indices indicating a degree of genetic abnormalities in each of the different types of genomic data for each of the plurality of biological data groups; and generating a combined index which evaluates the degree of genetic abnormalities for the entire biological data groups by using an analysis algorithm for generalizing the estimated indices.
According to another aspect of the present invention, a method of analyzing personalized multi-omics data includes: estimating indices indicating a degree of genetic abnormalities for each of a plurality of different biological data groups obtained from an individual's gene sample; obtaining a confidence value for each of the plurality of biological data groups from genome data measurement platforms used to obtain the plurality of biological data groups; and reflecting the confidence values in the estimated indices to generalize the estimated indices and generating a combined index which evaluates the degree of genetic abnormalities for the entire biological data groups.
According to another aspect of the present invention, a non-transitory computer-readable recording medium having recorded thereon a program for executing the method of analyzing personalized multi-omics data is provided.
According to another aspect of the present invention, an apparatus for analyzing personalized multi-omics data includes: a data acquisition unit for acquiring a plurality of biological data groups containing different types of genome data from an individual's gene sample; an index estimation unit for estimating indices indicating a degree of genetic abnormalities in each of the different types of genomic data for each of the plurality of biological data groups; and a combined index generation unit for generating a combined index which evaluates the degree of genetic abnormalities for the entire biological data groups by using an analysis algorithm for generalizing the estimated indices.
According to another aspect of the present invention, an apparatus for analyzing personalized multi-omics data includes: an index estimation unit for estimating indices indicating a degree of genetic abnormalities for each of a plurality of different biological data groups obtained from an individual's gene sample; a data acquisition unit for obtaining a confidence value for each of the plurality of biological data groups from genome data measurement platforms used to obtain the plurality of biological data groups; and a combined index generation unit for reflecting the confidence values in the estimated indices to generalize the estimated indices and generating a combined index which evaluates the degree of genetic abnormalities for the entire biological data groups.
As described above, the method and apparatus for analyzing personalized multi-omics data allows personalization of genomic information obtained from an individual's gene sample for analysis, thereby providing precise detection of genetic abnormalities in an individual's genome. The method and apparatus may also combine or merge different kinds of genome information derived from an individual's gene sample for analysis, thereby allowing more precise and efficient analysis of individual's genome information compared to the use of a single type of data.
These and/or other aspects will become apparent and more readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
Reference will now be made in detail to embodiments, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to like elements throughout. In this regard, the present embodiments may have different forms and should not be construed as being limited to the descriptions set forth herein. Accordingly, the embodiments are merely described below, by referring to the figures, to explain aspects of the present description.
The system 1 uses microarrays 21 and 22 such as DNA chips and a sequencing tool 23 such as Genotype Console or Expression Console to obtain various types of genome information including nucleic acid sequences and protein sequences from the gene sample of a patient 2. The gene sample can be any type of sample containing genetic information (e.g., DNA, RNA, or protein), such as blood, saliva, or other samples (e.g., tissue or fluid samples) of the body. Thus, the system 1 may use different measurement platforms to obtain various types of genome information.
The details of the processes of obtaining various kinds of genome information about nucleic acids and protein contain in a sample by using measurement platforms such as the microarrays 21 and 22 and the sequencing tool 23 are known to those of ordinary skill in the art, and a detailed description thereof is omitted, accordingly.
The system 1 may employ measurement platforms other than the microarrays 21 and 22 and the sequencing tool 23 so long as they can obtain various types of genome information such as information about nucleic acids and protein.
Nucleic acids contain genome information about an individual and are divided into two types; DeoxyriboNucleic Acid (DNA) and RiboNucleic Acid (RNA). The DNA is a genetic material, i.e., a gene, including individual's genome information. A DNA sequence contains information about cells and tissues of an individual, and bases in the DNA sequence represent information about the order in which 20 types of amino acids in a protein of an individual are joined together or aligned. That is, the protein is a product produced from nucleic acid and expressed in various types according to an individual's DNA sequence.
Genome information such as an individual's DNA sequence and protein is useful for understanding biological phenomena and obtaining information about an individual's disease. Thus, comparing a DNA sequence in a patient's gene with a DNA sequence from a normal gene for analysis may prevent occurrence of an individual's illness or facilitate choosing the best treatment at the early stages of a disease.
The system 1 analyzes the patient's genome information to detect genetic abnormalities. To achieve this, the apparatus 10 for analyzing personalized multi-omics data in the system 1 personalizes biological data groups related to various types of genome information such as information about nucleic acids and protein derived from the gene sample 20 and combines the results for analysis.
‘Omics’ refers to a field of study in biology, encompassing, e.g., genomics, proteomics, transcriptomics, and metabolomics. Multi-omics refers to genetic information gathered from multiple sources. For instance, multi-omics data might include information regarding DNA (e.g., sequence, single nucleotide polymorphism, mutation, copy number variation, etc.), RNA (e.g., sequence, mutation, copy number variation, etc.), and/or protein sequence (sequence, mutation, expression level, etc) relating to a gene or group of genes.
A biological data group, as used herein, refers to a data group comprising genome data (i.e., genomic data or “omic” data), from a given measurement platform or source and its quality score or confidence indicator. The plurality of biological data groups described in the present embodiment each contain different types of omics data sets originating from the gene sample 20 and, thus, collectively contain multi-dimensional genetic information, for instance Single Nucleotide Polymorphism (SNP), Copy Number Variations (CNV), mutation information, mRNA expression data or the results of proteome analysis to identify genetic phenomena such as how a gene functions after the gene is turned into a protein, or Transcriptome analysis to identify genetic phenomena such as how a gene will function during transition from a gene to a protein.
In one embodiment, each of the plurality of biological data groups contain different omics data regarding a particular gene or group of genes. More specifically, the plurality of data groups may include two or more different data groups each comprising data about mutation, SNP, CNV, insertion, deletion, gene expression, DNA methylation, protein expression, protein targeting, protein phosphorylation, and protein binding.
The system 1 and the apparatus 10 according to the present embodiment personalizes the biological data groups and integrally combines or merges the results for analysis. By relying upon multiple different types of omics data, the system, apparatus, and method described herein enables more precise, accurate, and/or efficient detection of abnormalities in an individual's genome.
The system 1 and the apparatus 10 combine or merge the plurality of biological data groups by using confidence values of the data included in the biological data groups. The details of this process are described by reference to embodiments of the invention in the following paragraphs.
The data acquisition unit 100 acquires a plurality of biological data groups at least two or more of which contain different kinds of genetic information (e.g., different types of omics data, as discussed above, from the patient's gene sample 20.
The data acquisition unit 100 also obtains a confidence value for each biological data group, which may be a measure of precision and/or accuracy for the data of biological data group. More specifically, each of the biological data groups is acquired from a particular platform or software, e.g., a sequencing tool 23, such as Genotype Console and Expression Console, together with a confidence value or quality measure describing how reliable (e.g., precise and/or accurate) the acquired data is. That is, the confidence value may be information based on a quality score produced by measurement platforms used to obtain different types of biological data groups.
In the present embodiment, the confidence value is used as a weight assigned to an index for each of different types of biological data groups. As will be described later, if data sets are acquired by different sequencing tools 23 and then normalized based on confidence values, as described above, the data sets may be compared with each other.
For example, when SNP or CNV calling is performed using Affymetrix SNP6.0, a confidence value may be obtained for each gene site, together with corresponding data. The confidence value may have a value between 0 and 1 and be converted into a percentile in order to normalize data. When Affymetrix U133 is used instead of SNP6.0, a detection p-value is acquired. The detection p-value indicates how reliable values absent (A), marginal (M), and present (P) for each probe are. Likewise, the detection p-value may be converted into a percentile so as to normalize data.
In describing the present embodiment, it is assumed that the plurality of biological data groups include only a biological data group related to mutation, a biological data group related to mRNA expression, and a biological data group related to CNV. However, the plurality of biological data groups are not limited thereto, and may include other types of biological data groups.
In order to obtain a biological data group related to mutation, the gene sample 20 reacts with a DNA chip (e.g., SNP 6.0), and the data acquisition unit 100 acquires the result produced by the sequencing tool 23, such as Genotype Console, and its corresponding confidence value. In order to obtain a biological data group related to mRNA expression, the gene sample 20 reacts with a DNA chip (e.g., U133 Plus2.0) and the data acquisition unit 100 acquires the result produced by the sequencing tool 23 (e.g., Expression Console) and its corresponding confidence value. Furthermore, in order to obtain a biological data group related to CNV, the gene sample 20 reacts with a DNA chip (e.g., U133 Plus2.0), and the data acquisition unit 100 acquires the result produced by the SNP 23 (e.g. Expression Console) and its corresponding confidence value. Thus, the data acquisition unit 100 obtains a plurality of biological data groups, including different types of genetic information about a gene or set of genes and corresponding confidence values.
For each of biological data groups acquired, the index estimation unit 200 estimates (calculates) indices indicating an estimated degree of genetic abnormality in each of the different types of genetic data contained therein. For convenience in describing the present embodiment, the estimated indices are p-values for statistically testing the significance with respect to the degree of genetic abnormalities. However, other statistical indices may be used.
The index estimation unit 200 statistically compares genetic data contained in the acquired biological data groups with corresponding control groups and calculates indices for the biological data groups. The control groups may be data obtained from public databases corresponding to the biological data groups (i.e., the same type of data corresponding to the same gene or set of genes), but the present invention is not limited thereto.
The index estimation unit 200 may compare genetic data with corresponding control groups by using a normal distribution or empirical distribution. In particular, the index estimation unit 200 compares genetic data of each of biological data groups with a corresponding control group by using the same type of distribution.
The index estimation unit 200 may perform the above-described processes on each gene within the genetic data contained in the biological data groups.
Processes of calculating or estimating indices in the index estimation unit 200 according to the present invention will now be described more fully with reference to
A DNA chip (SNP 6.0) provides the result of a reaction with a gene sample (301).
A sequencing tool (Genotype Console) performs a Genotype Call on the result of the reaction (302).
The sequencing tool (Genotype Console) carries out annotation on the result obtained in operation 302 (303). In this case, the sequencing tool (Genotype Console) translates the result obtained in operation 302 into the name of a gene containing a mutation. For example, the sequencing tool (Genotype Console) may convert the result to an annotation such as ‘hg19.position.ref.change’.
A sequencing tool, Mutation Assessor, developed by Memorial Sloan Kettering Cancer Center (MSKCC), calculates a Fl score (functional impact score) and a confidence value for each gene (304).
The data acquisition unit 100 obtains a biological data group related to the mutation, and Fl score and a confidence value of the biological data group related to the mutation (305).
The index estimation unit 200 fits the obtained Fl score to a normal distribution (like a z-score) and calculates an index p-valuem (306). The process of calculating an index p-valuem is described in greater detail below. The index p-valuem may be obtained for each gene contained in the biological data group related to the mutation. The index p-valuem obtained for the biological data group related to the mutation from the index estimation unit 200 as described above may be used as an index that is personalized to the patient 2 for mutation.
A DNA chip (U133 Plus2.0) provides the result of a reaction with a gene sample (311).
A sequencing tool (Expression Console) performs an Expression Call on the result of the reaction (312).
The sequencing tool (Expression Console) uses a MicroArray Suite 5.0 (MAS5) algorithm to detect an initial p-value for each ProbeSetID from the result obtained in operation 312 and calculates a corresponding confidence value (313).
The data acquisition unit 100 obtains a biological data group related to mRNA expression, and the initial p-value and confidence value of the biological data group related to mRNA expression (314).
The index estimation unit 200 fits the obtained initial p-value to a normal distribution or an empirical distribution and estimates an index p-valueR (315). The process of calculating an index p-valueR is described in greater detail below. The index p-valueR may be obtained for each gene contained in the biological data group related to mRNA expression.
The index estimation unit 200 uses Gene Symbol corresponding to ProbeSetID to perform annotation on the index p-valueR (316). If there is an overlap between genes, the index estimation unit 200 estimates the final index p-valueR and its corresponding confidence value based on the index p-valueR having the smallest value.
As described above, the index p-valuem obtained for the biological data group related to a mutation from the index estimation unit 200 may be used as an index that is personalized to the patient 2 for mutation.
The index p-valueR obtained from the index estimation unit 200 for the biological data group related to mRNA expression as described above may be used as an index that is personalized to the patient 2 for mRNA expression.
A DNA chip (SNP 6.0) provides the result of a reaction with a gene sample (321).
A sequencing tool (Genotype Console) performs a Genotype Call on the result of the reaction (322).
The sequencing tool (Genotype Console) carries out annotation on the result obtained in operation 322 (323). In this case, the sequencing tool (Genotype Console) may perform annotation (hg 18 version) on genes within the result, which is found in or partially corresponding to a CNV region.
The sequencing tool (Genotype Console) converts the result obtained in operation 323 for each gene and removes data for duplicate genes (324).
The data acquisition unit 100 obtains a biological data group related to CNV, and a confidence value of the biological data group related to CNV (325).
The index estimation unit 200 fits the obtained biological data group to an empirical distribution and estimates an index p-valuec (326). The process of calculating an index p-values is described in greater detail below. The index p-values obtained for the biological data group related to CNV from the index estimation unit 200 as described above may be used as an index that is personalized to the patient 2 for CNV.
As described above with reference to
Referring to
Referring to
Referring to
More specifically, the index standardization unit 310 incorporates (reflects) the confidence value for each of the biological data groups obtained by the data acquisition unit 100 into the indices calculated by the index estimation unit 200, and normalizes the indices for each of the biological data groups. The combined index calculating unit 320 then generalizes the normalized indices by using an analysis algorithm for generalizing the estimated indices and produces a combined index p-valuecombine.
The analysis algorithm used in the combined index generation unit 300 may be a meta-analysis algorithm. Examples of the generally known meta-analysis algorithm include a Fisher's inverse chi-square method, a Tippett's method (minimum p method), a Stouffer's inverse normal method, a George's method (logit method), and The Cancer Genome Atlas (TCGA) method.
The meta-analysis algorithm is used to obtain a representative p-value from a plurality of p-values. The precise methodology for applying the algorithms will be readily apparent to those of ordinary skill in the art. Furthermore, it will be understood by those of ordinary skill in the art that the combined index generation unit 300 may use any meta-analysis algorithm so long as the algorithm is designed for obtaining a representative p-value from among a plurality of p-values given for the same sample.
By way of further illustration, the combined index generation unit 300 may apply a meta-analysis algorithm as described below.
The index standardization unit 310 applies a weight corresponding to a confidence value (e.g., a confidence value converted to a percentile) for each of the biological data groups to the estimated indices and converts the estimated indices. The combined index calculating unit 320 combines or merges the indices obtained by the index standardization unit 310 and produces a combined index p-valuecombine. This process is expressed by Equation (1):
p
combine
=p
m
w
·p
R
w
·p
c
w
(wm+wR+wc=1) (1)
pm=personalized p-value in mutation data
pR=personalized p-value in mRNA expression data
pc=personalized p-value in CNV data
wm=percentiled QC measure in mutation data
wR=percentiled QC measure in mRNA expression data
wc=percentiled QC measure in CNV data
As is evident by Equation (1), the index standardization unit 310 applies (reflects) a weight corresponding to a conference value wm of a mutation biological data group in an index p-value pm estimated from the biological data group. Similarly, the index standardization unit 310 also applies weights corresponding to confidence values wR and wC of a mRNA expression biological data group and a CNV biological data group in indices pR and pC estimated from the biological data groups, respectively.
The combined index generation unit 300 then multiplies the weighted indices in order to generalize the indices and generates a combined index pcombine.
In this case, if a weight (confidence value) cannot be obtained for a biological data group, a weight w is randomly set using the following Equation (2):
For example, when the weight (confidence value wR) cannot be obtained for the CNV biological data group in Equation (1), and three biological data groups are used in the analysis, the weight wR is assumed to have a value of 1/√{square root over (3)}, according to Equation (2).
Furthermore, if an index p-value cannot be estimated from a biological data group, the index p-value may be set to 1.
The apparatus 10 for analyzing personalized multi-omics data outputs a combined index pcombine (or p-value pcombine) that is obtained by combining indices for different types of biologic data groups in the manner described above.
The combined index pcombine may be used as input data for a variety of different purposes, such as regression analysis, gene classification, and/or gene clustering analysis. For instance, it may be used to analyze the relationship between a receptor, such as c-MET, and oncogene, thereby allowing precise diagnostics for c-MET in patients with cancer. The method and apparatus described herein is believed to be particularly useful as a companion diagnostic for a particular course of therapy (e.g., anti-c-Met therapy). Thus, the method described herein may further comprise administering a therapeutic agent, particularly an anti-cancer agent (e.g., a c-Met antagonist), before or after performing the method.
Thereafter, the apparatus applies a meta-analysis algorithm to the estimated indices pm, pc, and pR to generalize or merge the indices (604). In this case, as an example of a meta-analysis, the apparatus 10 generalizes or merges the estimated indices pm, pc, and pR by applying weights wm, wc and wR based on confidence values and combining the weighted values. The apparatus 10 outputs a combined index Pcombine (605).
The data acquisition unit 100 obtains a plurality of biological data groups containing different types of genome information from an individual's gene sample (701).
The index estimation unit 200 estimates an index indicating the degree of genetic abnormalities in the different types of genome information for each of the biological data groups (702).
The combined index generation unit 300 uses an analysis algorithm for generalizing the estimated indices to generate a combined index for evaluating genetic abnormalities for the entire biological data groups (703).
The above embodiments of the present invention may be recorded in programs (non-transient computer readable medium) that can be executed on a computer and be implemented through general purpose digital computers that can run the programs using a computer readable recording medium. Data structures described in the above embodiments may be recorded on a medium in a variety of ways, with examples of the medium including recording media, such as magnetic storage media (e.g., ROM, floppy disks, hard disks, etc.) and optical recording media (e.g., CD-ROMs, or DVDs), and transmission media such as Internet transmission media.
All references, including publications, patent applications, and patents, cited herein are hereby incorporated by reference to the same extent as if each reference were individually and specifically indicated to be incorporated by reference and were set forth in its entirety herein.
The use of the terms “a” and “an” and “the” and “at least one” and similar referents in the context of describing the invention (especially in the context of the following claims) are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. The use of the term “at least one” followed by a list of one or more items (for example, “at least one of A and B”) is to be construed to mean one item selected from the listed items (A or B) or any combination of two or more of the listed items (A and B), unless otherwise indicated herein or clearly contradicted by context. The terms “comprising,” “having,” “including,” and “containing” are to be construed as open-ended terms (i.e., meaning “including, but not limited to,”) unless otherwise noted. Recitation of ranges of values herein are merely intended to serve as a shorthand method of referring individually to each separate value falling within the range, unless otherwise indicated herein, and each separate value is incorporated into the specification as if it were individually recited herein. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language (e.g., “such as”) provided herein, is intended merely to better illuminate the invention and does not pose a limitation on the scope of the invention unless otherwise claimed. No language in the specification should be construed as indicating any non-claimed element as essential to the practice of the invention.
Preferred embodiments of this invention are described herein, including the best mode known to the inventors for carrying out the invention. Variations of those preferred embodiments may become apparent to those of ordinary skill in the art upon reading the foregoing description. The inventors expect skilled artisans to employ such variations as appropriate, and the inventors intend for the invention to be practiced otherwise than as specifically described herein. Accordingly, this invention includes all modifications and equivalents of the subject matter recited in the claims appended hereto as permitted by applicable law. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the invention unless otherwise indicated herein or otherwise clearly contradicted by context.
Number | Date | Country | Kind |
---|---|---|---|
10-2012-0089667 | Aug 2012 | KR | national |