This application claims the benefit of Korean Patent Application No. 10-2013-0118120, filed on Oct. 2, 2013, in the Korean Intellectual Property Office, the entire disclosure of which is hereby incorporated by reference.
1. Field
The present disclosure relates to methods and apparatuses for diagnosing diseases, such as cancer, by using genetic information of a subject.
2. Description of the Related Art
A genome denotes all genetic information of one organism. Various technologies, such as a deoxyribonucleic acid (DNA) chip and next generation sequencing technology are being developed to sequence a genome of a person. Genetic information, such as a nucleic acid sequence, is widely used to find genes associated with diseases, such as diabetes, cancer, etc., or to determine a correlation between genetic diversity and an expression characteristic of an individual. In particular, genetic information collected from a person is important in determining genetic features of a person associated with different symptoms or a progression of a disease. Therefore, genetic information, such as a nucleic sequence, of a person is fundamental data for determining current and future disease-related information to prevent a disease or selecting an optimal treatment method at the initial stage of a disease. Regarding genetic information of an organism, research is being done into technologies that accurately analyze genetic information of a person and diagnose a disease of the person by using a genome detection apparatus, which detects single nucleotide polymorphism (SNP) and copy number variation (CNV), such as a DNA chip, a microarray, or the like.
Provided are methods and apparatuses for diagnosing diseases, such as cancer, by using genetic information of a subject.
Provided is a non-transitory computer-readable storage medium storing a program for executing the methods.
These and/or other aspects will become apparent and more readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings in which:
Additional aspects will be set forth in part in the description which follows and, in part, will be apparent from the description, or may be learned by practice of the presented embodiments.
According to an aspect of the present invention, a method of diagnosing cancer by using genetic information includes: acquiring first gene expression data of an subject, for whom cancer is to be diagnosed, for a gene marker set including at least one gene marker; and determining a possibility of cancer of the subject by using the acquired first gene expression data and pre-stored second gene expression data of a normal person group and a cancer patient group, wherein the gene marker set includes at least one gene marker selected from the group consisting of pyrroline-5-carboxylate reductase 1 (PYCR1), phosphoglycerate dehydrogenase (PHGDH), glutaminase 2 (liver, mitochondrial) (GLS2), glutaminase (GLS), glutamate dehydrogenase 1 (GLUD1), glutamate-ammonia ligase (GLUL), glutamic-oxaloacetic transaminase 1 and soluble (aspartate aminotransferase 1) (GOT1), glutamic-oxaloacetic transaminase 2 and mitochondrial (aspartate aminotransferase 2) (GOT2), glutamic-pyruvate transaminase (alanine aminotransferase) (GPT), glutamic pyruvate transaminase (alanine aminotransferase 2) (GPT2), phosphoserine aminotransferase 1 (PSAT1), asparagine synthetase (glutamine-hydrolyzing) (ASNS), ornithine aminotransferase (OAT), phosphoserine phosphatase (PSPH), aldehyde dehydrogenase 18 family and member A1 (ALDH18A1), and cysteine conjugate-beta lyase cytoplasmic (CCBL1).
The gene marker set may include PYCR1, and comprise at least one gene marker selected from the group consisting of PHGDH, GLS2, GLS, GLUD1, GLUL, GOT1, GOT2, GPT, GPT2, PSAT1, ASNS, OAT, PSPH, ALDH18A1, and CCBL1.
The gene marker set may comprise one gene marker PYCR1.
The gene marker set may comprise all the gene markers PYCR1, PHGDH, GLS2, GLS, GLUD1, GLUL, GOT1, GOT2, GPT, GPT2, PSAT1, ASNS, OAT, PSPH, ALDH18A1, and CCBL1.
The method may further comprise preprocessing the first gene expression data including first gene expression levels, based on a distribution of second gene expression levels included in the pre-stored second gene expression data, wherein the determining may include determining the presence of cancer by using the preprocessed first gene expression data and the pre-stored second gene expression data.
The preprocessing may comprise calculating ratios of the second gene expression levels and the first gene expression levels in units of a gene marker to preprocess the first gene expression data.
The preprocessing may comprise normalizing or standardizing the first gene expression levels in units of a gene marker to preprocess the first gene expression data in comparison with the second gene expression levels.
The determining may comprise applying the preprocessed first gene expression data to a discriminant model, pre-generated from the pre-stored second gene expression data, to determine the possibility of cancer.
The discriminant model may be pre-generated by using a regression model, which has a univariate representing the gene marker set or a multi-variate corresponding to two or more of the gene markers included in the gene marker set, for the pre-stored second gene expression data.
The determining may include: calculating an index, indicating a degree of expression of the first gene expression levels in the preprocessed first gene expression data, for the second gene expression levels; and applying the calculated index to the pre-generated discriminant model to calculate a statistical significance level indicating a presence probability of cancer, and the possibility of cancer may be determined based on the calculated statistical significance level.
The calculating of an index may include calculating the index by using at least one of the following methods: a fisher exact test, a binomial test, a geneset enrichment analysis (GSEA), a Mahalanobis distance, a Euclid distance, a Manhattan distance, a maximum distance, a minimum distance, and a correlation coefficient.
The calculating of an index may include estimating a representative expression pattern that is obtained by summarizing distributions of third gene expression levels of the normal person group in the second gene expression data, and the index may be calculated based on the degree of expression of the first gene expression levels in the preprocessed first gene expression data for the estimated representative expression pattern.
The determining may include: calculating an index, indicating a degree of expression of the first gene expression levels in the preprocessed first gene expression data, for a representative expression pattern that is obtained by summarizing distributions of third gene expression levels of the normal person group; and calculating a statistical significance level indicated by the calculated index by using an empirical distribution of degrees of expression of the third gene expression levels for the representative expression pattern, and the possibility of cancer may be determined based on the calculated statistical significance level.
The determining may further include comparing the calculated statistical significance level and a threshold value, which is used to determine the presence of cancer or a degree of occurrence of cancer, by using the pre-generated discriminant model, and the possibility of cancer may be determined based on the compared result.
According to another aspect of the present disclosure, an apparatus for diagnosing cancer by using genetic information comprises: a gene expression data acquiring unit that acquires first gene expression data of an subject, for whom cancer is to be diagnosed, for a gene marker set including at least one gene marker; and a determination unit that determines a possibility of cancer of the subject by using the acquired first gene expression data and pre-stored second gene expression data of a normal person group and a cancer patient group, wherein the gene marker set includes at least one gene marker selected from the group consisting of pyrroline-5-carboxylate reductase 1 (PYCR1), phosphoglycerate dehydrogenase (PHGDH), glutaminase 2 (liver, mitochondrial) (GLS2), glutaminase (GLS), glutamate dehydrogenase 1 (GLUD1), glutamate-ammonia ligase (GLUL), glutamic-oxaloacetic transaminase 1 and soluble (aspartate aminotransferase 1) (GOT1), glutamic-oxaloacetic transaminase 2 and mitochondrial (aspartate aminotransferase 2) (GOT2), glutamic-pyruvate transaminase (alanine aminotransferase) (GPT), glutamic pyruvate transaminase (alanine aminotransferase 2) (GPT2), phosphoserine aminotransferase 1 (PSAT1), asparagine synthetase (glutamine-hydrolyzing) (ASNS), ornithine aminotransferase (OAT), phosphoserine phosphatase (PSPH), aldehyde dehydrogenase 18 family and member A1 (ALDH18A1), and cysteine conjugate-beta lyase cytoplasmic (CCBL1).
The gene marker set may comprise PYCR1, and include at least one gene marker selected from the group consisting of PHGDH, GLS2, GLS, GLUD1, GLUL, GOT1, GOT2, GPT, GPT2, PSAT1, ASNS, OAT, PSPH, ALDH18A1, and CCBL1.
The determination unit may comprise a preprocessor that preprocesses the first gene expression data including first gene expression levels, based on a distribution of second gene expression levels included in the pre-stored second gene expression data, and the determination unit may determine the presence of cancer by using the preprocessed first gene expression data and the pre-stored second gene expression data.
The apparatus may further comprise a storage unit that stores a discriminant model pre-generated from the pre-stored second gene expression data, wherein the determination unit may determine the possibility of cancer by using the pre-generated discriminant model and the preprocessed first gene expression data.
The determination unit may further include a calculator that calculates an index, indicating a degree of expression of the first gene expression levels in the preprocessed first gene expression data, for the second gene expression levels, and applies the calculated index to the pre-generated discriminant model to calculate a statistical significance level indicating a presence probability of cancer, and the determination unit may determine the possibility of cancer, based on the calculated statistical significance level.
The determination unit may further include a comparator that compares the calculated statistical significance level and a threshold value, which is used to determine the presence of cancer or a degree of occurrence of cancer, by using the pre-generated discriminant model, and the determination unit may determine the possibility of cancer, based on the compared result.
According to another aspect of the present disclosure, a method of detecting cancer using genetic information, the method comprising: acquiring first gene expression data of a subject for a gene marker set including at least one gene marker; and comparing the acquired first gene expression data to pre-stored second gene expression data of a normal person group and a cancer patient group by calculating statistical similarity of the first gene expression data to the pre-stored second gene expression data, wherein cancer in the subject is indicated if the first gene expression data is more similar to the pre-stored second gene expression data of the cancer patient group than to the pre-stored second gene expression data of the normal patient group, and wherein the gene marker set comprises at least one gene marker selected from the group consisting of pyrroline-5-carboxylate reductase 1 (PYCR1), phosphoglycerate dehydrogenase (PHGDH), glutaminase 2 (liver, mitochondrial) (GLS2), glutaminase (GLS), glutamate dehydrogenase 1 (GLUD1), glutamate-ammonia ligase (GLUL), glutamic-oxaloacetic transaminase 1 and soluble (aspartate aminotransferase 1) (GOT1), glutamic-oxaloacetic transaminase 2 and mitochondrial (aspartate aminotransferase 2) (GOT2), glutamic-pyruvate transaminase (alanine aminotransferase) (GPT), glutamic pyruvate transaminase (alanine aminotransferase 2) (GPT2), phosphoserine aminotransferase 1 (PSAT1), asparagine synthetase (glutamine-hydrolyzing) (ASNS), ornithine aminotransferase (OAT), phosphoserine phosphatase (PSPH), aldehyde dehydrogenase 18 family and member A1 (ALDH18A1), and cysteine conjugate-beta lyase cytoplasmic (CCBL1), wherein the statistical similarity is calculated using at least one of a fisher exact test, a binomial test, a geneset enrichment analysis (GSEA), a Mahalanobis distance, a Euclid distance, a Manhattan distance, a maximum distance, a minimum distance, and a correlation coefficient.
Reference will now be made in detail to embodiments, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to like elements throughout. In this regard, the present embodiments may have different forms and should not be construed as being limited to the descriptions set forth herein. Accordingly, the embodiments are merely described below, by referring to the figures, to explain aspects of the present description. Expressions such as “at least one of,” when preceding a list of elements, modify the entire list of elements and do not modify the individual elements of the list.
Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
Although not shown in
That is, the cancer diagnostic system 100 of
A nucleic acid, such as a DNA, of an individual corresponds to a genetic material (i.e., a gene) including genetic information. A nucleic acid sequence includes information about tissue and cells constituting an individual. Therefore, research into information about a complete nucleic acid sequence of a person is frequently carried out while investigating a vital phenomenon, developing a new drug, diagnosing and preventing a disease, and researching the heredity of humans.
Recently, due to the advance of genome research, a functional correlation between genes included in a genome has gradually been disclosed, and thus, an analysis of a gene network between genes is attracting much attention. This may be because most physiological phenomena happening in an organism are caused by a reaction between a plurality of genes instead of one gene.
The gene network is expressed as a network in which genes are intricately connected to each other, and may be acquired from a database, known by one of ordinary skill in the art, such as the national center for biotechnology information (NCBI). However, due to the advance of gene analysis technology, a new gene network is continuously disclosed, and the existing gene network is updated with the new gene network. Therefore, a gene network to be described in the present embodiment is not limited to the gene network acquired from the known database.
In the cancer diagnostic system 100, as described above, the cancer diagnostic apparatus 10 performs an analysis based on a correlation between genetic information and the cancer occurrence of a person to increase an efficiency and accuracy of cancer diagnosis.
To date, a correlation between various kinds of cancers and a gene pathway was elucidated through much research into a genome of a person. Among the expansive research, there is research indicating glutamate metabolism is relevant to cancer metabolism. Such research is disclosed in the paper of Munoz-Pinedo C. et al. published in the Cell Death and Disease Journal (2012) and the paper of Kara L. Cerveny published in the Cell Journal (2013). Thus, an influence of glutamate metabolism on cancer metabolism is not described here in detail.
Glutamate metabolism is metabolism that greatly affects the growth or metastasis of a cancer cell, and particularly affects a process in which glutamine is changed to alpha keto glutamate corresponding to an energy source and a building block of a cancer cell.
According to the present embodiment, the cancer diagnostic apparatus 10 uses, as gene markers (e.g., bio markers), various kinds of gene networks associated with glutamate metabolism and genes belonging to a gene pathway highly associated with the occurrence of cancer among a plurality of gene pathways in diagnosing cancer, thus increasing the accuracy or efficiency of cancer diagnosis.
In particular, the cancer diagnostic apparatus 10 may use, as a gene marker set, genes belonging to a pathway of amino acid synthesis and interconversion (transamination) among various kinds of gene networks associated with glutamate metabolism.
In more detail, when the genes belonging to the pathway of amino acid synthesis and interconversion (transamination) are abnormal, the cancer diagnostic apparatus 10 may determine glutamate metabolism as being abnormal. Furthermore, when glutamate metabolism is abnormal, the cancer diagnostic apparatus 10 may determine there to be a high possibility of lung cancer (adenocarcinoma of the lung).
However, it can be understood by one of ordinary skill in the art that the present embodiment is not limited to lung cancer (adenocarcinoma of the lung), and may be used to diagnose various kinds of cancers including but not limited to breast cancer, colon cancer, ovarian cancer, etc.
That is, the cancer diagnostic apparatus 10 may determine whether the genes belonging to the pathway of amino acid synthesis and interconversion (transamination) are abnormal, and determine the presence of cancer or a degree of cancer occurrence.
Referring to
Referring to
Regarding glutamate metabolism, as described above, various kinds of gene networks or gene pathways intervene in the occurrence of cancer (e.g., adenocarcinoma of the lung).
According to the present embodiment, however, the pathway of amino acid synthesis and interconversion (transamination) among the gene pathways is used. The reason will be described with reference to a simulation result of
Referring to
Here, 587 gene pathways acquired from reactome that is a database of biological pathways are selected as gene pathways to verify for gene expression data of a cancer patient group and a normal person group.
Regarding the 587 gene pathways, gene expression data is analyzed by using an Affymetrix U133A platform and an Affymetrix U133 plus 2.0 platform.
In a detailed simulation setting, as shown in a table 401, among gene expression data of a group of 60 lung cancer patients and gene expression data of a group of 60 normal persons which are acquired a GSE19804 data set (DataSet) of the GEO database, the selected 587 gene pathways are analyzed by using Affymetrix U133 plus 2.0 platform. Furthermore, among gene expression data of a group of 27 lung cancer patients and gene expression data of a group of 27 normal persons which are acquired the GSE19804 data set (DataSet), the selected 587 gene pathways are analyzed by using the Affymetrix U133A platform. Also, among gene expression data of a group of 33 lung cancer patients and gene expression data of a group of 33 normal persons which are acquired the GSE19804 data set (DataSet), the selected 587 gene pathways are analyzed by using the Affymetrix U133A platform.
In this case, in order to analyze a degree of association between cancer and the 587 gene pathways, a degree of association between a gene pathway and a possibility of lung cancer is analyzed or simulated by using five kinds of methods, namely, fisher exact test, binomial test, geneset enrichment analysis (GSEA), Mahalanobis distance, and Euclid distance.
The simulation result shows which among various gene pathways such as the pathway of amino acid synthesis and interconversion (transamination), a pathway of unwinding of DNA, a pathway of O-linked glycosylation of mucins, a pathway of APC Cdc20 mediated degradation of Nek2A, a pathway of synthesis and interconversion of nucleotide di- and triphosphates, a pathway of G1/S specific transcription, and a pathway of kinesins, the pathway of amino acid synthesis and interconversion (transamination) has the highest accuracy of 0.871.
In other words, the simulation result denotes that among various gene pathways associated with glutamate metabolism, a gene pathway having the highest correlation with a possibility of cancer is the pathway of amino acid synthesis and interconversion (transamination).
Therefore, according to the present embodiment, since at least one gene marker selected from the group consisting of PYCR1, PHGDH, GLS2, GLS, GLUD1, GLUL, GOT1, GOT2, GPT, GPT2, PSAT1, ASNS, OAT, PSPH, ALDH18A1, and CCBL1 which are belonging to the pathway of amino acid synthesis and interconversion (transamination) is used as described above with reference to
Moreover, an accuracy of cancer diagnosis is increased by a specific gene marker. Alternatively, an accuracy of cancer diagnosis is increased according to the number of gene markers included in the gene marker set.
Referring to
As a result, the frequency number of the gene marker PYCR1 is shown as being highest, which denotes that when the gene marker PYCR1 is included in a gene marker set for cancer diagnosis, an accuracy of cancer diagnosis increases.
Referring to an AUC distribution graph 503, a result, which is obtained by comparing an AUC distribution of a gene marker set 505 including all the sixteen gene markers PYCR1, PHGDH, GLS2, GLS, GLUD1, GLUL, GOT1, GOT2, GPT, GPT2, PSAT1, ASNS, OAT, PSPH, ALDH18A1, and CCBL1, an AUC distribution of gene marker sets 506 which are obtained by combining fifteen gene markers except PYCR1, and an AUC distribution of gene marker sets 507 which are obtained by combining the sixteen gene markers including PYCR1, shows that the AUC distribution of the gene marker sets 506 is lower than the AUC distribution of the gene marker set 505 and the AUC distribution of the gene marker sets 507.
That is, similarly to the analysis of the graph 501, an analysis of the AUC distribution graph 503 may denote that when the gene marker PYCR1 is included in the gene marker set for cancer diagnosis, an accuracy of cancer diagnosis increases.
Referring to an AUC distribution graph 601 shown at a left side of
Referring to an AUC distribution graph 603 shown at a right side of
Here, a discovery set of the AUC distribution graph 603 denotes a plurality of pieces of the gene expression data used in the simulation of
To summarize the descriptions above with reference to
In particularly, as described above with reference to
Moreover, as described above, when gene markers are combined in order for a gene marker set to at least include PYCR1, the accuracy of cancer diagnosis increases.
In addition, as described above with reference to
Hereinafter, a function and an operation of the cancer diagnostic apparatus 10 for diagnosing cancer of the subject 3 by using the gene markers will be described in more detail.
Referring to
In the cancer diagnostic apparatus 10, the gene expression data acquiring unit 110 and the determination unit 120 may be implemented as a generally-used processor 100. That is, the processor 100 may be implemented with an array of a plurality of logic gates, or may be implemented as a combination of a general-use microprocessor and a memory which stores a program executed in the microprocessor. Also, the processor 100 may be implemented as a module type of an application program. Furthermore, it can be understood by one of ordinary skill in the art that the processor 100 may be implemented as hardware capable of performing operations to be described in the present embodiment.
The cancer diagnostic apparatus 10 of
The gene expression data acquiring unit 110 acquires first gene expression data of the subject 3 for a gene marker set including at least one gene marker.
Here, as described above, the gene marker set may include at least one gene marker selected from the group consisting of PYCR1, PHGDH, GLS2, GLS, GLUD1, GLUL, GOT1, GOT2, GPT, GPT2, PSAT1, ASNS, OAT, PSPH, ALDH18A1, and CCBL1. Alternatively, the gene marker set necessarily includes the gene marker PYCR1, and may further include at least one gene marker selected from the group consisting of PHGDH, GLS2, GLS, GLUD1, GLUL, GOT1, GOT2, GPT, GPT2, PSAT1, ASNS, OAT, PSPH, ALDH18A1, and CCBL1. Alternatively, the gene marker set may include only one gene marker PYCR1. Alternatively, the gene marker set may include all the gene markers PYCR1, PHGDH, GLS2, GLS, GLUD1, GLUL, GOT1, GOT2, GPT, GPT2, PSAT1, ASNS, OAT, PSPH, ALDH18A1, and CCBL1. That is, it can be seen by one of ordinary skill in the art that a combination of gene markers included in the gene marker set according to the present embodiment is not limited thereto.
The first gene expression data acquired by the gene expression data acquiring unit 110 may be image data that is obtained by analyzing biological samples, which are collected from the subject 3 and have undergone a hybridization reaction in the microarray 4, in the image analysis apparatus such as a high content cell imaging apparatus, a high content screening apparatus, or a high throughput screening apparatus. Alternatively, the first gene expression data may be statistical data that is obtained by digitizing a plurality of pieces of gene expression data analyzed from the image data.
A detailed operation, which acquires expression data from the biological samples by using the microarray 4 and the image analysis apparatuses, is known to one of ordinary skill in the art, and thus, its detailed description is not provided.
The storage unit 130 pre-stores second gene expression data of the normal person group 1 and the cancer patient group 2 and a discriminant model that is pre-generated from the second gene expression data.
The discriminant model is a model used to determine a possibility of cancer. When the second gene expression data is divided into gene expression data of the normal person group 1 and gene expression data of the cancer patient group 2, the discriminant model is a statistical model that analyzes and predicts into which of the normal person group 1 and the cancer patient group 2 individual gene expression levels included in the second gene expression data are classified.
Therefore, by using the discriminant model, when arbitrary gene expression data (for example, the first gene expression data of the subject 3) is newly inputted (acquired), whether the input gene expression data belongs to the normal person group 1 or the cancer patient group 2 may be analyzed by using a statistical probability.
The discriminant model may be generated by using a regression model, which has a univariate representing a gene marker set or a multi-variate corresponding to two or more gene markers, and the second gene expression data. Here, the regression model may be a model based on logistic regression. Such a logistic regression model having the univariate is expressed as the following Equation (1):
log it(p)=a1·XGene Marker Set+a2 (1)
where a1 denotes a coefficient, a2 denotes a coefficient, and XGene Marker Set denotes an independent variable. Also, log it(p) is a dependent variable, and denotes a probability that gene expression data is classified into the normal person group 1 (or the cancer patient group 2).
A logistic regression model having the multi-variate is expressed as the following Equation (2):
log it(p)=b1·XPYCR1+b2·XALDH18A1+ . . . +bn·XGene Marker (2)
where each of b1, b2, . . . , bn denotes a coefficient, and each of XPYCR1 and XALDH18A1 denotes an independent variable corresponding to a gene marker. Also, the number of independent variables may be equal to or less than the number of gene markers included in a gene marker set, and log it(p), a dependent variable, denotes a probability that gene expression data is classified into the normal person group 1 (or the cancer patient group 2).
An index value to be calculated by a below-described calculator 1203 may be substituted into the independent variables of Equations (1) and (2).
An operation, which generates a regression model for each of the two groups (the normal person group 1 and the cancer patient group 2) with data given thereto as in Equations (1) and (2), is known to one of ordinary skill in the art, and thus, its detailed description is not provided.
In more detail, the discriminant model may be generated by using the logistic regression model, but is not limited thereto. It can be understood by one of ordinary skill in the art that the discriminant model may be generated by using another statistical analysis method such as analysis of variance (ANOVA) or correlation analysis.
That is, as described above, the storage unit 130 pre-stores the pre-generated discriminant model and the second gene expression data of the normal person group 1 and the cancer patient group 2 based on generation of the discriminant model.
The determination unit 120 determines a possibility of cancer of the subject 3 by using the second gene expression data of the normal person group 1 and the cancer patient group 2 and the first gene expression data of the subject 3. Here, the determination unit 120 may determine the presence of cancer as the determined result. Also, the determination unit 120 may determine a degree of occurrence of cancer, for example, a high risk or a low risk, as the determined result.
A detailed operation of determining a possibility of cancer in the determination unit 120 will be described in detail with reference to
Referring to
The determination unit 120 of
First, the determination unit 120 reads out the second gene expression data, acquired from the normal person group 1 and the cancer patient group 2, and the discriminant model from the storage unit 130.
The preprocessor 1201 preprocesses the first gene expression data including first gene expression levels on the basis of a distribution of the second gene expression levels included in the second gene expression data.
To provide a more detailed description, the preprocessor 1201 may calculate ratios of the second gene expression levels and the first gene expression levels in units of a gene marker, thereby preprocessing the first gene expression data. Such a preprocessing method may be applied to a case in which the first gene expression data or the second gene expression data is data obtained by using a real-time polymerase chain reaction (RT-PCR), but is not limited thereto.
Moreover, the preprocessor 1201 may normalize or standardize the first gene expression levels in comparison with the second gene expression levels, thereby preprocessing the second gene expression data. Such a preprocessing method may be applied to a case in which the first gene expression data or the second gene expression data is data obtained by using the microarray 4, but is not limited thereto.
In
However, the newly acquired first gene expression data of the subject 3 is raw data 901, and is accurately analyzed only when preprocessing the first gene expression data in comparison with the second gene expression data. The preprocessor 1201 converts the raw data 901 of the first gene expression data into preprocessed first gene expression data 903 so as to have a pattern normalized or standardized with respect to the gene expression level of 0 as in the second gene expression data of the normal person group 1 and the cancer patient group 2.
The calculator 1203 calculates an index, indicating a degree of expression of the preprocessed first gene expression levels, for the second gene expression levels.
To describe an embodiment of an operation of calculating the index, the calculator 1203 may calculate the index by using the method such as the fisher exact test, the binomial test, the geneset enrichment analysis (GSEA), or the like.
In another embodiment of an operation of calculating the index, first, the calculator 1203 estimates a representative expression pattern that is obtained by summarizing distributions of gene expression levels of the normal person group 1 among the second gene expression data. The calculator 1203 calculates a representative value (or a centroid) based on an average, a weight average, a median value of gene expression levels for each gene marker, thereby estimating a representative expression pattern of each gene marker.
Subsequently, the calculator 1203 calculates an index for the estimated representative expression pattern on the basis of a degree of expression of the preprocessed first gene expression levels. Here, the calculator 1203 may calculate and analyze the index by using the Mahalanobis distance, the Euclid distance, the Manhattan distance, the maximum distance, the minimum distance, or a correlation coefficient. In addition, by using various methods, the calculator 1203 may calculate the index indicating a degree of expression.
The calculator 1203 applies or substitutes the calculated index into the discriminant model read out from the storage unit 130 to calculate a statistical significance level indicating a presence probability of cancer.
The comparator 1205 compares the statistical significance level calculated by the calculator 1203 and a threshold value, which is used to determine the presence of cancer or a degree of occurrence (for example, the high risk or the low risk) of cancer.
As a result, the determination unit 20 may determine a possibility of cancer depending on whether glutamate metabolism of the subject 3 is abnormal, on the basis of the comparison result from the comparator 1205.
The above-described index, statistical significance level, or threshold value may be one of a probability, a cumulative probability, a priority, a quantile, and a deviation.
Furthermore, the calculator 1203 may calculate the statistical significance level indicating a presence probability of cancer even without the discriminant model stored in the storage unit 130.
The calculator 1203, as described above, estimates the representative expression pattern that is obtained by summarizing the distributions of the gene expression levels of the normal person group 1.
Subsequently, the calculator 1203 calculates the index, indicating a degree of expression of the preprocessed first gene expression levels, for the estimated representative expression pattern. That is, an operation of calculating the index may be similar to the above-described operation.
Then, the calculator 1203 calculates the statistical significance level indicated by the calculated index by using an empirical distribution of degrees of expression of gene expression levels of the normal person group 1 for the estimated representative expression pattern, instead of the discriminant model. Here, the present embodiment is not limited to the empirical distribution, and may use another kind of null distribution.
The comparator 1205 compares the statistical significance level calculated by the calculator 1203 and the threshold value which is used to determine the presence of cancer or a degree of occurrence of cancer. The determination unit 20 may determine a possibility of cancer of the subject 3 on the basis of the comparison result from the comparator 1205.
According to the present embodiment, in order to calculate the statistical significance level, the discriminant model may be used, or instead of the discriminant model, the empirical distribution may be used. That is, the present embodiment is not limited thereto.
Referring to
Here, the pre-generated discriminant model may be pre-stored in the storage unit 130.
The following Equation (3) is merely an example which is arbitrarily assumed for convenience of description, and the present embodiment is not limited thereto. It can be understood by one of ordinary skill in the art that various discriminant models may be applied by using various kinds of samples.
log it(p)=−6.1036+0.1335·Index (3)
where Index denotes an index calculated by the calculator 1203. A regression curve 1001 may be generated according to the discriminant model expressed as Equation (3).
In
Therefore, the determination unit 120 determines a possibility of cancer according to a section including the score, on the basis of a comparison result of the score and threshold values 39 and 52 compared by the comparator 1205. For example, when score<39, the determination unit 120 determines the subject 3 as having low-risk cancer, and when 39≦score<52, the determination unit 120 determines the subject 3 as having intermediate-risk cancer. Also, when 52≦score, the determination unit 120 determines the subject 3 as having high-risk cancer.
In
Moreover, it can be understood by one of ordinary skill in the art that a threshold value is not limited to the threshold values described above with reference to
In operation 1110, the gene expression data acquiring unit 110 acquires the first gene expression data of the subject 3 for a gene marker set including at least one gene marker.
The gene marker set used in operation 1110 includes at least one gene marker selected from the group consisting of PYCR1, PHGDH, GLS2, GLS, GLUD1, GLUL, GOT1, GOT2, GPT, GPT2, PSAT1, ASNS, OAT, PSPH, ALDH18A1, and CCBL1.
In operation 1120, the determination unit 120 determines a possibility of cancer of the subject 3 by using the acquired first gene expression data and the second gene expression data of the normal person group 1 and the cancer patient group 2. Here, the second gene expression data may be pre-stored in the storage unit 130 along with the discriminant model which is pre-generated based on the second gene expression data. The determination unit 120 reads out from the second gene expression data or the discriminant model pre-stored in the storage unit 130, and determines a possibility of cancer by using the readout second gene expression data or discriminant model.
In operation 1210, the gene expression data acquiring unit 110 acquires the first gene expression data of the subject 3 for a gene marker set including at least one gene marker.
The gene marker set used in operation 1210 includes PYCR1, and includes at least one gene marker selected from the group consisting of PHGDH, GLS2, GLS, GLUD1, GLUL, GOT1, GOT2, GPT, GPT2, PSAT1, ASNS, OAT, PSPH, ALDH18A1, and CCBL1.
In operation 1220, the determination unit 120 determines a possibility of cancer of the subject 3 by using the acquired first gene expression data and the second gene expression data of the normal person group 1 and the cancer patient group 2. Here, the second gene expression data may be pre-stored in the storage unit 130 along with the discriminant model which is pre-generated based on the second gene expression data. The determination unit 120 reads out from the second gene expression data or the discriminant model pre-stored in the storage unit 130, and determines a possibility of cancer by using the readout second gene expression data or discriminant model.
In operation 1310, the gene expression data acquiring unit 110 acquires the first gene expression data of the subject 3 for a gene marker set including at least one gene marker.
The gene marker set used in operation 1310 includes at least one of the gene markers belonging to the pathway of amino acid synthesis and interconversion (transamination).
In operation 1320, the determination unit 120 determines a possibility of cancer of the subject 3 by using the acquired first gene expression data and the second gene expression data of the normal person group 1 and the cancer patient group 2. Here, the second gene expression data may be pre-stored in the storage unit 130 along with the discriminant model which is pre-generated based on the second gene expression data. The determination unit 120 reads out from the second gene expression data or the discriminant model pre-stored in the storage unit 130, and determines a possibility of cancer by using the readout second gene expression data or discriminant model.
As described above, according to the one or more of the above embodiments of the present invention, cancer is diagnosed by using only genetic information of a person. In particular, cancer is accurately diagnosed by using gene markers highly associated with the occurrence of a disease.
The above-described embodiments of the present invention may be written as computer programs and may be implemented in general-use digital computers that execute the programs using a computer-readable recording medium. Also, a structure of data used in the aforementioned embodiments may be recorded in computer-readable recording media through various members. Examples of the computer-readable recording medium include magnetic storage media (e.g., ROM, floppy disks, hard disks, etc.), optical recording media (e.g., CD-ROMs or DVDs), etc), and transmission media such as Internet transmission media.
It should be understood that the exemplary embodiments described therein should be considered in a descriptive sense only and not for purposes of limitation. Descriptions of features or aspects within each embodiment should typically be considered as available for other similar features or aspects in other embodiments.
All references, including publications, patent applications, and patents, cited herein are hereby incorporated by reference to the same extent as if each reference were individually and specifically indicated to be incorporated by reference and were set forth in its entirety herein.
The use of the terms “a” and “an” and “the” and “at least one” and similar referents in the context of describing the invention (especially in the context of the following claims) are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. The use of the term “at least one” followed by a list of one or more items (for example, “at least one of A and B”) is to be construed to mean one item selected from the listed items (A or B) or any combination of two or more of the listed items (A and B), unless otherwise indicated herein or clearly contradicted by context. The terms “comprising,” “having,” “including,” and “containing” are to be construed as open-ended terms (i.e., meaning “including, but not limited to,”) unless otherwise noted. Recitation of ranges of values herein are merely intended to serve as a shorthand method of referring individually to each separate value falling within the range, unless otherwise indicated herein, and each separate value is incorporated into the specification as if it were individually recited herein. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language (e.g., “such as”) provided herein, is intended merely to better illuminate the invention and does not pose a limitation on the scope of the invention unless otherwise claimed. No language in the specification should be construed as indicating any non-claimed element as essential to the practice of the invention.
Preferred embodiments of this invention are described herein, including the best mode known to the inventors for carrying out the invention. Variations of those preferred embodiments may become apparent to those of ordinary skill in the art upon reading the foregoing description. The inventors expect skilled artisans to employ such variations as appropriate, and the inventors intend for the invention to be practiced otherwise than as specifically described herein. Accordingly, this invention includes all modifications and equivalents of the subject matter recited in the claims appended hereto as permitted by applicable law. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the invention unless otherwise indicated herein or otherwise clearly contradicted by context.
Number | Date | Country | Kind |
---|---|---|---|
10-2013-0118120 | Oct 2013 | KR | national |