This application claims priority from prior Japanese Patent Application Publication No. 2017-136368, filed on Jul. 12, 2017, entitled “METHOD FOR BUILDING A DATABASE”, the entire contents of which are incorporated herein by reference.
The present invention relates to a method for building a database and a system for building a database.
In recent years, attempts have been made to determine a treatment policy based on the molecular level of a patient, such as gene expression level, centering on breast cancer. For example, Japanese Patent Application Publication No. 2011-223957 describes a method for predicting the prognosis of breast cancer that is negative for lymph node metastasis and positive for estrogen receptor based on the expression of 95 genes.
The background for such prognostic predictions has been the rapid development of next generation sequencing and detection technologies and analytical techniques by microarrays and the like for comprehensively analyzing expression of genes across all genes.
With next generation sequencing analysis and microarray analysis, it is now possible to analyze the expression levels of numerous genes and DNA sequence variations in DNA. NCBI Gene Expression Omnibus and other databases that can be used in the public domain are also being constructed. On the other hand, since the data accumulated in each database have not necessarily been collected under standardized conditions and analyzed, the database may contain analytical errors and the like, so that the state of gene expression and the like in the database is unlikely to genuinely reflect the gene expression of the samples. Further, neither the state of the individual collected samples nor the clinical contexts are homogeneous.
While the number of genes used to predict the prognosis of a disease and to predict the therapeutic effect of a drug is limited, in next-generation sequencing analysis and microarray analysis, genes and proteins that do not require measurement are also analyzed in large quantities.
In view of such problems in next-generation sequencing analysis and microarray analysis, the present invention provides a method to effectively utilize data reflecting the expression of measurement-target genes and non-target genes or functions of the gene products acquired by next generation sequencing analysis and microarray analysis.
A first embodiment of the invention for solving these problems is a method for constructing a database of gene related information including gene related measurement data reflecting expression of a gene in a biological sample or a function of a gene product, wherein the database is used for searching for a candidate for a new marker, the method comprising: a step of acquiring information specifying a gene to be analyzed; a step of acquiring information on a gene to be analyzed other than the gene to be analyzed A step of acquiring gene-related measurement data, a step of outputting gene-related information of the non-analysis target gene to a database, and a step of storing gene related information of the non-analysis target gene and biological sample information related to the biological sample which is information related to the biological sample from which the gene-related measurement data were acquired in the database.
A second embodiment of the invention for solving these problems is a method for searching for a candidate for a new marker based on gene related information including gene related measurement data reflecting the expression of the gene in the biological sample or the function of the gene product, wherein the method includes a step of acquiring information specifying an analysis target gene, a step of acquiring gene-related measurement data for a non-analysis target gene other than the analysis target gene, a step of outputting gene related information of the non-analysis target gene to a database, a step of storing in the database the gene related information of the non-analysis target gene and biological sample related information which is information related to the biological sample from which the gene related measurement data were obtained, a step of associating the gene related information with the biological sample related information, a step of acquiring, for each gene, a numerical value indicating the strength of relevance between the gene-related measurement data included in the gene-related information and the biological sample-related information, and a step of determining a candidate for a new marker as a gene strongly related to the biological sample related information based on the numerical value.
The 3-1th embodiment of the invention for solving these problems is a system 500 for constructing a database of gene related information including gene related measurement data reflecting the expression of a gene in a biological sample or the function of a gene product, wherein the database is used for searching candidates for a new marker, the system including an a laboratory facility information processing apparatus 20 and a laboratory facility database storage apparatus 100, wherein the laboratory facility information processing apparatus 20 acquires information specifying the analysis target gene, acquires the gene-related measurement data for a non-analysis target gene other than the analysis target gene, and stores the gene related information of the non-analysis target gene in the laboratory facility database storage apparatus, and the laboratory facility database storage apparatus 100 outputs gene related information of the non-analysis target gene and receives and stores biological sample-related information which is information related to the biological sample from which the gene-related measurement data was obtained.
The 3-2nd embodiment of the invention for solving these problems is a system 600 for constructing a database of gene related information including gene related measurement data reflecting the expression of a gene in a biological sample or the function of a gene product, wherein the database is used for searching candidates for a new marker, and the system includes a medical facility information processing apparatus 50, a laboratory facility information processing apparatus 20, a medical facility database storage apparatus 101, wherein the laboratory facility information processing apparatus acquires information for specifying an analysis target gene, acquires the gene-related measurement data for a non-analysis target gene other than the analysis target gene, and outputs the gene related Information of the non-analysis target gene to the medical facility database storage apparatus 101, and the medical facility information processing apparatus 50 outputs the biological sample related information which is information related to the biological sample from which the gene related measurement data were acquired to the medical facility database storage apparatus 101, and the medical facility database storage apparatus receives and stores the gene related information of the non-non-analysis target gene and biological sample related information.
The 3-3rd embodiment of the invention for solving the problem is a system 700 for constructing a database of gene-related information including gene-related measurement data reflecting the expression of a gene in a biological sample or the function of a gene product, wherein the database is used for searching candidates for new markers, and the system includes a medical facility information processing apparatus 50, a laboratory facility information processing apparatus 20, and a database storage apparatus 102, and the laboratory facility information processing apparatus 20 acquires the information for specifying the analysis target gene, acquires the gene-related measurement data for a non-analysis target gene other than the analysis target gene, and outputs the gene related information of the non-analysis target gene to the database storage apparatus, and the medical facility information processing apparatus 50 outputs the biological sample related information which is information related to the biological sample from which the gene related measurement data were acquired to the database storage apparatus, and the database storage apparatus 102 receives and stores the gene related information of the non-analysis target gene and the biological sample related information.
According to embodiments 1, 2, 3-1, 3-2, and 3-3, data reflecting the expression of the measurement target gene and a gene other than the measurement target gene or the function of a gene product obtained by next-generation sequencing analysis and microarray analysis can be effectively utilized.
A fourth embodiment of the invention for solving these problems is a method for constructing a database of gene related information including gene related measurement data reflecting expression of a gene in a biological sample or a function of a gene product, wherein the data stored in the database are used as training data or verification data of artificial intelligence for searching for a new marker, the method including a step of acquiring information specifying a measurement target gene, a step of acquiring gene related measurement data of the measurement target gene, a step of storing gene-related information of the measurement target in a database, and a step of storing information related to the biological sample from which the gene-related measurement data were acquired in the database. According to the present invention, a large amount of artificial intelligence training data or verification data can be provided.
A fifth embodiment of the invention for solving these problems is a method for constructing a database of gene related information including gene related measurement data reflecting expression of a gene in a biological sample or a function of a gene product, wherein the database is used for searching for a candidate for a new marker, the method including a step of acquiring gene-related information obtained for a plurality of genes including non-analysis target genes other than the analysis target gene from a laboratory facility information processing apparatus and/or a medical facility information processing apparatus, a step of acquiring biological sample related information which is information related to the biological sample from which the gene related measurement data were acquired from the laboratory facility information processing apparatus and/or the medical facility information processing apparatus, and a step of storing the gene related information and the biological sample related information in the database.
A sixth embodiment of the invention for solving these problems is a system 500, 600, 700 for constructing a database of gene related information including gene related measurement data reflecting expression of a gene in a biological sample or a function of a gene product, wherein the database is used for searching candidates for a new marker, the system including database storage apparatus 100, 101, 102, the database storage apparatus acquires gene-related information obtained for a plurality of genes including non-analysis target genes other than the analysis target gene from a laboratory facility information processing apparatus 20 and/or a medical facility information processing apparatus 50, and acquires biological sample related information, which is information related to the biological sample from which the gene-related information was obtained, from the laboratory facility information processing apparatus 20 and/or the medical facility information processing apparatus 50, and stores the gene-related information and the biological sample-related information. According to the fifth and sixth embodiments, data reflecting the expression of a measurement target gene and genes other than the measurement target gene, or the function of the gene product acquired by next-generation sequencing analysis or microarray analysis can be effectively utilized.
According to the invention, it is possible to effectively utilize data reflecting the expression of measurement target genes and genes other than the measurement target genes, or functions of the gene products acquired by next-generation sequencing analysis or microarray analysis.
Hereinafter, embodiments of the invention will be described in detail with reference to the accompanying drawings. Note that the method of constructing a database, the system for constructing a database, and the database storage apparatus according to the present invention are not limited to the specific embodiments described below. In the following description, the same reference numerals are assigned to the same components. m Therefore, descriptions of each component denoted by the same reference numeral can be shared between the same reference numerals. Furthermore, for terms commonly used in each embodiment, the explanation of terms in each embodiment are also applied to other embodiments.
1. Database Construction Method
First, an outline of an embodiment of the present invention will be described with reference to
In addition, these databases can be used to provide training data and verification data for performing artificial intelligence machine learning when searching for the new marker or the like using artificial intelligence. The database also can be used to provide verification data for searching for new markers using statistical methods.
The first embodiment of the present invention relates to a method for constructing a database used for re-profiling for searching candidates for new markers. Specifically, the database nonvolatilely stores gene-related information including gene-related measurement data reflecting the expression of a gene or the function of a gene product in the biological sample.
The novel marker is, for example, a disease biomarker or a target molecule for the treatment of a disease. The disease biomarker can be used for disease risk assessment, screening, differential diagnosis, prognosis prediction, recurrence prediction and the like. The target molecule for the treatment of the disease also is a molecule that can prevent disease, treat disease, or delay disease progression by controlling the function of the target molecule. The target molecule also may be used to predict therapeutic effect.
Next, referring to
In the embodiment, the biological sample is not limited insofar as it is collected from a living body. For example, the biological sample may be a blood sample (whole blood, plasma, serum or the like), urine, body fluids (sweat, secretions from the skin, tears, saliva, spinal fluid, abdominal fluid, and pleural effusion), and tissues (fresh tissue, frozen tissue, fixed tissues, and tissues embedded in embedding agents such as paraffin).
It also is preferable that the biological sample is collected from at least one lesion selected from a group consisting of a predetermined disease, a predetermined disease type and a stage of a predetermined disease. The disease is not limited, but is preferably a tumor (a benign epithelial tumor, a benign non-epithelial tumor, a malignant epithelial tumor, a malignant non-epithelial tumor), more preferably a malignant epithelial tumor, or a malignant non-epithelial tumor, even more preferably malignant epithelial tumor, and yet more preferably a breast cancer. Most preferred is lymph node metastasis negative and estrogen receptor (ER) positive breast cancer.
The biological sample is preferably plural, and the plurality of biological samples are collected from lesions of different patients. More preferably, the plurality of biological samples are collected from lesions of the same disease in different patients, and still more preferably are collected from lesions of the same stage in different patients.
In a biological sample, a tissue considered to be normal which may serve as a negative control for the lesion site also may be collected. In this case, the tissue considered to be normal is preferably a normal part of the tissue to which the lesion site belongs. The normal part of the tissue to which the lesion site belongs may be taken from a plurality of patients or from a person not having the lesion.
The biological sample can be collected at the time of surgery or biopsy in a medical facility or the like to which the patient belongs. The collected biological sample is contained in a container such as a tube. A storage solution such as RNAlater (registered trademark) made by ThermoFisher Scientific Co., Ltd. or a fixative such as formaldehyde may be contained in the container. The biological sample contained in the container may be refrigerated or frozen. Although known preservatives or fixatives can be used for the preservation solution or the fixation solution, but from the viewpoint of preventing degradation and structural change of molecules in the biological sample during storage or transportation and keeping the biological sample in a certain state to some extent, it is preferable to use a commercially available kit or commercially available reagent. For example, a container attached to Curebest (registered trademark) 95GC Breast (Sysmex Corporation) can be used as a container for collecting a biological sample and a container for a biological sample. The biological sample contained in the container is pretreated in order to acquire gene-related measurement data at a medical facility or a laboratory facility that accepts an examination.
Examples of the gene-related measurement data reflecting the expression of the gene or the function of the gene product include the expression level of RNA (mRNA and/or microRNA) for each gene, the base sequence information of RNA, DNA (genomic DNA and/or mitochondrial DNA) methylation level, base sequence information of DNA (genomic DNA and/or mitochondrial DNA), or abundance of gene product protein (monomer protein, complex protein, monomeric peptide, and complex peptide), glycosylation modification information of proteins (including monomeric proteins, complex proteins, monomeric peptides, and complex peptides), and the like. For example, when the gene-related measurement data is the methylation amount of DNA, the gene-related measurement data includes at least the methylation amount of DNA in each gene and at least the position information of the methylation site of the DNA. When the gene-related measurement data is DNA sequence information, the gene-related measurement data also include not only base sequence information but also at least deletion, substitution, fusion, copy number mutation or the occurrence of insertion of the DNA base sequence of each gene, and information on the position thereof. The sequence information of the DNA also includes genetic polymorphism information such as single nucleotide polymorphism, double nucleotide polymorphism, triple nucleotide polymorphism and the like. When the gene-related measurement data is information on glycosylation modification of a protein, the gene-related measurement data also may include not only the presence or absence of modification of each protein but also the modification position of each protein, and information on the type of sugar chain of the modified protein are included.
Therefore, the pretreating of the biological sample from which the gene-related measurement data are acquired is not limited insofar as the RNA, DNA or protein of the measurement sample can be extracted in order to obtain the above-mentioned gene-related measurement data.
For example, when RNA is used to acquire gene-related measurement data, RNA can be obtained from a biological sample by a known method. Commercially available kits such as Qiagen RNeasy kit (registered trademark) manufactured by Qiagen can also be used for RNA extraction from a biological sample. When DNA is acquired to acquire gene-related measurement data, DNA also can be obtained from a biological sample by a known method. Commercially available kits such as QIAamp DNA Mini Kit (registered trademark) manufactured by Qiagen can also be used for DNA extraction from a biological sample. When proteins are used to obtain gene-related measurement data, proteins also can be extracted from biological samples by a known method. Commercially available reagents such as GE Healthcare Japan KK, trade name: Mammalian Protein Extraction Buffer and the like can be used for extracting proteins from biological samples. In the case where the biological sample is embedded in paraffin, it is possible to extract DNA from the biological sample using QIAamp DNA FFPE Tissue Kit (registered trademark) manufactured by QIAgen.
Regarding pretreating of biological samples, it is preferable to use commercially available kits or commercially available reagents from the viewpoint of preventing degradation of RNA and DNA in the process, structural change of proteins and the like, and homogenizing the sample for measurement.
Next, prior to acquiring the gene-related measurement data, the measurement sample may be pretreated as necessary. The pretreatment includes adding fluorescent labels, biotin labels or the like necessary for detection when acquiring gene-related measurement data to the RNA, DNA, or protein of the measurement sample, or the pretreatment product of the measurement sample described below. For example, when the measurement sample is RNA, the pretreatment of the measurement sample may include synthesizing cDNA or cRNA using RNA of the measurement sample as a template. Amplification of the cDNA or cRNA by PCR also may be included. In the case where the sample for measurement is DNA, the pretreatment of the sample for measurement may include amplifying the DNA of the sample for measurement by PCR if necessary. The pretreatment of the measurement sample also may include cutting the PCR product amplified using the DNA of the measurement sample or the DNA of the measurement sample as a template with a restriction enzyme. Where the sample for measurement is a protein, a surfactant such as sodium dodecyl sulfate, NP-40, Triton X-100, Tween-20 and/or a reducing agent such as β-mercaptoethanol, dithiothreitol or like reducing agent also may be included. The pretreatment methods are well known.
Also known is a method of labeling by fluorescence or biotin on the RNA, DNA, or protein of the measurement sample, or the pretreatment product of the measurement sample described below. For example, 3 'IVT PLUS Reagent Kit (trade name) manufactured by Thermo Fisher Scientific Co., Ltd. can be used.
The pretreatment product of the pretreated measurement sample according to the above method is subjected to measurement to acquire gene related measurement data.
It is desirable that the above-described collection of a biological sample, extraction of a sample for measurement from a biological sample, and pretreatment of a sample for measurement are carried out using a commercially available kit or commercially available reagents in unified form to manage quality in the various steps for the purpose of constructing a homogenized database.
Next, each step for acquiring gene-related measurement data will be described with reference to
From the examination request form which the medical facility first fills in, the examiner or the processing section 21 of the laboratory facility information processing apparatus 20 (to be described later) acquires information for specifying the gene to be analyzed (step S1). For example, the analysis target gene may be one or a plurality of genes to be used for at least one analysis selected from a group consisting of disease risk determination, screening, differential diagnosis, prognosis prediction, recurrence prediction, efficacy prediction, and disease monitoring. It is preferable that the analysis target gene also is determined beforehand according to the analysis to be performed on each gene, for example, for each disease and for each disease stage in a laboratory and/or a medical facility. For example, taking Curebest (registered trademark) 95GC Breast as an example, a dedicated examination request form is attached to Curebest (registered trademark) 95GC Breast. The examination request form filled in with the required matter is sent by mail or on-line or the like from the medical facility to the laboratory facility. By receiving the inspection request form, the examiner of the laboratory facility grasps the Curebest (registered trademark) 95GC Breast as the inspection item and, if necessary, the processing unit 21 accepts the input information to start examination of the Curebest (registered trademark) 95GC Breast. Curebest (registered trademark) 95GC Breast is defined so that the 95 genes described in
Here, the “probe set.ID” described in
Next, in step S2, the examiner or the processing unit 21 acquires the gene-related measurement data by a predetermined measurement method. Methods for acquiring gene related measurement data are not limited. When the gene-related measurement data is the RNA expression level, RNA base sequence information, DNA methylation amount, or DNA base sequence information, it can be measured by base sequence sequencing and/or microarray. More specifically, in order to measure the expression level of RNA, RNA-seq analysis (Illumina, Inc.) using the next generation sequencer, and a microarray capable of RNA expression analysis, Human Genome U133 Plus 2.0 Array (by Thermo Fisher Scientific Inc.) and the like can be used. In order to measure the amount of DNA methylation, Infinium Methylation EPIC Kit (Illumina, Inc.) using microarrays or the like can be used. In addition, in order to measure (or detect) DNA sequence information, Genome-Wide Human SNP Array 6.0 or GeneChip (registered trademark) Human Genome U133 Plus 2.0 Array manufactured by Thermo Fisher Scientific Co., can be used for microarray measurement, exon sequence by next generation sequencer, and whole genome sequencing.
When the gene-related measurement data is the amount of protein present, it also can be measured by microarray and/or ELISA (including EIA). More specifically, it can be measured using an array of antibodies (C-series, G-series, L-series, Quantibody) and Protein Array series manufactured by RayBiotech.
Furthermore, when the gene-related measurement data is sugar chain modification of the protein, it can be measured by microarray and/or ELISA (including EIA). More specifically, it can be measured using a lectin array or the like manufactured by RayBiotech.
In step S2, if the sample for measurement or the product obtained by pretreating the sample is a nucleic acid, it may include thermal denaturation of these nucleic acids before performing the measurement.
From the viewpoint of maintaining the homogeneity of the acquired gene-related measurement data, it is preferable to select a measurement method in which the reproducibility of the gene-related measurement data is secured. For example, it is preferable to use a microarray and other measurement reagents consistently. In this way, by homogenizing the measuring method together with homogenization of the pretreated product of the measurement sample and/or measurement sample the quality of the gene-related measurement data can be kept constant. The laboratory that acquires the gene-related measurement data also is preferably a single facility (including a branch laboratory maintaining a certain examination accuracy) or one or more facilities to maintain consistent accuracy. The laboratory facility may be installed in a medical facility.
The acquisition of the gene-related measurement data by the above measuring method can be carried out by a measuring apparatus 10, which will be described later, suitable for measuring a signal such as fluorescence in each of the above measuring methods, the apparatus 10 acquires a signal in the above measurement and calculates the intensity of the light. The intensity of the signal also may be converted to amount of RNA (copy number), the amount of protein, the DNA methylation level or methylation percentage, the rate of change in the base sequence of RNA, the rate of change in the base sequence of DNA, the rate of protein glycosylation modification to acquire gene-related measurement data.
As shown in
Acquisition of the above-described gene-related measurement data may be performed only for non-analysis target genes other than the analysis target gene, but also may be performed for all analysis targets mounted on the microarray, total RNA, total DNA, or total protein may be measured; for example, only the gene-related measurement data of the non-analysis target gene may be extracted in the gene related measurement data. In step S5 of
It is preferable that the gene-related measurement data are acquired for a plurality of non-analysis target genes and/or a plurality of analysis target genes. The plurality of non-analysis target genes may be selected, for example, not only as genes to be analyzed but also genes suggested to be associated with a predetermined disease, a predetermined disease type, or a stage of a predetermined disease. The non-analysis target gene is a gene other than the analysis target gene and also is a gene which is analyzable in each of the above measuring methods.
According to the above method, the examiner or the processing unit 21 may also acquire the gene-related measurement data of the analysis target gene (step S9). Similarly to the gene-related measurement data of the non-analysis target gene, the gene-related measurement data of the analysis target gene is linked with other gene related information (step S10), and output to the first database storage apparatus 100, the second database storage apparatus 101, or the third database storage apparatus 102 (step S10).
The gene related data may be normalized or standardized and stored in the first database storage apparatus 100, the second database storage apparatus 101, or the third database storage apparatus 102. When the measurement method is a microarray, examples of normalization method include global normalization such as total intensity normalization, Lowess normalization, and/or local normalization. More specifically, the data can be normalized by the RMA algorithm, the MASS algorithm, the PLIER algorithm, or the like. As the analysis software using the RMA algorithm, the product Asymmetric Expression Console software (Thermo Fisher Scientific) may be mentioned. When the measurement method is a method using the next generation sequencer, Reads Per Million mapped reads (RPM), Read per kilobase of exon model per million mapped reads (RPKM), Trimmed mean of M values (TMM) method and the like may be mentioned.
The standardization of the above-mentioned gene-related data is carried out by comparing the data of housekeeping genes (GAPDH: glyceraldehyde-3-phosphate dehydrogenase, (β-actin, β-microglobulin, HPRT 1: hypoxanthine phosphoribosyltransferase 1 and the like.), or methods for comparing the values of gene-related measurement data based on expression levels of the gene product expression level, and performing statistical processing to determine a Z score, significance probability (p value), or likelihood using data recorded in the gene expression information database NCBI Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/) of microarray experiments DataSet Record Data such as GDS 3834 (Multiple normal tissues) and the like as standardized values. It is also preferable that the data serving as the reference value is acquired by a homogenized method.
Examples of combinations of a plurality of genes to be analyzed include, for example, at least one selected from a group consisting of Curebest (registered trademark) 95GC Breast analysis target gene, Oncotype (registered trademark) DX analysis target gene, Mamma Print analysis target gene, Blue Print analysis target gene, PAM 50 analysis target gene, SureSelect Human All Exon V6 analysis target gene, SureSelect Human All Exon V6+COSMIC analysis target gene, SureSelect Human All Exon V6+UTR analysis target gene, SureSelect Human All Exon V5 target gene, SureSelect Human All Exon V5+UTRs target gene, SureSelect Human All Exon V5+IncRNA target gene, SureSelect Human All Exon V5+Regulatory target gene, TruSight Cancer target gene, TruSight Tumor 15 target gene, and TruSight Tumor 170 target gene.
Generally, the analysis target genes are about 20 genes to about 100 genes. However, the genes actually measured genes in microarrays and the like are about 38,500 genes, and analysis of 50,000 or more gene products including variants of gene products and the like is carried out. Therefore, when measuring the analysis target gene, the gene related information of the acquired non-analysis target gene and the biological sample related information corresponding thereto become extremely large. Therefore, the database that collects the information has a very large amount of information and is useful.
In acquiring the above-described gene-related measurement data, it is preferable to determine beforehand the type of examination criteria such as what type of biological sample is to be collected from a patient of any disease or stage, what kind of measurement method is used to acquire gene related measurement data, what collection site, how much sample to collect, how to collect the biological sample, how to preserve the biological sample until the measurement and the like, and acquire the gene related measurement data for a biological sample in conformance with these criteria. The examination criteria are selected from at least one type selected from a group consisting of the medical diagnosis related information, the medical treatment related information, the type of the biological sample, the measurement method, the amount of the biological sample to be measured, the biological sample collection method, and the biological sample storage method. The criteria may be determined by an laboratory facility and/or a medical facility.
(3) Construction of Database
The processing unit 101 of the first database storage apparatus 100, the second database storage apparatus 101, or the third database storage apparatus 102 that stores the gene related information also acquires the gene related information output at step 6 of
The gene related information and the biological sample related information 5 can correspond to each other using a code for specifying a biological sample as a key. Therefore, in the first database storage apparatus 100, the second database storage apparatus 101, or the third database storage apparatus 102, although the gene related information and the biological sample related information 5 are not necessarily combined in a single file, they may be combined in one file. As another aspect, the gene-related information and the biological sample related information 5 also may be individually stored in two database storage apparatuses that are accessible from a terminal of a user of a database, for example, via a network.
Furthermore, the database constructed in the present embodiment also may be stored in a storage medium such as an optical disk, or semiconductor memory element such as a hard disk, a flash memory, or an optical disk. The storage format of the database on the storage medium is not limited as long as the display device can read the database. Storage in the storage medium is preferably nonvolatile. In this case, the database construction method can be re-read as a manufacturing method of the storage medium storing the database.
(4) Other Embodiments
In the above database construction method, a step may be included in which reports 3 and 4 are prepared to report the gene related information 2 of the analysis target gene obtained in 1-1. (2) above, or the gene related information 2 of the analysis target gene and the gene related information 1 of the non-analysis target gene to a medical facility. The reports 3 and 4, for example, as shown in
In the present embodiment, each step (step S1 to step S6, or step S1 to step S6, step S9 and step S10) performed by the processing unit 21 of the laboratory facility information processing apparatus 20 is executed by a computer program. Each step (steps S7, S12 and S8) performed by the processing unit 101 of the first database storage apparatus 100, the second database storage apparatus 101, or the third database storage apparatus 102 is also executed by a computer program. The computer program may be stored in a storage medium such as a hard disk, a semiconductor memory element such as a flash memory, or an optical disk. The storage format of the program in the storage medium is not limited insofar as the display apparatus can read the program. Storage in the storage medium is preferably nonvolatile.
In one example of the present embodiment, even if the biomarker of a disease searched by re-profiling is a biomarker of a disease different from the disease that the patient from whom the biological sample is taken, biomarker may be a biomarker of the same disease as the disease of the patient from whom the biological sample was taken.
According to the present embodiment, it is also possible to conduct the measurement under conditions that control the quality of measurement sample and gene related measurement data so as to homogenize the steps from collection of the measurement sample to the construction of the database. Since there is no need to consider quality defects of the measurement sample due to the preservation state of the biological sample, the gene-related measurement data acquired under the conditions of quality controlled in this manner reflect the state of the diseased tissue of the patient from whom the biological sample was collected. Thus, the database constructed according to the first embodiment is more reliable than other databases in that it reflects the condition of the patient's diseased tissue.
1-2. Construction of Database for Training Data and Verification Data]
According to a second aspect of the invention, a method is provided to construct a database to provide training data (also called teaching data, learning data) for classifying artificial intelligence into a discriminant, decision tree, nearest neighbor method, support vector machine, neural network, machine learning (also called teacher data, learning data) for machine learning such as deep learning, and a verification data (test data) for determining whether the constructed learning model is valid. The database constructed in the embodiment can be used for verification (validation) of a mathematical model obtained by statistical methods such as regression analysis, multiple regression analysis, variance analysis, principal component analysis and the like.
In the method for constructing a database of the invention as described in the first embodiment, it is possible to conduct the measurements under conditions that control the quality of the gene-related measurement data and measurement sample so as to homogenize the steps from collection of the measurement sample to the construction of the database. Therefore, the gene related measurement data of the analysis target genes and the non-analysis target genes acquired pursuant with the collection of a biological sample, pretreatment of the biological sample, the pretreatment method of a measurement sample obtained by such pretreatment, and the method of acquiring gene-related measurement data described in the first embodiment have higher reliability than that of other databases. Therefore, highly reliable data can be provided as verification data for determining whether training data or the constructed learning model is effective.
Specifically, the second embodiment as shown in
Furthermore, the database constructed in the present embodiment also may be stored in a storage medium such as an optical disk, or semiconductor memory element such as a hard disk, a flash memory, or an optical disk. The storage format of the database on the storage medium is not limited as long as the display device can read the database. Storage in the storage medium is preferably nonvolatile. In this case, the database construction method can be re-read as a manufacturing method of the storage medium storing the database.
In the second embodiment, the examiner or the processing unit 21 may acquire the gene-related measurement data for the non-analysis target gene in step S22, and output the gene related information of the non-analysis target gene 1 to the first database storage apparatus 100, the second database storage apparatus 101, or the third database storage apparatus 102 in step S23, and store the gene related information 1 of the non-analysis target gene in the first database storage apparatus 100, second database storage apparatus 101, or the third database storage apparatus 102 in step S24. Also in the second embodiment, the database may be constructed from only the gene-related information 1 of the non-analysis target gene from step S22 to step S25.
In the present embodiment, each step (step S21 to step S23, or step S1 to step S23, step S26 and step S27) executed by the processing unit 21 of the laboratory facility information processing apparatus 20 is executed by a computer program by the processing unit of the first database storage apparatus 100, the second database storage apparatus 101, or each step (step S24, S26, and S25) is executed by the processing unit 101 of the third database storage apparatus 102 also by a computer program. The computer program may be stored in a storage medium such as a hard disk, a semiconductor memory element such as a flash memory, or an optical disk. The storage format of the program in the storage medium is not limited insofar as the display apparatus can read the program. Storage in the storage medium is preferably nonvolatile.
The database constructed by the above method can be used for artificial intelligence learning or to verify a model constructed by artificial intelligence. The gene related information 2 of the analysis target gene and the gene related information 1 of the non-analysis target gene stored in the database may be used to cause artificial intelligence to learn one or both depending on the purpose. For example, regarding one disease, gene related information 2 of an analysis target gene and biological material related information 5 corresponding thereto, which are stored in a database, also may be divided into two groups, one used as training data and the other used as verification data. The gene related information 2 of the analysis target gene used for Leave-One-Out Cross-Validation and the biological material related information 5 corresponding thereto can be handled as verification data even when performing Leave-One-Out Cross-Validation by using all the gene related information 2 of the analysis target gene stored in the database as training data for a single disease. In this section, the gene related information 2 of the analysis target gene can be replaced with the gene related information 1 of the non-analysis target gene.
2. System for Constructing Databases
The third embodiment of the present invention relates to a system for constructing the database described in the first embodiment and the second embodiment.
The embodiments of the third embodiment include the 3-1st embodiment for constructing a database in a laboratory, the 3-2nd embodiment for constructing a database in a medical facility, and the 3-3rd embodiment for constructing a database laboratory and the medical institution collaborate 3-3 embodiment in which a laboratory and medical facility collaborate for constructing the database. Below, a schematic view of the system shown in
2-1. Configuration of Hardware
The laboratory facility information processing apparatus 20 shown in
The laboratory facility information processing apparatus 20 includes a processing unit (CPU) 21, a main storage unit 22, a ROM (read only memory) 23, an auxiliary storage unit 24, a communication interface (I/F) 25, an input I/F 26, an output I/F 27, a media I/F 28, a bus 29. The laboratory facility information processing apparatus 20 also includes an input unit 30 and a display unit 31. The laboratory facility information processing apparatus 20 also may include the storage medium 32.
The medical facility information processing apparatus 50 includes a processing unit (CPU) 51, a main storage unit 52, a ROM 53, an auxiliary storage unit 54, a communication I/F 55, an input I/F 56, an output I/F 57, a media I/F 58, a bus 59. The medical facility information processing apparatus 50 also includes an input unit 60 and a display unit 61. The medical facility information processing apparatus 50 also may include the storage medium 62.
The first database storage apparatus (laboratory facility database storage apparatus) 100, the second database storage apparatus (medical facility database storage apparatus) 101, and the third database storage apparatus 102 include a processing unit (CPU) 201, a main storage unit 202 a ROM 203, an auxiliary storage unit 204, a communication I/F 205, an input I/F 206, an output I/F 207, a media I/F 208, and a bus 209. The first database storage apparatus 100, the second database storage apparatus 101, and the third database storage apparatus 102 each have an input unit 210 and a display unit 211. The first database storage apparatus 100, the second database storage apparatus 101, and the third database storage apparatus 102 also may include the storage medium 212.
The CPUs 21, 51, and 201 control each unit based on the programs stored in the ROMs 23, 53, and 203 and the auxiliary storage units 24, 54, and 204. The CPUs 21, 51, and 201 also may be MPUs 21, 51, and 201.
The ROMs 23, 53, and 203 are configured by a mask ROM, a PROM, an EPROM, an EEPROM, and the like, and store programs and settings related to the hardware operation of the apparatuses and boot programs executed by the CPUs 21, 51, 201 during activation of the laboratory facility information processing apparatus 10, the medical facility information processing apparatus 50, the first database storage apparatus 100, the second database storage apparatus 101, and third database storage apparatus 102.
The main storage units 22, 52, and 202 are configured by a RAM such as SRAM or DRAM, and volatilely store information received from the input units 30, 60, and 210. The auxiliary storage units 24, 54, and 204 store application software and information input or generated during operation of the respective devices 20, 50, 100, 101, 102 in a nonvolatile manner (nonvolatile storage is also referred to as “recording”). The auxiliary storage units 24, 54, and 204 are configured by a semiconductor memory element such as a hard disk, a flash memory, an optical disk, or the like.
The communication I/Fs 25, 55, 205 receives information from an external device and also transmits information stored or generated by each device 20, 50, 100, 101, 102 to the outside. The communication I/Fs 25, 55, and 205 are serial interfaces such as USB, IEEE 1394, RS-232C and the like, parallel interfaces such as SCSI, IDE, IEEE 1284, analog interfaces including D/A converter, A/D converter, a network interface controller (NIC) and the like.
The input I/Fs 26, 56, and 206 accept character input, click input, voice input and the like from the input units 30, 60, and 210. For example, the input I/Fs 26, 56, and 206 are serial interfaces such as USB, IEEE 1394, and RS-232C, parallel interfaces such as SCSI, IDE, and IEEE 1284, and analog interfaces including a D/A converter and an A/D converter and the like. The accepted input content is stored in the main storage unit 22, 52, 202 or the auxiliary storage unit 24, 54, 204.
For example, the output I/Fs 27, 57, 207 are composed of the same interface as the input I/Fs 26, 56, 206, and output the information generated by the CPUs 21, 51, 201 to the display units 31, 51, 211. The output I/Fs 27, 57, 207 output the information generated by the CPUs 21, 51, 201 and stored in the auxiliary storage units 24, 54, 204 to the display units 31, 51, 211. Here, the display units 31, 51, and 211 may be a display or a projector, but may also be a printer.
The media I/Fs 28, 58, 208 read, for example, application software or the like stored in the storage media 32, 62, 212. The read application software and the like are stored in the main storage units 22, 52, 202 or the auxiliary storage units 24, 54, 204. The media I/Fs 28, 58, and 208 also write information generated by the CPUs 21, 51, and 201 to the storage media 32, 62, and 212. The media I/Fs 28, 58, and 208 write the information generated by the CPUs 21, 51, 201 and stored in the auxiliary storage units 24, 54, and 204 to the storage media 32, 62, 212. The storage media 32, 62, and 212 are configured by a flexible disk, a CD-ROM, a DVD-ROM, or the like. The storage media 32, 62, and 212 are connected to media I/Fs 28, 58, and 208 by a flexible disk drive, a CD-ROM drive, a DVD-ROM drive, or the like. The control of each hardware configuration by the CPU 21, 51, 201 is transmitted to each hardware configuration by buses 29, 59, 209.
2-2. System for Constructing a Database in a Laboratory Facility
As shown in
The processing unit 21 of the laboratory facility information processing apparatus 20 acquires information specifying the analysis target gene, for example, by input from the input unit 30 or via the communication I/F 25, or the media I/F 28, and stores the information in the main storage unit 22, ROM 23, or the auxiliary storage unit 24. The processing unit 21 also acquires the gene-related measurement data from the measurement apparatus 10. Next, the processing unit 21 acquires gene related measurement data concerning the analysis target gene and/or the non-analysis target gene other than the analysis target gene, and generates gene related information for each gene. Subsequently, the processing unit 21 outputs the gene related information 2 of the analysis target gene and/or the gene related information 1 of the non-analysis target gene to the first database storage apparatus 100 via the communication I/F 25.
The processing unit 201 of the first database storage apparatus 100 acquires the gene related information 1 of the analysis target gene and/or the non-analysis target gene via the communication I/F 205. The processing unit 201 of the first database storage apparatus 100 also acquires biological sample related information 5, which is information related to the biological sample from which the gene related measurement data were acquired via input from the input unit 210 or through the communication I/F 205 or media I/F 208. The processing unit 201 of the first database storage apparatus 100 stores the acquired gene related information 2 of the analysis target gene and/or the gene related information 1 of the non-analysis target gene and the biological sample related information 5 in the auxiliary storage unit 204.
Here, the processing unit 21 of the laboratory facility information processing apparatus 20 also may store the information in the storage medium 32 in order to output the gene-related information 2 of the analysis target gene and/or the gene related information 1 of the non-analysis target gene to the first database storage apparatus 100. The processing unit 201 of the first database storage device 100 may acquire the gene related information 2 of the analysis target gene and/or the gene related information 1 of the non-analysis target gene via the media I/F 208. The processing unit 21 of the laboratory facility information processing apparatus 20 acquires the biological sample related information 5 and outputs the biological sample related information 5 together with the gene related information 2 of the analysis target gene and/or the gene related information 1 of the non-analysis target gene to the database storage apparatus 100. The description of each step of “1-1. Construction of database for re-profiling” is hereby incorporated by reference.
2-3. System for Constructing a Database in a Medical Facility
As shown in
The processing unit 21 of the laboratory facility information processing apparatus 20 acquires information specifying the analysis target gene, for example, by input from the input unit 30 or via the communication I/F 25, or the media I/F 28, and stores the information in the main storage unit 22, ROM 23, or the auxiliary storage unit 24. The processing unit 21 also acquires the gene-related measurement data from the measurement apparatus 10. Next, the processing unit 21 acquires the gene-related measurement data for non-analysis target genes other than the analysis target gene and/or the analysis target gene, and generates gene related information for each gene. Subsequently, the processing unit 21 outputs the gene related information 2 of the analysis target gene and/or the gene related information 1 of the non-analysis target gene to the second database storage apparatus 101 via the communication I/F 25.
The processing unit 51 of the medical facility information processing unit 50 receives the biological sample related information 5, which is information related to the biological sample from which the gene related measurement data, input from the input unit 60 by a doctor or the like in a medical facility, and outputs the biological sample related information 5 to the second database storage apparatus 101 via the communication I/F 55.
The processing unit 201 of the second database storage apparatus 101 acquires the gene related information 2 of the analysis target gene and/or the gene related information 1 of the non-analysis target gene via the communication I/F 205. The processing unit 201 of the second database storage apparatus 101 also acquires the biological sample related information 5 via the communication I/F 205 or the like. The processing unit 201 of the second database storage apparatus 101 stores the acquired gene related information 2 of the analysis target gene and/or the gene related information 1 of the non-analysis target gene and the biological sample related information 5 in the auxiliary storage unit 204.
Here, the processing unit 21 of the Laboratory facility information processing apparatus 20 stores the gene-related information 2 of the analysis target gene and/or the gene related information 1 of the non-analysis target gene in the storage medium 32 for output to the second database storage apparatus 101. The processing unit 51 of the medical facility information processing apparatus 50 also may store the biological sample related information 5 in the storage medium 52 in order to output the biological sample related information 5 to the second database storage apparatus 101. The processing unit 201 of the second database storage apparatus 101 acquires the gene related information 2 of the analysis target gene and/or the gene related information 1 of the non-analysis target gene and the biological sample related information 5 via the media I/F 208. The description of each step of “1-1. Construction of database for re-profiling” is hereby incorporated by reference.
2-4. System for Constructing Databases by Collaboration Between Laboratories and Medical Facilities
As shown in
The processing unit 21 of the laboratory facility information processing apparatus 20 acquires information specifying the analysis target gene, for example, by input from the input unit 30 or via the communication I/F 25, or the media I/F 28, and stores the information in the main storage unit 22, ROM 23, or the auxiliary storage unit 24. The processing unit 21 also acquires the gene-related measurement data from the measurement apparatus 10. Next, the processing unit 21 acquires the gene-related measurement data for non-analysis target genes other than the analysis target gene and/or the analysis target gene, and generates gene related information for each gene. Subsequently, the processing unit 21 outputs the gene related information 2 of the analysis target gene and/or the gene related information 1 of the non-analysis target gene to the third database storage apparatus 102 via the communication I/F 25.
The processing unit 51 of the medical facility information processing unit 50 receives the biological sample related information 5, which is information related to the biological sample from which the gene related measurement data was obtained, input from the input unit 60 by a doctor or the like in a medical facility, and outputs the biological sample related information 5 to the third database storage apparatus 102 via the communication I/F 55.
The processing unit 201 of the third database storage apparatus 102 acquires the gene related information 2 of the analysis target gene and/or the gene related information 1 of the non-analysis target gene via the communication I/F 205. The processing unit 201 of the third database storage apparatus 102 acquires the biological sample related information 5 via the communication I/F 205 or the like. The processing unit 201 of the third database storage apparatus 102 stores the acquired gene related information 2 of the analysis target gene and/or the gene related information 1 of the non-analysis target gene and the biological sample related information 5 in the auxiliary storage unit 204.
Here, the processing unit 21 of the laboratory facility information processing apparatus 20 also may store the gene-related information 2 of the analysis target gene and/or the gene related information 1 of the non-analysis target gene in the storage medium 32 for output to the third database storage apparatus 102. The processing unit 51 of the medical facility information processing apparatus 50 also may store the biological sample related information 5 in the storage medium 52 in order to output the biological sample related information 5 to the third database storage apparatus 102. The processing unit 201 of the third database storage apparatus 102 acquires the gene related information 2 of the analysis target gene and/or the gene related information 1 of the non-analysis target gene and the biological sample related information 5 via the media I/F 208.
The description of each step of “1-1. Construction of database for re-profiling” is hereby incorporated by reference.
In the 3-1st embodiment, the 3-2nd embodiment, the 3-3rd embodiment, the processing unit 21 of the laboratory facility information processing apparatus 20 also may determine whether to generate reports 3 and 4 regarding the analysis target gene and/or non-analysis target gene.
3. Method for Searching for New Marker Candidate
The fourth embodiment of the invention relates to a method of searching for candidates of a new biomarker by reprofiling gene-related information including gene-related measurement data reflecting the expression of the gene in the biological sample or the function of the gene product using the database constructed according to the first embodiment. Therefore, the terms and description of the present embodiment common to the first embodiment are referred to the description of the first embodiment. The fourth embodiment also may be implemented by the new marker search apparatus 80 according to a fifth embodiment to be described later.
As shown in
In statistical processing, for example, data such as DataSet Record GDS 3834 (Multiple normal tissues) or the like also can be used when reference data of a healthy tissue is required. When statistical analysis requires data as a criterion of disease, data registered in NCBI Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/) also can be used. Preferably, reference data of healthy tissue or tissue from a disease lesion may be acquired according to the method of obtaining the gene-related measurement data in the first embodiment in order to obtain homogenized data.
Subsequently, the examiner or the processing unit 81 of the new marker searching apparatus 80 determines candidates for a new marker based on the numerical value with respect to each biological sample related information. Specifically, when the numerical value is an absolute value, the examiner or processing unit 81 of the new marker searching device 80, for example, sorts the gene-related measurement data corresponding to the absolute value on the basis of the absolute value (step S33), and determines which of the genes has a high absolute value (step S34). Then, the examiner or processor 81 of the new marker search unit 80 determines a gene having a high absolute value as a candidate for a new marker (step S35), and determines that the gene is a non-candidate for a new marker if the absolute value is low (step S36). The number of new markers may be plural.
In the case of obtaining the relevance between each biological sample related information and a plurality of genes, relevance can be obtained by subjecting the numerical values to statistical processing or the like. For example, multiple comparisons such as FALSE DISCOVERY RATE, Family-Wise error rate, Bonferroni method, Holm method and the like may be performed for a plurality of genes ranging from the highest in a predetermined ranking of the genes arranged based on the absolute values of the numerical values in step S33, and a performing method of estimating a gene having a relevance (a significant difference is recognized) of the biological sample related information by a resampling method such as Permutation test, Bootstrap method, Cross Validation or the like.
It is also possible to classify each gene for each biological function (for example, apoptosis-related genes and the like) and obtain the relationship between the function in the living body and each diagnosis related information or each treatment related information or the like. Such association can be obtained by Gene Set Enrichment Analysis or the like. Alternatively, after a group of genes strongly related to the biological sample related information is selected by hyper geometric distribution or the like, the relevance between each gene and biological sample related information can be obtained by using the degree of overlap of each gene group classified based on in vivo function as an index.
A candidate for a new marker also may be searched for based on the medical information related to, for example, the presence or absence of a family history, or the strength of the relation between the treatment related information such as whether the prognosis of the disease is good and the strength of the association of the gene. Such a search can be performed by statistical processing such as regression analysis, variance analysis, principal component analysis or the like using numerical values showing the relationship between the obtained gene-related measurement data and biological sample related information, or a hierarchical mathematical model may be obtained by cluster analysis such as clustering, k-means, mean-shift and the like, validated using a part of the obtained numerical value, and to determine from the validation data a plurality of genes having strong relevance from biological sample related information.
In the present embodiment, the processing unit 81 of the new marker search apparatus 80 performs each step (step S31 to S36) by executing a computer program. The computer program may be stored in a storage medium such as a hard disk, a semiconductor memory element such as a flash memory, or an optical disk. The storage format of the program in the storage medium is not limited insofar as the display apparatus can read the program. Storage in the storage medium is preferably nonvolatile.
4. New Marker Candidate Search Apparatus
The new marker searching apparatus 80 shown in
The new marker search apparatus 80 includes a processing unit (CPU) 81, a main storage unit 82, a ROM 83, an auxiliary storage unit 84, a communication I/F 85, an input I/F 86, an output I/F 87, and a media I/F 88. The new marker search apparatus 80 includes an input unit 90 and a display unit 91. The new marker search apparatus 80 also may include the storage medium 92. The description of each configuration incorporates the description of “2-1. Hardware Configuration” herein.
20 laboratory facility information processing apparatus; 50 medical facility information processing apparatus; 100 first database storage apparatus; 101 second database storage apparatus; 102 third database storage apparatus; 500, 600, 700 system.
Number | Date | Country | Kind |
---|---|---|---|
2017-136368 | Jul 2017 | JP | national |