This invention relates to a Literature Information Processing System that analyzes literature information by natural language processing and provides an output of the analysis result.
Generally it becomes possible to reveal genetic function and structure by degrees through the recent development of gene analysis technology. Above all, DNA microarray technology is noted for its superiority in the gene analysis methods. The surface of DNA microarray consists of different DNA (probe) aligned in a very dense state on surface of a flat board (glass, silicon, plastic, etc.). For probes, cDNA, short-chain nucleotides (20-30 base) and so on are ordinarily used.
The basis of DNA microarray is utilization of hybridization, i.e. the forming hydrogen bonding between A (Adenine) and T (Thymine), and that between G (Guanine) and C (Cytosine). On this DNA microarray, we capture the target DNA by the DNA or RNA hybridization that has been marked with fluorescent material. The signal of the captured target is included in the hybridization signals, which can be detected as a fluorescence signal from each spot. By analyzing this data with computers, we can observe the state of 1000—several tens of thousands of DNA at a time, and for numbers of genes at one time we can monitor the gene expressions.
As for the functions of gene and protein elements etc, numerous studies have already been conducted, and the articles on these studies are stored in a database. The data on the interaction between genes and proteins stores in the text of the articles is important, but it is difficult for users to examine each sentence from articles and find these interactions because there are enormous numbers of articles in the database. Consequently, there are approaches that automatically search articles stored electronically in the database and select the names of the elements described within articles are important issues in the natural language processing. Furthermore, using the natural language processing, these approaches can extract the connections between some of two elements (for instance co-occurrence), called a binary relation, and draw the combined network of the connections as a pathway map.
There is a system that analyzes the pathway of proteins and genes, which is necessary in understanding biological processes. (see http://www.infocom.co.jp/bio/bioinfo.pathway. html) In addition, there also is a network that shows the connection between biological molecules searched via disorder name. (see http://www.immd.co.jp/keymolnet/027k6d2x40/Key Molnet0305Rla.pdf)
In existing systems, pathway analysis and pathway map drawing are performed for one by one protein and gene, therefore it takes a large amount of time in the effort to analyze and draw pathways of various proteins and genes obtained as a result of DNA microarray. Moreover, because of this, much more time and work is required to analyze and understand the complex relationship between resulting proteins and genes that are obtained as the above existing pathway analysis tools.
The purpose of this invention, referred to henceforth as, “The Literature Information Processing System,” is to provide a Literature Information Processing System that can easily.analyze the interaction of a large number of element names and draw a pathway map.
The Literature Information Processing System has the following characteristics: 1) the dictionary that stores multiple element names and the verbs that indicate the interactions between element names, 2) the literature database that stores multiple literature information, 3) the input means to enter element names, 4) the multi-body interactions extracting means to extract multi-body interactions of every element name entered in reference to the above dictionary and the above literature database, and 5) the pathway map drawing means to draw the overlapping parts extracted by the multi-body interaction extracting means.
By using Literature Information Processing System, we can obtain the information of the extracted multi-body interactions of every element name entered in reference to the dictionary while the literature database draws pathway maps of the extracted multi-body interactions. In other words, the system can extract multi-body interactions and draw pathway maps simultaneously. Consequently, the system can expeditiously extract the multi-body interactions and draw the pathway map of each multiple element name entered.
The Literature Information Processing System has the following characteristics: 1) the dictionary to store multiple element names and the verbs that indicate the interactions between element names, 2) the literature database to store multiple literature information, 3) the input means to enter element names, 4) the decision making means to determine whether multi-body interactions of the above element names should be extracted or not, 5) the multi-body interactions extracting means to extract the multi-body interactions in reference to the above dictionary and the above literature database, and 6) the pathway map drawing means to draw a pathway map on the basis of the multi-body interactions extracted on the basis of the multi-body interactions extracted by the above decision making means.
The Literature Information Processing System evaluates whether the multi-body interactions are extracted from each multiple element name or not, then extracts the multi-body interactions from the element names whose extractions are incomplete in reference to the dictionary and the literature database. Then, it draws the pathway maps based on the extracted multi-body interactions. As a result, the system does not redundantly extract multi-body interactions, thus the system can extract multi-body interactions and draw pathway maps very quickly for each multiple entered element name.
The Literature Information Processing System includes an additional function of the above dictionary that also stores the noun phrases and the adjective phrases that indicate the interactions between the element names. The system can extract the multiple precise connections extensively because the system drastically increases the vocabulary stored in the dictionary.
Furthermore, the Literature Information Processing System has the following characteristics: 1) the literature database to store the multiple literature information, 2) the input means to enter element names, 3) the multi-body interactions extracting means to extract multi-body interactions of each multiple element name entered in reference to the above literature database on the basis of the verbs indicating the interactions between the above element names, 4) the overlapping part extracting means to extract the overlapping parts of the multi-body interactions extracted for every element name, and 5) the pathway map drawing means to draw the overlapping parts extracted by the above overlapping parts extracting means as one unit of information.
The Literature Information Processing System extracts multi-body interactions of every multi-entered element name in reference to the literature database and draws a pathway map of the extracted multi-body interactions. In other words, the system can extract multi-body interactions of each multiple element name simultaneously and draw the pathway map in reference to the only literature database. Consequently, without having the dictionary that stores multiple element names and contains verbs that indicate interactions between the multiple element names within the system, the system can extract the multi-body interactions and draw pathway maps of each multiple element name entered very quickly with simple system architecture.
Further, the Literature Information Processing System has an extra feature where the above multi-body interactions extracting means extracts multi-body interactions based on noun phrases and adjective phrases that indicate the interactions between the element names. The Literature Information Processing System can vastly extract precise multi-body interactions because the system extracts multi-body interactions not on verbs alone, but also on noun phrases and adjective phrases.
The Literature Information Processing System has following additional features: 1) the literature database means to store multiple literature information, 2) the input means to enter element names, 3) the decision making means to determine whether the multi-body interactions about the above element names are extracted based on the verb that indicates the interaction between the above element names or not, 4) the multi-body interactions extracting means to extract the multi-body interactions of the element names deemed not to be extracted in the multi-body interactions by the above decision making means in reference to the above literature database, and 5) the pathway map drawing means to draw the pathway map of the multi-body interactions extracted by the above multi-body interactions extracting means.
The Literature Information Processing System evaluates whether the multi-body interactions of each multiple element name entered should be extracted or not, and extracts the multi-body interactions from the element names whose multi-body interactions are not extracted the multi-body interactions by the above literature function in reference to the literature database. It then draws the pathway maps based on the multi-body interactions remaining. Consequently, without using the dictionary to store the multiple element names and the verbs that indicate interactions between the multiple element names with the system, the system can extract the multi-body interactions and draw pathway maps of every multiple element name entered very quickly with simple system architecture.
The Literature Information Processing System's decision making means has a feature that evaluates whether the multi-body interactions are extracted based on the noun phrases and the adjective phrases that indicate the interactions between the element names. The Literature Information Processing System of origination can extract a vast number of exact multi-body interactions because the system evaluates whether the extraction of multi-body interactions are done on verbs alone, or include noun phrases and adjective phrases.
The Literature Information Processing System's multi-body interactions extracting means also extracts the multi-body interactions of the element names entered by the above input means and those of the element names extracted as having multi-body interactions, and also those of the element names extracted.
The Literature Information Processing System's extraction range specifying means also specifies the range of extracting the multi-body interactions by the above multi-body interactions extracting function on the element names entered by the above input function.
The Literature Information Processing System can draw a simple pathway map or a detailed pathway map according to need because the system can specify the extraction range of the multi-body interactions on the element names entered.
The Literature Information Processing System's pathway map drawing function also discriminates by the above multiple relations extracting means and shows the element names entered by the above input means and the element names extracted from the element names entered by the above input means.
The Literature Information Processing System can make it easy to understand pathway maps drawn because the system can choose the element names entered by the input means and the element names extracted from the element names entered by the input means and shows them via pathway maps.
Another characteristic of the Literature Information Processing System is that it has the multiple relation indicating means to show the multiple relations extracted by the above multiple relation extracting means. This multiple relation indicating means chooses and shows the multiple positive and negative relationships.
The Literature Information Processing System makes it easy to figure out the multiple relations showed because the system can discriminate and show multiple positive and negative relations.
The Literature Information Processing System of this invention has the further following characteristics: 1) the dictionary to store the verbs that indicate the multiple element names and the interactions between the element names, 2) the literature database to store multiple literature information, 3) the first multi-body interactions extracting means to extract the multi-body interactions of each multiple element name in reference to the above dictionary and the above literature database, 4) the multi-body interactions storing means to store the multi-body interactions extracted by the first multi-body interactions extracting means, 5) the input means to enter element names, 6) the second multi-body interactions extracting means to extract the multi-body interactions of every multiple element name entered in reference to the multi-body interactions stored by the above multi-body interactions storing means, 7) the overlapping part extracting means to extract the overlapping parts of the multi-body interactions extracted by the above overlapping part extracting means, and 8) the pathway map drawing means to draw the overlapping part extracted by the above overlapping part extracting means as one unit of information.
The Literature Information Processing System extracts the multi-body interactions of each multiple element name entered in reference to the multi-body interactions storage that stores and extracts the multi-body interactions in advance, and draws the pathway map on the basis of the extracted multi-body interactions. In other words, the system can extract the multi-body interactions simultaneously and draw the pathway map for each multiple element name. Consequently, the system can extract the multi-body interactions and draw the pathway map for every multiple element name entered very quickly.
The Literature Information Processing System of this invention has the following characteristics: 1) the dictionary to store the verbs that indicate multiple element names and the interactions between the element names, 2) the literature database to store multiple literature information, 3) the first multi-body interaction extracting means to extract the multi-body interactions of each multiple element name in reference to the above dictionary and the above literature database, 4) the multi-body interaction storing means to store the multi-body interactions extracted by the above first multi-body interaction extracting means, 5) the input means to enter element names, 6) the decision making means to decide whether the above element names are extracted for the multi-body interactions or not, 7) the second multi-body interaction extracting means to extract the multi-body interactions of the element names whose multi-body interactions are not extracted by the above decision making means in reference to the multi-body interactions stored by the above multi-body interaction extracting means, and 8) the pathway drawing means to draw the pathway maps on the basis of the multi-body interactions extracted by the multi-body interaction extracting means.
The Literature Information Processing System determines whether the multi-body interactions of each of multiple element name entered are extracted or not, then extracts the multi-body interactions of the element names that are not included in the extraction of multi-body interactions in reference to the multi-body interaction storing storage which extracts and stores the multi-body interactions in advance, and draws the pathway map on the basis of the multi-body interactions extracted. Consequently, the system can extract the multi-body interactions and draw the pathway map very quickly because the system doesn't extract the multi-body interactions of element names redundantly.
Another characteristic of the Literature Information Processing System is that the above dictionary stores the noun phrases and adjective phrases that indicate the interactions between the element names. The Literature Information Processing System can extract vast numbers of precise multi-body interactions because the system can considerably increase vocabulary and expressions stored in the dictionary.
In addition the Literature Information Processing System has also extracts the multi-body interactions of the element names considered to have multi-body interactions with the element names entered by the above input means and extracts the multi-body interactions of the element names extracted.
The Literature Information Processing System of this invention has the extraction range specifying means to extract the range of the multi-body interactions using the above second multi-body interaction extracting means on the basis of the element names entered by the above input means.
The Literature Information Processing System can draw a simple pathway map and a detailed pathway maps according to need because the system can specify the range of the multi-body interactions to extract on the basis of the element names entered.
The Literature Information Processing System of this invention has the characteristic that the above pathway map drawing means identifies the element names entered by the above input means and the element names extracted from the element names entered using the above input means by the above second multi-body interactions extracting means.
The Literature Information Processing System can make it easy to understand the pathway maps drawn because the system can discriminate between the element names entered by the input means and the element names extracted from the element names entered using the input means.
The Literature Information Processing System of this invention has the following characteristics: the multi-body interaction categorizing means to categorize the multi-body interactions stored by the above multi-body interaction storing means on the basis of the verbs that indicate the interactions between the above element names, and the reliability assessment means that assesses the reliability of the multi-body interactions for every verb on the basis of the multi-body interactions of the all the verbs categorized using the above multi-body interactions categorizing means.
The Literature Information Processing System has the characteristic that the above reliability assessment means identifies the above element name as a node, identifies the connection between the above elements, and has the graph drawing means to draw the graph which indicates the connection between the above node and the above edge. It also has a means to assess the reliability on the basis of the graph drawn by the graph drawing means.
The Literature Information Processing System categorizes the multi-body interactions stored by the multi-body interactions storing means on the basis of the verb that indicates the interaction between the element names, and assesses the reliability of the multi-body interactions of every verb on the basis of the multi-body interactions of every verb categorized. In consequence, the system can draw the pathway map on the basis of the multi-body interactions of which reliability is ensured and increases the reliability of the pathway map.
The Literature Information Processing System also includes Internet information, so it can extract multi-body interactions and draw the pathway maps on based the latest literature information.
The Literature Information Processing System has the characteristic that the above element names are protein names and gene names and it can expeditiously draw the pathway maps that indicate the interactions between the protein/gene names, signaling pathways, and metabolic pathways.
The Literature Information Processing System also has the detection result input means to enter the element name based on the detection result by the DNA microarray analysis device.
The Literature Information Processing System's detection result input means enters the element name that is the result of the experiment drawn by at least two experiments of the above DNA microarray analysis device.
The Literature Information Processing System can directly enter the element name based on the detection result of DNA microarray analysis device, extract the multi-body interactions of element names entered, and draw a pathway map. In other words, the system can draw the pathway map very quickly on the basis of the detection results of the DNA microarray analysis device. In addition, because the system can enter the element names gained by more than two experiments at the same time and extract the multi-body interactions of the element names entered simultaneously, the system can draw the pathway map based on the detection result of DNA microarray analysis device very quickly.
The Literature Information Processing System's pathway map drawing means identifies and indicates the element names drawn on the pathway map on the basis of each experiment. The Literature Information Processing System can make it easy to figure out pathway maps because the system identifies and indicates the element names drawn on the pathway map on the basis of each experiment.
The Literature Information Processing System's pathway map drawing means indicates all the element names based on each experiment as element names drawn on the pathway map.
The Literature Information Processing System's pathway map drawing means indicates the intersection of the element names based on each experiment as element names drawn on the pathway map.
The Literature Information Processing System's pathway map drawing means indicates the different points of the element names based on each experiment as element names drawn on the pathway map.
The Literature Information Processing System can make it easy to understand the detection results indicated on the pathway map because the system can change the element names indicated on the pathway map according to need (for example, the system indicates all the element names based on each experiment as element names drawn on the pathway map, or the system indicates the intersection of the element names based on each experiment as element names drawn on the pathway map, and the system indicates the different points of the element names based on each experiment as element names drawn on a pathway map).
The Literature Information Processing System of this invention has the following characteristics: 1) the multi-body interactions storing means to store the multi-body interactions extracted from each multiple element names, 2) the input means to enter the element names, 3) the extraction range specifying means to specify the range to extract the multi-body interactions on the basis of the element names entered using the above input means, 4) a multi-body interaction extracting means to extract the multi-body interactions existing between the element names of the range already extracted as well as extracting the multi-body interactions of the range specified by the above extraction range specifying means in reference to the above multi-body interactions storage means for each element name entered, 5) the pathway map drawing means to draw the pathway map on the basis of the multi-body interactions extracted by the above multi-body interactions extracting means.
As this Literature Information Processing System specifies the extraction range and extracts the multi-body interactions of the range, the system extracts the multi-body interactions existing between the element names already extracted. Consequently, necessary information is not lost because needless element names are excluded, so the necessary information can be easily figured out from pathway maps visually because it is necessary to extract new element names as well as to extract the multi-body interactions existing between the element names already extracted. The processing time of extracting the multi-body interactions can be shortened, and the resources composing the Literature Information Processing System can be reduced. Furthermore, for example, by specifying the extraction range based on specific element names, the characteristic attribute that indicates element, and the connection of the verb that indicates interaction, the range of extracting necessary information can be configured properly.
The Literature Information Processing System of this invention has the following characteristics: 1) the relation pattern storage to store the relation patterns between the element names, 2) the verification means to verify the relationships between element names on pathway maps drawn by the above pathway map drawing means in reference to the relation patterns stored in the above relation pattern storage. 052 The Literature Information Processing System has the following characteristics: 1) the multi-body interactions storage means to store the multi-body interactions extracted for each multiple element name, 2) the input means to enter element names, 3) the defined condition entering means to enter the defined conditions that limit the range of the pathway map displayed, 4) the multi-body interaction extracting means to extract the multi-body interactions for every multiple element name entered in reference to the multi-body interactions storing means, and 5) the pathway map drawing means to draw pathway maps on the basis of the multi-body interactions extracted by the multi-body interaction extracting means and the defined conditions entered by the above defined condition entering means.
The Literature Information Processing System draws a pathway map on the basis of the defined conditions entered. In consequence, the system reduces the risk that necessary information gets buried and determination becomes difficult because of displaying a large amount of element names and makes it easy to figure out the necessary information accurately from the pathway map drawn.
The Literature Information Processing System also has the specific element name storing storage to store specific element names that interact between a large number of element names. Also the above pathway map drawing means changes the display of the multi-body interactions about the specific element names in reference to the specific names stored in the above specific element name storing storage.
The Literature Information Processing System's pathway map drawing means displays the information indicating the relationship of each element name when the multi-body interactions extracted by the above multi-body interaction extracting means includes at least three element names.
The Literature Information Processing System has a supplementary memorization and information storage area that stores the supplementary information about the above pathway map, and has the pathway map drawing means to draw the above pathway map in reference to the stored supplementary information.
The Literature Information Processing System includes the information indicating the predefined element names that the supplementary information are abbreviated-described and the information indicating predefined figures that are used when displaying the predefined element names. The pathway map drawing means uses the predefined figures to draw the pathway map in reference to the supplementary information when displaying the predefined element names.
The Literature Information Processing System includes the information of the material names that the above supplementary information has predefined connections with the interactions between the above element names, and has the characteristic that the above pathway map drawing means draws the pathway map including the above material name in reference to the above supplementary information.
The Literature Information Processing System has the following characteristics: 1) the literature database to store the multiple literature information, 2) the gene expression information database to store gene expression information, 3) the input means to enter element names, 4) the multi-body interactions extracting means to extract the multi-body interactions for each multiple element names entered by the input means in reference to the literature database and the gene expression information database, and 5) the pathway map drawing means to draw the pathway map on the basis of the multi-body interactions extracted by the multi-body interaction extracting means.
The Literature Information Processing System extracts the multi-body interactions in reference to the literature information and the gene expression information and draws the pathway map.
The Literature Information Processing System includes Internet information in the above literature information.
The Literature Information Processing System has the characteristic that the above element names are protein names or gene names.
The Literature Information Processing System evaluates whether the multi-body interactions that are extracted by the multi-body interactions extracting means are direct interactions or not in reference to the supplementary information storage area that stores the supplementary information that indicates the domain structure of the predefined proteins and the collateral relations between the domain structures of each protein in case the above element name is a protein.
The Literature Information Processing System has the following characteristics: 1) the binary relation storage area to store the binary relations extracted for each multiple protein name and gene name, 2) the input means to enter protein names and gene names, 3) the defined condition input means to enter the binary relations: a) the binary relation indicating that the first protein does the first interaction with the gene transcription factor which is a gene, b) the binary relation indicating that the above transcription factor does the second interaction with genes of probe, and c) the binary relation indicating that the above gene of probe does the third interaction with the above second protein, 4) the binary relation extracting means to extract binary relations for each protein name and gene name entered in reference to the binary relation storage area, and 5) the pathway map drawing means to draw the pathway map on the basis of the defined conditions entered by the binary relations and extracted by the binary relation extracting means and the defined conditions input means.
The above defined conditions input means of the Literature Information Processing System enters the information that limit the specific verb as the verb describing the binary relation.
The Literature Information Processing System defines the relation of subject-predicate of interactions between protein and gene names as a condition to limit the pathway map indicated. In addition, as a defined condition, this system enters the information to limit the specific verbs as verbs describing binary relations. Consequently, this system can draw pathway maps on the basis of protein and gene names that indicate the relation defined as a defined condition. Also, using verbs describing binary relations (for example, limiting “bind” or “interact”) this system can indicate defined relations and draw the pathway maps that indicate only necessary information.
The Literature Information Processing System has the following characteristics: 1) the multi-body interactions storage area to store the binary relations that indicate the relationship between two element names and the multi-body interactions that indicate the relationship between more than three element names, 2) the input means to enter element names, 3) the multi-body interaction extracting means to extract the multi-body interactions for each multiple element name entered by the input means in reference to the multi-body interaction storage area, 4) the binary relation extracting means to extract the binary relations for each element name that have multi-body interactions with the entered element names in reference to the multi-body interaction storage area, and 5) the pathway map drawing means to draw the pathway map on the basis of the extracted multi-body interactions and the extracted binary relations.
The Literature Information Processing System's multi-body interaction extracting means extracts the multi-body interactions that indicate the relationship between 3, 4, 5, or 6 element names as the multi-body interactions.
The Literature Information Processing System extracts the multi-body interactions that indicate the relationship between at least three element names or more, and extracts the binary relations for each element name that have the multi-body interactions extracted to draw the pathway map. That is, the number of element names that have multi-body interactions indicating the relationship between more than three element names is generally less than that of the element name that indicates the multi-body interactions. For this reason the element names that have multi-body interactions indicating the relationships between more the three element names are extracted first, then the binary relations for the extracted element names are extracted, the exclusive objects can be analyzed cyclopaedically. In addition, the appropriate element names in range can be analyzed as objects by extracting the multi-body interactions indicating the relationship between 3, 4, 5 or 6 element names.
And below, we will explain the Biomedical Literature Information Processing System of the implementation of this invention in reference to the drawings.
Data Control Unit 10 is plugged into Literature (Database) DB14, Dictionary 16, Data Storage Unit, and Binary Relation Storage Unit (also Multiple Relation Storage Unit) 19. Literature DB14 stores the information of the literature in the medline database that is a public database for the biomedical literature information.
Dictionary 16 stores protein names, gene names (including abbreviated those names), noun phrases, and adjective phrases and the expression that have effects similar to verbs. As protein names, the official names of protein names and the synonyms are stored. That is, there are a large number of synonyms in protein names, and the styles of expression are different depending on the authors of the articles. The variations of synonyms are: 1) modifications of abbreviation, and capital or small letters, 2) Synonyms whose names indicate the roles (When only the same functions are explained, there may be various ways of expressions) and 3) synonyms including preposition and conjunction (modification relation is more complicated).
The official names of genes and the synonyms are stored as well as the verbs indicating the interactions between proteins as well as genes. The noun phrases, and adjective phrases, and expressions that have similar to these representing the meaning of verbs are also stored. These terms and phrases are stored in Dictionary 16 (the terms are collected by means of analyzing literature information stored in public databases by human or computers). Data Storage Unit 18 stores the element names (protein names, gene names, etc) entered from input part and the element names (protein names, gene names, etc) of the experimental result transmitted from DNA microarray analysis device 26. Binary Relation Storage Unit 19 stores the data of the binary relation extracted by this Biomedical Literature Information Processing System.
Data Control Unit 10 is plugged into Display Unit 20 and Print Unit 22. Display Unit 20 displays entry screens to enter element names and binary relations pathway maps drawn. Print Unit 22 prints pathway maps drawn.
Additionally, Data Control Unit 10 is plugged into Communication Control Unit 24, and received the information of element names or probe names based on the detection result of DNA microarray analysis device 26. Communication Control Unit 24 functions as a detection result input unit.
At this time, DNA microarray 40 is set on scanning XY stage 46, and transferred to XY direction. For this reason, DNA microarray 40 is scanned to XY direction by the laser launched from Laser Light Source 30, and the electronic signal output from conversion element 44 on the basis of the irradiation of the laser. Process Device 48 converts the electronic signal from conversion element 44 to A/D, and gets it as a scanning image data.
The scanning image data obtained like this is saved as a general-purpose image data such as a Bit Map format to Data Storage Unit 50 once, then read out by the dedicated analysis software and date is processed according to the request from the user to identify the expressed probes, here probes are fragments of DNAs. We can then acquire a probe ID that is an identifier of a DNA fragment (a part of DNA on DNA microarray that generated DNA is located), generated DNA name, and analysis data such as protein names that have the interaction with generated DNA. These analyzed data are stored in Storage Unit 50, and transferred to Data Control Unit 10 via Communication Control Unit 52 and Communication Control Unit 24.
Next, we would like to explain using the microarray experimental data, supposing it is performed by DNA microarray analysis device 26.
In the experiment, they first gave soybeans and fed feed including alfalfa to 4 female rats (includes Genestine).
Next, at ovulation dates, they mated the female rats with a male rat, and this day counts as the 0 day. After mating, they changed the feed for two of four rats not to include soybeans and alfalfa.
Next, at the 11th day of fertilization (GD11), for the two rats those were fed with soybeans, they gave 17α estradiol melted into peanut oil including once a day for one of the two rats, and for the other rat, they gave peanut oil only as a control. For the other two rats those were fed not to include soybeans among four, they gave the feed with genistein melt over DMSO once a day for one of the two rats, and for the other they gave only DMSO as a control.
Next, at the 20th day of fertilization, they took out the ovary and uterus of the rat fetus to extract RNA, and performed microarray analysis using Rat genome U34A chip of Affymetrix company.
Supposing the result of this microarray analysis is obtained in our system, the result of this microarray analysis should be transmitted to Data Control Unit 10 of the Biomedical Literature Information Processing System via Communication Control Unit 52 of DNA microarray analysis device 26, and stored in Data Storage Unit 18.
The microarray analysis device to analyze usual gene expression, in image scanning device of microarray analysis device, recognizes probe partitions to calculate the signal intensity, and deducts the signal intensity of the background including noises to monitor the signal. Furthermore, the device maps the statistics model of probe expression to find outlier values, and determines the method to obtain the average amount to gain the reliable estimate value. In the example of the Affymetrix company, you can see the protocol to handle the data: http:www.affymetrix.com/support/technical/technotes/statica 1_reference_guide.pdf
To compare two different micaroarray experiments, for example, by monitoring the house keeping gene expressions whose expression is necessary to maintain fundamental function, or structure of a cell whose representations are always considered to be constant using microarray, we perform scaling the results with different experiments by assuming that all amounts of RNA are constant. The expression values of all gene are multiplied by a factor to keep constant values for the house keeping genes in different experiments, thus we can reduce the difference of experimental conditions affecting the expression values. The difference of the expression values usually called fold change since it means of the change of multiplication because the change of expression is relative between different experiments. We can recognize that a gene is up regulated or down regulated, or not changed by the value of fold change from the microarray analysis. Therefore, we must choose the threshold value by which we decide whether the value of fold change is caused by noise or not. If the value of fold change of the expression of a probe exceeds a certain threshold value and higher (lower), we recognize that the gene represented by the probe is up regulated (down regulated) and meaningful, not just noises of the experiments. Actually, it sometimes causes misunderstanding without referring to whether the threshold change is up-regulation or down-regulation. Therefore, we must examine that the change is up or down regulation or not changed by mathematical algorithm such as t-test, ANOVA, those are already developed and well used. The details of these are well documented in “Guide to Analysis of DNA Microarray Data” Steen Knudsen (John-Wily and Sons, 2002)
It turns out that the analysis result of microarray is to show a set of up regulating genes or that of down regulating genes. In a comparison of data between many experiments, the clustering that hypothesizes the virtual distance to each gene such as hierarchical type clustering function and categorizes genes is used. For example,
In most experiments, when adding disturbances such as heat, stream, stress, medicine, and chemical reaction, we observe the differences between the static states, and trangent or perturbed states of normal cells and of disease sample cells (or cells of knock out mouse). Thus, microarray data are four types of data: 1) static-normal, 2) static-disease, 3) perturbed-normal, and 4) perturbed-disease state.
In the different types of microarray, which is called genome array, the variants DNA sequence, such as SNPs (Single Nucleotide Polymorphisms) of humans are detected from the DNA probes of microarray that aligns of fragments of genome sequence. We can detect changes of copy numbers of genes from this microarray. We can detect the estimated copy numbers of gene expressions by change of copy numbers from the microarray, and deduct the value from the expression value obtained by an expression experiment of gene expression microarray, then we evaluate the net values of expressions of genes, leading to the network analysis of gene expressions with those information. In these analyses, it is expected that the DNA region that normally should have a function may lose function as a consequence of the removing movement of the portion of the DNA region that contains some genes or promoter regions, or vice versa, DNA region may have additional function as a consequence of the adding movement of some portion of the DNA region to the original DNA region. This invention makes it easy to analyze the responsible parts, which make the change of the function of genes by comparing the pathway obtained by this invention for the gene expression results of normal sample and pathway thus obtained for the gene expression results for the samples with specific DNA movements.
It takes much time to analysis all probe data directly in the experiments, and the purpose of analysis is not clear, but there might be misunderstanding leading to cause severe errors. To avoid this, in this invention, we describe the result of two expressions clustering near each other to vertical axis and horizontal axis, and compare the variation of expression value at the point of genome by using hypergeometric distribution, and use EIM method (literature: Kano et al., Physiol. Genomics 10, 1152(2003)) that classify the regions genomes according to the levels of expression value.
The results of experiment 1-3 are transmitted from the DNA microarray analysis device to the Biomedical Literature Information Processing System, and entered into the system via Communication Control Unit 24. In addition, the result of experiment 1-3 can be entered with Input Unit 12.
In the Display Unit, the user interface (not shown) are composed of following parts: 1) a part to select data from the part showing the location of data, 2) a part to indicate date, medical status, conditions, and organism species of experimented data, 3) a part to indicate the relation between group of probe ID and expression value, and 4) a part that indicates thresholds and displays up regulations, down regulations, and even the common and uncommon gene lists of different experimental data.
In addition,
The Data Control Unit 10 of the Biomedical Literature Information Processing System stores the results of experiment 1-3 received from Communication Control Unit 24 on Data Storage Unit 18 (step S10). The results of experiment 1-3 are gene name groups selected to set the threshold of gene expression level as discussed previously.
Next, we extract mutual binary relations of gene names and protein names in reference to Dictionary 16 and Literature DB14 for the gene names indicated in the result of experiment 1 (step S11). That is, we extract the binary relations between gene names and protein names indicated as “noun A (gene name)”, “verb”, and “noun B (gene name)” using natural language processing for the first name of gene names shown in the result of experiment 1.
And for “noun B (gene name)” extracted as having binary relation with “noun A (gene name)”, we also extract the mutual binary relations of gene names and protein names indicated as “noun B (gene name)”, “verb”, and “noun C (gene name)”. That is, we extract the binary relation of the gene name extracted as having a binary relation with the gene name input as an experimental result. This binary interaction extraction or search is performed in our system in the predetermined range (the range of predetermined hierarchy), for example, the range from the entered gene name, for example, up to the third hierarchy, or to the extraction of gene names up to those which directly involve functions.
The extracted binary relations are stored in Binary Relation Storage Unit 19 (Step S12). Next, the system evaluates whether the extractions of binary relations for all the gene names shown on the result of experiment 1 are finished or not (Step S13). In case that the extractions are decided not to be finished, the system goes back to Step S11 to extract binary relations of next gene names.
In Step S13, if the extractions of the binary relations for all the gene names shown on the result of experiment 1 are deemed to be finished, we extract the binary relations of gene names shown on the result of experiment 2 in reference to Dictionary 16 and Literature DB14 (Step S14) to store the extracted binary relations in Binary Relation Storage Unit 19 (Step S15). Here, the process of extracting binary relations in Step S14 is the same as the process of extracting binary relations in Step S11.
If the extractions of the binary relations for all the gene names shown on the result of experiment 2 are finished (Step S16), we extract the mutual binary relations of gene/protein names shown on the result of experiment 3 in reference to Dictionary 16 and Literature DB14 (Step S17) to store the extracted binary relations in Binary Relation Storage Unit 19 (Step S18). Here, the process of extracting binary relations in Step S17 is the same as the process of extracting binary relations in Step S11.
If the extractions or searching of the binary relations for all the gene names appeared in the result of experiment 3 are finished (Step S19), we detect the overlapping parts for the binary relations stored in Binary Relation Storage Unit 19 (Step S20). That is, the some of the binary relations extracted for the gene names shown in the results of the experiments are redundantly counted because each experimental result includes the same gene names. Consequently, in case overlapping parts are found and removed, the pathway map is drawn regarding the overlapped binary relations as one unit of information (Step 21).
Here we explain how effective our data analysis on the microarray analysis: assuming that we have probe information of two up-regulated gene lists for microarray, and considering the case where in drawing interaction relationships with simple method. For probe ‘a’, for example, the interaction relations between probe ‘a’ and proteins are searched just one time, the interaction relations between the proteins of probe a and other proteins (the first interaction around probe ‘a’) will be g-h, g-c-a, and g-b-a as shown on
(1) Union of different pathways is always taken to generate in combining pathways. (2) Some sets of pathways are stored previously as many templates of pathways so that if one of genes (or proteins) or an interaction is obtained, then a set of group of sequential pathways can automatically generate. (3) Performing recursively search for an input set of obtained partner proteins (or genes) as searched results through the system for the previous input proteins (or genes). Thus the region of intersections of the networks for different input sets of probes (or proteins) increase. Our systems can provide the recursively-generated network plenty of times. However in the real implementation, the region of the recursively-generated network becomes too large if we recursively generate network so many times, therefore we need some restrictions on the region or the number of recursive search. To remove the multiple counts in the intersection, we can remove it as a graph theoretical homology search of at least two of networks with identifying names of the nodes under consideration. (4) The further branches of edges of node in the pathways for proteins are predicted stochastically and statistically by generating network by Monte Carlo method or Bayesian network. (5) The pathways for proteins (or genes) are statistically predicted with use of the motif patterns for them in the database. Using the method described in (1) to (5) and their combinations, we can generate possible network for the nodes in the restricted region in our system, and we can provide some portions of the possible network as user input or the instruction from outside of system.
In addition to previous information, supplementary information (for example, gene names or modes of action of 17 aestradiol, gene names or mode of action of genestein, etc.) are input using Input Unit 12 to draw a pathway map.
A pathway map is drawn using the supplementary information entered by Input Unit 12 and binary relations stored in Binary Relation Storage Unit 19. First, 17 α estradiol and gene names that 17 α estradiol acts are represented as nodes. Then 17 α estradiol and gene names that 17 α estradiol acts are linked by edges. Next, gene names that 17 α estradiol acts and gene names of interaction partners having binary relations with those are derived from the system are represented as nodes. Then gene names that 17 α estradiol acts and gene names of interaction partners having binary relations with those are derived from the system are linked by edges.
On the other hand, genistein and gene names that genistein acts are represented as nodes. Then genistein and gene names that genistein acts are linked by edges. Next, gene names that genistein acts and gene names of interaction partners having binary relations with those are derived from the system are represented as nodes. Then gene names that genistein acts and gene names of interaction partners having binary relations with those are derived from the system are linked by edges. Here, the shapes of the edges that connect gene names to gene names are provided for each interaction verb that indicates an interaction between genes. The attribute of edge corresponded to “bind” is defined as “-”, the attribute of edge corresponded to “inhibit” is defined as “⊥”, and the attributes of edges corresponded to other verbs are defined as “→”. Consequently, by using edges of these defined attributes, connections between gene names are linked on the basis of verbs in the binary relations. As just described, regarding gene names as nodes, pathway maps of all the binary relations stored in Binary Relation Storage Unit 19 are drawn by linking gene names having binary relations with these genes by edges.
Furthermore, we can select gene names for drawing in a pathway map from gene names stored in Data Storage Unit 18 in Biomedical Literature Information Processing System concerning this embodiment. Consequently, the system can display as follows: 1) all the gene names based on each experiment as gene names drawn on a pathway map, 2) intersections of element names based on each experiment as gene names drawn on a pathway map, and 3) differences (exclusive OR) of element names based on each experiment as gene names drawn on a pathway map. That is, the system can draw pathway maps shown on
The system can discriminate and show those element names input from Input Unit 12 or DNA microarray analysis device 26 via Communication Unit 23 and those element names of interaction partners having binary relations derived from the system. For example, on
The Biomedical Literature Information Processing System concerning the first embodiment extracts binary relations in reference to Dictionary 16 and Literature DB 14 for each of the plural element names entered, and draws a pathway map on the basis of extracted binary relations. That is, the system can extract binary relations and draws pathway maps for each of the plural element names in parallel. Consequently, the system can extract binary relations and draw pathway maps for each of the plural element names entered very quickly. That is, the system can draw pathways of interactions between protein names and gene names, signaling pathways, and metabolic pathways very quickly.
The Biomedical Literature Information Processing System concerning this embodiment can draw either a simple pathway map or a detailed pathway map, according to need, because the system can specify the extraction range of binary relations based on element names entered.
The Biomedical Literature Information Processing System concerning this embodiment can make it easy to understand pathway maps drawn, because the system can discriminate the element names entered by the input means, and element names extracted from the element names entered by the input means, to show them on pathway maps.
The Biomedical Literature Information Processing System concerning this embodiment can extract binary relations and draw pathway maps based on the latest literature information, because the literature information includes Internet information.
And the Biomedical Literature Information Processing System concerning this embodiment can directly enter the element name based on the detection result of DNA microarray analysis device 26, extract binary relations of entered element names, and draw pathway maps. That is, the system can draw pathways on the basis of detection results very quickly, because the system can enter element names obtained by more than two experiments at the same time, and extract binary relations of entered element names to draw pathway maps in parallel.
The Biomedical Literature Information Processing System concerning this embodiment makes it easy to figure out pathway maps, because the system identifies and indicates the element name drawn on the pathway map based on each experiment. Furthermore, the system can make it easy to understand analysis results on pathway maps, because the system can change element names shown on pathway maps according to need (for example: 1) displaying all gene names based on each experiment as those drawn on pathway maps, 2) displaying intersections of gene names based on each experiment as those drawn on pathway maps, and 3) displaying differences of gene names based on each experiment as those drawn on pathway maps, etc.).
In addition, in the above embodiment, we can display binary relations stored in Binary Relation Storage Unit 19 before we draw pathway maps.
In the above embodiment, after obtaining results of experiment-1, experiment-2 and experiment-3, we can adjust the threshold values for selecting protein names and gene names that are used for pathway map drawing, and may draw pathway maps using selected gene and protein names on the basis of this adjusted threshold value. Here, the threshold value is determined by the degree of gene expressions, and defines the threshold for selecting genes. That is, as shown in
For gene names shown on the result of experiment 1, the system extracts binary relations of gene/protein names in reference to Dictionary 16 and Literature DB14 (Step S212). The system stores extracted binary relations in Binary Relation storage Unit 19 (Step 213). For each gene name extracted from the result of experiment 1, the system evaluates whether the extractions of binary relations are finished or not (Step S214). In cases where the extractions are not finished, the system goes back to Step S212 to extract binary relations of next gene names. Because the process of Step S212-S214 is the same as that of Step S11-S13 (
For gene names shown on the result of experiment 2, the system extracts binary relations of gene/protein names in reference to Dictionary 16 and Literature DB14 (Step S215). The system stores extracted binary relations in Binary Relation storage Unit 19 (Step 216). Furthermore, for all of the gene names shown on the result of experiment 1, the system evaluates whether the extractions of binary relations are finished or not (Step S217). In cases where the extractions are not finished, the system goes back to Step S215 to extract binary relations of next gene names. Because the process of Step S215-S217 is the same as that of Step S14-S16 (
For gene names shown on the result of experiment 3, the system extracts binary relations of gene/protein names in reference to Dictionary 16 and Literature DB14 (Step S218). The system stores extracted binary relations in Binary Relation storage Unit 19 (Step 219). Furthermore, for all of the gene names shown on the result of experiment 1, the system evaluates whether the extractions of binary relations are finished or not (Step S220). In cases where the extractions are not finished, the system goes back to Step S218 to extract binary relations of next gene names. Because the process of Step S218-S220 is the same as that of Step S17-S19 (
In cases where: 1) the binary relations of all gene names extracted from the result of experiment 1 are deemed to be finished on Step S214, 2) the binary relations of all gene names shown on the result of experiment 2 are deemed to be finished on Step S217, and 3) the binary relations of all gene names shown on the result of experiment 3 are deemed to be finished on Step S220, the overlapping parts of binary relations stored in Binary Relation Storage Unit 19 are extracted (Step S221). When overlapping parts are extracted, the pathway map is drawn regarding the overlapped binary relations as a reference (Step S222). Because the process of Step S221-S222 is the same as that of Step S21-S22 (
Next, we evaluate whether the drawn pathway is appropriate or not (Step S223). Here, the pathway map is estimated either by the Data Control Unit of this Biomedical Literature Processing system or the user of the system who intends to display the pathway map drawing. That is, gene names shown by the result of experiment 1 are in many cases displayed close to one another on the pathway map. Therefore, in cases where one of the gene names shown by the result of experiment 1 is shown within those shown by the results of other experiments (because the pathway map may not be appropriate), the pathway map needs to be modified (Step S224). Consequently, the system goes back to Step S211 to adjust the threshold values and geometrical threshold values, and draws a pathway map and evaluates it (Step S211-Step S224). As just described, the system can appropriately discriminate whether gene expressions are increasing or not, and can draw pathway maps including appropriate information that analyzers need by adjusting threshold value to extract gene names that are used for drawing pathway maps.
In addition, for drawing a pathway map interpreted in
That is, as shown in
In cases where: 1) the binary relations of all gene names shown on the result of experiment 1 are deemed to be finished on Step S314, 2) the binary relations of all gene names shown on the result of experiment 2 are deemed to be finished on Step S318, and 3) the binary relations of all gene names shown on the result of experiment 3 are deemed to be finished on Step S322, the overlapping parts of binary relations stored in Binary Relation Storage Unit 19 are extracted (Step S323). If the overlapping parts are extracted, the pathway map is drawn regarding the overlapped binary relation as reference (Step S324).). The detailed explanation of the process is omitted because the process of Step 323-S324 is the same as those of Step S221-S222 on
Then, we will estimate whether the drawn pathway is appropriate or not (Step S325). If the pathway map needs to be modified, we go back to Step S311, Step S315, and Step S319 to adjust the configured threshold values of each experiment. Then we can draw a pathway map and evaluate it again. As just described, we can discriminate whether a gene is increased in expression for each experiment or not, and can draw a more appropriate pathway map by adjusting threshold values and geometrical threshold values for each experiment to extract gene names used for drawing pathway maps.
Now, we will explain the second embodiment. In the first embodiment, after extracting binary relations of gene names shown on each experiment result, we extract the overlapping parts of the binary relations and draw pathway maps, regarding the overlapping parts as a reference. In the second embodiment, we discriminate whether extractions of binary relations of gene names shown on each experimental result are finished or not. Then we extract the binary relations of the gene names whose binary relations were not extracted to draw pathway maps.
Data Control Unit 10 of the Biomedical Literature Information Processing System stores the results of experiment 1-3 obtained via Communication Control Unit 24 in Data Storage Unit 18 (Step S30). Then, we evaluate whether the extractions of the binary relations of the gene names shown on the result of experiment 1 are finished or not (Step S31). Consequently, we evaluate whether the binary relation is extracted and stored in Binary Relation Storage Unit 19 or not for the first gene name in the gene names shown on the result of experiment 1.
In Step S31, if the extraction of the binary relations is deemed not to be unfinished, we extract the binary relations of gene/protein names in reference to Dictionary 16 and Literature DB14 (Step S32) to store the extracted binary relations in Binary Relation Storage Unit 19 (Step S33). In addition, the extractions of binary relations in Step S32 and storage the binary relations in Step S33 are the same as Step S11 and S12 of the first embodiment.
In Step S32, if the extraction of the binary relations is deemed to be finished, we go to Step S34 and evaluate whether the extraction of binary relations of all the gene names shown in the result of experiment 1 are finished and stored in Binary Relation Storage Unit 19 or not. Here, in case where gene names whose binary relations are not extracted and should be extracted, we go back to Step S34 and extract the binary relations of the rest of the gene names.
In Step S34, if the extraction of binary relations of all the gene names that should be extracted in the result of experiment 1 are deemed to be finished, we evaluate whether the extraction of binary relations of the gene names shown in the result of experiment 2 are finished or not (Step S35), and extract the binary relation of gene/protein names for the gene names whose binary relations are not extracted in reference to Dictionary 16 and Literature DB14, then store the extracted binary relations in Binary Relation Storage Unit 19 (Step S37). Here, the process of extracting binary relations in Step S36 is the same as that in Step S32.
If the extractions of binary relations for all the gene names that should be extracted in the result of experiment 2 are finished (Step S38), we estimate whether the extractions of binary relations for all the gene names shown in the result of experiment 3 are finished or not (Step S39), and extract binary relations of gene/protein names in reference to Dictionary 16 and Literature DB14, then store the extracted binary relations in Binary Relation Storage Unit 19 (Step S40). Here, the process of extracting binary relations in Step S40 is the same as that in Step S32.
If the extractions of binary relations for all the gene names that should be extracted in the result of experiment 3 are finished (Step S42), we draw pathway maps of binary relations stored in Binary Relation Storage Unit 19 (Step S43).
In addition, in the Biomedical Literature Information Processing System concerning this embodiment, we can select gene names to draw on pathway maps from the gene names stored in Data Storage Unit 18. That is, we can draw pathway map to show on
In addition, the system can discriminate the element names input via Communication Control Unit 24 from DNA microarray analysis device 26 from the element names extracted as interaction partners having binary relations with those are derived from the system with those entered gene names. Furthermore, if gene names based on more than two experimental results are entered via Communication Control Unit 24 from DNA microarray analysis device 26, the system can discriminate gene names to show on pathway maps for every experiment to display.
The Biomedical Literature Information Processing System concerning the second embodiment evaluates whether the extractions of binary relations for each of plural element names entered are finished or not, then extracts the binary relations of the element names whose binary relations are not extracted in reference to Dictionary 16 and Literature DB14, and draws the pathway maps on the basis of extracted binary relations. Consequently, the system can extract binary relations and draw pathway maps very quickly for each of entered plural element names because the system doesn't redundantly extract binary relations of element names. That is, the system can draw pathway maps that show interactions between protein/gene names, signaling pathways, and metabolic pathways very quickly.
In addition, the Biomedical Literature Information Processing System concerning this embodiment can draw simple pathway maps or detailed pathway maps because the system can decide the range of extracting binary relations on the basis of entered element names.
In addition, the Biomedical Literature Information Processing System concerning this embodiment can make it easy to understand the difference in the element names of input and derived by the system using different styles of the drawn pathway maps because the system can discriminate element names entered by input means from element names of interaction or relation partners having binary relations derived from the system entered by the input means and display those element names on pathway maps.
In addition, the Biomedical Literature Information Processing System concerning this embodiment can extract binary relations and draw pathway maps on the basis of the latest literature information because the literature information includes Internet information.
Moreover, the Biomedical Literature Information Processing System concerning this embodiment can directly input element names based on the detection result of DNA microarray analysis device, and extract the binary relations of the entered element names, and draw pathway maps. In addition, the system can enter the element names obtained from more than two experiments at one time and extract the binary relations of entered element names in parallel, then draw pathway maps. Consequently, the system can draw pathway maps based on the detection results of DNA microarray analysis device very quickly.
In addition, the Biomedical Literature Information Processing System concerning this embodiment can make it easy to understand pathway maps because the system discriminates and displays element names to draw on pathway maps on the basis of each experiment. Furthermore, the system can make it easy to understand analysis results because the system can change element names shown on pathway maps according to the instruction by the user.
In addition, in the Biomedical Literature Information Processing System concerning the second embodiment, we may adjust an threshold value to select protein/gene names for drawing pathway maps and draw pathway maps using selected protein/gene names on the basis of this adjusted an threshold values after obtaining the results of experiment 1-3. And we may adjust an threshold value to select protein/gene names and select protein/gene names in the pathway maps for each experiment on the basis of this adjusted threshold value to draw pathway maps with selected protein/gene names.
Next, we will explain the third embodiment. In the above first embodiment, we consult Dictionary DB and Literature DB in case of extracting binary relations of gene names shown on each experimental result. However, in the third embodiment, we consult only Literature DB in case of extracting binary relations of gene names shown in each experiment. Consequently, the system architecture of the Biomedical Literature Information Processing System concerning the third embodiment is that Dictionary is removed from that concerning the first embodiment.
The extracted binary relations are stored in Binary Relation Storage Unit 19 (Step S52). Next, we evaluate whether the extractions of binary relations are finished or not for all the gene names shown in the result of experiment 1 (Step S53). In case where all the extractions are not finished, we go back to Step S51 to extract the binary relations of next gene names.
In Step S53, if the extraction of binary relations of all the gene names shown in the result of experiment 1 are deemed to be finished, we extract the mutual binary relations of gene/protein names in reference to Literature DB14 using natural language processing (Step S54), and store the extracted binary relations in Binary Relation Storage Unit 19 (Step S55). Here, estimate whether the extraction of binary relations of the gene names shown in the result of experiment 2 are finished or not (Step S35), and extract the binary relation of gene/protein names for the gene names whose binary relations are not extracted in reference to Dictionary 16 and Literature DB14, then store the extracted binary relations in Binary Relation Storage Unit 19 (Step S37). Here, the process of extracting binary relations in Step S54 is the same as that in Step S51.
If the extraction of binary relations of all the gene names shown in the result of experiment 2 are deemed to be finished (Step S56), we extract the binary relations of gene/protein names for gene names shown in the result of experiment 3 in reference to Literature DB14 using natural language processing (Step S57), and store the extracted binary relations in Binary Relation Storage Unit 19 (Step S58). Here, the process of extracting binary relations in Step S57 is the same as that in Step S51.
If the extractions of binary relations of all the gene names shown in the result of experiment 2 are deemed to be finished (Step S59), we extract the overlapping parts of binary relations stored in Binary Relation Storage Unit 19 (Step S60). If the overlapping parts are detected, the pathway map is drawn regarding the overlapped binary relation as reference information (Step S61).
In addition, in the Biomedical Literature Information Processing System concerning this embodiment, we can select gene names to draw on pathway maps from the gene names stored in Data Storage Unit 18. That is, the same as the first embodiment, the system can draw pathway maps to show on
And the system can discriminate and show element names entered from Input Unit 12 or DNA microarray analysis device 26 via Communication Unit 23 and element names that have binary relations with these entered element names on pathway map. Furthermore, if gene names based on more than two experimental results are entered via Communication Control Unit 24 from DNA microarray analysis device 26, the system can discriminate gene names to show on pathway maps for every experiment to display.
The Biomedical Literature Information Processing System concerning the third embodiment extracts the binary relations for each plural element names entered in reference to literature database, and draws the pathway maps based on extracted binary relations. Consequently, for each plural element names, the system can extract binary relations in parallel, in reference to literature database only, and draw pathway maps. Consequently, without a dictionary that stores the verbs indicating interactions between plural element names and element names (even a simple system architecture), the system can extract binary relations and draw pathway maps very quickly for each plural element names entered. That is, the system can draw pathways of interactions between protein names and gene names, signaling pathways, and metabolic pathways very quickly.
The Biomedical Literature Information Processing System concerning this embodiment can draw a simple pathway map or a detailed pathway map according to need because the system can specify the extraction range of binary relations on the basis of entered element names.
The Biomedical Literature Information Processing System concerning this embodiment can make it easy to understand pathway maps drawn because the system can discriminate the element names entered by the input means and element names extracted from the element names entered by the input means to show them on pathway maps.
The Biomedical Literature Information Processing System concerning this embodiment can extract binary relations and draw pathway maps on the basis of the latest literature information because the literature information includes Internet information.
The Biomedical Literature Information Processing System concerning this embodiment can directly enter the element name based on the detection result of DNA microarray analysis device, extract binary relations of entered element names, and draw pathway maps. That is, the system can draw pathways on the basis of detection results very quickly because the system can enter element names obtained by the more than two experiments at the same time and extract binary relations of entered element names in parallel to draw pathway maps.
In addition, the Biomedical Literature Information Processing System concerning this embodiment can make it easy to understand pathway maps because the system discriminates and displays element names to draw on pathway maps on the basis of each experiment. Furthermore, the system can make it easy to understand analysis results because the system can change element names shown on pathway maps according to the instruction by the user.
In addition, in the Biomedical Literature Information Processing System concerning the fourth embodiment, we may adjust an threshold value to select protein/gene names for drawing pathway maps and draw pathway maps using selected protein/gene names on the basis of this adjusted a threshold values after obtaining the results of experiment 1-3. We can adjust an threshold value to select protein/gene names and select protein/gene names for drawing pathway maps for each experiment on the basis of this adjusted threshold value to draw pathway maps with selected protein/gene names.
Now, we will explain the third embodiment. In the above third embodiment, after extracting the binary relations of gene names shown in the results of each experiment, the system extracts the overlapping parts of the gene names and draws pathway maps regarding the overlap as one unit of information. Meanwhile, in the fourth embodiment, the system evaluates whether the binary relations of gene names shown in each experimental result are extracted or not, then extracts the binary relations of the gene names whose binary relations are not extracted and draw the pathway maps.
Data Control Unit 10 of Biomedical Literature Information Processing System stores the results of experiment 1-3 obtained via Communication Control Unit 24 in Data Storage Unit 18 (Step S70). Next, we evaluate whether the binary relations of the gene names shown in the results of experiment 1 are extracted or not (Step S71). That is, for the first gene name of those shown in the results of experiment 1, we evaluate whether the binary relation of the gene names is extracted and stored in Binary Relation Storage Unit 19 or not.
If the extraction of the binary relations is deemed not to be finished in Step S32, we extract the binary relations between gene/protein names in reference to Literature DB14, using natural language processing (Step S72) to store the extracted binary relations in Binary Relation Storage Unit 19. The process of extracting binary relations in Step S72 is the same as those in Step S32 of the third embodiment.
On the other hand, in Step S71, if the extraction of the binary relations is deemed to be finished, we go to Step S74 and evaluate whether the extraction of binary relations of all the gene names shown in the result of experiment 1 are finished and stored in Binary Relation Storage Unit 19 or not. In case gene names whose binary relations are not extracted, we go back to Step S71 and extract the binary relations of the rest of the gene names.
In Step S74, if the extraction of binary relations of all the gene names shown in the result of experiment 1 are deemed to be finished, we evaluate whether the extraction of binary relations of the gene names shown in the result of experiment 2 are finished or not (Step S75), and extract the binary relation of gene/protein names for the gene names whose binary relations are not extracted in reference to Dictionary 16 and Literature DB14 with natural language processing (Step S76), then store the extracted binary relations in Binary Relation Storage Unit 19 (Step S37). Here, the process of extracting binary relations in Step S76 is the same as that in Step S72.
If the extractions of binary relations for all the gene names shown in the result of experiment 2 are finished (Step S78), we evaluate whether the extractions of binary relations for all the gene names shown in the result of experiment 3 are finished or not (Step S79), and the extraction of gene names in the result of experiment 3 is deemed not to be finished, then the system extracts binary relations of gene/protein names for unfinished ones in reference to Dictionary 16 and Literature DB14 with natural language processing (Step S80), then store the extracted binary relations in Binary Relation Storage Unit 19 (Step S81). Here, the process of extracting binary relations in Step S80 is the same as that in Step S72.
If the extractions of binary relations for all the gene names shown in the result of experiment 3 are finished (Step S82), we draw pathway maps of binary relations stored in Binary Relation Storage Unit 19 (Step S83).
In the Biomedical Literature Information Processing System concerning this embodiment, we can select gene names to draw on pathway maps from the gene names stored in Data Storage Unit 18. That is, we can draw the same pathway map to show on
The system can discriminate the element names entered via Communication Control Unit 24 from DNA microarray analysis device 26 from the element names extracted as partner element having binary relations with those entered gene names in depicting them. Furthermore, if gene names based on more than two experimental results are entered via Communication Control Unit 24 from DNA microarray analysis device 26, the system can discriminate gene names to show on pathway maps for every experiment to display.
The Biomedical Literature Information Processing System concerning the fourth embodiment evaluates whether the extractions of binary relations for each of plural element names entered are finished or not, then extracts the binary relations of the element names whose binary relations are not extracted in reference to Literature DB14, and draws the pathway maps on the basis of the extracted binary relations. Consequently, the system can extract binary relations and draw pathway maps very quickly for each of entered plural element names because the system does not extract binary relations of element names redundantly. That is, the system can draw pathway maps that show interaction between protein/gene names, signaling pathways, and metabolic pathways very quickly.
The Biomedical Literature Information Processing System concerning this embodiment can draw a simple pathway map or a detailed pathway map according to the needs because the system can specify the extraction range of binary relations on the basis of entered element names.
The Biomedical Literature Information Processing System concerning this embodiment can make it easy to understand pathway maps, because the system can discriminate the element names entered by the input means and element names extracted by the system from the element names entered by the input means when showing them on pathway maps.
The Biomedical Literature Information Processing System concerning this embodiment can extract binary relations and draw pathway maps based on the latest literature information, because the literature information includes information from the Internet.
The Biomedical Literature Information Processing System concerning this embodiment can directly enter the element name based on the detection result of DNA microarray analysis device, extract binary relations of entered element names, and draw pathway maps. That is, the system can draw pathways on the basis of detection results very quickly because the system can enter element names obtained by the more than two experiments at the same time and extract binary relations of entered element names in parallel to draw pathway maps.
In addition, the Biomedical Literature Information Processing System concerning this embodiment can make it easy to understand pathway maps because the system discriminates and displays element names to draw on pathway maps on the basis of each experiment. Furthermore, the system can make it easy to understand analysis results because the system can change element names shown on pathway maps according to need.
In addition, in the Biomedical Literature Information Processing System concerning the forth embodiment, we may adjust an threshold value to select protein/gene names for drawing pathway maps and draw pathway maps using selected protein/gene names on the basis of this adjusted an threshold values after obtaining the results of experiment 1-3. We can adjust an threshold value to select protein/gene names and select protein/gene names for drawing pathway maps for each experiment, based on this adjusted threshold value, to draw pathway maps with selected protein/gene names.
Now, we will explain the fifth embodiment. At the beginning of the fifth embodiment, in reference to Dictionary 16 and Literature DB 14, for the gene names stored in Dictionary 16, we extract the binary relations between protein/gene names (nouns and verbs) by natural language processing and determine the reliability of the extracted binary relations. In addition, we skip the detailed explanation because the system architecture of the Biomedical Literature Information Processing System concerning the fifth embodiment is the same as that concerning the first embodiment.
First of all, the determination process of the reliability of the binary relations in the fifth embodiment is explained as follows. Data Control Unit 10 extracts the binary relations between the element names (protein names, gene names, etc.) for each of element names (nouns and verbs) stored in Dictionary 16 in reference to the literature information stored in Literature DB14. The extracted binary relations are stored in Binary Relation Storage Unit 19.
Next, we will categorize the binary relations stored in Binary Relation Storage Unit 19 on the basis of the verbs in binary relations between element names. For example, we respectively categorize using such the verbs representing interaction between element names as “bind, ” inhibit“, interact”, “phosphorylate”, “mediate”, “modulate”, “induce”, associate“, etc.
Next, for each categorized binary relation (that means for each verb that indicates an interaction between element names), we draw the graph that indicates the interaction between a node and an edge (representing an element name as a node and representing a relationship between element names as an edge).
When displaying the network that has a scale-free nature for visually analyzing, specific nodes called hubs in the network have an overwhelming number of edges. Therefore the network has so many edges around hubs for example, exceeding more than 1000 edges for some of the top hub nodes, thus network diagram becomes too complex to find out important interaction relations, if we draw the network as it is. To avoid such complication, we can divide the interactions around nodes and separately draw the network if nodes are hubs. So top hubs are identified and the number of edges around top hubs is calculated previously for each hub node, and stores these data into storage. Then if we encounter a hub node having Hh edges, so we draw only the relations around the hub node, by showing Npre edges only. In this case, we can draw hub part of the edges, 1−(int (Nh/Npre)+1, of a hub nodes with monitoring what part of interactions are drawing, and portioned pictures is drawing int(Nh/Npre)+1 times. Using this function, user is no more worry about explosive network drawing. Without this method, when the network contains hubs, it suddenly has an explosive number of edges. But this system can be used without this kind of worry and inconvenience. Here ‘int’ means the operation of taking integer value.
In addition, in the graphs shown on
Based on the nature of the drawn graphs between nodes and edges, we can determine the reliability of the extracted binary relations. That is, the reliability of the extracted binary relations are guaranteed when each data of the drawn graphs are grouped near the ideal curve, but the reliability is not guaranteed when any data of the drawn graph are remarkably away from the ideal curve. In such case, for example, we correct the content stored in Dictionary 16 and add words, then extract the binary relations again. For re-extracted binary relations, regarding element names as nodes and regarding relationships of element names as nodes, we draw the relations between edges and nodes for each verb that indicate interactions between element names. The reliability of the extracted binary relations for each verb are guaranteed when each data of the drawn graphs are grouped near the ideal curve.
Next, we explain the extractions of the binary relations in the fifth embodiment in reference to
The extracted binary relations are stored in Binary Relation Storage Unit 19 (Step S92). Next, we evaluate whether the extractions of binary relations are finished or not for all the gene names shown in the result of experiment 1 (Step S93). In case where all the extractions are not finished, we go back to Step S91 to extract the binary relations of next gene names.
In Step S93, if the extraction of binary relations of all the gene names shown in the result of experiment 1 are deemed to be finished, we extract the binary relations of gene/protein names in reference to Binary Relation Storage Unit 19 (Step S94), and store the extracted binary relations in Binary Relation Storage Unit 19 (Step S95). Here, the process of extracting binary relations in Step S94 is the same as that in Step S91.
If the extraction of binary relations of all the gene names shown in the result of experiment 2 are deemed to be finished (Step S96), we extract the binary relations of gene/protein names for gene names shown in the result of experiment 3 in reference to Binary Relation Storage Unit 19 (Step S97), and store the extracted binary relations in Binary Relation Storage Unit 19 (Step S98). Here, the process of extracting binary relations in Step S97 is the same as that in Step S91.
If the extractions of the binary relations for all the genes shown in the result of experiment 3 are finished (Step S99), the overlapping parts of the binary relations (the binary relations extracted in Step S92 and stored in Step S92, the binary relations extracted in Step S94 and stored in Step S95, the binary relations extracted in Step S97 and stored in Step S98) are extracted (Step S100). If the overlapping parts are extracted, the pathway map is drawn regarding the overlapped binary relations as reference information (Step S101). Here, the processes of Step S100 and Step S101 are the same as those of Step S20 and Step S21 in the first embodiment (in reference to
In addition, in the Biomedical Literature Information Processing System concerning this embodiment, we can select gene names to draw on pathway maps from the gene names stored in Data Storage Unit 18. That is, the same as the first embodiment, the system can draw pathway maps to show on
And the system can discriminate and show element names entered from Input Unit 12 or DNA microarray analysis device 26 via Communication Unit 23 and element names that have binary relations with these entered element names on pathway maps. Furthermore, if gene names based on more than two experimental results are entered via Communication Control Unit 24 from DNA microarray analysis device 26, the system can discriminate gene names to show on pathway maps for every experiment to display.
The Biomedical Literature Information Processing System concerning the fifth embodiment extracts the binary relations for each of plural element names entered in reference to Binary Relation Storage Unit 19 that extracts binary relations to store beforehand, and draws the pathway maps on the basis of extracted binary relations. Consequently, for each plural element names, the system can extract the binary relations in parallel and draw the pathway maps. Consequently, the system can extract binary relations and draw pathway maps for each of plural element names entered very quickly.
The Biomedical Literature Information Processing System concerning this embodiment categorizes binary relations stored in Binary Relation Storage Unit on the basis of verbs that indicate interactions between element names, and determines the reliability of binary relation for each verb on the basis of binary relations for each of categorized verb. Consequently, the system can draw a pathway map on the basis of binary relations whose reliabilities are guaranteed, and improve the reliability of a pathway map.
In addition, in the Biomedical Literature Information Processing System concerning the embodiment, we may adjust an threshold value that is used to select protein and gene names for drawing pathway maps and draw pathway maps using selected protein and gene names on the basis of this adjusted threshold values after obtaining the results of experiment 1-3. We may adjust an threshold value that is used to select protein and gene names, and select protein and gene names for drawing pathway maps for each experiment based on this adjusted threshold value to draw pathway maps with selected protein/gene names.
Next, we will explain the sixth embodiment. In the above fifth embodiment, after extracting the binary relations of gene names shown in the results of each experiment, the system extracts the overlapping parts of the gene names and draws pathway maps regarding the overlapping parts as one unit of information. In the sixth embodiment, the system evaluates whether the binary relations of gene names shown in each experimental result are extracted or not, then extracts the binary relations of the gene names whose binary relations are not extracted and draw the pathway maps.
Data Control Unit 10 of Biomedical Literature Information Processing System stores the results of experiment 1-3 obtained via Communication Control Unit 24 in Data Storage Unit 18 (Step S110). Next, we evaluate whether the binary relations of the gene names shown in the results of experiment 1 are extracted or not (Step S111). That is, for the first gene name of those shown in the results of experiment 1, we evaluate whether the binary relation of the gene names is extracted and stored in Binary Relation Storage Unit 19 or not.
If the extraction of the binary relations is deemed not to be finished in Step S111, we extract the binary relations between gene/protein names in reference to Literature DB19 (Step S112) to store the extracted binary relations in Binary Relation Storage Unit 19 (Step S113). In Step S111, if the extraction of the binary relations is deemed to be finished, we go to Step S114 and evaluate whether the extraction of binary relations of all the gene names shown in the result of experiment 1 are finished and stored in Binary Relation Storage Unit 19 or not. Here, in case where gene names whose binary relations are not extracted are left, we go back to Step S111 and extract the binary relations of the rest of the gene names.
In Step S114, if the extraction of binary relations of all the gene names shown in the result of experiment 1 are deemed to be finished, we evaluate whether the extraction of binary relations of the gene names shown in the result of experiment 2 are finished or not (Step S115), and extract the binary relation of gene/protein names for the gene names whose binary relations are not extracted in reference to Binary Relation Storage Unit 19 (Step S116), then store the extracted binary relations in Binary Relation Storage Unit 19 (Step S117). Here, the process of extracting binary relations in Step S116 is the same as that in Step S112.
If the extractions of binary relations for all the gene names shown in the result of experiment 2 are finished (Step S118), we evaluate whether the extractions of binary relations for all the gene names shown in the result of experiment 3 are finished or not (Step S119), and in case where the extractions are not finished, we extract the binary relations of those gene/protein names in reference to Binary Relation Storage Unit 19 (Step S120), then store those extracted binary relations in Binary Relation Storage Unit 19 (Step S121). Here, the process of extracting binary relations in Step S120 is the same as that in Step S112.
If the extractions of binary relations for all the gene names shown in the result of experiment 3 are finished (Step S122), we draw pathway maps of binary relations (the binary relation that is extracted in Step S112 and stored in Step S113, the binary relation that is extracted in Step S116 and stored in Step S117, and the binary relation that is extracted in Step S120 and stored in Step S121) stored in Binary Relation Storage Unit 19. Here, the process of extracting binary relations in Step S123 is the same as that in Step S20 (in reference to
In the Biomedical Literature Information Processing System concerning this embodiment, we can select gene names to draw on pathway maps from the gene names stored in Data Storage Unit 18. That is, we can draw pathway map to show in
The system can discriminate the element names entered via Communication Control Unit 24 from DNA microarray analysis device 26 from the element names extracted as interaction partners having binary relations those are derived from the system. If gene names based on more than two experimental results are entered via Communication Control Unit 24 from DNA microarray analysis device 26, the system can discriminate gene names to show on pathway maps for every experiment to display.
The Biomedical Literature Information Processing System concerning this embodiment evaluates whether the extractions of binary relations for each of plural element names entered are finished or not, then extracts the binary relations of the element names whose binary relations are not extracted in reference to Binary Relation Storage Unit 19 that extract the binary relations to store beforehand, and draws the pathway maps on the basis of the extracted binary relations. Consequently, the system can extract binary relations and draw the pathway maps very quickly for each entered plural element names because the system doesn't redundantly extract binary relations of element names.
Moreover, in the Biomedical Literature Information Processing System concerning this embodiment, the binary relations stored in Binary Relation Storage Unit are categorized on the basis of verbs that indicate interactions between element names, and the reliability of the binary relations for each verb are determined on the basis of the binary relations of each categorized verbs. Consequently, on the basis of the binary relations whose reliability are guaranteed, we can draw the pathway map and improve the reliability of the pathway maps.
In addition, the above embodiment has a dictionary that stores verbs indicating the interaction between plural element names or element names, and a literature database that stores multiple literature information, and extracts the binary relations for each of plural element names entered in reference to the dictionary and the literature database. Although, with a database that stores a lot of literature information, we can extract the binary relations for each of the plural element names entered in reference to the database.
In addition, in the Biomedical Literature Information Processing System concerning the sixth embodiment, we can adjust a threshold value to select protein/gene names for drawing pathway maps and draw pathway maps using selected protein/gene names on the basis of this adjusted an threshold values after getting the results of experiment 1-3. And we can adjust a threshold value to select protein/gene names and select protein/gene names for drawing pathway maps for each experiment on the basis of this adjusted threshold value to draw pathway maps with selected protein/gene names.
The Biomedical Literature Information Processing System concerning each embodiment, as noted above, can make it easy to compare experiments whose conditions are different, because the system is able to process a large amount of data at the same time. Whether in the field of diagnosis or in clinics, the system can analyze experimental data very quickly with microarray analysis for the ability to gather experimental results and literature information at the same time, and can be used in fields of discovery of drug, elucidation of disease, and molecular biology.
In the above embodiment, we extract binary relations from biomedical literatures regarding proteins and genes as nodes (elements) and draw pathway maps, but in addition, we can also extract multiple relations, such as three-body or four-body and many-number-body relations, from biomedical literatures regarding proteins and genes as nodes (elements) and draw pathway maps. We have analyzed binary relations between proteins and genes in the above embodiment. Even if extending this to the case of generalizing and extracting pathway information attributed to many-body interactions between multiple proteins and genes, the effect of this invention will be useful as those in the case of binary relations. We will take transcriptional control as a cooperative operation of many-body interactions between multiple proteins. In T cell receptor a gene enhancer, AML-1 and Ets-1 binds to transcription start sites of genes first, and ATF binds to DNA in the same way, then DNA is folded back to about 130 degrees by LEF-1 binding to DNA. Hereby, the transcription starts after the binding of ATF, AML-1, and Ets-1. We can clearly understand the function from the viewpoints of multiple relations involving 6 elements (including DNA). This invention has a characteristic in advantage of analyzing complicated phenomena in life concerning complicated interactions (such as a transcription initiation) from multiple proteins and multiple interaction relations.
In addition, three-body interaction relation means the interactions between gene and protein names indicated, such as “A (gene name) associate (verb) with B (gene name) and C (gene name)”, or “cooperative interactions among A (gene name), B (gene name) and C (gene name)”. Four-body interaction relation means the interactions between gene and protein names indicated such as “A (gene name)-B (gene name)-C (gene name)-D (gene name) complex”. By extracting the multiple interaction relations just described, we can study phenomena caused by complex interactions between multiple gene and protein names, such as transcription activity, epigenetic effect such as methylation, and protein complex, etc.
In the previous interaction extraction, we have extracted binary relations within multiple relations, a combination of single verb and two nouns “noun-verb-noun”, from literature information, and analyzed to draw a pathway map in the above embodiment. Here we can extract the multiple relations from literature information, where the same combinations of element names and verbs, the different combinations of element names and verbs, such as “noun-verb-noun-verb-noun”, or more variations of repeating of nouns and verbs combinations. This multiple interactions improve the results of extractions and the accuracy of searching literature information, and accurately give the meaning of the extracted results from literature.
In the field of molecular biology, the time sequences of signaling in cells, which can be represented by combinations of nouns that indicate many proteins and verbs that indicate interactions between proteins, are the time series in specific events involving many interacting proteins. In this case, the specific order of specific set of verbs is important. In the case of “noun-verb-noun”, it is often observed in the literature that the function of a protein is induced after the other protein binds to this protein. In particular, using NFkB as an example, NFkB in the in the cell cytoplasm move into the nucleus and begins to function:
Here is another example where a protein in a cell membrane translocate to a nucleus:
More example of this is that a protein in the cytoplasma moves to Golgi and some of the portion was cleaved and the portion moves to nucleus:
The expressions of the concept of time flow in the biology literature can be found in the terms such as G1 phase, S phase, or M phase in a cell cycle. However in many cases, time flow is represented by the order of multiple events, such as the order of interactions and movements of specific proteins. Therefore, the extraction of the same or the different combinations of protein (or gene) names and verbs in a sentence from literature information, such as “some protein nouns of interactions that indicate protein names-verb of an interaction-protein noun-verb of an interaction-verb that indicates a function”, provides significant sentences relating time dependent complex phenomena, which lead to the deep understanding of life, that we cannot obtain from using the extraction for the binary relations.
In the same way, by extracting a set of the noun that indicates a cell name or localization in a cell with the above noun-verb-noun for the reason that those emerge in a text at the same time, from a text, we can clearly specify the protein interaction place in a cell. Here, we can replace a verb by a noun phrase or an adjective phrase. According to the extracted binary relations, we can mathematically analyze correlations between protein and gene names by the scalar field. We can also analyze the correlations matrix, as a vector or tensor field for the results of extracted multiple (or binary) relations.
Additionally, we can store the list that indicates relationships from probe IDs obtained as experimental results by microarray analysis device to the substantial mRNAs or genes, and the relationships from protein/gene names that have the reverse relations to probe IDs.
Next, we will explain the seventh embodiment.
Data Control Unit 10 of the Biomedical Literature Information Processing System stores the experimental results obtained via Communication Control Unit 18 in Data Storage Unit 18 (Step S130). And in the following, we will explain by taking an example of the case in which protein A is obtained as an experimental result in DNA microarray analysis device.
Next, we specify the extraction range of binary relations on the basis of protein A stored as an experimental result (Step S131). Consequently, we specify the range (hierarchy) of proteins that are extracted as having binary relations with protein A.
Next, in the range specified on Step S131, we extract binary relations between gene names and protein names for protein names stored as experimental results in reference to Dictionary 16 and Literature 14 (Step S132). That is, for protein A, with using natural language processing, we extract binary relations of protein/gene names indicated by “noun (protein A)”, “verb”, and “noun (protein name)”.
In addition, for “noun (protein name)” extracted as having binary relations with “noun (protein A)”, we extract binary relations of protein/gene names indicated by “noun (protein name)”, “verb”, and “noun (protein name)” That is, we extract not only binary relations of protein names obtained as experimental results, but also those of protein names extracted as having binary relation with the protein name (protein A) obtained as an experimental result. In the extraction range (the range of extracted hierarchy) specified on Step S131, for example, this extraction of the binary relation is complete within the range of the second hierarchy from the entered protein name (protein A), or within the range of extracting protein names that are directly involved in functions.
Here, in the case where pathway map is drawn with using protein A and the protein (of the first hierarchy) that has binary relation with protein A, regarding protein A (black circle in
On the other hand, in the case of extracting the proteins (of the second hierarchy) that have binary relations with proteins of the first hierarchy, the binary relations between proteins of the first hierarchy, which are not extracted when extracting proteins of the first hierarchy, are extracted. That is, as shown in
Consequently, in Step S132, the extraction of proteins is performed to the hierarchy specified as an extraction range from protein A that is obtained as experimental result. At the same time, the binary relations between the proteins of the hierarchy already extracted are extracted. In the case where the extraction range is limited to the second hierarchy, for example, the system extracts binary relations that exist between proteins of the second hierarchy that are already extracted in parallel with extracting to the range of the proteins of the second hierarchy.
The binary relations extracted on Step S132 are stored in Binary Relation Storage Unit 19 (Step 133). Next, we draw a pathway map on the basis of binary relations stored in Binary Relation Storage Unit 19 (Step S134). Here, even in the case where the range of necessary pathway map is the binary relations between the proteins of second hierarchy, and in the case of extracting binary relations within usual procedure, we cannot draw the edge that indicates the binary relations between the proteins of second hierarchy without extracting to the extent of the third hierarchy. Consequently, as shown in
With that, as defined in the above Step S132, by extracting binary relations that exist between proteins that are already extracted as well as extracting binary relations from protein A in the range of specified extraction, the pathway map as shown in
The Biomedical Literature Information Processing System concerning the seventh embodiment extracts only multiple relations between element names already extracted without extracting new element names, in extracting multiple relations that exist between element names extracted as having multiple relations (binary relations). Consequently, the system can make it easy to visually figure out necessary information from the pathway map because necessary information are not buried by drawing of proteins not needed.
The Biomedical Literature Information Processing System concerning the seventh embodiment extracts binary relations that exist between proteins already extracted and draws a pathway map, as well as extracting binary relations in the specified range of extraction based on protein A obtained as experimental result. Consequently, there is no need for extracting proteins with another new hierarchy, for extracting binary relations that exist between proteins are already extracted. Therefore we can shorten the process time of extracting binary relations and reduce the resources that compose the Biomedical Literature Information Processing System.
In addition, in the Biomedical Literature Information Processing System concerning the above seventh embodiment, we gave an explanation with the example of the case of protein A being obtained as an experimental result. We can obtain plural proteins such as protein A and protein B and so on as an experimental result. Here, in the case that protein A or protein B is obtained as an experimental result, we specify each range of extraction on protein A and protein B (for example, for protein A, the extraction range to the proteins of the second hierarchy and to the binary relations that exist between the proteins of the second hierarchy. For protein B, the extraction range to the proteins of the second hierarchy) and extract binary relations. After extracting the overlaps of the extracted binary relations, we can draw the pathway map regarding the overlapped binary relations as one unit of information.
Here, for protein A and protein B, in the case of extracting in the range to the second hierarchy, the pathway map is drawn as shown in
In the above seventh embodiment, we input (obtain) protein names into the system, but we can input the protein names obtained from probe IDs as an experimental result (for example, the gene cluster selected by limiting the threshold of gene expression amount) provided by DNA microarray analysis device 26.
In addition, in the Biomedical Literature Information Processing System concerning the above seventh embodiment, we extract binary relations in reference to a dictionary and Literature DB, but we can extract binary relations only in reference to Literature DB.
We can verify the reliability of drawn pathway maps based on relationships between nodes and edges. By setting the ‘number k-1’, ‘number k’, and ‘number k+1’ to the edges in the k−1, k, and k+1 hierarchy of the binary relations between protein names, we observe that the relationships as shown in
In the Biomedical Literature Information Processing System concerning this embodiment, we can mathematically verify the reliability of pathway maps by mapping (or homology mapping) the relation patterns stored in Relationship Pattern Storage 18a to the relations between nodes and edges in the drawn pathway map in Data Control Unit 10 where it functions as verification. For example, in the pathway map shown in
Now, we will explain the Biomedical Literature Information Processing System concerning the eighth embodiment.
Data Control Unit 10 of Biomedical Literature Information Data System receives experimental results (Step S140). The detailed explanation of the process in Step S140 is omitted because the process is the same as those of Step S130 in
Next, we input the defined conditions that are used for drawing pathway maps (Step S141). For example, we input plural protein names (gene names) as element names of experimental results, then the system provides plural protein names as interacting partners that have binary relations with each input protein name, and also provides, by recursive searching, plural protein names as interacting partners that have binary relations with the first-extracted protein names by inputting first-extracted protein names. The number of total extracted protein names for drawing in a pathway map, as shown in
As shown in
Consequently, as shown in
Next, for protein names stored as experimental results, we extract binary relations between gene and protein names in reference to Dictionary 16 and Literature DB 14 (Step S142), and the extracted binary relations are stored in Binary Relation Storage Unit 19 (Step S143). The detailed explanation of the process is omitted because the process of Step S142 and S143 is the same as those of Step S221 and S222.
Next, for all of the gene names shown on experimental results, the system evaluates whether the extractions of binary relations are finished or not (Step S144). In cases where the extractions are not finished, the system goes back to Step S142 to extract binary relations of next protein names.
In the case where extractions of binary relations for all the protein names shown on experimental results are deemed to be finished, the pathway map is drawn based on the binary relations stored in Binary Relation Storage Unit 19 and the defined conditions stored in Data Storage Unit 18 (Step S145). In the case where the direction of edges is defined as one direction, for example, the pathway map (small one) is drawn as shown in
The Biomedical Literature Information Processing System concerning the eighth embodiment draws pathway maps based on defined conditions that define the drawing range of pathway maps. Consequently, the system can draw pathway maps using necessary information from extracted binary relations by specifying appropriate defined conditions.
In the Biomedical Literature Information Processing System concerning the eighth embodiment, using defined conditions for the pathway map, we can extracts the binary relations for smaller sized region as shown in
In addition, the Biomedical Literature Information Processing System concerning the eighth embodiment can shorten time and draw pathway maps very quickly because the system draws small pathway maps that include necessary information based on the deifned conditions. The system makes it easy to visually understand binary relations between protein names shown as a pathway map.
In addition, in the Biomedical Literature Information Processing System concerning the eighth embodiment, by restricting the direction of edges, the smaller pathway map can be drawn. The system provides much smaller pathway map by imposing more defined conditions that restrict the direction of edges.
Here, in the medline, a public database that stores biomedical literature information, the database that stores information (mesh term) (for example, which disease the genes (proteins) and organs that are included in literature information are related to, or which cytoma (internal organ) the genes and the organs are related to, etc.) is formed. Consequently, we can store this mesh term in Literature DB14 and specify defined conditions using the stored mesh term (in reference to
In the Biomedical Literature Information Processing System concerning the above eighth embodiment, from the pathway map whose direction of the edge is restricted, we can extract the pathway map whose range is more restricted. That is, we can draw pathway maps with the direction of edges and other defined conditions, such as restricting specific verbs in the binary relations. For example, for the pathway map shown in
In addition, using multiple relations only, we can extract a small pathway map from a big pathway map. There are a large number of sentences in the texts in the literatures that provide binary relations, but the number of sentences in the texts of literatures that provide multiple relations including more than three proteins and genes is less than those that provide binary relations. Consequently, the extractions of the sentences that include at least more than three element names, and the mutual interactions thus obtained provide smaller sized pathway map. In addition, by restricting in using verbs of interactions to concerning control such as “induce”, “inhibit”, or “activate” in extracting multiple relations, we obtain information concerning control mechanisms that indicate non-physical, long-ranged, and semantic interactions. Alternatively, we can obtain information concerning protein complex with using the verbs that indicate physical interactions such as “bind”, “interact”, or “cooperative”. 257 By using multiple relations we can extract a small pathway map from a big pathway map with restricting the range of network composed by extracting binary relations. That is, in the Biomedical Literature Information Processing System that is shown in
Suppose extracting multiple relations for instance, k-body (here k is positive integer) relations and k+1-body relations. The more element names that compose multi-body (or multiple) relations, the more complex sentences that provide information about multiple relations, and then the less frequency the sentences appear. Therefore, the range of the network of the k+1-body relations becomes narrower than that of k-body relations. But if the value of k becomes larger than some threshold value, the number of sentences becomes smaller, so we cannot see the network behavior composed of k-body interaction relations. Consequently, the values of k in the k-body relation should be k=3, 4, 5, or 6 to obtain meaningful analysis results.
In addition, we can restrict the display of multiple relations related to specific element names that have interactions between plural element names (for example, display protein names that have binary relations with specific protein names) to draw a pathway map. Here protein names as nodes and interactions between protein names as edges. It is well known that specific protein nodes in the network have a vast number of edges, and these nodes are called hubs. The list representing hub proteins (the list of hub proteins) is stored in Specific Element Name Storage 18b set within Data Storage Unit 18 in advance, as shown in
Here, for example, top 70 proteins in all proteins (in order of the number of edges) are stored as hub proteins (the list of hub proteins) in Data Storage Unit 18 as shown in
In addition, in Biomedical Literature Information Processing System concerning the above embodiment, in the case where multiple relations that include more than three element names are extracted, we can clarify the relationships between element names. For example, in the case where the interactions of the extracted multiple relations include more than three element names, the list that indicates relationships between element names is drawn up, and the list is stored in Data Storage Unit 18. That is, as shown in
In addition, in Biomedical Literature Information Processing System concerning the above embodiment, in the case where the multiple relations that include more than three element names are extracted, we can allocate nodes according to the number of edges and categorize gene and proteins with a group of pathway function for drawing a pathway map. That is, when drawing a pathway map in Data Storage Unit 10 that stores various functions as a pathway map drawing means, we count the number of edges (multiple relations) that each node (gene and protein) has, and allocate the node that has the largest number of edges at the center. Next, around the node already located (in the circle centered on the node already located). We allocate nodes at an even interval in the order of the large number of edges. That is, the fewer the number of edges nodes have, the nodes are located upon a circle farther from the node at the center.
In a similar way, we can modify the configuration of nodes so as the closer the nodes according to the degree of the interaction represented by verb. Here the distances between nodes are adjusted according to the interaction strength obtained from the literature information. By locating the nodes in this way, pathway maps will be drawn as sets of groups so as each node in the group which has a defined relationship, such as some specific functions for the multiple relations, specific interactions that explain control, gathered similar functions. Then, within the pathway map drawn, taking the verb that shows the number of edges between nodes and relationship of nodes as a parameter, we make clustering nodes by general algorithms to form some functions or clusters (groups), as shown in
Furthermore, in the case where nodes are separated into groups that have defined function or groups that explain defined control, etc, we can display the nodes in the same group, cell type, for example. Within the group that explains the sense of time (such as cell cycle or circadian rhythms), it is separated into nodes related to brain, in reference to mesh term, and nodes related to liver. Next, the pathway map consisted of nodes related to brain (brain pathway map) and the pathway map consisted of nodes related to liver (liver pathway map) are drawn. Then, the nodes in common within brain pathway map and liver pathway map (nodes in common) are specified, and the nodes in common are located on the same position, locating each pathway map to overlap in an identifiable state.
In addition, in Biomedical Literature Information Processing System concerning the above embodiment, we can draw a pathway map in reference to the supplementary information related to pathway maps. That is, as shown in
For example, we can display specific element names identifiable from other element names in reference to supplementary information. That is, the famous genes, such as Estrogen Receptor and Androgen Receptor, are often noted in two or three letters like “ER” or “AR” in literatures, but such omitted notations often differ in each field. Therefore, even if “ER” is noted in a literature, there is a possibility that “ER” does not always mean Estrogen Receptor.
Consequently, we collect element names whose number of characters is two or three beforehand, search cited literatures for each element name, categorize by field, and hierarchies by co-occurrence of element names and year of publication of the reference journal, etc. By sub typing, using statistics of frequency and graph theoretical analysis of element name network of more than 100 specific professionals who are users of literature information, and by synthesizing hierarchical element name information, we register beforehand supplementary information that handles element names in biomedical field as a whole in Supplementary Information Storage 18c set up in Data Storage Unit 18. Then we can refer to the supplementary information stored in Supplementary Information Storage 18c when drawing a pathway map, and we can draw user's attention by showing the configuration different from other genes in the case where extracted element names are included in supplementary information.
In addition, using the different form of the figure for displaying specific element names such as “ER” and “AR”, we can make it enable to visually understand the possibility that the gene names erroneously indicate other elements. That is, for the element names that the event probability of error is high in searching literature information, we make up a table as shown in
In addition, in Biomedical Literature Information Processing System concerning the above embodiment, we can display the important materials (not proteins or genes) in the process of interaction identifiable from proteins and genes. That is, we make the list that indicates the important materials in the process of interaction between genes/proteins (for example, the effects on interactions of phosphorylated, ubiquitination, methylation, mutation evolution, monoprotic polymorphism, permutation on chromosome, lipid, and carbohydrate) as supplementary information beforehand, and store the list in Supplementary Information Storage 18c set up in Data Storage Unit 18 (refer to
In addition, in Biomedical Literature Information Processing System concerning the above embodiment, we can draw a pathway map that includes interactions between element names that are omitted in literature information. For example, in the case of using the verbs such as “inhibit” or “induce”, when protein A interacts with E via protein B, C, and D as shown in
In addition, in Biomedical Literature Information Processing System concerning the above embodiment, we can draw the pathway that can compare different experimental results. For example, we make each experiment for the case that 17 α estradiol concentration are 0.5 μg/kg and 1.0 μg/kg, and extract multiple relations based on each experimental result. Here, we calculate the union of sets of nodes and edges shown by the multiple relations extracted on the basis of the experimental results in the case of concentration 0.5 μg/kg and those in the case of concentration 1.0 μg/kg. Then, we draw the pathway map that allocates the common node in one position in the pathway map of the union of sets, that is, the node shown in the case of concentration 0.5 μg/kg and that in the case of concentration 1.0 μg/kg (refer to
As just described, by displaying two pathway maps in superimposed condition, we can make it easy to understand visually 1) the common edges and nodes, 2) the nodes and edges that emerge only in the case of concentration 0.5 μg/kg, and 3) the nodes and edges that emerge only in the case of concentration 1.0 μg/kg. In addition, in the above example, we can discern two pathway maps by displaying edges in solid line and broken line, but we can also display by using colors, for example, we can display the edge that composes the pathway map of concentration 0.5 ρg/kg in blue and display the edge that composes the pathway map of concentration 1.0 μg/kg in purple.
In addition, for the experimental results in the case where 17 α estradiol concentrations differ, for example, we can display the specific node in a visually-prehensible condition from the experimental result in the case of concentration 0.5 μg/kg and in the case of concentration 1.0 μg/kg. That is, we draw a pathway map by allocating the node with a single edge (displayed in white circle on the figure) outside the prescribed circle (refer to
In addition, in the case of extracting multiple relations, such as binary relations between proteins for example, in Biomedical Literature Information Processing System stated above, for the verb “bind”, it is often unclear whether two proteins are directly connected or two proteins are connected via other proteins as a result. For example, even if the case is “protein A”, “bind”, “protein B” as an actual result that “protein A” binds to “protein C” and “protein C” binds to “protein B”, only “protein A”, “bind”, “protein B” is often featured in literatures. In addition, it is recognized that the experimental result is “protein A”, “bind”, “protein B”, but it is not clear whether the process is done via any proteins in between or not, and often only the clear parts (“protein A”, “bind”, “protein B”) are featured. Consequently, in cases where the verb that indicates multiple relations is “bind”, we can display the information that shows whether the function is direct or indirect (the function via any proteins) with a pathway map.
Here, proteins have domain structures (refer to
In addition, in Biomedical Literature Information Processing System concerning the above embodiment, we can display the pathway of interactions to proteins input as experimental results. That is, in Biomedical Literature Information Processing System, if we extract binary relations (multiple relations) and store the binary relations (multiple relations) in Binary Relation (Multiple Relation) Storage Unit, we can display a pathway of interactions in reference to binary (multiple) relations stored in the Binary Relation (Multiple Relation) Storage Unit. For example, as shown in
Next, protein D is searched as a protein that acts on protein B1 or protein B2, and protein C is searched as a protein that acts on protein B3. At this time, as described above, we finish the process of searching proteins that act on protein D because there is no protein which acts on protein D. At the same time, we search proteins that act on protein C in reference to binary (multiple) relations stored in the Binary Relation (Multiple Relation) Storage Unit. As shown in
In addition, even if protein B is extracted as having binary relation with protein A, there is a possibility that other proteins intervene between protein A and protein B as described above. In such a case, we can display the pathway of the interaction that has a possibility of intervening between protein A and protein B, in reference to the binary relations (multiple relations) stored in Binary Relation (Multiple Relation) Storage Unit (refer to
In addition, in Biomedical Literature Information Processing System concerning the above embodiment, we can display the nodes that counteract interactions in making the discernment possible. For example, the specific pathway map (pathway map of medicine A) are drawn for medicine 1 that indicates the binary relations extracted based on the proteins expressed to medicine A, and the specific pathway map (pathway map of medicine B) are drawn for medicine 2 that indicates the binary relations extracted based on the proteins expressed to medicine B. Here, as shown in
Next, we explain the ninth embodiment.
Next, we explain the process of drawing pathway maps on Biomedical Literature Information Processing System concerning the ninth embodiment in reference to the flow chart of
First, Data Control Unit 10 of the Biomedical Literature Information Processing System obtains experimental results (Step S150). The detailed explanation of the process is omitted because the process of Step S150 is the same as those of Step S130 in
Next, we extract binary relations in reference to Literature DB and Gene expression Information DB28 (Step S151), and store the extracted binary relations in Binary Relation Storage Unit 19 (Step S152). The detailed explanation of the process is omitted because the process of Step S150-S151 is the same as those of Step S132-S133 in
Next, we evaluate whether extractions of binary relations for all the probes shown on experimental results are finished or not (Step S153), and in the case where the extractions are not finished for all binary relations, we go back to Step S1511 to extract the binary relations of next probes.
In the case where the extractions of the binary relations for all the protein names shown on the experimental results are deemed to be finished in Step S153, the pathway map is drawn based on the binary relations stored in Binary Relation Storage Unit 19 (Step S154). For example, the representation to the organ A-C concerning probe 1-5 is as shown in
In the Biomedical Literature Information Processing System concerning the ninth embodiment, we can examine the actual experimental results based on literature information, because the system draws pathway maps based on the multiple relations extracted in reference to Gene Expression Information DB that stores gene expression information and Literature DB. That is, in the case where the pathway map dependent on an organ-specific pathway map and derivation of cell is drawn, we can do various analyses by analyzing and organizing drawn pathway maps. For example, we can extract different and common points on pathway maps of each organ and pathway maps of cancer and those of non-cancer. Consequently, we can draw the pathway map of probes expressed in specific organs by combining the data of experimental results (for example, Gene Expression Information Database) and literature database (the database of literature information).
In addition, in the above embodiment, we have extracted multiple relations for the literatures of biomedical field, based on the verbs that indicate interactions between elements, and have drawn pathway maps by setting protein and gene names as elements (nodes). We can also draw interactions between elements (nodes) on pathway maps for the literatures in the field of social science. In this case, we can indicate human relationships (relative, blood relationship, lover, married couple, friends, and family name) and personal connections on pathway maps by setting a “human” in literatures as an element (node) and extracting multiple relations based on the verbs that indicate interactions between elements and by drawing pathway maps. These pathway maps can be effectively used as information to figure out the human relationships and personal connections in the field of sports, movies, and politics.
In addition, we can draw interactions between elements (nodes) on pathway maps for the literatures of economic field. In this case, we can indicate relationships between companies (capital, business tieup, flow of money, and personal relationships), capital ties, etc. on pathway maps by setting a company name in literatures as an element (node) and extracting multiple relations based on the verbs that indicate interactions between elements and by drawing pathway maps. These pathway maps can be effectively used as one unit of information to make decisions in business and stock market.
In addition, we can draw interactions between elements (nodes) on pathway maps for the literatures of the military field. In this case, we can indicate background between cases, organs, cultures, economy, and personal relationships, etc. on pathway maps by setting a case name in literatures as an element (node) and extracting multiple relations based on the verbs that indicate interactions between elements and by drawing pathway maps. These pathway maps can be effectively used as information for analyzing information, analyzing historical information, and making decisions.
In addition, we can draw interactions between elements (nodes) on pathway maps for the literatures of the urban planning field. In this case, we can indicate relationships of electric power, water line, sewage, oil, and traffic on pathway maps by setting City name in literatures as an element (node) and extracting multiple relations based on the verbs that indicate interactions between elements and by drawing pathway maps. These pathway maps can be effectively used as information to make decisions in business and stock market.
In addition, we can draw interactions between elements (nodes) on pathway maps for the literatures of the legal field. In this case, we can indicate relationships between letters and systems of law on pathway maps by setting the law name in literatures as an element (node) and extracting multiple relations based on the verbs that indicate interactions between elements and by drawing pathway maps. These pathway maps can be effectively used as information to make decisions in business and politics.
In the above explanation concerning this invention, we have made an explanation for English-language literatures, but we can apply these to various languages (for example, Russian, Chinese, Korean, Japanese, Latin, etc.) that are used in history or at the present day by using the standard technology of the current natural language processing in the same way.
The present disclosure relates to content contained in Japanese Patent Application No. 2004-097914 filed on Mar. 30, 2004, the entire disclosure of which is incorporated here by reference.
As stated above, the literature information processing system of this invention is suitable for analyzing literature information by natural language processing and expeditiously putting analysis results.
Number | Date | Country | Kind |
---|---|---|---|
2004-097914 | Mar 2004 | JP | national |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP05/06025 | Mar 2005 | US |
Child | 11528452 | Sep 2006 | US |