Exemplary embodiments according to the present invention will be explained in detail with reference to the accompanying drawings in the following sections 1 to 4:
1. general outline of an interaction evaluating apparatus (
2. a sub-unit forming unit in the interaction evaluating apparatus (
3. a learning unit in the interaction evaluating apparatus (
4. a prediction-target generating unit and an executing unit in the interaction evaluating apparatus (
With regard to a general outline of an interaction evaluating apparatus, description will be made of a hardware configuration, a functional configuration, etc., of the interaction evaluating apparatus.
As shown in
The CPU 101 is responsible for overall control of the interaction evaluating apparatus. The ROM 102 stores programs such as a boot program. The RAM 103 is used for a work area of the CPU 101. The HDD 104 controls read/write of data from/to the HD 105 under the control of the CPU 101. The HD 105 stores data written under the control of the HDD 104.
The FDD 106 controls read/write of data from/to the FD 107 under the control of the CPU 101. The FD 107 stores data written under the control of the FDD 106 and allows the interaction evaluating apparatus to read the data stored in the FD 107.
The removable recording medium may be a compact-disc read-only memory (CD-ROM), a compact-disc recordable (CD-R), a compact-disc rewritable (CD-RW), a magneto optical (MO) disk, a digital versatile disk (DVD), and a memory card, in addition to the FD 107. The display 108 displays a cursor, icons or tool boxes as well as data such as documents, images, and function information. This display 108 may be a cathode ray tube (CRT), a thin film transistor (TFT) liquid crystal display, and a plasma display, for example.
The I/F 109 is connected via a communication line to a network 114 such as the Internet and is connected to other apparatuses via this network 114. The I/F 109 is responsible for interfacing the network 114 with the inside of the apparatus and controls input/output of data from/to an external apparatus. The I/F 109 may be a modem and a LAN adaptor, for example.
The keyboard 110 is disposed with keys for entering characters, numeric characters, various instructions, etc., to enter data. A touch-panel type input pad, a numeric keypad, etc., may be used instead. The mouse 111 moves a cursor, selects an area or moves and resizes a window, etc. A trackball or joystick may be used instead, as long as similar functions for a pointing device are included.
The scanner 112 reads an image optically and captures image data into the interaction evaluating apparatus. The scanner 112 may have an OCR function. The printer 113 prints image data and document data. The printer 113 may be a laser printer or ink-jet printer, for example.
The family DB 210 is a database in which a family that is a group of proteins having similar nature. In other words, proteins belonging to one family have similar nature and it is believed that proteins in a protein complex are replaceable among the proteins in one family. Representative example of such database is InterPro (http://www.ebi.ac.uk).
The sub-unit forming unit 201 performs a sub-unit formation processing on the complex pair information 3300 as shown in
The aforementioned family has a hierarchical structure and includes proteins belonging to different families. The sub-unit forming unit 201 focuses on a rather large family and divides proteins in the large family into mutually exclusive families to categorize a group of proteins included in a protein complex as a sub-unit, which is an exclusive group. This exclusive group is referred to as an exclusive family. The complex pair information categorized in terms of this exclusive family is referred to as sub-unit complex pair information 230.
The gene ontology is protein attributes, such as biological processes, cellular localization, and molecule functions characterizing proteins that are added by human. The GODB 220 stores information relating to such protein attributes.
To the learning unit 202, the sub-unit complex pair information 230 is input, and a prediction rule set 240 is output from the learning unit. Specifically, the learning unit 202 adds protein attributes to the sub-units in the sub-unit complex pair information 230 based on the GODB 220. Thus, a structure to distinguish a sub-unit pair including a targeted interaction attribute from a sub-unit pair not including the targeted interaction attribute is obtained.
This structure is a prediction rule for the interaction attribute for each sub-unit. The prediction rule is expressed by “condition→conclusion”. The condition is set as “a protein attribute of a sub-unit in a protein complex is XXX” and the conclusion is obtained as “an interaction type is YYY”. The learning unit 202 outputs the prediction rules to build the prediction rule set 240. The prediction rule set 240 is stored in a recording medium such as the RAM 103 and the HD 105 shown in
That is, if the prediction rule is established for any combination of sub-units in a protein complex pair, the prediction rule is assumed to be applied to the entire protein complex pair and it is considered that the interaction attribute corresponding to the prediction rule exists.
To the prediction-target generating unit 203, complex pair information 2400 of a prediction target is input. The complex pair information 2400 includes information on protein complex pairs with known interaction attributes and protein complex pairs with unknown interaction attributes. The prediction-target generating unit 203 performs a sub-unit formation processing on the complex pair information 2400 to generate prediction target data 250.
To the executing unit 204, the prediction target data 250 obtained from the prediction-target generating unit 203 is input. The executing unit 204 calculates an attribute score as an execution result based on the prediction rule set 240. The attribute sore is validity evaluation of an interaction attribute of a sub-unit pair. The prediction target data 250 is data identified by the complex pair information 2400 of which an interaction attribute between protein complexes or between sub-units is unknown.
By calculating the attribute score, for a protein complex pair of which an interaction attribute is known, the responsible sub-unit pair can be estimated. For a protein complex pair of which an interaction attribute is unknown, both the interaction attribute and the responsible sub-unit pair can be estimated at a time.
The family DB 210 and the GODB 220 realize the functions thereof with a recording medium such as the ROM 102, the RAM 103, and the HD 105 shown in
Description has been made for the general outline of the interaction evaluating apparatus with reference to
The sub-unit forming unit 201 forms sub-units of proteins in each protein complex identified by the complex pair information 3300.
In the example shown in
In the example shown in
To the exclusive-family generating unit 501, the family list FLi is input. The exclusive-family generating unit 501 identifies a family of the highest conception that represents the nature of the protein Pi. The identified family is referred to as an exclusive family. Specifically, the exclusive-family generating unit 501 includes a family-list extracting unit 511, a lower-bound-list generating unit 512, a tracking/linking unit 513, and an exclusive-family identifying unit 514.
The family-list extracting unit 511 extracts the family list FLi of the protein Pi from the family DB 210. Specifically, the extraction is performed in the order from the protein P1 with the gene ID: i=1.
The lower-bound-list generating unit 512 generates a lower-bound list from the family list FLi extracted by the family-list extracting unit 511. Specifically, the lower-bound list is generated by sequentially adding the family list FLi being extracted, and by sorting the lists in ascending order of the families, for example, in the order of alphabetical letters a, b, . . . , added to the families Fa, Fb, . . . .
The tracking/linking unit 513 performs a tracking (tracing) process and a linking process. The tracking process is a process of correlating families in one family list FLi. Specifically, families are correlated by tracking a higher-order family from a family in the family list FLi sorted in ascending order.
The linking process is a process of correlating different family lists. The linking process is performed on family lists not overlapping with each other. In the linking process, when a family list that overlaps with both of the family lists not overlapping with each other is extracted, the highest-order families in the family lists not overlapping with each other are correlated by performing the track process.
The exclusive-family identifying unit 514 identifies the exclusive family for each protein Pi from the lower-bound list including families correlated by the tracking/linking unit 513. For example, the highest-order family of the family list FLi of the protein Pi is identified as the exclusive family.
If the highest-order family in the family list FLi is used as a correlated source for the correlation with another family, the correlated destination family is identified as the exclusive family. If a single family belongs to the family list FLi and if the family is correlated with no families, the family is directly identified as the exclusive family. The identified exclusive family is stored in the exclusive family DB 500 along with the gene ID: i of the protein Pi.
The lower-bound list 602 is an intermediate product for creating the exclusive family and is updated every time the family list FLi is extracted. For example, when the family list FL1 of the protein P1 is extracted, a lower-bound list including only the family list FL1 is acquired.
When the family list FL2 of the protein P2 is extracted, the family list FL2 is added to the lower-bound list including only the family list FL1. When the family list FL3 of the protein P3 is extracted, the family list FL3 is added to the lower-bound list including the family lists FL1 and FL2. When the family list FL4 of the protein P4 is extracted, the family list FL4 is added to the lower-bound list including the family lists FL1 to FL3. Thus, the lower-bound list 602 is acquired.
In the lower-bound list 602, the family list FL4 overlaps with the family list FL1. That is, a family Fb is a family belonging to the family lists FL1 and FL4. Therefore, the tracking/linking unit 513 correlates the family Fb with a family Fa by the tracking from the family Fb to the family Fa, which is higher in the ascending order in the family list FL1 (an arrow Tba in
Similarly, in the lower-bound list 602, the family list FL4 overlaps with the family list FL2. A family Fe in the family list FL4 is a family belonging to the family lists FL2 and FL4. Therefore, the tracking/linking unit 513 correlates the family Fe with a family Fc by the tracking from the family Fe to the family Fc, which is higher in the ascending order in the family list FL2 (an arrow Tec in
Since the family list FL2 includes a family Ff, which is lower than the family Fe in the ascending order, the tracking/linking unit 513 correlates the family Ff with the family Fe by the tracking from the family Ff to the family Fe (an arrow Tfe in
In the lower-bound list 602, the family list FL1 and the family list FL2 do not overlap, while the family list FL4 overlaps with both the family lists FL1 and FL2. Therefore, the family list FL1 and the family list FL2 can be linked through the family list FL4.
Therefore, the tracking/linking unit 513 correlates the family list FL2 with the family list FL1 by the linking from the family Fc, which is high in the ascending order in the family list FL2, to the family Fa, which is high in the ascending order in the family list FL1 (an arrow Lca in
A chart 603 on the right side in
For the family list FL2 of the protein P2, FL2={Fc, Fe, Ff}; the family Ff is correlated with the higher-order family Fe in the tracking process (the arrow Tfe in
For the family list FL3 of the protein P3, FL3={Fd}, and since the family Fd is correlated with no family, the family Fd is directly defined as the exclusive family of the protein P3.
For the family list FL4 of the protein P4, FL4={Fb, Fe}, and each of the families Fb and Fe is correlated with the family Fa as described above. Therefore, the exclusive family of the protein P4 is the family Fa.
The exclusive-family generating unit 501 stores “gene ID”, “protein (name)”, and “exclusive family” that constitute one record for each protein, in the exclusive family DB 500.
The complex-pair-information acquiring unit 502 shown in
Specifically, the exclusive family can be identified by using information of a protein included in the protein complexes CL1 and CR2 (e.g., gene ID: i and protein (name) Pi) as a clue to extract the exclusive family of the protein from the exclusive family DB 500.
The group processing unit 504 executes grouping on proteins from which the exclusive families are identified, and makes groups of proteins for each exclusive family. The group of proteins is the sub-unit.
As shown in
An exclusive family F10 is identified for the proteins P101 to P104; an exclusive family F11 is identified for the proteins P111 to P113; an exclusive family F20 is identified for the proteins P201 to P203; an exclusive family F21 is identified for the proteins P221, P231; and an exclusive family is not identified for the proteins P221, P231 since the exclusive family DB 500 has no corresponding exclusive family.
A shown in
The exclusive family is then extracted from the exclusive family DB 500 for each protein of the other protein complex CR2 (step S905) and the group processing unit 504 forms the sub-units by using the exclusive families to organize the proteins with the identified exclusive families (step S906).
The lower-bound-list generating unit 512 generates (updates) the lower-bound list from the group of the extracted family lists FLi (step S1003). The tracking/linking unit 513 performs the tracking process and the linking process of the lower-bound list (step S1004) and the gene ID: i is incremented (step S1005).
If i>n is not satisfied (step S1006: NO), the procedure goes back to step S1002. On the other hand, if i>n is satisfied (step S1006: YES), the lower-bound list is completed and gene ID: i is defined as i=1 again (step S1007). The exclusive-family identifying unit 514 identifies the exclusive family of the protein Pi (step S1008).
The identified exclusive family and the information (gene ID: i and protein name) of the protein Pi are output to the exclusive family DB 500 as a record (step S1009). The gene ID: i is then incremented (step S1010). If i>n is not satisfied (step S1011: NO), the procedure goes back to step S1008. On the other hand, if i>n is satisfied (step S1011: YES), the procedure goes to step S902.
Since the aforementioned sub-unit forming unit 201 can classify groups of proteins included in the protein complexes CL1 and CR2 into the sub-units that are exclusive groups, the sub-units can be identified even if the sub-units are unknown that are groups of proteins constituting a variant. By acquiring the sub-units, the learning unit 202 can achieve the extraction of the prediction rules highly accurately.
As described above, the learning unit 202 uses the sub-unit complex pair information 230 as input information and refers to the GODB 220 to output the prediction rule set 240.
A GO term list GOi is attribute information of protein Pi and has a hierarchical structure in a tree stricture. Each node in the GO term list GOi represents the protein attribute information of the protein Pi. Numeric characters in the nodes are attribute identification information (attribute number) j (j=1 to m). The protein attribute information is indicated by Aj.
A node with hatching shown in
To the learning data generator 1201, the sub-unit complex pair information 230 is input, and the learning data generator 1201 generates learning data from which the prediction rule is extracted based on the GODB 220. Specifically, the learning data generator 1201 includes a sub-unit extracting unit 1211, a protein-attribute detecting unit 1212, a sub-unit attribute generating unit 1213, and a learning-data generating unit 1214.
The sub-unit extracting unit 1211 extracts a sub-unit from the sub-unit complex pair information 230. For example, if the extraction source is the sub-unit complex pair information 230 shown in
The protein-attribute detecting unit 1212 detects from GODB 220 the protein attribute information of the proteins belonging to the sub-unit extracted by the sub-unit extracting unit 1211. For example, if the protein Pi is included in the extracted sub-unit, the protein attribute information A1 to A3, A5, A6, and A10 is detected from the GO term list GOi shown in
The sub-unit attribute generating unit 1213 generates the protein attribute information relating to the sub-unit (hereinafter, “sub-unit attribute information”) from the protein attribute information Aj detected by the protein-attribute detecting unit 1212. Specifically, when focusing on all of the proteins in the sub-unit, the sub-unit attribute information for protein attribute information Aj can be acquired by aggregating certain protein attribute information Aj.
For example, when a flag is set to “1” if certain protein attribute information Aj is detected for all the proteins in the sub-unit and the flag is set to “0” if the information is not detected, all the flags of all the proteins in the sub-unit can be aggregated using a aggregating condition such as logical multiplication, logical addition, and majority decision, and the aggregation result can be used as the sub-unit attribute information for the protein attribute information Aj.
For example, with regard to the detection result of the protein attribute information A1, since the proteins P101, P103, P104 are “1” and the protein P102 is “0”, the aggregation result is “0” if the aggregating condition is logical multiplication (AND); the aggregation result is “1” if the aggregating condition is logical addition (OR); and the aggregation result is “1” if the aggregating condition is majority decision. The aggregated protein attribute information Aj will hereinafter be indicated by sub-unit attribute information Bj.
The learning-data generating unit 1214 shown in
The learning data 1410 include aggregation result information 1411 and 1412. The learning data 1420 include aggregation result information 1421 and 1422. The learning data 1430 include aggregation result information 1431 and 1432.
For example, in the learning data 1410 as an example, the protein complex CL1 has the sub-units SL10, SL11 and the protein complex CR2 has the sub-units SR20 to SR23. Therefore, the learning-data generating unit 1214 establishes eight (2×4) sub-unit pairs between both protein complexes CL1 and CR2.
In
The learning data 1410, 1420 and 1430 include interaction attribute information in addition to the aggregation result information. The interaction attribute information is taken over from the source complex pair information 3300. The interaction attribute information includes interaction attribute type information.
Specifically, a pair of the sub-units CL1 and CR2 is associated with interaction attribute type information 1413 in the learning data 1410; a pair of the sub-units CL3 and CR4 is associated with interaction attribute type information 1423 in the learning data 1420; and a pair of the sub-units CL5 and CR6 is associated with interaction attribute type information 1433 in the learning data 1430. A circle mark in the interaction attribute type information indicates a relevant interaction type.
For example, the interaction type of the learning data 1410 is an interaction type INk; the interaction type of the learning data 1420 is an interaction type INk; and the interaction type of the learning data 1430 is an interaction type INK. An interaction type ID is indicated by k (k=1 to k).
The interaction attribute information includes interaction direction information. Referring to
The prediction-rule extracting unit 1202 extracts the prediction rule from the learning data set 1210. Specifically, the prediction-rule extracting unit 1202 includes a rule-match processing unit 1221 and a prediction-rule determining unit 1222. The prediction rule is represented by “condition→conclusion” and three types of the conditions are assumed because a protein complex pair is concerned.
The three types includes the case in which only the sub-unit attribute information of the sub-units in the protein complex giving the interaction is used in the “condition”, the case in which only the sub-unit attribute information of the sub-units in the protein complex receiving the interaction is used in the “condition”, and the case in which sub-unit information of the sub-units in the both protein complexes is used in the “condition”.
The rule-match processing unit 1221 applies the aforementioned three types of the “conditions” to perform a rule match process. In the rule match process, so-called association analysis is performed. A parameter relating to the association analysis is obtained and this parameter is used to calculate a credibility degree and a support degree.
The rule match process result of
The rule match process result of
First, the detection number of the sub-unit is counted for each piece of the sub-unit attribute information Bj. Specifically, when focusing on the sub-unit attribute information B1 of the protein complex CL1 in the aggregation result information 1411 of the learning data 1410, the flag of the sub-unit SL10 is “0” since the sub-unit attribute information B1 is not detected for the sub-unit SL10 and the flag of the sub-unit SL11 is “1” since the sub-unit attribute information B1 is detected for the sub-unit SL11.
The total number of the sub-units is two in the aggregation result information 1411 (the sub-unit S10 and the sub-unit S11), and since the detected sub-unit with the flag of “1” is the sub-unit S11, the detection number is one. In
The detection number of the sub-unit of a plurality of pieces of the sub-unit attribute information is counted for each protein complex CL1, CL3, and CL5. Specifically, when focusing on the sub-unit attribute information B1, Bj of the protein complex CL1 in the aggregation result information 1411 of the learning data 1410, the flags of the sub-unit SL10 are “0” since the sub-unit attribute information B1, and Bj is not detected for the sub-unit SL10 and the flags of the sub-unit SL11 is “1” since the sub-unit attribute information B1, Bj is detected for the sub-unit SL11.
The total number of the sub-units is two in the aggregation result information 1411 (the sub-unit S10 and the sub-unit S11), and since the detected sub-unit with the flag of “1” is the sub-unit S11, the detection number is one. In
A parameter for calculating the credibility degree is calculated. The credibility degree is a rate of the occurrence of “conclusion” when “condition” is generated and can be expressed by the following equation.
COj k=xjk/Xjk (1)
In the case of the sub-unit attribute information Bj and the interaction type INk, Cojk is the credibility degree, xjk is the detection number including “condition” and “conclusion”, and Xjk is the detection number including “condition”.
Specifically, the detection number Xjk is the total detection number of the sub-unit attribute information Bj, which is the condition. For example, in the protein attribute information Bj, the detection number of the protein complex CL1 is “2”; the detection number of the protein complex CL3 is “1”; the detection number of the protein complex CL5 is “1”; and, therefore, Xjk=4 is achieved.
On the other hand, the detection number xjk must also satisfy “conclusion”. Therefore, in
Although it is important to acquire the credibility degree COjk for value judgment of the extracted prediction rule, even when the credibility degree COjk is high, the extracted prediction rule has the extremely low number of occurrences if a support degree SUjk is low. Therefore, it is important to calculate and evaluate the support degree SUjk.
The support degree SUjk is a rate of the detection number concurrently satisfying “condition” and “conclusion” to the total sub-unit number and can be expressed by the following Equation 2.
SUjk=xjk/Njk (2)
In the case of the sub-unit attribute information Bj and the interaction type INk, Njk is the total sub-unit number in the sub-unit attribute information Bj. Since the total sub-unit number of each protein complex CL1, CL3, and CL5 is “2”, the total sub-unit number Njk in the sub-unit attribute information Bj is Njk=6. On the other hand, njk is the number of “conclusion” corresponding to “condition”. In
In the example shown in
In the example shown in
More specifically, for example, in the protein complex pair {CL1, CR2}, with regard to the number of sub-unit pairs satisfying that the sub-unit attribute information B1 exists in the protein complex CL1 and that the sub-unit attribute information Bj exists in the protein complex pair CR2, referring to
The prediction-rule determining unit 1222 determines the prediction rule based on the credibility degree COjk and the support degree SUjk acquired by the rule-match processing unit 1221. Specifically, in the case of the sub-unit attribute information Bj and the interaction type INk, it is determined whether the credibility degree COjk is equal to or greater than a threshold value COt with regard to a rule meaning that “if sub-unit attribute information of one sub-unit is Bj, the interaction type is INk” (hereinafter, “Bj→INk”). If the credibility degree COjk is equal to or greater than the threshold value COt, “Bj-INk” is determined as the prediction rule.
The prediction accuracy is improved by considering the support degree Sujk. Therefore, if the credibility degree COjk is equal to or greater than the threshold value COt, it may be determined whether the support degree SUjk is equal to or greater than a threshold value SUt. If the credibility degree COjk is equal to or greater than the threshold value COt and if the support degree SUjk is equal to or greater than a threshold value SUt, “Bj→INk” may be determined as the prediction rule.
The score calculating unit 1203 calculates a score of the prediction rule determined by the prediction-rule determining unit 1222. Specifically, for example, the score calculating unit 1203 calculates a log-of-odds (LOD) score. In the case of the sub-unit attribute information Bj and the interaction type INk, the rate of the interaction type INk is njk/Njk. The LOD score is a score for evaluating how great the credibility degree COj is relative to the rate of the interaction type INk (njk/Njk).
That is, the LOD score represents the extent of abnormality about likelihood representing how frequently the prediction rule occurs, and the greater the LOD score is, the better the prediction rule reflects characteristics. The LOD score can be calculated by the following Equation 3.
The score calculating unit 1203 sorts the prediction rules in the order from the highest calculated score to rank the prediction rules.
Specifically, for example, in the learning data set 1210 shown in
For example, in the learning data set 1210 shown in
The score calculating unit 1203 calculates the LOD score and sorts the prediction rules in the order from the highest score to rank the prediction rules (step S1908). The ranked prediction rule set 240 is stored (step S1909).
The attribution number j of the protein attribution information Aj is set to j=1 (step S2003), and by referring to the GODB 220, the protein-attribute detecting unit 1212 detects the protein attribute information Aj of the proteins in the extracted sub-unit (step S2004). It is determined whether j=m is achieved (step S2005), and if j=m is not achieved (step S2005: no), j is incremented (step S2006) and the procedure goes back to step S2004.
On the other hand, if j=m is achieved (step S2005: YES), the procedure goes back to step S2001. At step S2001, if the unprocessed sub-unit does not exist (step S2001: NO), it is determined whether an unprocessed sub-unit exists for the detection of the protein attribution information Bj (step S2007). If the unprocessed sub-unit exists (step S2007: YES), the unprocessed sub-unit is extracted (step S2008).
The attribution number j of the protein attribution information Bj is set to j=1 (step S2009), the sub-unit attribute generating unit 1213 generates the sub-unit attribute information Bj (step S2010).
It is then determined whether j=m (m is the maximum attribute number) is achieved (step S2011), and if the j=m is not achieved (step S2011: NO), j is incremented (step S2012) and the procedure goes back to step S2010.
On the other hand, if j=m is achieved (step S2011: YES), the procedure goes back to step S2007. At step S2007, if the unprocessed sub-unit does not exist (step S2007: NO), the learning-data generating unit 1214 can perform combination construction (step S2013) to acquire the learning data set 1210 shown in
The prediction-rule determining unit 1222 performs the prediction rule determination process (step S2103). It is determined whether k=K is achieved (step S2104), and if k=K is not achieved (step S2104: NO), k is incremented (step S2105) and the procedure goes back to the rule match process at step S2102. On the other hand, if k=K is achieved (step S2104: YES), the procedure goes to step S1904.
If this prediction rule extraction process is a process performed at step S1905, the procedure goes to step S1906, and if this process is performed at step S1907, the procedure goes to step S1908.
The detection number xjk, the detection number Xjk, and the total sub-unit number Njk are counted (step S2203). These parameters are used to calculate the credibility degree COjk (step S2204) and the support degree SUjk (step S2205).
It is then determined whether j=m is achieved (step S2206), and if j=m is not achieved (step S2206: NO), j is incremented (step S2207) and the procedure goes back to step S2202. On the other hand, if j=m is achieved (step S2206: YES), the procedure goes to step S2103.
On the other hand, If COjk≧COt is achieved (step S2302: YES), it is determined whether SUjk≧SUt is achieved (step S2303). If SUjk≧SUt is not achieved (step S2303: NO), the procedure goes to step S2305.
If SUjk≧SUt is achieved (step S2303: YES), the rule “Bj→INk” is determined as the prediction rule (step S2304), and the procedure goes to step S2305. At step S2305, it is determined whether j=m is achieved, and if j=m is not achieved (step S2305: NO), j is incremented (step S2306) and the procedure goes back to step S2302. If j=m is achieved (step S2305: YES), the procedure goes to step S2104.
In the aforementioned rule match process (step S2102), for the convenience of description, the number of sub-units with the rule match is detected for one sub-unit attribute information Bj at step S2202, and the case of using a plurality of pieces of the sub-unit attribute information shown in
In this way, the aforementioned learning unit 202 can extract the reliable rule from the rules acquired by giving the sub-unit complex pair 230.
As describe above, to the prediction-target generating unit 203, the complex pair information 2400 of a prediction target is input. The prediction-target generating unit 203 makes sub-units of the complex pair information 2400 and finally creates the prediction target data 250.
To the executing unit 204, the prediction target data 250 is input, and the executing unit 204 refers to the prediction rule set 240 acquired by the learning unit 202 to calculate the execution result, i.e., the attribute score, which is validation evaluation of an interaction attribute of a sub-unit pair.
As described above, the sub-unit forming unit 201 generates sub-unit complex pair information 2410 from the prediction target complex pair information 2400.
The learning data generator 1201 uses the sub-unit complex pair information 2410 as input information and refers to the GODB 220 to generate the prediction target data 250 with the process same as that for the learning data. Therefore, the prediction target data 250 has the same data structure as the aforementioned learning data.
The executing unit 204 includes a prediction-target acquiring unit 2401, a highest-order-rule extracting unit 2402, a conformity determining unit 2403, an identifying unit 2405, and an output unit 2406. The prediction-target acquiring unit 2401 acquires the prediction target data 250.
The highest-order-rule extracting unit 2402 shown in
The conformity determining unit 2403 determines whether the prediction target data 250 acquired by the prediction-target acquiring unit 2401 conforms to the prediction rule extracted by the highest-order-rule extracting unit 2402. Specifically, it is determined whether the aggregation result information of the prediction target data 250 includes the sub-unit attribute information Bj that is identical to the sub-unit attribute information Bj constituting the condition of the prediction rule. If the prediction target data 250 includes the interaction type information, it may also be determined whether the interaction type is identical.
On the other hand, in the aggregation result information 2701 of the protein complex CLy giving the interaction among the prediction target data 250, since the sub-unit SLy0 has the sub-unit attribute information Bj, a rule match is generated for the prediction rule 2800 between the protein complexes CLy and CRz. In this case, the both interaction types are phosphorylation (INk) and identical. Therefore, if the interaction type is considered in the conformity determination, a rule match is generated for the prediction rule 2800.
The attribute-credibility calculating unit 2404 shown in
PCk=COr×RC (4)
In Equation 4, PCk is the prediction attribute credibility degree relating to the prediction rule generating a rule match; COr is the credibility degree COjk relating to the prediction rule generating a rule match; and RC is a remaining credibility degree. The initial value of the remaining credibility degree RC is RC=1 and the calculated prediction attribute credibility degree PCk is decremented every time the prediction attribute credibility degree PC is calculated. That is, the remaining credibility degree RC is a coefficient proportional to the order from the highest LOD score of the prediction rule after the conformity determination. Therefore, the prediction rule at the higher rank has a greater effect on the prediction attribute credibility degree PCk.
The responsible sub-unit pair/interaction attribute identifying unit 2405 shown in
Specifically, for a protein complex pair with a known interaction attribute, a sub-unit pair with the highest prediction attribute credibility degree PC is identified as the responsible sub-unit pair. In the example shown in
For a protein complex pair with an unknown interaction attribute, since it is not known for what interaction type INk the prediction attribute credibility degree PC should be focused on, the prediction attribute credibility degree PCk equal to or greater than a threshold value PCt is detected, and the interaction attribute is identified with the interaction type INk thereof. Since the interaction type INk is identified, the responsible sub-unit pair can be identified at the same time as is the case with the known interaction attribute.
Specifically, in the example of
A sub-pair unit {SLy0, SRz1} with the prediction attribute credibility degree PC1=0.9 is identified as the responsible sub-unit pair. Similarly, a sub-pair unit {SLy2, SRz1} with the prediction attribute credibility degree PCK=0.8 is identified as the responsible sub-unit pair.
The output unit 2406 outputs an execution result, that is, the responsible sub-unit pair and the interaction attribute identified by the responsible sub-unit pair/interaction attribute identifying unit 2405. The output format may be any form such as screen display, print output, or data storage. The execution result using the sub-unit complex pair information 2410 shown in
The prediction-target acquiring unit 2401 acquires the created prediction target data 250 (step S3202). The initial value of the remaining credibility RC is set to RC=l (step S3203) and it is determined whether all the prediction rules in the prediction rule set 240 are applied to the rule match (step S3204).
If unapplied prediction rules exist (step S3204: NO), the highest-order-rule extracting unit 2402 extracts the prediction rule ranked at the highest order among the unapplied prediction rules (step S3205). The conformity determining unit 2403 determines whether a rule match is generated (step S3206).
If a rule match is not generated (step S3206: NO), the procedure goes back to step S3204. On the other hand, if a rule match is generated (step S3206: YES), the attribute-credibility calculating unit 2404 calculates the prediction attribute credibility degree PCk for the prediction rule generating the rule match (step S3207). The calculated prediction credibility degree PCk is subtracted from the current remaining credibility degree RC to update the remaining credibility degree RC (step S3208) and the procedure goes back to step S3204.
If all the prediction rules are applied at step S3204 (step S3204: YES), it is determined whether the interaction attribute of the prediction target is known (step S3209). If the interaction attribute is known (step S3209: YES), the responsible sub-unit pair/interaction attribute identifying unit 2405 identifies the responsible sub-unit pair (step S3210) that is output as the execution result (step S3212).
On the other hand, if the interaction attribute is unknown (step S3209: NO), the responsible sub-unit pair/interaction attribute identifying unit 2405 identifies the interaction attribute between prediction target protein complexes and the responsible sub-unit pair thereof (step S3211) that are output as the execution result (step S3212).
Thus, according to the prediction-target generating unit 203 and the executing unit 204 described above, the responsible sub-unit pair can be deduced for the protein complex pair with the known interaction attribute. The interaction attribute and the responsible sub-unit pair can be deduced at the same time for the protein complex pair with the unknown interaction attribute.
As described above, according to the protein complex interaction evaluating program, the recording medium recording the program, the interaction evaluating apparatus, and the protein complex interaction evaluating method, the validation evaluation of the interaction attribute can be achieved effectively and highly accurately.
The protein complex interaction evaluating method described in the embodiment can be realized by executing a program prepared in advance with a computer such as a personal computer and a workstation. The program is recorded on a computer-readable recording medium such as an HD, a FD, a CD-ROM, an MO, and a DVD and is read from the recording medium by the computer for execution. The program may be a transmission medium that can be distributed through a network such as the Internet.
According to the embodiments described above, validity evaluation can be performed for an interaction attribute effectively and highly accurately.
Although the invention has been described with respect to a specific embodiment for a complete and clear disclosure, the appended claims are not to be thus limited but are to be construed as embodying all modifications and alternative constructions that may occur to one skilled in the art which fairly fall within the basic teaching herein set forth.
Number | Date | Country | Kind |
---|---|---|---|
2006-150672 | May 2006 | JP | national |