The present invention relates to techniques for a method for calculating an interaction between feature amounts and a system for calculating an interaction between feature amounts.
Research on the human gut microbiota using metagenomic analysis technology has attracted a great deal of international attention. One of the main reasons for this is that it has become clear that there is a close relationship between the human gut microbiota and disease. For example, it has been reported that in addition to colon-related diseases such as pseudomembranous colitis, obesity, diabetes, various autoimmune diseases, colon cancer, liver cancer, renal failure, heart failure, nervous system diseases, mental and brain functions such as autism, which are related to lifestyle and eating habits, are associated with the human gut microbiota. Thus, recent studies have revealed that the structure of the gut microbiota is involved in the systemic function regardless of the organ. By paying attention to the relationship between the gut microbiota and the disease, it is expected that new treatments and preventions different from the conventional ones will be possible for various diseases.
The gut microbiota has a very complicated flora structure in which a large number of bacterial species interact with each other, and interacts with the health condition of the host and the nutrients ingested by the host to affect the physiological function of the host. As a result, the gut microbiota is believed to be involved in the development of various diseases. Therefore, when analyzing the association between gut microbiota and disease, it is important to consider the interaction between many factors including external factors such as health status and nutrient intake, in addition to the factors inside the gut microbiota. Traditional statistical methods are often used in the association analysis in gut microbiota studies. However, since multiple tests are a problem when dealing with a large number of factors in traditional statistical methods, machine learning methods that are excellent in analyzing a large number of factors and interactions thereof have been attracting attention in recent years.
JP-T520510 discloses a pharmacological phenotypic prediction platform for individuals and cohorts, in which “in patients who have or may have a primary or comorbid disease, pharmacological phenotypes can be predicted by a collection of panomix data, physiomics data, environmental data, sociomix data, demographic data, and outcome phenotypic data over a certain period of time. The machine learning engine generates statistical models based on training data from patients for the training, thereby pharmacological phenotypes can be predicted, including drug response and administration, adverse drug events, disease and comorbid disease risk, drug-gene interactions, drug-drug interactions, and multidrug therapy interactions. Then, to benefit from additional predictive power, the model is applied to new patient data to predict pharmacological phenotypes thereof and allow clinical and research decision-making including drug selection and dosages, changing dosing regimens, optimizing multidrug therapy, monitoring, and the like to benefit from additional predictive power, thereby avoiding adverse events and substance abuse, improving drug response, bringing better patient outcomes, lower treatment costs, and public health benefits, and increasing the effectiveness of research in the pharmacology and other biomedical fields” (see abstract).
However, in the method presented in JP-T-2020-520510, the machine learning model is used to predict the pharmacological phenotype of the patient based on the data of the new patient. Therefore, it is not possible to extract important factors in the prediction of pharmacological phenotype from the model.
The present invention was made in view of such a background and an object of the present invention is to easily grasp the association between feature amounts and events.
In order to solve the above-mentioned problems, the present invention is characterized in that an arithmetic device executes a model construction step of acquiring data including a feature amount vector which is a set of numerical values of feature amounts and is an explanatory variable, and information of an event which is an objective variable, and constructing a classification and prediction model having a tree structure for classifying and predicting the event based on the feature amount vector, an interaction score calculation step of calculating an interaction score in which a degree of association of an interaction between the feature amounts with the event is scored based on a position of the feature amount appearing in a node constituting the classification and prediction model, and a position of the feature amount in the classification and prediction model in which the position of the feature amount appearing in the node has been shuffled, and an output step of outputting the calculated interaction score to an output unit.
Other solutions will be described as appropriate in the embodiments.
According to the present invention, the association between the feature amount and the event can be easily grasped.
Next, modes for carrying out the present invention (referred to as “embodiments”) will be described in detail with reference to the drawings as appropriate. However, the present embodiment is merely an example to implement the present invention and does not limit the present invention.
The first embodiment shows an example of extracting an interaction associated with pollinosis in an analysis of the association among the gut microbiota, ingested nutrients, and the presence or absence of pollinosis. In the first embodiment and the second embodiment, the presence or absence of pollinosis is used as the objective variable, but the objective variable is not limited to the presence or absence of pollinosis as long as it can be classified.
The arithmetic system 1 includes an arithmetic device 100 and a database 200.
The arithmetic device 100 includes a central processing unit (CPU) 101, a storage device 102 such as a hard disk (HD), a communication device 103, and a memory 110.
The program stored in the storage device 102 is loaded into the memory 110. Then, the CPU 101 executes the loaded program. As a result, an acquisition unit 111, a model construction unit 112, an interaction score calculation unit 113, and an output processing unit 114 are embodied. Further, an input device 121 such as a keyboard, a mouse, and the like and a display device 122 are connected to the arithmetic device 100.
The acquisition unit 111 acquires feature amount vector data 211 (see
The model construction unit 112 constructs a classification and prediction model with a tree structure using a random forest or the like based on the acquired feature amount vector data 211 and event data 212.
The interaction score calculation unit 113 calculates the interaction score based on the classification and prediction model with a tree structure constructed by the model construction unit 112. The method of calculating the interaction score will be described later, but the interaction score is a scored degree of association of the interaction between the feature amounts with the event, based on the position of the feature amount appearing in the node that constitutes the classification and prediction model with a tree structure.
The output processing unit 114 displays the calculated interaction score on the display device 122.
The communication device 103 is connected to the database 200, receives the information of the database 200, and transmits the received information to the memory 110.
The training data 210 (see
The arithmetic system 1 may be in the form of a cloud service by using the arithmetic device 100 as a cloud server.
With reference to
First, the acquisition unit 111 acquires, from the database 200, feature amount vector data 211 (see
Further, the acquisition unit 111, from the database 200, acquires event data 212 (see
Next, the model construction unit 112 constructs a classification and prediction model with a tree structure that classifies and predicts persons with pollinosis and persons without pollenosis based on the feature amount vector using the feature amount vector data 211 and the event data 212 (S111). The classification and prediction model with a tree structure can be constructed by any algorithm, including decision trees, random forests, gradient boosting decision trees, and the like. In the present embodiment, a random forest is used.
After that, the interaction score calculation unit 113 calculates a combination of all the feature amounts (K) based on the feature amount vector (S112). The interaction score calculation unit 113 temporarily stores the number of combinations of all the feature amounts as K in the memory 110.
Subsequently, the interaction score calculation unit 113 initializes k indicating the combination number to “0” (k = 0: S113).
Then, the interaction score calculation unit 113 adds “1” to k (k ← k + 1: S114).
Next, the interaction score calculation unit 113 calculates the interaction score for the k-th feature amount combination (S120). The method of calculating the interaction score will be described later.
Subsequently, the interaction score calculation unit 113 determines whether or not k = K (S141). K is the total number of combinations of feature amounts. That is, in step S141, the interaction score calculation unit 113 determines whether or not the interaction score has been calculated for all the combinations of the feature amounts.
When the interaction score has not been calculated for all the combinations of the feature amounts (S141 - No), the interaction score calculation unit 113 returns the process to step S114.
When the interaction score is calculated for all the combinations of feature amounts (S141 → Yes), the output processing unit 114 outputs the interaction score for the predetermined combination of feature amounts to the display device 122 (S142).
The training data 210 is stored in the database 200 and has information on bacterial species composition, information on nutrient intake, and information on events. The information on the bacterial species composition is the structure of the gut microbiota in each subject, and specifically, the relative abundance of each intestinal bacterium is stored. The nutrient intake information stores the intake of nutrients ingested by the subject. In addition, the event information stores information on whether the subject has pollinosis (a predetermined category based on a qualitative variable having a nominal scale). A qualitative variable is a variable whose value is discrete, such as sex, name, “1st, 2nd, 3rd” and the like. In addition, the nominal scale is a scale in which only the differences in categories such as sex and name are shown, and the order between the categories is meaningless. By the way, the scale is a classification standard based on the nature of the data.
Information on the bacterial species composition of the gut microbiota can be obtained, for example, by meta 16S analysis of the gut microbiota genome. In addition, information on the bacterial species composition of the gut microbiota may be obtained from the gene composition or the like obtained from the metagenomic analysis. Further, as the information on the ingested nutrients, the food intake may be used in addition to the nutrient intake. Food intake is collected using a brief self-administered dietary history questionnaire (BDHQ) or the like. Nutrient intake can be calculated by a dedicated calculation program using BDHQ.
Of the information, regarding the bacterial species composition and nutrient intake, the numerical values for (relative abundance of Prevotella) ... (relative abundance of Ruminococcus), (intake of RTN), (intake of Zn) are listed for each subject. Such numerical values are called feature amounts, and a list of numerical values is called a feature amount vector. The information on the bacterial species composition and the ingested nutrients is the feature amount vector data 211 in
First, the interaction score calculation unit 113 substitutes “0” into the variable “h” indicating the current number of shuffles (S121).
Next, the interaction score calculation unit 113 performs a first simultaneous appearance number calculation process (S122). In step S122, the interaction score calculation unit 113 calculates the number of times that two feature amounts appear simultaneously in the same search branch in the decision tree in the classification and prediction model constructed in step S111 of
Then, the interaction score calculation unit 113 performs a first addition process based on the result of the first simultaneous appearance number calculation process (S123). In step S123, the interaction score calculation unit 113 adds up the number of simultaneous appearances calculated in step S122 for the entire classification and prediction model.
Subsequently, the interaction score calculation unit 113 adds 1 to h and substitutes it for h (h ← h + 1: S124).
Then, the interaction score calculation unit 113 shuffles the classification and prediction model (S125). In step S125, the interaction score calculation unit 113 randomly shuffles the positions of the feature amounts while maintaining the topology of the decision tree. Shuffle will be described later.
Next, in step S126 in which the interaction score calculation unit 113 performs a second simultaneous appearance number calculation process (S126), the interaction score calculation unit 113 performs the same process as step S122 for the classification and prediction model subjected to the shuffle process. As a result, the interaction score calculation unit 113 calculates the number of times that two feature amounts appear simultaneously in the same search branch in the shuffle-processed classification and prediction model.
Subsequently, the interaction score calculation unit 113 performs a second addition process (S127). In step S127, the interaction score calculation unit 113 adds up the number of times that two feature amounts calculated in step S126 appear simultaneously in the same search branch in the entire classification and prediction model.
Next, the interaction score calculation unit 113 determines whether or not h = H (S128). Here, H is the number of times the interaction score calculation unit 113 shuffles.
When h = H is not satisfied (S128 → No), the interaction score calculation unit 113 returns the process to step S124.
When h = H is satisfied (S128 → Yes), the interaction score calculation unit 113 calculates the mean value and standard deviation of the number of simultaneous appearances in the classification and prediction model in which the shuffle process is performed using the result of adding the results of step S27 for each shuffle and the number of shuffles (H).
After that, the interaction score calculation unit 113 calculates the interaction score using the result of step S123 and the result of step S129 (S130). The calculation of the interaction score will be described later.
With reference to
Further, in
Although three decision trees generated by the random forest are shown here, in reality, thousands to tens of thousands of decision trees constructed using randomly sampled data, and feature amounts are generated.
Further, in
Further, in
Further, the branch node located at the highest level (“Node #0” in
Since the classification and prediction model with a tree structure such as a random forest divides the data by conditional branching, it is possible to capture the dependency between a plurality of feature amounts. Then, in the classification and prediction model with a tree structure, it has a feature that the dependency between a plurality of feature amounts is expressed in each branch of the decision tree.
Here, the branch is the route from the root node to the leaf node. For example, in the decision tree shown in
In the route, the root node side is defined as upstream and the leaf node side is defined as downstream.
For example, in the example shown in
The intensity of the interaction between the feature amounts can be evaluated based on the number of simultaneous appearances. The number of simultaneous appearances will be described later. In the present embodiment, the intensity of the interaction between the feature amounts is shown as the interaction score. Then, in the present embodiment, the interaction score for the combination of x and y, which are any feature amounts, is defined by the following equation (1).
In Equation (1), I(x,y) is an interaction score for a combination of x and y, which are any feature amounts. N(x,y) is the number of times (the number of simultaneous appearances) that the feature amounts x and y appear simultaneously in the same search branch in the classification and prediction model before shuffling. The search branch will be described later. Further, M(x,y) is the number of simultaneous appearances when the positions of the feature amounts are randomly shuffled while maintaining the topology of the tree. Further, E(M(x,y)) indicates the mean of M(x,y), and σ(M(x, y)) is the standard deviation of M(x,y) .
First, the calculation method of N(x,y) in Equation (1) will be described.
In the present embodiment, the search branch is defined as a route until all of the feature amounts of interest appear while following the route from the root node to the downstream.
For example, if attention is paid to the feature amounts “A” and “B” in the decision tree shown in
Then, in this example, the number of times that the feature amounts “A” and “B” appear simultaneously in the decision tree shown in
That is, when the search branch is defined as described above, the number of times that two feature amounts appear simultaneously in the same search branch (the number of simultaneous appearances) is synonymous with calculating the number of search branches in each decision tree.
Based on the above, with reference to
In the decision tree shown in
In the decision tree shown in
Then, in the decision tree shown in
In this way, the process of calculating the number of search branches (that is, the number of simultaneous appearances) in each decision tree is the process corresponding to step S122 in
N(A,F) in Equation (1) is a number in which the feature amount “A” and the feature amount “F” appear simultaneously in all the decision trees. Therefore, assuming that the decision trees shown in
Next, M(x,y), E(M(x,y)), and σ(M(x,y)) in Equation (1) will be described.
As described above, in Equation (1), M(x,y) is the number of simultaneous appearances when the positions of the feature amounts are randomly shuffled while maintaining the topology of the tree. Further, E(M(x,y)) indicates the mean of M(x,y), and σ(M(x,y)) is the standard deviation of M(x,y) .
Here, a process of randomly shuffling the position of the feature amount while maintaining the topology of the tree (shuffle process: step S125 in
A shuffle is performed based on the following rules.
Hereinafter, the shuffle process will be described with reference to
In the whole decision tree shown in
The interaction score calculation unit 113 randomly shuffles the positions of the feature amounts in “A (#0), B (#2), C (#3), E (#4), D (#8), F (#10)”. For example, it is assumed that “B (#0), D (#2), F (#3), C (#4), A (#8), E (#10)” are obtained as a result of shuffling. When such a result is obtained, the interaction score calculation unit 113 assigns the feature amount “B” to the branch node (root node) “Node #0” and assigns the feature amount “D” to the branch node “Node #2”. The interaction score calculation unit 113 also assigns other feature amounts to the branch nodes in the same manner.
Further, in the whole decision tree shown in
Similarly, in the whole decision tree shown in
Such shuffling creates a state in which the information on the dependency between feature amounts is lost in the decision tree.
Subsequently, the interaction score calculation unit 113 calculates the number in which the feature amount “A” and the feature amount “F” appear simultaneously in the same search branch (the number of simultaneous appearances) for each decision tree in which the result of shuffling the position of the feature amount is assigned. This process is performed in the same manner as before the shuffling process. Incidentally, this process corresponds to step S126 in
Then, the interaction score calculation unit 113 adds up the number of simultaneous appearances obtained for each decision tree in all the decision trees. This result is M(A,F) of Equation (1). This process corresponds to step S127 in
The interaction score calculation unit 113 performs such shuffling a plurality of times (for example, about 10 times). Then, the interaction score calculation unit 113 divides the accumulation of M(A,F) for each shuffle by the number of shuffles to calculate E(M(A,F)) of Equation (1), which is the mean value of M(A,F). Further, the interaction score calculation unit 113 calculates σ(M(A,F)) of Equation (1), which is the standard deviation of M(A,F) based on M(A,F) and E(M(A,F)). This process corresponds to step S129 in
Subsequently, the interaction score calculation unit 113 substitutes the calculated M(A,F), E(M(A,F)), and σ(M(A,F)) into Equation (1) to calculate I(A,F) (interaction score). This process corresponds to step S130 in
The interaction score calculation unit 113 calculates the interaction score for each combination of all the feature amounts (corresponding to steps S114 to S141 in
In Equation (1), normalization is performed according to the state in which the information on the dependency between the feature amounts is lost ((M(x,y)). In this way, by normalizing the state in which the information on the dependency between feature amounts is lost, the intensity of the interaction is well reflected.
N(A,F) indicates the number of simultaneous appearances in each decision tree generated by model construction. The number of simultaneous appearances indicates the intensity of the interaction between the feature amount “A” and the feature amount “B” in the decision tree. However, if the feature amount “A” and the feature amount “F” simply appear in large numbers in each decision tree, the value of N(A,F) becomes large. That is, if the feature amount “A” and the feature amount “F” simply appear in large numbers in each decision tree, even if there is little interaction between the feature amount “A” and the feature amount “F”, N(A,F) becomes large. That is, N(A,F) includes the number in which the feature amount “A” and the feature amount “F” appear at the same time in the search branch by chance.
Therefore, in the present embodiment, the number of simultaneous appearances (M(A,F)) in the state where the information on the dependency between the feature amounts is lost in each decision tree by the shuffle process is subtracted from N (A, F) . That is, M(A,F) indicates the number in which the feature amount “A” and the feature amount “F” happen to appear in the same search branch.
Therefore, the result of subtracting M(A,F) from N(A,F) shows the value (intensity) of the true interaction between the feature amount “A” and the feature amount “F”. However, since the value of M(A,F) changes depending on the result of shuffling, E(M(A,F)) in which the sum of M(A,M) with respect to the number of shuffles is divided by the number of shuffles is used by performing shuffling a plurality of times.
Furthermore, in Equation (1), data with different scales can be compared by dividing by a(M(x,y)). However, the division by σ(M(x,y)) may not be performed in Equation (1) .
By using the interaction score shown in Equation (1), the interaction between the feature amounts having a high interaction score in the classification and prediction of pollinosis patients and non-pollinosis patients, that is, the interaction between the feature amounts having a high degree of association with pollinosis can be extracted. As for the interaction score as shown in Equation (1), the interaction score can be similarly calculated for a combination of two or more any number of feature amounts.
In the results shown in
In the example shown in
As shown in
In the graph display area 510, the interaction score is shown as a bar graph, and the combination of feature amounts is shown in the order of the degree of association (interaction score) (ascending order). The combination of the feature amounts is shown in the graph display area 510 in the form of “(Cu, Ruminococcus)” or the like.
In the list display area 520, the combination of feature amounts and the degree of association with pollinosis (interaction score) are shown in ascending order. The display contents of the list display area 520 are the same as those in
The description and setting area 530 has a calculation formula explanation area 531 and a threshold value setting window 532.
In the calculation formula explanation area 531, an explanation regarding the calculation formula of the interaction score is displayed. The calculation formula explanation area 531 can be omitted.
In the threshold value setting window 532, the threshold value of the interaction score displayed in the list display area 520 is set as described above. As described above, in the example shown in
As described above, the output screen 500 in the present embodiment can present a list of the interactions between feature amounts having a high degree of association extracted by a predetermined threshold value (threshold value set by default) or a threshold value specified by the user in the threshold value setting window 532.
According to the first embodiment, only a combination of highly associated feature amounts extracted by a method using a classification and prediction model having a tree structure (random forest in the example shown in the first embodiment) can be analyzed. This makes it possible to avoid multiple tests, which is a problem in statistical methods. That is, according to the example shown in
In addition, since the importance used in a general classification and prediction model with a tree structure is evaluated in the presence of all other feature amounts, it is an index that also takes into account the effects of interactions between feature amounts. However, since the importance is calculated for each feature amount, the information on the interaction with respect to the combination of feature amounts is not given. On the other hand, according to the interaction score in the present embodiment, it is possible to obtain information on the interaction with respect to the combination of feature amounts. That is, the interaction score according to the present embodiment can directly evaluate what kind of interaction between feature amounts is important in classification and prediction.
In this embodiment, the nutrient intake is used as the feature amount, but the health information obtained by the health examination may also be used as the feature amount. In this case, health information may be used as a feature amount instead of the nutrient intake, or both the nutrient intake and the health information may be used as a feature amount.
As described above, in the first embodiment, a classification and prediction model with a tree structure is used based on various metadata such as information on the bacterial species composition of the gut microbiota, information on nutrient intake, health information, and the like to analyze the association between the gut microbiota and disease in consideration of the interactions between numerous feature amounts. Then, as a result, the interaction between the feature amounts associated with the disease can be extracted. That is, in the first embodiment, the degree of association of the interaction between the feature amounts with the phenotype (event) is converted into a score as an interaction score and output by using the classification and prediction model having a tree structure. This makes it possible to extract the interaction between feature amounts related to the phenotype (event). As a result, in the association analysis between the gut microbiota and the disease (presence or absence of pollinosis in the first embodiment), the interaction of the feature amount highly associated with the disease can be extracted.
Next, a second embodiment of the present invention will be described with reference to
By combining the machine learning method with another statistical method, it is possible to evaluate whether the interaction between the highly associated feature amounts extracted by the machine learning method is positively or negatively associated with pollinosis. In the positive association, the higher the value, the higher the probability of pollinosis, and in the negative association, the smaller the value, the higher the probability of pollinosis.
For example, by examining the sign of the coefficient corresponding to each feature amount using logistic regression in addition to the random forest, it is possible to evaluate whether the association between each feature amount and pollinosis is positive or negative. However, the method is not limited to this method, and a plurality of other statistical methods may be combined. The explanatory variable used for logistic regression is the feature amount vector data 211.
The output screen 500a shown in
In the list display area 520a of
By the way, if the coefficient having a positive sign and the coefficient having a negative sign are the same numbers, “0” is stored in the “+/-” column. In this case, it means that it is not possible to evaluate whether the association between each feature amount and pollinosis is positive or negative.
Further, in the list display area 520a, the combination of the feature amounts related to the negative is shown in shading, and the combination of the feature amounts related to the positive is shown without shading. Incidentally, the numerical value of the combination of the feature amounts displayed in the list display area 520a and the degree of association (interaction score) with pollinosis is the same as that shown in the list display area 520 of
According to the second embodiment, it is possible to show the relationship between the combination of feature amounts and the probability of developing a symptom.
In the second embodiment, the random forest and the logistic regression are combined, but the analysis combined with the random forest is not limited to the logistic regression as long as it is regression analysis. For example, the random forest and multiple regression analysis may be combined.
In the first embodiment and the second embodiment, event data 212 is used in which events such as the presence or absence of disease (presence or absence of pollinosis) can be classified into a predetermined category (predetermined category based on a qualitative variable having a nominal scale).
On the other hand, in the third embodiment, important interactions are extracted in the analysis for predicting some numerical values indicating the health condition of the patient. In such a case, as shown in
Hereinafter, a specific example of the third embodiment will be described with reference to
That is, the training data 210b shown in
That is, in the training data shown in
The model construction unit 112 (see
According to the third embodiment, the same effect as that of the first embodiment can be obtained even for an event having a (discrete) numerical value such as a severity score of pollinosis.
In the example shown in
Further, the second embodiment and the third embodiment may be combined.
Further, in the present embodiment, the case of the combination of two feature amounts is described in the calculation of the interaction score, but the combination of three or more feature amounts is also possible.
The present invention is not limited to the above-described embodiments and includes various modifications. For example, the above-described embodiments have been described in detail in order to explain the present invention in an easy-to-understand manner and are not necessarily limited to those having all the described configurations. Further, it is possible to replace a part of the configuration of one embodiment with the configuration of another embodiment, and it is also possible to add the configuration of another embodiment to the configuration of one embodiment. Further, it is possible to add, delete, and/or replace a part of the configuration of each embodiment with another configuration.
Further, each of the above-mentioned configurations, functions, parts 111 to 114, the storage device 102, the database 200, and the like may be achieved by hardware, for example, by designing a part or all of them by an integrated circuit or the like. Further, as shown in
Further, in each embodiment, the control lines and information lines are shown as necessary for the explanation, and not all the control lines and information lines are shown in the product. In practice, almost all configurations may be considered to be interconnected.
Number | Date | Country | Kind |
---|---|---|---|
2021-157769 | Sep 2021 | JP | national |