The present invention relates to a data analysis apparatus, method, and program which analyze data by a statistical method.
In the field of statistical data analysis, partial least squares (PLS) method which is one type of the supervised dimension reduction method is used for multivariate analysis such as metabolomics for comprehensively analyzing metabolites in a biological body. The PLS is used for various purposes such as visualization, regression, and construction of discrimination model. Recently, a method obtained by improving PLS is proposed (for example, Patent Document 1).
Patent Document 1 discloses OPLS (orthogonal partial least squares method) in which OSC (orthogonal signal correction) method is applied to the PLS. The OPLS of Patent Document 1 separates, in a model which predicts a variable Y from input data set X by the PLS, a systematic variation in X to a variation orthogonal (uncorrelated) to Y and a variation which enables prediction of Y. Accordingly, variations uncorrelated with Y among variations of multiple statistical samples included in the data set X are filtered. Consequently, it is possible to obtain a model which is more easily interpreted without undermining Y prediction accuracy.
Recently, in the field of metabolomics, there is a case where a plurality of individuals (statistical samples) obtained by sampling data of metabolite are classified into several groups according to blood lines or the status of dosing. In this case, a research related to a variation pattern of the metabolism which variates according to the specific order among the groups is reported (Non-Patent Document 1).
An object of the present invention is to provide a data analysis apparatus, method, and program which can perform various data analysis by taking into account a group order among statistical samples.
A data analysis apparatus according to the present invention performs multivariate analysis with a plurality of data items on a plurality of statistical samples. The data analysis apparatus includes a storage and a controller. The storage records statistical data for managing the plurality of data items per statistical sample, and group information indicating an arrangement order of groups formed by the plurality of statistical samples. The controller performs predetermined calculation processing based on the statistical data and the group information. The controller calculates, based on the statistical data, a kernel matrix in which a matrix element represents a predetermined relationship between a statistical sample corresponding to a row number of the matrix element in the plurality of statistical samples and a statistical sample corresponding to a column number of the matrix element. The controller performs calculation processing based on a partial least squares method under a predetermined condition defined by the kernel matrix and the group information, to calculate a score for the statistical samples.
The data analysis apparatus according to the present invention can perform integrated analysis and nonlinear analysis on various types of statistical data based on the kernel matrix while reflecting a group order in scores based on group information. Consequently, it is possible to perform various types of data analysis while taking into account the group order among statistical samples.
Hereinafter, embodiments of a data analysis apparatus, method, and program according to the present invention will be described with reference to the accompanying drawings. In each following embodiment, the same components will be assigned the same reference numerals.
An outline of statistical analysis of a data analysis method according to a first embodiment of the present invention will be described with reference to FIGS. 1A to 5.
The metabolomics is a research field which comprehensively analyzes low molecular metabolites (compounds whose molecular weight is 1000 or less) in a biological body.
where n represents a sample size (the number of individuals), and p represents the number of measured metabolites (the number of measurement items). According to the formula (1), items of measurement data (statistical data) of p metabolites measured from individuals corresponding to row numbers are recorded per row.
The metabolome data illustrated in
As in the above example, the metabolome data contains measurement data of several hundreds to several thousands of metabolites. These complicate to visually express the behavior of each sample on the metabolome data (for example, in a case that normal mice and disease model mice are analysis targets, what difference the metabolome data of those liver samples causes). Thus, scores based on a multivariate are generated by multivariate analysis to visually express the behavior of samples by using a scatter plot of scores. The scatter plot is used to check a relevance between samples (such as a difference between two groups of the normal mice and the disease model mice) as illustrated in
Here, in the examples illustrated in
In the above case, obtaining such scores that three groups are arranged in a predetermined order is useful for biological observation related to the order and its verification. With respect to the metabolome data obtained from each individual, which is managed per type of biological sample as illustrated in
First, a general theory on multivariate analysis of metabolomics will be described. The multivariate analysis of metabolome data generally uses main component analysis and PLS. The PLS can easily obtain accurate scores whose groups are precisely separated by using group information in combination with metabolome data additionally. A classical multivariate analyzing method for performing analysis by using group information includes, for example, a canonical correlation analysis. However, it is difficult to apply this method to data when the number (p) of variables (measurement items) in the data is larger than a sample size (n) (p>>n). In contrast to this, PLS is also applicable in the case of p>>n.
By using PLS, it is possible to obtain scores whose groups are separated. In this regard, a predetermined order among groups is assumed in some case, for example, where variations related to a drug concentration is of interest, or attention is paid to metabolites related to tastiness indices in sensory evaluation. However, the PLS does not reflect the group order information in the scores, and thus cannot obtain expected results. Therefore, the inventor proposed a method called PLS-ROG (Rank Order of Groups) to which PLS is modified (see Non-Patent Document 2). Using PLS-ROG makes it possible to obtain scores whose groups have an order. Further, the PLS-ROG enables selection of metabolites related to scores by using statistical hypothesis testing.
In the metabolomics, multiple types of metabolome data may be obtained from one individual. For example, administration of a particular drug to an animal is likely to influence metabolism of multiple organs. In such a case, samples (specimens) of a plurality of organs, plasma, and urine are sampled from the identical individual, and respective metabolome data are obtained. In addition, data other than the metabolome data, such as a gene expression level and a protein level, is often measured simultaneously with metabolome data from the identical individual. By integrating plural measurement data obtained from the identical individual and calculating common scores by using multivariate analysis, it is possible to specify metabolites which commonly variate in a plurality of organs, or metabolites and genes which variate commonly in the identical individual.
If the above multivariate analysis can integrate different types of measurement data with the group order among individuals reflected, it is possible to perform various data analysis, e.g., possible to specify metabolites which commonly variate in individuals according to the group order, or causal relationships of them. Therefore, the inventor devises “kernel PLS-ROG (kernel order partial least squares method)” by introducing a concept of a kernel method to the above PLS-ROG, the kernel PLS-ROG enabling various analysis such as integrated analysis of various types of measurement data and nonlinear data analysis while taking into account the order among the groups. Hereinafter, the PLS-ROG and the kernel PLS-ROG will be described.
The PLS-ROG can be formulated by using the data matrix X (formula (1)) having n rows and p columns, a dummy matrix Y having n rows and g columns, and an explanatory variable t and an objective variable s (each n-dimensional vector). Here, n represents a sample size, p represents the number of measurement items (data items), and g represents the number of groups. The dummy matrix Y is a matrix for setting group information indicating the group order (see
Following relationships are set between the explanatory variable t and the data matrix X by using a weight vector wx (p-dimensional vector), and between the objective variable s and the dummy matrix Y to set the following relationships by using a weight vector wy (g-dimensional vector) respectively.
t=Xw
x (2)
s=Yw
y (3)
By using the above X, Y, t and s, the PLS-ROG is formulated as the following optimization problem (to find specific weight vectors wx, and wy).
[Math. 2]
max cov(t,s) (4)
subject to wx′wx=1, (5)
(1−κ)wy′wy+κwy′Y′P′D′DPYwy=1 (6)
In the above formula, cov(t, s) is a covariance of the explanatory variable t and the objective variable s, and κ is a parameter constant indicating a penalty for the group order among individuals. The matrix P is a matrix of g rows and n columns indicating weights corresponding to the number of individuals (number of samples) n1, n2, . . . , and ng included in each group. The matrix D is a matrix of (g−1) rows and g columns for smoothing among groups. Concrete forms of the matrices P and D is exemplified as below.
According to the formulas (4) to (6), the PLS-ROG configures an optimization problem for optimizing the covariance cov (t, s) under the conditions expressed by the formulas (5) and (6). The conditional formula (5) expresses a condition for setting a magnitude of the weight vector wx to 1. The conditional formula (6) expresses a condition for shifting the magnitude of the weight vector wy from 1 by the constant κ according to a penalty term as the second term on the left side. The second term on the left side of the formula (6) is the penalty term which gives a penalty according to the group order of the dummy matrix Y.
The score according to the PLS-ROG is calculated by wx and wy obtained in the optimization problem and a corresponding synthetic variable (t, s) in the formulas (2) and (3). The PLS-ROG can reflect the group order set by the dummy matrix Y in the scores according to the penalty term of the conditional formula (6).
The kernel PLS-ROG, which is a statistical data analyzing method according to the present embodiment, will be described below.
First, the formulation of the kernel PLS-ROG will be described. In the formulas (2) to (6) obtained by formulating the PLS-ROG, the following formula (9) is adopted instead of the formula (2). Along with this, a kernel matrix K having n rows and n columns and an n-dimensional vector αx are introduced (formulas (10) and (11)).
t=Φw
x (9)
w
x=Φ′αx (10)
K=ΦΦ′ (11)
In the above formula, Φ represents a matrix (mapping) corresponding to the data matrix X. A specific matrix indication (n rows and p columns) of Φ may not be given in particular. The kernel matrix K is a matrix in which each of matrix elements are composed of a kernel function k(xi, xj) taking, as arguments, two of the plural measurement data xi (p-dimensional vectors) for every sample in the data matrix X. The kernel function k(xi, x3) is a function which represents an inner product in the feature space to which xi, xj are mapped by Φ, and has a concrete form which can be calculated based on a pair of measurement data xi and xj. Details of the kernel matrix K and the kernel function k(xi, x1) will be described later. The vector αx is a vector used in place of the weight vector wx.
With the formulas (9) to (11), the explanatory variable t can be expressed as the following formula by using the vector αx and the kernel matrix K without using wx and Φ.
t=Kα
x (12)
Further, the formula (5) is expressed as the following formula by using the vector αx and the kernel matrix K based on the formula (10).
αx′Kαx=1 (13)
The formula (13) expresses a condition that an inner product via the kernel matrix K between the vector αx and itself is 1. As a result, the kernel PLS-ROG configures the optimization problem in which, instead of the formula (5), the condition of the formula (13) is imposed on the formulas (4) to (6) for formulating the PLS-ROG. The kernel PLS-ROG is described by the formulas (4), (6), and (13) without using the concrete form of Φ by eliminating the weight vector wx.
The kernel PLS-ROG formulated as the above can be described as the optimization problem of the following Lagrangian function J (λx and λy represent parameters) by using the Lagrangian multiplier method.
By partially differentiating the function J with αx and wy respectively and rearranging the two obtained formulas, the kernel PLS-ROG finally returns to the generalized eigenvalue problem of the following formula (calculation of eigenvalue λ and eigenvectors αx and wy).
The eigenvalues λ and the eigenvectors αx and wy calculated according to the formulas (15) and (16) have (g−1) non-zero eigenvalues λ. In the present embodiment, the score is set to a value of the explanatory variable t obtained by substituting the eigenvector αx of each eigenvalue λ in the formula (12).
The formulas (15) and (16) are used by a data analysis apparatus 50 (see
Details of the kernel matrix and the kernel function will be described below.
The (i, j) element of the kernel matrix K is expressed as the kernel function k(xi, xj) related to the pair of the measurement data xi and xj which are for the ith and jth samples in the data matrix X. Various concrete forms of the kernel function k(xi, xj) can be used. For example, the following linear kernel kL(xi, xj) (formula (17)), a polynomial kernel kp(xi, x1) (formula (18)), and a Gaussian kernel kG(xi, xj) (formula (19)) can be used as the kernel function k(xi, xj).
In the formula (18), m represents an arbitrary real number, q represents an arbitrary natural number, and a in the formula (19) represents a positive real number. By forming the kernel matrix K based on the nonlinear kernels such as the formulas (18) and (19), it is possible to perform nonlinear data analysis while taking into account the group order.
Further, when a plurality of types of measurement data xi(L), xi(H), xi(B), and xi(P) is obtained per individual as in the case of metabolome data derived from a plurality of organs or biological body fluids (see
Assuming N types of measurement data are obtained per individual, measurement items per type are recorded in each type of data matrix X(1), X(2), . . . , and X(N), and the column directions do not coincide with each other. In this case, calculating the kernel matrix based on the kernel function for each of the various types of measurement data xi(1), . . . , and xi(N), result in that the kernel matrices K(1), K(2), . . . , and K(N) all have n rows and n columns per type. The kernel matrix K for integrated analysis is composed of a predetermined average of all types of kernel matrices K(1), K(2), . . . , and K(N). The predetermined average may be an arithmetic mean, a weighted mean with an appropriately selected weight, or a geometric mean per each matrix element.
The theory of the kernel PLS-ROG configured as described above can be realized by a computer for calculating the kernel matrix K based on the data matrix X indicating measurement data of a plurality of statistical samples, and calculating the formulas (15) and (16) based on the kernel matrix K and group information related to a group order among statistical samples. In this way, it is possible to obtain scores which take into account the group order among statistical samples on the computer, so as to visualize the scores by plot display, and integrally analyze a plurality of types of data matrices X(1), X(2), . . . , and X(N). Hereinafter, the data analysis apparatus, method, and program which realize the kernel PLS-ROG will be described.
The configuration of the data analysis apparatus according to the present embodiment will be described with reference to
The data analysis apparatus 50 performs calculation according to the kernel PLS-ROG (PLS under the conditions of the formulas (6) and (13)) based on the data matrix X indicating the measurement data of a plurality of statistical samples to calculate scores (t), and displays a plot image of the scores (see
The controller 51 is configured by, for example, a CPU or an MPU which realizes predetermined functions in cooperation with software, and controls an entire operation of the data analysis apparatus 50. The controller 51 reads data and programs stored in the storage 52 to perform various calculation processing, and realize various functions. For example, the controller 51 executes data analysis processing which realizes data analysis according to the above-described kernel PLS-ROG. The program for executing the data analysis processing may be packaged software. The controller 51 may be a hardware circuit such as a dedicated electronic circuit designed to realize a predetermined function or a reconfigurable electronic circuit. The controller 51 may be configured by various semiconductor integrated circuits such as a CPU, a MPU, a microcomputer, a DSP, a FPGA, and an ASIC.
The storage 52 is a storage medium which stores necessary programs and data to realize the functions of the data analysis apparatus 50, and includes, for example, a hard disk (HDD) or a semiconductor storage device (SSD). The storage 52 may include, for example, a semiconductor device such as a DRAM or an SRAM, to temporarily store data and function as a working area of the controller 51. For example, the storage 52 stores arithmetic expressions (formulas (15) and (16)) of the kernel PLS-ROG, the data matrix X (indicating measurement data of a plurality of measurement items per statistical sample), the dummy matrix Y (indicating group information related to a group order among statistical samples), and the kernel matrix K. When N types of measurement data are obtained per statistical sample for the data matrix X, the storage 52 manages a plurality of types of data matrices X(1), X(2), . . . , and X(N).
The operation unit 53 is a user interface through which a user performs an operation. The operation unit 53 is configured by, for example, a keyboard, a touch pad, a touch panel, buttons, switches, and combinations thereof. The operation unit 53 is an exemplary obtaining unit which obtains various information input by the user.
The display 54 is configured by, for example, a liquid crystal display or an organic EL display. The display 54 displays various information such as information input from the operation unit 53, for example.
The device interface 55 is a circuit (module) which connects other devices to the data analysis apparatus 50. The device interface 55 performs communication according to a predetermined communication standard. The predetermined standards include USB, HDMI (registered trademark), IEEE 1395, WiFi, and Bluetooth (registered trademark).
The network interface 56 is a circuit (module) which connects the data analysis apparatus 50 to the network via a wireless or wired communication line. The network interface 56 performs communication conforming to predetermined communication standards. The predetermined communication standards include communication standards such as IEEE802.3, and IEEE802.11a/11b/11g/11ac.
The operation of the data analysis apparatus 50 according to the present embodiment will be described with reference to
The flowcharts illustrated in
In the following operation example, the storage 52 stores in advance the various data matrices X(L), X(H), X(B), and X(P) indicating the metabolome data of each type of the liver, the heart, the brain, and the plasma (see
In the flowchart of
Returning to
Next, the controller 51 performs the kernel PLS-ROG calculation processing based on the obtained data matrices X(L), X(H), X(B), and X(P) and dummy matrix Y (S3). The kernel PLS-ROG calculation processing is processing for calculating the formulas (15) and (16) of the kernel PLS-ROG described in the above 2-2-2. This processing will be described with reference to
Here, the kernel PLS-ROG calculation processing (S3) will be described with reference to the flowchart of
Next, the controller 51 selects one type of a plurality of types of data matrices X(L), X(H), X(B), and X(P) (e.g. a liver sample) (S11).
Next, based on the measurement data (of the liver sample) xi(L) and xj(L) (i, j=1 to 9) for a pair of individuals among all nine individuals, the controller 51 calculates the kernel function k(xi(L), xj(L)) as the (i, j) element of the kernel matrix K(L) of the selected type (S12).
The controller 51 performs the calculation of step S12 on all pairs of combinations of all nine individuals, to calculate every matrix element of the kernel matrix K(L) per type (S13). For example, in the case of linear kernel, the controller 51 calculates the matrix elements of the kernel matrix K(L) based on the inner product of a pair of measurement data xi(L) and xj(L) as illustrated in
The controller 51 performs the processing of steps S11 to S13 on each type of the data matrix X(L), X(H), X(B), and X(P) (S14), to calculate all types of kernel matrices K(L), K(H), K(B), and K(P).
When calculating all types of the kernel matrices X(L), X(H), X(B), and X(P) (Yes in S14), the controller 51 takes an average of types according to the arithmetic expression illustrated in
Next, the controller 51 reads, from the storage 52, the arithmetic expressions (15) and (16) in the theory of the kernel PLS-ROG described in the above section 2-2-2, and substitutes the averaged kernel matrix K and dummy matrix Y in the arithmetic expression (S16).
Next, the controller 51 calculates the eigenvalue λ of the generalized eigenvalue problem with the substituted arithmetic expression, and the eigenvectors αx and wy corresponding to each eigenvalue X (S17).
Next, the controller 51 calculates the explanatory variable t (n=9-dimensional vector) corresponding to each of the calculated (g−1) eigenvectors based on the calculated kernel matrix K (formula (12)), to calculate the scores for each individual (S18).
Returning to
Next, the controller 51 receives a user's operation via the operation unit 53, and determines whether or not the user selects any one of the components of the displayed scores for further data analysis (S5). For example, the user can select the score component which reflects the group order from a plot image of the score displayed on the display 54 (see
When it is determined that the user does not select any score component (No in S5), the controller 51 ends the present processing.
On the other hand, when it is determined that the user selects any one of the score components (Yes in S5), the controller 51 analyzes a correlation between metabolites of each of the various data matrices X(L), X(H), X(B), and X(P) and the score of the selected component (S6). More specifically, the controller 51 calculates a correlation coefficient (a coefficient indicating the correlation between statistical distributions of both data (see
The analysis lists La to Ld illustrated in
According to the above data analysis processing, it is possible to realize the kernel PLS-ROG which enables integrated analysis among types according to the kernel matrix K while taking into account the group order among individuals. Hereinafter, the analysis result of the data analysis processing will be described.
A biological research (Non-Patent Document 1) suggests that the concentrations of metabolites of metabolic intermediates (N, N-Dimethylglycine and Betaine) and purine metabolism in a glycine biosynthesis pathway of the liver increase/decrease in the order of the wild rabbits (first group), the WHHL rabbits with dosing (second group), and the WHHL rabbits (third group). From this point of view, the data analysis processing (
Further, in the data analysis processing (
Further, in order to determine stochastically whether the above correlation can actually have a meaning (significant) or is a mere coincidence on the data, the p value is used in the hypothesis testing. In this analysis, as illustrated in
In the tables of 13A to 13D, information indicated by the analysis lists La to Ld is illustrated together in the case of the kernel PLS (
As illustrated in
Further, regarding the purine metabolism, as illustrated in
As for the other metabolites, as illustrated in
As described above, the data analysis apparatus 50 according to the present embodiment can generate a common score by taking into account the group order based on the kernel PLS-ROG, and thereby integrally analyze each metabolism of liver, heart, brain, and plasma samples. Further, the data analysis apparatus 50 can change the setting of the value of κ, and thereby compare the correlation in the case of the kernel PLS-ROG and the correlation in the case of the kernel PLS as described above, and perform various data analysis.
As described above, the data analysis apparatus 50 according to the present embodiment performs multivariate analysis related to a plurality of measurement items on a plurality of statistical samples based on measurement data obtained by measuring a plurality of measurement items per statistic sample. The data analysis apparatus 50 includes the storage 52 and the controller 51. The storage 52 records the data matrix X composed of measurement data obtained by measuring a plurality of measurement items per statistical sample, and the dummy matrix Y indicating group information indicating a predetermined order for a group formed by the plurality of statistical samples. The controller 51 performs predetermined calculation processing based on the data matrix X and the dummy matrix Y. The controller 51 calculates the predetermined kernel function k(xi, xj) which uses measurement data of a pair of statistical samples in a plurality of statistical samples as the arguments xi, and xj. Based on the calculation result of the kernel function k(xi, xj) and the group information, the controller 51 calculates the scores of a plurality of statistical samples according to the partial least squares method (kernel PLS-ROG) under a predetermined condition defined by the kernel matrix K and the dummy matrix Y whose matrix elements are the kernel function k(xi, xj) per pair of statistical samples.
The data analysis apparatus 50 according to the present invention can perform integrated analysis and nonlinear analysis on various types of measurement data based on the kernel matrix K while reflecting a group order in scores based on group information (dummy matrix Y). Consequently, it is possible to perform various types of data analysis while taking into account the group order among statistical samples.
In the present embodiment, the storage 52 manages a plurality of types of measurement data x(L)i, x(H)i, x(B)i, and x(P)i per statistical sample as the various data matrices X(L), X(H), X(B), and X(P)i. The various types of measurement data x(L), x(H)i, x(B)i, and are metabolome data whose measurement items are a plurality of metabolites in the biological body, for example. The controller 51 calculates the kernel matrix K from the average of the kernel functions each related to the respective type of the measurement data x(L)i, x(H)i, x(B)i, and x(P)i. Consequently, it is possible to integrally analyze a plurality of types of measurement data x(L)i, x(H)i, x(B)i, and x(P)i which are separately managed.
In the present embodiment, the score calculated by the data analysis apparatus 50 increases or decreases according to the group order indicated by the dummy matrix Y. Therefore, by using the calculated scores, analysis of data by taking into account the group order can be facilitated. For example, in the present embodiment, the controller 51 analyzes the correlation between data of each measurement item in the measurement data and the calculated score.
In the present embodiment, the predetermined condition includes the first condition and the second condition. The first condition is a condition that the inner product via the kernel matrix K of the first vector αx and itself is set to a predetermined value, the first vector αx being related to an explanatory variable t of the explanatory variable t and an objective variable s in the partial least squares method (formula (13)). The second condition is a condition that the magnitude of the second vector is shifted from the predetermined value according to a predetermined penalty term based on the group information, the second vector wy being related to the objective variable s (formula (6)).
The data analysis method (kernel PLS-ROG) according to the present invention is also useful for integrated analysis of metagenome data having an order among groups of samples, and metagenome data and metabolome data. An example of integrated analysis of metagenome data and metabolome data according to the kernel PLS-ROG will be described below.
In this example, an example will be described in which integrated analysis according to the kernel PLS-ROG is applied to metagenome data and metabolome data disclosed in Non-Patent Document 3. Non-Patent Document 3 is a research which uses metagenome data and metabolome data in breast milk of a person. Conventionally, breast milk, which is an important source of bacteria for growth of an infant, is known to influence the composition of intestinal bacteria in newborns. Non-patent Document 3 discloses that an analysis result of bacterial flora and metabolites in breast milk of a mother undergoing chemotherapy of Hodgkin's lymphoma showed the influence of chemotherapy in the profile.
According to Non-Patent Document 3, two samples of breast milk after a 0th week, a 2nd week, a 4th week, a 6th week, a 10th week, a 12th week, a 14th week, and a 16th week from a start of chemotherapy were sampled from mothers who underwent chemotherapy of the Hodgkin's lymphoma. For each sample, 16S rRNA metagenome analysis using a next generation sequencer and metabolome analysis using a gas chromatography-mass spectrometer were performed. Further, known UniFrac analysis (see, for example, Non-Patent Document 3) was performed on the data of the metagenome analysis result to obtain data constituting a similarity matrix D. The similarity matrix D is a matrix in which each element expresses the similarity between samples, and is expressed by the following formula (20) with the number of samples m.
In the formula (20), di,j represents a similarity (i, j=1 to m) which is the degree of similarity between the ith sample and the jth sample. Note that di,j has a value within the range of 0 to 1, and as di,j is closer to 0, the sample i and the sample j are more similar. The similarity matrix D and the metagenome analysis result data are examples of metagenome data indicating information related to the gene sequence of the bacterial flora.
Further, the metabolome data obtained by the metabolome analysis of Non-Patent Document 3 constitutes a data matrix X of 225 substances in each row and 16 samples in each column.
The statistical data of Non-Patent Document 3 as described above is open to the public. In this example, the data analysis apparatus 50 applied the kernel PLS-ROG and the kernel PLS to the statistical data of 14 samples, which is obtained by removing part of defective data from the above statistical data, to perform integrated analysis.
Based on the above similarity matrix D, the data analysis apparatus 50 generated the kernel matrix Kg of metagenome data as follows: That is, each of non-diagonal elements of the kernel matrix Kg was set to a reciprocal of corresponding element in the similarity matrix D. Further, each of diagonal components of the kernel matrix Kg was set to 20 as a predetermined value.
In addition, the data analysis apparatus 50 generated the kernel matrix Km of the metabolome data by using the linear kernel of the data matrix X (Km=XX′). The data analysis apparatus 50 calculated the kernel matrix K obtained by integrating metagenome data and metabolome data based on the average between the kernel matrices Kg and Km as in the following formula (21).
K=(½)Kg+(½)Km (21)
The data analysis apparatus 50 analyzed data of the kernel PLS-ROG (κ=0.5) and the kernel PLS (κ=0.5) based on the above kernel matrix K, and the dummy matrix indicating the group order of every two samples corresponding to a period from start of chemotherapy of the sample, to calculate scores.
As illustrated in
On the other hand, according to the kernel PLS-ROG, regarding the metagenome data as illustrated in
As described above, the data analysis method of the kernel PLS-ROG according to the present invention is applicable to metagenome data for analysis of bacterial flora in breast milk or bacterial flora of intestinal bacteria. The data analysis method of the kernel PLS-ROG according to the present invention can integrally analyze metagenome data and metabolome data.
In the first embodiment described above, an example in which a data analysis apparatus 50 is configured by an information processing device such as a PC is described. However, the present invention is not limited thereto, and the data analysis apparatus 50 may be a server device such as an ASP server. For example, the data analysis apparatus 50 may obtain, from a network interface (an example of an obtaining unit), information indicating a data matrix X and a dummy matrix Y input via a network, and execute data analysis processing. Further, the data analysis apparatus 50 may transmit information indicating a score generated by the data analysis processing via the network.
In the first embodiment, an example of application of this data analysis method to metabolomics is described. This data analysis method is not limited to metabolomics, and may be applied to various types of omics analysis and multivariate analysis of chemometrics. In this case, measurement data may be data obtained by omics analysis or chemometrics in an identical biological body.
In the first embodiment, integrated analysis of a plurality of types of metabolome data is described. This data analysis method is applicable to scenes which require various integrated analysis. This data analysis method may be used to integrate metabolome data and gene expression data, or integrate and analyze analysis data obtained from a plurality of measurement platforms.
In the first embodiment, the metabolome data illustrated in
In the first embodiment, the correlation of the score component selected by the user is analyzed (step S5 of
In the first embodiment, hypothesis testing is performed based on an analysis result of the data analysis processing. The data analysis apparatus 50 may perform the hypothesis testing. For example, a correlation coefficient and a threshold value of a p value may be preset to the storage 52, and the controller 51 may extract metabolites satisfying a predetermined condition (e.g., an absolute value of the correlation coefficient of “0.6” or more and the p value of “0.05” or less) to analyze the correlation between specific score components.
Hereinafter, aspects of the present invention are exemplified.
1st aspect of the present invention is a data analysis apparatus which performs multivariate analysis with a plurality of data items on a plurality of statistical samples. The data analysis apparatus comprises a storage and a controller. The storage records statistical data for managing the plurality of data items per statistical sample, and group information indicating an arrangement order of groups formed by the plurality of statistical samples. The controller performs predetermined calculation processing based on the statistical data and the group information. The controller calculates, based on the statistical data, a kernel matrix in which a matrix element represents a predetermined relationship between a statistical sample corresponding to a row number of the matrix element in the plurality of statistical samples and a statistical sample corresponding to a column number of the matrix element. The controller performs calculation processing based on a partial least squares method under a predetermined condition defined by the kernel matrix and the group information, to calculate a score for the statistical samples.
2nd aspect of the present invention is the data analysis apparatus according to 1st aspect, wherein the storage manages a plurality of types of measurement data per statistical sample in the statistical data. The controller generates a kernel matrix with respect to each of the plurality of types of measurement data, and calculates an integrated kernel matrix based on an average of kernel matrices for the plurality of types of measurement data.
3rd aspect of the present invention is the data analysis apparatus according to 1st or 2nd aspect, wherein the predetermined relationship is defined by a kernel function based on data with respect to the statistical sample corresponding to the row number in the statistical data, and data with respect to the statistical sample corresponding to the column number.
4th aspect of the present invention is the data analysis apparatus according to any one of 1st to 3rd aspects, wherein the score increases or decreases according to the order of the groups indicated by the group information.
5th aspect of the present invention is the data analysis apparatus according to any one of 1st to 4th aspects, wherein the controller analyzes a correlation between data per data item in the statistical data and the calculated score.
6th aspect of the present invention is the data analysis apparatus according to any one of 1st to 5th aspects, wherein the predetermined condition includes a first condition and a second condition. The first condition is for setting an inner product via the kernel matrix between a first vector and itself to a predetermined value, the first vector being related to an explanatory variable of the explanatory variable and an objective variable in the partial least squares method. The second condition is for shifting a magnitude of a second vector from a predetermined value by a predetermined penalty term based on the group information, the second vector being related to the objective variable.
7th aspect of the present invention is the data analysis apparatus according to any one of 1st to 6th aspects, wherein the statistical data includes metabolome data in which the data items are a plurality of metabolites in biological bodies.
8th aspect of the present invention is the data analysis apparatus according to any one of 1st to 7th aspects, wherein the statistical data includes metagenome data indicating information related to a gene sequence of a bacterial flora.
9th aspect of the present invention is the data analysis apparatus according to any one of 1st to 8th aspects, wherein the statistical data includes data obtained by omics analysis or chemometrics in an identical biological body.
10th aspect of the present invention is a data analysis method for causing a computer to perform multivariate analysis with a plurality of data items on a plurality of statistical samples. A storage of the computer records statistical data for managing the plurality of data items per statistical sample, and group information indicating an arrangement order of groups formed by the plurality of statistical samples. The data analysis method comprises, by the computer, calculating, based on the statistical data, a kernel matrix in which a matrix element represents a predetermined relationship between a statistical sample corresponding to a row number of the matrix element in the plurality of statistical samples and a statistical sample corresponding to a column number of the matrix element. The data analysis method comprises, by the computer, performing calculation processing based on a partial least squares method under a predetermined condition defined by the kernel matrix and the group information, to calculate a score for the statistical samples.
11th aspect of the present invention is a program for causing a computer to execute the data analysis method according to 10th aspect.
Number | Date | Country | Kind |
---|---|---|---|
2015-230862 | Nov 2015 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2016/084509 | 11/21/2016 | WO | 00 |