The present invention relates to a data processing device, a data processing method, a data processing program, and a non-transitory recording medium, and particularly to a technology for classifying input data.
A technology using a “self-organizing map (SOM)” is known as a method of classifying input data into data having similar properties. The self-organizing map (hereinafter, sometimes referred to as the “SOM”) is a map generated by one method of machine learning devised by Mr. T. Kohonen, and is a map self-organizationally generated by repeating an operation of searching for a lattice point having reference data closest to input data and impregnating (reflecting) information on the input data near this lattice point. The SOM is a map in which a large number of pieces of input data are mapped from a high-dimensional space to a low-dimensional space while maintaining similarity between the pieces of data. A technology for creating a structure map showing a three-dimensional structure of molecules has been known as a data classification technology using such SOM (for example, see JP2007-277234A). In the case of the molecule, a dimension of data can be represented by, for example, the number of dihedral angles, and a molecule having a complicated three-dimensional structure has high-dimensional data.
In the creation of the SOM described above, the arrangement of data in the map changes by learning. A scene of such a change is described in, for example, “self-organizing—automatic classification algorithm”, [online], Yuji Ikegaya, [searched on May 7, 2018], Internet (http://gaya.jp/spiking_neuron/som.htm).
In a case where the SOM described in JP2007-277234A and “self-organizing—automatic classification algorithm”, [online], Yuji Ikegaya, [searched on May 7, 2018], Internet (http://gaya.jp/spiking_neuron/som.htm) is used, almost the lattice points of substantially the same reference data may appear at separated locations on the map. For example, cells of “yellow” appear at separated locations on the map even in a result of a simulation (data in each lattice point (cell) is represented by a three-dimensional color vector represented by components of (red (R), green (G), blue (B)) in “self-organizing—automatic classification algorithm”, [online], Yuji Ikegaya, [searched on May 7, 2018], Internet (http://gaya.jp/spiking_neuron/som.htm). That is, “even the pieces of data originally having similar properties may be classified as pieces of data of which features are greatly different in the SOM”.
As described above, the related art cannot appropriately classify a plurality of pieces of high-dimensional data.
The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a data processing device, a data processing method, a data processing program, and a non-transitory recording medium capable of appropriately classifying a plurality of pieces of high-dimensional data.
In order to achieve the aforementioned object, a data processing device according to a first aspect of the present invention comprises a data input unit that inputs a plurality of pieces of data, an initial value setting unit that sets initial values of reference vectors to all lattice points of a lattice point space including a plurality of lattice points based on the plurality of pieces of data, one lattice point being coupled to all other lattice points in the lattice point space, a distance calculation unit that calculates inter-lattice-point distances between one lattice point and the other lattice points by using a designated distance function based on the initial values of the reference vectors, a search unit that calculates distances between an input vector constituted by components of the plurality of pieces of data and the reference vectors for the lattice points based on the distance function, and searches for a nearest lattice point which is the lattice point of which the distance from the input vector is the shortest based on a result of the calculation, a data allocation unit that allocates, as data for the nearest lattice point, each of the plurality of pieces of data based on a result of the search, a correction vector calculation unit that calculates correction vectors for the reference vectors by using a reflection function for reflecting information on the plurality of pieces of data on the nearest lattice point and the lattice points near the nearest lattice point, a distance update unit that corrects the reference vectors by adding the correction vectors to the reference vectors of the lattice points, and updates the inter-lattice-point distances by using the plurality of pieces of data allocated to the lattice points and the reference vectors, a repetition controller that repeats processing in the search unit, the data allocation unit, the correction vector calculation unit, and the distance update unit for all the plurality of pieces of data and all the plurality of lattice points until a designated end condition is satisfied, and an information output unit that outputs information indicating the inter-lattice-point distances updated by the repetition.
The inventors of the present application have made extensive studies on the problems of the related art (SOM) described above, and have found the knowledge of “the pieces of similar data appears at the separated locations on the map since a specific shape (a square lattice of 10×10 in “self-organizing—automatic classification algorithm”, [online], Yuji Ikegaya, [searched on May 7, 2018], Internet (http://gaya.jp/spiking_neuron/som.htm)) is set in the lattice point space and the information on the input data is not reflected on the lattice points in a long range”. In the data processing device according to the first aspect, the similarity between the pieces of reference data (data allocated to each lattice point) is regarded as the inter-lattice-point distance without assuming a special shape in the lattice point space based on such a knowledge. Specifically, in the SOM, distances between a lattice point 801 and lattice points 802, 803, and 804 are 1, 2{circumflex over ( )}(½), and 2, respectively, as shown in
As described above, in the present invention, since one lattice point is coupled to all other lattice points (at the inter-lattice-point distance corresponding to the similarity between the pieces of reference data) and there is no “lattice point that is distant geometrically”, information on input data can be reflected on all the lattice points, and the lattice points of substantially the same reference data do not appear at separated locations in the lattice point space. Accordingly, the data processing device according to the first aspect can appropriately classify the plurality of pieces of high-dimensional data. The coupling of the lattice points may be maintained while the update of the inter-lattice-point distance between the lattice points is repeated, and may be decoupled at a stage at which the update is ended and information (for example, a two-dimensional or three-dimensional map) is output.
In the first aspect and the following aspects, the “reference vector” is a vector indicating the feature of the data belonging to the lattice point. Data of which a distance from the reference vector is short is stored at the lattice point having the reference vector, but means that data having features similar to the reference vector is collected at the lattice point. The initial values of the reference vectors can be randomly set. However, in a case where the initial values are randomly set, since there is a possibility that a result obtained by updating the inter-lattice-point distances are different for each processing even though the input data is the same, the initial values of the reference vectors are preferably set according to a predetermined standard. It is more preferable that the initial value reflects a spatial distribution of the input data.
In the first aspect and the following aspects, any function d(x, y) that satisfies the following four conditions for any two points (x, y) or two data groups (X, Y) can be used as the “distance function” (the conditions are the same for the function d(X, Y)).
Condition (1): d(x, y) is non-negative real number
Condition (2): where d(x, y)=0, x=y
Condition (3): d(x, y)=d(y, x)
Condition (4): d(x, z)+d(z, y)≥d(x, y)
In accordance with a data processing device according to a second aspect, in the first aspect, the distance calculation unit calculates the inter-lattice-point distances by using the reference vector of the one lattice point and the reference vectors of the other lattice points, and the search unit calculates the distances by using the input vector and the reference vectors. The second aspect defines one aspect of a method of calculating the inter-lattice-point distances and the distances between the input vector and the reference vectors.
In accordance with a data processing device according to a third aspect, in the first or second aspect, the initial value setting unit sets the initial values of the reference vectors based on statistical features of the data. The third aspect defines one aspect of the initial value setting method, and for example, an average, a variance, and a correlation can be used as the “statistical features”. However, the present invention is not limited to these examples. A principal component analysis, a regression analysis, a kernel principal component analysis, and the like can be used as a specific method. In a case where the principal component analysis is used, the initial value setting unit can set the initial values of the reference vectors based on an average vector of the input data, a maximum eigenvalue of a variance-covariance matrix, and an eigenvector corresponding to the maximum eigenvalue. The initial values of the reference vectors may be set by further considering second and third principal components in addition to the maximum eigenvalue (first principal component).
In accordance with a data processing device according to a fourth aspect, in any one of the first to third aspects, the distance function is a function for obtaining a distance between the pieces of data. The “distance between the pieces of data” includes not only a distance for any two points (x, y) but also a distance for two data groups (X, Y). Specifically, for example, a Ward distance, a Euclid distance, a Mahalanobis distance, and other functions used in a cluster analysis can be used as the distance function. These functions are specific examples of the distance functions that satisfy the conditions described above in the first aspect, but the distance function in the data processing device according to the embodiment of the present invention is not limited thereto.
In accordance with a data processing device according to a fifth aspect, in any one of the first to fourth aspects, the correction vector calculation unit calculates the correction vector by using, as the reflection function, a function of which a value decreases as the inter-lattice-point distance increases. In the fifth aspect, the correction vectors are calculated by using, as the reflection function, the function in which a degree of reflection of information decreases as the inter-lattice-point distance increases. Specifically, for example, in a case where the inter-lattice-point distance is d and a range in which the data is reflected is σ, the correction vectors can be calculated by using the function represented by exp(−d/σ) as the reflection function. However, the present invention is not limited to such an aspect. In this case, σ is a constant that defines an influence range of the input data.
In accordance with a data processing device according to a sixth aspect, in any one of the first to fifth aspects, the initial value setting unit sets the initial values of the reference vectors to the lattice points of the lattice point space in which the number of lattice points is less than the number of the plurality of pieces of data. In the sixth aspect, such a condition is set in order to cluster the data.
In accordance with a data processing device according to a seventh aspect, in any one of the first to sixth aspect, the information output unit creates and outputs a lattice point distribution map on which a distribution of the lattice points and the plurality of pieces of data allocated to the lattice points are represented in a two-dimensional space or a three-dimensional space based on the information indicating the inter-lattice-point distances. In the seventh aspect, since the lattice point distribution map in which the distribution of the lattice points is represented in the two-dimensional space or the three-dimensional space (low-dimensional space) is created and output, even though the input data is high-dimensional, the user can easily understand the data distribution.
In accordance with a data processing device according to an eighth aspect, in the seventh aspect, the information output unit sets an initial arrangement of the lattice points in the two-dimensional space or the three-dimensional space, minimizes a designated evaluation function by adjusting the arrangement of the lattice points, and creates and outputs the lattice point distribution map based on the adjusted arrangement. The eighth aspect defines one aspect of a method of creating a low-dimensional lattice point distribution map, and for example, a multidimensional scaling method can be used. However, the invention is not limited thereto. For example, the steepest descent method can be adopted for minimizing the evaluation function, but the invention is not limited thereto.
In accordance with a data processing device according to a ninth aspect, in any one of the first to sixth aspects, the data input unit inputs local stable structures of a compound and energies of the local stable structures in association with each other, and the repetition controller repeats extraction processing of extracting the local stable structures of the compound based on the updated inter-lattice-point distances and decoupling processing of decoupling the lattice points according to the inter-lattice-point distances until a designated number of local stable structures are extracted. The ninth aspect defines one aspect of processing in a case where the local stable structures of the compound are extracted.
In general, compounds can have different structures depending on the environment (temperature, pH, and the like), but a stable structure (structure with low energy) is desired to be acquired in a case where compounds as medicine candidates are searched for, for example. However, since the compound may not have the most stable structure (a structure having the lowest energy) depending on the surrounding environment or the like, it is effective to acquire a large number of local stable structures and extract a plausible structure from the local stable structures. In the data processing device according to the embodiment of the present invention, the lattice points of substantially the same reference data do not appear at the separated locations in the lattice point space as described above for the first aspect, and in the case of the compound, since a case where “even though there is actually one local stable structure, the local stable structures appear at the plurality of lattice points” does not occur, the local stable structure can be accurately extracted.
In the ninth aspect, the “local stable structure” corresponds to the lowest energy between the energy corresponding to one lattice point and the energy of another lattice point directly coupled to the one lattice point. At the start of the extraction processing and the decoupling processing, since one lattice point is coupled to all the other lattice points, the local stable structure is only one most stable structure. However, the number of local stable structures increases as the extraction processing and the decoupling processing are repeated. Thus, the extraction processing and the decoupling processing are repeated until a desired number of local stable structures are extracted. The extraction processing and the decoupling processing can be performed as the processing of the data processing device without creating the map indicating the lattice point space (the map visually recognizable by the user).
In the ninth aspect, any “energy” can be used as long as the energy (or free energy) is derived from the three-dimensional structure of the compound. For example, in the case of quantum scientific calculation, a total electron energy can be used.
In accordance with a data processing device according to a tenth aspect, in the ninth aspect, the repetition controller performs, as the extraction processing, processing of setting, as a representative energy of one lattice point, a minimum energy among the energies of the local stable structures allocated to the one lattice point for the one lattice point, comparing the representative energies between the one lattice point and all other lattice points coupled to the one lattice point, and extracting the local stable structure corresponding to the minimum representative energy based on a result of the comparison. The tenth aspect defines the specific contents of the extraction processing.
In accordance with a data processing device according to an eleventh aspect, in the tenth aspect, the information output unit displays an energy distribution map indicating a correspondence between the lattice points and the representative energies of the lattice points on a display device, the lattice point space being projected in the two-dimensional space or the three-dimensional space according to an arrangement of the lattice points and the inter-lattice-point distances on the energy distribution map. In the eleventh aspect, since the energy distribution map on which the lattice point space is projected in two dimensions or three dimensions (lower-dimensional space than the dimension of the input data) is displayed on the display device, the user can easily visually grasp the scene of the energy distribution (position of the local stable structure).
In accordance with a data processing device according to a twelfth aspect, in the eleventh aspect, the information output unit displays the energy distribution map by using a symbol having a size corresponding to the number of local stable structures allocated to the lattice point and a color corresponding to the representative energy of the lattice point. The twelfth aspect defines a specific display aspect of the energy distribution map, and the user can more easily visually grasp the scene of the energy distribution.
In order to achieve the aforementioned object, a data processing method according to a thirteenth aspect of the present invention is a data processing method of a data processing device that includes a data input unit which inputs data, a data processing unit that processes the input data, and an information output unit that outputs information regarding the processed data. The method comprises a data input step of inputting, by the data input unit, a plurality of pieces of data, an initial value setting step of setting, by the data processing unit, initial values of reference vectors to all lattice points in a lattice point space including a plurality of lattice points based on the plurality of pieces of data, one lattice point being coupled to all other lattice points in the lattice point space, a distance calculation step of calculating, by the data processing unit, inter-lattice-point distances between one lattice point and other lattice points by using a designated distance function based on the initial values of the reference vector, a search step of calculating, by the data processing unit, distances between an input vector constituted by components of the plurality of pieces of data and the reference vectors for the lattice points based on the distance function, and searching for a nearest lattice point which is the lattice point of which the distance from the input vector is the shortest based on a result of the calculation, a data allocation step of allocating, by the data processing unit, as data for the nearest lattice point, the plurality of pieces of data based on a result of the search, a correction vector calculation step of calculating, by the data processing unit, correction vectors for the reference vectors by using a reflection function for reflecting information on the plurality of pieces of data on the nearest lattice point and the lattice points near the nearest lattice point, a distance update step of correcting, by the data processing unit, the reference vectors by adding the correction vectors to the reference vectors of the lattice points, and updating the inter-lattice-point distances by using the data allocated to the lattice points and the reference vectors, a repetition control step of repeating, by the data processing unit, processing in the search step, the data allocation step, the correction vector calculation step, and the distance update step for all the plurality of pieces of data and for all the plurality of lattice points until a designated end condition is satisfied, and an information output step of outputting, by the information output unit, information indicating the inter-lattice-point distances updated by the repetition.
According to the thirteenth aspect, it is possible to appropriately classify the plurality of pieces of high-dimensional data as in the first aspect. The configuration similar to the second to twelfth aspects may be further included in the thirteenth aspect.
In order to achieve the aforementioned object, a data processing program according to a fourteenth aspect of the present invention causes a computer to execute a data input step of inputting a plurality of pieces of data, an initial value setting step of setting initial values of reference vectors to all lattice points of a lattice point space including a plurality of lattice points based on the data, one lattice point being coupled to all other lattice points in the lattice point space, a distance calculation step of calculating inter-lattice-point distances between one lattice point and other lattice points by using a designated distance function based on the initial values of the reference vectors, a search step of calculating distances between an input vector constituted by a plurality of components of the data and the reference vectors for the lattice points based on the distance function, and searching for a nearest lattice point which is the lattice point of which the distance from the input vector is the shortest based on a result of the calculation, a data allocation step of allocating, as data for the nearest lattice point, the data based on a result of the search, a correction vector calculation step of calculating correction vectors for the reference vectors by using a reflection function for reflecting information on the data on the nearest lattice point and the lattice points near the nearest lattice point, a distance update step of correcting the reference vectors by adding the correction vectors to the reference vectors of the lattice points and updating the inter-lattice-point distances by using the data allocated to the lattice points and the reference vectors, a repetition control step of repeating processing in the search step, the data allocation step, the correction vector calculation step, and the distance update step for all the plurality of pieces of data and all the plurality of lattice points until a designated end condition is satisfied, and an information output step of outputting information indicating the inter-lattice-point distances updated by the repetition.
According to the fourteenth aspect, it is possible to appropriately classify the plurality of pieces of high-dimensional data as in the first and thirteenth aspects. The configuration similar to the second to twelfth aspects may be further included in the fourteenth aspect. The “computer” in the fourteenth aspect can be realized by using one or more various processors such as a central processing unit (CPU).
In order to achieve the aforementioned object, a non-transitory recording medium according to a fifteenth aspect of the present invention is a non-transitory recording medium having a computer-readable code of the data processing program according to the fourteenth aspect recorded thereon. In the non-transitory recording medium according to the fifteenth aspect, a code for a program further including the configurations according to the second to twelfth aspects in addition to the fourteenth aspect may be recorded.
As described above, in accordance with the data processing device, the data processing method, the data processing program, and the non-transitory recording medium according to the embodiment of the present invention, it is possible to appropriately classify the plurality of pieces of high-dimensional data.
Hereinafter, an embodiment of a data processing device, a data processing method, a data processing program, and a non-transitory recording medium according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings. In the following description, the data processing method according to the embodiment of the present invention may be referred to as a “self-constructing topological map (SCTM) method”.
<Configuration of Processing Unit>
Functions of the units of the processing unit 100 described above can be realized by using various processors. The various processors include, for example, a CPU that is a general-purpose processor that realizes various functions by executing software (program). The various processors described above include a graphics processing unit (GPU) specialized for image processing and a programmable logic device (PLD) which is a processor capable of changing a circuit configuration after a field programmable gate array (FPGA). A dedicated electric circuit which is a processor having a circuit configuration specifically designed to execute specific processing such as an application specific integrated circuit (ASIC) is also included in the various processors described above.
The functions of the units may be realized by one processor, or may be realized by a plurality of processors of the same type or different types (for example, a plurality of FPGAs, a combination of a CPU and a FPGA, or a combination of a CPU and a GPU). A plurality of functions may be realized by one processor. As an example in which the plurality of functions is realized by one processor, firstly, one processor is constituted by a combination of one or more CPUs and software as represented by a computer such as a client or a server, and this processor is realized as the plurality of functions. Secondly, a processor that realizes the functions of the entire system by one integrated circuit (IC) chip is used as represented by a system on chip (SoC). As described above, various functions are realized by using one or more of the various processors described above as a hardware structure. The hardware structure of these various processors is, more specifically, an electric circuitry in which circuit elements such as semiconductor elements are combined.
In a case where the processor or electric circuitry described above executes software (program), a processor-readable code (computer-readable code) of the software to be executed is stored in a non-transitory recording medium such as the ROM 122 (see
<Configuration of Storage Unit>
The storage unit 200 is constituted by a non-transitory recording medium such as a digital versatile disk (DVD), a hard disk, and various semiconductor memories, and a controller thereof, and can store, for example, information (input data 202, reference vector information 204, distance function information 206, reflection function information 208, inter-lattice-point distance information 210, a lattice point distribution map 212, and an energy distribution map 214) shown in
<Configuration of Display Unit and Operation Unit>
The display unit 300 includes a monitor 310 (display device), and can display an input image, information stored in the storage unit 200, a result of processing performed by the processing unit 100, and the like. The operation unit 400 includes a keyboard 410 and a mouse 420 as an input device and/or a pointing device, and a user can perform operations necessary for executing the data processing method according to the embodiment of the present invention via these devices and screens of the monitor 310. Operations executable by the user can include, for example, designations of a method of setting initial values of reference vectors, a distance function, and a reflection function.
<Procedure of Data Processing Method>
Iris data is publicly known data regarding sepals and petals of three types of irises (setosa, versicolor, and virginica) (for example, available from https://archive.ics.uci.edu/ml/machine-learning-databases/iris/), and include a total of 150 pieces of data by 50 pieces for each type of iris.
<Input of Data>
The data input unit 102 inputs the iris data described above (step S100: data input step). The iris data stored as the input data 202 in the storage unit 200 may be input, or may be read from a recording medium (not shown). Alternatively, the iris data may be acquired from the external server 500 and the external database 510 via the network 1000.
<Setting of Initial Value of Reference Vector>
The initial value setting unit 104 sets the initial values of the reference vectors to all lattice points in a lattice point space including a plurality of lattice points based on a plurality of pieces of data, and in the lattice point space, one lattice point is coupled to all other lattice points (step S110: initial value setting step). The reference vector is a vector representing a feature of the data belonging to the lattice point. In Example 1, the number of lattice points is four. From the viewpoint of classifying (clustering) data, it is preferable that (number of lattice points<number of data). A scene of the lattice point space in a case where the number of lattice points is four is as shown in
The initial value setting unit 104 can set the initial values of the reference vectors based on, for example, a result obtained by analyzing a principal component of the input data. In the case of the iris data, an average vector <x>, a maximum eigenvalue (first principal component) λ of a variance-covariance matrix, and an eigenvector X are as in the following Equations (1) to (3). The principal component analysis is an example of the method of setting the initial value based on statistical features of the input data, and the average vector, the first principal component, the eigenvector, and the like are examples of the statistical features of the input data.
[Equation 1]
x
=(5.84,3.05,3.76,1.20) (1)
[Equation 2]
λ=4.22 (2)
[Equation 3]
X=(−0.36,0.08,−0.86,−0.36) (3)
Reference vectors for lattice points 1 to 4 are set as in the following Equation (4) by using these equations. In Equation (4), N=4, and i is an integer of 1 to 4.
The reference vectors for the lattice points 1 to 4 can be specifically expressed as in the following Equations (5) to (8).
[Equation 5]
r
1
=
x
−√{square root over (λ)}X (5)
[Equation 6]
r
2
=
x
−⅓√{square root over (λ)}X (6)
[Equation 7]
r
3
=
x
+⅓√{square root over (λ)}X (7)
[Equation 8]
r
4
=
x
+√{square root over (λ)}X (8)
Although only the first principal component is considered in the above-described example, the initial values may be set in consideration of second and third principal components. A method other than the principal component analysis (for example, a method based on statistical features of data such as regression analysis or kernel principal component analysis) may be used to set the initial values. The methods and setting conditions used for setting the initial values may be determined according to an operation of the user. It is preferable that the method of setting the initial value reflects a spatial distribution of the input data.
<Calculation of Inter-Lattice-Point Distance>
The distance calculation unit 106 calculates an inter-lattice-point distance between one lattice point and another lattice point by using a designated distance function based on the initial values of the reference vectors set in step S110 (step S120: distance calculation step). The distance function is a function for obtaining a distance between the pieces of data (including a distance for two data groups (X, Y) in addition to a distance between any two points (x, y)). A function for obtaining a Ward distance is considered as the distance function D in Example 1. However, a function for obtaining a Euclid distance, a Mahalanobis distance, or another function used in cluster analysis may be used. The distance function to be used may be determined according to the operation of the user.
In a case where a lattice point i and a lattice point j are given, the number of pieces of data belonging to the lattice points are Ni and Nj, respectively, and in a case where centers of mass of the data belonging to the lattice points are ci and cj, the Ward distance is given in the following Equation (9). Equation (9) means that the distance between the data groups belonging to the lattice points i and j is obtained.
From the definition, in a case where the vector is regarded as the lattice point at which the number of data is 1, a Ward distance between a vector a and a vector b is given in the following Equation (10).
[Equation 10]
D(a,b)=½(a−b)2 (10)
Since only the pieces of data of the reference vectors r1 to r4 are respectively allocated to the lattice points at a point in time of step S120, the distance calculation unit 106 can calculate the Ward distance by the following Equation (11). Equation (11) indicates the distance between the lattice point 1 and the lattice point 2, but the distances for other lattice points can be similarly calculated.
[Equation 11]
D(i,j)=D(r1,r2)=½(r1−r2)2 (11)
<Search for Nearest Lattice Point>
The search unit 108 searches for a nearest lattice point which is a lattice point having the shortest distance based on a calculation result of the distance described above (a distance between the input vector and the reference vector constituted by components of the input data) for the input data (step S130: search step). For example, in a case where an input vector of a first data of an iris (x1=(5.1, 3.5, 1.4, 0.2)), the Ward distance of each lattice point from the reference vector can be calculated as in the following Equations (12) to (15).
[Equation 12]
D(x1,r1)=11.29 (12)
[Equation 13]
D(x1,r2)=5.73 (13)
[Equation 14]
D(x1,r3)=2.05 (14)
[Equation 15]
D(x1,r4)=0.25 (15)
From Equations (12) to (15), the lattice point having the reference vector of which the distance from the input vector x1 is the shortest is the lattice point 4. That is, the nearest lattice point which is the lattice point of which the distance from the input vector x1 is the shortest is the lattice point 4. The distances from the lattice points 1 to 4 are similarly calculated for second to 150th data (input vectors), and the nearest lattice point is searched for.
<Allocation of Data>
The data allocation unit 110 allocates the input vector x1 (first input data) as data for the lattice point 4 which is the nearest lattice point based on the result of the search in step S130 (step S140: data allocation step). The second to 150th data are also allocated as the data for the nearest lattice point. As a result, the input data is classified as shown in a table of
<Calculation of Correction Vector and Update of Inter-Lattice-Point Distance>
The correction vector calculation unit 112 calculates correction vectors for the reference vectors by using the reflection function that reflects information on the input data (the plurality of pieces of data) on the nearest lattice point and the lattice points near the nearest lattice point (step S150: distance update step). Specifically, the correction vector calculation unit 112 calculates the correction vectors by the following Equation (16).
In Equation (16), ND is the total number of data (=150), and Cj is a set of the input data stored in the lattice point j (for example, a set of 58th, 60th, . . . and 99th data for the lattice point 3 in
In Equation (16), f is a reflection function that reflects the information on the input data (the plurality of pieces of data) on the nearest lattice point and the lattice points near the nearest lattice point and is an exponential function defined by the following Equation (17) in Example 1. However, this function is not limited thereto.
The reflection function of Equation (17) is a function of which a value decreases as d (inter-lattice-point distance) increases. Since the Ward distance described above has a dimension of the square of the Euclid distance, the definition of Equation (17) can be regarded as a Gauss function for the Euclid distance. σ is a constant that defines an influence range of the input data, and can be given by the following Equation (18) by using an appropriate coefficient rσ. Accordingly, the coefficient rσ is also any constant. Hereinafter, although it will be described in Example 1 that rσ=0.1, the coefficient is not limited to this value.
[Equation 18]
σ=rσ max(D(x1,x2),D(x1,x3), . . . ) (18)
The reflection function may be a function of which a value decreases in inverse proportion to the distance such as (1/d) instead of the exponential function shown in Equation (17). In the case of the function such as (1/d), a degree of decrease in the value due to an increase in the inter-lattice-point distance is less than that of the exponential function shown in Equation (17), and the influence of the input data can be strongly exerted far away.
<Update of Inter-Lattice-Point Distance>
The distance update unit 114 corrects the reference vectors by adding the correction vectors to the reference vectors of the lattice points (step S150: distance update step). For example, the reference vector r1 of the lattice point 1 is corrected as in the following Equation (19).
[Equation 19]
r
1=(6.59,2.88,5.52,1.94)→r1+δr1=(6.47,2.91,5.26,1.83) (19)
The distance update unit 114 similarly corrects the reference vectors r2 to r4 of the lattice points 2 to 4 (step S150: distance update step).
The distance update unit 114 updates the inter-lattice-point distance by using the input data (the plurality of pieces of data) allocated to the lattice points 1 to 4 and the reference vectors r1 to r4 (step S150: distance update step). For example, the distance between the lattice points 1 and 2 (D(1, 2)) will be described. As shown in
The distance update unit 114 similarly updates other inter-lattice-point distances (step S150: distance update step).
<Repetition Control>
Until a designated end condition is satisfied (until the determination in step S160 becomes YES), the repetition controller 116 repeats the processing of the search unit 108, the data allocation unit 110, the correction vector calculation unit 112, and the distance update unit 114 (the search step, the data allocation step, the correction vector calculation step, and the distance calculation step) for all the pieces of input data (the plurality of pieces of data) and all the plurality of lattice points (step S160: repetition control step). The data to be allocated to each lattice point changes as these kinds of processing are repeated (see
In a case where the reflection function is a function of which a value decreases rapidly as an increase in the inter-lattice-point distance (for example, in a case where the reflection function is the exponential function of Equation (17)), processing of decoupling the lattice points of which the inter-lattice-point distance is sufficiently large and reflecting the influence of the input data may be skipped. However, since it is necessary to recouple the lattice points in a case where the inter-lattice-point distance becomes closer, it is necessary to calculate the inter-lattice-point distance itself for all the combinations every time. It is possible to reduce computation cost by such decoupling or recoupling.
<Output of Information on Inter-Lattice-Point Distance>
The information output unit 118 outputs information indicating the inter-lattice-point distance updated by the repetition described above (step S170: information output step). The output can be performed by a combination of characters, numbers, figures, symbols, colors, and the like, can be stored in the storage unit 200 (for example, stored as the lattice point distribution map 212; see
<Creation and Display of Lattice Point Distribution Map>
A case where the lattice point distribution map is created as the information indicating the inter-lattice-point distance in step S170 will be described. The lattice point distribution map is a map in which the distribution of the lattice points (arrangement and distance) and the input data allocated to the lattice points are represented in a two-dimensional space or a three-dimensional space based on the information indicating the inter-lattice-point distance by the following method. In Example 1, a case where the two-dimensional distribution map is created by a multidimensional scaling method will be described.
[Equation 21]
E(i,j)=(d(i,j)−D(i,j))2 (21)
In Equation (21), d(i, j) is the Euclid distance on the lattice point distribution map, and D(i, j) is the Ward distance described above.
The information output unit 118 adjusts the arrangement (initial arrangement) of the lattice points, and minimizes the designated evaluation function (step S174: minimization step). For example, the function represented by the following Equation (22) can be used as the evaluation function.
For example, a steepest descent method can be used as the method of minimizing the evaluation function, but the invention is not limited thereto. Various methods can be used as a method for solving a minimization problem.
The information output unit 118 creates the lattice point distribution map based on the adjusted arrangement (step S176: lattice point distribution map creation step), and outputs the created lattice point distribution map (step S178: lattice point distribution map output step). In the creation of the lattice point distribution map in step S176, each lattice point can be represented by a symbol (here, a circle) having a size corresponding to the number of input data allocated to the lattice points. The lattice points can be colored in any color. For example, in the distribution map of the iris data (reference originals shown in the source of
The lattice point distribution map created in this manner is shown in
<Comparison with SOM>
A result obtained by comparing the classification result according to the embodiment of the present invention with a classification result of the related art (SOM) will be described. Although the input data is the iris data described above,
<Scene of Progress of Classification>
As described above, in the data processing method according to the embodiment of the present invention, these kinds of processing of search, data allocation, correction vector calculation, and distance update (steps S130 to S150 in
As described above, according to the data processing device 10, the data processing method, the data processing program, and the non-transitory recording medium according to the first embodiment, it is possible to appropriately classify a plurality of pieces of high-dimensional data.
In general, compounds (molecules) can have different structures depending on the environment (temperature, pH, and the like), but a stable structure (structure with low energy) is desired to be acquired in a case where compounds as medicine candidates are searched for, for example. However, since the compound may not have the most stable structure (a structure having the lowest energy) depending on the surrounding environment or the like, it is effective to acquire a large number of local stable structures and extract a plausible structure from the local stable structures. Since the acquisition of the local stable structures can be achieved by, for example, a method to be described later, a “method of extracting a plausible structure from the acquired local stable structures” becomes a problem.
<From Input of Data to Update of Inter-Lattice-Point Distance>
The data input unit 102 inputs the local stable structure of the compound and the energy of the local stable structure in association with each other (step S100: data input step). The local stable structure of the compound and the energy thereof can be obtained, for example, by a method to be described later (see a term “search for local stable structure of compound”). The dimension of the data to be input is a dimension of the number of internal coordinates such as a dihedral angle of the compound (molecule), and the more the complicated structure, the higher the dimension of the compound. Any energy can be used as the energy of the compound as long as the energy originates from the three-dimensional structure of the compound (or free energy). The number of data (local stable structures) to be input is optional, but, for example, about 1000 to 10000 pieces of data can be input. In a case where N local stable structures are input, the number of lattice points is preferably smaller than N. For example, in a case where N=1000, the number of lattice points can be 100. However, the number of lattice points is not limited thereto.
Since these kinds of processing of steps S110 to S150 can be performed similarly to Example 1 described above except for a difference in the dimension of the data and the number of data, detailed description will be omitted.
<Extraction of Local Stable Structure>
The local stable structure in the lattice point space (a state in which the inter-lattice-point distance is updated by the processing up to step S150) is defined as a structure that has (1) the lowest energy among the structures allocated to a certain lattice point and has (2) lower energy than the structures belonging to all the other lattice points connected to the certain lattice point. For example, in the case of the lattice point space shown in
In a case where the local stable structure is extracted based on the comparison result of the representative energies in the state shown in
<Creation and Display of Energy Distribution Map>
The information output unit 118 creates the energy distribution map in which the lattice point space is projected in the two-dimensional space or the three-dimensional space according to the arrangement of the lattice points and the inter-lattice-point distance, and displays the energy distribution map on the monitor 310. The energy distribution map indicates the correspondence between the lattice point and the representative energy of the lattice point (step S170: information output step). The information output unit 118 can create and display the energy distribution map by using symbols having a size corresponding to the number of local stable structures allocated to the lattice points and a color corresponding to the representative energy of the lattice point. For example, the larger the number of local stable structures allocated to the lattice points, the larger the symbol indicating the lattice point. The lattice points having high representative energy can be displayed in red, and the lattice points having low representative energy can be displayed in blue. The distance between the lattice points on the energy distribution map may be the inter-lattice-point distance updated by the processing up to step S150, and the lattice points to be coupled may be coupled by a line. The arrangement of the lattice points can be determined by using, for example, the multidimensional scaling method as in the case of Example 1 described above.
An example of the classification for alanine dipeptide (a peptide formed by coupling two alanines) will be described.
<Classification of Three-Dimensional Structures by Related Art and Method According to Embodiment of Present Invention>
<Classification Result Using SOM>
The result obtained by classifying the three-dimensional structures shown in
<Classification Result Using SCTM Method>
The result obtained by classifying the three-dimensional structures shown in
<Search for Local Stable Structure of Compound>
<Search Device of Molecular Stable Structure>
In Example 2 described above, an aspect of a method of inputting a plurality of local stable structures (and energies thereof) of the compound but searching for the stable structures to be input will be described. Specifically, the stable structure includes, for example, a structural formula acquisition unit that acquires a structural formula of a compound, a three-dimensional structure generation unit that generates one or more three-dimensional structures, a local stable structure acquisition unit that changes internal coordinates of the three-dimensional structure, and obtains a local stable structure which is a structure having low energy, an energy acquisition unit that obtains internal coordinates of the local stable structure and an energy of the local stable structure at the internal coordinates, an energy distribution function calculation unit that calculates an energy distribution function calculated for each internal coordinate of each atom constituting the compound, the energy distribution function indicating a distribution of the energy of the local stable structure for the internal coordinates of the local stable structure, a probability distribution function calculation unit that calculates a probability distribution function for increasing a probability of the low-energy internal coordinates from the energy distribution function, and an output unit that outputs the local stable structure. The three-dimensional structure generation unit can search for a search device of a molecular stable structure that generates a three-dimensional structure based on the acquired structural formula of the compound or the probability distribution function.
<Configuration of Search Device>
The search device of the molecular stable structure described above can be realized by the same configuration as that of the data processing device 10 shown in
The energy acquisition unit 138 acquires the energy of the local stable structure acquired by the local stable structure acquisition unit 134. The energy distribution function calculation unit 140 calculates the energy distribution function indicating the distribution of the energy of the local stable structure (structural energy) for each of the internal coordinates of the local stable structure. The energy distribution function is calculated for each internal coordinate constituting the compound. The probability distribution function calculation unit 142 calculates the probability distribution function for increasing the probability of the internal coordinates having low energy from the energy distribution function.
The output unit 144 outputs the local stable structure acquired by the local stable structure acquisition unit 134. The most stable structure acquisition unit 136 outputs the most stable structure. The display controller 146 controls the display of the acquired information and the processing result on the monitor 310. The details of the processing of the search method of the molecular stable structure using these functions of the processing unit 100 will be described later. The functions of the units of the processing unit 100 related to the search for the molecular stable structure can be realized by using various processors similar to the flowchart described above with reference to
The storage unit 200 stores the information shown in
<Configuration of Display Unit and Operation Unit>
The user can perform the operations necessary for executing the search method of the molecular stable structure via the screens of the monitor 310 by using the keyboard 410 and the mouse 420 shown in
<Search Method of Molecular Stable Structure>
In the device having the configuration described above, searching of a molecular stable structure can be performed by a search method of a molecular stable structure including a structural formula acquisition step of acquiring a structural formula of a compound, a first three-dimensional structure generation step of generating one or more three-dimensional structures in which internal coordinates of the structural formula are randomly set, a local stable structure acquisition step of changing the internal coordinates of the three-dimensional structure and obtaining the local stable structure which is the structure having low energy, an energy acquisition step of obtaining the internal coordinates of the local stable structure and the energy of the local stable structure in the internal coordinates, an energy distribution function calculation step of calculating a one-dimensional or a multidimensional energy distribution function calculated for one or each of a plurality of internal coordinates constituting the compound, the energy distribution function indicating a distribution of the energy of the local stable structure for the internal coordinates of the local stable structure, a probability distribution function calculation step of calculating a probability distribution function for increasing a probability of the low-energy internal coordinates from the energy distribution function, a second three-dimensional structure generation step of simultaneously changing one or more internal coordinates based on the probability distribution function and generating one or more three-dimensional structures by using the determined internal coordinates, a repetition step of repeating the local stable structure acquisition step, the energy acquisition step, the energy distribution function calculation step, the probability distribution function calculation step, and the second three-dimensional structure generation step by using the three-dimensional structure generated in the second three-dimensional structure generation step, and an output step of outputting at least any one of a plurality of local stable structures obtained in the local stable structure acquisition step or the structure having the lowest energy from the plurality of local stable structures.
In the search method described above, first, the three-dimensional structure is generated from the structural formula, the local stable structure is acquired by changing the internal coordinates, and the energy distribution function and the probability distribution function for increasing the probability of the low-energy internal coordinates are calculated from the obtained local stable structure. The probability of the internal coordinates with which the structure having low energy is obtained can be increased by generating the three-dimensional structure based on this probability distribution function, acquiring the local stable structure, and reflecting the internal coordinates of this local stable structure and the value of the energy on the probability distribution function. Accordingly, the local stable structure having low energy can be easily acquired. The local stable structure having lower energy can be obtained by increasing the number of times of the repetition step. Accordingly, the structure having the lowest energy (the most stable structure) can be acquired in a short time from the plurality of obtained local stable structures. The search method described above is not a conformational search based on local structural deformation. However, since the structure is searched for while simultaneously changing one or more internal coordinates based on the probability distribution function, various local stable structures can be obtained in a short time.
<Procedure of Search Method>
The search method further includes an energy distribution function calculation step of calculating an energy distribution function indicating a distribution of energy of the local stable structure for the internal coordinates of the local stable structure in each internal coordinate in a case where it is determined in step S18 that the desired structure, the desired number of local stable structures, or the most stable structure are not obtained (step S20), a probability distribution function calculation step of calculating a probability distribution function for increasing a probability of the low-energy internal coordinates from the energy distribution function (step S22), and a second three-dimensional structure generation step of generating one or more three-dimensional structures based on the probability distribution function (step S24). As the energy distribution function, a one-dimensional energy distribution function may be calculated for each one internal coordinate constituting the compound. A two-dimensional energy distribution function may be calculated by using two internal coordinates, or a multidimensional energy distribution function may be calculated by using a plurality of internal coordinates. In the probability distribution function calculation step, it is preferable that a function for accelerating the calculation is added. The function for accelerating the calculation can include a white noise, but is not limited thereto.
After the three-dimensional structure is generated in step S24, the processing returns to step S14. The local stable structure is acquired from this three-dimensional structure, and the internal coordinates and the value of the energy of the local stable structure are acquired. The internal coordinates and the value of the energy of this local stable structure are reflected on the energy distribution function and the probability distribution function up to now. The probability distribution function obtained in step S22 can be the probability distribution function having a high probability of the internal coordinates at which the low energy is obtained by repeating steps S14 to S24. The probability with which the local stable structure having lower energy is obtained can be increased by using this probability distribution function.
The search method further includes an output step of outputting one most stable structure having the lowest energy from the plurality of obtained local stable structures and the local stable structures in a case where it is determined in step S18 that a target structure, a target number of local stable structures, or the most stable structure are obtained (step S26). The plurality of local stable structures can be obtained by repeating the steps of steps S14 to S24. The most stable structure among the obtained structures can be obtained by selecting the structure having the lowest energy from the local stable structures. Except for a specific compound, it is not possible to objectively determine whether or not the obtained most stable structure is truly the most stable. However, the larger the number of times steps S14 to S24 are repeated, the higher the probability with which the obtained most stable structure is truly the most stable. It is possible to estimate whether the obtained most stable structure is truly the most stable from a state of convergence of the probability distribution function to some extent. The molecular stable structure can be determined by obtaining the most stable structure among the obtained structures. In a case where the most stable structure is not adopted as the actual three-dimensional structure of the compound, a candidate for the next three-dimensional structure can be selected from the local stable structures by outputting the plurality of local stable structures. The plurality of local stable structures can be output.
Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described aspects, and various modifications can be made without departing from the spirit of the present invention. For example, the iris data described in Examples 1 and 2 and general data other than the three-dimensional structure of the compound can be classified.
Number | Date | Country | Kind |
---|---|---|---|
2018-119116 | Jun 2018 | JP | national |
The present application is a Continuation of PCT International Application No. PCT/JP2019/021542 filed on May 30, 2019 claiming priority under 35 U.S.C § 119(a) to Japanese Patent Application No. 2018-119116 filed on Jun. 22, 2018. Each of the above applications is hereby expressly incorporated by reference, in its entirety, into the present application.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2019/021542 | May 2019 | US |
Child | 17106829 | US |