This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2012-075189, filed on Mar. 28, 2012, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are directed to an information conversion device, an information search device, an information conversion method, an information search method, and a computer-readable recording medium.
In the past, there has been known a technique of searching for data in which a level of similarity or relevance with input query data satisfies a predetermined condition from among a plurality of pieces of data registered in a database. As an example of such a technique, there has been known a neighbor search technique in which a level of similarity or relevance between data and data is represented by a distance of a feature quantity vector in a multi-dimensional space, and a predetermined number of pieces of data are selected from data whose distance from query data is within a threshold value or data near to query data.
Here, in the case in which a number of pieces of data are registered in a database, if distances between all pieces of data and query data registered in the database are calculated, a computation cost for executing a neighbor search increases. In this regard, there has been known a technique in which a computation cost for executing a neighbor search is reduced such that data of a search target is limited using an index of a feature quantity vector space which is generated in advance or an index based on a distance form a specific feature quantity vector. However, in this technique, it is difficult to reduce a computation cost when a dimension number of a feature quantity vector increases.
In this regard, as a technique of reducing a computation cost in a search process, there has been known a technique of speeding up a search process such that stringency of a search result is mitigated, and then a set of similar data approximate to query data is acquired. For example, a match retrieval or a calculation of a Hamming distance between binary strings is performed at a higher speed than a calculation of a distance between vectors. In this regard, there has been known a technique of reducing a computation cost such that a feature quantity vector is converted into a binary string while maintaining a distance relation between feature quantity vectors, and a match retrieval or a Hamming distance with a binary string converted from query data is calculated.
Here, a technique of converting a feature quantity vector of a database into binary data by applying a random projection function has been known as a technique of converting a feature quantity vector into a binary string. In addition, there has been known a technique of deciding a projection function in which the distribution of data is considered using previously obtained registration data and converting a feature quantity vector into binary data through the decided projection function in order to perform conversion in a state in which a distance relation of original feature quantity vectors is maintained.
Next, an example of a method of converting a feature quantity vector into a binary string and searching for a data similar to query data will be described.
For example, an information processing device stores a feature quantity vector indicated by a white circle in
As a result, each feature quantity vector is converted into any of “01,” “11,” “00,” and “10.” Further, when a binary string converted from query data is “11” as indicated by (C) in
However, in the technique of converting a feature quantity vector into a binary string as described above, since one feature quantity vector is mapped with one binary string, a distance of a binary string on a similar feature quantity vector is increased, and thus there is a problem in that search omission may occur.
According to an aspect of an embodiment, an information conversion device includes a memory and a processor coupled to the memory. The processor executes a process including converting a feature quantity vector of data which is a target of a search process using a Hamming distance into a symbol string including a binary symbol and a wild card symbol that causes a Hamming distance from the binary symbol to be zero (0).
According to another aspect of an embodiment, an information search device includes a memory and a processor coupled to the memory. The processor executes a process including converting a feature quantity vector of data which is a target of a search process using a Hamming distance into a symbol string including a binary symbol and a wild card symbol that causes a Hamming distance from the binary symbol to be zero (0). The process includes searching data that causes a Hamming distance between a symbol string converted at the converting and a binary string converted from query data is a predetermined value or less from among the data.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed.
Preferred embodiments of the present invention will be explained with reference to accompanying drawings.
A first embodiment will be described below in connection with an information search device that searches for neighbor data of query data using a binarized feature quantity vector with reference to
In addition, the information search device 1 is connected with a client device 2 through which query data is input. Here, when query data is received from the client device 2, the information search device 1 searches for neighbor data of the received query data, and transmits the searched neighbor data to the client device 2. Here, the information search device 1 searches for data such as an image or a voice or biological data in biometric authentication using a fingerprint pattern or a vein pattern as a search target.
Here, when biological data is input from the client device 2 as query data, the information search device 1 extracts a feature quantity vector representing a feature quantity of the input biological data, and searches for registration biological data having a feature quantity vector similar to the extracted feature quantity vector. In other words, the information search device 1 determines whether or not registration biological data of the user who has input the query data remains registered.
Further, the information search device 1 calculates a Hamming distance between a symbol string converted from the feature quantity vector of the registration biological data and a symbol string obtained by binarizing a feature quantity vector in the biological data input as the query data. Then, the information search device 1 extracts registration biological data whose Hamming distance is a predetermined threshold value or less as a candidate of a search target. Thereafter, the information search device 1 executes a stringent matching process of the searched registration biological data and the biological data input as the query data, and outputs an execution result.
As described above, the information search device 1 narrows down data of a search target by converting a feature quantity vector representing a feature of registration biological data of a search target into a symbol string and calculating a Hamming distance from a symbol string of the query data. Then, the information search device 1 performs matching in biometric authentication by performing matching between the narrowed-down data and the query data.
Here, when the input biological data or registration biological data is an image, for example, a feature quantity vector is obtained by converting density or a numerical value of coordinates of a feature point such as a direction or length of a ridge in a specific region in an image, a gradient, or an end edge or a divergence of a ridge into a vector. Further, when the input biological data or registration biological data is a voice, for example, the feature quantity vector is obtained by converting a numerical value such as the distribution, intensity, or a peak value of a frequency component into a vector.
Here, when registration biological data of a search target is converted into a binary string including “0” or “1,” there are cases in which a distance relation between feature quantity vectors is not reflected. In this regard, the information search device 1 performs conversion into a symbol string including a wild card symbol in which a Hamming distance from a binary symbol is “0” and a binary symbol. Then, the information search device 1 searches for registration biological data in which a Hamming distance between the symbol string including the binary symbol and the wild card symbol and the symbol string converted from the feature quantity vector of the query data is a predetermined threshold value or less as a candidate of a search target, and thus the accuracy of search is improved.
The process executed by the information search device 1 illustrated in
Here, an example of information stored in the feature quantity vector storage unit 10 will be described with reference to
As described above, the feature quantity vector storage unit 10 stores feature quantity vectors of a plurality of pieces of registration biological data for each data ID, that is, for each user who has registered the registration biological data. In the following description, feature quantity vectors associated with the same data ID, that is, feature quantity vectors of the registration biological data registered by the same user are described as feature quantity vectors belonging to the same class.
Referring back to
Further, although not illustrated in
Referring back to
Specifically, when a certain component of a feature quantity vector belonging to a certain class falls within a predetermined range from the boundary with a feature quantity vector of a different class, the conversion function learning unit 12 generates a conversion function of converting this component into a wild card symbol. Further, when a certain component of a feature quantity vector belonging to a certain class does not fall within a predetermined range from the boundary with a feature quantity vector of a different class, the conversion function learning unit 12 generates a conversion function of converting this component into a binary symbol corresponding to a value of this component.
In detail, the conversion function learning unit 12 calculates a product of a feature quantity vector and a predetermined conversion matrix, and when a certain component of the calculated product falls within a predetermined range, the conversion function learning unit 12 generates a conversion function of converting the certain component into a wild card symbol. Further, the conversion function learning unit 12 calculates a product of a feature quantity vector and a predetermined conversion matrix, and when a certain component of the calculated product does not fall within a predetermined range, the conversion function learning unit 12 generates a conversion function of converting the certain component into a binary symbol corresponding to a value of the certain component.
Then, the conversion function learning unit 12 converts the feature quantity vector stored in the feature quantity vector storage unit 10 into a symbol string using the generated conversion function, and stores the converted symbol string in the symbol string data index storage unit 11.
In addition, the conversion function learning unit 12 generates a conversion function using a feature quantity vector previously stored in the feature quantity vector storage unit 10. Specifically, the conversion function learning unit 12 extracts two feature quantity vectors stored in the feature quantity vector storage unit 10, regards one feature quantity vector as query data, and regards the other feature quantity vector as a feature quantity vector of data of a search target.
Then, the conversion function learning unit 12 calculates a Euclidean distance (norm) between the extracted two feature quantity vectors. Further, the conversion function learning unit 12 converts the extracted feature quantity vector into a symbol string using a predetermined conversion function, and calculates a Hamming distance in the converted symbol string. Then, the conversion function learning unit 12 evaluates the conversion function that has converted the feature quantity vector based on the calculated Euclidean distance and the Hamming distance. Thereafter, the conversion function learning unit 12 changes a parameter of the conversion function based on the evaluation result of the conversion function.
Further, the conversion function learning unit 12 extracts two feature quantity vectors again, and converts the extracted feature quantity vectors into a symbol string using the conversion function having the changed parameter. Further, the conversion function learning unit 12 evaluates the conversion function based on the Euclidean distance of the re-extracted feature quantity vectors and the Hamming distance in the symbol string, and changes a parameter of the conversion function based on the evaluation result.
Then, by repeating the above-described process twice or more, the conversion function learning unit 12 optimizes the parameter of the conversion function. Thereafter, the conversion function learning unit 12 converts the feature quantity vector stored in the feature quantity vector storage unit 10 into a symbol string using the conversion function having the optimized parameter, and stores the converted symbol string in the symbol string data index storage unit 11.
Next, the conversion function generated by the conversion function learning unit 12 will be described with reference to
For example, in the method according to the related art, a feature quantity vector included in a range at the right of the straight line in
In this regard, the information search device 1 converts a feature quantity vector included in a predetermined range from the boundary in which the product of the conversion matrix W and the feature quantity vector x is “0” into a wild card symbol “*.” Here, the distance between the wild card symbol “*” and the boundary symbol “1” or “0” is determined to be “0” in a calculation of the Hamming distance. For this reason, the information search device 1 causes a feature quantity vector present near the boundary line in which the product of the conversion matrix W and the feature quantity vector x is “0” to be included in the search result, and thus can prevent search omission.
For example, a feature quantity vector indicated by thin hatching in
Next, a process by which the conversion function learning unit 12 optimizes the conversion function by repeatedly evaluating the conversion function and changing the parameter will be described with reference to
As illustrated in
Specifically, the conversion function learning unit 12 updates the conversion function such that the Hamming distance in the converted symbol string is decreased when the Euclidean distance between the feature quantity vectors is short, but the Hamming distance in the converted symbol string is increased when the Euclidean distance between the feature quantity vectors is long. Further, when the extracted feature quantity vectors belong to the same class, the Euclidean distance between the feature quantity vectors is decreased. Thus, when the Euclidean distance between the feature quantity vectors is short and so the Hamming distance in the symbol string is decreased, the conversion function learning unit 12 can decrease the Hamming distance in the symbol string converted from the feature quantity vector belonging to the same class.
As a result, the conversion function learning unit 12 updates the conversion function such that the feature quantity vector belonging to each class is successfully divided by the boundary line as illustrated at the right side of
In addition, the conversion function learning unit 12 updates the conversion function using the feature quantity vector stored in the feature quantity vector storage unit 10. Thus, the conversion function learning unit 12 can obtains the conversion function optimized for data of a search target. Furthermore, the conversion function learning unit 12 may optimize the conversion function in view of the class to which the extracted feature quantity vector belongs as well as the Euclidean distance between the extracted feature quantity vectors or the Hamming distance of the symbol string converted from the extracted feature quantity vector.
Next, a concrete example by which the conversion function learning unit 12 updates a predetermined conversion function and generates an optimized conversion function will be described. In the following description, the conversion function generated by the conversion function learning unit 12 will be first described, and then a process of changing parameters of the conversion function based on the evaluation result of the conversion function and optimizing the conversion function will be described.
First, the description will proceed with the conversion function generated by the conversion function learning unit 12. For example, when the conversion function learning unit 12 converts the feature quantity vector into the symbol string having the binary symbol and the wild card symbol, a converted symbol string c is represented by the following Formula (1). In Formula (1), p represents the number of symbols (a dimension number) of the symbol string.
cεC≡{0,1,*}p (1)
Next, a Hamming distance mij between a symbol string ci and a symbol string cj is defined as in the following Formula (2). Here, in Formula (2), s(cki, ckj) is a value represented by the following Formula (3), and ck is a k-th symbol in a symbol string c.
Here, various variations can be made on the conversion function, but, for example, the conversion function learning unit 12 sets the conversion function represented by the following Formula (4). Here, uk is a k-th value in a symbol string u.
Further, the symbol string u is a symbol string defined by the following Formula (5). In Formula (5), a bold-faced x is an n-dimensional feature quantity vector, and a bold-faced W is an n×p conversion matrix of. In Formula (5), bold-faced a1, a2, b1, and b2 are p-dimensional vectors. Further, a1, a2, b1, and b2 are parameters of the conversion function used to decide a range used for conversion into a wild card symbol, and each element is assumed to have a value of zero (0) or more. Furthermore, bold-faced h+ and h− are p-dimensional vectors in which each element is “0” or “1,” and bold-faced g+ and g− are p-dimensional vectors in which each element is “0” or “−1.”
In other words, the conversion function learning unit 12 obtains h+, h−, g+, and g− that cause a value in which each parameter is considered on the product of the conversion matrix and the feature quantity vector to be maximum in each term in Formula (5), and calculates a vector u using the calculated h+, h−, g+, and g−.
Here,
For example, as illustrated in
Further, a feature quantity vector included in a range satisfying −Wx+a2-b2=0 from a range satisfying Wx−a2−b2=0 is converted into a binary symbol “0.” Further, a feature quantity vector included in a range in which −Wx−a1+b1 is zero (0) or more or a range in which −Wx+a2−b2 is zero (0) or more is converted into the wild card symbol “*.”
Next, the description will proceed with a process by which the conversion function learning unit 12 changes the parameters a1, a2, b1, and b2 of the conversion function based on the evaluation result of the conversion function and optimizes the conversion function. For example, a conversion function that converts a feature quantity vector into a symbol string while maintaining a distance relation in an original feature quantity vector space as much as possible is preferably used as the conversion function used by the information search device 1.
In this regard, for example, the conversion function learning unit 12 can evaluate the conversion function using an evaluation function expressed by the following Formula (6). Here, in Formula (6), dij is an Euclidean distance between a feature quantity vector i and a feature quantity vector j. Further, in Formula (6), S is a data set of the feature quantity vector stored in the feature quantity vector storage unit 10.
In other words, the conversion function learning unit 12 evaluates the conversion function as being high when it is determined that similarity between a relation of a Euclidean distance in a feature quantity space and a relation of a distance between symbol strings is high using Formula (6). As another example, the conversion function learning unit 12 evaluates the conversion function using the following Formula (7). Here, in Formula (7), l2(mij,tij) is a value expressed in by the following Formula (8). Further, in Formulas (7) and (8), t is “1” when the feature quantity vector i and the feature quantity vector j belong to the same class but is zero (0) when the feature quantity vector i and the feature quantity vector j belong to different classes.
In other words, the conversion function learning unit 12 causes the Hamming distance between the symbol strings to be smaller than “ρ” on feature quantity vectors of the same class and causes the Hamming distance between the symbol strings to be “ρ” or more on feature quantity vectors of different classes using Formulas (7) and (8). The following description will proceed with an example in which the conversion function learning unit 12 evaluates the conversion function using Formulas (7) and (8).
Here, Formulas (7) and (8) have a low value on the conversion function that causes the Hamming distance between the symbol strings to be smaller than “ρ” on feature quantity vectors of the same class and causes the Hamming distance between the symbol strings to be “ρ” or more on feature quantity vectors of different classes. For this reason, the conversion function learning unit 12 preferably optimizes the conversion matrix W and the parameter a1, a2, b1, b2 of the conversion function such that a value of Formula (7) serving as the evaluation function is reduced.
Here, Formula (7) serving as the evaluation function is a discontinuous function. For this reason, let us consider a case of minimizing an upper limit of Formula (7). For example, the conversion function learning unit 12 regards the feature quantity vector i as registration data and the feature quantity vector j as query data. Here, a conversion formula used to convert query data into a binary string is defined by the following Formula (9). In Formula (9), xq is a feature quantity vector serving as query data.
In this case, the upper limit of Formula (7) serving as the evaluation function can be expressed by the following Formula (10).
Referring to a first term of Formula (10), l2(mij,tij) is a value unrelated to hi+, hi−, hj, gi+, and gi− thus can be expressed as in the following Formula (11).
Here, when each of hi+, hi−, hj, gi+, and gi− that satisfy a calculation expressed by Formula (11) is represented by a symbol with a wavy line thereabove, the right side of Formula (10) can be expressed by the following Formula (12):
Here, for a maximum calculation of hi+, hi−, hj, gi+, and gi−, conversion expressed by the following Formula (13) to (17) has been performed.
Next, the conversion function learning unit 12 optimizes the conversion matrix of Formula (12) and the parameters using a stochastic gradient descent (SGD) technique. Specifically, the conversion function learning unit 12 sequentially updates the conversion matrix W and the parameters a1, a2, b1, and b2 of the conversion function using the following Formulas (18) to Formula (22), and minimizes the upper limit of Formula (7). In Formulas (18) to (22), η is a parameter representing a learning rate.
w
t+1
=w
t
−η{{tilde over (h)}
i
+
x
i
T
−{tilde over (h)}
i
−
x
i
T
+{tilde over (g)}
i
+
x
i
T
+{tilde over (g)}
i
−
x
i
T
+{tilde over (h)}
j
x
j
T
−h
i′+xiT+hi′−xiT−gi′+xiT+gi′−xiT−hj′xjT} (18)
a
1
t+1
=a
1
t
−η{{tilde over (h)}
i
+
−{tilde over (h)}
i
−
−h
i′++hi′−} (19)
a
2
t+1
=a
1
t
−η{−{tilde over (g)}
i
+
+{tilde over (g)}
i
−
+g
i′+−gi′−} (20)
b
1
t+1
=b
1
t
−η{{tilde over (h)}
i
+
+{tilde over (h)}
i
−
−h
i′+−hi′−} (21)
b
2
t+1
=b
2
t
−η{−{tilde over (g)}
i
+
−{tilde over (g)}
i
−
+g
i′++gi′−} (22)
As described above, the conversion function learning unit 12 extracts a feature quantity vector from the feature quantity vector storage unit 10, and repeats a process of calculating Formulas (18) to (22) by a predetermined number of times. Then, the conversion function learning unit 12 calculates the conversion matrix and the parameter to minimize the upper limit of Formula (7) by sequentially updating the conversion matrix W and the parameters a1, a2, b1, and b2 of the conversion function. In other words, the conversion function learning unit 12 optimizes the conversion matrix W and the parameters a1, a2, b1, and b2 of the conversion function.
Thereafter, the conversion function learning unit 12 converts the feature quantity vector stored in the feature quantity vector storage unit 10 into a symbol string using the optimized conversion matrix W and the parameters a1, a2, b1, and b2 of the conversion function, and stores the converted symbol string in the symbol string data index storage unit 11. Further, the conversion function learning unit 12 notifies the feature quantity converting unit 13 of the optimized conversion matrix W.
The above description has been made in connection with an example in which the conversion matrix W and the parameters a1, a2, b1, and b2 of the conversion function are optimized using the stochastic gradient descent technique, but the conversion function learning unit 12 may minimize the upper limit of Formula (7) using another optimization algorithm.
In addition, the conversion function learning unit 12 optimizes the conversion matrix W and the parameters a1, a2, b1, and b2 of the conversion function by repeating the above-described process by a predetermined number of times. However, the conversion function learning unit 12 may determine that the conversion matrix W and the parameters a1, a2, b1, and b2 of the conversion function have been optimized when a predetermined condition is satisfied. For example, the conversion function learning unit 12 may determine that the conversion matrix W and the parameters a1, a2, b1, and b2 of the conversion function have been optimized when the value of the evaluation function expressed by Formula (7) is a predetermined threshold value or less.
Referring back to
Here, when the feature quantity vector and the binary string bq are received from the feature quantity converting unit 13, the search unit 14 executes the following process. First, the search unit 14 calculates the Hamming distance between the received binary string bq and each symbol string stored in the symbol string data index storage unit 11. For example, when the received binary string bq is “110100” and the symbol string is “110110,” the search unit 14 calculates “1” as the Hamming distance. Further, since the Hamming distance between the wild card symbol and the binary symbol is “0,” when the received binary string bq is “110100” and the symbol string is “1001*0,” the search unit 14 calculates “1” as the Hamming distance.
Then, the search unit 14 extracts a symbol string whose Hamming distance is a predetermined value or less, that is, a symbol string of a feature quantity vector which is a neighbor candidate of query data. Further, the search unit 14 acquires a feature quantity vector which is a source of the extracted symbol string from the feature quantity vector storage unit 10, and compares the extracted feature quantity vector with the feature quantity vector acquired from the feature quantity vector storage unit 10.
Thereafter, when a feature quantity vector matching with the feature quantity vector acquired from the feature quantity converting unit 13 or a feature quantity vector whose Euclidean distance is a predetermined threshold value or less is present among the feature quantity vectors acquired from the feature quantity vector storage unit 10, the search unit 14 executes the following process. In other words, the search unit 14 notifies the client device 2 of the fact that the query data matches with the registration biological data.
However, when a feature quantity vector matching with the feature quantity vector acquired from the feature quantity converting unit 13 or a feature quantity vector whose Euclidean distance is a predetermined threshold value or less is not present among the feature quantity vectors acquired from the feature quantity vector storage unit 10, the search unit 14 executes the following process. In other words, the search unit 14 notifies the client device 2 of the fact that the query data does not match with the registration biological data. As a result, the client device 2 can perform biometric authentication of the user who has inputted the query data.
Here, a process by which the search unit 14 extracts a symbol string of a feature quantity vector serving as a neighbor candidate of query data will be described with reference to
In other words, the information search device 1 converts a feature quantity vector which is present within a predetermined range from the boundary of a threshold value used for conversion into a symbol string into a symbol string including a wild card symbol. For example, when a feature quantity vector indicated by (E) in
As a result, the search unit 14 excludes feature quantity vectors indicated by white circles in a lower portion of
In addition, the search unit 14 extracts a feature quantity vector serving as the neighbor candidate of the query data by calculating the Hamming distance between the binary string converted from the query data and the symbol string converted from the feature quantity vector. Then, the search unit 14 calculates a Euclidean distance between the extracted feature quantity vector and the feature quantity vector of the query data. As a result, the search unit 14 can reduce a search cost for executing the search process.
In addition, the search unit 14 may further increase the speed of the search process using a hash table. In this regard, an example in which the search unit 14 performs a search process using a hash table will be described with reference to
Further, the search unit 14 generates a hash table associated with a data ID of a feature quantity vector present near a feature quantity vector which is a conversion source of a source symbol string on the generated binary string. Then, when the binary string converted from the feature quantity vector of the query data is received, the search unit 14 acquires a data ID associated with the received binary string from the hash table. Thereafter, the search unit 14 acquires a feature quantity vector associated with the data ID acquired from the hash table from the feature quantity vector storage unit 10, and calculates the Euclidean distance from the feature quantity vector of the query data.
As described above, the search unit 14 stores the hash table in which the symbol string is associated with the data ID of the feature quantity vector present near the feature quantity vector which is the source of the symbol string. As a result, the search unit 14 can execute the search process at a high speed.
For example, the conversion function learning unit 12, the feature quantity converting unit 13, and the search unit 14 include an electronic circuit. Here, an integrated circuit (IC) such as an application specific integrated circuit (ASIC) or a field programmable gate array (FPGA), a central processing unit (CPU), or a micro processing unit (MPU) is applied as the electronic circuit.
Further, the feature quantity vector storage unit 10 and the symbol string data index storage unit 11 are memory devices such as a semiconductor memory device such as a random access memory (RAM) or a flash memory, a hard disk, or an optical disk.
Next, the flow of a process by which the information search device 1 generates the conversion function will be described with reference to
First, the information search device 1 extracts arbitrary two feature quantity vectors from the feature quantity vector storage unit 10 as learning data (step S101). Next, the information search device 1 initializes the conversion function (step S102). In other words, the information search device 1 sets the conversion matrix W of the conversion function and the values of the parameters a1, a2, b1, and b2 of the conversion function to predetermined initial values. Then, the information search device 1 evaluates the current conversion function (step S103). In other words, the information search device 1 converts the extracted learning data into a symbol string using the current conversion function, and evaluates the current conversion function using the Hamming distance between the converted symbol strings and the Euclidean distance of the learning data.
Then, the information search device 1 updates the conversion matrix W of the current conversion function and the values of the parameters a1, a2, b1, and b2 of the conversion function using the evaluation result in step S103 (step S104). Next, the information search device 1 determines whether or not an end condition has been satisfied (step S105). For example, the information search device 1 determines whether or not the process of steps S103 to 5104 has been executed by a predetermined number of times or whether or not the evaluation value represented by Formula (7) is a predetermined threshold value or less.
Here, when it is determined that the end condition has been satisfied (Yes in step S105), the information search device 1 converts the feature quantity vector using the updated conversion function (step S106), and ends the process. However, when it is determined that the end condition has not been satisfied (No in step S105), the information search device 1 executes the process of step S103.
Effects of First Embodiment
As described above, the information search device 1 converts a feature quantity vector of data which is a target of the search process using the Hamming distance into a symbol string including a wild card symbol and a binary symbol. Thus, the information search device 1 includes a feature quantity vector present near a threshold value used for conversion into a symbol string as a search candidate and thus prevents search omission.
Further, when a certain component of a feature quantity vector falls within a predetermined range from the boundary with a feature quantity vector of a different class, the information search device 1 converts this component into the wild card symbol “*.” Further, when a certain component of a feature quantity vector does not fall within a predetermined range from the boundary with a feature quantity vector of a different class, the information search device 1 converts this component into a binary symbol. Thus, the information search device 1 can convert a feature quantity vector into a symbol string such that search omission does not occur.
In addition, when a certain component of a product of a conversion matrix and a feature quantity vector falls within a predetermined range, the information search device 1 converts this component into the wild card symbol “*,” but when the certain component does not fall within a predetermined range, the information search device 1 converts this component into a binary symbol corresponding to a value of this component. Thus, when a conversion matrix according to the distribution of feature quantity vectors is selected, the information search device 1 converts a feature quantity vector into a symbol string in a state in which a positional relation of feature quantity vectors is maintained while preventing search omission.
Further, the information search device 1 extracts two feature quantity vectors from the feature quantity vector storage unit 10, and evaluates a predetermined conversion function based on the Euclidean distance between the extracted feature quantity vectors and the Hamming distance between the symbol strings converted from the feature quantity vectors by the predetermined conversion function. Then, the information search device 1 updates the conversion matrix W of the predetermined conversion function and the values of the parameters a1, a2, b1, and b2 of the conversion function based on the evaluation result. Thus, the information search device 1 converts the feature quantity vector into the symbol string using the optimized conversion function for each distribution of the feature quantity vectors stored in the feature quantity vector storage unit 10.
In addition, the information search device 1 decreases the evaluation value of the conversion function when the feature quantity vectors extracted from the feature quantity vector storage unit 10 are feature quantity vectors of the same class and the Hamming distance between the converted symbol strings is a predetermined value or less at the time of evaluation of the conversion function. Further, the information search device 1 decreases the evaluation value of the conversion function when the feature quantity vectors extracted from the feature quantity vector storage unit 10 are feature quantity vectors of different classes and the Hamming distance between the converted symbol strings is a predetermined value or more at the time of evaluation of the conversion function.
In other words, when feature quantity vectors registered by the same user are converted into a symbol string, the information search device 1 decreases the evaluation value of the conversion function when the Hamming distance is a predetermined value or less. Further, when feature quantity vectors registered by different user are converted into a symbol string, the information search device 1 decreases the evaluation value of the conversion function when the Hamming distance is a predetermined value or more. Then, the information search device 1 updates the conversion matrix W of the predetermined conversion function and the values of the parameter a1, a2, b1, and b2 of the conversion function such that the upper limit of the evaluation value is decreased. Thus, the information search device 1 can automatically generate the optimal conversion function according to the distribution of the feature quantity vectors stored in the feature quantity vector storage unit 10.
In addition, the information search device 1 stores the feature quantity vector in association with the converted symbol string. Specifically, the information search device 1 stores the feature quantity vector and the converted symbol string in the feature quantity vector storage unit 10 and the symbol string data index storage unit 11 in association with the same data ID. Then, the information search device 1 searches for a feature quantity vector associated with a symbol string that causes the Hamming distance from the binary string converted from the query data to be a predetermined value or less. Thus, the information search device 1 can reduce the computation cost for searching a feature quantity vector positioned near query data.
The embodiment of the present invention has been described so far, but embodiment of various forms can be made in addition to the above-described embodiment. In this regard, another embodiment of the present invention will be described below as a second embodiment.
(1) Regarding Formulas
The above-described information search device 1 performs conversion of the feature quantity vector, conversion of the query data, evaluation of the conversion function, and optimization of the conversion matrix W and the parameters a1, a2, b1, and b2 of the conversion function using Formulas (1) to (22). However, the embodiment is not limited to this example.
In other words, the information search device 1 may appropriately employ a conversion function of performing conversion into a symbol string including a wild card symbol at the time of conversion of a feature quantity vector. Further, the information search device 1 does not need to convert a feature quantity vector of query data using an optimized conversion matrix W and may convert a feature quantity vector of query data into a binary string using an arbitrary conversion matrix.
Further, the information search device 1 decreases the upper limit of the evaluation function using the stochastic gradient descent technique and optimizes the conversion matrix W and the parameter a1, a2, b1, and b2 of the conversion function. However, the embodiment is not limited to this example, and the information search device 1 may optimize the conversion matrix W and the parameter a1, a2, b1, and b2 of the conversion function using an arbitrary technique.
For example, when the conversion matrix W and the parameter a1, a2, b1, and b2 of the conversion function are optimized such that the upper limit of the evaluation function is decreased, the information search device 1 decreases the evaluation value of the conversion function when the Hamming distance between the feature quantity vectors of the same user is a predetermined value or less. In other words, the information search device 1 optimizes the conversion matrix W and the parameter a1, a2, b1, and b2 of the conversion function by decreasing the evaluation value on the conversion function of more appropriately converting a feature quantity vector into a symbol string. However, for example, the information search device 1 may employ the conversion function when the evaluation value of the conversion function of more appropriately converting a feature quantity vector into a symbol string is increased and thus exceeds a predetermined threshold value.
(2) Regarding Evaluation of Conversion Function
At the time of evaluation of the conversion function, the above-described information search device 1 extracts two feature quantity vectors from the feature quantity vector storage unit 10, regards one of the extracted two feature quantity vectors as query data and the other as the registered feature quantity vector, and evaluates the conversion function. However, the embodiment is not limited to this example. For example, the information search device 1 may extract a plurality of feature quantity vectors, regard one of the extracted feature quantity vectors as query data and the remaining feature quantity vectors as the registered feature quantity vectors, and evaluate the conversion function.
(3) Regarding Embodiment of Invention
The above-described information search device 1 extracts candidates of feature quantity vectors positioned near a feature quantity vector of equerry data based on the Hamming distance, and determines whether or not data similar to the feature vector of the query data is present among the candidates of the extracted feature quantity vectors. However, the embodiment of the present invention is not limited to this example.
In other words, the determination on whether or not data similar to query data is present can be made by the information search device according to the related art. In this regard, the present invention may be implemented as an information converting program or an information conversion device that converts a registered feature quantity vector into a symbol string including a wild card symbol “*” and a binary symbol, and search of a feature quantity vector may be undertaken by the information search device according to the related art. In the case of this embodiment, the information search device according to the related art treats “0” as the Hamming distance between the wild card symbol and the binary symbol.
Further, the information search device 1 transmits information about whether or not data similar to a feature vector of query data is present to the client device 2. However, the embodiment is not limited to this example. For example, the information search device 1 may extract a candidate of a feature quantity vector positioned near a feature quantity vector of query data using a Hamming distance, and may transmit the extracted feature quantity vector to the client device 2. Alternatively, the information search device 1 may transmit a feature quantity vector, which is a source of a symbol string that causes a Hamming distance from a binary string of a feature quantity vector of query data to be a predetermined threshold value or less, to the client device 2. Further, the information search device 1 may transmit feature quantity vectors to the client device 2 in the ascending order of Hamming distances.
(4) Regarding Feature Quantity Vector
The above-described information search device 1 stores a feature quantity vector of biological data. However, the embodiment is not limited to this example, and the information search device 1 may store a feature quantity vector on arbitrary information and determine whether or not a feature quantity vector similar to a feature quantity vector of query data remains stored.
(5) Program
Meanwhile, the information search device 1 according to the first embodiment has been described in connection with the example in which various kinds of processes are implemented using hardware. However, the embodiment is not limited to this example and may be implemented such that a previously prepared program is executed by a computer included in the information search device 1. In this regard, an example of a computer that executes a program having the same function as the information search device 1 according to the first embodiment will be described with reference to
A computer 100 illustrated in
The HDD 120 stores a feature quantity vector table 121 in which the same information as the information stored in the feature quantity vector storage unit 10 is stored and a symbol string table 122 in which the same information as the information stored in the symbol string data index storage unit 11 is stored. Further, an information converting program 131 is stored in the RAM 130 in advance. In the example illustrated in
The information converting program described in the present embodiment may be implemented such that a previously prepared program is executed by a computer such as a personal computer or a workstation. The program may be distributed via a network such as the Internet. Further, the program may be stored in a computer readable recording medium such as a hard disk, a flexible disk (FD), a compact disc read only memory (CD-ROM), a magneto optical disc (MO), or a digital versatile disc (DVD). Furthermore, the program may be read from a recording medium and executed by a computer.
According to an aspect of the present invention, the accuracy of search when a feature quantity vector is converted into a binary string is improved.
All examples and conditional language recited herein are intended for pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2012-075189 | Mar 2012 | JP | national |