The present application claims priority from Japanese application serial No. 2011-041268, filed on Feb. 28, 2011, the entire contents of which are hereby incorporated by reference into this application.
1. Technical Field of the Invention
The present invent relates to a method and a system for searching similarity in inputted unstructured data.
2. Description of Related Arts
Searching, compared to inputted unstructured data such as an image, a moving picture, a document, binary data, or biological body information, unstructured data similar thereto is called similarity search. The similarity search is typically performed by extracting from raw unstructured data (hereinafter called raw data) information called features used for distance calculation (or similarity calculation) and then considering that a smaller distance indicating a degree of disagreement between the features (or a greater degree of similarity indicating a degree of agreement between the features) indicates a greater degree of similarity. The distance (degree of similarity) between the features is called score.
Examples include: a method (k-Nearest Neighbor Search) of calculating a distance (or degree of similarity) between raw data inputted at time of search (hereinafter called search data) and raw data enrolled in a database (hereinafter called enrolled data), selecting K pieces of the enrolled data in ascending order of distance (or descending order of the degree of similarity), and outputting information related thereto as search results; and a method (Range Search) of outputting as search results information related to the enrolled data whose distance (or degree of similarity) is smaller (or larger) than a threshold value r.
At this point, calculating scores for all the pieces of enrolled data where a total number of enrolled data is N requires N times of score calculation. Typically, the score calculation requires a significant amount of time; therefore, an increase in the number N of pieces of enrolled data results in an almost proportional increase in the amount of search time. On the contrary, suggested is distance-based indexing by which scores between the pieces of enrolled data are previously calculated, the order to select the pieces of enrolled data for which the score is to be calculated by using this is determined, and the calculation of the scores from the pieces of enrolled data is stopped in the middle of processing to thereby reduce the number of times of score calculation.
For example, in E. CHAVEZ, K. FIGUEROA and G. NAVARRO”, “Effective Proximity Retrieval by Ordering Permutations,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 30, No. 9, pp. 1647-1658 (2008), from N pieces of enrolled data, for example, M (M<N) pieces of enrolled data (hereinafter called pivots) are selected randomly, a distance between each piece of enrolled data and each pivot is calculated, a vector (hereinafter called first index vector) used at time of search by using this distance is obtained for each piece of enrolled data, a distance between search data inputted at the time of search and each pivot is calculated to obtain a second index vector of the search data, and then the order to select the remaining pieces of enrolled data (hereinafter called non-pivots) are determined in ascending order of a distance between the first and second index vectors). As the index vector, obtained in E. CHAVEZ, K. FIGUEROA and G. NAVARRO, “Effective Proximity Retrieval by Ordering Permutations,” IEEE Trans. On Pattern Analysis and Machine Intelligence, Vol. 30, No. 9, pp. 1647-1658 (2008) is a vector with which IDs of the pivots are arranged in ascending order of distance.
In Non Patent Literature 1, the order to select the non-pivots is determined in the ascending order of the distance between the first and second index vectors. However, this method leaves room for improvement in that an expected number of non-pivots for which score calculation is not performed (for which a score is not searched) is reduced, despite that the score from the search data is smaller than the threshold value r, when the calculation of the score from the non-pivot is stopped in the middle of processing, that is, in terms of search accuracy.
It is an object of the present invention to theoretically minimize an expected number of non-pivots for which the score is not calculated and thus not searched.
To address the object described above, the present invention is characterized by having: a pivot determination unit that determines a pivot from enrolled data; a raw data acquisition unit that acquires raw data; a feature extraction unit that extracts features from the raw data; a score calculation unit that calculates a score as one of a distance and a degree of similarity between the features; an index vector generation unit that generates an index vector by using the score for the pivot; a Δ score calculation unit that calculates a Δ score as one of a distance and a degree of similarity between the index vectors; an non-pivot-specific parameter training unit that trains, by using training data, a parameter of each non-pivot including a regression coefficient; a non-pivot selection order determination unit that determines, by using the Δ score between inputted search data and the non-pivot as well as the regression coefficient, in order to select the non-pivots in descending order of posterior probability through logistic regression; a search result output unit that outputs a search result based on the score between the search data and the enrolled data; and a database that holds the feature of the enrolled data, pivot information indicating which piece of the enrolled data is the pivot, an index including the index vector of each non-pivot, and the parameter of each non-pivot.
With the present invention, by using a non-pivot-specific regression coefficient, the order to select the non-pivots is determined in descending order of posterior probability through logistic regression. This makes it possible to theoretically minimize the expected number of non-pivots for which the score is not calculated and thus not searched, despite that the score from the search data is smaller than the threshold value r. This consequently provides effect of dramatically improving accuracy.
Hereinafter, the first embodiment will be described with reference to the accompanying drawings.
A similarity search system of this embodiment is a similar image search system that, as a result of inputting an image by the user, searches a database in a server terminal for a similar image. Unstructured data such as a moving picture, music, a document, or binary data instead of an image may be used. The similarity search system of this embodiment uses a color histogram as features of the image and uses a Euclid distance as a score between the features.
The similarity search system of this embodiment preselects M pivots from N pieces of enrolled data. As a method of selecting the pivots, there is, for example, a method of selecting them randomly. Next, the similarity search system calculates a score between each piece of the remaining enrolled data (each non-pivot) and each of the pivots and, based on this, obtains a first index vector used at time of search for each non-pivot. At time of search, the similarity search system calculates a score between the inputted search data and each pivot and, based on this, obtains a second index vector of the search data. The index vector is a vector as a clue directly teaching positional relationship between each non-pivot and the search data without obtaining a score. Typically, it takes a great deal of time for calculating the score between the search data and each piece of the enrolled data, but the number of times of score calculation can be reduced (that is, high-speed search can be performed) by determining the order to select non-pivots by using a distance (or a degree of similarity) between the index vectors (hereinafter called Δ score), performing calculation of the score from the non-pivot T(<N−M) times (where T is an upper limit value predefined by the system manager or the like), and then stopping the calculation of the score from the non-pivot in the middle of performance.
As the index vector, a vector formed of the score from each pivot (hereinafter called score vector) may be provided or a vector (hereinafter referred to as permutation vector) with IDs of the pivots arranged in ascending order of distance (or degree of similarity) may be provided. A collection of the first index vectors of the different non-pivots is called an index.
For example, in
For the Δ score (distance or degree of similarity between the index vectors), when the score vector is used as the index vector, for example, any of Manhattan distance, the Euclid distance, etc. is assumed, and when the permutation vector is used, for example, any of Spearman Rho, etc. is assumed. Alternatively, for example, what is obtained by subtracting the aforementioned distance from a maximum possible value may be used as the degree of similarity.
For example, when the score vector is used as the index vector and the Euclid distance is used as the Δ score, where a Euclid distance between a score vector Sq of the search data and a score vector Si of enrolled data Xi is De(Sq, Si),
is obtained. Here, Si(z) denotes a z-th element in the score vector Si. In the case of
When the permutation vector is used as the index vector and Spearman Rho is used as the Δ score, where Spearman Rho between a permutation vector Tq of the search data and a permutation vector Ti of the enrolled data Xi is Dρ(Tq, Ti),
is obtained. Here, Ti(z) denotes a suffix number of the z-th element in the permutation vector Ti. For example, where Ti=(X2, XM, X1, . . . , X3)T, Ti(1)=2, Ti(2)=M, Ti(3)=1, . . . , Ti(M)=3. Tq−1(i) denotes at what place in the permutation vector Tq the element Xi is located. For example, where Tq=(XM, X1, X2, . . . , X3)T, Tq−1(1)=2, Tq−1 (2)=3, Tq−1(3)=M, . . . , Tq−1(M)=1. In
A first characteristic of the similarity search system of this embodiment is that an index vector size (the number of dimensions of the index vector) of each non-pivot is uniquely determined (trained) before search by using prepared data (training data). A method of training the index vector size will be described in detail below.
b
1) and (b2) show examples of the indexes when index vectors are held in correspondence with the index vector size of each trained non-pivot in a case where the score vector and the permutation vector are used as the index vectors. In this case, when the score vector is used as the index vector, for the score vector, rearrangement is made so that the number of elements corresponding to a score vector size are provided in ascending or descending order of scores, and in order to tell to which pivot the concerned score corresponds, a permutation vector with the same length is also held.
For example, in
As described above, in this embodiment, by using the training data, the non-pivot-specific index vector size is trained, and the non-pivot-specific index vector is saved in correspondence with the index vector size of this non-pivot. This makes it possible to reduce the index vector size for each non-pivot. This results in reduction in a size of indexes saved into the database, which can provide effect of realizing system weight reduction. Details of a method of training the index vector size will be described below.
The Δ score in this case is indicated, when the score vector is used as the index vector and the Euclid distance is used as the Δ score, where the Euclid distance between a score vector Sq of the search data and a score vector Si of the enrolled data Xi (permutation vector is Ti and the score vector size is Zi) is De (Sq, Si, Ti, Zi):
In the case of
Moreover, when the permutation vector is used as the index vector and the Spearman Rho is used as the Δ score, where Spearman Rho between the permutation vector Tq of the search data and the permutation vector Ti of the enrolled data Xi (permutation vector size is Zi) is Dρ(Tq, Ti, Zi),
is obtained. In the case of
As described above, the Δ score, corresponding to the index vector size, between the search data and the non-pivot (that is, distance between zi-dimension vectors) is calculated. This requires shorter time for the Δ score calculation than for calculating a Δ score corresponding to the number (M) of pivots (that is, distance between M-dimension vectors). This consequently provides effect of improving speed.
A second characteristic of the similarity search system of this embodiment is that after the ΔSq, M+1, . . . , ΔSq, N for each non-pivot is obtained in this manner, by using logistic regression, the order to select non-pivots is determined in descending order of posterior probability P (sq, i<r|ΔSq, i)(M+1≦i≦N) for which the score sq, i from the search data is smaller than a threshold value r. The posterior probability P (sq, i<r|ΔSq, i) can be deformed by use of Bayes' theorem as follows:
where σ ( ) is a logistic sigmoid function, and ai is:
A logistic sigmoid function σ ( ) is a monotonically increasing function, and thus determining the order to select the non-pivots in descending order of ai permits determination of the order to select the non-pivots in descending order of the posterior probability P (sq, I<r|ΔSq, i). The ai can be obtained by using the logistic regression. In the logistic regression, ai can be obtained in an approximate manner by:
ai≈wi,1ΔSq,i+wi,0 [Formula 7]
The wi, 1 and wi, 0 are non-pivot-specific regression coefficients of logistic regression (M+1≦i≦n). It is possible to adopt a method of using a value common to the non-pivots as the regression coefficient, but since the regression coefficient definitely differs in value from one non-pivot to another, it is possible to more properly obtain the ai by using the non-pivot-specific regression coefficient. Moreover, according to Formula 7, ai can be obtained in an approximate manner through performing multiplication once and performing addition once on the Δ score ΔSq, i, and thus it takes little time for calculating the ai. The regression coefficient is uniquely determined (trained) before search by using prepared data (training data), as is the case with the index vector size. Details of a method of training the regression coefficient will be described below.
Assuming here that an aggregation of Δ scores ΔSq, M+1, . . . , ΔSq, N for each non-pivot is ΔSq and then the non-pivot determined as the e(1≦e≦N−M)-th place is Xm(e)(M+1≦m(e)≦N), an expected number of non-pivots for which the score is consequently not calculated (that is, not searched) despite that the score from the search data is smaller than the threshold value r after calculating the score from the non-pivot T(<N−M) times can be denoted as:
Note, however, that for the approximation from the second to third lines, what has the greatest influence on the posterior probability of the non-pivot Xm(e) is a Δ score ΔSq,m(e) for Xm(e). In Formula 8, approximation can be achieved by a sum of posterior probabilities P (sq, m(e)<r|ΔSq, m(e)) of the non-pivot Xm(e) for which score calculation has not yet been performed, but this sum can be minimized when the score from the non-pivot is calculated T-times in descending order of the posterior probability P(sq, m(e)<r|ΔSq, m(e)).
Therefore, in this embodiment, the order to select the non-pivots is determined in descending order of the posterior probability through the logistic regression by using the non-pivot-specific regression coefficient, and this makes it possible to theoretically minimize the expected number of non-pivots for which the score is not calculated and thus not searched despite that the score from the search data is smaller than the threshold value r. This consequently provides effect of dramatically improving accuracy. Details of a method of training the regression coefficient will be described below.
This system is composed of: an enrollment terminal 100 that transmits to a server terminal enrollment information acquired from a user; a server terminal 200 that saves the enrollment information, generates supplementary information from the enrollment information, and performs similarity search on raw search data by using the enrollment information and the supplementary information; a client terminal 300 that transmits to the server terminal 200 the raw search data inputted by the user; and a network 400.
The number of each of the enrollment terminal 100, the server terminal 200, and the client terminal 300 may be one or more. The enrollment terminal 100 may be the same terminal as the server terminal 200 or as the client terminal 300. Moreover, the enrollment terminal 100 is not necessarily provided. The server terminal 200 may be the same terminal as the client terminal 300. The network 400 may use a network such as WAN or LAN, communication between devices using a USB, an IEEE 1394, or the like, or wireless communication such as a portable phone network or BlueTooth.
For example, assumed configuration is that the enrollment terminal 100 includes a plurality of PCs in a firm, the server terminal 200 is one server in a data center operated by the firm, the client terminal 300 includes a plurality of users' individual PCs, and the network 400 is the Internet, and assumed operation is that an employee in the firm performs image enrollment. In this case, the enrollment terminal 100 may be a server in the data center, so that a server manager can perform image enrollment. Alternatively, the enrollment terminal 100 may be provided in the user's individual PC, so that the user can perform image enrollment. Alternatively, without providing the enrollment terminal 100, the server terminal 200 may perform automatic collection from the Internet. Alternatively, the enrollment terminal 100, the server terminal 200, and the client terminal 300 may be provided in the user's individual PC, so that image enrollment, supplementary information generation, and search can be performed on the individual PC.
The enrollment terminal 100 is composed of: a raw data acquisition unit 101 that acquires raw data; and a communication I/F 102.
The server terminal 200 is composed of: a pivot determination unit 201 that determines M pivots from N pieces of enrolled data; a feature extraction unit 202 that extracts features from raw data; a score calculation unit 203 that calculates a score as a distance (or a degree of similarity) between the features; an index vector generation unit 204 that generates an index vector by using a score for a non-pivot or a pivot of search data; a Δ score calculation unit 205 that calculates a distance (or degree of similarity) (hereinafter called Δ score) between the index vectors; a non-pivot-specific parameter training unit 206 that trains a non-pivot-specific parameter by using training data; a non-pivot selection order determination unit 207 that determines the order to select the non-pivots by using a Δ score between the inputted search data and the non-pivot; a search result output unit 208 that outputs search results based on a score between the search data and the enrolled data; a communication I/F 209, and a database 210.
The database 210 holds master data 220. The master data 220 holds enrollment information 230 of each enrolled user and supplementary information 240. The enrollment information 230 holds, for each piece of the enrolled data, an enrolled data ID 231, raw data 232, and a feature 233. The supplementary information 240 holds: pivot information 241 that indicates which piece of the enrolled data is a pivot; an index 242; and a non-pivot-specific parameter 250. The index 242 holds an index vector 243 for each non-pivot. The non-pivot-specific parameter 250 holds, for each non-pivot, an index vector size 251 and a regression coefficient 252 that is used for logistic regression.
The client terminal 300 is composed of: a raw data acquisition unit 301 that acquires raw data; and a communication I/F 302.
The enrollment terminal 100 acquires raw enrolled data from the user (step S101).
The enrollment terminal 100 transmits the raw enrolled data to the server terminal 200 (step S102).
The server terminal 200 extracts features for enrollment from the raw enrolled data (step S103).
The server terminal 200 saves into the database 210 the enrollment information 230 including the enrolled data ID 231 specific to the enrolled data, the raw data 232 for enrollment, and the feature 233 for enrollment (step S104).
The server terminal 200 acquires the enrollment information 230 of each enrolled user from the database 210 to newly generate supplementary information and acquires the added enrollment information 230 from the database 210 to updates the supplementary information (step S201).
To newly generate supplementary information, the server terminal 200 newly determines M pivots from among the raw data 232 of the N pieces of enrollment information 230 (step S202). To update the supplementary information, this step is omitted and the raw data 232 of the added enrollment information 230 is provided as a non-pivot. Methods of determining a pivot include: for example, random selection; and determining as a pivot upon every pivot selection the one which has a smallest (or largest) sum of scores or Δ scores from the pivots determined by that time.
The server terminal 200 obtains a score between each pivot and each of the (N−M) non-pivots to generate the index vector 243 to newly generate supplementary information, and obtains a score between each pivot and each of the added non-pivots to generate the index vector 243 to update the supplementary information (step S203).
The server terminal 200 uniquely determines (trains) the non-pivot-specific parameter 250 composed of the index vector size 251 and the regression coefficient 252 used for the logistic regression by using prepared data (training data) for each of the N−M non-pivots to newly generate supplementary information and for each added non-pivot to update the supplementary information (step S204). Details of a method of training the non-pivot-specific parameter 250 composed of the index vector size 251 and the regression coefficient 252 will be described below.
The server terminal 200, to newly generate supplementary information, saves into the database 210, as the supplementary information 240, the pivot information 241 indicating which piece of the enrolled data is a pivot, the index 242 composed of the index vector 243 of each of the N−M non pivots, and the non-pivot-specific parameter 250 composed of the index vector size 251 and the regression coefficient 252 of each trained non-pivot. The server terminal 200, to update the supplementary information, adds the generated index vector 243 to the index 242 of the database 210, and adds the index vector size 251 and the regression coefficient 252 of each trained non-pivot to the non-pivot-specific parameter 250. At this point, for the index vector 243 of each non-pivot, saving or addition for the index vector size 251 of the concerned non-pivot is performed (step S205).
The server terminal 200 acquires the master data 220 from the database 210 (step S301).
The client terminal 300 acquires raw search data from the user (step S302).
The client terminal 300 transmits the raw search data to the server terminal 200 (step S303).
The server terminal 200 extracts a feature for search from the raw search data (step S304).
The server terminal 200 calculates a score between the search data and each pivot (step S305).
The server terminal 200, based on the score between the search data and each pivot, generates an index vector of the search data (step S306).
The server terminal 200, by using the index vector of the search data, the index 242 including the index vector of each non-pivot, and the index vector size 251 of each non-pivot, calculates a Δ score between the search data and each of the non-pivots (step S307).
The server terminal 200, based on a Δ score ΔSq, M+1, . . . , ΔSq, N, by using regression coefficients wi, 1 and wi, 0 of logistic regression of each non-pivot, obtains by Formula 7 a value ai related in a monotonically increasing manner to posterior probability P(sq, i<r|ΔSq, i)(M+1≦i≦N) for which the score sq, i from the search data is smaller than the threshold value r, and determines the order to select the non-pivots in descending order of ai (step S308).
The server terminal 200 initializes at 0 the number of times t of calculating the score between the search data and the non-pivot (step S309).
The server terminal 200 calculates a score between the search data and the non-pivot selected in accordance with the order to select the non-pivots determined at step S308 (step S310).
The server terminal 200 increases the number of times t of calculating the score between the search data and the non-pivot by an increment of 1 (step S311).
The server terminal 200 proceeds to step S310 if the number of times t of calculating the score between the search data and the non-pivot is equal to or smaller than an upper limit value T and proceeds to step S313 if it is larger than the upper limit value T (step S312).
The server terminal 200 transmits the raw data 232 as search results to the client terminal 300 (step S313). At this point, a method of selecting k pieces of enrolled data in ascending order (or descending order) of score and providing them as search results (k-Nearest Neighbor Search) may be adopted, or a method of providing as search results the enrolled data for which the score is smaller (or larger) than the threshold value r (Range Search) may be adopted.
The client terminal 300 displays the raw data 232 as the search results (step S314).
Hereinafter, details of the method of training by using the training data the parameter 250 composed of the index vector size 251 and the regression coefficient 252 for each non-pivot in step S204 will be described. As the training data, (N−1) non-pivots other than the concerned non-pivot for which the parameter is trained may be used, or data previously prepared separately from the enrolled data may be used.
First, the method of training the regression coefficients wi, 1 and wi, o when an index vector size Zi is fixed at a certain value will be described. Assume that the training data are Qi, Q2, . . . QN′ (where N′ is the number of pieces of training data). Moreover, a Δ score between the training data Qj (1≦j≦N′) and the non-pivot Xi (M+1≦i≦N) is a ΔSj, i, and an aggregation of Δ scores for the non-pivot Xi of each training data Qj (1≦j≦N′) is expressed by:
ΔSi={ΔSj,i|1≦j≦N′} [Formula 9]
For example, when the score vector is used as the index vector and the Euclid distance is used as the Δ score, the ΔSj, i can be expressed as De (Sqj, Si, Ti, Zi) (where Sqj is a score vector of the training data Qj) and can be calculated by Formula 3. When the permutation vector is used as the index vector and the Spearman Rho is used as the Δ score, the ΔSj, i is Dρ (Tqj, Ti, Zi) (where Tqj is a permutation vector of the training data Qj) and can be calculated by Formula 4.
Further, assume that a label that takes 1 when a score sj, i between the training data Qj (1≦j≦N′) and the non-pivot Xi (M+1≦i≦N) is smaller than the threshold value r and that takes 0 in other cases is defined as Lji and an aggregation of labels for the non-pivot Xi of each training data Qj (1≦j≦N′) is expressed by:
L
i
={L
ji|1≦j≦N′} [Formula 10]
Furthermore, a regression coefficient of the non-pivot Xi can be expressed, where wi, l and wi, 0 are arranged, in a vector form of:
w
i=(wi,1,wi,0)T [Formula 11]
In this embodiment, the aggregation ΔSi of the Δ scores for the non-pivot Xi and the aggregation Li of the labels are used for training the regression coefficient wi.
As the method of training the regression coefficient, there is a method of using maximum A posterior probability estimation and maximum likelihood estimation. To train the regression coefficient wi through the maximum A posterior probability estimation by using the aggregation ΔSi of the Δ scores for the non-pivot Xi and the aggregation Li of the labels, a parameter wiMAP is obtained through:
and they are provided as training results. However, Bayes' theorem is used for deformation on the 2nd to 4th lines and the ΔSi and wi are independent from each other for deformation on the 4th to 5th lines (that is, P(ΔSi|wi)=P(ΔSi)). For deformation on the 5th to 6th deformation, the fact that P(ΔSi) is fixed, not depending on wi is used. Moreover, argmax f(x) indicates x that maximizes f(x). To train the regression coefficient wi through the maximum likelihood estimation, a parameter wiML is obtained through:
and they are provided as training results.
As shown by Formulae 12 and 13, the maximum A posterior probability estimation is different from the maximum likelihood estimation in a point that the regression coefficient is trained in view of the posterior probability P (wi) of the regression coefficient wi. As described above, the maximum A posterior probability estimation is characterized by being capable of training the regression coefficient more toughly than the maximum likelihood estimation, by considering the posterior probability of the regression coefficient, even when the number of pieces of training data is small. In particular, in this embodiment, the number of labels Lji taking 1 (that is, the number of pieces of the training data Qj similar to the non-pivot Xi) is typically very small, and thus the regression coefficient may not be appropriately trained through the maximum likelihood estimation. Even in such a case, the regression coefficient can be appropriately trained through the maximum A posterior probability estimation.
P(Li|ΔSi, wi) can be obtained by:
Note that, however, the label Lji for the 1st to 2nd lines takes 1 when the score sj, i between the training data Qj and the non-pivot Xi is smaller than the threshold value r and takes 0 in other cases and dependence on the Δ score ΔSS, i is used. Moreover, aj, i is:
By using the logistic regression described above,
aj,i≈wi,1ΔSj,i+wi,0 [Formula 16]
can be obtained.
Assuming that P(wi) is, for example, normal distribution of an average vector 0 and variance-covariance matrix Σo,
There is a method of obtaining:
P(wi)=N(0,Σ0) [Formula 17]
There are, for example, a method of presetting Σo at an adequate value and a method of automatically determining it by using empirical Bayes method based on the training data. Moreover, an average vector other than 0 may be used, or for example, exponential distribution or gamma distribution other than normal distribution may be used as a distribution model.
At this point, the regression coefficient wiMAP or WiML obtained through the maximum A posterior probability estimation or the maximum likelihood estimation (that is, which maximizes Formula 16 or 17) can be calculated by using, for example, a Newton-Raplon method. This is a method of sequentially obtaining the value wiMAP of the maximum A posterior probability estimation or the value wiML of the maximum likelihood estimation with the following procedures.
1. An initial value)wi(o) of wi is set appropriately. For example, wi(o)=0, and τ←0.
2. As described below, wi(τ+1) is obtained. The symbol τ is the number of times of sequential calculation:
w
i
(
=
i
(r)−(∇∇E(wi(τ)))−1∇E(wi(τ)) [Formula 18]
Note that E(wi(τ) is posterior probability, or a negative log of the likelihood. The symbol ∇ is a differential operator vector. This is called an error function. In the case of the maximum A posterior probability estimation,
E(wi(τ)=−log P(Li|ΔSi,wi(τ)P(wi(τ)) [Formula 19]
and in the case of the maximum likelihood estimation,
E(wi(τ)=−log P(Li|ΔSi,wi(τ)) [Formula 20]
Moreover, ∇E (wi(τ)) and ∇∇E(wi(τ)) are a first-order differential column vector and a second-order differential line-column, respectively. For example, in the case of the maximum A posterior probability estimation, when Formulae 14, 16, and 17 are employed,
can be obtained, where
aj,i(τ)≈wi,1(τ)ΔSj,i+wi,0(τ) [Formula 23]
x
j=(ΔSj,i,1)T [Formula 24]
3. when a difference between wi(τ+1) and wi(τ) is sufficiently small or when τ exceeds a fixed value, wi(τ+1) ends as wiMAP or wiML. Otherwise, as τ←τ+1, the process returns to 2.
Next, a method of training the index vector size Zi will be described. To this end, the aforementioned operation is performed while the index vector size Zi is varied to various values (for example, values of 1 to M), and the wiMAP or the WiML for which the error function is as small as possible and the Zi that achieves this may be provided as training results. This makes it possible obtain the best parameter in terms of accuracy.
Alternatively, the non-pivot-specific parameter may be trained so that a sum of the error functions for the non-pivot becomes as small as possible while the index size is equal to or smaller than a fixed value. To this end, the wiMAP with which the sum of error functions for the non-pivot becomes largest while Zi of each non-pivot is varied to various values in a range where the index size is equal to or smaller than the fixed value and Zi that realizes this may be provided as training results (M+1≦i≦N). This makes it possible to realize, when a required value is set for the size of the supplementary information, most excellent performance in terms of accuracy in a range that satisfies this.
Moreover, in this embodiment, obtaining the label Lji (1≦j≦NI, M+1≦i≦N) requires calculation of a total (N−M)×N′ scores, which typically takes a great deal of time. Thus, a Δ score between each non-pivot and each piece of the (N′) training data may be obtained, (<N′) pieces of the training data may be selected in ascending order (where υ′ is a value predefined by a system manager or the like), and they may be used for training. The piece of training data with a small Δ score is similar to the non-pivot with high possibility, and this makes it possible to reduce the number of times of score calculation required for the training to (N−M)×υ′ pieces while suppressing reduction in the number of labels Lji that take 1 (that is, that is similar to the non-pivot Xi) as much as possible. This consequently provides effect of performing high-speed training.
Moreover, for example, in a case where the pieces of enrolled data form several clusters in the feature space, the parameter such as the index vector size may take a similar or same value for each cluster.
Therefore, in this embodiment, clustering may be performed on the non-pivots, the non-pivot-specific parameter may be trained so that some or all of the parameters obtained for each cluster are common. As a clustering method, any of hierarchical methods such as a nearest neighbor method, a farthest neighbor method, a group average method, and a Ward method may be used. Training a common parameter for each cluster in this manner makes it possible to reduce a size of the parameter. This consequently provides effect of realizing further system weight reduction.
Moreover, in a case where the enrolled data is used as the training data, when enrolled data has been added, it is possible that the parameter training is not performed successfully since the index vector size of the training data is small. However, training the common parameter for each cluster as described above makes it possible to easily perform the parameter training by using the common parameter for the cluster to which the concerned enrolled data belongs.
Hereinafter, with reference to the accompanying drawings, the second embodiment will be described. A similarity search system of this embodiment is a biological body identification system which, as a result of inputting biological body information by a user who attempts authentication (hereinafter referred to as authenticated user), searches a database in a client terminal for similar biological body information, and thereby identifies to which user (hereinafter referred to as enrolled user) enrolled in the database the authenticated user corresponds, and performs authentication based on results of this identification.
In this embodiment, raw data is biological body information.
This system is composed of: an enrollment terminal 100 that transmits to a server terminal a feature of biological body information obtained from the user; a server terminal 200 that saves enrollment information, generates supplementary information from the enrollment information, and performs biological body identification on a feature for authentication by using the enrollment information and the supplementary information; a client terminal 300 that transmits to the server terminal 200 a group ID and the feature for the authentication inputted by the user; and a network 400.
For example, it is possible to from, for an information access control system or an attendance management system of a firm, the enrollment terminal 100 with a plurality of PCs in the firm, the server terminal 200 with one server in a data center operated by the firm, the client terminal 300 with a plurality of employees' PCs, and the network 400 with the Internet. Moreover, it is possible to form, for an entrance and exit management system in the firm, the enrollment terminal 100, the server terminal 200, and the client terminal 300 in the same entrance and exit management device. A group ID 221 may be a value specific to a business place to which the user belongs, or may be set to be specific to each client terminal 300 or each base. In the former case, possible operation is to input the group ID at the time of authentication by the user. In the latter case, the user is not required to input the group ID at the time of authentication.
The enrollment terminal 100 further has: a group ID/user name acquisition unit 103 that acquires a group ID and a user name; and a feature extraction unit 104 that extracts a feature from raw data.
The server terminal 200 does not have a feature extraction unit 202 but has a group narrowing unit 209a and has master data 220 for each group ID. The master data 220 has a group ID 221. Enrollment information 230 does not have raw data 232 but has a user name 234 for each piece of the enrollment information.
Possible features of the biological body information are, for example, minutiae for fingerprints, an iris code for an iris, and cepstrum for a vocal print. Possible scores between the two pieces of biological body information are the number and a ratio of corresponding minutiae for the fingerprint, a Hamming distance for the iris, and a Mahalanobis distance for the vocal print.
The client terminal 300 further has: a group ID acquisition unit 303 that acquires a group ID; and a feature extraction unit 304 that extracts a feature from raw data.
Hardware configuration of the enrollment terminal 100, the server terminal 200, and the client terminal 300 according to this embodiment is the same as that of
The enrollment terminal 100 acquires a group ID and a user name from the user (step S101a).
The enrollment terminal 100 extracts a feature for enrollment from raw enrolled data (step S102a).
The enrollment terminal 100 transmits to the server terminal 200 the group ID, the user name, and the feature for enrollment (Step S103a).
The server terminal 200, if the master data 220 corresponding to the group ID is in the database 210, adds to the master data 220 the enrollment information 230 including an enrolled data ID 231 specific to the enrolled data, the user name 234, and a feature 233 for enrollment. If there is no master data 220, the enrollment information 230 including the group ID 221, the enrolled data ID 231 specific to the enrolled data, the user name 234, and the feature 233 for enrollment is newly created (step S104a).
Processing procedures and a data flow of supplementary information generation processing according to this embodiment is the same as that of
The server terminal 200 acquires the master data 220 for each ID from the database 210 (step S301a).
The client terminal 300 acquires the group ID from the user (step S302a). The group ID may be a value specific to each client terminal 300 or each base, or may not be acquired from the user in this case.
The client terminal 300 extracts a feature for search from raw search data (step S303a).
The client terminal 300 transmits the group ID and the feature for search to the server terminal 200 (step S304a).
A target of search by the server terminal 200 is master data corresponding to the acquired group ID (step S305a).
As described above, in this embodiment, narrowing the enrolled data by using the group ID is performed. This makes it possible to dramatically reduce the number of pieces of enrolled data for which the score is calculated. This consequently provides effect of further improving speed.
The server terminal 200 transmits, as a search result, the user name 234 corresponding to the enrolled data to the client terminal 300 (step S313a).
The client terminal 300 displays the user name 234 corresponding to the enrolled data as the search result (step S314a).
The present invention is applicable to any application that performs similarity search on unstructured data such as an image, a moving picture, music, a document, binary data, or biological body information. For example, the invention is applicable to a similar image search system, a similar moving picture search system, a similar music search system, a similar document search system, a similar file search system using fuzzy hash, an information access control system, an attendance management system, and an entrance and exit management system.
Number | Date | Country | Kind |
---|---|---|---|
2011-041268 | Feb 2011 | JP | national |