HIGH-ACCURACY SIMILARITY SEARCH SYSTEM

INCORPORATION BY REFERENCE

The present application claims priority from Japanese application serial No. 2011-041268, filed on Feb. 28, 2011, the entire contents of which are hereby incorporated by reference into this application.

1. Technical Field of the Invention

The present invent relates to a method and a system for searching similarity in inputted unstructured data.

2. Description of Related Arts

Searching, compared to inputted unstructured data such as an image, a moving picture, a document, binary data, or biological body information, unstructured data similar thereto is called similarity search. The similarity search is typically performed by extracting from raw unstructured data (hereinafter called raw data) information called features used for distance calculation (or similarity calculation) and then considering that a smaller distance indicating a degree of disagreement between the features (or a greater degree of similarity indicating a degree of agreement between the features) indicates a greater degree of similarity. The distance (degree of similarity) between the features is called score.

Examples include: a method (k-Nearest Neighbor Search) of calculating a distance (or degree of similarity) between raw data inputted at time of search (hereinafter called search data) and raw data enrolled in a database (hereinafter called enrolled data), selecting K pieces of the enrolled data in ascending order of distance (or descending order of the degree of similarity), and outputting information related thereto as search results; and a method (Range Search) of outputting as search results information related to the enrolled data whose distance (or degree of similarity) is smaller (or larger) than a threshold value r.

At this point, calculating scores for all the pieces of enrolled data where a total number of enrolled data is N requires N times of score calculation. Typically, the score calculation requires a significant amount of time; therefore, an increase in the number N of pieces of enrolled data results in an almost proportional increase in the amount of search time. On the contrary, suggested is distance-based indexing by which scores between the pieces of enrolled data are previously calculated, the order to select the pieces of enrolled data for which the score is to be calculated by using this is determined, and the calculation of the scores from the pieces of enrolled data is stopped in the middle of processing to thereby reduce the number of times of score calculation.

For example, in E. CHAVEZ, K. FIGUEROA and G. NAVARRO”, “Effective Proximity Retrieval by Ordering Permutations,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 30, No. 9, pp. 1647-1658 (2008), from N pieces of enrolled data, for example, M (M<N) pieces of enrolled data (hereinafter called pivots) are selected randomly, a distance between each piece of enrolled data and each pivot is calculated, a vector (hereinafter called first index vector) used at time of search by using this distance is obtained for each piece of enrolled data, a distance between search data inputted at the time of search and each pivot is calculated to obtain a second index vector of the search data, and then the order to select the remaining pieces of enrolled data (hereinafter called non-pivots) are determined in ascending order of a distance between the first and second index vectors). As the index vector, obtained in E. CHAVEZ, K. FIGUEROA and G. NAVARRO, “Effective Proximity Retrieval by Ordering Permutations,” IEEE Trans. On Pattern Analysis and Machine Intelligence, Vol. 30, No. 9, pp. 1647-1658 (2008) is a vector with which IDs of the pivots are arranged in ascending order of distance.

SUMMARY OF THE INVENTION

In Non Patent Literature 1, the order to select the non-pivots is determined in the ascending order of the distance between the first and second index vectors. However, this method leaves room for improvement in that an expected number of non-pivots for which score calculation is not performed (for which a score is not searched) is reduced, despite that the score from the search data is smaller than the threshold value r, when the calculation of the score from the non-pivot is stopped in the middle of processing, that is, in terms of search accuracy.

It is an object of the present invention to theoretically minimize an expected number of non-pivots for which the score is not calculated and thus not searched.

To address the object described above, the present invention is characterized by having: a pivot determination unit that determines a pivot from enrolled data; a raw data acquisition unit that acquires raw data; a feature extraction unit that extracts features from the raw data; a score calculation unit that calculates a score as one of a distance and a degree of similarity between the features; an index vector generation unit that generates an index vector by using the score for the pivot; a Δ score calculation unit that calculates a Δ score as one of a distance and a degree of similarity between the index vectors; an non-pivot-specific parameter training unit that trains, by using training data, a parameter of each non-pivot including a regression coefficient; a non-pivot selection order determination unit that determines, by using the Δ score between inputted search data and the non-pivot as well as the regression coefficient, in order to select the non-pivots in descending order of posterior probability through logistic regression; a search result output unit that outputs a search result based on the score between the search data and the enrolled data; and a database that holds the feature of the enrolled data, pivot information indicating which piece of the enrolled data is the pivot, an index including the index vector of each non-pivot, and the parameter of each non-pivot.

With the present invention, by using a non-pivot-specific regression coefficient, the order to select the non-pivots is determined in descending order of posterior probability through logistic regression. This makes it possible to theoretically minimize the expected number of non-pivots for which the score is not calculated and thus not searched, despite that the score from the search data is smaller than the threshold value r. This consequently provides effect of dramatically improving accuracy.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram showing functional configuration according to a first embodiment of the present invention;

FIG. 2 is a block diagram showing hardware configuration according to the first embodiment and a second embodiment of the invention;

FIG. 3 is a flow diagram showing enrollment processing according to the first embodiment of the invention;

FIG. 4 is a flow diagram showing supplementary information generation processing according to the first and second embodiments of the invention;

FIG. 5 is a flow diagram showing search processing according to the first embodiment of the invention;

FIG. 6 is a schematic diagram showing a feature space and indexes;

FIG. 7 is a block diagram showing functional configuration according to the second embodiment of the invention;

FIG. 8 is a flow diagram showing enrollment processing according to the second embodiment of the invention; and

FIG. 9 is a flow diagram showing search processing according to the second embodiment of the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
First Embodiment

Hereinafter, the first embodiment will be described with reference to the accompanying drawings.

A similarity search system of this embodiment is a similar image search system that, as a result of inputting an image by the user, searches a database in a server terminal for a similar image. Unstructured data such as a moving picture, music, a document, or binary data instead of an image may be used. The similarity search system of this embodiment uses a color histogram as features of the image and uses a Euclid distance as a score between the features.

The similarity search system of this embodiment preselects M pivots from N pieces of enrolled data. As a method of selecting the pivots, there is, for example, a method of selecting them randomly. Next, the similarity search system calculates a score between each piece of the remaining enrolled data (each non-pivot) and each of the pivots and, based on this, obtains a first index vector used at time of search for each non-pivot. At time of search, the similarity search system calculates a score between the inputted search data and each pivot and, based on this, obtains a second index vector of the search data. The index vector is a vector as a clue directly teaching positional relationship between each non-pivot and the search data without obtaining a score. Typically, it takes a great deal of time for calculating the score between the search data and each piece of the enrolled data, but the number of times of score calculation can be reduced (that is, high-speed search can be performed) by determining the order to select non-pivots by using a distance (or a degree of similarity) between the index vectors (hereinafter called Δ score), performing calculation of the score from the non-pivot T(<N−M) times (where T is an upper limit value predefined by the system manager or the like), and then stopping the calculation of the score from the non-pivot in the middle of performance.

As the index vector, a vector formed of the score from each pivot (hereinafter called score vector) may be provided or a vector (hereinafter referred to as permutation vector) with IDs of the pivots arranged in ascending order of distance (or degree of similarity) may be provided. A collection of the first index vectors of the different non-pivots is called an index.

FIG. 6 shows an example of search data Q and enrolled data X₁, X₂, . . . X_Nin a feature space. Note that X₁, X₂, . . . X_Mdenote pivots and X_M+1, X_M+2, . . . X_Ndenote non-pivots. Here, two clusters are formed and they are greatly separated from each other. Moreover, the number of dimensions of the feature is very large and it takes time to calculate a score between the features.

FIGS. 6 (a1) and (a2) show examples of the second index vector of the search data and indexes when a score vector and a permutation vector are respectively used as the index vector. Note that the Euclid distance between the features is used as a score.

For example, in FIG. 6 (a1), the score between X_M+1 and X₁is 70 and the score vector S_M+1of X_M+1is S_M+1=(70, 28, 1053, . . . , 43)^T. In FIG. 6 (a2), the pivot realizing the smallest score of the scores between X_M+1and the pivots is X₂and the score of X_M+1is T_M+1=(X₂, X_M−1, . . . , X₃)^T.

For the Δ score (distance or degree of similarity between the index vectors), when the score vector is used as the index vector, for example, any of Manhattan distance, the Euclid distance, etc. is assumed, and when the permutation vector is used, for example, any of Spearman Rho, etc. is assumed. Alternatively, for example, what is obtained by subtracting the aforementioned distance from a maximum possible value may be used as the degree of similarity.

For example, when the score vector is used as the index vector and the Euclid distance is used as the Δ score, where a Euclid distance between a score vector S_qof the search data and a score vector S_iof enrolled data X_iis D_e(S_q, S_i),

$\begin{matrix} D_{e} (S_{q}, S_{i}) = \sum_{z = 1}^{M} {(S_{q} (z) - S_{i} (z))}^{2} & [Formula 1] \end{matrix}$

is obtained. Here, S_i(z) denotes a z-th element in the score vector S_i. In the case of FIG. 6(a1), obtained calculation can be D_e(S_q, S_M+1)=(78−70)²+(95−28)²+ . . . +(39−43)².

When the permutation vector is used as the index vector and Spearman Rho is used as the Δ score, where Spearman Rho between a permutation vector T_qof the search data and a permutation vector T_iof the enrolled data X_iis D_ρ(T_q, T_i),

$\begin{matrix} D_{ρ} (T_{q}, T_{i}) = \sum_{z = 1}^{M} {(z - T_{q}^{- 1} (T_{i} (z)))}^{2} & [Formula 2] \end{matrix}$

is obtained. Here, T_i(z) denotes a suffix number of the z-th element in the permutation vector T_i. For example, where T_i=(X₂, X_M, X₁, . . . , X₃)^T, T_i(1)=2, T_i(2)=M, T_i(3)=1, . . . , T_i(M)=3. T_q⁻¹(i) denotes at what place in the permutation vector T_qthe element X_iis located. For example, where T_q=(X_M, X₁, X₂, . . . , X₃)^T, T_q⁻¹(1)=2, T_q⁻¹(2)=3, T_q⁻¹(3)=M, . . . , T_q⁻¹(M)=1. In FIG. 6(a2), obtained calculation can be Dρ(T_q, T_M+1)=(1−3)²+(2−1)²+ . . . +(M−M)².

A first characteristic of the similarity search system of this embodiment is that an index vector size (the number of dimensions of the index vector) of each non-pivot is uniquely determined (trained) before search by using prepared data (training data). A method of training the index vector size will be described in detail below.

FIGS. 6(
b
1) and (b2) show examples of the indexes when index vectors are held in correspondence with the index vector size of each trained non-pivot in a case where the score vector and the permutation vector are used as the index vectors. In this case, when the score vector is used as the index vector, for the score vector, rearrangement is made so that the number of elements corresponding to a score vector size are provided in ascending or descending order of scores, and in order to tell to which pivot the concerned score corresponds, a permutation vector with the same length is also held.

For example, in FIG. 6(b1), the score vector size of X_M+1is trained as 3, and the score vector S_M+1−(28, 43, 70) is held together with the permutation vector T_M+1=(X₂, X_M, M₁)^T. In FIG. 6(b2), a permutation vector size is trained as 2, and a permutation vector T_M+1is T_M+1=(X₂, X_M)^T. In FIGS. 6(b1) and (b2), blacked-out sections are saved into the database.

As described above, in this embodiment, by using the training data, the non-pivot-specific index vector size is trained, and the non-pivot-specific index vector is saved in correspondence with the index vector size of this non-pivot. This makes it possible to reduce the index vector size for each non-pivot. This results in reduction in a size of indexes saved into the database, which can provide effect of realizing system weight reduction. Details of a method of training the index vector size will be described below.

The Δ score in this case is indicated, when the score vector is used as the index vector and the Euclid distance is used as the Δ score, where the Euclid distance between a score vector S_qof the search data and a score vector S_iof the enrolled data X_i(permutation vector is T_iand the score vector size is Z_i) is D_e(S_q, S_i, T_i, Z_i):

$\begin{matrix} D_{e} (S_{q}, S_{i}, T_{i}, Z_{i}) = \sum_{z = 1}^{Z_{i}} {(S_{q} (T_{i} (z)) - S_{i} (T_{i} (z)))}^{2} & [Formula 3] \end{matrix}$

In the case of FIG. 6(b1), obtained calculation can be D_e(S_q, S_M+1, T_i, Z_i)=(95−28)²+(39−43)²+ . . . +(78−70)².

Moreover, when the permutation vector is used as the index vector and the Spearman Rho is used as the Δ score, where Spearman Rho between the permutation vector T_qof the search data and the permutation vector T_iof the enrolled data X_i(permutation vector size is Z_i) is D_ρ(T_q, T_i, Z_i),

$\begin{matrix} D_{ρ} (T_{q}, T_{i}, Z_{i}) = \sum_{z = 1}^{Z_{i}} {(z - T_{q}^{- 1} (T_{i} (z)))}^{2} & [Formula 4] \end{matrix}$

is obtained. In the case of FIG. 6(a2), obtained calculation can be Dρ(T_q, T_M+1, Z_i)=(1−3)²+(2−1)².

As described above, the Δ score, corresponding to the index vector size, between the search data and the non-pivot (that is, distance between z_i-dimension vectors) is calculated. This requires shorter time for the Δ score calculation than for calculating a Δ score corresponding to the number (M) of pivots (that is, distance between M-dimension vectors). This consequently provides effect of improving speed.

A second characteristic of the similarity search system of this embodiment is that after the ΔS_{q, M+1, . . . ,}ΔS_{q, N}for each non-pivot is obtained in this manner, by using logistic regression, the order to select non-pivots is determined in descending order of posterior probability P (sq, i<r|ΔS_{q, i})(M+1≦i≦N) for which the score s_{q, i}from the search data is smaller than a threshold value r. The posterior probability P (sq, i<r|ΔS_{q, i}) can be deformed by use of Bayes' theorem as follows:

$\begin{matrix} \begin{matrix} P (s_{q, i} < r  Δ S_{q, i}) = \frac{P (Δ S_{q, i}  s_{q, i} < r) P (s_{q, i} < r)}{P (Δ S_{q, i})} \\ = \frac{1}{1 + \exp (- a_{i})} \\ = σ (- a_{i}) \end{matrix} & [Formula 5] \end{matrix}$

where σ ( ) is a logistic sigmoid function, and a_iis:

$\begin{matrix} a_{i} = \ln \frac{P (Δ S_{q, i}  s_{q, i} < r) P (s_{q, i} < r)}{P (Δ S_{q, i}) - P (Δ S_{q, i}  s_{q, i} < r) P (s_{q, i} < r)} & [Formula 6] \end{matrix}$

A logistic sigmoid function σ ( ) is a monotonically increasing function, and thus determining the order to select the non-pivots in descending order of a_ipermits determination of the order to select the non-pivots in descending order of the posterior probability P (s_q, I<r|ΔS_{q, i}). The a_ican be obtained by using the logistic regression. In the logistic regression, a_ican be obtained in an approximate manner by:

a_i≈w_i,1ΔS_q,i+w_i,0 [Formula 7]

The w_{i, 1}and w_{i, 0}are non-pivot-specific regression coefficients of logistic regression (M+1≦i≦n). It is possible to adopt a method of using a value common to the non-pivots as the regression coefficient, but since the regression coefficient definitely differs in value from one non-pivot to another, it is possible to more properly obtain the a_iby using the non-pivot-specific regression coefficient. Moreover, according to Formula 7, a_ican be obtained in an approximate manner through performing multiplication once and performing addition once on the Δ score ΔS_{q, i}, and thus it takes little time for calculating the a_i. The regression coefficient is uniquely determined (trained) before search by using prepared data (training data), as is the case with the index vector size. Details of a method of training the regression coefficient will be described below.

Assuming here that an aggregation of Δ scores ΔS_{q, M+1, . . . ,}ΔS_{q, N}for each non-pivot is ΔS_qand then the non-pivot determined as the e(1≦e≦N−M)-th place is X_m(e)(M+1≦m(e)≦N), an expected number of non-pivots for which the score is consequently not calculated (that is, not searched) despite that the score from the search data is smaller than the threshold value r after calculating the score from the non-pivot T(<N−M) times can be denoted as:

$\begin{matrix} \sum_{e = T + 1}^{N - M} 1 \times P (s_{q, m (e)} < r  Δ S_{q}) + 0 \times P (s_{q, m (e)} ≧ r  Δ S_{q}) = \sum_{e = T + 1}^{N - M} P (s_{q, m (e)} < r  Δ S_{q}) \approx \sum_{e = T + 1}^{N - M} P (s_{q, m (e)} < r  Δ S_{q, m (e)}) & [Formula 8] \end{matrix}$

Note, however, that for the approximation from the second to third lines, what has the greatest influence on the posterior probability of the non-pivot X_m(e)is a Δ score ΔS_q,m(e)for X_m(e). In Formula 8, approximation can be achieved by a sum of posterior probabilities P (sq, m(e)<r|ΔS_{q, m(e)}) of the non-pivot X_m(e)for which score calculation has not yet been performed, but this sum can be minimized when the score from the non-pivot is calculated T-times in descending order of the posterior probability P(sq, m(e)<r|ΔS_{q, m(e)}).

Therefore, in this embodiment, the order to select the non-pivots is determined in descending order of the posterior probability through the logistic regression by using the non-pivot-specific regression coefficient, and this makes it possible to theoretically minimize the expected number of non-pivots for which the score is not calculated and thus not searched despite that the score from the search data is smaller than the threshold value r. This consequently provides effect of dramatically improving accuracy. Details of a method of training the regression coefficient will be described below.

FIG. 1 shows a configuration example of the similarity search system of this embodiment. In this embodiment, raw data is an image.

This system is composed of: an enrollment terminal 100 that transmits to a server terminal enrollment information acquired from a user; a server terminal 200 that saves the enrollment information, generates supplementary information from the enrollment information, and performs similarity search on raw search data by using the enrollment information and the supplementary information; a client terminal 300 that transmits to the server terminal 200 the raw search data inputted by the user; and a network 400.

The number of each of the enrollment terminal 100, the server terminal 200, and the client terminal 300 may be one or more. The enrollment terminal 100 may be the same terminal as the server terminal 200 or as the client terminal 300. Moreover, the enrollment terminal 100 is not necessarily provided. The server terminal 200 may be the same terminal as the client terminal 300. The network 400 may use a network such as WAN or LAN, communication between devices using a USB, an IEEE 1394, or the like, or wireless communication such as a portable phone network or BlueTooth.

For example, assumed configuration is that the enrollment terminal 100 includes a plurality of PCs in a firm, the server terminal 200 is one server in a data center operated by the firm, the client terminal 300 includes a plurality of users' individual PCs, and the network 400 is the Internet, and assumed operation is that an employee in the firm performs image enrollment. In this case, the enrollment terminal 100 may be a server in the data center, so that a server manager can perform image enrollment. Alternatively, the enrollment terminal 100 may be provided in the user's individual PC, so that the user can perform image enrollment. Alternatively, without providing the enrollment terminal 100, the server terminal 200 may perform automatic collection from the Internet. Alternatively, the enrollment terminal 100, the server terminal 200, and the client terminal 300 may be provided in the user's individual PC, so that image enrollment, supplementary information generation, and search can be performed on the individual PC.

The enrollment terminal 100 is composed of: a raw data acquisition unit 101 that acquires raw data; and a communication I/F 102.

The server terminal 200 is composed of: a pivot determination unit 201 that determines M pivots from N pieces of enrolled data; a feature extraction unit 202 that extracts features from raw data; a score calculation unit 203 that calculates a score as a distance (or a degree of similarity) between the features; an index vector generation unit 204 that generates an index vector by using a score for a non-pivot or a pivot of search data; a Δ score calculation unit 205 that calculates a distance (or degree of similarity) (hereinafter called Δ score) between the index vectors; a non-pivot-specific parameter training unit 206 that trains a non-pivot-specific parameter by using training data; a non-pivot selection order determination unit 207 that determines the order to select the non-pivots by using a Δ score between the inputted search data and the non-pivot; a search result output unit 208 that outputs search results based on a score between the search data and the enrolled data; a communication I/F 209, and a database 210.

The database 210 holds master data 220. The master data 220 holds enrollment information 230 of each enrolled user and supplementary information 240. The enrollment information 230 holds, for each piece of the enrolled data, an enrolled data ID 231, raw data 232, and a feature 233. The supplementary information 240 holds: pivot information 241 that indicates which piece of the enrolled data is a pivot; an index 242; and a non-pivot-specific parameter 250. The index 242 holds an index vector 243 for each non-pivot. The non-pivot-specific parameter 250 holds, for each non-pivot, an index vector size 251 and a regression coefficient 252 that is used for logistic regression.

The client terminal 300 is composed of: a raw data acquisition unit 301 that acquires raw data; and a communication I/F 302.

FIG. 2 shows hardware configuration of the enrollment terminal 100, the server terminal 200, and the client terminal 300 according to this embodiment. These terminals can be composed of: as shown in the figure, a CPU 500, a memory 501, an HDD 502, an input dev ice 503, an output device 504, and a communication device 505.

FIG. 3 shows processing procedures and a data flow of enrollment according to this embodiment.

The enrollment terminal 100 acquires raw enrolled data from the user (step S101).

The enrollment terminal 100 transmits the raw enrolled data to the server terminal 200 (step S102).

The server terminal 200 extracts features for enrollment from the raw enrolled data (step S103).

The server terminal 200 saves into the database 210 the enrollment information 230 including the enrolled data ID 231 specific to the enrolled data, the raw data 232 for enrollment, and the feature 233 for enrollment (step S104).

FIG. 4 shows processing procedures and a data flow of supplementary information generation according to this embodiment. This processing is performed between when enrollment processing is performed and when search processing is performed. For example, it is possible to perform this processing immediately after the enrollment or at night on a day when the enrollment is performed. Moreover, this processing involves two cases: the case where supplementary information is newly generated; and the case where the supplementary information for enrolled data added after the last supplementary information generation is updated.

The server terminal 200 acquires the enrollment information 230 of each enrolled user from the database 210 to newly generate supplementary information and acquires the added enrollment information 230 from the database 210 to updates the supplementary information (step S201).

To newly generate supplementary information, the server terminal 200 newly determines M pivots from among the raw data 232 of the N pieces of enrollment information 230 (step S202). To update the supplementary information, this step is omitted and the raw data 232 of the added enrollment information 230 is provided as a non-pivot. Methods of determining a pivot include: for example, random selection; and determining as a pivot upon every pivot selection the one which has a smallest (or largest) sum of scores or Δ scores from the pivots determined by that time.

The server terminal 200 obtains a score between each pivot and each of the (N−M) non-pivots to generate the index vector 243 to newly generate supplementary information, and obtains a score between each pivot and each of the added non-pivots to generate the index vector 243 to update the supplementary information (step S203).

The server terminal 200 uniquely determines (trains) the non-pivot-specific parameter 250 composed of the index vector size 251 and the regression coefficient 252 used for the logistic regression by using prepared data (training data) for each of the N−M non-pivots to newly generate supplementary information and for each added non-pivot to update the supplementary information (step S204). Details of a method of training the non-pivot-specific parameter 250 composed of the index vector size 251 and the regression coefficient 252 will be described below.

The server terminal 200, to newly generate supplementary information, saves into the database 210, as the supplementary information 240, the pivot information 241 indicating which piece of the enrolled data is a pivot, the index 242 composed of the index vector 243 of each of the N−M non pivots, and the non-pivot-specific parameter 250 composed of the index vector size 251 and the regression coefficient 252 of each trained non-pivot. The server terminal 200, to update the supplementary information, adds the generated index vector 243 to the index 242 of the database 210, and adds the index vector size 251 and the regression coefficient 252 of each trained non-pivot to the non-pivot-specific parameter 250. At this point, for the index vector 243 of each non-pivot, saving or addition for the index vector size 251 of the concerned non-pivot is performed (step S205).

FIG. 5 shows processing procedures and a data flow of search according to this embodiment.

The server terminal 200 acquires the master data 220 from the database 210 (step S301).

The client terminal 300 acquires raw search data from the user (step S302).

The client terminal 300 transmits the raw search data to the server terminal 200 (step S303).

The server terminal 200 extracts a feature for search from the raw search data (step S304).

The server terminal 200 calculates a score between the search data and each pivot (step S305).

The server terminal 200, based on the score between the search data and each pivot, generates an index vector of the search data (step S306).

The server terminal 200, by using the index vector of the search data, the index 242 including the index vector of each non-pivot, and the index vector size 251 of each non-pivot, calculates a Δ score between the search data and each of the non-pivots (step S307).

The server terminal 200, based on a Δ score ΔS_{q, M+1, . . . ,}ΔS_{q, N}, by using regression coefficients w_{i, 1}and w_{i, 0}of logistic regression of each non-pivot, obtains by Formula 7 a value a_irelated in a monotonically increasing manner to posterior probability P(sq, i<r|ΔSq, i)(M+1≦i≦N) for which the score s_{q, i}from the search data is smaller than the threshold value r, and determines the order to select the non-pivots in descending order of a_i(step S308).

The server terminal 200 initializes at 0 the number of times t of calculating the score between the search data and the non-pivot (step S309).

The server terminal 200 calculates a score between the search data and the non-pivot selected in accordance with the order to select the non-pivots determined at step S308 (step S310).

The server terminal 200 increases the number of times t of calculating the score between the search data and the non-pivot by an increment of 1 (step S311).

The server terminal 200 proceeds to step S310 if the number of times t of calculating the score between the search data and the non-pivot is equal to or smaller than an upper limit value T and proceeds to step S313 if it is larger than the upper limit value T (step S312).

The server terminal 200 transmits the raw data 232 as search results to the client terminal 300 (step S313). At this point, a method of selecting k pieces of enrolled data in ascending order (or descending order) of score and providing them as search results (k-Nearest Neighbor Search) may be adopted, or a method of providing as search results the enrolled data for which the score is smaller (or larger) than the threshold value r (Range Search) may be adopted.

The client terminal 300 displays the raw data 232 as the search results (step S314).

Hereinafter, details of the method of training by using the training data the parameter 250 composed of the index vector size 251 and the regression coefficient 252 for each non-pivot in step S204 will be described. As the training data, (N−1) non-pivots other than the concerned non-pivot for which the parameter is trained may be used, or data previously prepared separately from the enrolled data may be used.

First, the method of training the regression coefficients w_{i, 1}and w_{i, o}when an index vector size Z_iis fixed at a certain value will be described. Assume that the training data are Q_i, Q₂, . . . Q_N′ (where N′ is the number of pieces of training data). Moreover, a Δ score between the training data Qj (1≦j≦N′) and the non-pivot X_i(M+1≦i≦N) is a ΔS_{j, i}, and an aggregation of Δ scores for the non-pivot X_iof each training data Qj (1≦j≦N′) is expressed by:

ΔS_i={ΔS_j,i|1≦j≦N′} [Formula 9]

For example, when the score vector is used as the index vector and the Euclid distance is used as the Δ score, the ΔS_{j, i}can be expressed as D_e(S_qj, Si, Ti, Z_i) (where S_qjis a score vector of the training data Qj) and can be calculated by Formula 3. When the permutation vector is used as the index vector and the Spearman Rho is used as the Δ score, the ΔS_{j, i}is D_ρ (T_qj, Ti, Zi) (where T_qjis a permutation vector of the training data Qj) and can be calculated by Formula 4.

Further, assume that a label that takes 1 when a score s_{j, i}between the training data Qj (1≦j≦N′) and the non-pivot Xi (M+1≦i≦N) is smaller than the threshold value r and that takes 0 in other cases is defined as L_jiand an aggregation of labels for the non-pivot Xi of each training data Qj (1≦j≦N′) is expressed by:

L
_i
={L
_ji|1≦j≦N′} [Formula 10]

Furthermore, a regression coefficient of the non-pivot X_ican be expressed, where w_{i, l}and w_{i, 0}are arranged, in a vector form of:

w
_i=(w_i,1,w_i,0)^T [Formula 11]

In this embodiment, the aggregation ΔS_iof the Δ scores for the non-pivot X_iand the aggregation L_iof the labels are used for training the regression coefficient w_i.

As the method of training the regression coefficient, there is a method of using maximum A posterior probability estimation and maximum likelihood estimation. To train the regression coefficient w_ithrough the maximum A posterior probability estimation by using the aggregation ΔS_iof the Δ scores for the non-pivot X_iand the aggregation Li of the labels, a parameter w_i^MAPis obtained through:

$\begin{matrix} \begin{matrix} w_{i}^{MAP} = \underset{w_{i}}{argmax P} (w_{i}  Δ S_{i}, L_{i}) \\ = \underset{w_{i}}{argmax P} (Δ S_{i}, L_{i}  w_{i}) P (w_{i}) \\ = \underset{w_{i}}{argmax P} (L_{i}  Δ S_{i}, w_{i}) P (Δ S_{i}  w_{i}) P (w_{i}) \\ = \underset{w_{i}}{argmax P} (L_{i}  Δ S_{i}, w_{i}) P (Δ S_{i}) P (w_{i}) \\ = \underset{w_{i}}{argmax P} (L_{i}  Δ S_{i}, w_{i}) P (w_{i}) \end{matrix} & [Formula 12] \end{matrix}$

and they are provided as training results. However, Bayes' theorem is used for deformation on the 2nd to 4th lines and the ΔS_iand w_iare independent from each other for deformation on the 4th to 5th lines (that is, P(ΔS_i|w_i)=P(ΔS_i)). For deformation on the 5th to 6th deformation, the fact that P(ΔS_i) is fixed, not depending on w_iis used. Moreover, argmax f(x) indicates x that maximizes f(x). To train the regression coefficient w_ithrough the maximum likelihood estimation, a parameter wi^MLis obtained through:

$\begin{matrix} \begin{matrix} w_{i}^{ML} = \underset{w_{i}}{argmax P} (Δ S_{i}, L_{i}  w_{i}) \\ = \underset{w_{i}}{argmax P} (L_{i}  Δ S_{i}, w_{i}) P (Δ S_{i}  w_{i}) P (Δ S_{i}  w_{i}) \\ = \underset{w_{i}}{argmax P} (L_{i}  Δ S_{i}, w_{i}) P (Δ S_{i}) \\ = \underset{w_{i}}{argmax P} (L_{i}  Δ S_{i}, w_{i}) \end{matrix} & [Formula 13] \end{matrix}$

and they are provided as training results.

As shown by Formulae 12 and 13, the maximum A posterior probability estimation is different from the maximum likelihood estimation in a point that the regression coefficient is trained in view of the posterior probability P (w_i) of the regression coefficient w_i. As described above, the maximum A posterior probability estimation is characterized by being capable of training the regression coefficient more toughly than the maximum likelihood estimation, by considering the posterior probability of the regression coefficient, even when the number of pieces of training data is small. In particular, in this embodiment, the number of labels L_jitaking 1 (that is, the number of pieces of the training data Q_jsimilar to the non-pivot X_i) is typically very small, and thus the regression coefficient may not be appropriately trained through the maximum likelihood estimation. Even in such a case, the regression coefficient can be appropriately trained through the maximum A posterior probability estimation.

P(Li|ΔS_i, w_i) can be obtained by:

$[Formula 14]$

$\begin{matrix} P (L_{i}  Δ S_{i}, w_{i}) = \prod_{j = 1}^{N^{*}} {P (s_{j, i} < r  Δ S_{j, i}, w_{i})}^{L_{ji}} {P (s_{j, i} ≧ r  Δ S_{j, i}, w_{j})}^{1 - L_{ji}} \\ = \prod_{j = 1}^{N^{*}} {P (s_{j, i} < r  Δ S_{j, i}, w_{i})}^{L_{ji}} {(1 - P (s_{j, i} < r  Δ S_{j, i}, w_{i}))}^{1 - L_{ji}} \\ = \prod_{j = 1}^{N^{*}} {σ (a_{j, i})}^{L_{ji}} {(1 - σ (a_{j, i}))}^{1 - L_{ji}} \end{matrix}$

Note that, however, the label L_jifor the 1st to 2nd lines takes 1 when the score s_{j, i}between the training data Q_jand the non-pivot X_iis smaller than the threshold value r and takes 0 in other cases and dependence on the Δ score ΔS_{S, i}is used. Moreover, a_{j, i}is:

$\begin{matrix} a_{j, i} = \ln \frac{P (Δ S_{j, i}  s_{j, i} < r) P (s_{j, i} < r)}{P (Δ S_{j, i}) - P (Δ S_{j, i}  s_{j, i} < r) P (s_{j, i} < r)} & [Formula 15] \end{matrix}$

By using the logistic regression described above,

a_j,i≈w_i,1ΔS_j,i+w_i,0 [Formula 16]

can be obtained.

Assuming that P(w_i) is, for example, normal distribution of an average vector 0 and variance-covariance matrix Σ_o,

There is a method of obtaining:

P(w_i)=N(0,Σ₀) [Formula 17]

There are, for example, a method of presetting Σ_oat an adequate value and a method of automatically determining it by using empirical Bayes method based on the training data. Moreover, an average vector other than 0 may be used, or for example, exponential distribution or gamma distribution other than normal distribution may be used as a distribution model.

At this point, the regression coefficient w_i^MAPor W_i^MLobtained through the maximum A posterior probability estimation or the maximum likelihood estimation (that is, which maximizes Formula 16 or 17) can be calculated by using, for example, a Newton-Raplon method. This is a method of sequentially obtaining the value w_i^MAPof the maximum A posterior probability estimation or the value w_i^MLof the maximum likelihood estimation with the following procedures.

1. An initial value)w_i^(o)of wi is set appropriately. For example, w_i^(o)=0, and τ←0.

2. As described below, w_i^(τ+1)is obtained. The symbol τ is the number of times of sequential calculation:

w
_i
^{( τ+1)}
= w
_i
^(r)−(∇∇E(w_i^(τ)))⁻¹∇E(w_i^(τ)) [Formula 18]

Note that E(w_i^(τ)is posterior probability, or a negative log of the likelihood. The symbol ∇ is a differential operator vector. This is called an error function. In the case of the maximum A posterior probability estimation,

E(w_i^(τ)=−log P(L_i|ΔS_i,w_i^(τ)P(w_i^(τ)) [Formula 19]

and in the case of the maximum likelihood estimation,

E(w_i^(τ)=−log P(L_i|ΔS_i,w_i^(τ)) [Formula 20]

Moreover, ∇E (w_i^(τ)) and ∇∇E(w_i^(τ)) are a first-order differential column vector and a second-order differential line-column, respectively. For example, in the case of the maximum A posterior probability estimation, when Formulae 14, 16, and 17 are employed,

$\begin{matrix} \nabla E (w_{i}^{(τ)}) = Σ_{0}^{- 1} w_{i}^{(τ)} + \sum_{j = 1}^{N^{*}} (σ (a_{j, i}^{(τ)}) - L_{ji}) x_{j} & [Formula 21] \\ \nabla \nabla E (w_{i}^{(τ)}) = Σ_{0}^{- 1} + \sum_{j = 1}^{N^{*}} σ (a_{j, i}^{(τ)}) (1 - σ (a_{j, i}^{(τ)})) x_{j} x_{j}^{T} & [Formula 22] \end{matrix}$

can be obtained, where

a_j,i^(τ)≈w_i,1^(τ)ΔS_j,i+w_i,0^(τ) [Formula 23]

x
_j=(ΔS_j,i,1)^T [Formula 24]

3. when a difference between w_i^(τ+1)and w_i^(τ)is sufficiently small or when τ exceeds a fixed value, w_i^(τ+1)ends as w_i^MAPor w_i^ML. Otherwise, as τ←τ+1, the process returns to 2.

Next, a method of training the index vector size Z_iwill be described. To this end, the aforementioned operation is performed while the index vector size Z_iis varied to various values (for example, values of 1 to M), and the w_i^MAPor the W_i^MLfor which the error function is as small as possible and the Zi that achieves this may be provided as training results. This makes it possible obtain the best parameter in terms of accuracy.

Alternatively, the non-pivot-specific parameter may be trained so that a sum of the error functions for the non-pivot becomes as small as possible while the index size is equal to or smaller than a fixed value. To this end, the w_i^MAPwith which the sum of error functions for the non-pivot becomes largest while Z_iof each non-pivot is varied to various values in a range where the index size is equal to or smaller than the fixed value and Zi that realizes this may be provided as training results (M+1≦i≦N). This makes it possible to realize, when a required value is set for the size of the supplementary information, most excellent performance in terms of accuracy in a range that satisfies this.

Moreover, in this embodiment, obtaining the label L_ji(1≦j≦NI, M+1≦i≦N) requires calculation of a total (N−M)×N′ scores, which typically takes a great deal of time. Thus, a Δ score between each non-pivot and each piece of the (N′) training data may be obtained, (<N′) pieces of the training data may be selected in ascending order (where υ′ is a value predefined by a system manager or the like), and they may be used for training. The piece of training data with a small Δ score is similar to the non-pivot with high possibility, and this makes it possible to reduce the number of times of score calculation required for the training to (N−M)×υ′ pieces while suppressing reduction in the number of labels L_jithat take 1 (that is, that is similar to the non-pivot X_i) as much as possible. This consequently provides effect of performing high-speed training.

Moreover, for example, in a case where the pieces of enrolled data form several clusters in the feature space, the parameter such as the index vector size may take a similar or same value for each cluster.

Therefore, in this embodiment, clustering may be performed on the non-pivots, the non-pivot-specific parameter may be trained so that some or all of the parameters obtained for each cluster are common. As a clustering method, any of hierarchical methods such as a nearest neighbor method, a farthest neighbor method, a group average method, and a Ward method may be used. Training a common parameter for each cluster in this manner makes it possible to reduce a size of the parameter. This consequently provides effect of realizing further system weight reduction.

Moreover, in a case where the enrolled data is used as the training data, when enrolled data has been added, it is possible that the parameter training is not performed successfully since the index vector size of the training data is small. However, training the common parameter for each cluster as described above makes it possible to easily perform the parameter training by using the common parameter for the cluster to which the concerned enrolled data belongs.

Second Embodiment

Hereinafter, with reference to the accompanying drawings, the second embodiment will be described. A similarity search system of this embodiment is a biological body identification system which, as a result of inputting biological body information by a user who attempts authentication (hereinafter referred to as authenticated user), searches a database in a client terminal for similar biological body information, and thereby identifies to which user (hereinafter referred to as enrolled user) enrolled in the database the authenticated user corresponds, and performs authentication based on results of this identification.

FIG. 7 shows a configuration example of the biological body identification system of this embodiment. Here, only a point different from FIG. 1 will be described.

In this embodiment, raw data is biological body information.

This system is composed of: an enrollment terminal 100 that transmits to a server terminal a feature of biological body information obtained from the user; a server terminal 200 that saves enrollment information, generates supplementary information from the enrollment information, and performs biological body identification on a feature for authentication by using the enrollment information and the supplementary information; a client terminal 300 that transmits to the server terminal 200 a group ID and the feature for the authentication inputted by the user; and a network 400.

For example, it is possible to from, for an information access control system or an attendance management system of a firm, the enrollment terminal 100 with a plurality of PCs in the firm, the server terminal 200 with one server in a data center operated by the firm, the client terminal 300 with a plurality of employees' PCs, and the network 400 with the Internet. Moreover, it is possible to form, for an entrance and exit management system in the firm, the enrollment terminal 100, the server terminal 200, and the client terminal 300 in the same entrance and exit management device. A group ID 221 may be a value specific to a business place to which the user belongs, or may be set to be specific to each client terminal 300 or each base. In the former case, possible operation is to input the group ID at the time of authentication by the user. In the latter case, the user is not required to input the group ID at the time of authentication.

The enrollment terminal 100 further has: a group ID/user name acquisition unit 103 that acquires a group ID and a user name; and a feature extraction unit 104 that extracts a feature from raw data.

The server terminal 200 does not have a feature extraction unit 202 but has a group narrowing unit 209a and has master data 220 for each group ID. The master data 220 has a group ID 221. Enrollment information 230 does not have raw data 232 but has a user name 234 for each piece of the enrollment information.

Possible features of the biological body information are, for example, minutiae for fingerprints, an iris code for an iris, and cepstrum for a vocal print. Possible scores between the two pieces of biological body information are the number and a ratio of corresponding minutiae for the fingerprint, a Hamming distance for the iris, and a Mahalanobis distance for the vocal print.

The client terminal 300 further has: a group ID acquisition unit 303 that acquires a group ID; and a feature extraction unit 304 that extracts a feature from raw data.

Hardware configuration of the enrollment terminal 100, the server terminal 200, and the client terminal 300 according to this embodiment is the same as that of FIG. 2.

FIG. 8 shows processing procedures and a data flow of enrollment processing according to this embodiment. Step S101 of FIG. 8 is equal to step S101 of FIG. 3.

The enrollment terminal 100 acquires a group ID and a user name from the user (step S101a).

The enrollment terminal 100 extracts a feature for enrollment from raw enrolled data (step S102a).

The enrollment terminal 100 transmits to the server terminal 200 the group ID, the user name, and the feature for enrollment (Step S103a).

The server terminal 200, if the master data 220 corresponding to the group ID is in the database 210, adds to the master data 220 the enrollment information 230 including an enrolled data ID 231 specific to the enrolled data, the user name 234, and a feature 233 for enrollment. If there is no master data 220, the enrollment information 230 including the group ID 221, the enrolled data ID 231 specific to the enrolled data, the user name 234, and the feature 233 for enrollment is newly created (step S104a).

Processing procedures and a data flow of supplementary information generation processing according to this embodiment is the same as that of FIG. 4. Note that, however, this processing is performed for each group ID. The number N of pieces of the enrollment information 230 and the number M of pivots may be different from one group ID to another.

FIG. 9 shows processing procedures and a data flow of search processing according to this embodiment. Steps S302, S305 to S312, and S314 of FIG. 9 are equal to steps S302, S305 to S312, and S314 of FIG. 3.

The server terminal 200 acquires the master data 220 for each ID from the database 210 (step S301a).

The client terminal 300 acquires the group ID from the user (step S302a). The group ID may be a value specific to each client terminal 300 or each base, or may not be acquired from the user in this case.

The client terminal 300 extracts a feature for search from raw search data (step S303a).

The client terminal 300 transmits the group ID and the feature for search to the server terminal 200 (step S304a).

A target of search by the server terminal 200 is master data corresponding to the acquired group ID (step S305a).

As described above, in this embodiment, narrowing the enrolled data by using the group ID is performed. This makes it possible to dramatically reduce the number of pieces of enrolled data for which the score is calculated. This consequently provides effect of further improving speed.

The server terminal 200 transmits, as a search result, the user name 234 corresponding to the enrolled data to the client terminal 300 (step S313a).

The client terminal 300 displays the user name 234 corresponding to the enrolled data as the search result (step S314a).

The present invention is applicable to any application that performs similarity search on unstructured data such as an image, a moving picture, music, a document, binary data, or biological body information. For example, the invention is applicable to a similar image search system, a similar moving picture search system, a similar music search system, a similar document search system, a similar file search system using fuzzy hash, an information access control system, an attendance management system, and an entrance and exit management system.

HIGH-ACCURACY SIMILARITY SEARCH SYSTEM

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

US Classifications

International Classifications

Abstract

Description

Claims

Priority Claims (1)