The present invention relates to a technique associated with authentication using an image.
Conventionally, as a learning method of an authentication technique using an image, there is provided a method that uses the similarity of a pair (Japanese Patent Laid-Open No. 2020-504891). Furthermore, to improve accuracy in a predetermined erroneous determination level, there is provided a learning method of improving the accuracy by focusing on a pair near a determination threshold (J. Liu, H. Qin, Y. Wu, and D. Liang, “AnchorFace: Boosting TAR@FAR for Practical Face Recognition”, AAAI, vol. 36, no. 2, pp. 1711-1719, June 2022).
However, in the technique described in Japanese Patent Laid-Open No. 2020-504891, there is a problem that a pair is used for learning regardless of the similarity of the pair. Furthermore, in the technique described in J. Liu, H. Qin, Y. Wu, and D. Liang, “AnchorFace: Boosting TAR@FAR for Practical Face Recognition”, AAAI, vol. 36, no. 2, pp. 1711-1719, June 2022, there is a problem that the learning improvement effect is small with respect to a pair whose similarity is much higher or lower than a threshold.
The present invention provides a technique for improving the authentication accuracy by learning other cases while focusing on a pair whose similarity is close to a threshold.
According to an aspect of the present invention, there is provided a learning apparatus comprising: one or more processors; and one or more memories storing executable instructions which, when executed by the one or more processors, cause the learning apparatus to function as: a first acquisition unit configured to acquire at least one pair of an image and a label; a first calculation unit configured to calculate a feature vector from the image using a feature extractor; a second calculation unit configured to calculate, as a positive pair similarity, a similarity between a first feature vector calculated by the first calculation unit from the image acquired by the first acquisition unit and a second feature vector associated with the same label as the label acquired by the first acquisition unit, and calculate, as a negative pair similarity, a similarity between the first feature vector calculated by the first calculation unit from the image acquired by the first acquisition unit and a third feature vector associated with a label different from the label acquired by the first acquisition unit; a decision unit configured to decide a similarity threshold; a third calculation unit configured to calculate a loss value with respect to the positive pair similarity lower than the threshold; a fourth calculation unit configured to calculate a loss value with respect to the negative pair similarity higher than the threshold; and a learning unit configured to learn a parameter of the feature extractor that decreases the loss value calculated by the third calculation unit and the loss value calculated by the fourth calculation unit, wherein one of the third calculation unit and the fourth calculation unit uses a loss function including a function such that an absolute value of a gradient is larger near a predetermined threshold and is non-zero but smaller even at a point away from the threshold.
Further features of the present invention will become apparent from the following description of exemplary embodiments with reference to the attached drawings.
Hereinafter, embodiments will be described in detail with reference to the attached drawings. Note, the following embodiments are not intended to limit the scope of the claimed invention. Multiple features are described in the embodiments, but limitation is not made to an invention that requires all such features, and multiple such features may be combined as appropriate. Furthermore, in the attached drawings, the same reference numerals are given to the same or similar configurations, and redundant description thereof is omitted.
First, an example of the hardware arrangement of a computer apparatus 100a applicable to a learning apparatus or an inference apparatus will be described with reference to a block diagram shown in
A CPU (Central Processing Unit) 101a executes various kinds of processes using computer programs and data stored in a ROM 102a and a RAM 103a. Thus, the CPU 101a performs operation control of the overall computer apparatus 100a, and also executes or controls various kinds of processes to be described as processes performed by the learning apparatus or the inference apparatus.
The ROM (Read Only Memory) 102a stores setting data of the computer apparatus 100a, a computer program and data associated with activation of the computer apparatus 100a, a computer program and data associated with the basic operation of the computer apparatus 100a, and the like.
The RAM (Random Access Memory) 103a includes an area used to store a computer program and data loaded from the ROM 102a or an external storage device 104a. Furthermore, the RAM 103a includes an area used to store a computer program and data externally received via a communication interface 107a. The RAM 103a also includes a work area used by the CPU 101a to execute various kinds of processes. As described above, the RAM 103a can provide various kinds of areas, as appropriately.
The external storage device 104a is a mass information storage device such as a hard disk drive device. The external storage device 104a stores an OS (Operating System), and computer programs and data for causing the CPU 101a to execute or control various kinds of processes to be described as processes performed by the learning apparatus or the inference apparatus. The computer programs and data stored in the external storage device 104a are loaded into the RAM 103a, as appropriately, under the control of the CPU 101a, and processed by the CPU 101a.
Note that the external storage device 104a may include a memory card, an optical disk such as a Flexible Disk (FD) and Compact Disk (CD) detachable from the computer apparatus 100a, a magnetic or optical card, an IC card, and a memory card.
An input device 109a is connected to an input device interface 105a. The input device 109a is a user interface such as a keyboard, a mouse, or a touch panel screen, and can input various kinds of instructions to the CPU 101a when it is operated by the user.
A monitor 110a is connected to an output device interface 106a. The monitor 110a includes a liquid crystal screen or a touch panel screen, and displays a processing result of the CPU 101a by an image or characters. Note that in addition to or instead of the monitor 110a, a projection apparatus such as a projector may be provided.
The communication interface 107a is an interface used to perform data communication with the outside, and is connected to a network line 111a such as a LAN or the Internet in
The captured image (the image of each frame of the moving image or the periodically or nonperiodically captured still image) captured by the network camera 112a is received by the external storage device 104a or the RAM 103a via the network line 111a and the communication interface 107a.
The CPU 101a, the ROM 102a, the RAM 103a, the external storage device 104a, the input device interface 105a, the output device interface 106a, and the communication interface 107a are all connected to a system bus 108a.
An example of the functional arrangement of a learning apparatus 100b according to this embodiment will be described with reference to a block diagram shown in
The learning apparatus according to this embodiment learns a face authentication task for determining whether faces included in two images are the face of the same person. More specifically, learning is performed so that feature vectors are calculated from the respective images and the similarity between the two feature vector becomes high in a case where the faces of the persons included the respective images are identical. Note that in this embodiment, as the similarity, the cosine similarity (to be referred to as the cos similarity hereinafter) between the vectors is used. However, the Euclidean distance or the like may be used as the similarity, and the definition of the similarity is not limited to a specific form.
An acquisition unit 101b acquires one or more pairs (learning data) each including an image and the label (person label) of a person included in the image. The one or more pairs (pairs of images and person labels) obtained by the acquisition unit 101b will be referred to as a mini batch hereinafter. A pair of an image and a person label will sometimes be referred to as a sample hereinafter. The image includes the face image of the person. The person label is information representing the ID of the person, and the same label value indicates the same person. The pair of the image and the person label is stored in the external storage device 104a or the like. Note that the person label may have a folder structure, and then images of the same person may collectively be stored in a folder and the images of the same person may be determined because they are stored in the same folder. The method of giving the person label is not limited to them.
A calculation unit 102b calculates, from an image, a feature vector to be used for authentication. To calculate a feature vector, a neural network is used as a feature extractor. For example, a Convolutional Neural Network (CNN) as a kind of neural network is used as the feature extractor. The CNN extracts abstraction information from an input image by repeatedly performing, for the input image, processing including convolutional processing, activation processing, and pooling processing. At this time, a processing unit of the convolutional processing, activation processing, and pooling processing is often called a layer. There are many known methods as the activation processing used at this time but, for example, a method called a Rectified Linear Unit (ReLU) may be used. Furthermore, there are many known methods as the pooling processing but, for example, a method called Max pooling may be used. For example, as the structure of the CNN, ResNet introduced by a non-patent literature (K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016) and the like may be used. Alternatively, a neural network known as a Vision Transformer (ViT) and described in a non-patent literature (Alexey Dosovitskiy, et al. An image is worth 16×16 words: Transformers for image recognition at scale. In ICLR, 2021) may be used. Note that the structure of the neural network is not limited to them. The calculation unit 102b holds information such as the weight and the structure of the neural network in the RAM 103a or the like.
A memory bank unit 103b holds a feature vector in association with a person label. The memory bank unit 103b holds m feature vectors for each person label where m represents a predetermined constant. The memory bank unit 103b holds the person labels and the feature vectors in the RAM 103a or the like.
A calculation unit 104b acquires the feature vector acquired by the calculation unit 102b from the image acquired by the acquisition unit 101b. Then, the calculation unit 104b obtains, as a “positive pair similarity”, the similarity of a pair (positive pair) of the acquired feature vector and the feature vector of the same person label as that of the image among the feature vectors held in the memory bank unit 103b. Furthermore, the calculation unit 104b obtains, as a “negative pair similarity”, the similarity of a pair (negative pair) of the acquired feature vector and the feature vector of a person label different from that of the image among the feature vectors held in the memory bank unit 103b.
A threshold decision unit 105b acquires a threshold used for loss calculation (to be described later). For example, the threshold decision unit 105b may acquire a threshold held in advance in the RAM 103a. Alternatively, for example, the threshold decision unit 105b may acquire, as a threshold, the similarity of top N % negative pairs among the negative pairs obtained by the calculation unit 104b where N represents a predetermined constant. Furthermore, for example, the threshold decision unit 105b may acquire, as a threshold, a false acceptance rate (for example, 0.001%) or the like at the time of the operation of face authentication. The threshold acquisition method is not limited to them.
A calculation unit 106b calculates a loss with respect to the positive pair similarity lower than the threshold obtained by the threshold decision unit 105b. A calculation unit 107b calculates a loss with respect to the negative pair similarity higher than the threshold obtained by the threshold decision unit 105b.
A calculation unit 108b learns, for each person label, representative vectors each having dimensions the number of which is equal to that of the feature vector calculated by the calculation unit 102b. Then, the calculation unit 108b calculates a loss that is close to the representative vector of the same person label among the representative vectors and is far from the representative vector of the different person label among the representative vectors. The calculation unit 108b calculates a loss using, for example, a loss function such as ArcFace described in a non-patent literature (J. Deng, J. Guo, N. Xue, and S. Zafeiriou. ArcFace: Additive angular margin loss for deep face recognition. In CVPR, 2019).
A learning unit 109b updates (learns) the neural network used by the calculation unit 102b so that the loss is minimized As the loss, the loss calculated by each of the above-described calculation units 106b, 107b, and 108b is used. To update (learn) the neural network, general back propagation or the like is used.
An updating unit 110b updates the feature vectors held in the memory bank unit 103b using the feature vector obtained from the image acquired by the acquisition unit 101b. The memory bank unit 103b holds up to m feature vectors for each person label. If the number of feature vectors held in the memory bank unit 103b in correspondence with the person label of the image is smaller than m, the new feature vector is stored in the memory bank unit 103b. If m feature vectors are held in the memory bank unit 103b in correspondence with the person label of the image, the oldest registered feature vector is deleted and the new feature vector is stored.
Learning loop processing according to this embodiment will be described with reference to a flowchart shown in
Step S202a is the start of a mini batch loop. The number of samples forming a mini batch is determined in advance, and the learning data is divided into mini batches by the number of samples. Each mini batch is assigned with a number from 1. To refer to the number using a variable j, the CPU 101a first initializes the value of the variable j to 1. If the value of the variable j is equal to or smaller than the total number of mini batches, the process advances to step S203a; otherwise, the process exits from the loop, and advances to step S206a.
In step S203a, the jth mini batch among the learning data is acquired. More specifically, the acquisition unit 101b acquires a pair of a learning image and a person label. Note that the acquisition unit 101b may perform image processing known as Data Augmentation for the acquired image. For example, the acquisition unit 101b may perform processing such as change of the tint of the image or addition of noise to pixel values.
In step S204a, learning is performed using the mini batch obtained in step S203a. Details of the processing in step S204a will be described later with reference to a flowchart shown in
Step S205a is the end of the mini batch loop, in which the CPU 101a adds 1 to the value of the variable j, and the process advances to step S202a. Step S206a is the end of the epoch loop, in which the CPU 101a adds 1 to the value of the variable i, and the process returns to step S201a.
Details of the learning step processing performed in step S204a will be described with reference to the flowchart shown in
In step S202b, the calculation unit 104b obtains the similarity (positive pair similarity) of a pair (positive pair) of the acquired feature vector and the feature vector, among the feature vectors held in the memory bank unit 103b, of the same person label as that of the image included in the mini batch acquired by the acquisition unit 101b. Furthermore, the calculation unit 104b obtains the similarity (negative pair similarity) of a pair (negative pair) of the acquired feature vector and the feature vector, among the feature vectors held in the memory bank unit 103b, of a person label different from that of the image.
More specifically, for each sample of the mini batch, the calculation unit 104b acquires the feature vector of the same person label as that included in the sample from the memory bank unit 103b, and obtains the similarity between the feature vectors as the positive pair similarity. Furthermore, the calculation unit 104b also obtains the feature vector of a person label different from that included in the sample, and obtains the similarity between the feature vectors as the negative pair similarity.
In step S203b, the threshold decision unit 105b acquires a threshold to be used for loss calculation (to be described later). In this embodiment, the threshold decision unit 105b acquires a predetermined threshold th.
In step S204b, the calculation unit 106b calculates the loss of the positive pair. More specifically, the calculation unit 106b calculates the loss of the positive pair using equations (1) and (2) below.
Equation (1) is an equation for obtaining the loss (positive loss) of one positive pair, and equation (2) is an equation for summing the losses of all the positive pairs, where lp represents the loss of a given positive pair, sim p represents the similarity of the positive pair, th represents the threshold acquired in step S203b, r represents a preset constant such that as r is smaller, the loss with respect to the pair away from the threshold is larger, and N represents the index of the radical root, for which a fixed value such as N=2 is used. If the similarity of the positive pair is lower than the threshold, the upper radical root expression in equation (1) is applied; otherwise, the loss of the positive pair is zero.
Furthermore, Lp represents the loss of all the positive pairs. Np represents the number of positive pairs each having the similarity lower than the threshold. That is, Np represents the number of positive pairs to which the upper expression in equation (1) is applied.
In step S205b, the calculation unit 107b calculates the loss of the negative pair. More specifically, the calculation unit 107b calculates the loss of the negative pair using equations (3) and (4) below.
Equation (3) is an equation for obtaining the loss (negative loss) of one negative pair, and equation (4) is an equation for summing the losses of all the negative pairs, where ln represents the loss of a given negative pair, simn represents the similarity of the negative pair, th represents the threshold acquired in step S203b, r represents the preset constant such that as r is smaller, the loss with respect to the pair away from the threshold is larger, and N represents the index of the radical root, for which a fixed value such as N=2 is used. If the similarity of the negative pair is higher than the threshold, the upper radical root expression in equation (3) is applied; otherwise, the loss of the negative pair is zero. Ln represents the loss of all the negative pairs. Nn represents the number of negative pairs each having the similarity higher than the threshold. That is, Nn represents the number of negative pairs to which the upper expression in equation (3) is applied.
In step S206b, the calculation unit 108b learns, for each person label, the representative vectors each having dimensions the number of which is equal to that of the feature vector calculated by the calculation unit 102b, thereby obtaining a loss such as ArcFace described above.
In step S207b, the learning unit 109b updates (learns) the neural network used by the calculation unit 102b by back propagation so as to minimize a loss L given by:
L=L
cls+λ1×LP+λ2×Ln (5)
where Lcls represents the loss (representative vector loss) obtained in step S206b, and λ1 and λ2 represent weighting parameters and are constants defined in advance.
Note that in back propagation, the update amount of each weight of the neural network is calculated from the value of a gradient as a first-order derivative of the loss function. At this time, since simp=th in equation (1) and simn=th in equation (3) are non-differentiable points, if the similarity takes this value, learning is advanced by setting the value of the gradient to 0 (the same applies to non-differentiable points of the equation of another loss function). In step S208b, the updating unit 110b updates the feature vectors held in the memory bank unit 103b using the feature vector of the mini batch.
The positive loss and the negative loss described in equations (1) and (3) will be explained in detail with reference to
Note that the radical root is used in the above equations. However, another function may be used. For example, a log function may be used. For example, if a log function is used for the loss function of each of equations (1) and (3), equations (6) and (7) below are obtained.
Unlike equations (1) and (3), to avoid the antilogarithm of the log function from being zero, 1 is added. Alternatively, a negative inverse proportional function may be used, as given by:
Alternatively, an arc tangent function may be used, as given by:
Note that a loss function for which an arc tangent function is used has no non-differentiable points, and non-zero values are obtained over the entire area of the similarity. The above examples of the function all have a characteristic that “the absolute value of the gradient is large (larger) near (in proximity to) the predetermined threshold and is small (smaller) even at a point away from the threshold”. This is because the gradient (the derivative of the loss) is “a function that is to have a negative exponent”. A power function is a function of a form of xa, and a is the exponent of the power function. “A function that is to have a negative exponent” is “a function that is to have a negative exponent” even in differentiation, and thus has the effect of abruptly suppressing an output with respect to an increase in input. Therefore, although the gradient is large near the threshold, the gradient abruptly decreases as the similarity is farther away from the threshold. For example, if the loss function is a log function log(x), the first-order derivative is x−1 and the second-order derivative is −x−2As an input x increases, the absolute value |x−1| of the first-order derivative decreases. Thus, as x increases, the gradient of the log function also decreases. In addition, since the absolute value |−x−2| of the second-order derivative also decreases, the gradient of the log function “abruptly” decreases as x increases. Even in the radical root or the inverse proportional function in the above equation, the first-order derivative is “a function that is to have a negative exponent”, and thus the trend is formed. Since the first-order derivative of the arc tangent function tan−1 (x) is (1+x2)−1, and is “a function that is to have a negative exponent”, the trend is formed. Note that the first-order derivative of the loss function for which the arc tangent function of equation (11) is used is known as the Cauchy distribution. In the first-order derivative, “the function that is to have a negative exponent” is not limited to the radical root, the inverse proportional function, the log function, the arc tangent function, and the like.
Furthermore, as compared with the sigmoid function used in J. Liu, H. Qin, Y. Wu, and D. Liang, “AnchorFace: Boosting TAR@FAR for Practical Face Recognition”, AAAI, vol. 36, no. 2, pp. 1711-1719, June 2022, the gradient of the sigmoid function is formed by an exponential function. Therefore, the gradient decreases more abruptly than the power function, and then disappears. Thus, by using “the function that is to have a negative exponent”, the characteristic that “the absolute value of the gradient is small even at a point away from the threshold”.
where t represents a parameter for controlling the inclination of the sigmoid function.
Note that “a function such that the absolute value of the gradient is large near the predetermined threshold and is non-zero but small even at a point away from the threshold” may be combined with the sigmoid function. For example, it is considered to use a log function or the like in a predetermined domain. For example, the abrupt loss of the sigmoid function may be used near the threshold, and the log function may be used in a portion away from the threshold. Alternatively, the sum of the log function or the like and the sigmoid function may be set as a loss function. In this case, the log function or the like is used for a given term of a polynomial. The form in which “a function such that the absolute value of the gradient is large near the predetermined threshold and is non-zero but small even at a point away from the threshold” is included in the loss function is not limited to them.
In the above description, the same function is used for both the positive loss and the negative loss but different functions may be used for them. Alternatively, a function having no property that “the absolute value of the gradient decreases” may be used for one of the positive loss and the negative loss. That is, all the similarity pairs may be used for learning without decreasing the gradient as the similarity is farther away from the threshold. This makes it possible to concentratedly perform learning near the threshold with respect to one of the positive loss and the negative loss but perform learning over the entire area without limitation to an area near the threshold with respect to the other of the positive loss and the negative loss.
In the above example, the threshold decision unit 105b acquires only one threshold, and the same threshold is used for both the positive loss and the negative loss. However, different thresholds may be used. More specifically, the thresholds are given by equations (13) and (14) below. An example of using a log function will be described but a function such as a radical root may be used.
where thp represents a positive threshold and thp represents a negative threshold.
In addition, a small loss may be applied to a positive pair having the similarity higher than the positive threshold 304d. That is, a loss is applied to a positive pair having the similarity higher than the positive threshold 304d shown in
where u represents a parameter that adjusts the degree of a loss applied to a pair having the similarity higher than the positive threshold. As u is increased, a loss is strongly applied to positive pairs near the threshold, and as u is decreased, a loss is widely applied to positive pairs away from the threshold.
Furthermore,
Two or more positive thresholds or negative thresholds may be provided. For example, when two negative thresholds are provided, the negative loss is given by:
where th1n represents the first negative threshold and th2n represents the second negative threshold. When the similarity exceeds the first negative threshold, the loss becomes zero. In a portion between the first negative threshold and the second negative threshold, the same loss as the above-described negative loss is applied. This prevents the negative pair having the similarity higher than the predetermined similarity from being used for learning. Therefore, it is possible to ignore the high similarity obtained by processing an actually positive pair as a negative pair because of an error of a person label or the like. Similarly, it is possible to ignore the low similarity caused by noise by setting even the positive loss to zero in a case where the loss is smaller than the predetermined value. Note that when learning is performed, as described above, the face authentication system is operated using the negative threshold th2n. Note that an example in which log functions are used for equations (13) to (16) has been explained but a radical root and the like may be used.
In the above description, the representative vector loss is also used as a loss, but need not be used. Alternatively, the representative vector loss may be combined with another loss of face authentication. For example, a triplet loss described in a non-patent literature (Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In CVPR, 2015) may be used. At this time, the calculation unit 108b may be deleted to change the arrangement to, for example, calculate a triplet loss. Alternatively, other losses may be combined, and the combination of losses is not limited to them.
Furthermore, in the above description, a pair is formed from the feature vector obtained from the mini batch and the feature vector in the memory bank. However, a pair may be formed only from the mini batch. That is, a positive pair is formed from the feature vectors of the same person label in the mini batch and a negative pair is formed from the feature vectors of different person labels. At this time, the arrangement may be changed by excluding the memory bank unit 103b and the updating unit 110b. Furthermore, in the above description, the acquisition unit 101b acquires one or more pairs of the images and the person labels. However, the acquisition unit 101b may acquire “a pair of images” and “the label of the pair”. “The label of the pair” indicates one of the positive pair and the negative pair. At this time, the calculation unit 104b need not create a pair and changes the arrangement to obtain the similarity of each pair.
In the above description, an old feature vector is also used for learning as long as it is not deleted from the memory bank unit 103b. However, an “effective time” may be set for each feature vector, and the feature vector exceeding the effective time may not be used for learning. More specifically, the memory bank unit 103b holds an effective time (integer value) for each feature vector. When the updating unit 110b registers a feature vector in step S208b, the effective time (predetermined positive integer value) is set for the feature vector, and 1 is subtracted from the effective time of the feature vector stored in the memory bank unit 103b with a lapse of time. Then, when the pair similarity is calculated in step S202b, a pair is generated by only feature vectors each having the effective time of 1 or more among the feature vectors stored in the memory bank unit 103b. This can prevent the old feature vector registered in the memory bank unit 103b from being used for learning.
Next, an example of the functional arrangement of an inference apparatus 600a that determines, using the neural network learned by the learning apparatus 100b shown in
An acquisition unit 601a acquires a registered image and a person label. The person label is information indicating a person. The registered image and the person label may be acquired from the external storage device 104a. Alternatively, a captured image transmitted from the network camera 112a may be acquired as a registered image and “the person label of a person included in the captured image” input by operating the input device 109a by the user may be acquired.
An acquisition unit 602a acquires a collation image. The acquisition unit 602a may acquire, as a collation image, an image stored in the external storage device 104a, or may acquire, as a collation image, a captured image transmitted from the network camera 112a.
Using “the neural network used by the calculation unit 102b” learned by the learning apparatus 100b, a calculation unit 603a calculates the feature vector of the registered image acquired by the acquisition unit 601a and the feature vector of the collation image acquired by the acquisition unit 602a. A database unit 604a holds the feature vector obtained by the calculation unit 603a with respect to the registered image acquired by the acquisition unit 601a and the corresponding person label in association with each other.
A collation unit 605a collates the feature vectors. More specifically, the collation unit 605a calculates the cos similarity between the feature vector obtained by the calculation unit 603a with respect to the collation image acquired by the acquisition unit 602a and each feature vector registered in the database unit 604a, and outputs each calculated cos similarity.
Note that the collation unit 605a may output information representing whether each obtained cos similarity exceeds a predetermined threshold. It is possible to determine that a person corresponding to the feature vector having the cos similarity to the feature vector of the collation image, that is equal to or higher than the threshold, among the feature vectors registered in the database unit 604a, is the same person as the person in the collation image. On the other hand, it is possible to determine that a person corresponding to the feature vector having the cos similarity to the feature vector of the collation image, that is lower than the threshold, among the feature vectors registered in the database unit 604a, is not the same person as the person in the collation image. Therefore, the collation unit 605a may perform this determination processing, and output the result of the determination processing.
The collation unit 605a may output the person label held in the database unit 604a in association with the feature vector having the cos similarity to the feature vector of the collation image, that is equal to or higher than the threshold, among the feature vectors registered in the database unit 604a.
In the above description, the acquisition unit 601a acquires the person label. However, the acquisition unit 601a may acquire only the registered image. In this case, the database unit 604a holds only the feature vector. With this arrangement, the collation unit 605a outputs information representing whether the registered image of the same person as the person in the collation image is registered in the database unit 604a without outputting the person label of collated feature vector.
Furthermore, the example of registering persons in advance has been explained above but it may be configured to collate two images of a registered image and a collation image. In this case, the database unit 604a is unnecessary. That is, the acquisition units 601a and 602a acquire a registered image and a collation image, respectively, and the calculation unit 603a obtains the feature vector of the registered image and that of the collation image. The collation unit 605a obtains the similarity between the feature vector of the registered image and that of the collation image, and outputs information representing whether the similarity exceeds a predetermined threshold. Alternatively, the collation unit 605a may be configured to output the obtained similarity.
As described above, according to this embodiment, since learning of pairs each having a similarity close to a threshold is concentratedly performed, it is possible to improve the accuracy of authentication when an operation is performed near the threshold. In particular, by using a loss function including “a function such that the absolute value of the gradient is large near the predetermined threshold and is non-zero but small even at a point away from the threshold”, it is possible to gradually learn cases away from the threshold while concentratedly learning cases near the threshold.
Furthermore, by using different thresholds for a positive loss and a negative loss, when the authentication system is operated using the negative threshold, the possibility that the similarity of the positive pair is erroneously lower than the threshold decreases and it is possible to prevent non-authentication.
In addition, by applying a small loss to a positive pair having the similarity higher than the positive threshold, it is possible to prioritize learning of positive pairs each having the similarity the threshold close to or lower than the threshold while maintaining the similarities of the positive pairs high.
This embodiment will describe the difference from the first embodiment. This embodiment is assumed to be the same as the first embodiment, unless it is specifically stated otherwise. In the first embodiment, learning data is a pair of an image and a person label. This embodiment will describe an embodiment in a case where an attribute is used as learning data in addition to an image and a person label. In this embodiment, only pairs of a predetermined attribute combination are formed using the attribute, and the predetermined attribute combination is concentratedly learned. In addition, by increasing the gradient of a loss of a specific attribute, the loss is strengthened and learning of the specific attribute is similarly, concentratedly performed.
A learning apparatus that performs learning by forming a pair of the same race by using a “race” as an attribute. Since if persons are of the same race, they are similar to each other and are difficult to identify, it is possible to concentratedly learn the difficult pair. An example of the functional arrangement of a learning apparatus 400a according to this embodiment will be described with reference to a block diagram shown in
An acquisition unit 401a acquires one or more groups (learning data) of images, person labels, and attributes. One or more “groups of images, person labels, and attributes” obtained by the acquisition unit 401a will be referred to as a mini batch hereinafter. “A group of an image, a person label, and an attribute” will sometimes be referred to as a sample hereinafter. The attribute is metadata of an image, and holds a “race” of a face in the image in this embodiment. The race includes values of “East Asia”, “South East Asia”, “Caucasian”, “Black”, and the like. “A group of an image, a person label, and an attribute” is stored in an external storage device 104a or the like.
A memory bank unit 403a holds a feature vector in association with a person label and an attribute. The memory bank unit 403a holds m (the upper limit number) feature vectors for each person label where m represents a predetermined constant. Note that m feature vectors may be held for each pair of a person label and an attribute. This can hold feature vectors for each person label without the bias of the attribute. Note that at this time, when a memory bank updating unit 110b additionally stores a feature vector, if the number of feature vectors for each pair of the person label and the attribute exceeds m, an old feature vector is deleted to store the feature vector. The memory bank unit 403a holds the person label, the attribute, and the feature vector in a RAM 103a and the like.
A filter unit 410a specifies the feature vector held in the memory bank unit 403a based on the attribute for each image acquired by the acquisition unit 401a. More specifically, for each image obtained by the acquisition unit 401a, the filter unit 410a acquires the feature vector of the same attribute as that corresponding to the image from the memory bank unit 403a. For example, if the attribute of the image acquired by the acquisition unit 401a is “Caucasian”, the filter unit 410a acquires the feature vector corresponding to the attribute “Caucasian” from the memory bank unit 403a.
For each image acquired by the acquisition unit 401a, a calculation unit 404a acquires the feature vector acquired by the filter unit 410a for the image. Then, for the acquired feature vector, the calculation unit 404a obtains the similarity to the feature vector of the same person label as a positive similarity, and obtains the similarity to the feature vector of a different person label as a negative similarity.
In this embodiment as well, similar to the first embodiment, processing according to the flowchart shown in
Step S401b is the start of a loop of a learning sample. Each sample of the mini batch is assigned with a number from 1. To refer to the number using a variable k, a CPU 101a first initializes the value of the variable k to 1. If the value of the variable k is equal to or smaller than the number of samples, the process advances to step S402b. If the value of the variable k exceeds the number of samples, the process exits from the loop, and ends.
In step S402b, the acquisition unit 401a acquires the kth sample of the mini batch. In step S403b, the filter unit 410a specifies an attribute to form a pair with the attribute of the sample acquired in step S402b. More specifically, the filter unit 410a specifies the attribute of the sample. That is, the filter unit 410a specifies “Caucasian” in a case where the attribute of the sample is “Caucasian”.
In step S404b, the filter unit 410a acquires, from the memory bank unit 403a, the feature vector corresponding to the attribute acquired in step S403b. For example, if “Caucasian” is specified in step S403b, the filter unit 410a acquires the feature vector corresponding to “Caucasian” from the memory bank unit 403a.
In step S405b, the calculation unit 404a obtains the positive pair similarity and the negative pair similarity using the feature vector of the sample and the feature vector acquired in step S404b. More specifically, the calculation unit 404a obtains the positive pair similarity from the similarity between the feature vector of the person label of the sample and that of the same person label. Furthermore, the calculation unit 404a obtains the negative pair similarity from the similarity between the feature vector of the person label of the sample and that of a different person label. Step S406b is the end of the loop of the sample, in which the CPU 101a adds 1 to the variable k, and the process returns to step S401b.
In the above description, only “race” is used as person information. However, another attribute of “sex”, “age”, or the like may be used. Furthermore, “birthplace” may be used instead of “race”. Note that a combination of these attributes may be used as an attribute. That is, if the race and the sex are used, only a pair of the same race and sex is formed. Furthermore, with respect to the age or the like, category values obtained by dividing the age into 10-year periods may be used. The person information is not limited to them.
In the above-description, a pair of the same attribute of the person information is formed. However, a pair of similar pieces of person information may be formed. For example, in a case where the attribute of the race includes “East Asia”, “South East Asia”, “Caucasian”, and “Black”, “East Asia” and “South East Asia” are similar races, and thus a pair of “East Asia” and “South East Asia” may be formed. Furthermore, in the case of the age or the like, a pair may be formed by considering the ages within a predetermined range of ±5 years or the like as similar ages. In addition, since young people largely vary with ages, the similarity determination range may be changed to a range of ±3 years or the like under the age of 20. More specifically, the filter unit 410a is configured to acquire the feature vector similar to the attribute of the sample from the memory bank unit 403a. At this time, the filter unit 410a is configured to obtain all similar attributes in step S403b of
In the above description, the filter unit 410a supplies the feature vector to the calculation unit 404a. However, the filter unit 410a may supply, to the calculation unit 404a, “identifier information for specifying a feature vector meeting the condition”. Then, the calculation unit 404a may select a pair similarity based on the identifier information. That is, the calculation unit 404a may calculate the similarity for each of all the pairs, and then acquire, using the identifier information, the pair similarity to be used. This arrangement has the merit that a pair selection method based on the similarity can be used at the same time. That is, a negative pair indicating the similarity higher than the predetermined similarity or a positive pair indicating the similarity lower than the predetermined similarity can be a difficult pair. If such difficult pair is also selected, it is possible to concentratedly learn the difficult pair. In a case where such selection method is used at the same time, it is more efficient to select the pair of the same attribute after obtaining the similarities of all the pairs. A method of transmitting the feature vector specified by the filter unit 410a is not limited to them.
A method of concentratedly learning a specific attribute by strengthening the loss of the specific attribute will be described next. Even in a case where learning is performed by focusing on the same attribute, a weak point/strong point occurs for each value of the attribute. For example, the similarity of the negative pair is relatively high for the weak attribute. By changing r of equation (3) above in accordance with the attribute, the loss is strengthened/weakened in accordance with the attribute.
Note that r is changed, as described above, in accordance with the attribute of the sample acquired by the acquisition unit 401a. However, if the attributes of the pair are applicable, corresponding r may be used. Alternatively, if one of the attributes of the pair is applicable, corresponding r may be used. The case of the negative loss has been exemplified above but the same may apply to the case of the positive loss of equation (1).
Next, an example of the functional arrangement of a learning apparatus 500a that performs learning by forming a pair so that one image of the pair includes a mask by assuming “presence/absence of mask” as an attribute will be described with reference to a block diagram shown in
When a face authentication system for determining whether persons included in two images are identical to each other is operated, one image is a registered image and thus does not include a mask. Therefore, only a pair of “absence of mask and presence of mask” and a pair of “absence of mask and absence of mask” are evaluated at the time of the operation, and a pair of “presence of mask and presence of mask” is not evaluated in many cases. Thus, such pair that only one image is an image of “presence of mask” is formed to perform learning. Note that in
A filter unit 510a includes a determination unit 511a. The determination unit 511a determines the quality of an image acquired by the acquisition unit 401a based on the attribute of the image. For example, the determination unit 511a determines “high quality” in a case where the attribute of the image is “absence of mask”, and determines “low quality” in a case where the attribute of the image is “presence of mask”. If the image acquired by the acquisition unit 401a is an image of “low quality”, the filter unit 510a acquires, from the memory bank unit 403a, a feature vector for which “high quality” is determined. On the other hand, if the image acquired by the acquisition unit 401a is an image of “high quality”, the filter unit 510a acquires feature vectors of “low quality” and “high quality” from the memory bank unit 403a.
In this embodiment as well, similar to the first embodiment, the processing according to the flowchart shown in
In step S503b, the determination unit 511a determines the quality based on the attribute of the sample acquired in step S402b. More specifically, the determination unit 511a determines “low quality” in a case where the attribute of the sample is “presence of mask”, and determines “high quality” in a case where the attribute of the sample is “absence of mask”.
In step S504b, the filter unit 510a obtains the quality to form a pair with the quality determined in step S503b. More specifically, the filter unit 510a obtains “high quality” in a case where “low quality” is determined for the sample in step S503b, and obtains both “low quality” and “high quality” in a case where “high quality” is determined for the sample in step S503b.
In step S505b, the filter unit 510a acquires, from the memory bank unit 403a, the feature vector of the corresponding quality obtained in step S504b. More specifically, the filter unit 510a causes the determination unit 511a to determine the quality based on the attribute of the feature vector in the memory bank unit 403a, and acquires the feature vector of the corresponding quality obtained in step S504b. Note that the quality determination result of the memory bank unit 403a may be cached, and used. The example of using “mask” as an attribute has been explained above. However, the quality may be determined in accordance with each attribute shown in
A shielding attribute is an attribute concerning shielding of the face of a person included in an image. In addition to the presence/absence of a mask, the presence/absence of “sunglasses” or “hat” can be used. Furthermore, with respect to “make-up”, normal make-up is OK but low quality may be determined in a case of an “abnormality” such as paint on a face when, for example, watching a soccer match. It is also considered that part of the face is shielded by another item or a hand. In this case, the quality may be decided in accordance with the ratio of a shielded area.
A reflection attribute is an attribute concerning how a person included in an image is taken. “Face size” indicates the length of a short side of a rectangular region of a face, and the quality is determined in accordance with whether the length is longer than a predetermined size such as 100 pixels. “Interpupillary distance” indicates the number of pixels between both eyes, and the quality is determined in accordance with whether the interpupillary distance is larger than a predetermined number of pixels. “Eye closing” is an attribute indicating whether eyes are closed, and high quality is determined when eyes are open. “Facial expression” indicates the facial expression of the person, and high quality is determined for a facial expression close to a straight face. On the other hand, low quality may be determined for an abnormality such as a mouth opened excessively wide or a smile. “Face direction” indicates the direction of the face. If the pitch, roll, and yaw rotation angles are equal to or smaller than a predetermined threshold (±5°), high quality is determined; otherwise, low quality is determined.
An image quality attribute is an attribute concerning the image quality. “Brightness” indicates the brightness of the reflection of the face, and brightness is determined by performing threshold processing (determination processing of comparing a value with the threshold) for a numerical value such as a luminance “Image capturing device” indicates information concerning a device that has captured the image. If the device is a predetermined camera, high quality may be determined; otherwise, low quality may be determined. In this example, if a single-lens reflex camera is used to capture the image, high quality is determined; otherwise, low quality is determined. “Noise” indicates blurring/shaking or the like, and if “noise” is equal to or larger than a predetermined amount, low quality is determined. Alternatively, the determination processing may be performed by additionally considering another noise such as salt-and-pepper noise. The attribute for determining the quality is not limited to them. The image quality attribute may further include a resolution.
Furthermore, not all these items need to be used, and these items may selectively be used in combination. As a combining method, a method of combining the items as an AND condition may be possible. For example, only when only two items of “mask” and “sunglasses” are used, and high quality is determined for both “mask” and “sunglasses”, “high quality” may be determined. If “low quality” is determined for one of “mask” and “sunglasses”, “low quality” may be determined. However, if many conditions are used as an AND condition, the number of images for which high quality is determined excessively decreases, and the number of pairs usable for learning decreases. To cope with this, feature vectors to be left may be determined for each attribute and may be left under the OR conditions. That is, the feature vectors to be left are obtained from the memory bank unit 403a in accordance with the presence/absence of “mask”. Next, the feature vectors to be left are obtained from the memory bank unit 403a in accordance with the presence/absence of “sunglasses”. The feature vectors determined to be left under one of the conditions may be left to form pairs. Alternatively, a plurality of attributes are processed as AND conditions to form some pairs of attributes, and the feature vectors to be left may be left under the OR condition. For example, “mask” and “sunglasses” are used as one AND condition, and “interpupillary distance” and “brightness” are used as one AND condition. Then, the feature vectors to be left are obtained under the two AND conditions, and the results are combined under an OR condition, thereby deciding the feature vectors to be left. The method of combining the attributes is not limited to them.
In the above description, pairs are formed by excluding a pair of “low quality and low quality”. However, pairs may be formed by excluding a pair of “high quality and high quality”. More specifically, if the image acquired by the acquisition unit 401a is an image of “high quality”, only the feature vectors of “low quality” are acquired from the memory bank unit 403a. Since a pair of “high quality and high quality” is easy to collate, learning can be performed by focusing on pairs of “low quality and high quality” that are difficult to collate. This can improve the accuracy of the difficult pairs.
Furthermore, the acquisition unit 401a also holds the attribute in the external storage device 104a in the above description, but may be configured to obtain the attribute from an image using an attribute determiner. For example, a neural network for estimating the presence/absence of a mask for an image is prepared for the attribute of the presence/absence of a mask, and a result estimated by the neural network is used. Alternatively, for the attribute of brightness, the luminance value of an image or the like is calculated. The arrangement of an attribute estimator is not limited to them.
Furthermore, the quality is estimated based on the attribute in the above description. However, the quality may be estimated directly from an image. That is, a neural network may be learned to output the quality when an image is input, and the output from the neural network may be obtained as an quality attribute. It is possible to calculate a quality score or the like for an image by a neural network that classifies a data set of high-quality images and low-quality images. The quality score can be held as an attribute. In this case, the determination unit 511a holds a threshold for the quality score. If the quality score exceeds the threshold, the determination unit 511a determines high quality; otherwise, the determination unit 511a determines low quality. The attribute may include quality, and the method of obtaining the quality from the image is not limited to them.
An example of the functional arrangement of an inference apparatus 600b that determines, using a neural network learned by the learning apparatus according to this embodiment, whether faces included in two images are the face of the same person will be described next with reference to
A determination unit 601b obtains the attribute of an image acquired by an acquisition unit 601a. More specifically, the determination unit 601b may be configured to obtain the attribute of the image from the image using an attribute determiner or the like. For example, the attribute determiner estimates the presence/absence of a mask in the image. Alternatively, the determination unit 601b may be configured to hold, in advance, the attribute of the image in the external storage device 104a, and acquire it. The method of obtaining the attribute of the image by the determination unit 601b is not limited to them.
A determination unit 602b determines the quality of the image by the same criterion as that of the determination unit 511a used by the above-described learning apparatus. More specifically, the determination unit 602b performs the same processing as that of the determination unit 511a. For example, if it is configured to determine the quality based on the presence/absence of a mask, low quality is determined for the presence of the mask, and high quality is determined for the absence of the mask.
Alternatively, the determination unit 602b may use a stricter criterion than that of the determination unit 511a. For example, the criterion for a numerical value range may be set stricter. That is, the determination unit 511a determines high quality when “interpupillary distance” indicates 50 pixels or more but the determination unit 602b determines high quality when “interpupillary distance” indicates a value equal to or larger than a numerical value (for example, 70 pixels) larger than 50 pixels. Alternatively, attributes used may be increased to make the criterion stricter. That is, the determination unit 602b may use attributes that are not used by the determination unit 511a, and determine “high quality” when high quality is determined for all the attributes. This can hold higher-quality images in a database unit 604a, thereby improving accuracy of collation.
When the determination unit 602b determines low quality, a notification unit 603b makes a notification of the result of the determination processing. More specifically, the notification unit 603b displays, on a monitor 110a or the like, information representing that registration has failed because the registered image is a low-quality image. At this time, the feature vector of the image acquired by the acquisition unit 601a is not registered in the database unit 604a.
Note that the database unit 604a is a constituent element in the above description but is not essential. In a case where there is no database unit 604a, a collation unit 605a obtains the similarity between the feature vectors of images obtained by the acquisition unit 601a and an acquisition unit 602a. However, if the determination unit 602b determines low quality, the collation unit 605a need not calculate the similarity. Alternatively, the processing need not be performed from calculation of the feature vector of the image by a calculation unit 603a.
As described above, according to this embodiment, in the example shown in “Formation of Pair by Person Information of Race and Like”, it is possible to concentratedly learn difficult pairs with respect to pairs of similar pieces of person information, thereby improving accuracy. In the example shown in “Strengthening of Loss by Attribute”, it is possible to concentratedly perform learning of a weak attribute, thereby reducing the variation of accuracy caused by the attribute. In the example shown in “Formation of Pair by Image Quality Based on Mask and Like”, it is possible to concentratedly learn pairs appearing at the time of the operation of the face authentication system, thereby improving accuracy at the time of the operation. In addition, when the inference apparatus checks the quality of the registered image by the criterion equal to or stricter than the image quality used at the time of learning, collation of a pair that has not been learned is avoided, thereby making it possible to prevent erroneous authentication/non-authentication.
The learning apparatus and the inference apparatus according to the above embodiments process a face authentication task as a target but may be applied to an authentication task that applies another distance learning. For example, the learning apparatus and the inference apparatus can be applied to another biometric authentication in which, for example, it is determined, based on images of pupils such as irises, whether images of two pupils are of the same person. The type of authentication task processed by the learning apparatus and the inference apparatus is not limited to them.
Numerical values, processing timings, processing orders, main constituents of processing, acquisition methods/transmission destinations/transmission sources/storage locations of data (information) used in the above-described embodiments are merely examples for a detailed explanation. The present invention is not intended to limit these to the examples.
Some or all of the above-described embodiments may be used in combinations as needed. Alternatively, some or all of the above-described embodiments may selectively be used.
Embodiment(s) of the present invention can also be realized by a computer of a system or apparatus that reads out and executes computer executable instructions (e.g., one or more programs) recorded on a storage medium (which may also be referred to more fully as a ‘non-transitory computer-readable storage medium’) to perform the functions of one or more of the above-described embodiment(s) and/or that includes one or more circuits (e.g., application specific integrated circuit (ASIC)) for performing the functions of one or more of the above-described embodiment(s), and by a method performed by the computer of the system or apparatus by, for example, reading out and executing the computer executable instructions from the storage medium to perform the functions of one or more of the above-described embodiment(s) and/or controlling the one or more circuits to perform the functions of one or more of the above-described embodiment(s). The computer may comprise one or more processors (e.g., central processing unit (CPU), micro processing unit (MPU)) and may include a network of separate computers or separate processors to read out and execute the computer executable instructions. The computer executable instructions may be provided to the computer, for example, from a network or the storage medium. The storage medium may include, for example, one or more of a hard disk, a random-access memory (RAM), a read only memory (ROM), a storage of distributed computing systems, an optical disk (such as a compact disc (CD), digital versatile disc (DVD), or Blu-ray Disc (BD™), a flash memory device, a memory card, and the like.
While the present invention has been described with reference to exemplary embodiments, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. The scope of the following claims is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structures and functions.
This application claims the benefit of Japanese Patent Application No. 2022-175029, filed Oct. 31, 2022, which is hereby incorporated by reference herein in its entirety.
Number | Date | Country | Kind |
---|---|---|---|
2022-175029 | Oct 2022 | JP | national |