The present disclosure relates to a learning system, a learning method, and a program.
Hitherto, machine learning has been used in various fields, for example, image analysis and natural language processing. In supervised machine learning, it takes time to prepare training data, and therefore it is required to increase the accuracy of the learning model through use of a smaller amount of training data. For example, in Non Patent Literature 1, there is described a method called “few-shot object detection” which creates a learning model capable of recognizing data having an unknown label based on a very small amount of training data.
[NPL 1] Leonid Karlinsky, Joseph Shtok, Sivan Harary, Eli Schwartz, Amit Aides, Rogerio Feris, Raja Giryes, and Alex M Bronstein. Rep Met: Representative-based metric learning for classification and few-shot object detection. In CVPR, 2019.
For multi-label data, it is particularly time-consuming to prepare the training data because there are many combinations of labels. However, the method of Non Patent Literature 1 targets data having a single label, and therefore this method is not applicable to multi-label data. For this reason, with the method of the related art, unless a larger amount of training data is prepared, it is not possible to increase the accuracy of a learning model which is capable of recognizing multi-label data.
An object of the present disclosure is to increase accuracy of a learning model capable of recognizing multi-label data through use of a small amount of training data.
According to one aspect of the present disclosure, there is provided a learning system including: first calculation means configured to calculate, when multi-label query data is input to a learning model, a first loss based on an output of the learning model and a target output; feature amount acquisition means configured to acquire a feature amount of the multi-label query data and a feature amount of support data corresponding to the multi-label query data, which are calculated based on a parameter of the learning model; second calculation means configured to calculate a second loss based on the feature amount of the multi-label query data and the feature amount of the support data; and adjustment means configured to adjust the parameter based on the first loss and the second loss.
According to the present disclosure, it is possible to increase the accuracy of the learning model capable of recognizing the multi-label data by using a small amount of training data.
Description is now given of an example of an embodiment of a learning system according to the present disclosure.
The server 10 is a server computer. The server 10 includes a control unit 11, a storage unit 12, and a communication unit 13. The control unit 11 includes at least one microprocessor. The storage unit 12 includes a volatile memory, for example, a RAM, and a nonvolatile memory, for example, a hard disk drive. The communication unit 13 includes at least one of a communication interface for wired communication and a communication interface for wireless communication.
The creator terminal 20 is a computer to be operated by a creator. The creator is a person creating data to be input to the learning model. In this embodiment, an image is described as an example of the data. For this reason, in this embodiment, the term “image” can be read as “data”. The data to be input to the learning model is not limited to images. Examples of other data are described in modification examples described later.
For example, the creator terminal 20 is a personal computer, a smartphone, or a tablet terminal. The creator terminal 20 includes a control unit 21, a storage unit 22, a communication unit 23, an operation unit 24, and a display unit 25. Physical components of the control unit 21, the storage unit 22, and the communication unit 23 may be similar to those of the control unit 11, the storage unit 12, and the communication unit 13, respectively. The operation unit 24 is an input device such as a mouse or a touch panel. The display unit 25 is a liquid crystal display or an organic EL display.
The learning terminal 30 is a computer for executing learning by a learning model. For example, the learning terminal 30 is a personal computer, a smartphone, or a tablet terminal. The learning terminal 30 includes a control unit 31, a storage unit 32, a communication unit 33, an operation unit 34, and a display unit 35. Physical components of the control unit 31, the storage unit 32, the communication unit 33, the operation unit 34, and the display unit 35 may be similar to those of the control unit 11, the storage unit 12, the communication unit 13, the operation unit 24, and the display unit 25, respectively.
Programs and data described as being stored into the storage units 12, 22, and 32 may be supplied thereto via the network N. Further, the respective hardware configurations of the server 10, the creator terminal 20, and the learning terminal 30 are not limited to the above-mentioned examples, and various types of hardware can be applied thereto. For example, the hardware configuration may include at least one of a reading unit (e.g., an optical disc drive or a memory card slot) for reading a computer-readable information storage medium, and an input/output unit (e.g., a USB port) for inputting and outputting data to/from an external device. For instance, at least one of the program and the data that are stored on the information storage medium may be supplied via at least one of the reading unit and the input/output unit.
In this embodiment, description is given of processing of the learning system S by taking, as an example, a case in which an image of an article for sale to be sold through a website is input to the learning model. For example, the creator is a clerk at a shop selling the article for sale. The creator edits a photograph of the article for sale by using image editing software installed on the creator terminal 20, and creates an image to be posted on the website.
The image editing software is used to add artificial objects to the photograph of the article for sale. Each object is a component of the image. The article for sale being a subject of the image is also one of the objects. The objects added to the photograph by the image editing software are electronic images. For example, for the purpose of promoting sales of the article for sale, the creator adds at least one of a digital text, a digital frame, and a color bar for to photograph of the article for sale.
The digital text is text added to the photograph by using the image editing software. The digital text is different from a natural text. The natural text is text included in the article for sale itself. In other words, the natural text is the text included in the photograph before editing. For example, the natural text is a name of the article for sale or a brand name printed on the article for sale.
The digital frame is a frame added to the photograph by using the image editing software. In this embodiment, there is described a case in which a digital frame of 1 pixel and a digital frame of 2 pixels or more are present, but the digital frame may have any thickness. The digital frame is different from a natural frame. The natural frame is a frame included in the article for sale itself. In other words, the natural frame is the frame included in the photograph before editing. For example, the natural frame is an edge of a box of the article for sale.
The color bar is an image showing a color variation of the article for sale. The color bar includes a bar of each of a plurality of colors. For example, in the case of an item of clothing having 10 color variations, the color bar includes bars for 10 colors.
When the creator has created an image by editing the photograph of the article for sale, the creator uploads the edited image to the server 10. The uploaded image is stored in an image database of the server 10 and posted on the website.
In an image 13, a digital frame DF30 of 1 pixel and a digital text DT31 are added to an image of a bag. In an image I4, a digital text DT40 is added to an image of a pair of gloves. In an image I5, a digital text DT50 and a color bar CB51 consisting of a nine-color bar are added to an image of an item of clothing.
As in this embodiment, when the creator can freely edit the image, an image which has a poor design and does not improve a willingness to purchase by the customer may be uploaded. In contrast, an image which is well designed and improves the willingness to purchase by the customer may be uploaded. For this reason, it is important to identify the edited content (artificially decorated portion) made to the image.
In view of this, the learning terminal 30 creates a learning model for labeling the edited content made to the image. The learning model is a model which uses machine learning. Various methods can be used for the machine learning itself. For example, a convolutional neural network or a recursive neural network can be used. The learning model in this embodiment is a supervised model or a semi-supervised model, but an unsupervised model may be used.
As in this embodiment, the learning model performing labeling is sometimes referred to as “classification learner.” Labeling refers to conferring of labels to input images. The labels are the classification of the images. In this embodiment, the label means the edited content made to the image. As an example of the labels, the following labels 0 to 6 are described, but the labels are not limited to the example of this embodiment, and any label can be set.
A (label 0) image does not include any edited content, a (label 1) image includes a digital text, a (label 2) image includes a natural text, a (label 3) image includes a digital frame of 2 pixels or more, a (label 4) image includes a digital frame of 1 pixel, a (label 5) image includes a natural frame, and a (label 6) image includes a color bar. Label 0 means that the image does not correspond to any of labels 1 to 6.
In this embodiment, the output of the learning model includes seven binary values indicating whether or not the image belongs to each of labels 0 to 6. As an example, a case in which the output of the learning model is expressed in a vector format is described, but the output of the learning model may have any format. For example, the output of the learning model may have an array format, a matrix format, or a single numerical value. Aa another example, in place of the above-mentioned seven values, the output of the learning model may be a numerical value of from 0 to 6 indicating the label to which the image belongs. In this case, when the image belongs to label 2 and label 5, the output of the learning model is a combination of the numerical values of 2 and 5.
For example, when the value of a certain label is 0, this means that the image does not belong to that label. For example, when the value of a certain label is 1, this means that the image belongs to that label. For example, when the output of the learning model is [0, 1, 0, 0, 1, 0, 0], this means that the image belongs to label 1 and label 4. The output of the learning model is not required to be a binary value of 0 or 1, and an intermediate value may exist. The intermediate value indicates a probability (likelihood) of the image belonging to the label. For example, when the value of a certain label is 0.9, this means that there is a 90% probability that the image belongs to that label.
An image database DB includes a mixture of single-label images and multi-label images. A single label indicates that an image belongs to only one label. In the example of
As described regarding the related art, attempting to create a learning model capable of handling multi-labels requires a lot of time and effort to prepare the training data because there are a large number of label combinations. For this reason, it is difficult to create a learning model capable of handling multi-labels. Further, images such as those in this embodiment are difficult to label for the following two reasons.
The first reason is that the images stored in the image database DB not only include images of popular articles for sale, but also include many images of less popular articles for sale. Such a distribution is referred to as “long-tail distribution.” A population having a long-tail distribution includes a wide variety of images. For this reason, even when a large amount of training data is prepared, the training data includes a wide variety of patterns as the shapes of the articles for sale, and hence it is difficult for the learning model to recognize the features of the images.
The second reason is that most of the images stored in the image database DB are images of the external appearance of an article for sale, and portions, for example, digital text, are not as noticeable, like fine grains. For this reason, it is difficult for the learning model to recognize features, for example, digital text. Multi-label images are even more difficult because several features that are not noticeable, for example, fine grains, are required to be recognized. Such a problem can also be referred to as “fine-grained multi-label classification problem.” Further, in images like those in this embodiment, there are also problems in that it is difficult to distinguish between digital text and natural text, and it is also difficult to distinguish between digital frames and natural frames.
In view of the above, the learning system S of this embodiment creates a learning model capable of handling multi-labels by applying few-shot learning which is based on a contrastive learning approach. As a result, even in cases in which the images have a long-tail distribution and features that are not noticeable like fine grains (even when the above-mentioned first and second reasons exist) , the accuracy of the learning model is increased by using less training data. The details of the learning system S are now described.
In the server 10, a data storage unit 100 is implemented. The data storage unit 100 is mainly implemented by the storage unit 12. The data storage unit 100 stores the data required for learning by the learning model. For example, the data storage unit 100 stores the image database DB. As described with reference to
In this embodiment, the images stored in the image database DB have a predetermined format (for example, size, resolution, number of bits of color, and filename extension), but the image database DB may store images of any format. Further, the images stored in the image database DB are downloaded to the learning terminal 30 and then labeled by the user of the learning terminal 30, but labeled images may be stored in the image database DB.
The data storage unit 300 stores the data required for learning by learning models M1 and M2. When the learning model M1 and the learning model M2 are not distinguished in the following description, the learning models are simply referred to as “learning model M.” For example, the data storage unit 300 stores a data set DS for learning. The data set DS stores each of a plurality of images conferred with a label that is a correct answer.
In this embodiment, there is described a case in which a part of the images in the image database DB are stored in the data set DS, but all of the images in the image database DB may be stored in the data set DS. For example, the user of the learning terminal 30 accesses the server 10, and downloads a part of the images in the image database DB. The user displays the downloaded images on the display unit 35, and confers the labels that are the correct answers to create the data set DS.
For example, it is assumed that the image database DB contains about 200 million images, and users have randomly sampled and labeled about 40,000 to about 50,000 images from among those images. As a general rule, the images in this embodiment can be freely edited, and hence there may be some edits which creators tend to perform some edits that creators are less likely to perform. For this reason, the labels of the randomly sampled images may have a long-tail distribution.
When there is a mixture of single-label images and multi-label images like in this embodiment, at least one label corresponds to a class. The images belong to any class, but not to another class. For example, the multi-label images belong to a certain label and to another label, but do not belong to a certain class and to another class. When there are 41 label combinations in the population of randomly sampled images, this means that there are 41 classes in the population.
In the example of
In
Further, the method of conferring the label that is the correct answer to an image is not limited to the example described above, and any method can be used. For example, the user may use a known clustering method to confer the correct-answer label to an image. Moreover, for example, the user may use a learning model M that has learned a single label image to confer the correct-answer label to an image.
The data storage unit 300 stores not only the data set DS, but also the learning model M (actual data of the learning model M). The learning model M includes a program and a parameter. As the format itself of the program and the parameter of the learning model M, various formats used in machine learning can be used. For example, the program of the learning model M includes code defining processing (for example, convoluting, embedded vector calculation, and pooling) of each of a plurality of layers. Further, for example, the parameter of the learning model M includes a weighting coefficient and a bias. The parameter of the learning model M is referred to by the program of the learning model M.
As illustrated in
The parameter of the learning model M1 and the parameter of the learning model M2 are shared. That is, the parameter of the learning model M1 and the parameter of the learning model M2 are the same. The program of the learning model M1 and the program of the learning model M2 are also the same, and the internal structure, for example, the layers, is also the same. That is, any one of the learning model M1 and the learning model M2 is a copy of the other.
The data stored by the data storage unit 300 is not limited to the example described above. It is sufficient that the data storage unit 300 store the data required for the learning by the learning model M. For example, the data storage unit 300 may store the data set DS divided into three parts: a training data set, a verification data set, and a test data set. Further, for example, the data storage unit 300 may store the same database as the image database DB.
The data acquisition unit 301 acquires the images to be used for the learning by the learning model M. In this embodiment, the data acquisition unit 301 acquires the query image xQ and the support image xS from an image group having a long-tail distribution in multi-labels. The data acquisition unit 301 may also acquire the query image xQ and the support image xS from an image group which does not have a long-tail distribution.
The image group is a collection of a plurality of images. In this embodiment, the image group is stored in the image database DB having a long-tail distribution. When the number of samples of the data set DS is a certain level, the data set DS may also have a long-tail distribution, and therefore a collection of a plurality of images stored in the data set DS may correspond to the above-mentioned image group.
The long-tail distribution is a distribution like that described with reference to
The learning model M in this embodiment is a model which recognizes objects included in an image, and therefore a multi-label query image xQ is described as an example of query data. Further, an example of support data is the support image xS corresponding to the query image xQ. The query image xQ and the support image xS are each images used in few-shot learning.
The query image xQ is an image of a new class that has not been learned by the learning model M. The query image xQ is sometimes referred to as “test image.” The support image xS is an image of the same class as the query image xQ or of a different class from the query image xQ. For example, when general classes have been learned by the learning model M through use of a training data set for general object recognition, the class to be learned through use of the query image xQ and the support image xS is, in principle, a class which has not been learned by the learning model M.
In this embodiment, the data acquisition unit 301 randomly samples the image group stored in the image database DB to acquire the image group, and stores pairs of the individual acquired images and the labels that are the correct answers in the data set DS. As illustrated in
The data acquisition unit 301 randomly acquires the query image xQ and the support image xS from the data set DS for each episode. An episode is a part of a series of processes in few-shot learning. In few-shot learning, a few episodes are repeated. For example, for each episode, there is an image set of at least one query image xQ and at least one support image xS .
The few-shot learning in this embodiment is performed by following a setting called “N-Way K-shot.” Here, N means the number of classes per episode, K means the number of images per episode, and N and K are natural numbers. In general, as N becomes smaller, the accuracy of the learning model M becomes higher, and as K becomes higher, the accuracy of the learning model M becomes higher. In this embodiment, a case in which N is 1 and K is 5 (that is, a case of 1-way 5-shot) is described, but N and K may be any values.
In this embodiment, there is described a case in which there is an episode corresponding to a part of the combinations of labels that are possible for the multi-labels, but there may be episodes corresponding to all possible combinations. As an example, there is now described a case in which there are 15 episodes corresponding to the respective 15 classes shown in
For example, episode 1 is an episode for learning the images of the class (class having only label 1) having the highest total number of images in the distribution of
Further, for example, episode 2 is an episode for learning the images of the class (class having label 1 and label 2) having the second highest total number of images. The data acquisition unit 301 randomly samples six images of this class (images having labels [0, 1, 1, 0, 0, 0, 0]) from among the data set DS. The data acquisition unit 301 uses one of the six images as the query image xQ and the remaining five as support images xS.
Similarly for the other episode 3 to episode 15, for each episode, the data acquisition unit 301 randomly samples six images of the class corresponding to the episode and acquires those images as the query image xQ and sample images. That is, the data acquisition unit 301 acquires six images of the class corresponding to a certain episode as an image set of the query image xQ and the support images xS of the episode.
Further, when the value of N is 2 or more, each support image xS of a plurality of classes is included in one episode. In this case, only the query image xQ of any of the plurality of classes may be included in one episode, or a plurality of query images xQ corresponding to the plurality of classes may be included in one episode. Even when the value of N is 2 or more, the number of query images xQ is not limited to one.
The number of episodes may be specified by the user, or may be automatically determined from a statistical value in the image database DB or data set DS. For example, the user may specify the classes to be learned by the learning model M, and the episodes corresponding to that number may be set. Further, for example, classes having a total number of images in the image database DB or data set DS equal to or more than a threshold value may be automatically identified, and the episodes corresponding to that number may be set. The data acquisition unit 301 is only required to acquire the number of images corresponding to the episodes.
The first calculation unit 302 calculates, when the multi-label query image xQ is input to the learning model M1, a first loss LBCE based on the output of the learning model M1 and a target output. That is, the first calculation unit 302 calculates the first loss LBCE based on the parameter of the learning model M1.
The output of the learning model M1 is the actual output obtained from the learning model M1. The target output is the content that the learning model M1 is supposed to output. In this embodiment, the label that is the correct answer stored in the data set DS corresponds to the target output.
The first loss LBCE shows an error (difference) between the output of the learning model M1 and the target output. The first loss LBCE is an index which can be used to measure the accuracy of the learning model M1. A high first loss LBCE means a large error and a low accuracy. A low first loss LBCE means a small error and a high accuracy. In this embodiment, there is described a case in which the first loss LBCE is a multi-label cross-entropy loss, but the first loss LBCE can be calculated by using any method. It is sufficient that the first loss LBCE can be calculated based on a predetermined loss function.
A set of the individual query images xQ included in a certain episode is hereinafter written as uppercase “XQ” In this embodiment, the set XQ of query images xQ of a certain episode consists of one query image xQ. In this embodiment, there is described a case in which N in N-way K-shot is 1, but there may be cases in which N is 2 or more. In those cases, the query image may thus be written as xQi, in which “i” is a natural number equal to or less than N. Here, iϵ{1, . . . , N}, and xQi ϵXQ.
As illustrated in
For example, when the query image xQ is input to the learning model M1, an embedded function f(x) calculates f(xQ), which is an embedded vector of the query image xQ. In f(x) , “x” means any image. The embedded function f(x) may be a part of the program of the learning model M1, or may be an external program called by the learning model M1. The embedded vector is acquired by the feature amount acquisition unit 303 described later.
The first calculation unit 302 uses a sigmoid function “σ(z)=1/(1+e−z)” to acquire a binary output of each class based on the embedded vector f(xQ). For example, the first calculation unit 302 calculates the first loss LBCE based on Expression 1 and Expression 2 below. Expression 1 and Expression 2 are examples of loss functions, but any function can be used as the loss function itself. When a loss other than a multi-label cross-entropy loss is to be used, a loss function corresponding to the loss can be used.
L
BCE(σ(z), yQ)={l1, ···, lN}T [Expression 1]
l
n=−yQn·|ogσ(z)−(1−yQn) |og(1−σ(z)) [Expression 2]
The yQn of Expression 2 is the respective binary labels of the query images xQ, and yQnϵyQ. Here, yQ is a combination of labels corresponding to each input. As the error between the actual output corresponding to the query image xQ and the target output of the query image xQ becomes smaller, the first loss LBCE becomes smaller, and as the error becomes larger, the first loss LBCE becomes larger.
The learning model M in this embodiment can recognize three or more labels, and for each combination of labels (i.e., for each episode), there is an image set which includes a query image xQ and support images xS. There are three or more labels, and hence there are two or more label combinations.
The first calculation unit 302 calculates, for each combination of labels (that is, for each episode), the first loss LBCE based on the query image xQ corresponding to the combination. The method of calculating the first loss LBCE of the individual episodes is as described above. In this embodiment, there are 15 episodes, and therefore the first calculation unit 302 calculates the first loss LBCE corresponding to each of the 15 episodes.
The learning model M in this embodiment replaces the last layer of a model which has learned another label other than the plurality of labels to be recognized with a layer corresponding to the plurality of labels. The last layer is the output layer. For example, the last layer of the learning model M that has learned the shape of a general object by using ResNet 50 is replaced with a layer corresponding to the multi-labels (in this embodiment, a layer outputting seven values of from label 0 to label 6). As a result, the combination of the labels to be recognized by the learning model M is output. The first calculation unit 302 calculates the first loss LBCE based on the output of the learning model M having the replaced layer corresponding to the plurality of labels and the target output.
The feature amount acquisition unit 303 acquires a feature amount of the query image xQ and a feature amount of each support image xS corresponding to the query image xQ, which are calculated based on the parameter of the learning model M. The parameter is the current parameter of the learning model M. That is, the parameter is the parameter before adjustment by the adjustment unit 305 described later. When pre-learning is performed by using ResNet50, for example, the feature amounts are acquired based on the parameter after the pre-learning.
The feature amounts are information indicating a feature of the image. In this embodiment, there is described a case in which the embedded vector corresponds to the feature amount. For this reason, the term “embedded vector” in this embodiment can be read as “feature amount.” The feature amounts can be expressed in any format, and are not limited to vector formats. The feature amounts may be expressed in another format, for example, an array format, a matrix format, or a single numerical value.
As illustrated in
In this embodiment, there are a plurality of support images xS per episode, and therefore the feature amount acquisition unit 303 acquires the embedded vector of each of the plurality of support images xS. Further, the value of K is 5 and there are five support images xS per episode, and hence the feature amount acquisition unit 303 inputs each of the five support images xS to the learning model M2 and acquires five embedded vectors. When the value of N is 2 or more, it is sufficient that the number of embedded vectors of the support images xS acquired by feature amount acquisition unit 303 correspond to the number of N.
The feature amount acquisition unit 303 acquires, for each combination of labels (that is, for each episode) , the embedded vector of the query image xQ corresponding to the combination and the embedded vector of each support image xS corresponding to the combination. In this embodiment, there are 15 , and therefore the feature amount acquisition unit 303 acquires the embedded vector of one query image xQ and the embedded vector of each of the five support images xS corresponding to each of the 15 episodes.
The second calculation unit 304 calculates a second loss LCL based on the embedded vector of the query image xQ and the embedded vector of each support image xS.
The second loss LCL shows an error (difference) between the embedded vector of the query image xQ and the embedded vector of each support image xS. The second loss LCL is an index which can be used to measure the accuracy of the learning models M1 and M2. A high second loss LCL means a large error and a low accuracy. A low second loss CL means a small error and a high accuracy. In this embodiment, there is described a case in which the second loss LCL is a contrastive loss, but the second loss LCL can be calculated by using any method. It is sufficient that the second loss LCL can be calculated based on a predetermined loss function.
A contrastive loss is a loss used in contrastive learning. Contrastive learning is used to learn whether a pair of images is similar or not. For example, the Euclidean distance of a pair of embedded vectors in a pair of images {X1, X2} is used as a distance metric Dw.
For example, when a similarity label indicating the similarity of the image pair is Yϵ{0, 1}, the contrastive loss is calculated based on Expression 3 below. When Y is 0, this means that an image X1 and an image X2 are similar (the image X1 and the image X2 have the same label). When Y is 1, this means that the image X1 and the image X2 are not similar (the image X1 and the image X2 have different labels). Expression 3 is an example of a loss function, but any function can be used as the loss function itself. In Expression 3, M is a constant for adjusting the loss generated when Y is 1.
L
CL(X1, X2, Y)=½{(1−Y)(DW(X1, X2))2+(Y){max(0, M−DW(X1, X2))}2} [Expression 3]
In order to apply contrastive learning like that described above to the method of this embodiment, in place of comparing the similarity between two images, two embedded vectors calculated from each of the support image xS and the query image xQ are input. In this embodiment, the support image xS and the query image xQ here have the same label, and therefore the similarity label Y is 0. For example, the second calculation unit 304 calculates the second loss LCL based on Expression 4 below. In Expression 4, the elements having a line drawn above “f(xs)” are the average value of the embedded vector of the support image xS. Expression 4 is an example of a loss function, but any function can be used as the loss function itself.
L
CL(
In this embodiment, the query image xQ and the support images xS have at least one label which is the same. There is described here a case in which the labels of all of those are the same, but the labels may be partial matches, and not exact matches . The second calculation unit 304 calculates the second loss LCL so that, as the difference between the embedded vector of the query image xQ and the embedded vector of the support image xS becomes larger, the second loss LCL becomes larger. The difference between the embedded vectors may be expressed by an index other than the distance. The relationship between the difference and the second loss LCL is defined in the loss function.
In this embodiment, N is 2 or more and there are a plurality of support images xS per episode, and hence the second calculation unit 304 calculates an average feature amount (in Expression 4, the elements having a line drawn above “f (xS) ”) based on the embedded vector of each of the plurality of support images xS , and acquires the second loss LCL based on the embedded vector of the query image xQ and the average embedded vector. The average embedded vector may be weighted in some manner in place of being a simple average of the five support images xS . When the value of N is 2 or more, an average feature amount extending across classes may be calculated.
The second calculation unit 304 calculates, for each combination of labels (that is, for each episode) , the second loss LCL based on the embedded vector of the query image xQ corresponding to the combination and the embedded vector of each support image xS corresponding to the combination. In this embodiment, there are 15 episodes, and therefore the second calculation unit 304 calculates the second loss LCL based on the embedded vector of one query image xQ and the embedded vector of each of the five support images xS corresponding to each of the 15 episodes.
The adjustment unit 305 adjusts the parameter of the learning model M based on the first loss LBCE and the second loss LCL. Adjusting the parameter has the same meaning as executing learning by the learning model M. As the method itself of adjusting the parameter based on losses, various methods can be used. For example, an inverse error propagation method or a gradient descent method may be used. The adjustment unit 305 adjusts the parameter of the learning model M so that the first loss LBCE and the second loss CL each become smaller.
When the parameter of the learning model M is adjusted so that the first loss LBCE becomes smaller, the error between the output of the learning model M and the label that is the correct answer is reduced. That is, the probability that the learning model M outputs the correct answer increases. In other words, the output of the learning model M becomes closer to the label that is the correct answer.
When the parameter of the learning model M is adjusted such that the second loss LCL becomes smaller, the learning model M calculates the embedded vectors such that the difference between the embedded vector of the query image xQ and the embedded vector of the support image xS similar to the query image xQ is reduced.
Contrary to this embodiment, in a case in which a support image xS not similar to the query image xQ is used, when the parameter of the learning model M is adjusted so that the second loss LCL becomes smaller, the learning model M calculates embedded vectors so that the difference between the embedded vector of the query image xQ and the embedded vector of the support image xS not similar to the query image xQ becomes larger.
In this embodiment, the adjustment unit 305 calculates a total loss Ltotal based on the first loss LBCE and the second loss LCL, and adjusts the parameter of the learning model M based on the total loss Ltotal. The total loss Ltotal is calculated based on Expression 5 below. Expression 5 is an example of a loss function, but any function can be used as the loss function itself . For example, in place of a simple average as in Expression 5, the total loss Ltotal may be calculated based on a weighted average using a weighting coefficient.
L
total=LCL+LBCE Expression 5
In this embodiment, the learning model M1 and the learning model M2 exist, and the parameter is shared between the learning model M1 and the learning model M2. For this reason, the adjustment unit 305 adjusts the parameter of the learning model M1 and the parameter of the learning model M2. In this embodiment, the adjustment unit 305 adjusts the parameter of the learning model M1 by using the total loss Ltotal and copies the adjusted parameter of the learning model M1 to the learning model M2.
Contrary to the case described above, the adjustment unit 305 may adjust the parameter of the learning model M2 by using the total loss Ltotal, and copy the adjusted parameter of the learning model M2 to the learning model M1. Further, in place of copying the parameter, the adjustment unit 305 may adjust the parameter of the learning model M1 by using the total loss Ltotal, and adjust the parameter of the learning model M2 by using the same total loss Ltotal. As a result of this method, the parameter is shared as well.
In this embodiment, the adjustment unit 305 adjusts the parameter of the learning model M based on the first loss LBCE and the second loss LCL calculated for each combination of labels (that is, for each episode). In this embodiment, there are 15 episodes, and therefore the adjustment unit 305 adjusts the parameter of the learning model M based on 15 loss pairs (a pair of first loss LBCE and second loss LCL) corresponding to the respective 15 episodes.
For example, the adjustment unit 305 calculates 15 total losses Ltotal corresponding to the respective 15 episodes. The adjustment unit 305 adjusts the parameter of the learning model M for each of the 15 total losses Ltotal by using the inverse error propagation method, for example. The adjustment unit 305 may adjust the parameter of the learning model M by combining all or a part of the 15 total losses Ltotal into one loss.
The adjustment unit 305 may adjust the parameter of the learning model M without calculating the total loss Ltotal. For example, the adjustment unit 305 may adjust the parameter of the learning model M so that the first loss LBCE becomes smaller, and then adjust the parameter of the learning model M so that the second loss LCL becomes smaller. As another example, the adjustment unit 305 may adjust the parameter of the learning model M so that the second loss LCL becomes smaller, and then adjust the parameter of the learning model M so that the first loss LBCE becomes smaller.
The adjustment unit 305 may also combine the first loss LBCE for a certain episode with the first loss LBCE for another episode into one loss, and then adjust the parameter of the learning model
M. The adjustment unit 305 may also combine the second loss LCL for a certain episode with the second loss LCL for another episode into one loss, and then adjust the parameter of the learning model M.
The data set DS is stored in advance in the storage unit 32. Further, the order of the episodes to be processed and the classes corresponding to the individual episodes are specified in advance. For example, the episodes corresponding to each of the 15 classes in the long-tail distribution shown in
As illustrated in
The learning terminal 30 inputs each of the five support images xS of the episode to be processed to the learning model M2 (Step S4). The learning terminal 30 acquires the embedded vector of the query image xQ calculated by the learning model M1 and the embedded vector of each of the five support images xS calculated by the learning model M2 (Step S5). The learning terminal 30 calculates the average value of the embedded vectors of the five support images xS (Step S6).
The learning terminal 30 calculates the second loss LCL based on the embedded vector of the query image xQ and the average value calculated in Step S6 (Step S7). The learning terminal 30 calculates the total loss Ltotal based on the first loss LBCE and the second loss LCL (Step S8). The learning terminal 30 adjusts the parameter of each of the learning model M1 and the learning model M2 based on the total loss Ltotal (Step S9).
The learning terminal 30 determines whether or not all episodes have been processed (Step S10). When there is an episode that has not yet been processed (Step S10: N), the process returns to Step S1, and the next episode is processed. When it is determined that processing has been executed for all episodes (Step S10: Y), the learning terminal 30 determines whether or not the learning has been repeated a predetermined number of times (Step S11) . This number is referred to as “epoch”.
When it is not determined that the learning has been repeated the predetermined number of times (Step S11: N), the learning terminal 30 repeats the adjustment of the parameter of each of the learning model M1 and the learning model M2 (Step S12). In Step S12, the processing from Step S1 to Step S9 is repeated for each of the 15 episodes. Meanwhile, when it is determined that the learning has been repeated the predetermined number of times (Step S11: Y), the processing is ended.
According to the learning system S of this embodiment, by adjusting the parameter of the learning model M based on the first loss LBCE and the second loss LCL, the accuracy of the learning model M which is capable of recognizing multi-label data can be increased by using less training data. For example, when an attempt is made to adjust the parameter of the learning model M by using only the first loss LBCE, which is a multi-label cross-entropy loss, it is required to prepare an extremely large amount of training data. Further, for example, when an attempt is made to adjust the parameter of the learning model M by using only the second loss LCL, which is a few-shot learning-based contrastive loss, it is possible to reduce the amount of training data, but due to the above-mentioned first problem and second problem and the like, the accuracy of the learning model M capable of handling multi-labels may not be sufficiently increased. Through use of the first loss LBCE and the second loss LCL together, a reduction in training data and an improvement in the accuracy of the learning model M can both be achieved. According to the inventors' own research, it has been confirmed that the labeling accuracy of labels having a relatively small total number of images in a long-tail distribution (labels 0, 4, 5, and 6 of
Moreover, the learning system S can cause the learning model M to learn the features of images which are similar to each other by calculating the second loss LCL so that the second loss LCL becomes larger as the difference between the embedded vector of the query image xQ and the embedded vector of the support image xS having at least one label which is the same becomes larger. For example, the accuracy of the learning model M can be increased by adjusting the parameter of the learning model M so that the embedded vector of the query image xQ becomes closer to the embedded vectors of the support images xS .
Further, the learning system S can increase the number of the support images xS and effectively increase the accuracy of the learning model M by acquiring the second loss LCL based on the embedded vector of the query image xQ and an average value of the embedded vector of each of a plurality of support images xS . That is, the second loss LCL can be accurately calculated even when the number of support images xS is increased. Moreover, one second loss LCL may be calculated by combining the embedded vectors of a plurality of support images xS into a single average value. As a result, it is not required to calculate a large number of second losses LCL, and therefore the processing load on the learning terminal 30 can be reduced, and the learning can be accelerated.
Further, the learning system S can effectively increase the accuracy of the learning model M by using one index which comprehensively considers the first loss LBCE and the second loss LCL by calculating a total loss Ltotal and adjusting the parameter based on the first loss LBCE and the second loss LCL. Moreover, the processing required during learning can be simplified by combining the first loss LBCE and the second loss LCL into one total loss Ltotal. That is, by combining two losses into one, the learning processing can also be combined into one. As a result, the processing load on the learning terminal 30 can be reduced, and the learning can be accelerated.
Further, the learning system S has an image set which includes the query image xQ and support images xS for each combination of labels (that is, for each episode) . Through adjustment of the parameter of the learning model M based on the first loss LBCE and the second loss LCL calculated for each label combination, the features of various label combinations can be learned by the learning model M, and the accuracy of the learning model M can be increased. Moreover, even when there are many label combinations for multi-labels, it is possible to create a learning model M capable of recognizing those combinations.
Further, the learning system S can execute the calculation of the embedded vectors in parallel and accelerate the learning processing by inputting the query image xQ to the learning model M1 and inputting the support images xS to the learning model M2.
Further, even when the population to be processed by the learning model M has a long-tail distribution, the learning system S can reduce training data and maximize the accuracy of the learning model M by acquiring the query image xQ and the support images xS from a data group having a long-tail distribution for multi-labels. For example, in learning performed by using classes having a large total number of images and classes having a small total number of images, by making the number of images to be used in the learning (the number of images included per episode) to be the same, the features of all the classes can be learned universally by the learning model M.
The learning system S can prepare a learning model M having a certain degree of accuracy at the beginning of learning and can also increase the accuracy of the ultimately obtained learning model M by replacing the last layer of a model which has learned another label other than the plurality of labels to be recognized with a layer corresponding to the plurality of labels. For example, when pre-learning is executed by using a general ResNet 50, the learning model M obtained by the pre-learning can recognize the features of a general object to a certain degree. That is, this learning model M can recognize to a certain degree what part of the image to focus on so that the object can be classified. Through use of such a learning model M to perform learning like that in this embodiment, a higher accuracy learning model M can be obtained. Further, the number of times for which learning is required to be executed in order to obtain a learning model M having a certain degree of accuracy can be reduced, and the processing load on the learning terminal 30 can be reduced. Further, the learning can be accelerated.
Further, the learning system S can enhance the accuracy of the learning model M capable of recognizing multi-label images through use of a small amount of training data by using the data to be processed by the learning model M as an image.
The present disclosure is not limited to the embodiment described above, and can be modified suitably without departing from the spirit of the present disclosure.
(1) For example, the adjustment unit 305 may calculate the total loss Ltotal based on the first loss LBCE, the second loss LCL, and a weighting coefficient specified by the user. The user can specify at least one weighting coefficient for the first loss LBCE and the second loss LCL. The user may specify weighting coefficients for both of the first loss LBCE and the second loss LCL, or for only one of the first loss LBCE and the second loss LCL. The weighting coefficients specified by the user are stored in the data storage unit 300. The adjustment unit 305 acquires, as the total loss Ltotal a value obtained by multiplying each of the first loss LBCE and the second loss LCL by the weighting coefficient, and adding the products. The processing of the adjustment unit 305 after the total loss Ltotal is acquired is the same as in the embodiment.
According to Modification Example (1) of the present disclosure, the accuracy of the learning model M can be effectively increased by calculating the total loss Ltotal based on the first loss LBCE, the second loss LCL, and the weighting coefficient specified by the creator. For example, the weighting coefficient can be used depending on the objective of the user by, for example, increasing the weighting coefficient of the first loss LBCE when a major class in a long-tail distribution is to be preferentially learned, and increasing the weighting coefficient of the second loss LCL when a minor class in a long-tail distribution is to be preferentially learned.
(2) Further, for example, the second calculation unit 304 may acquire the second loss LCL based on the embedded vector of the query image xQ, the embedded vector of each support image xS, and a coefficient corresponding to a label similarity between the query image xQ and each support image xS. The label similarity is the number or proportion of the same label. When the number or proportion of the same label is larger or higher, this means that the label similarity is higher.
In the embodiment, there is described a case in which the labels of the query image xQ and the labels of the support image xS completely match (case in which the class of the query image xQ and the class of the support image xS are the same), but in this modification example, there is described a case in which the label of the query image xQ and the label of the support image xS partially match and do not completely match (case in which the class of the query image xQ is similar to the class of the support image xS).
For example, when the query image xQ is a multi-label image belonging to the three labels of label 1, label 2, and label 4, and the support image xS is a multi-label image belonging to the three labels of label 1, label 3, and label 4, two of the three labels match between the query image xQ and the support image xS. The coefficient corresponding to the similarity is thus 0.67. The second calculation unit 304 calculates the second loss LCL by multiplying this coefficient by Expression 4.
When the number or proportion of the same label between the query image xQ and the support image xS becomes larger or higher, the coefficient becomes larger. The relationship between the number or proportion of the labels and the coefficient may be determined in advance in an expression or the data of a table, for example. When the second calculation unit 304 calculates the second loss LCL in an episode, the second calculation unit 304 identifies the number or proportion of the same label between the query image xQ and the support image xS of the episode, and acquires the coefficient corresponding to the number or proportion. The second calculation unit 304 calculates the second loss LCL based on the coefficient.
According to Modification Example (2) of the present disclosure, the accuracy of the learning model M can be effectively increased through use of less training data by acquiring a second loss LCL based on a coefficient corresponding to a label similarity between the query image xQ and the support image xS. For example, there are cases in which it is difficult to find another image having exactly the same labels as that of a certain image, but it is easy to obtain an image having similar labels. In this case, by acquiring the second loss LCL based on a coefficient corresponding to the label similarity, it is not required to obtain another image having the exact same labels, and the time and effort expended by the user can be reduced.
(3) Further, for example, the modification examples described above may be combined.
For example, parameter adjustment may be executed without calculating the average value of the embedded vector of each of the plurality of support images xS. In this case, the adjustment unit 305 may execute parameter adjustment by calculating, for each support image xS, a total loss Ltotal based on the first loss LBCE of the query image xQ and the second loss LCL of the support image xS.
Further, for example, in
Further, for example, there has been described a case in which the parameter of the learning model M is adjusted based on the first loss LBCE and the second loss LCL, but the learning system S may adjust the parameter of the learning model M based only on the second loss
LCL without calculating the first loss LBCE. Conversely, the learning system S may adjust the parameter of the learning model M based only on the first loss LBCE without calculating the second loss LCL. This is because even when the learning system S is configured in such a way, it is possible to create a learning model M having a certain degree of accuracy.
Further, for example, the object to be recognized by the learning model M may be any object included in the image, and is not limited to a digital text, for example. For example, the learning model M may recognize a multi-label image in which a plurality of objects such as a dog or a cat appear. That is, the labels labeled by the learning model M are not limited to digital text or the like, and may be a subject in the image. It is sufficient that the label be some kind of classification of an object in the image.
Further, for example, the data input to the learning model M is not limited to images. That is, the learning system S is also applicable to a learning model M which performs recognition other than image recognition. For example, the learning system S may also be applicable to a learning model M for performing speech recognition. In this case, the data input to the learning model M is voice data. As another example, the learning system S is also applicable to a learning model M in natural language processing. In this case, the data input to the learning model M is document data. As still another example, the learning system S is also applicable to a learning model M which recognizes various human behaviors or phenomena in the natural world, for example. The data input to the learning model M may be data corresponding to the application of the learning model M.
Further, for example, all or part of the functions included in the learning terminal 30 may be implemented on another computer. For example, each of the data acquisition unit 301, the first calculation unit 302, the feature amount acquisition unit 303, the second calculation unit 304, and the adjustment unit 305 may be included in the server 10. In this case, each of those functions is implemented mainly by the control unit 11. Further, for example, each of those functions may be shared by a plurality of computers. The learning system S may include only one computer. Further, for example, the data described as being stored in the data storage units 100 and 300 may be stored in another computer or information storage medium different from the server 10 or the learning terminal 30.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2020/045416 | 12/7/2020 | WO |