This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2022-207686, filed on Dec. 23, 2022, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to an alert generation program, and the like.
An image recognition technology for recognizing a specific object from an image has been widely used. With this technology, for example, a region of the specific object included in the image is specified as a bounding box (Bbox). In addition, there is also a technology for performing image recognition that identifies an object by using machine learning. Furthermore, it is conceivable to apply this type of the image recognition technology to, for example, monitor a motion of purchasing performed by a customer in a store or manage work performed by workers in a factory.
In stores, such as supermarkets and convenience stores, self-service checkout registers are becoming widely used. A self-service checkout register is a point of sale (POS) checkout register system in which a user who purchases commodity products performs a series of processes between a process of reading bar code assigned to each of the commodity products and a process of calculating a payment amount. For example, by installing the self-service checkout register, it is possible to implement an improvement of a labor shortage caused by a decrease in population and suppression of labor costs. The related technology is described, for example, in Japanese Laid-open Patent Publication No. 2019-29021.
According to an aspect of an embodiment, a non-transitory computer-readable recording medium stores therein an alert generation program that causes a computer to execute a process including acquiring video image of a person who is scanning a code of a commodity product to an accounting machine, specifying, by analyzing the acquired video image, from among a plurality of commodity product candidates that are set in advance, a commodity product candidate that corresponds to the commodity product that is included in the video image, acquiring an item of the commodity product that has been registered to the accounting machine by scanning the code of the commodity product to the accounting machine, and generating, based on an item of the specified commodity product candidate and the item of the commodity product acquired from the accounting machine, an alert that indicates an abnormality of the commodity product that has been registered to the accounting machine.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
At the self-service checkout register described above, there is an aspect in which it is difficult to detect a fraudulent act because a scan of a commodity product code or calculation of a payment amount is entrusted by a user himself or herself. For example, even if image recognition Artificial Intelligence (AI) is used in an aspect of detecting the fraudulent act described above, a huge amount of training data is needed for a training of the image recognition AI. However, in stores, such as supermarkets and convenience stores, many types of commodity products are present, and, in addition, a life cycle of each of the commodity products is short, so that a replacement of each of commodity products frequently occurs. It is difficult to tune the image recognition AI in accordance with the life cycle of these types of commodity products, or it is difficult to conduct a training of new image recognition AI.
In addition, at the self-service checkout register described above, a scan of a commodity product code or calculation of a payment amount is entrusted by a user himself or herself, so that there is another aspect in which it is difficult to suppress a fraudulent act, that is, a so-called label switch, conducted by re-labelling a label attached on a high-priced commodity product to a label attached on a low-priced commodity product and scanning the low-priced commodity product.
Accordingly, it is an object in one aspect of an embodiment of the present invention to provide an alert generation program, an alert generation method, and an information processing apparatus capable of suppressing a label switch conducted at a self-service checkout register.
Preferred embodiments will be explained with reference to accompanying drawings. Furthermore, the present embodiment is not limited by the embodiments. In addition, each of the embodiments can be used in any appropriate combination as long as they do not conflict with each other.
The information processing apparatus 100 is one example of a computer that is connected to the camera 30 and the self-service checkout register 50. The information processing apparatus 100 is connected to the administrator terminal 60 via a network 3. The network 3 may be various communication networks that are used regardless of a wired or wireless connection. In addition, the camera 30 and the self-service checkout register 50 may be connected to the information processing apparatus 100 via the network 3.
The camera 30 is one example of an image capturing device that captures a video image of a region including the self-service checkout register 50. The camera 30 transmits data on the video image to the information processing apparatus 100. In a description below, the data on the video image is sometimes referred to as “video image data”.
In the video image data, a plurality of image frames obtained in time series are included. A frame number is assigned to each of the image frames in an ascending order in time series. A single image frame is image data of a still image that is captured by the camera 30 at a certain timing.
The self-service checkout register 50 is one example of an accounting machine in which a user 2 who purchases a commodity product registers the commodity product to be purchased at a checkout register and calculates a payment amount (payment) by himself or herself, and is called “Self-checkout”, “automated checkout”, “self-checkout machine”, “self-check-out register”, or the like. For example, if the user 2 moves a commodity product that is targeted for a purchase to a scan region included in the self-service checkout register 50, the self-service checkout register 50 scans a code that is printed or attached on the commodity product, and registers the commodity product that is targeted for the purchase. Hereinafter, a process in which a commodity product is registered to the self-service checkout register 50 is sometimes referred to as “registered at a checkout register”. In addition, the “code” mentioned here may be a bar code that meets the standards defined by the Japanese Article Number (JAN), the Universal Product Code (UPC), the European Article Number (EAN), or the like, or may be another two-dimensional code.
The user 2 repeatedly performs a motion of registering at a checkout register described above, and, when a scan of each of the commodity products has been completed, the user 2 operates a touch panel of the self-service checkout register 50, and makes a request for calculation of a payment amount. When the self-service checkout register 50 receives the request for calculation of the payment amount, the self-service checkout register 50 presents the number of commodity products targeted for the purchase, an amount of money for the purchase, and the like, and then, performs a process of calculation of the payment amount. The self-service checkout register 50 registers, as self-service checkout register data (commodity product information) in a storage unit, information on the commodity products scanned in a period of time between a point at which the user 2 starts the scan and a point at which the user 2 makes the request for calculation of the payment amount, and then, transmits the information to the information processing apparatus 100.
The administrator terminal 60 is one example of a terminal device that is used by an administrator of the store. For example, the administrator terminal 60 may be a mobile terminal device carried by the administrator of the store. In addition, the administrator terminal 60 may be a personal computer, such as a desktop personal computer or a laptop personal computer. In this case, the administrator terminal 60 may be arranged in, for example, a backyard of the store, or may be arranged in an office located outside of the store. As one aspect, the administrator terminal 60 receives various notifications from the information processing apparatus 100. In addition, here, a terminal device that is used by the administrator of the store is cited as an example; however, the administrator terminal 60 may be a terminal device that is used by anyone who is involved in the store.
With this configuration, the information processing apparatus 100 acquires a video image on a person who is scanning a code of a commodity product to the self-service checkout register 50. Then, the information processing apparatus 100 specifies, on the basis of the acquired video image and a machine learning model (zero-shot image classifier), from among a plurality of commodity product candidates (texts) that are set in advance, a commodity product candidate that corresponds to the commodity product that is included in the video image. After that, the information processing apparatus 100 acquires, by scanning the code of the commodity product to the self-service checkout register 50, an item of the commodity product that has been identified by the self-service checkout register 50. After that, the information processing apparatus 100 generates, on the basis of an item of the specified commodity product candidate and the item of the commodity product acquired from the self-service checkout register 50, an alert that indicates an abnormality of the commodity product that has been registered to the self-service checkout register 50.
As a result, as one aspect, the information processing apparatus 100 is able to output an alert at the time of detection of a label switch conducted in the self-service checkout register 50, and is thus able to suppress the label switch conducted in the self-service checkout register 50.
The communication unit 101 is a processing unit that controls communication with another device and is implemented by, for example, a communication interface or the like. For example, the communication unit 101 receives video image data from the camera 30, and outputs a processing result obtained by the control unit 110 to the administrator terminal 60.
The storage unit 102 is a processing unit that stores therein various kinds of data, a program executed by the control unit 110, or the like, and is implemented by, for example, a memory, a hard disk, or the like. The storage unit 102 stores therein a training data DB 103, a machine learning model 104, a hierarchical structure DB 105, a video image data DB 106, and a self-service checkout register data DB 107.
The training data DB 103 is a database that stores therein data that is used for a training of a first machine learning model 104A. For example, a case will be described by using an example illustrated in
In the correct answer information, a class of a person and an object that are the detection target, a class that indicates an interaction between a person and an object, and a bounding box (Bbox indicating region information on an object) that indicates a region of each of the classes are set. For example, as the correct answer information, region information on a Something class that indicates an object that is a commodity product or the like and that is other than a checkout bag, region information on a class of a person that indicates a user who purchases a commodity product, and a relationship (grip class) that indicates an interaction between the Something class and the class of the person are set. In other words, as the correct answer information, information on an object that is being gripped by a person is set.
In addition, as the correct answer information, region information on a class of a checkout bag that indicates a checkout bag, region information of a class of a person that indicates a user who uses the checkout bag, and a relationship (grip class) that indicates an interaction between the class of the checkout bag and the class of the person are set. In other words, as the correct answer information, information on the checkout bag that is being gripped by the person is set.
In general, if a Something class is learned by using a object identification (object recognition), all of the backgrounds, clothes, small goods, and the like that are not related to a task are consequently detected. In addition, all of these items correspond to Something, so that a lot of Bboxes are just identified in the image data and nothing is recognized. In a case of the HOID, it is possible to recognize a special relationship (there may be another case indicating sitting, operating, etc.) that indicates an object that is held by a person, so that it is possible to use for a task (for example, a fraud detection task to be performed at the self-service checkout register) as meaningful information. After an object has been detected as Something, a checkout bag or the like is identified as a unique class represented by Bag (checkout bag). The checkout bag is valuable information for the fraud detection task performed at the self-service checkout register, but is not valuable information for other tasks. Accordingly, it is worth to use on the basis of unique knowledge of the fraud detection task that is performed at the self-service checkout register in a course of a motion of taking out a commodity product from a basket (shopping basket) and putting the commodity product into the bag, and thus, a useful effect is obtained.
A description will be given here by referring back to
The first machine learning model 104A may be implemented by, as merely one example, the HOID described above. In this case, the first machine learning model 104A identifies, from the input image data, a person, a commodity product, and a relationship between the person and the commodity product, and then outputs an identification result. For example, the items of “the region information on a class of a person, the region information on a class of a commodity product (object), and an interaction between the person and the commodity product” are output. In addition, here, a case will be described as an example in which the first machine learning model 104A is implemented by the HOID; however, the case may be implemented by a machine learning model using various neural networks or the like.
The second machine learning model 104B may be implemented by, as merely one example, a zero-shot image classifier. In this case, the second machine learning model 104B is constituted such that a list of texts and an image are used as an input, and a text that has the maximum degree of similarity to the image and that are included in the list of the texts is output as a label of the input image.
Here, as an example of the zero-shot image classifier described above, Contrastive Language-Image Pre-training (CLIP) may be included. The CLIP mentioned here implements embedding of a plurality of types of images and texts, what is called multi-modal, with respect to a feature space. In other words, in CLIP, by training an image encoder and a text encoder, it is possible to implement embedding in which a distance between vectors is closer regarding a pair of an image and a text that have a closer meaning. For example, the image encoder may be implemented by Vision Transformer (ViT), or may be implemented by a convolutional neural network, such as a Residual Neural Network (ResNet). In addition, the text encoder may be implemented by a Transformer constituted of Generative Pre-trained Transformer (GPT) based architectures, or may be implemented by a regression typed neural network, such as Long Short-Term Memory (LSTM).
The hierarchical structure DB 105 is a database that stores therein a hierarchical structure in which an attribute of a commodity product is listed for each of a plurality of hierarchies. The hierarchical structure DB 105 is data that is generated by a data generation unit 112, which will be described later, and corresponds to one example of reference source data that is referred to by the zero-shot image classifier that is used as one example of the second machine learning model 104B. For example, for the text encoder included in the zero-shot image classifier, the list in which the texts that corresponds to the respective attributes of the commodity products belonging to the same hierarchy are listed is referred to in the order from a hierarchy at an upper level, that is, a shallow hierarchy, from among the hierarchies that are included in the hierarchical structure DB 105.
The video image data DB 106 is a database that stores therein the video image data that has been captured by the camera 30 that is installed in the self-service checkout register 50. For example, the video image data DB 106 stores therein image data, in units of frames, that is acquired by the camera 30 on the basis of each of the self-service checkout registers 50 or each of the cameras 30, an output result of the HOID obtained by inputting the acquired image data to the HOID, and the like.
The self-service checkout register data DB 107 is a database that stores therein various kinds of data acquired from the self-service checkout register 50. For example, the self-service checkout register data DB 107 stores therein, for each of the self-service checkout register 50, an item name of a commodity product and the number of purchased commodity products that have been registered at a checkout register as the commodity products targeted for a purchase, a billing amount that is a total amount of money of all of the commodity products targeted for the purchase, and the like.
The control unit 110 is a processing unit that manages the entirety of the information processing apparatus 100 and is implemented by, for example, a processor or the like. The control unit 110 includes a machine learning unit 111, the data generation unit 112, a video image acquisition unit 113, a self-service checkout register data acquisition unit 114, a fraud detection unit 115, and an alert generation unit 118. In addition, the machine learning unit 111, the data generation unit 112, the video image acquisition unit 113, the self-service checkout register data acquisition unit 114, the fraud detection unit 115, and the alert generation unit 118 are implemented by an electronic circuit that is included in a processor or implemented by a process or the like that is executed by the processor.
The machine learning unit 111 is a processing unit that performs machine learning on the machine learning model 104. As one aspect, the machine learning unit 111 performs machine learning on the first machine learning model 104A by using each of the pieces of training data that are stored in the training data DB 103.
As another aspect, the machine learning unit 111 performs machine learning on the second machine learning model 104B. Here, an example in which the second machine learning model 104B is trained by the machine learning unit 111 included in the information processing apparatus 100 will be described; however, the second machine learning model 104B that has been trained is released on the Internet or the like, so that machine learning does not always need to be performed by the machine learning unit 111. In addition, the machine learning unit 111 is able to perform fine-tune on the second machine learning model 104B in the case where accuracy of the second machine learning model 104B is insufficient after the second machine learning model 104B that has been trained is used for an operation of the self-service checkout register system 5.
From among these pairs of the images and the texts, the image are input to an image encoder 10I, and the texts are input to a text encoder 10T. The image encoder 10I in which the images have been input outputs vectors in each of which the received image is embedded in a feature space. In contrast, the text encoder 10T in which the texts have been input outputs vectors in each of which the received text is embedded in a feature space.
For example,
Here, for a training of the CLIP model 10, various formats are used for a caption of a text on a Web and thus the labels are undefined, so that an objective function called Contrastive objective is used.
In Contrastive objective, in a case of an ith image in the mini batch, an ith text corresponds to a correct pair, so that the ith text is set as a positive example, whereas all of the other texts are set as negative examples. In other words, a single positive example and N−1 negative examples are set to each of the pieces of training data, so that N positive examples and N2−N negative examples are consequently generated in the entirety of the mini batch. For example, in a case of an example of the similarity matrix M1, elements of N diagonal components that are indicated by an invert display of black and white are set as a positive example, and N2−N elements that are displayed in white are set as a negative example.
In the similarity matrix M1 having a structure in this way, parameters of the image encoder 10I and the text encoder 10T that maximize the degree of similarity of the N pairs corresponding to the positive examples and that minimize the degree of similarity of the N2−N pairs corresponding to the negative examples are trained.
For example, taking the 1st image 1 as an example, the 1st text is set as a positive example and the 2nd text and the subsequent texts are set as negative examples, and then, a loss, for example, a cross entropy error, in the row direction of the similarity matrix M1 is calculated. By calculating this kind of loss for each of the N images, the losses related to the respective images are obtained. In contrast, taking the 2nd text 2 as an example, the 2nd image is set as a positive example and all of the other images other than the 2nd image are set as negative examples, and then, each of the losses is calculated in the column direction of the similarity matrix M1. By calculating this kind of loss for each of the N texts, the losses related to the respective texts are obtained. An update of the statistical value of the losses related to these images and the losses about these texts, for example, the parameters that minimize the average, is performed on the image encoder 10I and the text encoder 10T.
As a result of training the image encoder 10I and the text encoder 10T that minimize this type of Contrastive objective, the CLIP model 10 that has been trained is generated.
A description will be given here by referring back to
More specifically, the data generation unit 112 acquires a commodity product list of commodity products provided in stores, such as supermarkets or convenience stores. Acquiring this kind of the commodity product list is implemented by acquiring a list of commodity products that have been registered to commodity product master in which, as merely one example, the commodity products provided in a store are stored as a database. Consequently, as merely one example, the commodity product list illustrated in
Furthermore, as merely one example, the data generation unit 112 acquires a template of a hierarchical structure illustrated in
Subsequently, the data generation unit 112 adds the attribute, that is, for example, the attribute related to a “price”, or the like, that has been assigned by the system definition or the user definition for each element that is included in the lowermost hierarchy, that is, for example, the first hierarchy at this point of time, in the template of the hierarchical structure. Hereinafter, the attribute related to the “price” is sometimes referred to as a “price attribute”. In addition, in the following, as merely one example of the attribute, the price attribute is cited as an example and details thereof will be described later, however; as an additional remark, another attribute related to, for example, a “color”, a “shape”, “quantity in stock”, or the like may also be added.
Then, the data generation unit 112 extracts, for each element included in the lowermost hierarchy constituted in the hierarchical structure that is being generated, that is, for each element k corresponding to the price attribute belonging to the second hierarchy at the present moment, a commodity product item in which the degree of similarity to the element k is equal to or larger than a threshold th1.
After that, the data generation unit 112 calculates, for each element n included in the mth hierarchy from the first hierarchy to the M−1th hierarchy excluding the Mth hierarchy that is the lowermost hierarchy from among all of the M hierarchies that are included in the hierarchical structure that is being generated, a variation in prices V of the commodity product items belonging to the element n. After that, the data generation unit 112 determines whether or not the variation in prices V is equal to or less than a threshold th2. At this time, if the variation in prices V is equal to or less than the threshold th2, the data generation unit 112 decides to abort the retrieval of the hierarchies that are ranked lower than the hierarchy in which the element n belongs. In contrast, if the variation in prices V is not equal to or less than the threshold th2, the data generation unit 112 increments a loop counter m of the hierarchy by one, and iterates calculation of the variation in prices and threshold determination on the variation related to each of the elements that belong to the hierarchy that is ranked at one level lower than the hierarchy in which the element n belongs.
As merely one example, a case in which the first hierarchy illustrated in
In the following, a case in which the second hierarchy illustrated in
Furthermore, a case in which the second hierarchy illustrated in
After that, the data generation unit 112 iterates the search until it is decided to abort the search that is started for each element included in the first hierarchy, or until all of the elements included in the M−1th hierarchy have been searched. Then, the data generation unit 112 decides the depth of each of the routes in the hierarchical structure on the basis of the determination result of the variation in prices obtained at the time of the search described above.
As merely one example, if an element in which the variation in prices of the commodity product item is equal to or less than the threshold th2 is present in a route between the element at the highest level and the element at the lowermost hierarchy included in all of the M hierarchies constituted in the hierarchical structure, the data generation unit 112 sets the subject element as a terminal node. In contrast, if an element in which the variation in prices of the commodity product item is equal to or less than the threshold th2 is not present in a route between the element at the highest level and the element at the lowermost hierarchy included in all of the M hierarchies constituted in the hierarchical structure, the data generation unit 112 sets the element corresponding to the commodity product item as the terminal node.
For example, in the example illustrated in
In the following, in the example illustrated in
In this way, the hierarchical structure illustrated in
A list of the class captions are input to the zero-shot image classifier that is one example of the second machine learning model 104B in accordance with the hierarchical structure as described above. For example, as the list of the class captions of the first hierarchy, the list of the text “fruit”, the text “fish”, and the like is input to the text encoder 10T included in the CLIP model 10. At this time, it is assumed that “fruit” is output by the CLIP model as the label of the class that corresponds to an input image with respect to the image encoder 10I. In this case, as the list of the class captions of the second hierarchy, a list of the text “high-priced grapes” and the text “low-priced grapes” is input to the text encoder 10T that is included in the CLIP model 10.
In this way, a list in which the texts corresponding to the attributes of the commodity products belonging to the same hierarchy are listed in the order of the upper level hierarchy constituted in the hierarchical structure is input as the class captions that are used in the CLIP model 10. Consequently, it is possible to allow the CLIP model 10 to narrow down the candidates for the commodity product items in units of hierarchies. Consequently, it is possible to reduce the processing cost for task implementation as compared to a case in which a list of the texts corresponding to all of the commodity product items in a store is input as the class captions that are used in the CLIP model 10.
Furthermore, in the hierarchical structure that is to be referred to by the CLIP model 10, an element that belongs to the hierarchy that is ranked lower than the hierarchy that includes the element in which the variation in prices of the commodity product item is equal to or less than the threshold th2 is omitted, so that it is possible to perform clustering on the commodity product items in each of which the amount of damage is small at the time of occurrence of a fraudulent act. Consequently, it is possible to implement a further reduction in the processing cost for the task implementation.
In addition, in stores, such as supermarkets and convenience stores, a large number of types of commodity products are present, a life cycle of each of the commodity products is short, so that a replacement of each of commodity products frequently occurs.
The hierarchical structure data that is to be referred to by the CLIP model 10 is a plurality of commodity product candidates that are arranged inside of a store at the present moment from among the candidates for the large number of types of commodity products that are targeted for the replacements. That is, a part of the hierarchical structure in the CLIP model 10 is only updated in accordance with the replacement of the commodity products that are arranged inside of a store. From among the candidates for the large number of types of commodity products that are targeted for the replacement, it is possible to easily manage the plurality of commodity product candidates that are arranged inside of a store at the present moment.
A description will be given here by referring back to
The self-service checkout register data acquisition unit 114 is a processing unit that acquires, as the self-service checkout register data, information on the commodity product that has been registered at a checkout register of the self-service checkout register 50. The “registered at a checkout register” mentioned here can be implemented by, in addition to scanning a commodity product code that is printed or attached on the commodity product, manually inputting the commodity product code by the user 2. In this case, as a user interface, a field for inputting the number of commodity products may possibly be included. As described in the latter case, the reason for allowing the user 2 to operate the manual input of the commodity product code is that it is not always able to print or attach the labels of the codes onto all of the respective commodity products. In this way, the self-service checkout register data that has been acquired in response to the registration operation performed at a checkout register of the self-service checkout register 50 is stored in the self-service checkout register data DB 107.
The fraud detection unit 115 is a processing unit that detects various fraudulent acts on the basis of the video image data obtained by capturing the surrounding area of the self-service checkout register 50. As illustrated in
The first detection unit 116 is a processing unit that detects a fraudulent act, as it is called the label switch, that is performed by re-labelling a label attached on a high-priced commodity product to a label attached on a low-priced commodity product and scanning the label attached on the low-priced commodity product.
As one aspect, in the case where a new commodity product code has been acquired by way of a scan performed at the self-service checkout register 50, the first detection unit 116 starts up a process. In this case, the first detection unit 116 retrieves a frame that corresponds to the time at which the commodity product code is scanned and that is included in the frames that are stored in the video image data DB 106. Then, the first detection unit 116 generates an image of the commodity product that is being gripped by the user 2 on the basis of the output result obtained from the HOID that corresponds to the frame in which a hit occurs in the retrieval. Hereinafter, the image of the commodity product that is being gripped by the user 2 is sometimes referred to as a “gripped commodity product image”.
After the gripped commodity product image has been generated in this way, the first detection unit 116 inputs the generated gripped commodity product image to the zero-shot image classifier that is one example of the second machine learning model 104B. Furthermore, the first detection unit 116 inputs, to the zero-shot image classifier in accordance with the hierarchical structure stored in the hierarchical structure DB 105, a list of the texts corresponding to the attributes of the commodity products belonging to the same hierarchy in the order of the upper level hierarchy. Consequently, the candidates for the commodity product items are narrowed down as the depth of the hierarchy of the text that is input to the zero-shot image classifier is deeper. After that, the first detection unit 116 determines whether or not the commodity product item that has been registered at a checkout register by way of the scan matches the commodity product item that has been specified by the zero-shot image classifier or a commodity product item group that is included in the attribute of the upper level thereof. At this time, if both of the commodity product items do not match, it is possible to detect that a label switch has been performed. In addition, specifying the commodity product item performed by using the zero-shot image classifier will be described in detail later by using
The second detection unit 117 is a processing unit that detects a fraudulent act, as it is called the banana trick, that is performed by registering a low-priced commodity product at a checkout register instead of registering a high-priced commodity product without a label at a checkout register. An operation of registering a commodity product without a label at a checkout register in this way is performed by a manual input by the user 2.
As merely one example, the self-service checkout register 50 sometimes receives an operation of registering a commodity product without label at a checkout register by way of an operation performed on the selection screen of a commodity product without a code illustrated in
As another example, the self-service checkout register 50 sometimes receives an operation of registering a commodity product without a label at a checkout register by way of an operation performed on a retrieval screen for the commodity product without a code illustrated in
In the case where a manual input of a commodity product without a label is received by way of the selection screen 200 for a commodity product without a code or the retrieval screen 210 for a commodity product without a code, there may be a case in which the manual input is not always performed on the self-service checkout register 50 while the user 2 gripping the commodity product.
From this point of view, the second detection unit 117 starts up the following process in the case where a new commodity product code has been acquired by way of a manual input performed on the self-service checkout register 50. As merely one example, the second detection unit 117 retrieves a frame in which a grip class has been detected by the latest HOID as far back to the time at which a manual input is performed on the commodity product code from among the frames that are stored in the video image data DB 106. Then, the second detection unit 117 generates a gripped commodity product image related to the commodity product without a label on the basis of the output result obtained from the HOID corresponding to the frame in which a hit occurs in the retrieval.
After the gripped commodity product image has been generated in this way, the second detection unit 117 inputs the generated gripped commodity product image to the zero-shot image classifier that is one example of the second machine learning model 104B. Furthermore, the second detection unit 117 inputs, to the zero-shot image classifier in accordance with the hierarchical structure stored in the hierarchical structure DB 105, a list of the texts corresponding to the attributes of the commodity products belonging to the same hierarchy in the order of the upper level hierarchy. Consequently, the candidates for the commodity product items are narrowed down as the depth of the hierarchy of the text that is input to the zero-shot image classifier is deeper. After that, the second detection unit 117 determines whether or not the commodity product item that has been registered at a checkout register by way of the manual input matches the commodity product item that has been specified by the zero-shot image classifier or a commodity product item group that is included in the attribute of upper level thereof. At this time, if both of the commodity product items do not match, it is possible to detect that a banana trick has been performed.
In the following, a process of specifying a commodity product item by using the zero-shot image classifier will be described by giving a case example.
As illustrated in
In contrast, the text encoder 10T included in the CLIP model 10 receives an input of texts of “fruit”, “fish”, “meat”, and “dairy products”, as a list of the class captions, that corresponds to the elements that are included in the first hierarchy in accordance with the hierarchical structure illustrated in
At this time, it is possible to input these texts of “fruit”, “fish”, “meat”, and “dairy products” to the text encoder 10T without any change, but it is possible to indirectly perform “Prompt Engineering” by converting the form of a class caption at the time of an inference to the form of a class caption at the time of a training. For example, regarding a text of “a photograph of an {object}”, it may be possible to input a text of “a photograph of fruit” by inserting a text of, for example, “fruit” that corresponds to the attribute of a commodity product to a portion of the {object}.
As a result, the text encoder 10T outputs an embedded vector T1 of the text “fruit”, an embedded vector T2 of the text “fish”, an embedded vector T3 of the text “meat”, . . . , and an embedded vector TN of the text “dairy products”.
Then, the degree of similarity between the embedded vector I1 of the gripped commodity product image 20 and each of the embedded vector T1 of the text “fruit”, the embedded vector T2 of the text “fish”, the embedded vector T3 of the text “meat”, and the embedded vector TN of the text “dairy products” is calculated.
As indicated by the invert display of black and white illustrated in
The prediction result of “fruit” in the first hierarchy obtained in this way is not the terminal node in the hierarchical structure illustrated in
As a result, the text encoder 10T outputs the embedded vector T1 of the text “high-priced grapes” and the embedded vector T2 of the text “high-priced grapes”. Then, the degree of similarity between the embedded vector I1 of the gripped commodity product image 20 and each of the embedded vector T1 of the text “high-priced grapes” and the embedded vector T2 of the text “low-priced grapes” is calculated.
As indicated by the invert display of black and white illustrated in
The prediction result of “high-priced grapes” included in the second hierarchy obtained in this way is not the terminal node in the hierarchical structure illustrated in
As a result, the text encoder 10T outputs the embedded vector T1 of the text “Shine Muscat grapes” and the embedded vector T2 of the text “premium Kyoho grapes”. Then, the degree of similarity between the embedded vector I1 of the gripped commodity product image 20 and each of the embedded vector T1 of the text “Shine Muscat grapes” and the embedded vector T2 of the text “premium Kyoho grapes” is calculated.
As indicated by the invert display of black and white illustrated in
As described above, in the case example 1, the commodity product candidates are narrowed down to “fruit” by inputting the list of the attributes of the commodity products that correspond to the elements that are included in the first hierarchy as the class caption to the text encoder 10T. Then, the commodity product candidates are narrowed down to “high-priced grapes” by inputting the list of the attributes of the commodity products, from among the elements included in the second hierarchy, that belong to the hierarchy that is ranked lower than the hierarchy in which the element “fruit” that corresponds to the prediction result obtained in the first hierarchy belongs as the class captions to the text encoder 10T. Furthermore, the commodity product candidates are narrowed down to “Shine Muscat grapes” by inputting the list of the attributes of the commodity products, from among the elements included in the third hierarchy, that belong to the hierarchy that is ranked lower than the hierarchy in which the element “high-priced grapes” that corresponds to the prediction result obtained in the second hierarchy belongs as the class captions to the text encoder 10T. By performing this kind of narrowing process, as compared to a case in which the text corresponding to all of the commodity product items in the store are input to the text encoder 10T, it is possible to specify that the commodity product item included in the gripped commodity product image 20 is “Shine Muscat grapes” while reducing the processing cost of task implementation.
As merely one example, in the case where a commodity product item that has been registered at a checkout register by way of a manual input is “imperfect grapes A”, the commodity product item does not match the commodity product item “Shine Muscat grapes” that has been specified by the zero-shot image classifier. In this case, it is possible to detect that a banana trick is performed.
As illustrated in
In contrast, in accordance with the hierarchical structure illustrated in
As a result, the text encoder 10T outputs the embedded vector T1 of the text “fruit”, the embedded vector T2 of the text “fish”, the embedded vector T3 of the text “meat”, . . . , the embedded vector TN of the text “dairy products”.
Then, the degree of similarity between the embedded vector I1 of the gripped commodity product image 21 and each of the embedded vector T1 of the text “fruit”, the embedded vector T2 of the text “fish”, the embedded vector T3 of the text “meat”, and the embedded vector TN of the text “dairy products” is calculated.
As indicated by the invert display of black and white illustrated in
The prediction result of “fruit” included in the first hierarchy obtained in this way is not the terminal node in the hierarchical structure illustrated in
As a result, the text encoder 10T outputs the embedded vector T1 of the text “high-priced grapes” and the embedded vector T2 of the text “high-priced grapes”. Then, the degree of similarity between the embedded vector I1 of the gripped commodity product image 21 and each of the embedded vector T1 of the text “high-priced grapes” and the embedded vector T2 of the text “low-priced grapes” is calculated.
As indicated by the invert display of black and white illustrated in
The prediction result of “low-priced grapes” that is included in the second hierarchy obtained in this way is the terminal node in the hierarchical structure illustrated in
As described above, in the case example 2, as compared to the case example 1 described above, it is possible to omit a process of inputting, as the class captions, the three elements of “inexpensive grapes A”, “inexpensive grapes B”, and “imperfect grapes A” that are included in the third hierarchy and in which the variation in prices of the commodity product items is equal to or less than the threshold th2. Therefore, according to the case example 2, it is possible to implement a further reduction in the processing cost for the task implementation.
For example, in the case where the commodity product item that has been registered at a checkout register by way of a manual input is “imperfect grapes A”, the registered commodity product item matches the commodity product item “imperfect grapes A” that is included in the attribute of the commodity product “low-priced grapes” that has been specified by the zero-shot image classifier. In this case, it is possible to determine that a banana trick is not performed.
A description will be given here by referring back to
As one aspect, in the case where a fraud has been detected by the fraud detection unit 115, the alert generation unit 118 is able to generate an alert addressed to the user 2. As this type of alert addressed to the user 2, it is possible to include the commodity product item that has been registered at a checkout register and the commodity product item that has been specified by the zero-shot image classifier.
As another aspect, in the case where a fraud has been detected by the fraud detection unit 115, the alert generation unit 118 is able to generate an alert addressed to the persons who are involved in the store, for example, addressed to an administrator. As this sort of alert addressed to the administrator of the store, it is possible to include identification information on a category of a fraud and the self-service checkout register 50 in which the fraud has been detected, an estimated amount of damage caused by a fraudulent act, and the like.
In the following, the flow of a process performed by the information processing apparatus 100 according to the present embodiment. Here, (1) a data generation process, (2) a video image acquisition process, (3) a first detection process, (4) a second detection process, and (5) a specifying process that are performed by the information processing apparatus 100 will be described in this order.
As illustrated in
Then, the data generation unit 112 performs a loop process 1 that iterates the process at Step S103 described below by the number of times corresponding to an element count K of the elements that are included in the lowermost hierarchy in the hierarchical structure and in which the respective attributes have been added in the template at Step S102. In addition, here, an example in which the process at Step S103 is iterated; however, the process at Step S103 may be performed in parallel.
In other words, the data generation unit 112 extracts, from the commodity product list that has been acquired at Step S101, the commodity product item in which the degree of similarity to the element included in the lowermost hierarchy in the hierarchical structure, that is, the element k related to the price attribute, is equal to or larger than the threshold th1 (Step S103).
As a result obtained from the loop process 1 performed in this way, the commodity product items belonging to the element k are accordingly clustered for each element k related to the price attribute.
After that, the data generation unit 112 performs a loop process 2 that iterates the processes starting from the process at Step S104 described below to the process at Step S106 described below performed on the hierarchies from the first hierarchy to the M−1th hierarchy except for the Mth hierarchy that is the lowermost hierarchy from among all of the M hierarchies in the hierarchical structure after the clustering process performed at Step S103. Furthermore, the data generation unit 112 performs a loop process 3 that iterates the processes starting from the process at Step S104 described below to the process at Step S106 described below by the number of times corresponding to an element count N of the elements that are included in the mth hierarchy. In addition, here, an example in which the processes at Step S104 to Step S106 are iterated; however, the processes at Step S104 to Step S106 may be performed in parallel.
In other words, the data generation unit 112 calculates a variation in prices V of the commodity product items belonging to the element n in the mth hierarchy (Step S104). After that, the data generation unit 112 determines whether or not the variation in prices V is equal to or less than the threshold th2 (Step S105).
At this time, if the variation in prices V is equal to or less than the threshold th2 (Yes at Step S105), the data generation unit 112 decides to abort retrieval of the hierarchy that is ranked lower than the hierarchy in which the element n belongs (Step S106). In contrast, if the variation in prices V is not equal to or less than the threshold th2 (No at Step S105), the retrieval of the hierarchy that is ranked lower than the hierarchy in which the element n belongs is continued, so that the process at Step S106 is skipped.
By performing the loop process 2 and the loop process 3 in this way, the retrieval is iterated until it is decided to abort the retrieval process that is started for each element included in the first hierarchy, or until all of the elements included in the M−1th hierarchy has been retrieved.
After that, the data generation unit 112 decides the depth of each of the routes in the hierarchical structure on the basis of the determination result of the variation in prices obtained at the time of retrieval process performed at Step S104 to Step S106 (Step S107).
As a result of the depth of each of the routes in the M hierarchies in the hierarchical structure being decided in this way, the hierarchical structure is determined. The hierarchical structure that has been generated in this way is stored in the hierarchical structure DB 105 included in the storage unit 102.
After that, the video image acquisition unit 113 associates, for each frame, the image data in the frame with the output result of the HOID related to the frame, stores the associated data in the video image data DB 106 (Step S203), and returns to the process at Step S201.
Then, the first detection unit 116 generates the gripped commodity product image of the commodity product that is gripped by the user 2 on the basis of the output result obtained from the HOID that corresponds to the frame in which a hit occurs in the retrieval at Step S302 (Step S303).
Then, the first detection unit 116 performs a “specifying process” for specifying a commodity product item by inputting the gripped commodity product image to the zero-shot image classifier, and inputting the list of the texts corresponding to the attributes of the commodity products included in each of the plurality of hierarchies to the zero-shot image classifier (Step S500).
After that, the first detection unit 116 determines whether or not the commodity product item that has been registered at a checkout register by way of a scan matches the commodity product item group that is included in the attribute of the commodity product item that has been specified at Step S500 or the attribute of the commodity product located at the upper level of the level of the specified commodity product item (Step S304).
At this time, if both of the commodity product items do not match (No at Step S305), it is possible to detect that a label switch has been performed. In this case, the alert generation unit 118 generates and outputs an alert of the label switch that has been detected by the first detection unit 116 and outputs the alert (Step S306), and then, returns to the process at Step S301. In addition, if both of the commodity product items match (Yes at Step S305), the alert generation unit 118 skips the process at Step S306 and returns to the process at Step S301.
Then, the second detection unit 117 generates a gripped commodity product image related to the commodity product without a label on the basis of the output result obtained from the HOID that corresponds to the frame in which a hit occurs in the retrieval specified at Step S402 (Step S403).
Then, the second detection unit 117 performs the “specifying process” for specifying the commodity product item by inputting the gripped commodity product image to the zero-shot image classifier, and inputting the list of the texts corresponding to the attributes of the commodity products included in each of the plurality of hierarchies to the zero-shot image classifier (Step S500).
After that, the second detection unit 117 determines whether or not the commodity product item that has been registered at a checkout register by way of a manual input matches the commodity product item that is included in the attribute of the commodity product item that has been specified at Step S500 or the commodity product item group that is included in the upper level of the specified commodity product item (Step S404).
At this time, if both of the commodity product items do not match (No at Step S405), it is possible to detect that a banana trick has been performed. In this case, the alert generation unit 118 generates an alert of the banana trick that has been detected by the second detection unit 117, and outputs the generated alert (Step S406), and then, returns to Step S401. In addition, if both of the commodity product items match (Yes at Step S405), the alert generation unit 118 skips the process at Step S406 and returns to Step S401.
Then, the fraud detection unit 115 performs the loop process 1 that iterates the processes to be performed at Step S503 to Step S505 that will be described below in the range between the uppermost hierarchy and the lowermost hierarchy that are constituted in the hierarchical structure that are referred to at Step S502. In addition, here, an example in which the processes performed at Step S503 to Step S505 are distributed, but the processes performed at Step S503 to Step S505 may be performed in parallel.
Furthermore, the fraud detection unit 115 performs the loop process 2 that iterates the process to be performed at Step S503 that will be described below and the process to be performed at Step S504 that will be described below by the number of timed corresponding to the element count N of the elements that are included in the m hierarchies. In addition, here, an example in which the processes performed at Step S503 and Step S504 are distributed; however, the processes at Step S503 and Step S504 may be performed in parallel.
That is, the fraud detection unit 115 inputs the text corresponding to the element n included in the mth hierarchy to the text encoder 10T included in the zero-shot image classifier (Step S503). Then, the fraud detection unit 115 calculates the degree of similarity between the vector that is output by the image encoder 10I in which the gripped commodity product image has been input at Step S501 and the vector that is output by the text encoder 10T in which the text has been output at Step S503 (Step S504).
As a result obtained from the loop process 2, the similarity matrix of N elements that are included in the mth hierarchy and the gripped commodity product image is generated. After that, the fraud detection unit 115 selects the element that has the maximum similarity included in the similarity matrix and in which the degree of similarity between the N elements that are included in the mth hierarchy and the gripped commodity product image is the maximum (Step S505).
After that, the fraud detection unit 115 iterates the loop process 1 on the N elements that belong to the hierarchy that is ranked lower than the hierarchy, in which the element selected at Step S505 belongs, in the hierarchy that is ranked at one level lower than the hierarchy in which the loop counter m of the hierarchy is incremented by one.
As a result obtained from the loop process 1, the text that is output by the zero-shot image classifier at the time of an input of the text that corresponds to the element included in the lowermost hierarchy of the hierarchical structure is obtained as the specific result of the commodity product item.
As described above, the information processing apparatus 100 acquires the video image that includes an object. Then, the information processing apparatus 100 inputs the acquired video image to the machine learning model (zero-shot image classifier) that refers to the reference source data in which an attribute of an object is associated with each of the plurality of hierarchies. Accordingly, the information processing apparatus 100 specifies the attribute of the object that is included in the video image from among the attributes of the objects that are included in the first hierarchy (melon and apple). After that, the information processing apparatus 100 specifies, by using the specified attribute of the object, the attribute of the object that is included in the second hierarchy (expensive melon and inexpensive melon) that is located below the first hierarchy. After that, the information processing apparatus 100 specifies, by inputting the acquired video image to the machine learning model (zero-shot image classifier), the attribute of the object that is included in the video image from among the attributes of the objects that are included in the second hierarchy.
Therefore, with the information processing apparatus 100, it is possible to implement detection of a fraudulent act conducted at a self-service checkout register by using the machine learning model (zero-shot image classifier) in which a preparation of a large amount of training data is not needed and re-tuning in accordance with the life cycle of each of the commodity products is not also needed.
In addition, the information processing apparatus 100 acquires the video image on the person who is scanning a code of a commodity product to the self-service checkout register 50. Then, the information processing apparatus 100 specifies, by inputting the acquired video image to the machine learning model (zero-shot image classifier), the commodity product candidate that corresponds to the commodity product that that is included in the video image from among the plurality of commodity product candidates (texts) that are set in advance. After that, the information processing apparatus 100 acquires, by scanning the code of the commodity product to the self-service checkout register 50, the item of the commodity product that has been identified by the self-service checkout register 50. After that, the information processing apparatus 100 generates an alert that indicates an abnormality of the commodity product that has been registered to the self-service checkout register 50 on the basis of the item of the specified commodity product candidate and the item of the commodity product that has been acquired from the self-service checkout register 50.
Therefore, with the information processing apparatus 100, as one aspect, it is possible to output an alert at the time of detection of a label switch conducted at the self-service checkout register 50, so that it is possible to suppress the label switch conducted at the self-service checkout register 50.
In addition, the information processing apparatus 100 acquires the video image on the person who grips the commodity product to be registered to the self-service checkout register 50. Then, the information processing apparatus 100 specifies, by inputting the acquired video image to the machine learning model (zero-shot image classifier), from among the plurality of commodity product candidates (texts) that have been set in advance, the commodity product candidate that corresponds to the commodity product included in the video image. After that, from among the plurality of commodity product candidates that are output by the self-service checkout register 50, the information processing apparatus 100 acquires the item of the commodity product that has been input by the person. After that, on the basis of the item of the acquired commodity product and the specified commodity product candidate, the information processing apparatus 100 generates an alert indicating an abnormality of the commodity product that has been registered to the self-service checkout register 50.
Therefore, according to the information processing apparatus 100, as one aspect, it is possible to output an alert at the time of detection of a banana trick performed in the self-service checkout register 50, so that it is possible to suppress a banana trick performed in the self-service checkout register 50.
In addition, the information processing apparatus 100 acquires the commodity product data and generates the reference source data in which an attribute of a commodity product is associated with each of the plurality of hierarchies on the basis of a variation relationship among the attributes of the commodity products that are included in the acquired commodity product data. After that, the information processing apparatus 100 sets the generated reference source data as the reference source data that is referred to by the zero-shot image classifier.
Therefore, according to the information processing apparatus 100, it is possible to implement a reduction in the number of pieces of data that are referred to by the zero-shot image classifier that is used to detect a fraudulent act conducted in the self-service checkout register 50.
In the above explanation, a description has been given of the embodiment of the device disclosed in the present invention; however, the present invention may also be implemented with various kinds of embodiments other than the embodiments described above.
First, an application example 1 of the hierarchical structure that has been described above in the first embodiment. For example, in the hierarchical structure, in addition to the attributes of the commodity products, a label that indicates the number of commodity products and a label that indicates a unit of the number of items may be included.
As illustrated in
In this way, if the label that indicates the number of commodity products and the label that indicates a unit of the commodity products are included in the hierarchical structure, in addition to the label switch described above, it is possible to implement detection of a fraud conducted by scanning the number of commodity products that is less than the actual number of purchased commodity products caused by the label switch. Hereinafter, a fraud conducted by scanning the number of commodity products that is less than the actual number of purchased commodity products caused by the label switch is sometimes referred to as a “label switch (the number of items)”.
A process of specifying the commodity product item performed at the time of detection of this type of label switch (the number of items) will be described by giving a case example.
As illustrated in
In contrast, in the text encoder 10T included in the CLIP model 10, texts of “fruit”, “fish”, “meat”, and “beverage” that correspond to the elements that are included in the first hierarchy are input as the list of the class captions in accordance with the hierarchical structure illustrated in
As a result, the text encoder 10T outputs the embedded vector T1 of the text “fruit”, the embedded vector T2 of the text “fish”, the embedded vector T3 of the text “meat”, . . . , and the embedded vector TN of the text “beverage”.
Then, the degree of similarity between the embedded vector I1 of the gripped commodity product image 22 and each of the embedded vector T1 of the text “fruit”, the embedded vector T2 of the text “fish”, the embedded vector T3 of the text “meat”, and the embedded vector TN of the text “beverage” is calculated.
As indicated by the invert display of black and white illustrated in
The prediction result “beverage” included in the first hierarchy obtained in this way is not the terminal node in the hierarchical structure that is illustrated in
As a result, the text encoder 10T outputs the embedded vector T1 of the text “canned beer A” and the embedded vector T2 of the text “canned beer B”. Then, the degree of similarity between the embedded vector I1 of the gripped commodity product image 22 and each of the embedded vector T1 of the text “canned beer A” and the embedded vector T2 of the text “canned beer B” is calculated.
As indicated by the invert display of black and white illustrated in
The prediction result “canned beer A” in the second hierarchy obtained in this way is not the terminal node in the hierarchical structure illustrated in
As a result, the text encoder 10T outputs the embedded vector T1 of the text “a single piece of canned beer A” and the embedded vector T2 of the text “a set of six canned beers A”. Then, the degree of similarity between the embedded vector I1 of the gripped commodity product image 22 and each of the embedded vector T1 of the text “a single piece of canned beer A” and the embedded vector T2 of the text “a set of six canned beers A”.
As indicated by the invert display of black and white illustrated in
As a result of the narrowing down process performed as described above, it is possible to specify that the commodity product items included in the gripped commodity product image 22 is “canned beer A”, and also, it is possible to specify that the number of gripped commodity products is “6 pieces”. From the aspect of a practical use, the first detection unit 116 performs the following determination, in addition to the determination of the label switch described above. That is, the first detection unit 116 determines whether or not the number of commodity product items that has been registered at a checkout register by way of a scan is less than the number of commodity product items that has been specified by the image analysis obtained by the zero-shot image classifier. At this time, if the number of commodity product items registered at a checkout register by way of a scan is less than the number of commodity product items specified by the image analysis, it is possible to detect a fraud that is conducted by scanning the number of commodity product items that is less than the actual number of purchased commodity product items and that is caused by a label switch.
In the case where a fraud conducted by cheating the number of commodity product items to be purchased in this way has been detected, if the label switch (the number of items) has been detected by the first detection unit 116, the alert generation unit 118 is able to generate an alert addressed to the user 2. As this type of alert addressed to the user 2, it is possible to include the number of commodity product items that has been registered at a checkout register and the number of commodity product items that has been specified by image analysis obtained from the zero-shot image classifier.
As another aspect, if a label switch (the number of items) has been detected by the first detection unit 116, the alert generation unit 118 is able to generate an alert addressed to the persons who are involved in the store, for example, addressed to an administrator. As this sort of alert addressed to the administrator of the store, it is possible to include identification information on a category of a fraud and the self-service checkout register 50 in which the fraud has been detected, an estimated amount of damage caused by a fraudulent act, and the like.
In the following, a process of detecting the label switch (the number of items) described above will be described.
As illustrated in
Namely, if the commodity product items match (Yes at Step S305), the first detection unit 116 determines whether or not the number of commodity product items that has been registered at a checkout register by way of a scan is less than the number of commodity product items that has been specified by image analysis (Step S601).
Here, if the number of commodity product items that has been registered at a checkout register by way of a scan is less than the number of commodity product items that has been specified by image analysis (Yes at Step S601), it is possible to detect a label switch (the number of items) conducted by scanning the number of commodity product items that is less than the actual number of purchased commodity product items caused by the label switch. In this case, the alert generation unit 118 generates an alert of the label switch (the number of items) that has been detected by the first detection unit 116, and outputs the generated alert (Step S602), and then, returns to the process at Step S301.
As described above, it is possible to implement detection of the label switch (the number of items) by performing the first detection process in accordance with the hierarchical structure according to the application example 1.
In addition to the application example 1 described above, as another example of the hierarchical structure in which the element of the label that indicates the number of commodity products and the element of the label that indicates a unit of the number of commodity products are included, the hierarchical structure according to the application example 2 will be described as an example.
As illustrated in
In this way, if the label that indicates the number of commodity products and the label that indicates a unit of the commodity products are included in the hierarchical structure, in addition to the banana trick described above, it is possible to implement detection of a fraud conducted by manually inputting the number of commodity products that is less than the actual number of commodity products to be purchased caused by the banana trick. Hereinafter, a fraud conducted by manually inputting the number of commodity products that is less than the actual number of commodity products to be purchased caused by the banana trick is sometimes referred to as a “banana trick (the number of items)”.
In this way, registration of a commodity product without a label performed at a checkout register is performed by a manual input by the user 2. As merely one example, the self-service checkout register 50 sometimes receive registration of a commodity product without a label by way of an operation performed on the selection screen of the commodity product without a code illustrated in
A process of specifying a commodity product item performed at the time of detection of the banana trick (the number of items) in this way will be described by giving a case example.
As illustrated in
In contrast, in the text encoder 10T included in the CLIP model 10, the texts of “fruit”, “fish”, “meat”, and “dairy products” that correspond to the elements included in the first hierarchy are input as the list of the class captions in accordance with the hierarchical structure illustrated in
As a result, the text encoder 10T outputs the embedded vector T1 of the text “fruit”, the embedded vector T2 of the text “fish”, the embedded vector T3 of the text “meat”, . . . , and the embedded vector TN of the text “dairy products”.
Then, the degree of similarity between the embedded vector I1 of the gripped commodity product image 23 and each of the embedded vector T1 of the text “fruit”, the embedded vector T2 of the text “fish”, the embedded vector T3 of the text “meat”, and the embedded vector TN of the text “dairy products” is calculated.
As indicated by the invert display of black and white illustrated in
The prediction result of “fruit” included in the first hierarchy obtained in this way is not the terminal node in the hierarchical structure illustrated in
As a result, the text encoder 10T outputs the embedded vector T1 of the text “grapes A” and the embedded vector T2 of the text “grapes B”. Then, the degree of similarity between the embedded vector I1 of the gripped commodity product image 23 and each of the embedded vector T1 of the text “grapes A” and the embedded vector T2 of the text “grapes B” is calculated.
As indicated by the invert display of black and white illustrated in
The prediction result “grapes A” included in the second hierarchy obtained in this way is not the terminal node in the hierarchical structure illustrated in
As a result, the text encoder 10T outputs the embedded vector T1 of the text “single bunch of grapes A” and the embedded vector T2 of the text “two bunches of grapes A”. Then, the degree of similarity between the embedded vector I1 of the gripped commodity product image 22 and each of the embedded vector T1 of the text “single bunch of grapes A” and the embedded vector T2 of the text “two bunches of grapes A” is calculated.
As indicated by the invert display of black and white illustrated in
As a result of the narrowing down process performed as described above, it is possible to specify that the commodity product item included in the gripped commodity product image 23 is “two bunches of grapes A”, and also, it is also possible to specify that the number of gripped commodity products is “two bunches”. From the aspect of a practical use, the second detection unit 117 performs the following determination, in addition to the banana trick described above. That is, the second detection unit 117 determines whether or not the number of commodity product items that has been registered at a checkout register by way of a manual input is less than the number of commodity product items that has been specified by the image analysis obtained by the zero-shot image classifier. At this time, if the number of commodity product items that has been registered at a checkout register by way of a manual input is less than the number of commodity product items that has been specified by the image analysis, it is possible to detect a fraud that is conducted by scanning the number of commodity product items that is less than the actual number of purchased commodity product items caused by a banana trick.
In the case where a fraud by cheating the number of commodity product items to be purchased conducted in this way has been detected, if a banana trick (the number of items) has been detected by the second detection unit 117, the alert generation unit 118 is able to generate an alert addressed to the user 2. As this type of alert addressed to the user 2, it is possible to include the number of commodity product items that has been registered at a checkout register, and the number of commodity product items that has been specified by image analysis obtained from the zero-shot image classifier.
As another aspect, if a banana trick (the number of items) has been detected by the second detection unit 117, the alert generation unit 118 is able to generate an alert addressed to the persons who are involved in the store, for example, addressed to an administrator. As this sort of alert addressed to the administrator of the store, it is possible to include identification information on a category of a fraud and the self-service checkout register 50 in which the fraud has been detected, an estimated amount of damage caused by a fraudulent act, and the like.
In the following, a process of detecting the banana trick (the number of items) described above will be described.
As illustrated in
Namely, if the commodity product items match (Yes at Step S405), the second detection unit 117 determines whether or not the number of commodity product items that has been registered at a checkout register by way of a manual input is less than the number of commodity product items that has been specified by image analysis (Step S701).
Here, if the number of commodity product items that has been registered at a checkout register by way of a manual input is less than the number of commodity product items that has been specified by image analysis (Yes at Step S701), the following possibility is increased. That is, it is possible to detect a banana trick (the number of items) conducted by manually inputting the number of commodity product items that is less than the actual number of purchased commodity product items caused by the banana trick. In this case, the alert generation unit 118 generates an alert of the banana trick (the number of items) that has been detected by the second detection unit 117, and outputs the generated alert (Step S702), and then, returns to the process at Step S401.
As described above, it is possible to implement detection of the banana trick (the number of items) by performing the second detection process in accordance with the hierarchical structure according to the application example 2.
In the application example 1 and the application example 2 described above, an example in which the element of the label that indicates the number of commodity products and the element of the label that indicates a unit of the number of commodity products are included in the third hierarchy has been described; however, the element of the label that indicates the number of commodity products and the element of the label that indicates that indicates a unit of the number of commodity products may be included any one of the hierarchies.
As illustrated in
In this way, even when the element of the label that indicates the number of commodity products and the element of the label that indicates a unit of the number of commodity products are included in any one of the hierarchies, it is possible to detect a fraud by cheating the number of purchased commodity products conducted by the above described label switch (the number of items), the above described banana trick (the number of items), and the like.
In the first embodiment described above, as an example of the attribute of the commodity product, in addition to the category (the large classification and the small classification), an example in which the price attribute is added to the template; however, the attribute of the commodity product is not limited to this. For example, from the aspect of enhancing the accuracy of embedding a text of a class caption included in the zero-shot image classifier to a feature space, an attribute of “color”, “shape”, or the like may be added to the template. In addition to this, from the viewpoint of suppressing a shortage of stock in a store, an attribute of “quantity in stock” or the like may be added to the template.
In this way, as an example of the attribute of the commodity product, as a result of the element of “color”, “shape”, or the like being added to the template, it is possible to enhance the accuracy of the accuracy of embedding a text of a class caption included in the zero-shot image classifier to a feature space.
In the first embodiment described above, the hierarchical structure data has been indicated as one example of the reference source data in which the attribute of the commodity product is associated with each of the plurality of hierarchies, and example in which one or a plurality of commodity product candidates are specified by being referred to by the zero-shot image classifier has been indicated. In addition, an example in which, in the hierarchical structure data, from among the candidates for the a large number of types of commodity products that are targeted for the replacement, the class captions that correspond to the plurality of commodity product candidates that are arranged in the inside of a store at the present moment are listed has been indicated as merely one example; however, the example is not limited to this.
As merely one example, the hierarchical structure data may be generated on the basis of the commodity products that are arrived at the store at the time of year for each time of year. For example, in the case where the replacement of the commodity products in the store is performed every month, the data generation unit 112 generates the hierarchical structure data as follows at each time of year. Namely, the hierarchical structure data is generated for each time of year in a scheme of, for example, the hierarchical structure data related to the commodity products to be arrived in November in 2022, the hierarchical structure data related to the commodity products to be arrived in December in 2022, the hierarchical structure data related to the commodity products to be arrived in January in 2023, and the like. After that, the fraud detection unit 115 refers to the hierarchical structure data corresponding to the time at which the commodity product item has been specified from among the pieces of hierarchical structure data that are stored for each time of year, and inputs the obtained result to the text encoder included in the zero-shot image classifier. Consequently, it is possible to change the reference source data that is referred to by the zero-shot image classifier in conformity to the replacement of the commodity products performed in the store. As a result, even if the life cycle of each of the commodity products stocked in the store is short, it is possible to implement stable accuracy of specifying a commodity product item including before and after the replacement of the commodity products.
The number of self-service checkout registers and cameras, examples of numerical values, examples of the training data, the number of pieces of training data, the machine the learning models, each of the class names, the number of classes, the data formats, and the like that are used in the embodiment described above are only examples and may be arbitrarily changed. Furthermore, the flow of the processes descried in each of the flowcharts may be changed as long as the processes do not conflict with each other. In addition, a model generated from various algorithms, such as a neural network, may be used for each of the models.
In addition, regarding a scan position and a position of a shopping basket, the information processing apparatus 100 is also able to use a known technology, such as another machine learning model for detecting a position, an object detection technology, or a position detection technology. For example, the information processing apparatus 100 is able to detect a position of a shopping basket on the basis of a difference between frames (image data) and a change in frames in time series, so that the information processing apparatus 100 may also perform detection by using the difference between frames and the change in frames in time series, or may also generate another model by using the difference between frames and the change in frames in time series. Furthermore, by designating a size of the shopping basket in advance, the information processing apparatus 100 is also able to identify the position of the shopping basket in the case where an object with that size has been detected from the image data. In addition, the scan position is a position that is fixed to an extent, so that the information processing apparatus 100 is also able to identify the position designated by an administrator or the like as the scan position.
The flow of the processes, the control procedures, the specific names, and the information containing various kinds of data or parameters indicated in the above specification and drawings can be arbitrarily changed unless otherwise stated. Furthermore, specific examples, distributions, numerical values, and the like described in the embodiment are only examples and can be arbitrarily changed.
Furthermore, the specific shape of a separate or integrated device is not limited to the drawings. For example, the video image acquisition unit 113 and the fraud detection unit 115 may be integrated, or the fraud detection unit 115 may be separated to the first detection unit 116 and the second detection unit 117. In other words, all or part of the device can be configured by functionally or physically separating or integrating any of the units in accordance with various loads or use conditions. In addition, all or any part of each of the processing functions performed by the each of the devices can be implemented by a CPU and by programs analyzed and executed by the CPU or implemented as hardware by wired logic.
The communication device 100a is a network interface card or the like, and communicates with another device. The HDD 100b stores therein programs and the DB that operate the function illustrated in
The processor 100d operates the process that executes each of the functions described above in
In this way, the information processing apparatus 100 is operated as an information processing apparatus that executes an information processing method by reading and executing the programs. In addition, the information processing apparatus 100 is also able to implement the same function as that described above in the embodiment by reading the programs described above from a recording medium by a medium recording device and executing the read programs described above. In addition, the programs described in another embodiment are not limited to be executed by the information processing apparatus 100. For example, the embodiment described above may also be similarly used in a case in which another computer or a server executes a program, or in a case in which another computer and a server cooperatively execute the program with each other.
The programs may be distributed via a network, such as the Internet. Furthermore, the programs may be executed by storing the programs in a recording medium that can be read by a computer readable medium, such as a hard disk, a flexible disk (FD), a CD-ROM, a magneto-optical disk (MO), a digital versatile disk (DVD), or the like, and read the programs from the recording medium by the computer.
In the following, the self-service checkout register 50 will be described.
The communication interface 400a is a network interface card or the like, and communicates with another device. The HDD 400b stores therein programs and data that operate each of the functions of the self-service checkout register 50.
The processor 400d is a hardware circuit that operates the process that executes each of the functions of the self-service checkout register 50 by reading the program that executes the process of each of the functions of the self-service checkout register 50 from the HDD 400b or the like and loading the read program in the memory 400c. In other words, the process executes the same function as that performed by each of the processing units included in the self-service checkout register 50.
In this way, by reading and executing the program for executing the process of each of the functions of the self-service checkout register 50, the self-service checkout register 50 is operated as an information processing apparatus that performs an operation control process. Furthermore, the self-service checkout register 50 is also able to implement each of the functions of the self-service checkout register 50 by reading the programs from a recording medium by a medium reading device and executing the read programs. In addition, the programs described in another embodiment are not limited to be executed by the self-service checkout register 50. For example, the present embodiment may also be similarly used in a case in which another computer or a server execute a program, or in a case in which another computer and a server cooperatively execute a program with each other.
Furthermore, the programs that execute the process of each of the functions of the self-service checkout register 50 can be distributed via a network, such as the Internet. Furthermore, these programs can be executed by recording the programs in a recording medium that can be read by a computer readable medium, such as a hard disk, a flexible disk (FD), a CD-ROM, a magneto-optical disk (MO), a digital versatile disk (DVD), or the like, and read the programs from the recording medium by the computer.
The input device 400e detects various input operations performed by a user, such as an input operation performed with respect to the programs executed by the processor 400d. Examples of the input operation include a touch operation or the like. In a case of the touch operation, the self-service checkout register 50 further includes a display unit, and the input operation detected by the input device 400e may be a touch operation performed on the display unit. The input device 400e may be, for example, a button, a touch panel, a proximity sensor, or the like. In addition, the input device 400e reads a bar code. The input device 400e is, for example, a bar code reader. The bar code reader includes a light source and an optical sensor and scans the bar code.
The output device 400f outputs data that is output from the program executed by the processor 400d via external device, such as an external display device, that is connected to the self-service checkout register 50. In addition, in the case where the self-service checkout register 50 includes a display unit, the self-service checkout register 50 need not include the output device 400f.
According to an aspect of an embodiment, it is possible to suppress a label switch conducted in a self-service checkout register.
All examples and conditional language recited herein are intended for pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventors to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2022-207686 | Dec 2022 | JP | national |