This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2023-104038, filed on Jun. 26, 2023, the entire contents of which are incorporated herein by reference.
The embodiment discussed herein is related to an information processing program, an information processing method, and an information processing apparatus that identifies a person who performs an abnormal behavior and the behavior from a video.
For example, a technology for identifying a person who performs an abnormal behavior, such as shoplifting, from a monitoring video that is captured in any kind of facility, such as a store, by causing a computer to perform image recognition, and issuing an alert for giving a notice of abnormality is known. With this technology, it is possible to prevent occurrence of an incident.
In the technology as described above, for example, bounding boxes (Bboxes) that are rectangles enclosing areas including an object and a person are extracted from a video by using a machine learning model, and it is determined whether the person is performing an abnormal behavior by a positional relationship of the Bboxes.
However, it is difficult to detect a person or an object located in a blind spot of a monitoring camera, and therefore, it is not easy to accurately determine an abnormal behavior of the person form a video. Meanwhile, the blind spot of the monitoring camera may be, for example, a blind spot that occurs between imaging ranges of a plurality of cameras when the plurality of cameras capture images of different areas or a blind spot that occurs on the outside of an imaging range of a single camera, such as a swinging camera.
According to an aspect of an embodiment, a computer-readable recording medium has stored therein an information processing program that causes a computer to execute a process including acquiring a video that is captured by one or more camera apparatuses identifying a relationship for identifying a correlation between an object and a person included in the video by analyzing the acquired video determining whether the person has performed an abnormal behavior on a product on an outside of an imaging range of the camera apparatus based on the identified relationship and giving an alert based on a determination result on whether the person has performed the abnormal behavior on the product on the outside of the imaging range.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed.
Preferred embodiments of the present invention will be explained with reference to accompanying drawings. The present embodiment is not limited by the exemplary modes below. Further, each of the exemplary modes may be combined appropriately as long as no contradiction is derived.
First, image capturing performed by a monitoring camera and a blind angle of an imaging range will be described.
More than a few blind spots occur with respect to the imaging ranges of the camera apparatuses 110, although it depends on the number of the installed camera apparatuses 110 or a size of the facility in which image capturing is performed. In
The imaging range of the camera apparatus 120 moves as described above, and therefore, a position that is captured at a certain time is located out of the imaging range and located in a blind spot at a different time. For example, the person P is captured in the imaging range of the camera apparatus 120 on the left side in
An information processing system for implementing the present embodiment will be described below.
As the network 50, for example, it is possible to adopt any kind of a communication network, such as an intranet or the Internet, that used in any kind of facility, such as a store, regardless of whether the network is wired or wireless. Further, the network 50 need not always be a single network, but may be configured with, for example, an intranet and the Internet via a network device, such as a gateway, or a different device (not illustrated). Meanwhile, “in a facility” does not always indicate the inside of the facility, but may include the outside of the facility.
The information processing apparatus 10 is, for example, an information processing apparatus, such as a desktop personal computer (PC), a notebook PC, or a server computer, that is installed in any kind of facility, such as a store, and used by a security guard or the like. Alternatively, the information processing apparatus 10 may be a cloud computer apparatus that is managed by a service provider who provides a cloud computing service.
The information processing apparatus 10 receives, from the camera apparatus 100, a captured video of a predetermined imaging range in any kind of facility, such as a store, for example. Meanwhile, in a strict sense, the video includes a plurality of captured images that are captured by the camera apparatus 100, that is, a series of frames of moving images.
Further, the information processing apparatus 10 extracts an object including a person in any kind of facility, such as a store, from the video captured by the camera apparatus 100 by using a well-known technology, for example. Meanwhile, extraction of object from the video may be extraction of a bounding box (Bbox) that is a rectangle enclosing an area including an object and a person from the video, for example. Furthermore, the information processing apparatus 10 identifies a relationship for identifying a correlation between an object and a person, such as the person holding or carrying the object, for example.
Moreover, the information processing apparatus 10 determines whether the person has performed an abnormal behavior on a product based on the identified relationship between the object and the person, for example. The abnormal behavior described herein indicates, for example, shoplifting or the like, and in particular, the information processing apparatus 10 determines whether the person has performed an abnormal behavior on the product on an outside of an imaging range of the camera apparatus 100 based on the identified relationship between the object and the person.
Furthermore, when determining that the person has performed an abnormal behavior on the product, the information processing apparatus 10 issues an alert related to appearance of the person who has performed the abnormal behavior, for example. Meanwhile, the alert is a mere warning, and a suspicious person who may have performed the abnormal behavior may be included in the person who has performed the abnormal behavior, for example. Moreover, the alert may be, for example, output of a voice, a notice of a message notification on a screen, or the like. Furthermore, a notification destination of the alert may be, for example, an output device included in the information processing apparatus 10, an externally-attached device, or a different output device that is communicably connected to the information processing apparatus 10 via the network 50.
Moreover, the information processing apparatus 10 may identify a location of the person who has performed an abnormal behavior based on an installation location of the camera apparatus 100 that has captured a video in which the relationship between the object and the person is identified or the like, and limit a notification destination of the alert, for example. Meanwhile, limitation of the notification destination of the alert indicates that, for example, a notice of the alert is limited to an information processing terminal that is carried by a security guard who is present near the person who has performed the abnormal behavior, a PC that is installed near the location of the person who has performed the abnormal behavior, or the like.
Furthermore, a security guard or the like in any kind of facility, such as a store, is able to receive a notice of the alert and prevent occurrence of shoplifting or the like by paying attention to the person who has performed the abnormal behavior and stopping the abnormal behavior.
Meanwhile, in
The camera apparatus 100 is a monitoring camera that is installed in, for example, any kind of facility, such as a store. The camera apparatus 100 may be, for example, the camera apparatuses 110 that are a plurality of monitoring cameras, the camera apparatus 120 that is a swinging camera, or the like as explained above with reference to
A functional configuration of the information processing apparatus 10 will be described below.
The communication unit 11 is a processing unit that controls communication with a different apparatus, such as the camera apparatus 100, and is, for example, a communication interface, such as a network interface card.
The storage unit 12 has a function to store various kinds of data and a program that is executed by the control unit 20, and is implemented by, for example, a storage apparatus, such as a memory or a hard disk. The storage unit 12 stores therein an image capturing DB 13, a camera installation DB 14, a model DB 15, a rule DB 16, and the like. Meanwhile, DB is an abbreviation of a database.
The image capturing DB 13 stores therein a plurality of captured images that are a series of frames captured by the camera apparatus 100. The plurality of captured images that are captured by the camera apparatus 100, that is, a video, is transmitted from the camera apparatus 100 as needed, received by the information processing apparatus 10, and stored in the image capturing DB 13.
The camera installation DB 14 stores therein information for identifying a place in which each of the camera apparatuses 100 is installed, for example. The information stored here may be set in advance by an administrator of the information processing system 1 or the like, for example.
The model DB 15 stores therein, for example, information on a machine learning model for identifying areas including an object and a person and a relationship between the object and the person from a video that is captured by the camera apparatus 100, and a model parameter for constructing the machine learning model. The machine learning model is generated by machine learning by using, for example, the video that is captured by the camera apparatus 100, that is, the captured image, as input data and using the areas including the object and the person and a type of the relationship between the object and the person as ground truth labels. Meanwhile, the type of the relationship between the object and the person may be, for example, the person holding the object, the person carrying the object, or the like, but embodiments are not limited to this example. Further, the areas including the object and the person may be bounding boxes (Bboxes) that are rectangles enclosing the areas on the captured image, for example.
Furthermore, the model DB 15 stores therein, for example, information on a machine learning model for acquiring a type of an object for generating a scene graph from a video and a relationship between objects, and a model parameter for constructing the machine learning model. Meanwhile, the type of an object for generating a scene graph may be referred to as a “class”, and the relationship between objects may be referred to as a “relation”. Moreover, the machine learning model is generated by machine learning by using a video that is captured by the camera apparatus 100, that is, a captured image, as input data, and locations (Bboxes) of objects included in the captured image, types of the objects, and a relationship between the objects as ground truth labels.
Furthermore, the model DB 15 stores therein, for example, information on a machine learning model for generating an Attention map (to be described later), and a model parameter for constructing the machine learning model. The machine learning model is generated by, for example, training by using a feature value of an object detected from the captured image as input data and an important area in the image as a ground truth label. Meanwhile, various kinds of machine learning models may be trained and generated by the information processing apparatus 10 or by a different information processing apparatus.
The rule DB 16 stores therein, for example, information on a rule for determining that a person has performed an abnormal behavior on a product. The information stored herein may be set in advance by, for example, an administrator or the like of the information processing system 1.
For example, if a person who is detected from a video indicates a relationship that is set in the “relationship” with respect to an object that is set in the “object” in the rule DB 16, the information processing apparatus 10 is able to determine that the person is likely to perform an abnormal behavior on the product. More specifically, for example, as indicated by a rule ID=1 in
Moreover, if the relationship of “hold” between the person and the product disappears in a video that is temporally later than the subject video, the information processing apparatus 10 is able to determine that, for example, the person may have performed an abnormal behavior, such as shoplifting, on the product. Meanwhile, when the relationship of “hold” between the person and the product disappears, the person may have put the product in a shopping basket or a shopping cart, and therefore, it may be possible to further add a determination condition for a case in which the person is not carrying a shopping basket or the like. The determination condition may also be stored in, for example, the rule DB 16 or the like because it is possible to perform determination based on the relationship between the person and the object, such as a relationship in which it is not indicated that the person “holds” a shopping basket or the like, for example.
Furthermore, if a behavior of putting the product into a bag of the person appears in the video, the information processing apparatus 10 is able to identify a location of occurrence of shoplifting or the like. However, even when the behavior does not appear, the information processing apparatus 10 is able to estimate a location of occurrence of shoplifting or the like from, for example, an installation location of the camera apparatus 100 that has captured the video in which the relationship between the person and the object is identified. More specifically, for example, explanation will be given using
Moreover, as an example of the abnormal behavior, such as shoplifting, for example, there may be a case in which the person once puts the product in a shopping basket or the like and then put the product in a bag of the person on the outside of the imaging range of the camera apparatus 100. Therefore, for example, as indicated by a rule ID=2 in
Moreover, as another example of the abnormal behavior, such as shoplifting, for example, there may be a case in which the person may take away the product by cutting a security chain by using a tool, such as a chain cutter. Therefore, if a relationship between the person and a chain cutter detected from the video indicates “hold” or “carry”, the information processing apparatus 10 is able to determine that the person is likely to perform the abnormal behavior on the product. Furthermore, if the relationship between person and the product is “hold” or “carry” in a video that is temporally later than the subject video, the information processing apparatus 10 is able to determine that, for example, the person may have performed the abnormal behavior, such as shoplifting, on the product.
Moreover, it may be possible to determine the abnormal behavior based on a relationship between the person detected from the video and a bag carried by the person, rather than the relationship between the person and the product itself. For example, the relationship between the person and the bag detected from the video indicates “hold”, the information processing apparatus 10 is able to determine that the person is likely to perform the abnormal behavior on the product. Furthermore, if the relationship between the person and the bag indicates “hold” and the bag is bigger in a video that is temporally later than the subject video, the information processing apparatus 10 is able to determine that, for example, the person may have performed the abnormal behavior, such as shoplifting, on the product.
In this manner, by setting the relationship between the person and the object that may lead to an abnormal behavior and that is alarming in the rule DB 16, it is possible to determine whether the person has performed the abnormal behavior from the relationship between the person and the object detected from the video. Meanwhile, the setting information in the rule DB 16 illustrated in
Furthermore, the information stored in the storage unit 12 as described above is a mere example, and the storage unit 12 may store various kinds of information other than the information as described above.
The control unit 20 is a processing unit that controls the entire information processing apparatus 10, and is, for example, a processor or the like. The control unit 20 includes an acquisition unit 21, an identification unit 22, a determination unit 23, and a notification unit 24. Meanwhile, each of the processing units is one example of an electronic circuit included in the processor or a process executed by the processor.
The acquisition unit 21 acquires a video that is captured by, for example, the single or more camera apparatuses 100 in any kind of facility, such as a store, from the image capturing DB 13. Meanwhile, the video that is captured by the camera apparatus 100 is transmitted by the camera apparatus 100 to the information processing apparatus 10, received by the information processing apparatus 10, and stored in the image capturing DB 13 as needed.
The identification unit 22 analyzes, for example, the video that is acquired by the acquisition unit 21, and identifies a relationship for identifying a correlation between an object and a person included in the video. Meanwhile, each of the object and the person included in the video may be, for example, a first area that includes the object and a second area that includes the person. Further, the first area and the second area may be, for example, bounding boxes (Bboxes). Furthermore, the relationship to be identified may include, for example, a type of the relationship indicating that the person holds the object or the person carries the object. Moreover, the identification process as described above may include a process of generating a scene graph in which the first area, the second area, and the relationship are identified for each person included in the video by inputting the video that is acquired by the acquisition unit 21 to the machine learning model, for example. Generation of the scene graph will be described in detail below with reference to
In the example illustrated in
However, even the scene graph has a problem, and therefore, by solving the problem, the identification unit 22 is able to more accurately identify the relationship between the object and the person included in the video.
Therefore, in the present embodiment, an area that is important in terms of a context is adaptively extracted from the entire image with respect to each of a target Subject and a target Object for which a relationship is to be estimated, and then a target relationship is recognized. The area that is important for recognition of the relationship is extracted by, for example, generating a map (hereinafter, referred to as an “Attention map”) in which values 0 to 1 is assigned in accordance with an importance level.
Estimation of a relationship between objects using the Attention map 180 will be described in detail below with reference to
First, feature extraction from the captured image performed by the image feature extraction unit 41 will be described.
Object detection from the image feature value performed by the object detection unit 42 will be described below.
Meanwhile, a rectangle of the Bbox may be represented by, for example, four real values, such as coordinates (x1, y2) of an upper left of the rectangle and coordinates (x2, y2) of a lower right of the rectangle. Further, the class that is output by the object detection unit 42 is, for example, a probability value indicating that the object detected in the Bbox is a detection target object that is determined in advance. More specifically, for example, if detection target objects are {cat, table, car}, in the example illustrated in
Feature values of a pair of the detected objects, which is obtained by the pair feature value generation unit 43, will be described below.
Moreover, the pair feature value generation unit 43 forms pairs as combinations of all of the detected objects while adopting one of the objects as a Subject and the other one of the objects as an Object. In a pair feature value 182 illustrated on the right side in
Extraction of feature values indicating a relationship between the detected objects as a pair, which is performed by the relationship feature extraction unit 44, will be described below.
First, as illustrated in
Subsequently, the relationship feature extraction unit 44 generates, by the Attention map generation unit, the Attention map 180 by making a correlation with the image feature value that is converted by the conversion unit (1) for each row of the pair feature value 182 that is generated by the pair feature value generation unit 43. Meanwhile, each row of the pair feature value 182 indicates each pair of the Subject and the Object. Further, the relationship feature extraction unit 44 may convert the Attention map 180 by MLP or Layer normalization after making a correlation between the pair feature value 182 and the image feature value that is converted by the conversion unit (1).
A correlation process between the single pair feature value 182 and the image feature value that is converted by the conversion unit (1) will be described in detail below. Meanwhile, it is assumed that the pair feature value 182 is adjusted to a C-dimensional vector through a previous process. Further, it is assumed that the image feature value converted by the conversion unit (1) is a tensor of H×W and a C-dimensional channel direction. Furthermore, a pixel (x, y) at which the image feature value converted by the conversion unit (1) is paid attention to, and adopted as a pixel of interest. The pixel of interest is represented by 1×1×C, and therefore, is regarded as a C-dimensional vector. Moreover, the Attention map generation unit makes a correlation between the C-dimensional vector of the pixel of interest and the pair feature value 182 that is adjusted to the C-dimensional vector, and calculates a correlation value (scalar). Accordingly, the correlation value at the pixel of interest (x, y) is determined. The Attention map generation unit performs the above-described process on all of pixels and generates the Attention map 180 of H×W×1.
Further, the relationship feature extraction unit 44 multiplies the image feature value converted by the conversion unit (2) by the generated Attention map 180 to obtain a weighted sum, and extracts a feature value of an important area in the entire image corresponding to the pair of the Subject and the Object. Meanwhile, the weighted sum is obtained in the entire image, and therefore, the feature value that takes the weighted sum is a C-dimensional feature value for a single pair of the Subject and the Object.
Furthermore, a weighted sum between the Attention map 180 and the image feature value converted by the conversion unit (2) will be described in detail below. Meanwhile, it is assumed that the image feature value converted by the conversion unit (2) is a tensor of H×W×C. First, the relationship feature extraction unit 44 multiplies the image feature value converted by the conversion unit (2) by the Attention map 180. In this case, the Attention map 180 is represented by H×W×1, and therefore, a channel is copied in a C-dimension. Moreover, the relationship feature extraction unit 44 sums up all of the C-dimensional vectors of all of the pixels with respect to those subjected to multiplication. Accordingly, the single C-dimensional vector is generated. In other words, the single C-dimensional vector is generated for the single Attention map 180. Furthermore, in reality, the same number of the Attention maps 180 as the number of the pair feature values 182 are generated, and therefore, the same number of the C-dimensional vectors as the number of the pair feature values 182 are generated. Through the process as described above, the relationship feature extraction unit 44 obtains a weighted sum of the image feature values converted by the conversion unit (2) by using the Attention map 180 as a weight.
Moreover, the relationship feature extraction unit 44 synthesizes, by the synthesis unit, the feature value of the important area that is extracted by the Attention map 180 and the pair feature value 182 that is generated by the pair feature value generation unit 43, and outputs a relationship feature value 183. More specifically, the relationship feature extraction unit 44 is able to use a value in which the feature value of the important area and the pair feature value 182 are connected in a dimensional direction. Furthermore, the relationship feature extraction unit 44 may connect the feature value of the important area and the pair feature value 182, and thereafter, convert the connected feature value by MLP or the like to adjust the number of dimensions.
Estimation of a relationship of each pair of the Subject and the Object performed by the relationship estimation unit 45 will be described below.
All of the processes for estimation of the relationship between the objects using the Attention map 180 as described above are collected as a relationship identification process of each of the objects, which is performed by the identification unit 22 by using the NN 40.
First, the identification unit 22 extracts, from a video, a first feature value that corresponds to a first area including an object included in the video or a second area including a person included in the video, for example. Meanwhile, for example, the video may be a video that is captured by the camera apparatus 100 in any kind of facility, such as a store, and the first area and the second area may be Bboxes. Furthermore, the extraction process as described above corresponds to the process that is performed by the image feature extraction unit 41 for extracting the image feature value 181 from the captured image 170 as explained above with reference to
Subsequently, the identification unit 22 detects an object and a person included in the video from the extracted first feature value, for example. A process of detecting the object and the person as described above corresponds to the process that is performed by the object detection unit 42 for detecting Bboxes and classes of an object and a person from the image feature value 181 that corresponds to the first feature value.
Then, the identification unit 22 generates a second feature value that is a combination of the plurality of detected objects, the plurality of detected persons, and the first feature value of one of the object and the person in at least a single pair of the object and the person, for example. The generation process as described above corresponds to the process that is performed by the pair feature value generation unit 43 for generating the pair feature value 182 in which each of the feature values of the detected objects and the detected persons corresponding to the first feature value are arranged for each of pairs as explained above with reference to
Subsequently, the identification unit 22 generates a first map that indicates the plurality of objects, the plurality of persons, and the relationship for identifying at least a single correlation between the object and the person based on the first feature value and the second feature value, for example. The generation process as descried above corresponds to the process that is performed by the relationship feature extraction unit 44 for generating the Attention map 180 based on the image feature value 181 that corresponds to the first feature value and the pair feature value 182 that corresponds to the second feature value as explained above with reference to
Then, the identification unit 22 extracts a fourth feature value based on a third feature value that is obtained by converting the first feature value and based on the first map, for example. The extraction process as described above corresponds to the process that is performed by the relationship feature extraction unit 44 for extracting the relationship feature value 183 based on the feature value that is converted by the conversion unit (2) and the Attention map 180 that corresponds to the first map as described above with reference to
Further, the identification unit 22 identifies the relationship for identifying the correlation between the object and the person from the fourth feature value, for example. The identification process as described above corresponds to the process that is performed by the relationship estimation unit 45 for estimating and identifying a relationship (relation) between the object and the person from the relationship feature value 183 that corresponds to the fourth feature value as explained above with reference to
Furthermore, the identification unit 22 identifies a first person for whom the identified relationship between the object and the person temporally changes from a first relationship to a second relationship, based on the video that is acquired by the acquisition unit 21. Here, for example, the first person is a person who may have performed an abnormal behavior.
Moreover, for example, it is assumed that the relationship between the person detected from the video and a product that is one example of the object is that the person “holds” or “carries” the product, and this relationship is adopted as the first relationship. Furthermore, for example, it is assumed that the first relationship of “hold” or “carry” between the person and the product disappears in a video that is temporally later than the video in which the first relationship is identified, and a relationship between the person and the product in which the first relationship disappears is adopted as the second relationship.
Moreover, the identification unit 22 identifies, as the first person, a person for whom the relationship between the person and the object identified from the video temporally changes from the first relationship to the second relationship, that is, for example, a person who once held the product but who did not have the product at a later time. This is to identify the person as the first person who may have performed an abnormal behavior by assuming that the held product is subjected to shoplifting or the like, that is, an abnormal behavior may have been performed. Meanwhile, the identification unit 22 is able to identify the first relationship and the second relationship and then identify the first person by analyzing the scene graph, for example.
Furthermore, the first relationship and the second relationship that are changed from one to the other need not always be relationships between the same object and the person. For example, when the person cuts a security chain by using a tool, such as a chain cutter, and takes away the product, the first relationship is a relationship of “hold” between the person the chain cutter, and the second relationship is a relationship of “hold” between the person and the product.
Moreover, the first relationship and the second relationship that are changed from one to the other may be relationships between the same object and the person, but a state of the object may be changed. For example, when the product is taken away by being input in a bag, the first relationship is a relationship of “hold” between the person and an empty bag, and the second relationship is a relationship of “hold” between the person and the bag filled with the product and other contents. Here, the information processing apparatus 10 is able to determine states, such as an empty bag and a bag filled with contents by, for example, a change in the size of the bag.
Furthermore, the process of identifying the first relationship and the second relationship may include a process of identifying, from the video, an area including the object, an area including the person, the first relationship, and the second relationship by inputting the video acquired by the acquisition unit 21 to a machine learning model.
For example, the identification unit 22 inputs the video to the machine learning model, and identifies, from the video, the first area including an object, the second area including a person, and the first relationship for identifying a correlation between the object included in the first area and the person included in the second area. Further, the identification unit 22 inputs the video to the machine learning model, and identifies, from the video, a third area including an object, a fourth area including a person, and the second relationship for identifying a correlation between the object included in the third area and the person included in the fourth area.
Furthermore, the identification unit 22 identifies, for example, a first area in which an abnormal behavior on the product is performed based on the camera apparatus 100 that has performed image capturing. More specifically, the identification unit 22 identifies the first area in which the person has performed an abnormal behavior on the product from an installation location or an imaging range of the camera apparatus 100 that has captured the video in which the relationship between the first person and the product is identified, for example.
Moreover, for example, the identification unit 22 generates skeleton information on the person included in the video by analyzing the video that is acquired by the acquisition unit 21, and identifies the relationship for identifying a correlation between the object and the person included in the video based on the generated skeleton information. More specifically, the identification unit 22 extracts a bounding box (Bbox) that encloses an area including a person in a rectangle from the video that is acquired by the acquisition unit 21, for example. Then, the identification unit 22 generates the skeleton information by inputting, for example, image data of the extracted Bbox of the person to a trained machine learning model that is constructed by using an existing algorithm, such as DeepPose or OpenPose. For example, the identification unit 22 identifies a behavior of the person holding a predetermined object that is used for shoplifting of a product, based on the generated skeleton information. Furthermore, the identification unit 22 identifies a behavior of the person holding the product based on the skeleton information.
Furthermore, the identification unit 22 is able to determine a posture of the whole body of the person, such as stand, walk, squat, sit, or sleep, by using, for example, a machine learning model that is trained in advance for a skeleton pattern. For example, the identification unit 22 is able to determine the closest posture of the whole body by using a machine learning model that is trained by using Multilayer Perceptron for an angle between some joints in the skeleton information as illustrated in
Furthermore, the identification unit 22 is able to detect a motion of each of parts by determining a posture of the part based on a three-dimensional (3D) joint posture of the body. More specifically, the identification unit 22 is able to convert a two-dimensional (2D) joint coordinate to a 3D joint coordinate by using an existing algorithm, such as a 3D-baseline method.
Moreover, with respect to a part “arm”, for example, the identification unit 22 is able to detect whether left and right arms are oriented in any direction among forward, backward, leftward, rightward, upward, and downward directions (six types) by determining whether an angle between forearm orientation and each directional vector is equal to or smaller than a threshold. Meanwhile, the identification unit 22 is able to detect the arm orientation by a vector that is defined such that “a start point is an elbow and an end point is a wrist”.
Furthermore, with respect to a part “leg”, for example, the identification unit 22 is able to detect whether left and right legs are oriented in any direction from among forward, backward, leftward, rightward, upward, and downward directions (six types) by determining whether an angle between a lower leg orientation and each directional vector is equal to or smaller than a threshold. Meanwhile, the identification unit 22 is able to detect the lower leg orientation by a vector that is defined such that “a start point is a knee and an end point is an ankle”.
Moreover, with respect to a part “elbow”, for example, the identification unit 22 is able to detect that the elbow is extended if an angle of the elbow is equal to or larger than a threshold and the elbow is flexed if the angle is smaller than the threshold (two types)). Meanwhile, the identification unit 22 is able to detect the angle of the elbow by an angle between a vector A that is defined such that “a start point is an elbow and an end point is a shoulder” and a vector B that is defined such that “a start point is an elbow and an end point is a wrist”.
Furthermore, with respect to a part “knee”, for example, the identification unit 22 is able to detect that the knee is extended if an angle of the knee is equal to or larger than a threshold and the knee is flexed if the angle is smaller than the threshold (two types). Meanwhile, the identification unit 22 is able to detect the angle of the knee by an angle between a vector A that is defined such that “a start point is a knee and an end point is an ankle” and a vector B that is defined such that “a start point is a knee and an end point is a hip”.
Moreover, with respect to a part “hip”, the identification unit 22 is able to detect left twist and right twist (two types) by determining whether an angle between the hip and the shoulder is equal to or smaller than a threshold, and is able to detect that the hip is oriented forward if the angle is smaller than the threshold. Furthermore, the identification unit 22 is able to detect the angle between the hip and the shoulder from a rotation angle about an axial vector C that is defined such that “a start point is a midpoint of both hips and an end point is a midpoint of both shoulders. Meanwhile, the angle between the hip and the shoulder is detected for each of a vector A that is defined such that” a start point is a left shoulder and an end point is a right shoulder” and a vector B that is defined such that “a start point is a left hip (hip (L)) and an end point is a right hip (hip (R))”, for example.
Moreover, the identification unit 22 identifies a position of a person included in each of the videos that are captured by the respective camera apparatuses 100 by a first index that is different for each of the camera apparatuses 100, for example. The first index is, for example, an image coordinate system in which coordinates of a pixel at a left corner of an image that is a single frame of the video captured by the camera apparatus 100 is adopted as an origin (0, 0). The image coordinate system is different for each of the camera apparatuses 100, and therefore, the same coordinates in the images captured by the plurality of camera apparatuses 100 do not indicate the same position in a real space. Therefore, the identification unit 22 identifies the positions of the persons identified by the first indices by a second index that is common among the plurality of camera apparatuses 100, for example. The second index is a coordinate system that is common among the plurality of camera apparatuses 100 and that is obtained by, for example, converting the image coordinate system that is the first index by projective transformation (holography) coefficient, and is referred to as a “floor map coordinate system” as a comparison with the image coordinate system. Transformation from the image coordinate system to the floor map coordinate system will be described in detail below.
Calculation of the projective transformation coefficient that is used for transformation from the image coordinate system to the floor map coordinate system will be described below.
Furthermore, the identification unit 22 transforms the positions of the persons identified by the image coordinate system to the floor map coordinate system by using the calculated projective transformation coefficient and identifies the position, for example.
Referring back to explanation of
More specifically, the first person is a person who indicates, for example, the first relationship of “hold” or “carry” with the product that is the object, and indicates the second relationship in which the first relationship disappears in a video that is temporally later than the video in which the first relationship is identified. With respect to the first person as described above, the determination unit 23 is able to determine that the person has performed an abnormal behavior, such as shoplifting, by, for example, putting the holding product in a bag on the outside of the imaging range of the camera apparatus 100. In other words, the determination unit 23 determines whether the person has performed an abnormal behavior, such as shoplifting or a behavior that leads to shoplifting, on the product on the outside of the imaging range of the camera apparatus 100 based on the first relationship and the second relationship that are identified by the identification unit 22, for example.
Furthermore, for example, it is assumed that the first relationship of “hold” or the like is indicated between the person and a chain cutter that is the object, and the second relationship of “hold” or the like is indicated between the person and the product that is the object in a video that is temporally later than the video in which the first relationship is identified. In this case, the determination unit 23 is able to determine that the person has performed an abnormal behavior on the product.
Moreover, for example, it is assumed that the first relationship of “hold” or the like is indicated between the person and an empty bag that is the object, and the second relationship of “hold” or the like is indicated between the person and the bag filled with contents in a video that is temporally later than the video in which the first relationship is identified. In this case, the determination unit 23 is able to determine that the person has performed an abnormal behavior on the product.
Furthermore, if the person that is included in the second area and the person included in the fourth area, which are identified by the identification unit 22 from the video, are identical, the determination unit 23 compares the first relationship and the second relationship that are identified by the identification unit 22 and a rule that is set in advance. Here, the rule that is set in advance may be, for example, a rule that is set in the rule DB 16. Moreover, the determination unit 23 determines whether the person has performed an abnormal behavior on the product on the outside of the imaging range of the camera apparatus 100 based on, for example, a comparison result among the first relationship, the second relationship, and the rule that is set in advance.
Furthermore, the determination unit 23 determines whether the person included in each of the videos is an identical person based on the position of the person that is identified by the identification unit 22 by using the second index, for example. For example, the second index is the floor map coordinate system that is common among the plurality of camera apparatuses 100. Therefore, for example, when the floor map coordinate system indicated by the position of the person included in each of the videos captured by the plurality of camera apparatuses 100 is the same or located nearby in a predetermined range, the determination unit 23 is able to determine that the person included in each of the videos is an identical person.
Referring back to explanation of
Furthermore, the notification unit 24 gives an alert indicating that an abnormality has occurred on a first object, in association with the first area, for example. Meanwhile, the first area is, for example, an area that is identified by the identification unit 22 as an area in which the person may have performed an abnormal behavior on the product. Moreover, the alert may include information on the position of the first area, for example.
The flow of an abnormal behavior notification process performed by the information processing apparatus 10 will be described below.
First, as illustrated in
Furthermore, the information processing apparatus 10 inputs the videos acquired at Step S101 to the machine learning model and identifies, from the videos, an area including an object, an area including a person, and a relationship between the object and the person, for example (Step S102). Specifically, the information processing apparatus 10 analyzes a video in which the first area is captured among the plurality of acquired videos, and identifies a first relationship that identifies a correlation between the object and the person included in the video, for example. Moreover, the information processing apparatus 10 analyzes a video in which the second area is captured among the plurality of acquired videos, and identifies the second relationship that identifies a correlation between the object and the person included in the video. For example, the person appears in the first area. Further, the person moves to the second area through an area that is on the outside of the imaging ranges of the camera apparatuses 100. In this case, for example, the information processing apparatus 10 analyzes the video in which the first area is captured, and recognizes the first relationship indicating that the person “holds” each of the product and a predetermined object (for example, a bag) that is used for shoplifting of the product. Furthermore, for example, the information processing apparatus 10 analyzes the video in which the second area is captured, and recognizes the second relationship indicating that the person “holds” the predetermined object that is used for shoplifting of the product. In other words, the product that is “held” by the person in the first relationship is not “held” in the second relationship. Moreover, the predetermined object that is used for shoplifting of the product that is “held” by the person in the first relationship is continuously “held” in the second relationship. Meanwhile, the areas including the object and the person may be, for example, bounding boxes (Bbox) that are rectangles enclosing the object and the person in the video. Furthermore, the relationship between the object and the person may be that, for example, the person holds the product or the person carries the product. Meanwhile, a time of the video in which the second area is captured is later than a time of the video in which the first area is captured.
Subsequently, the information processing apparatus 10 determines whether the person has performed an abnormal behavior on the product based on, for example, the relationship between the object and the person that is identified at Step S102 (Step S103). Specifically, the information processing apparatus 10 determines whether the person has performed an abnormal behavior on the product on the outside of the imaging range of the camera apparatus 100 between the first area and the second area based on, for example, the first relationship and the second relationship. Specifically, when a temporal combination of the first relationship and the second relationship matches a rule that is set in advance, the information processing apparatus 10 determines that the person has performed an abnormal behavior on the product. For example, the information processing apparatus 10 recognizes that the predetermined object that is “held” by the person in the first relationship and that is used for shoplifting of the product is continuously “held” in the second relationship. Further, the information processing apparatus 10 recognizes that the product that is “held” by the person in the first relationship is not “held” in the second relationship. In this case, the information processing apparatus 10 determines that the person has performed an abnormal behavior on the product in an area that is located on the outside of the imaging ranges of the plurality of camera apparatuses 100. The abnormal behavior described herein indicates, for example, shoplifting or a behavior that leads to shoplifting on the outside of the imaging ranges of the camera apparatuses 100. If it is determined that the person has not performed an abnormal behavior on the product (Step S104: No), the abnormal behavior notification process illustrated in
In contrast, if it is determined that the person has performed an abnormal behavior on the product (Step S104: Yes), the information processing apparatus 10 gives an alert, for example (Step S105). Specifically, the information processing apparatus 10 gives an alert indicating that shoplifting or a behavior that leads to shoplifting has occurred in an area that is located between the first area and the second area and that is located outside of the imaging ranges of the plurality of camera apparatuses 100, for example. After execution of Step S105, the abnormal behavior notification process illustrated in
A flow of the relationship estimation process performed by the information processing apparatus 10 will be described below.
First, the information processing apparatus 10 acquires, for example, a video in which a predetermined imaging range is captured in any kind of facility, such as a store, by the camera apparatus 100, that is, an input image, from the image capturing DB 13 (Step S201). Meanwhile, the input image includes an image of a single frame of a video, and when a video is stored in the image capturing DB 13, a single frame is acquired as the input image from the video.
Furthermore, the information processing apparatus 10 extracts the image feature value 181 as an image feature of the input image, from the input image that is acquired at Step S201, for example (Step S202).
Subsequently, the information processing apparatus 10 detects, for example, a Bbox that indicates a location of each of objects included in the video and a class that indicates a type of each of the objects from the image feature value 181 that is extracted at Step S202 by using an existing technology (Step S203). Meanwhile, each of the objects detected herein may include a person, and in the following explanation, each of the objects may include a person.
Then, the information processing apparatus 10 generates, as the pair feature value 182, the second feature value that is a combination of the first feature values included in the objects in combinations of the objects detected at Step S203, for example (Step S204).
Subsequently, the information processing apparatus 10 synthesizes, for example, the feature value of an area that is important for estimation of the relationship and that is extracted by the Attention map 180 and the pair feature value 182, and extracts the relationship feature value 183 (Step S205). Meanwhile, the Attention map 180 is generated from the pair feature value 182 that is extracted at Step S204.
Furthermore, the information processing apparatus 10 estimates the relationship between the objects detected from the image, based on the relationship feature value 183 that is extracted at Step S205, for example (Step S206). Meanwhile, estimation of the relationship may be calculation of a probability for each type of the relationship, for example. After execution of Step S206, the relationship estimation process as illustrated in
As described above, the information processing apparatus 10 acquires a video that is captured by the one or more camera apparatuses 100, identifies a relationship for identifying a correlation between an object and a person included in the video by analyzing the acquired video, determines whether the person has performed an abnormal behavior on a product on an outside of an imaging range of the camera apparatus 100 based on the identified relationship, and gives an alert based on a determination result on whether the person has performed the abnormal behavior on the product on the outside of the imaging range.
In this manner, the information processing apparatus 10 identifies a relationship between an object and a person from a video, determines whether the person has performed an abnormal behavior, such as shoplifting, on the outside of an imaging range of the camera apparatus 100 based on the identified relationship, and gives an alert. With this configuration, the information processing apparatus 10 is able to more accurately determine, from the video, that the person is performing an abnormal behavior and give an alert.
Furthermore, the information processing apparatus 10 acquires a plurality of videos that are captured by the plurality of camera apparatuses 100 installed in a store and that include different areas captured by the plurality of camera apparatuses 100, identifies a first relationship for identifying a correlation between an object and the person included in a video in which a first area is captured by analyzing the video in which the first area is captured among the plurality of acquired videos, identifies a second relationship for identifying a correlation between an object and the person included in a video in which a second area is captured by analyzing the video in which the second area is captured among the plurality of acquired videos, determines whether the person has performed an abnormal behavior on a product in an area that is located between the first area and the second area and that is located on an outside of imaging ranges of the plurality of camera apparatuses 100 based on the first relationship and the second relationship, and gives an alert if it is determined that the person has performed the abnormal behavior.
With this configuration, the information processing apparatus 10 is able to more accurately determine, from the video, that the person is performing an abnormal behavior.
Moreover, a time of the video in which the second area is captured is later than a time of the video in which the first area is captured.
With this configuration, the information processing apparatus 10 is able to more accurately identify, from the video, an area in which the person is performing an abnormal behavior.
Furthermore, the first relationship indicates that the person holds the product and a predetermined object that is used for shoplifting of the product, the second relationship indicates that the person holds the predetermined object that is used for shoplifting of the product, and when the predetermined object that is held by the person in the first relationship is also held in the second relationship and when the product that is held in the person in the first relationship is not held in the second relationship, the information processing apparatus 10 determines that the person has performed an abnormal behavior on the product in an area that is located between the first area and the second area and that is located on an outside of imaging ranges of the plurality of camera apparatuses 100.
With this configuration, the information processing apparatus 10 is able to more accurately determine, from the video, that the person is performing an abnormal behavior.
Moreover, the information processing apparatus 10 identifies a first person for whom the identified relationship temporally changes from a first relationship to a second relationship based on the acquired video, and the process that is performed by the information processing apparatus 10 for determining whether the person has performed an abnormal behavior on the product includes a process of determining whether the first person has performed an abnormal behavior on the product on the outside of the imaging range based on the identified relationship.
With this configuration, the information processing apparatus 10 is able to more accurately determine, from the video, that the person is performing an abnormal behavior.
Furthermore, the process that is performed by the information processing apparatus 10 for identifying the relationship includes a process of identifying, from the video, a first area including the object, a second area including the person, and a first relationship for identifying a correlation between the object included in the first area and the person included in the second area by inputting the acquired video to a machine learning model, and identifying, from the video, a third area including the object, a fourth area including the person, and a second relationship for identifying a correlation between the object included in the third area and the person included in the fourth area by inputting the acquired video to a machine learning model, and the process for determining whether the person has performed an abnormal behavior on the product includes, when the person included in the second area and the person included in the fourth area are identical, a process of determining whether the person has performed an abnormal behavior on the product by comparing the identified first relationship, the identified second relationship, and a rule that is set in advance.
With this configuration, the information processing apparatus 10 is able to more accurately determine, from the video, that the person is performing an abnormal behavior.
Moreover, the information processing apparatus 10 identifies an area that is an area in which the person has performed an abnormal behavior on the product, that is located between the first area and the second area, and that is located on an outside of imaging ranges of the plurality of camera apparatuses 100, based on the plurality of the camera apparatuses 100 that have performed image capturing, and the process that is performed by the information processing apparatus 10 for giving the alert includes a process of giving the alert indicating occurrence of abnormality on the product in association with the identified area that is located on the outside of the imaging ranges of the plurality of camera apparatuses 100.
With this configuration, the information processing apparatus 10 is able to more accurately give a notice indicating that the person is performing an abnormal behavior from the video.
Furthermore, the process that is performed by the information processing apparatus 10 for determining whether the person has performed an abnormal behavior on the product includes a process of determining whether the person has performed an abnormal behavior including one of shoplifting and a behavior that leads to shoplifting on the product on an outside of the imaging range of the camera apparatus 100 based on the identified first relationship and the identified second relationship.
With this configuration, the information processing apparatus 10 is able to more accurately determine, from the video, that the person is performing an abnormal behavior, such as shoplifting.
Moreover, the process that is performed by the information processing apparatus 10 for identifying the first person includes a process of generating a scene graph that identifies the relationship for each of the persons included in the video by inputting the acquired video to a machine learning model, and identifying the first person by analyzing the scene graph.
With this configuration, the information processing apparatus 10 is able to more accurately determine, from the video, that the person is performing an abnormal behavior.
Furthermore, the process that is performed by the information processing apparatus 10 for identifying the relationship includes a process of extracting a first feature value that corresponds to one of the object and the person from the video, detecting the object and the person included in the video from the extracted first feature value, generating a second feature value that is a combination of the plurality of detected objects, the plurality of detected persons, and the first feature value of one of the object and the person in at least a single pair of the object and the person, generating a first map that indicates the plurality of objects, the plurality of persons, and the relationship for identifying at least a single correlation between the object and the person based on the first feature value and the second feature value, extracting a fourth feature value based on a third feature value that is obtained by converting the first feature value and based on the first map, and identifying the relationship from the fourth feature value.
With this configuration, the information processing apparatus 10 is able to more accurately determine, from the video, that the person is performing an abnormal behavior.
Moreover, the process that is performed by the information processing apparatus 10 for identifying the relationship includes generating skeleton information on the person by analyzing the acquired video, identifying the first relationship based on the generated skeleton information, and identifying the second relationship based on the generated skeleton information.
With this configuration, the information processing apparatus 10 is able to more accurately determine, from the video, that the person is performing an abnormal behavior.
Furthermore, the information processing apparatus 10 identifies a position of the person included in each of the videos that are captured by the respective camera apparatuses 100 by a first index that is different for each of the camera apparatuses 100, identifies the positions of the persons identified by the first indices by using a second index that is common among the plurality of camera apparatuses 100, and determines whether the persons included in the respective videos are an identical person based on the positions of the persons identified by using the second index.
With this configuration, the information processing apparatus 10 is able to more accurately determine, from the video, that the person is performing an abnormal behavior.
The processing procedures, control procedures, specific names, and information including various kinds of data and parameters illustrated in the above-described document and drawings may be arbitrarily changed unless otherwise specified. In addition, specific examples, distributions, values, and the like described in the embodiments are examples, and may be changed arbitrarily.
Furthermore, specific forms of distribution and integration of the components of each of the apparatuses are not limited to those illustrated in the drawings. In other words, all or part of the components may be functionally or physically distributed or integrated in arbitrary units depending on various loads or use conditions. Moreover, for each processing function performed by each apparatus, all or any part of the processing function may be implemented by a CPU and a program analyzed and executed by the CPU or may be implemented as hardware by wired logic.
The communication apparatus 10a is a network interface card or the like and performs communication with a different information processing apparatus. The HDD 10b stores therein a program for operating the functions illustrated in
The processor 10d is a hardware circuit that reads a program for executing the same processes as each of the processing units illustrated in
In this manner, the information processing apparatus 10 operates as an information processing apparatus that executes an operation control process by reading the program for performing the same processes as those of each of the processing units illustrated in
Furthermore, the program that executes the same processes as those of each of the processing units illustrated in
According to one aspect, it is possible to more accurately determine that a person is performing an abnormal behavior from a video and give a notice.
All examples and conditional language recited herein are intended for pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventors to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiment of the present invention has been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
| Number | Date | Country | Kind |
|---|---|---|---|
| 2023-104038 | Jun 2023 | JP | national |