This application is based upon and claims the benefit of priority from Japanese patent application No. 2022-88420, filed on May 31, 2022, the disclosure of which is incorporated herein in its entirety by reference.
The present invention relates to an image processing system, an apparatus, a processing method, and a program.
A technique related to the present invention is disclosed in Patent Documents 1 to 3 and Non-Patent Document 1.
Patent Document 1 (International Patent Publication No. WO2021/084677) discloses a technique for computing a feature value of each of a plurality of keypoints of a human body included in an image, searching for an image including a human body having a similar pose and a human body having a similar movement, based on the computed feature value, and putting together the similar poses and the similar movements and classifying. Further, Non-Patent Document 1 (Zhe Cao, Tomas Simon, Shih-En Wei, Yaser Sheikh, “Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields”, The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, P. 7291-7299) discloses a technique related to skeleton estimation of a person.
Patent Document 2 (Japanese Patent Application Publication No. 2021-60736) discloses a technique for estimating skeleton data about a person included in an image, computing a weight of a joint, based on a degree of reliability of an estimation result of each joint, and computing, by using the computed weight of each joint, a degree of similarity between the estimated skeleton data and skeleton data estimated from predetermined image data.
Patent Document 3 (International Patent Publication No. WO2022/009327) discloses a technique for computing a degree of similarity of a pose of a human body by using a feature value of each of a plurality of keypoints of a human body included in an image and a weight of each of the keypoints.
Faster search processing is required in the search processing of a similar image, based on a feature value of each of a plurality of keypoints of a human body included in an image. For example, in a scene in which a condition (for example: a threshold value of a degree of similarity, a weight of each keypoint, and the like) of search processing is set, an operator repeatedly performs the search processing while adjusting the condition, and appropriately adjusts the condition while referring to a search result of each time.
When a lot of time is required for the search processing in a scene in which such search processing is repeatedly performed, work efficiency is reduced.
Although Patent Document 1 and Non-Patent Document 1 disclose search processing of a similar image, based on a feature value of each of a plurality of keypoints of a human body included in an image, Patent Document 1 and Non-Patent Document 1 do not disclose a challenge to achieve faster search processing and a solving means thereof.
Although Patent Documents 2 and 3 disclose a technique for computing a degree of similarity by using a weight of each keypoint, Patent Documents 2 and 3 do not disclose a challenge to achieve faster search processing and a solving means thereof.
One example of an object of the present invention is, in view of the problem described above, to provide an image processing system, an apparatus, a processing method, and a program that solve a challenge to achieve faster search processing in the search processing of a similar image, based on a feature value of each of a plurality of keypoints of a human body included in an image.
One aspect of the present invention provides an image processing system including:
One aspect of the present invention provides an apparatus including:
One aspect of the present invention provides a processing method including,
One aspect of the present invention provides a program causing a computer to function as:
One aspect of the present invention achieves an image processing system, an apparatus, a processing method, and a program that solve a challenge to achieve faster search processing in the search processing of a similar image, based on a feature value of each of a plurality of keypoints of a human body included in an image.
The above-described object, the other objects, features, and advantages will become more apparent from suitable example embodiment described below and the following accompanying drawings.
Hereinafter, example embodiments of the present invention will be described with reference to the drawings. Note that, in all of the drawings, a similar component has a similar reference sign, and description thereof will be appropriately omitted.
The target image acquisition unit 11 acquires a target image. The skeleton structure detection unit 12 performs processing of detecting a keypoint of a human body included in the target image. The first verification unit 13 extracts a first reference image whose relationship with the target image satisfies a first extraction condition from among a plurality of reference images, based on the detected keypoint. The second verification unit 14 extracts a second reference image whose relationship with the target image satisfies a second extraction condition from among the first reference images, based on the detected keypoint.
The image processing system 10 having such a configuration solves a challenge to achieve faster search processing in the search processing of a similar image, based on a feature value of each of a plurality of keypoints of a human body included in an image.
An image processing system 10 according to the present example embodiment is acquired by further embodying the image processing system 10 according to the first example embodiment. The image processing system 10 according to the present example embodiment performs, in two steps, processing of searching for a desired reference image from among a plurality of reference images. In other words, reference images are narrowed down to some extent in a first step, and a desired reference image is then searched from among the narrowed reference images in a second step.
As illustrated in
In the present example embodiment, the server 1 performs the first step described above. In other words, the server 1 extracts a first reference image whose relationship with a target image satisfies a first extraction condition from among a plurality of reference images. Then, the client terminal 2 performs the second step described above. In other words, the client terminal 2 extracts a second reference image whose relationship with the target image satisfies a second extraction condition from the extracted first reference images (narrowed reference images). Hereinafter, a configuration of the image processing system 10 will be described in detail.
Next, one example of a hardware configuration of the image processing system 10 will be described. Each functional unit of the image processing system 10 is achieved by any combination of hardware and software concentrating on as a central processing unit (CPU) of any computer, a memory, a program loaded into the memory, a storage unit such as a hard disc that stores the program (that can also store a program downloaded from a storage medium such as a compact disc (CD), a server on the Internet, and the like in addition to a program previously stored at a stage of shipping of an apparatus), and a network connection interface. Then, various modification examples of an achievement method and an apparatus thereof are understood by a person skilled in the art.
The bus 5A is a data transmission path for the processor 1A, the memory 2A, the peripheral circuit 4A, and the input/output interface 3A to transmit and receive data to and from one another. The processor 1A is an arithmetic processing apparatus such as a CPU and a graphics processing unit (GPU), for example. The memory 2A is a memory such as a random access memory (RAM) and a read only memory (ROM), for example. The input/output interface 3A includes an interface for acquiring information from an input apparatus, an external apparatus, an external server, an external sensor, a camera, and the like, an interface for outputting information to an output apparatus, an external apparatus, an external server, and the like, and the like. The input apparatus is, for example, a keyboard, a mouse, a microphone, a physical button, a touch panel, and the like. The output apparatus is, for example, a display, a speaker, a printer, a mailer, and the like. The processor 1A can output an instruction to each of modules, and perform an arithmetic operation, based on an arithmetic result of the modules.
Next, a functional configuration of the image processing system 10 according to the present example embodiment will be described in detail.
The client terminal 2 can communicate with the server 1 via, for example, special-purpose software and a special-purpose application being preinstalled, or a program (such as a Web page) provided by the server 1, can also perform various types of processing, and can achieve a function of the target image acquisition unit 11 and the second verification unit 14. Hereinafter, a configuration of the functional unit of the image processing system 10 will be described.
The target image acquisition unit 11 acquires a target image. The target image is a still image being a target of processing performed by the skeleton structure detection unit 12, the first verification unit 13, and the second verification unit 14.
The target image acquisition unit 11 may receive a user input for specifying one from still images being stored in a predetermined accessible storage apparatus, and acquire the specified still image as a target image. In addition, the target image acquisition unit 11 may acquire, as a target image, a frame image specified by a user from a moving image. The moving image may be captured in the past, or may be a live image. For example, the target image acquisition unit 11 may receive a user input during reproduction of a moving image, and acquire, as a target image, a frame image displayed on a screen at a point in time at which the user input is received. In addition, the target image acquisition unit 11 may acquire, as a target image, a plurality of frame images in order at a time interval specified by a user from a moving image. Note that, the processing of acquiring a target image described herein is merely one example, which is not limited thereto.
As described above, in the present example embodiment, the client terminal 2 includes the target image acquisition unit 11. The target image acquisition unit 11 of the client terminal 2 receives an input for specifying a target image as described above via an input device (such as a touch panel, a physical button, a keyboard, a mouse, and a microphone) of the own apparatus. Then, the target image acquisition unit 11 stores the acquired target image in a storage apparatus in the client terminal 2. Further, the target image acquisition unit 11 transmits the acquired target image to the server 1.
The skeleton structure detection unit 12 performs processing of detecting a keypoint of a human body included in the target image. The skeleton structure detection unit 12 detects N (N is an integer of two or more) keypoints of a human body included in the target image. The processing by the skeleton structure detection unit 12 is achieved by using the technique disclosed in Patent Document 1. Although details will be omitted, in the technique disclosed in Patent Document 1, detection of a skeleton structure is performed by using a skeleton estimation technique such as OpenPose disclosed in Non-Patent Document 1. A skeleton structure detected in the technique is formed of a “keypoint” being a characteristic point such as a joint and a “bone (bone link)” indicating a link between keypoints.
For example, the skeleton structure detection unit 12 extracts a feature point that may be a keypoint from an image, refers to information acquired by performing machine learning on the image of the keypoint, and detects N keypoints of a human body. The detected N keypoints are predetermined. There is variety in the number (i.e., the number of N) of keypoints to be detected and which keypoint is used to detect a portion of a human body, and various variations can be adopted.
For example, as illustrated in
The first verification unit 13 extracts a first reference image whose relationship with the target image satisfies a first extraction condition from among a plurality of reference images being preregistered, based on the keypoint detected by the skeleton structure detection unit 12.
The first extraction condition is a condition that a “degree of similarity of a pose of a human body included in an image” computed by a “first computation method” is “equal to or more than a first reference value”. In other words, the first verification unit 13 computes a degree of similarity between a pose of a human body included in the target image and a pose of a human body included in each reference image by the first computation method. Then, the first verification unit 13 extracts, as a first reference image, the reference image whose computed degree of similarity is equal to or more than the first reference value.
The second verification unit 14 extracts a second reference image whose relationship with the target image satisfies a second extraction condition from among the first reference images extracted by the first verification unit 13, based on the keypoint detected by the skeleton structure detection unit 12. In other words, the second verification unit 14 performs verification of the target image with, as verification targets, the reference images (first reference images) narrowed down by the first verification unit 13, and extracts a second reference image from among the first reference images.
The second extraction condition is a condition that a “degree of similarity of a pose of a human body included in an image” computed by a “second computation method” is “equal to or more than a second reference value”. In other words, the second verification unit 14 computes a degree of similarity between a pose of a human body included in the target image and a pose of a human body included in each first reference image by the second computation method. Then, the second verification unit 14 extracts, as a second reference image, the first reference image whose computed degree of similarity is equal to or more than the second reference value.
The first computation method and the second computation method may be different from each other or may be the same. For example, in the first computation method and the second computation method, at least one of the number of keypoints and a kind of a keypoint being referred when a degree of similarity of a pose of a human body is computed may be different from each other.
Further, in the first computation method and the second computation method, setting contents of a weight of each keypoint being referred when a degree of similarity of a pose of a human body is computed may be different from each other. For example, in the first computation method, a degree of similarity of a pose of a human body may be computed by setting the same weight of all keypoints, and, in the second computation method, a degree of similarity of a pose of a human body may be computed based on a weight being set for each keypoint. Further, in the first computation method and the second computation method, weights of various keypoints may be different from each other.
Further, a first reference value and a second reference value may be set separately and independently. Thus, the first reference value and the second reference value may be the same value or may be different values.
Further, the first extraction condition and the second extraction condition may include other conditions different from each other. For example, at least one of the first extraction condition and the second extraction condition may include at least one of conditions that
A “predetermined number (minimum detection point)” and a “predetermined keypoint (necessary detection keypoint)” of the conditions may be predetermined, or may be able to be set by a user.
For example, the first extraction condition may include the condition, or the second extraction condition may include the condition.
In addition, both of the first extraction condition and the second extraction condition may include the condition. In that case, contents may be different from each other.
For example, when both of the first extraction condition and the second extraction condition include the condition that a “predetermined number or more of keypoints being referred when a degree of similarity of a pose of a human body is computed is detected”, the predetermined number may be able to be set separately and independently. In this case, the predetermined number in the first extraction condition and the predetermined number in the second extraction condition can be the same value or can be different values.
Further, when both of the first extraction condition and the second extraction condition include the condition that a “predetermined keypoint of keypoints being referred when a degree of similarity of a pose of a human body is computed is detected”, the predetermined keypoint may be able to be set separately and independently. In this case, a kind and the number of the predetermined keypoint in the first extraction condition and the predetermined keypoint in the second extraction condition can have the same content or can have different contents.
Further, in the second extraction condition, at least one of the plurality of items (the number of keypoints being referred when a degree of similarity of a pose of a human body is computed, a kind of a keypoint, a weight of each keypoint, a minimum detection point, and a necessary detection keypoint) described above may be able to be changed by a user input. Then, in the first extraction condition, the plurality of items described above may be fixed.
Herein, a specific example of the first extraction condition and the second extraction condition will be described. Note that, the example herein is merely one example, and the first extraction condition and the second extraction condition according to the present example embodiment are not limited to this.
The first extraction condition is a condition that a “degree of similarity of a pose of a human body computed based on all N keypoints is equal to or more than a first reference value”. Note that, the degree of similarity of the first extraction condition is computed with the same weight of the N keypoints.
The second extraction condition is a condition that a “degree of similarity of a pose of a human body computed based on some of N keypoints is equal to or more than a second reference value”. Note that, the degree of similarity of the second extraction condition is computed based on a weight being set for each keypoint.
Then, in the second extraction condition, the number of keypoints, a kind of a keypoint, and a weight of each keypoint being referred when a degree of similarity of a pose of a human body is computed can be changed by a user input. On the other hand, in the first extraction condition, the number of keypoints, a kind of a keypoint, and a weight of each keypoint being referred when a degree of similarity of a pose of a human body is computed are fixed.
Further, in the second extraction condition, the second reference value can be changed by a user input. In the first extraction condition, the first reference value may be able to be changed by a user input, or may be a fixed value.
Further, the second extraction condition further includes at least one of conditions that
A “predetermined number” and a “predetermined keypoint” of the conditions may be predetermined, or may be able to be changed by a user input. Note that, the first extraction condition does not include the conditions.
Herein, one example of processing of computing a degree of similarity between a pose of a human body detected from a target image and a pose of a human body indicated by a preregistered reference image, based on a keypoint detected by the skeleton structure detection unit 12, will be described.
There are various ways of computing a degree of similarity of a pose of a human body, and various techniques can be adopted. For example, the technique disclosed in Patent Document 1 may be adopted. Hereinafter, one example will be described, which is not limited thereto.
As one example, by computing a feature value of a skeleton structure indicated by a detected keypoint, and computing a degree of similarity between a feature value of a skeleton structure of a human body detected from a target image and a feature value of a skeleton structure of a human body indicated by a reference image, the degree of similarity between the poses of the two human bodies may be computed.
The feature value of the skeleton structure indicates a feature of a skeleton of a person, and is an element for classifying a pose of the person, based on the skeleton of the person. This feature value normally includes a plurality of parameters. Then, the feature value being referred in computation of a degree of similarity may be a feature value of the entire skeleton structure, may be a feature value of a part of the skeleton structure, or may include a plurality of feature values as in each portion of the skeleton structure. A method for computing a feature value may be any method such as machine learning and normalization, and a minimum value and a maximum value may be acquired as normalization. As one example, the feature value is a feature value acquired by performing machine learning on the skeleton structure, a size of the skeleton structure from a head to a foot on an image, a relative positional relationship among a plurality of keypoints in an up-down direction in a skeleton region including the skeleton structure on the image, a relative positional relationship among a plurality of keypoints in the left-right direction in the skeleton structure, an and the like. The size of the skeleton structure is a height in the up-down direction, an area, and the like of a skeleton region including the skeleton structure on an image. The up-down direction (a height direction or a vertical direction) is a direction (Y-axis direction) of up and down in an image, and is, for example, a direction perpendicular to the ground (reference surface). Further, the left-right direction (a horizontal direction) is a direction (X-axis direction) of left and right in an image, and is, for example, a direction parallel to the ground.
Note that, in order to perform a search desired by a user, a feature value having robustness with respect to search processing is preferably used. For example, when a user desires a search that does not depend on an orientation and a body shape of a person, a feature value that is robust with respect to the orientation and the body shape of the person may be used. A feature value that does not depend on an orientation and a body shape of a person can be acquired by learning skeletons of persons facing in various directions with the same pose and skeletons of persons having various body shapes with the same pose, and extracting a feature only in the up-down direction of a skeleton. One example of the processing of computing a feature value of a skeleton structure is disclosed in Patent Document 1.
In this example, the feature value of the keypoint indicates a relative positional relationship among a plurality of keypoints in the up-down direction in a skeleton region including a skeleton structure on an image. Since the keypoint A2 of the neck is the reference point, a feature value of the keypoint A2 is 0.0 and a feature value of a keypoint A31 of a right shoulder and a keypoint A32 of a left shoulder at the same height as the neck is also 0.0. A feature value of a keypoint A1 of a head higher than the neck is −0.2. A feature value of a keypoint A51 of a right hand and a keypoint A52 of a left hand lower than the neck is 0.4, and a feature value of the keypoint A81 of the right foot and the keypoint A82 of the left foot is 0.9. When the person raises the left hand from this state, the left hand is higher than the reference point as in
There are various ways of computing a degree of similarity of a pose indicated by such a feature value. For example, after a degree of similarity between feature values is computed for each keypoint, a degree of similarity between poses may be computed based on the degree of similarity between the feature values of the plurality of keypoints. For example, an average value, a maximum value, a minimum value, a mode, a medium value, a weighted average value, a weighted sum, and the like of a degree of similarity between feature values of a plurality of keypoints may be computed as a degree of similarity between poses. When a weighted average value and a weighted sum are computed, a weight of each keypoint may be able to be set by a user, or may be predetermined.
Herein, in
The reference image identification information is information that identifies a plurality of reference images from each other.
The data name is information provided to each reference image. The same data name can be provided to a plurality of reference images. Further, a plurality of data names can be provided to one reference image. The data name can be associated with a content (such as a pose of a human body, and a perspective of a target object) of an image, and the like. Illustrated “wheelchair/bird's-eye view” is provided to a reference image that includes a person in a wheelchair and includes the person captured in such a way that the person is looked down from the above. In addition, a data name of “cellular phone/right hand/bird's-eye view” may be provided to a reference image including a person who is holding a cellular phone with a right hand and talking on the phone. For example, a data name of “wheelchair/bird's-eye view” and a data name of “cellular phone/right hand/bird's-eye view” may be provided to a reference image including a person in a wheelchair who is holding a cellular phone with a right hand and talking on the phone.
A feature value is a feature value (for example: a set of feature values of each keypoint) of a pose of a human body included in each reference image.
Note that, the client terminal 2 may receive a user input for specifying a data name in addition to a user input for specifying a target image. Then, the client terminal 2 may transmit, to the server 1, a content of the user input for specifying the data name in addition to the specified target image. In this case, the first verification unit 13 may extract a reference image associated with the specified data name from among reference images, and then extract a first reference image that satisfies the first extraction condition from among the extracted reference images. In a case of such a configuration, reference images being search targets can be narrowed down by a data name, and faster search processing is achieved.
Note that, a content of the “user input for specifying a data name” described above can adopt various configurations. For example, the client terminal 2 may receive an input for directly specifying one or a plurality of data names as the “user input for specifying a data name” described above. In addition, the server 1 may create a group by putting together a plurality of data names having a common point, and manage each group by associating a label name with the group. For example, a label name of “use of cellular phone” may be associated with a group acquired by putting together data names such as “cellular phone/right hand/bird's-eye view” and “cellular phone/left hand/bird's-eye view”. Then, the client terminal 2 may receive an input for selecting a label name as the “user input for specifying a data name” described above. In this case, a data name associated with a group of the selected label name is specified.
Herein, one example of a flow of processing of the image processing system 10 formed of the server 1 and the client terminal 2 will be described by using a sequence diagram in
First, the client terminal 2 receives a user input for specifying a target image (S10). Next, the client terminal 2 transmits the specified target image to the server 1 (S11).
The server 1 performs processing of detecting a keypoint of a human body included in the target image, and then extracts a first reference image whose relationship with the target image satisfies a first extraction condition from among a plurality of reference images, based on the detected keypoint (S12). Next, the server 1 transmits, to the client terminal 2, the first reference image, information (for example: a feature value, and the like) about a keypoint of a human body detected from each first reference image (see
The client terminal 2 extracts a second reference image whose relationship with the target image specified in S10 satisfies a second extraction condition from among the received first reference images, based on the information about the keypoint of the human body detected from each of the received first reference images and the information about the keypoint of the human body detected from the target image (S14). Then, the client terminal 2 displays the extracted second reference image (S15). The display is achieved by display on a display, projection of a video using a projection apparatus, and the like.
Note that, the client terminal 2 can store the data (image and information) received in S13 in the storage apparatus of the client terminal 2, and repeatedly perform the processing in S14 and S15 by using the data.
For example, a user may perform an input for changing a second extraction condition on the client terminal 2. Then, the client terminal 2 may extract a second reference image whose relationship with the target image specified in S10 satisfies the second extraction condition after the change from among the received first reference images (S14), and may display the extracted second reference image (S15). The processing will be described in detail in a fifth example embodiment.
In addition, a plurality of second extraction conditions may be set in advance. Then, the client terminal 2 may extract a second reference image whose relationship with the target image specified in S10 satisfies each of the plurality of second extraction conditions from among the received first reference images (S14), and may separately display the second reference image being extracted based on each of the plurality of second extraction conditions (S15).
In this way, in a case where extraction based on a second extraction condition is performed for a plurality of times, when both of extraction based on a first extraction condition and the extraction based on the second extraction condition are performed each time, a processing load on a computer increases, and a processing speed is also reduced. As in the example, with the configuration in which the first extraction processing (S12) and the second extraction processing (S14) are separated, and the second extraction processing for a plurality of times can be performed in association with the first extraction processing once, a processing load on the computer is reduced, and a processing speed also increases.
The image processing system 10 according to the present example embodiment can perform, in two separate steps, extraction processing (search processing) of a similar image, based on a feature value of each of a plurality of keypoints of a human body included in an image. In other words, reference images being search targets can be narrowed down in the first step, and an image similar to a target image can be then searched from among the narrowed reference images in the second step. In this way, by performing, in the two separate steps, the search processing of a similar image, based on a feature value of each of a plurality of keypoints of a human body included in an image, faster search processing can be achieved.
For example, in a case of the two separate steps, by storing a result in the first step, the second step for a plurality of times can be performed by using the result. In other words, the second step can be performed for a plurality of times in association with the first step once. In contrast, when the extraction processing is not divided into two steps, all the extraction processing needs to be performed every time. According to the image processing system 10 in the present example embodiment as compared with such a comparative example, a processing load on a computer is reduced, and a processing speed also increases.
Further, in the present example embodiment, in the first extraction condition in the first step, the number of keypoints, a kind of a keypoint, and a weight of each keypoint being referred when a degree of similarity of a pose of a human body is computed can be fixed, and, in the second extraction condition in the second step, the items can be changed by a user input. As a technique related to a high speed of a search, there is a technique for storing data in a database while clustering, and performing a search by narrowing down clusters similar to a query at a time of the search. However, when a search is performed while changing a search condition each time, similarity between pieces of data changes due to the search condition, and thus the technique described above cannot be used and a search becomes slower. For this problem, by fixing the first extraction condition in the first step and allowing a change in the second extraction condition in the second step, the first extraction condition (step of narrowing down from a massive amount of data) can become faster, and a search condition (second extraction condition) in the second step can also be changed, and thus a targeted search can be performed at a high speed.
As illustrated in
In the second example embodiment, extraction processing (search processing) of a similar image, based on a feature value of each of a plurality of keypoints of a human body included in an image is divided into two steps, the server 1 performs the first step, and the client terminal 2 performs the second step. In contrast, in the present example embodiment, the server 1 performs both of the first step and the second step. Details will be described below.
Herein, one example of a flow of processing of the image processing system 10 formed of the server 1 and the client terminal 2 will be described by using a sequence diagram in
First, the client terminal 2 receives a user input for specifying a target image (S20). Next, the client terminal 2 transmits the specified target image to the server 1 (S21).
The server 1 performs processing of detecting a keypoint of a human body included in the target image, and then extracts a first reference image whose relationship with the target image satisfies a first extraction condition from among a plurality of reference images, based on the detected keypoint (S22). Next, the server 1 extracts a second reference image whose relationship with the target image received in S21 satisfies a second extraction condition from among the first reference images extracted in S22, based on the detection result of the keypoint in S22 (S23). Then, the server 1 transmits the extracted second reference image to the client terminal 2 (S24).
Subsequently, the client terminal 2 displays the received second reference image (S25). The display is achieved by display on a display, projection of a video using a projection apparatus, and the like.
Note that, the server 1 can store the data (image and information) acquired in the processing in S22 in the storage apparatus of the own apparatus, and repeatedly perform the processing in S23 and S24 by using the data. Then, when the client terminal 2 newly receives a second reference image (S24), the client terminal 2 can display the newly received second reference image.
For example, a user may perform an input for changing a second extraction condition on the client terminal 2. Then, the client terminal 2 may transmit the second extraction condition after the change to the server 1. Then, the server 1 may extract a second reference image whose relationship with the target image received in S21 satisfies the second extraction condition after the change from among the first reference images extracted in S22 (S23), and may transmit the extracted second reference image to the client terminal 2 (S24). The processing will be described in detail in the fifth example embodiment.
In addition, a plurality of second extraction conditions may be set in advance. Then, the server 1 may extract a second reference image whose relationship with the target image specified in S20 satisfies each of the plurality of second extraction conditions from among the first reference images extracted in S22 (S23), and may transmit, to the client terminal 2, the second reference image extracted based on each of the plurality of second extraction conditions, in an identifiable manner from each other (S24).
In this way, in a case where extraction based on a second extraction condition is performed for a plurality of times, when both of extraction based on a first extraction condition and the extraction based on the second extraction condition are performed each time, a processing load on a computer increases, and a processing speed is also reduced. As in the example, with the configuration in which the first extraction processing (S22) and the second extraction processing (S23) are separated, and the second extraction processing can be performed for a plurality of times in association with the first extraction processing once, a processing load on the computer is reduced, and a processing speed also increases.
Another configuration of the image processing system 10 according to the present example embodiment is similar to the configuration of the image processing system 10 according to the first and second example embodiments.
The image processing system 10 according to the present example embodiment achieves an advantageous effect similar to that of the image processing system 10 according to the first and second example embodiments. Further, according to the image processing system 10 in the present example embodiment, a processing load on the client terminal 2 is reduced.
An image processing system 10 according to the present example embodiment is formed of one apparatus physically and/or logically. One example of a functional block diagram of the image processing system 10 according to the present example embodiment is illustrated in
Another configuration of the image processing system 10 according to the present example embodiment is similar to the configuration of the image processing system 10 according to the first to third example embodiments. The image processing system 10 according to the present example embodiment also achieves an advantageous effect similar to that of the image processing system 10 according to the first to third example embodiments.
An image processing system 10 according to the present example embodiment has a function of changing a second extraction condition. Details will be described below.
The display control unit 15 displays a second reference image extracted by the second verification unit 14 on a display apparatus. For example, when the image processing system 10 is formed of a server 1 and a client terminal 2 as in the second and third example embodiments, the display control unit 15 displays a second reference image on a display apparatus (such as a display and a projection apparatus) of the client terminal 2. Further, when the image processing system 10 is formed of one apparatus physically and/or logically as in the fourth example embodiment, the display control unit 15 displays a second reference image on a display apparatus (such as a display and a projection apparatus) of the one apparatus.
The change reception unit 16 receives an input for changing a second extraction condition. For example, the change reception unit 16 may receive an input for changing at least one of a second reference value defined in the second extraction condition, the number of keypoints being referred when a degree of similarity of a pose of a human body is computed, a kind of a keypoint being referred when a degree of similarity of a pose of a human body is computed, a weight of each keypoint being referred when a degree of similarity of a pose of a human body is computed, a minimum detection point, and a necessary detection keypoint.
The minimum detection point is a predetermined number in a condition that a “predetermined number or more of keypoints being referred when a degree of similarity of a pose of a human body is computed is detected” that can be included in the second extraction condition described in the second example embodiment.
The necessary detection keypoint is a predetermined keypoint in a condition that a “predetermined keypoint of keypoints being referred when a degree of similarity of a pose of a human body is computed is detected” that can be included in the second extraction condition described in the second example embodiment.
When the image processing system 10 is formed of the server 1 and the client terminal 2 as in the second and third example embodiments, the change reception unit 16 can receive an input for changing a second extraction condition via an input apparatus (such as a touch panel, a physical button, a keyboard, a mouse, and a microphone) of the client terminal 2. Further, when the image processing system 10 is formed of one apparatus physically and/or logically as in the fourth example embodiment, the change reception unit 16 can receive an input for changing a second extraction condition via an input apparatus (such as a touch panel, a physical button, a keyboard, a mouse, and a microphone) of the one apparatus.
Note that, in response to reception of an input for changing a second extraction condition by the change reception unit 16, the second verification unit 14 newly extracts a second reference image whose relationship with a target image satisfies the second extraction condition after the change from among first reference images. Then, the display control unit 15 changes a content to be displayed on the display apparatus from the second reference image that satisfies the second extraction condition before the change to the second reference image that satisfies the second extraction condition after the change.
Next, one example of a flow of processing of the image processing system 10 formed of the server 1 and the client terminal 2 will be described by using a sequence diagram in
First, the client terminal 2 receives a user input for specifying a target image (S30). Next, the client terminal 2 transmits the specified target image to the server 1 (S31).
The server 1 performs processing of detecting a keypoint of a human body included in the target image, and then extracts a first reference image whose relationship with the target image satisfies a first extraction condition from among a plurality of reference images, based on the detected keypoint (S32). Next, the server 1 transmits, to the client terminal 2, the first reference image, information (for example: a feature value, and the like) about a keypoint of a human body detected from each first reference image (see
The client terminal 2 stores the data (image and information) received in S33 in the storage apparatus of the client terminal 2, and extracts a second reference image whose relationship with the target image specified in S30 satisfies a second extraction condition from among the first reference images received in S33 (S34). Then, the client terminal 2 displays the extracted second reference image (S35). The display is achieved by display on a display, projection of a video using a projection apparatus, and the like.
Subsequently, a user performs an input for changing the second extraction condition while referring to a search result (second reference image) displayed on the client terminal 2. The client terminal 2 receives the input for changing the second extraction condition (S36). Then, in response to the reception of the input, the client terminal 2 newly extracts a second reference image whose relationship with the target image specified in S30 satisfies the second extraction condition after the change from among the first reference images received in S33 (S37). Note that, the client terminal 2 performs the extraction processing in S37, based on the data received in S33 and stored in the storage apparatus of the client terminal 2. Next, the client terminal 2 changes a content to be displayed on the display apparatus from the second reference image that satisfies the second extraction condition before the change to the second reference image that satisfies the second extraction condition after the change (S38).
The client terminal 2 can repeatedly perform the processing in S36 to S38.
Next, another example of a flow of processing of the image processing system 10 formed of the server 1 and the client terminal 2 will be described by using a sequence diagram in
First, the client terminal 2 receives a user input for specifying a target image (S40). Next, the client terminal 2 transmits the specified target image to the server 1 (S41).
The server 1 performs processing of detecting a keypoint of a human body included in the target image, and then extracts a first reference image whose relationship with the target image satisfies a first extraction condition from among a plurality of reference images, based on the detected keypoint (S42). Then, the server 1 stores the data (image and information) acquired in the processing in S42 in the storage apparatus of the own apparatus.
Next, the server 1 extracts a second reference image whose relationship with the target image received in S41 satisfies a second extraction condition from among the first reference images extracted in S42, based on the detection result of the keypoint in S42 (S43). Then, the server 1 transmits the extracted second reference image to the client terminal 2 (S44).
The client terminal 2 displays the received second reference image (S45). The display is achieved by display on a display, projection of a video using a projection apparatus, and the like.
Subsequently, a user performs an input for changing the second extraction condition while referring to a search result (second reference image) displayed on the client terminal 2. The client terminal 2 receives the input for changing the second extraction condition (S46). Then, the client terminal 2 transmits the second extraction condition after the change to the server 1 (S47).
Next, the server 1 newly extracts a second reference image whose relationship with the target image specified in S40 satisfies the second extraction condition after the change from among the first reference images extracted in S42 (S48). Note that, the server 1 performs the extraction processing in S48, based on the data acquired in the processing in S42 and stored in the storage apparatus of the own apparatus. Next, the server 1 transmits the second reference image that satisfies the second extraction condition after the change to the client terminal 2 (S49).
Then, the client terminal 2 changes a content to be displayed on the display apparatus from the second reference image that satisfies the second extraction condition before the change to the second reference image that satisfies the second extraction condition after the change (S50).
The server 1 and the client terminal 2 can repeatedly perform the processing in S46 to S50.
Next, another example of a flow of processing of the image processing system 10 formed of one apparatus physically and/or logically will be described by using a flowchart in
First, the image processing system 10 receives a user input for specifying a target image (S60). Next, the image processing system 10 performs processing of detecting a keypoint of a human body included in the target image, and then extracts a first reference image whose relationship with the target image satisfies a first extraction condition from among a plurality of reference images, based on the detected keypoint (S61). Then, the image processing system 10 stores the data (image and information) acquired in the processing in S61 in the storage apparatus of the own apparatus.
Next, the image processing system 10 extracts a second reference image whose relationship with the target image specified in S60 satisfies a second extraction condition from among the first reference images extracted in S61, based on the detection result of the keypoint in S61 (S62). Then, the image processing system 10 displays the extracted second reference image (S63). The display is achieved by display on a display, projection of a video using a projection apparatus, and the like.
Subsequently, a user performs an input for changing the second extraction condition while referring to a search result (second reference image) displayed on the image processing system 10. The image processing system 10 receives the input for changing the second extraction condition (S64).
Next, the image processing system 10 newly extracts a second reference image whose relationship with the target image specified in S60 satisfies the second extraction condition after the change from among the first reference images extracted in S61 (S65). Note that, the image processing system 10 performs the extraction processing in S65, based on the data acquired in the processing in S61 and stored in the storage apparatus of the own apparatus. Next, the image processing system 10 changes a content to be displayed on the display apparatus from the second reference image that satisfies the second extraction condition before the change to the second reference image that satisfies the second extraction condition after the change (S66).
The image processing system 10 can repeatedly perform the processing in S64 to S66.
Another configuration of the image processing system 10 according to the present example embodiment is similar to the configuration of the image processing system 10 according to the first to fourth example embodiments. The image processing system 10 according to the present example embodiment also achieves an advantageous effect similar to that of the image processing system 10 according to the first to fourth example embodiments.
Further, the image processing system 10 according to the present example embodiment can increase a speed of search processing in work for repeatedly performing the search processing and changing a second extraction condition while confirming a search result of the search processing.
An image processing system 10 according to the present example embodiment receives a change in a second extraction condition via a characteristic user interface (UI) screen. Details will be described below.
The change reception unit 16 receives an input for changing a second extraction condition via a characteristic setting screen (UI screen). When the image processing system 10 is formed of a server 1 and a client terminal 2 as in the second and third example embodiments, the client terminal 2 displays the setting screen. Further, when the image processing system 10 is formed of one apparatus physically and/or logically as in the fourth example embodiment, the one apparatus displays the setting screen.
In the illustrated setting screen, a moving image is reproduced and displayed in a region M. The moving image may be a live image being currently captured by any camera, or may be a moving image being captured in the past and stored.
“Rotational angle” is a UI part for rotating an image in the region M. For example, 0 degree, 90 degrees, 180 degrees, and 270 degrees are selectable, and an image displayed in the region M is rotated by a selected angle. For example, when “90 degrees” is selected in the illustrated state, an image displayed in the region M is rotated clockwise by 90 degrees.
“Detection threshold value” is a first reference value of a first extraction condition.
“Label name” is as described in the second example embodiment. A user can select a label name via the UI part.
“Color of frame line”, “initial selection”, “select all check items”, and “display unused pose as well” will be described below.
The UI part for receiving an input for changing a second extraction condition is displayed under the region in which the items described above are displayed. In response to selection of a label name, a current setting content in response to one or each of a plurality of data names being associated with a group of the label name is displayed. A user can change the setting content to a desired content. For example, when a user selects “wheelchair” as a label name as illustrated, a current setting content in response to “wheelchair: bird's-eye view” being a data name associated with a group of the label name is displayed. Further, although not illustrated, for example, when a user selects “use of cellular phone” as a label name, a current setting content in response to each of data names such as “cellular phone/right hand/bird's-eye view” and “cellular phone/left hand/bird's-eye view” being associated with a group of the label name is displayed. In other words, in response to each of data names such as “cellular phone/right hand/bird's-eye view” and “cellular phone/left hand/bird's-eye view”, a human model, a second threshold value, a minimum detection point, and the like as illustrated are displayed.
A human model formed of N keypoints is displayed in a region R. Then, a keypoint being referred when a degree of similarity of a pose of a human body is computed and a keypoint not being referred are displayed in an identifiable manner. In a case of the illustrated example, a keypoint K1 indicated by a white dot is referred when a degree of similarity of a pose of a human body is computed, and a keypoint K2 indicated by a black dot is not referred when a degree of similarity of a pose of a human body is computed.
A user can select one from among the N keypoints, and change a weight of the keypoint. In a case of the illustrated example, a keypoint surrounded by a mark Q is selected by the user. A name of the keypoint is “joint3”. In response to selection of the one keypoint, as illustrated, a name of the selected keypoint and the UI part that changes a weight thereof are displayed. In a case of the illustrated example, the weight of joint3 is “0.0”. This indicates that the keypoint is not referred when a degree of similarity of a pose of a human body is computed.
The user can change a weight of the selected keypoint by an operation of an illustrated slide bar, a direct input of a numerical value, or the like, for example. For example, the weight of joint3 can be changed from “0” to a “numerical value different from 0”. Then, in response to the change, joint3 is switched from the keypoint not being referred when a degree of similarity of a pose of a human body is computed to a keypoint being referred. In response to this, a display of joint3 in the region R is switched from the black dot to the white dot.
Note that, a keypoint (keypoint K1 indicated by the white dot) being referred when a degree of similarity of a pose of a human body is computed can be selected, and a weight of the keypoint can be changed to “0”. In response to the change, the keypoint is switched from the keypoint being referred when a degree of similarity of a pose of a human body is computed to a keypoint not being referred. In response to this, a display of the keypoint in the region R is switched from the white dot to the black dot.
In addition, a keypoint (keypoint K1 indicated by the white dot) being referred when a degree of similarity of a pose of a human body is computed can be selected, and a weight of the keypoint can be changed within a range from “0”.
“ID19: wheelchair/bird's-eye view” is “data name” described in the second example embodiment. In the present example embodiment, a second extraction condition is set for each data name. By referring to a display of a data name such as “ID19: wheelchair/bird's-eye view”, the user can recognize the second extraction condition in response to which data name is being displayed and set.
“Second threshold value” is a second reference value of the second extraction condition.
“Minimum detection point” is as described in the fifth example embodiment. In a case of the example, the second extraction condition includes a condition that a “predetermined number or more of keypoints being referred when a degree of similarity of a pose of a human body is computed is detected”. In a case of the illustrated example, six keypoints (keypoint K1 indicated by the white dot) are “keypoints being referred when a degree of similarity of a pose of a human body is computed”, and a minimum detection point is “2”. In this case, detection of two or more of the six keypoints is a condition for satisfying the second extraction condition.
The change reception unit 16 can receive an input including a human model (human model displayed in the region R) formed of such a plurality of keypoints and being performed for selecting a keypoint being a setting target on the human mode, and can receive an input for changing the second extraction condition via the setting screen that receives an input for changing a weight of the selected keypoint (keypoint surrounded by the mark Q).
Further, the change reception unit 16 can receive the input for changing the second extraction condition via the setting screen that emphasizes and displays the selected keypoint (emphasizes and displays the mark Q) in the human model described above.
Further, the change reception unit 16 can receive the input for changing the second extraction condition via the setting screen that displays, in different manners, a keypoint (keypoint K1 indicated by the white dot) whose set weight is greater than a threshold value (for example: 0) and the other keypoint (keypoint K2 indicated by the black dot) in the human model described above.
Note that, when a “save setting” button on an upper left of the screen in
When an “analyze” button on the upper left of the screen is pressed, the target image acquisition unit 11 acquires, as a target image, a frame image displayed in the region M at that point in time. Subsequently, the skeleton structure detection unit 12, the first verification unit 13, and the second verification unit 14 perform the processing described in the first to fifth example embodiments on the target image. Then, as illustrated in
Note that, as illustrated, the display control unit 15 can switch an image displayed in the region M from an original moving image to a specified target image (still image) in response to a specification (pressing of the “analyze” button on the upper left of the screen in
The display control unit 15 may further superimpose and display, on the target image, a keypoint of a human body detected in the target image. The superimposition and the display are achieved based on a detection result by the skeleton structure detection unit 12. Note that, in the superimposition and the display, all keypoints may be displayed in the same display manner, or may be displayed in different display manners. For example, a keypoint of a right side of a body and a keypoint of a left side of the body may be displayed in display manners different from each other, or a keypoint of an upper half of the body and a keypoint of a lower half of the body may be displayed in display manners different from each other. Further, a keypoint being referred when a degree of similarity of a pose of a human body is computed may be emphasized and displayed. Furthermore, when one keypoint is selected in the region R, the selected keypoint may be emphasized and displayed in a human model superimposed and displayed on the target image.
The user can perform the input for changing the second extraction condition while referring to the verification result. For example, it is assumed that the user changes the minimum detection point from the state in
When a check is placed in “display unused pose as well” as illustrated in
As illustrated in
Herein, processing performed when the items of “still image”, “capturing”, “Live”, and “setting” are selected in the region on the left end in the UI screen in
When “still image” is selected, a screen for selecting a processing image from among images stored in a storage apparatus is displayed. When one image is selected as the processing image, the skeleton structure detection unit 12, the first verification unit 13, and the second verification unit 14 perform the processing described in the first to fifth example embodiments on the processing image. Note that, the first verification unit 13 and the second verification unit 14 extract a first reference image and a second reference image, based on a setting content of a first extraction condition and a second extraction condition at that point in time. Then, the extracted second reference image is displayed as a verification result on the screen.
When “capturing” is selected, a screen for selecting a processing image from a live image being currently captured by any camera or a moving image being captured in the past is displayed. A live image, or a moving image being captured in the past is reproduced and displayed in the screen. Then, a user performs a capturing operation at any timing during the reproduction. Then, a frame image displayed at that timing is selected as the processing image. When one image is selected as the processing image, the skeleton structure detection unit 12, the first verification unit 13, and the second verification unit 14 perform the processing described in the first to fifth example embodiments on the processing image. Note that, the first verification unit 13 and the second verification unit 14 extract a first reference image and a second reference image, based on a setting content of a first extraction condition and a second extraction condition at that point in time. Then, the extracted second reference image is displayed as a verification result on the screen.
When “Live” is selected, a screen for selecting a processing image from a live image being currently captured by any camera or a moving image being captured in the past is displayed. A live image, or a moving image being captured in the past is reproduced and displayed in the screen. Then, a user performs an input for specifying a time interval for selecting the processing image. Then, a plurality of frame images are selected as the processing images at the specified time interval. The skeleton structure detection unit 12, the first verification unit 13, and the second verification unit 14 successively perform the processing described in the first to fifth example embodiments on each of the plurality of selected processing images. Note that, the first verification unit 13 and the second verification unit 14 extract a first reference image and a second reference image, based on a setting content of a first extraction condition and a second extraction condition at that point in time. Then, the extracted second reference image is displayed as a verification result on the screen.
Note that, when any item of “still image”, “capturing”, and “Live” is selected, a user also selects at least one label name. For example, a check box associated with each of a plurality of label names is displayed on the screen. The user selects at least one label name by placing a check in the check box of a desired label name. Then, the image processing system performs the extraction processing using a second extraction condition (having a setting being saved) in response to a data name being associated with a group of the selected label name, and displays an extracted second reference image as a verification result on the screen.
Herein, “initial selection” in the setting screen (see
Another configuration of the image processing system 10 according to the present example embodiment is similar to the configuration of the image processing system 10 according to the first to fifth example embodiments. The image processing system 10 according to the present example embodiment also achieves an advantageous effect similar to that of the image processing system 10 according to the first to fifth example embodiments.
Further, the image processing system 10 according to the present example embodiment can perform an input for changing a second extraction condition via the characteristic setting screen described above. A user can efficiently and more accurately set a desired second extraction condition by performing the input for changing the second extraction condition via the characteristic setting screen described above.
While the example embodiments of the present invention have been described with reference to the drawings, the example embodiments are only exemplification of the present invention, and various configurations other than the above-described example embodiments can also be employed. The configurations of the example embodiments described above may be combined together, or a part of the configuration may be replaced with another configuration. Further, various modifications may be made in the configurations of the example embodiments described above without departing from the scope of the present invention. Further, the configurations and the processing disclosed in each of the example embodiments and the modification examples described above may be combined together.
Further, the plurality of steps (pieces of processing) are described in order in the plurality of flowcharts used in the above-described description, but an execution order of steps performed in each of the example embodiments is not limited to the described order. In each of the example embodiments, an order of illustrated steps may be changed within an extent that there is no harm in context. Further, each of the example embodiments described above can be combined within an extent that a content is not inconsistent.
A part or the whole of the above-described example embodiments may also be described in supplementary notes below, which is not limited thereto.
Number | Date | Country | Kind |
---|---|---|---|
2022-088420 | May 2022 | JP | national |