The present invention relates to a technique for detecting an intended target from an image.
In recent years, object detection algorithms such as You Only Look Once (YOLO) and Single Shot MultiBox Detector (SSD) have been used to detect targets in various business fields. To correctly detect a target, a trained model for detecting an object is to be trained with a sufficient volume of high-quality supervised data.
A technique known as data augmentation artificially augments the number of data pieces by processing an image in the supervised data with, for example, translation, scaling, rotation, or noise addition.
For example, Patent Literature 1 describes a technique for such data augmentation. Patent Literature 1 describes an information processing apparatus for selecting an image suitable for machine learning. The information processing apparatus includes an identifier that identifies, when a composite image generated by superimposing multiple element images of a target element on a background image includes element images overlapping one another, a shielding degree indicating a degree by which a first element image placed in the back is shielded by a second element image placed in the front, and a selector that selects, when the shielding degree is less than or equal to an upper limit value specified by the complexity of the first element image, a composite image used as supervised data in machine learning for generating a recognition model to detect the target element.
Patent Literature 1: Japanese Unexamined Patent Application Publication No. 2022-26456
Artificial intelligence (AI) for object detection separates a target from an image including both the target and a background image. Machine learning relies on, in addition to the features of the target, information (knowledge) about the background image. The learned knowledge about the background image may ideally match knowledge about the background image in actual use. However, the background image in actual use may be unavailable.
One or more aspects of the present invention are directed to a technique for generating supervised data without a background image in actual use.
Supervised data (refer to
For example, an application display includes multiple types of buttons or icons. For a user interface (UI) or user experience (UX), the style (colors and patterns) of such multiple types of buttons or icons tend to approximate to the style of the entire background of the display including the buttons or icons. Thus, the background image generated with multiple types of icons can easily reflect the features (e.g., color distribution or pattern) of the background of the application, thus increasing the training efficiency.
Referring to
The trained model 10 is, for example, an object detection algorithm such as You Only Look Once (YOLO) or Single Shot MultiBox Detector (SSD).
The image 11 includes the targets 12 to be recognized by the trained model 10. The image 11 may be a still image or a video. A video may be a set of images captured at predetermined intervals. Examples of such images include captured images on the screen and images captured with, for example, a camera, an in-vehicle camera, or a surveillance camera.
A target is an object to be detected in the image 11. A target detected by the trained model is indicated with, for example, a frame (not shown) surrounding the target. The target may instead be indicated with an arrow or a color to be distinguishable from other portions.
A supervised data generation apparatus will be described with reference to
A training apparatus 4 includes the supervised data generation apparatus 1 and a trained model generator 5.
The background image generator 2 selects a first target image 13a from an image group 14 including multiple different target images 13 and processes the first target image 13a through a transformation process including, for example, enlargement, reduction, rotation, and inversion to generate a background image 15.
The target images 13 are images of targets included in the image 11.
The first target image 13a includes one or more target images 13 used to generate the background image 15.
The image group 14 includes the multiple target images 13 to be identified. The target images 13 in the image group 14 are identifiable from one another by the trained model 10. The image group 14 is stored in a storage 31 (described later). The image group 14 may be stored in an external server.
The background image 15 includes transformed images 16 arranged in a background frame 15a. In the present embodiment, the transformed images 16 are randomly arranged until the background frame 15a is filled. The transformed images 16 may or may not overlap one another. The rectangles denoted with 16 in the background frame 15a indicate some of the transformed images 16 being arranged.
The transformed images 16 are generated through the transformation process of the first target image 13a selected from the image group 14. The transformation process includes enlargement, reduction, rotation, vertical inversion, and lateral inversion. The transformation process may further include, for example, noise addition and projective transformation. For example, typical data augmentation may also be used.
The transformation process corresponds to processes R2 and R3 in
The supervised data generator 3 selects second target images 13b from the image group 14 and combines the second target images 13b with the background image 15 to generate supervised data 17 (refer to
The second target images 13b include one or more target images 13 to be combined with the background image 15.
The supervised data 17 is used in supervised training in machine learning. In the
present embodiment, the supervised data 17 includes the background image 15, the target images 13, and positional information 18. The supervised data 17 will be described in detail later with reference to
The hardware configuration of the supervised data generation apparatus 1 and the training apparatus 4 will now be described with reference to
The hardware configuration of the supervised data generation apparatus 1 will be described. The hardware configuration of the training apparatus 4, which is substantially the same as the hardware configuration of the supervised data generation apparatus 1, will not be described.
As shown in
The storage 31 stores programs 36 and 36a for generating the supervised data and the trained model. The storage 31 may also store a browser program 37 and an operating system (OS) 38. The programs 36 and 36a are, for example, installed from the storage device 32. The target images 13 (the image group 14), the supervised data 17, and the trained model 10 stored in the storage 31 in the present embodiment may be stored in an external server.
In the present embodiment, the program 36 may cooperate with the OS 38 while using its functions, and the program 36a may cooperate with the browser program 37 while using its functions. The programs 36 and 36a may operate independently of the browser programs 37 and the OS 38 without using their functions.
In the hardware configuration of the programs 36 and 36a described above, the functions shown in the functional block diagram in
The steps described below include generating a single piece of supervised data 17. However, the steps described below actually include generating many different pieces of supervised data 17. Multiple pieces of supervised data 17 may be generated simultaneously with parallel processing.
In preprocessing S0 (not shown), the file format of the image data of the target images 13 to be processed with the program 36 is converted to a predetermined format.
The CPU 30 (refer to
The background image generation will now be described with reference to
The first target image 13a is randomly selected from the image group 14 (refer to
The vertical direction or the lateral direction in the selected first target image 13a is selected randomly. The vertical and lateral directions are predetermined for each target image 13. Three-, four-, or five-division is selected randomly. The image is divided into the selected division number in a direction parallel to the selected direction. A single image is randomly selected from three to five images generated by dividing the first target image 13a.
The selected divided image is processed through the transformation process to generate each transformed image 16 (refer to
The transformed images 16 are arranged in the background frame 15a. In the present embodiment, the transformed images 16 are randomly arranged in the background frame 15a. In the present embodiment, the transformed images 16 may overlap one another. The transformed images 16 subsequently arranged are superimposed on the transformed images 16 previously arranged.
When the background frame 15a is filled with the transformed images 16, the processing advances to the subsequent step. When the background frame 15a is yet to be filled, the processing returns to step R1. Without the background frame 15a being fully filled, the processing may advance to the subsequent step when a predetermined portion of the background frame 15a is filled. The predetermined portion is, for example, 90 to 95% of the area in the background frame 15a.
The background image 15 is complete.
Referring back to
When a specific second target image 13b is selected by other supervised data 17 many times, the probability of this second target image 13b being selected is adjusted to decrease. When the specific second target image 13b is selected fewer times, the probability of this second target image 13b being selected is adjusted to increase. In a supervised dataset 20, the number of times a second target image 13b being selected is adjusted to be the same or substantially the same among different pieces of supervised data.
Although the number of times a second target image 13b being selected is adjusted with, for example, a roulette wheel selection method in the present embodiment, another known method may be used instead.
When the processes for generating different pieces of supervised data 17 are performed in parallel to one another, the number of times the specific second target image 13b being selected in each process is added up, and the probability of the second target image 13b being selected in the subsequent selection in each process is increased or decreased to dynamically adjust and balance the probability of each second target image 13b being selected.
The second target images 13b are randomly arranged on the resulting background image 15. The second target images 13b are combined to be superimposed on the background image 15.
The positional information 18 about the second target images 13b with respect to the background frame 15a is stored. Each second target image 13b may have a starting point corresponding to the positional information 18 at the center, the center of gravity, or a corner of the rectangle.
The information about the second target images 13b to be stored with the positional information 18 may include, for example, the shape of each second target image 13b such as a square, a rectangle, or a circle, or the dimensions such as the lengths of the sides or the diameter.
The determination is performed as to whether the combined second target images 13b are displayed appropriately on the background image 15. For example, when none of the conditions (a), (b), or (c) holds, the image is displayed appropriately and the processing advances to the subsequent step.
When any one of the conditions (a), (b), and (c) holds, the combining in step S3 is canceled. The processing returns to step S2.
The positional information 18 about the combined second target images 13b is linked to the background image 15 and stored into the storage 31.
When the number of second target images 13b is less than a predetermined number, the processing returns to step S2. When the number of second target images 13b reaches the predetermined number, the processing ends.
In the present embodiment, the number of second target images 13b is randomly determined between 10 and 20. The number of second target images 13b is determined between step S1 and step S2. The number of second target images 13b may be determined in another step.
The processes described above performed by randomly selecting or randomly arranging the images may be partially or entirely performed by selecting or placing the images in a preset order.
The dividing process in R2 may be eliminated. In this case, the second target images 13b to be identified may not be easily distinguishable from the transformed images 16 of the background image. Thus, the transformation process including enlargement or reduction alone and rotation or inversion alone may be eliminated.
When the background image is generated, the transformed images 16 may be arranged without overlapping one another.
The illustrated supervised data 17 is used in machine learning to generate the trained model 10 that outputs a result from identifying a target 12 in response to input image data of the image 11 including the target 12 corresponding to the target image 13. The supervised data 17 includes a supervised image 19 and the positional information 18.
The supervised image 19 includes the second target images 13b selected from the image group 14 of the multiple target images 13 corresponding to the targets 12 and to be identified from one another by the trained model 10, and the background image 15 located around the second target image 13b. The background image 15 includes the transformed images 16 resulting from the transformation process performed on the first target image 13a selected from the image group 14.
Referring back to
The detection number in the table identifies the supervised dataset 20. The number of generated images is the number of pieces of supervised data in the supervised dataset 20. The epoch number is the number of iterations in the iteration learning. The training duration is the time taken to generate the trained model. The number of correctly detected targets is the number of targets detected correctly. The number of undetected targets is the number of targets undetected. The number of incorrectly detected targets is the number of detection erroneously detecting the target.
Background 1 is the background image 15 (refer to
Referring back to
For detection number 2, 800 pieces of training data with the background image of Background 1 are prepared and learned. For detection numbers 3, 4, and 5, respectively, 800 pieces of training data with the background image of Backgrounds 2, 3, and 4 are prepared and learned, similarly to detection number 2.
For detection number 6, 400 pieces of training data with the background image of each of Backgrounds 1 and 2 are prepared, and a total of 800 pieces of supervised data are learned. For detection numbers 8 and 9, 400 pieces of training data with the background image of each of Backgrounds 1 and 2 are prepared, similarly to detection number 6, and a total of 800 pieces of supervised data are learned for each detection number. For detection numbers 7, a total of 1600 pieces of supervised data are learned.
The numbers of correctly detected targets, as indicated by the data for detection numbers 2 and 3, show good results when Background 1 or Background 2 is the training data. In other words, the background image 15 in the present embodiment provides substantially the same results as when open data is used as the background image.
The data for detection numbers 6 to 9 shows good results when Background 1 and Background 2 are used as the training data. More specifically, when Background 1 and Background 2 are used as the training data, the number of correctly detected targets increases.
Others about Background Image
In the supervised data 17, the target images 13 included in the background image 15 are selected randomly, and the transformation process such as enlargement is randomly performed on the target images 13. The number of transformed images 16 to be used for the background image 15 in the background frame 15a can thus vary.
Another embodiment will now be described. The supervised data 17 in the modification described below is substantially the same as the supervised data 17 described above. Like reference numerals denote like elements, and such elements will not be described.
The background image in the modification includes, in addition to the target images 13, images other than the target images.
The items additionally described in the above embodiments may be combined as appropriate.
The background image 15 resembles the target images 13 and is almost indistinguishable, providing abundant background information to a trained model to increase the training efficiency. The background image 15 can be generated with the target images 13 without preparing open data.
The trained model generated with the supervised data 17 using the background image 15 generated with the target images 13 can have the same capability of detecting the target 12 as the trained model generated with the supervised data using the background image 21a generated with open data.
The trained model 10 can have the same capability of detecting the targets 12 as the trained model generated with supervised data using the background image 21a generated with open data.
Number | Date | Country | Kind |
---|---|---|---|
2023-124769 | Jul 2023 | JP | national |