This application claims the benefit of Taiwan application Serial No. 106140836, filed Nov. 23, 2017, the subject matter of which is incorporated herein by reference.
The present invention relates to an image processing method, an electronic device, and a non-transitory computer-readable storage medium, and more particularly to an image annotation method, an electronic device and a non-transitory computer-readable storage medium.
The current deep learning technology combined with computer vision has been the development trend of artificial intelligence (AI). However, the deep learning network needs a large number of image annotation training samples to improve the accuracy.
At present, most of the image annotation methods are done manually. The operator needs to select the objects one by one for each image frame in the video data and key in the associated annotation. However, when there are a large number of target objects in the video data, such manual annotation method is time-consuming and labor-intensive.
The present invention relates to an image annotation method, an electronic device and a non-transitory computer-readable storage medium, which can automatically filter out highly repetitive invalid image frame samples in the video data, extract key image frames with object structure diversity, and provide the key image frames to the user for browsing, so that the annotation items can be added and/or modified by the user, thereby improving the annotation result and saving labor required for image annotation. On the other hand, the technique proposed in the present invention also involves expert experience feedback mechanism to enhance the accuracy and robustness for extracting the key image frames.
According to an aspect of the present invention, an image annotation method implemented by an electronic device including a processor is provided. The image annotation method including the following. A sequence of image frames including a plurality of image frames is acquired from video data by the processor. An object detecting and tracking procedure is performed on the sequence of image frames by the processor, so as to identify and track one or more target objects from the image frames. A plurality of candidate key image frames are selected from the image frames according to a first selection condition by the processor, wherein the first selection condition comprises when a target object in the one or more target objects starts to appear or disappears in an image frame of the image frames, selecting the image frame as one of the candidate key image frames. A plurality of first similarity indexes of the candidate key image frames are determined by the processor, wherein each of the first similarity indexes is determined by the processor through a similarity calculation according to a first covariance value of a corresponding one of the candidate key image frames and a plurality of first variation values statistically calculated in different directions of the corresponding candidate key image frame. A plurality of second similarity indexes of a plurality of adjacent image frames are determined by the processor, wherein each of the adjacent image frames is adjacent to at least one of the candidate key image frames, and each of the second similarity indexes is determined by the processor through the similarity calculation according to a second covariance value of a corresponding one of the adjacent image frames and a plurality of second variation values statistically calculated in different directions of the corresponding adjacent image frame. The candidate key image frames as well as the adjacent image frames that meet a second selection condition are selected as a plurality of key image frames, wherein the second selection condition comprises when a difference between a corresponding second similarity index of an adjacent image frame of the adjacent image frames and a corresponding first similarity index of a candidate key image frame adjacent to the adjacent image frame exceeds a similarity threshold, selecting the adjacent image frame as one of the key image frames. The key image frames are presented on a graphical user interface and annotation information for the one or more target objects are displayed through the graphical user interface by the processor.
According to another aspect of the present invention, a non-transitory computer-readable storage medium is provided. The non-transitory computer-readable storage medium stores one or more instructions executable by a processor to cause an electronic device including the processor perform the image annotation method of the present invention.
According to yet aspect of the present invention, an electronic device is provided. The electronic device includes a memory and a processor. The processor is coupled to the memory and is configured to: acquire a sequence of image frames comprising a plurality of image frames from video data; perform an object detecting and tracking procedure on the sequence of image frames, so as to identify and track one or more target objects from the image frames; select a plurality of candidate key image frames from the image frames according to a first selection condition, wherein the first selection condition comprises when a target object in the one or more target objects starts to appear or disappears in an image frame of the image frames, selecting the image frame as one of the candidate key image frames; determine a plurality of first similarity indexes of the candidate key image frames, wherein each of the first similarity indexes is determined by the processor through a similarity calculation according to a first covariance value of a corresponding one of the candidate key image frames and a plurality of first variation values statistically calculated in different directions of the corresponding candidate key image frame; determine a plurality of second similarity indexes of a plurality of adjacent image frames, wherein each of the adjacent image frames is adjacent to at least one of the candidate key image frames, and each of the second similarity indexes is determined by the processor through the similarity calculation according to a second covariance value of a corresponding one of the adjacent image frames and a plurality of second variation values statistically calculated in different directions of the corresponding adjacent image frame; select the candidate key image frames as well as the adjacent image frames that meet a second selection condition as a plurality of key image frames, wherein the second selection condition comprises when a difference between a corresponding second similarity index of an adjacent image frame of the adjacent image frames and a corresponding first similarity index of a candidate key image frame adjacent to the adjacent image frame exceeds a similarity threshold, selecting the adjacent image frame as one of the key image frames; present the key image frames on a graphical user interface and display annotation information for the one or more target objects through the graphical user interface.
For a better understanding of the above and other aspects of the present invention, embodiments are described below in detail with reference to the accompanying drawings:
In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the disclosed embodiments. It will be apparent, however, that one or more embodiments may be practiced without these specific details. In other instances, well-known structures and devices are schematically shown in order to simplify the drawing.
The present invention provides an image annotation method, an electronic device, and a non-transitory computer-readable storage medium. Image annotation may refer to, for example, recognizing one or more specific objects in the video data through computer vision technology and assigning corresponding names or semantic descriptions to the identified specific objects. Taking unmanned autonomous vehicles for example, video sensors on vehicles may obtain a video stream of driving images. Through the image annotation technology, the automatic driving system may identify objects around the vehicle such as pedestrians, vehicles, cats, dogs, etc., and make a reaction in response to the identified environmental objects and the corresponding annotations, such as dodging the pedestrians that suddenly appear in front.
The image annotation method of the present invention may be implemented by an electronic device. For example, the electronic device includes a memory and a processor. The memory may store programs, instructions, data or files that the processor may obtain or execute. The processor is coupled to the memory and is configured to execute the image annotation method according to the embodiments of the present invention. The processor may, for example, be implemented as a microcontroller, a microprocessor, a digital signal processor, an application specific integrated circuit (ASIC), a digital logic circuit, a field programmable gate array (FPGA), or any other hardware element having processing functions. The image annotation method of the present invention may also be implemented as a software program, which can be stored on a non-transitory computer-readable storage medium such as a hard disk, a compact disc, a flash drive, a memory. When the processor loads the software program from the non-transitory computer readable storage medium, the image annotation method of the present invention may be executed.
In step 102, the processor performs video decompression to obtain an image frame sequence from the video data. The image frame sequence includes a plurality of image frames.
In step 104, the processor searches for candidate key image frames from the acquired image frames. In an embodiment, the processor may execute an object detecting and tracking procedure on the image frame sequence to identify and track one or more target objects from the image frames, and if it is determined that a change of the structural feature of a target object in an image frame exceeds a preset threshold, the image frame is selected as a candidate key image frame.
In step 106, the processor determines key image frames from the image frames. In addition to including the candidate key image frames selected in step 104, the key image frames may also include the image frames adjacent to the candidate key image frames that meet particular conditions. Here, two image frames being “adjacent” to each other refers to that the two image frames are adjacent to each other in a time sequence of a consecutive image frame sequence (e.g., a video stream). For example, two image frames adjacent to each other may be obtained at two continuous sampling time points.
In step 108, the processor presents the key image frames on a graphical user interface (GUI) and displays annotation information about the target objects through the GUI. The annotation information may include, for example, the name or the semantic description of the target objects, such as “pedestrian”, “moving car” and the like.
The GUI may also allow the user to select a new unidentified object from the key image frames and annotate it. For example, some objects may not be identified or tracked in an image frame containing a complex background. In this case, the user may manually select the unidentified object from the key image frames and annotate it. The object image selected by the user is called “a user-selected object”.
The term “user” as used herein includes, for example, a person or entity that owns an electronic device that is capable of performing the image annotation method of the present invention; a person or entity that operates or utilizes the electronic device; or a person or entity that is otherwise associated with the electronic device. It is contemplated that the term “user” is not intended to be limiting and may include various examples beyond those described.
In step 110, the processor performs object tracking on the user-selected object. This step can be done with any known object tracking algorithm.
In step 112, the processor obtains an annotation result. For example, the processor may receive a user operation via the GUI provided in step 108 and generate the annotation result in response to the user operation. The annotation result may include, for example, the user-selected objects and the annotation information about the user-selected objects. The user-selected objects may be extracted from the image contents of the key image frames. For example, the user may select a person's image in a key image frame as a user-selected object and key-in the corresponding annotation information as “pedestrian” through the GUI.
In an embodiment, the image annotation method may further include step 114. In step 114, the features of the user-selected object are extracted and enhanced. The results of feature extraction and enhancement may be provided as training samples to train and update the classifiers in step 104 for executing object detection, so that the performance of image annotation can be enhanced through the feedback of expert experience.
In step 202, the processor may detect the target object from a plurality of consecutive image frames in the video data. In an embodiment, the object detection procedure may be performed by using a hybrid variable window object detection algorithm implemented by an image pyramid algorithm in combination with a classifier pyramid algorithm. The above hybrid algorithm will be described with reference to
In step 204, the processor tracks the detected target object. In an embodiment, a histogram of oriented gradient (HOG) feature based kernelized correlation filter (KCF) object tracking procedure may be used to track the target objects.
For example, the processor may convert the target object image into a grayscale image so as to retrieve the HOG features of the target object, and perform a frequency domain transform on the HOG features to obtain HOG frequency domain features. The processor may execute a KCF object tracking procedure to track the HOG frequency domain features so as to track the target object. The frequency domain transform may be, for example, a Fourier transform, which can be expressed as follows:
In Equation 1, β represents the bin component stored in each HOG cell; and x and y represent the block coordinates for calculating the Fourier transform region.
In addition to the above, step 204 may also be implemented by any known object tracking algorithm, such as detect window algorithm and correlation filter algorithm.
In step 206, the processor may determine whether an image frame meets a first selection condition. If yes, in step 208, the processor picks the image frame that meets the first selection condition as a candidate key image frame. If not, the processor determines for the next image frame. The first selection condition may include, for example, selecting an image frame as one of the candidate key image frames if a target object starts to appear or disappear in the image frame. The term “appearing” or “disappearing” of an object refers to the situation where a change of the structural feature of the object exceeds a predetermined threshold. For example, if a pedestrian image in the video data turns from the front to the back, the processor may indicate that the object corresponding to the front of the person disappears and the object corresponding to the back of the person appears.
In step 402, the processor may calculate first similarity indexes of the candidate key image frames. For example, the first similarity index may be determined by the processor through a similarity calculation according to a first covariance value (σ1xy) of the corresponding candidate key image frame and a plurality of first variation values (σ1x, σ1y) statistically calculated in different directions (e.g., x and y directions) of the corresponding candidate key image frame. In an embodiment, the first similarity index (S1(x, y)) may be expressed as follows:
Np represents the total number of patches that an image frame divided into, Nx represents the total number of block columns along the x direction in a patch, Ny represents the total number of block rows along the y direction in the patch, μi represents the pixel average of the ith block in the patch,
In step 404, the processor obtains second similarity indexes of the adjacent image frames (where each of the adjacent image frames is adjacent to at least one of the candidate key image frames). The second similarity index may be, for example, determined by the processor through the similarity calculation according to a second covariance value (σ2xy) of the corresponding adjacent image frame and a plurality of second variation values (σ2x, σ2y) statistically calculated in different directions (e.g., x and y directions) of the corresponding adjacent image frame. In an embodiment, the second similarity index (S2(x, y)) may be expressed as follows:
The similarity calculation used in steps 402 and 404 may also be implemented by other algorithms that is capable of measuring the degree of similarity between objects, such as Euclidean distance algorithm, cosine similarity algorithm, Pearson correlation algorithm and inverse user frequency (IUF) similarity algorithm.
In step 406, the processor determines whether the adjacent image frame meets the second selection condition. The second selection condition may include, for example, when a difference between a corresponding second similarity index (S2(x, y)) of an adjacent image frame and a corresponding first similarity index (S1(x, y)) of a candidate key image frame adjacent to the adjacent image frame exceeds a similarity threshold (i.e., there is a large difference in the object structure between the two image frames), the adjacent image frame is selected as one of the key image frames.
In step 408, the processor selects the adjacent image frames of the candidate key image frames that meet the second selection condition as the key image frames.
Conversely, in step 410, an adjacent image that does not meet the second selection condition is not selected as the key image frame.
Thereafter, in step 412, the processor outputs all the candidate key image frames as well as the adjacent image frames which meet the second selection condition as the key image frames.
Next, a determination is made on the adjacent image frames F2, F3, F7 of the candidate key image frames F1 and F4 to F6. Since the adjacent image frames F2 and F7 are similar to the adjacent candidate key image frames F1 and F6, respectively, the adjacent image frames F2 and F7 are excluded from being selected as the key image frames. Since the adjacent image frame F3 and the adjacent candidate key image frame F4 are quite different, the adjacent image frame F3 is selected as one of the key image frames.
Finally, the outputted key image frames may include the image frames F1 and F3 to F6. The key image frames may, for example, be sorted into a sequence and displayed in a GUI.
The key image frame display area 602 may display a sequence of M key image frames KF1 to KFM, where M is a positive integer. The user may click on any of the key image frames in the key image frame display area 602, and the selected key image frame may be displayed in the main operation area 604.
The user may select an unidentified object in the main operation area 604. Taking
The user may annotate the user-selected object by assigning it a corresponding name or semantic description. The related annotation information may, for example, be displayed on the annotation area 606A. As shown in
The annotation information of the identified target objects may be displayed in the annotation area 606B. As shown in
The GUI 600 may further include one or more operation keys 608. For example, after the operation key 608 (“+add object”) is clicked, the user is allowed to select a user-selected object from the content of the key image frame displayed in the main operation area 604 and add the corresponding annotation for the user-selected object. The operation key 608 may also be implemented as a drop-down menu. The menu may, for example, include a preset annotation description and/or an annotation description that has been used.
It should be noted that the example in
Step 702 may be implemented with the object detecting and tracking algorithm used in step 104 of
Taking the enhanced HOG features as an example, the processor may execute a feature enhancement procedure as follows. The user-selected object is divided into a plurality of blocks. A to-be-processed block is selected from the blocks. A HOG feature extraction procedure is executed, so that a plurality of first HOG features of the to-be-processed block and a plurality of second HOG features of a plurality of adjacent blocks adjacent to the to-be-processed block are obtained. A norm operation on is performed on a feature set including the first HOG features and the second HOG features to obtain a normalization parameter. The first HOG features are normalized according to the normalization parameter, so that a plurality of enhanced first HOG features for executing object detection in the object detecting and tracking procedure are obtained.
The HOG feature extraction procedure includes, for example:
(1) Calculate the edge strength (MO of each pixel position in the block:
M
i=√{square root over ((x−1−x1)2+(y−1−y1)2)} (Equation 4)
In Equation 4, x1 and x−1 represent pixel grayscale values in front and back of the target pixel position in the x direction, respectively, and y1 and y−1 represent pixel grayscale values above and below the target pixel position in the y direction, respectively.
(2) Calculate the sum of all the edge strengths in the block (Msum):
In Equation 5, n represents the total number of pixels in the block.
(3) Calculate the direction component (Bi) stored in each bin:
In Equation 6, Mb represents the number of edge strengths classified in a bin.
In addition, when normalizing a to-be-processed block, the features of blocks adjacent to the to-be-processed block are taken into consideration to determine which vectors/edges are the primary or continuous. Then, the normalization is executed for the prominent or important edge vectors.
In an embodiment, the normalization parameter may be expressed as follows:
|x|=√{square root over (x12+ . . . +xn2)} (Equation 7)
In Equation 7, x1 to xn represent the HOG features that need to be normalized. For example, the HOG features include all of the first HOG features and the second HOG features. Next, the HOG feature normalization result (
where H(x,y) represents a pre-normalization result of the HOG features of the to-be-processed block.
In an embodiment, the processor may omit step 702 and train the classifier directly using the features of the user-selected object as training samples.
In the manner described above, the primary edge direction features of consecutive blocks can be enhanced. In an embodiment, the processor may arrange and store the calculated feature values according to the order of accessing the features when the object is detected/tracked, so as to accurately obtain the features of the user-selected object.
In step 904, the processor selects a classifier from the classifiers and provides the classifier with a plurality of training samples to establish a plurality of parameter ranges for a plurality of classes, wherein the classes correspond to classifications for the target objects and the user-selected objects.
In step 906, the processor searches in the parameter ranges for a distinguishable parameter range that does not overlap with other parameter ranges, and marks the corresponding class for the distinguishable parameter range as a distinguishable class.
In step 908, the processor selects a to-be-distinguished class from the classes, wherein the corresponding parameter range for the to-be-distinguished class overlaps with other parameter ranges in the parameter ranges. In an embodiment, the corresponding parameter range for the to-be-distinguished class overlaps with the most number of other parameter ranges in the parameter ranges.
In step 910, the processor selects another classifier that is able to mark the to-be-distinguished class as the distinguishable class from the classifiers.
In step 912, the processor removes the to-be-distinguished parameter range from the parameter ranges.
In step 914, the processor determines whether all of the selected classifiers in the classifiers allow each of the classes to be marked as a distinguishable class. If yes, the flow continues to step 916 to delete the unselected classifiers from the classifiers. If not, the flow goes back to step 906 to continues to execute the adaptive training process until all the selected classifiers can make each class to be marked as the distinguishable class.
In an embodiment, the processor may provide a plurality of particular training samples for a particular class to a classifier to obtain a plurality of distance values, and determine a particular parameter range for the particular class according to an average value of the distance values and a standard deviation of the distance values. Below, the details are described in conjunction with
In addition, according to the following embodiments, the training samples for an untrained object class (e.g., an object class corresponding to the user-selected object) are used as positive samples for the classifiers, and the training samples for other object classes are used as negative samples for the classifiers.
d
i,j
(k)=−ρk+
where
where sti represents the number of training samples for the ith class.
With the above manner, different classes can be projected onto a one-dimensional space, wherein OSHk represents the distance value reference point for the kth SVM classifier.
Based on the corresponding distance average values (μ1(k) and anμ2(k)) and the standard deviations (ν1(k) and ν2(k))of the respective classes LP1 and LP2, the upper limit of each parameter range can be expressed as follows for example:
maxi(k)=μi(k)+σi(k) (Equation 12)
The lower limit of each parameter range may be expressed as follows for example:
mini(k)=μi(k)−σi(k) (Equation 13)
Although in the above example the upper limit and the lower limit of the parameter range are respectively separated by one standard deviation from the average value, the present invention is not limited thereto. The size of parameter range can be adjusted depending on applications.
Based on the above, the present invention provides an image annotation method, an electronic device and a non-transitory computer-readable storage medium, which can automatically filter out highly repetitive invalid image frame samples in the video data, extract key image frames with object structure diversity, and provide the key image frames to the user for browsing, so that the annotation items can be added and/or modified by the user, thereby improving the annotation result and saving labor required for image annotation. On the other hand, the technique proposed in the present invention also involves expert experience feedback mechanism to enhance the accuracy and robustness for extracting the key image frames.
It will be apparent to those skilled in the art that various modifications and variations can be made to the disclosed embodiments. It is intended that the specification and examples be considered as exemplary only, with a true scope of the disclosure being indicated by the following claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
106140836 | Nov 2017 | TW | national |