This application claims the priority benefit of Taiwan application no. 109135458, filed on Oct. 14, 2020. The entirety of the above-mentioned patent application is hereby incorporated by reference herein and made a part of this specification.
The disclosure relates to an image recognition method and an image recognition system.
In the field of image recognition, neural networks have been widely used. However, different types of recognitions often require different neural network architectures. Therefore, traditionally, the recognition of multiple features requires the construction of multiple sets of neural networks. How to use a single neural network architecture to recognize multiple features to improve performance is the goal of those skilled in the art.
The disclosure provides an image recognition method and an image recognition system, which can simultaneously output recognition results corresponding to different detection tasks according to obtained features.
The disclosure provides an image recognition method for a plurality of detection tasks. The image recognition method includes: obtaining an image to be recognized by an image sensor; inputting the image to be recognized to a single convolutional neural network; obtaining a first feature map of a first detection task and a second feature map of a second detection task according to an output result of the single convolutional neural network, wherein the first feature map and the second feature map have a shared feature; using an end-layer network module to generate a first recognition result corresponding to the first detection task from the image to be recognized according to the first feature map, and to generate a second recognition result corresponding to the second detection task from the image to be recognized according to the second feature map; and outputting the first recognition result corresponding to the first detection task and the second recognition result corresponding to the second detection task.
The disclosure provides an image recognition system, which includes an image sensor, a storage device, an output device and a processor. The image sensor obtains an image to be recognized. The processor is coupled to the image sensor, the storage device and the output device. The processor inputs the image to be recognized to a single convolutional neural network. The storage device stores the single convolutional neural network. The processor obtains a first feature map belonging to a first detection task and a second feature map belonging to a second detection task according to an output result of the single convolutional neural network, wherein the first feature map and the second feature map have a shared feature. The processor uses an end-layer network module to generate a first recognition result corresponding to the first detection task from the image to be recognized according to the first feature map, and to generate a second recognition result corresponding to the second detection task from the image to be recognized according to the second feature map. The output device outputs the first recognition result corresponding to the first detection task and the second recognition result corresponding to the second detection task.
Based on the above, the image recognition method and system of the exemplary embodiments of the disclosure can obtain the recognition results of different detection tasks by using the single convolutional neural network.
To make the aforementioned more comprehensible, several embodiments accompanied with drawings are described in detail as follows.
Referring to
The image recognition system 100 includes an image sensor 110, a storage device 120, an output device 130 and a processor 140.
The image sensor 110 is configured to obtain an image to be recognized. In this exemplary embodiment, the image sensor 110 is, for example, a camcorder or a camcorder of a camera of a mobile device.
The storage device 120 is configured to store the single convolutional neural network, the image to be recognized and the recognition results. In this exemplary embodiment, the storage device 120 may be, for example, a fixed or a movable device in any possible forms including a random access memory (RAM), a read-only memory (ROM), a flash memory, a hard drive or other similar devices, or a combination of the above-mentioned devices.
The output device 130 is a device or an element configured to output an image recognition result. The output device 130 is, for example, a display. For instance, when the image to be recognized is a road image (e.g., including images of trees, vehicles and pedestrians), the output device 130 can output the images of trees, vehicles and pedestrians in image to be recognized according to the image recognition method of the disclosure.
The processor 140 is coupled to the image sensor 110, the storage device 120 and the output device 130 to control operations of the image recognition system 100. In this exemplary embodiment, the processor 140 is, for example, a processor for general purposes, a processor for special purposes, a conventional processor, a data signal processor, a plurality of microprocessors, one or more microprocessors, controllers, microcontrollers and Application Specific Integrated Circuit (ASIC) which are combined to a core of the digital signal processor, a Field Programmable Gate Array (FPGA), any other integrated circuits, a state machine, a processor based on Advanced RISC Machine (ARM) and similar products.
In this exemplary embodiment, the processor 140 can run an image recognition module based on the single convolutional neural network to perform recognitions of multiple detection tasks at the same time.
Referring to
In this exemplary embodiment, the processor 140 runs the image preprocessing module 2002 to perform a preprocessing on the image to be recognized obtained by the image sensor 110.
Referring to
Specifically, it is assumed that the dimensions of the neural network model include a width WM and a height HM, as shown by an image padding in an operation 310 in
Further, in another example, as shown by an operation 320 and an operation 330 and Formula 1 below, the image preprocessing module 2002 can also resize the raw image 321 by using a bicubic interpolation to obtain an input image 322.
I(u,v)=Σx,y=03axyII(ux,vy) (Formula 1)
Referring to
Specifically, the backbone architecture module 2004 is configured with a plurality of convolution layers in the single convolutional neural network, and the processor 140 runs the backbone architecture module 2004 to extract features corresponding to the detection tasks from the image to be recognized, so as to generate the feature maps. Then, the processor 140 runs the end-layer network module 2006 to perform the detection tasks and recognitions.
In this exemplary embodiment, for example, a first detection task is a 2D object detection task, and a second detection task is an image segmentation detection task. Accordingly, the backbone architecture module 2004 outputs a first feature map corresponding to the first detection task and a second feature map corresponding to the second detection task, and the end-layer network module 2006 performs an object detection for the first detection task and a point detection for the second detection task according to the feature maps.
Referring to
For example, in this exemplary embodiment, because the first detection task is the 2D object detection task, the first feature map of the first detection task can include coordinates of a bounding box, a width and a height of the bounding box, a detection confidence level of the bounding box and a class probability of the bounding box. Because the second detection task is the image segmentation detection task, the second feature map of the second detection task can include the coordinates of the bounding box, the detection confidence level of the bounding box and the class probability of the bounding box.
Referring to
Referring to
As shown in
After calculating the vector analysis between the neighboring points GSn and GSn+1, the processor 140 can repeatedly execute GSn=GSN+Δxy on each GSn≤GSn+1 and convert all points that first encounter negative cell data to the positive data. In this way, the processor 140 can obtain a positive data grid map 520 from the image segmentation ground truth 510. That is, the positive data grid map 520 with all cells including the points GS0, GS1, GS2, GS3, GS4, GS5, GS6 and GS7 may be obtained.
Referring to
In an exemplary embodiment, a first loss function may be configured for the first detection task, and a second loss function may be configured for the second detection task. The first loss function is configured to measure an error between a first recognition result and a first reference result corresponding to the first detection task, and the second loss function is configured to measure an error between a second recognition result and a second reference result corresponding to the second detection task.
According to the image segmentation ground truth 510 (i.e., GS(x,y,c)) and the point predictions RS(x, y, c, cl, p) generated by the method shown in
wherein λpt denotes a normalization weight for a positive xy-prediction; ipt denotes a positive point prediction; λnopt denotes a normalization weight for a negative (null value) xy-prediction; inopt denotes a negative point prediction; GSi(cl)=1; GSi(p(c))=1; mf denotes a batch number; f denotes a frame index.
The backbone architecture module 2004 of the processor 140 can obtain a 2D ground truth 810 from the input image (e.g., the input image 312). According to the 2D ground truth 810 (i.e., GD(x, y, w, h, c)) and the detection box predictions RD(x, y, w, h, c, cl, p)) generated by the method shown by
wherein λxy denotes a normalization weight for a positive xy-prediction; i,jbb denotes a positive detection prediction; λnobb denotes a normalization weight for a negative (null value) xy-prediction; i,jnobb denotes a negative detection prediction; GDi(a) is 1; GDi(p(c)) is 1; A denotes a total number of boxes; mf denotes a batch number; f denotes a frame index.
In an exemplary embodiment, the processor 140 can configure a plurality of prediction layers in the end-layer network module 2006, and process the shared feature map according to the first loss function corresponding to the first detection task and the second loss function corresponding to the second detection task. In other words, after the processor 140 extracts the shared feature from the image to be recognized (the input image 312) and generates the shared feature map of the first feature maps 420-1, 420-2 and 420-3 and the second feature map 430 through the backbone architecture module 2004, the processor 140 can process the shared feature map according to the first loss function corresponding to the first detection task obtained from Formulas 4 to 7 above and the second loss function corresponding to the second detection task obtained from Formulas 8 to 12 above.
In an exemplary embodiment, the processor 140 can use a plurality of normalization weights to balance a range of loss values of the second detection task to adjust a learnable weight of the backbone architecture module with reference to a range of loss values of the first detection task. For example, after the first loss function (the segmentation loss Segloss) and the second loss function (the 2D object detection loss Objloss) are obtained, a combined loss Finalloss may be obtained by using Formula 13 below. The processor 140 can use the combined loss Finalloss to adjust a learnable weight of the backbone architecture module 2004.
denotes a minimum segmentation loss when the backbone architecture module 2004 was trained only with the image segmentation detection task;
denotes a minimum 2D object detection loss when the backbone architecture module was trained only with the 2D object detection task.
In an exemplary embodiment, the processor 140 can use a computer vision technique in the end-layer network module to cluster and connect the recognition result corresponding to the second detection task.
Referring to
In this exemplary embodiment, the operation in which the processor 140 uses the computer vision technique to obtain the segmentation map can be divided into three stages. In the first stage, the processor 140 can recognize a start point stpt and an end point edpt by using Formula 14 and Formula 15 below.
wherein apt denotes positive point predictions, and bpt denotes bottom positive point predictions.
In the second stage, the processor 140 continues to find out an order index of the points.
Referring to
Starting from the point 910-1, the kernel searches whether there are the positive cells in the neighboring cells in the positive data grid map 930. Here, the kernel may be a kernel 940 with both length and width being 3 in
When the neighboring positive cells cannot be found in the positive data grid map 930 by using the kernel 940 (i.e., Kn), another kernel 950 with both length and width being 5 (Kgap) may be used to search the neighboring cells in the feature map 920 and the positive data grid map 930, and such operation can be expressed by Formula 18 to Formula 20 below.
After the second stage is processed, the sorted points may then be used to draw a contour as shown in
Referring to
In step S1020, the processor 140 performs a preprocessing on the image.
In step S1030, the processor 140 can input the image to be recognized (i.e., the input image 312 or the input image 322 in
Next, in step S1040, the processor 140 obtains a first feature map belonging to a first detection task and a second feature map belonging to a second detection task according to the single convolutional neural network.
In step S1050, the processor 140 can generate a first recognition result (i.e., a 2D object detection result) corresponding to the first detection task (a 2D object detection) from the image to be recognized according to the first feature map (the first feature maps 420-1, 420-2 and 420-3), and generate a second recognition result (i.e., the image segmentation result) corresponding to the second detection task (the image segmentation) from the image to be recognized according to the second feature map (the second feature map 430).
In step S1060, the output device 130 outputs the first recognition result (i.e., the 2D object detection result) corresponding to the first detection task (the 2D object detection) and the second recognition result (i.e., the image segmentation result) corresponding to the second detection task (i.e., the image segmentation).
In summary, the image recognition method and system of the exemplary embodiments of the disclosure can obtain the recognition results of different detection tasks simply by using the single convolutional neural network when there is the shared feature between the feature maps of the different detection tasks. Based on this, the time required for the image recognition can be saved and the accuracy of image recognition can be improved.
Although the present disclosure has been described with reference to the above embodiments, it is apparent to one of the ordinary skill in the art that modifications to the described embodiments may be made without departing from the spirit of the present disclosure. Accordingly, the scope of the present disclosure will be defined by the attached claims not by the above detailed descriptions.
Number | Date | Country | Kind |
---|---|---|---|
109135458 | Oct 2020 | TW | national |