This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2020-142879, filed Aug. 26, 2020, the entire contents of which are incorporated herein by reference.
Embodiments described herein relate to an inference apparatus, method, a non-transitory computer readable medium and a learning apparatus.
In an image classification field, such as intruder detection based on an image of a security camera and product anomaly detection, discrimination processing with a neural network has been adopted. Specifically, the discrimination processing with a neural network is used for the purposes of discrimination processing, such as detection of an intruder showing in a little size in an image photographed with a security camera and detection of a small defect of a product from an appearance inspection image in a factory.
In generally adopted discrimination processing with a neural network, when the discrimination processing is executed for an image with a large number of pixels, the image size is reduced in the first half of convolution processing. For this reason, the resolution of the image is reduced, and the discrimination accuracy deteriorates. In addition, generation of an interest map requires additional processing as well as detection processing, and the processing quantity and/or delay causes the problem in a situation in which acquisition of discrimination processing result is required in a short time. In addition, there is the problem that the interest map is insufficient as grounds for identification, because the interest map does not appear in the process of discrimination processing.
In general, according to one embodiment, an inference apparatus includes a processor. The processor generates an intermediate signal by processing an input signal with a convolutional neural network. The processor extracts one or more intermediate partial signals each serving as part of the intermediate signal from the intermediate signal. The processor calculates a statistic of the one or more intermediate partial signals. The processor outputs an inference result relating to the input signal and corresponding to the statistic.
An inference apparatus, method, a non-transitory computer readable medium and a learning apparatus according to the present embodiment will now be explained in detail hereinafter with reference to drawings. In the following embodiments, elements denoted by the same reference numerals execute the same operations, and an overlapping explanation thereof will be omitted.
An inference apparatus according to a first embodiment will be explained with reference to a block diagram of
An inference apparatus 10 according to the first embodiment includes an extraction unit 101, a convolution processing unit 102, a calculation unit 103, an output unit 104, and a display controller 105.
The extraction unit 101 receives an input signal. The input signal is, for example, an image signal. The image signal may be a still image, or a moving image including a predetermined number of time-series images. As another example, the input signal may be a one-dimensional time-series signal. The one-dimensional time-series signal is, for example, a sound signal and/or an optical signal acquired for a predetermined time.
The extraction unit 101 extracts one or more partial signals each of which is a different part in the input signal. For example, when the input signal is a still image, the partial signal is a partial image acquired by extracting a predetermined part in the still image. The partial signals may have the same size or different sizes. When the extraction unit 101 extracts a plurality of partial signals from the input signal, the extraction unit 101 may extract partial signals overlapping the other partial signals, or extract partial signals not overlapping the other partial signals.
The convolution processing unit 102 includes a convolutional neural network including a layered structure including a plurality of convolution layers. The convolution processing unit 102 receives a plurality of partial signals from the extraction unit 101, and processes each of the partial signals with the convolutional neural network to generate one or more intermediate partial signals corresponding to the one or more partial signals.
A plurality of convolution processing units 102 may be provided to correspond to a plurality of partial signals extracted with the extraction unit 101 by one-to-one correspondence. When a plurality of convolution processing units 102 are provided, convolutional neural networks included in the respective convolution processing units 102 may have the same parameters or parameters different from each other, such as the weight coefficient and the bias value. Only one convolution processing unit 102 may be provided. In this case, it suffices that a plurality of partial signals are successively processed in a time-division manner.
The calculation unit 103 receives the intermediate partial signals from the convolution processing unit 102, and executes statistical processing for the intermediate partial signals to calculate a statistic.
The output unit 104 receives the statistic from the calculation unit 103, and outputs an inference result relating to the input signal and corresponding to the statistic.
The display controller 105 executes emphasis processing corresponding to the statistic for the intermediate partial signals, and superimposes and displays the emphasized intermediate partial signals as an interest map on at least one of the input signal and the partial signals. The display controller 105 is illustrated as part of the inference apparatus 10, but the structure is not limited to it. The display controller 105 may be a member separated from the inference apparatus 10.
The following is an explanation of an operation example of the inference apparatus 10 according to the first embodiment with reference to a flowchart of
At Step S201, the extraction unit 101 extracts a plurality of partial signals from the input signal.
At Step S202, the convolution processing unit 102 executes convolution processing for each of the partial signals with the convolutional neural network to generate a plurality of intermediate partial signals.
At Step S203, the calculation unit 103 calculates a statistic of the intermediate partial signals. In this example, the calculation unit 103 calculates the mean value of each of the intermediate partial signals.
At Step S204, the calculation unit 103 calculates a maximum value from the mean values.
At Step S205, the output unit 104 applies a function to the maximum value, and outputs an inference result relating to the input signal, for example, the probability that the input signal corresponds to a class of an inference target, as an inference result.
In the flowchart of
The following is an explanation of an example of the input signal supposed in the first embodiment with reference to
For example, the extraction unit 101 extracts four partial images, that is, partial images 601, 602, 603, and 604 having the same size, such as rectangular parts each enclosed with broken lines in
Although detection of a foreign substance is easier in partial images having similar image patterns, such as the partial image 601 and the partial image 602, it is possible to process partial images having different image patterns due to a difference in shape, such as the partial image 603 and the partial image 604. This is because learning is possible in learning of the neural network described later to prevent the apparatus from reacting a difference in image pattern as a defect. This structure enables the inference apparatus 10 to process the extracted partial images together.
The following is an explanation of a first example of convolution processing in the convolution processing unit 102 with reference to
In the example of
The kernel is supposed to be moved by one pixel (that is, one stride). In an end portion of the partial image 701 for which a convolutional operation is performed and the subsequent intermediate partial image 703, the peripheral pixels are secured in a larger size by performing zero padding or copying pixel values of the end portion. This structure maintains the size of the intermediate partial image input to the subsequent convolution layer at the original partial image size without changing the number of vertical pixels and the horizontal pixels, even when a convolutional operation is performed by a stride. Specifically, the number of pieces of sampling data in the intermediate partial image (intermediate partial signal) is the same as that in the partial image (partial signal).
In addition to a product-sum operation, a predetermined bias value may be added to the sum of products. The bias value may be fixed for the whole image space in the same manner as the weight coefficient.
In addition, an activation layer may be inserted between layers of a plurality of convolution layers. The activation layer executes activation processing by applying a predetermined function, such as ReLU (Rectified Linear Unit), to the intermediate partial image 703 serving as the output from the convolution layer and acquired by a product-sum operation and addition of the bias value.
The activation layer is not always applied after the convolution layer. Specifically, a pattern in which convolution layers are successively connected without an activation layer interposed therebetween and a pattern in which an activation layer is connected after the convolution layer may exist in a mixed manner.
The following is an explanation of a second example of the convolution processing with the convolution processing unit 102 with reference to
The intermediate partial image 703 generated with the convolution layer may be formed with a plurality of channels. For example, in the case of a color image, the intermediate partial image 703 is an image of three channels corresponding to RGB signals. A convolution layer including a plurality of channels has a higher degree of freedom of processing, and is capable of dealing with various images. The example of
In addition, the weight coefficients and the bias values of the kernel used for the channels are different between the channels. Specifically, even when the position of the kernel is the same, that is, even when the position of the pixel is the same in the intermediate partial image 703 including a plurality of channels, the pixel values are different.
The following is an explanation of an operation example of the inference apparatus 10 according to the first embodiment illustrated in
The extraction unit 101 extracts a partial image 601 and a partial image 602 from an input image 900 serving as the identification target.
The convolution processing unit 102 executes convolution processing for each of the partial image 601 and the partial image 602 using the convolutional neural network. In the last layer of the convolutional neural network, that is, the last convolution layer generating an output from the convolution processing unit 102, the output is designed to be have one channel. As illustrated in
The calculation unit 103 calculates the mean value 901 of pixels of the intermediate partial image 706 acquired with the convolution processing unit 102. Specifically, one mean value 901 is calculated from one intermediate partial image 706. The calculation unit 103 calculates the maximum value 902 in the calculated mean values 901. The calculation unit 103 is not limited to calculation of the mean value, but may select the maximum pixel value in the pixels of the whole intermediate partial image 706 as the maximum value 902.
The output unit 104 applies a function to the maximum value 902. In this example, a sigmoid function is applied to the maximum value 902 to output an inference result 903. The inference result 903 is, for example, the probability that the input image 900 has a defect. By applying the sigmoid function, the output value has a value between zero to one. For this reason, when the output value is output as it is, the output value indicates the probability that the input image has a defect. As another example, the value “0.5” may be set as the threshold, and a result of binary determination may be output as the inference result 903. For example, a result “a defect exists” (“defective”) is output when the output value based on the sigmoid function is equal to or higher than the threshold, and a result “no defect exists” (“not defective”) is output when the output value based on the sigmoid function is smaller than the threshold.
As illustrated in
In
An input to the softmax function is similar to that of the second modification illustrated in
The following is an explanation of a display example of an interest map serving as the grounds for the inference result acquired by the inference apparatus 10, with reference to
The intermediate partial image acquired in the process of the inference processing with the inference apparatus 10 and serving as the source of the mean value selected as the maximum value can be used as an interest map relating to defects without any processing.
For example,
As illustrated in
By contrast, because an intermediate partial image 1402 corresponding to the partial image 604 includes no foreign substance, the pixel value is uniformly small in the region of the intermediate partial image 1402. Because no defects exist inside the intermediate partial image 1402, the mean value of the luminance values of the intermediate partial image decreases, the maximum value thereof also decreases, and consequently the possibility that the intermediate partial image is inferred as “defective” decreases.
Accordingly, by displaying the intermediate partial image generated with the inference apparatus 10 as the interest map in association with the input image, the user is enabled to check the partial image inferred as “defective”.
The position information (coordinate information) of the partial image for the input image may be provided to the partial image as a label or the like, and may be attached to the intermediate partial image as it is even when the partial image is processed with the convolution processing unit 102. As another example, the calculation unit 103 may receive the position information of the partial image, and associate the position information with the intermediate partial image serving as the output from the convolution processing unit 102.
In
The check image described above may be displayed on an external display device when the display controller 105 receives an instruction to display the interest map or the check image from the user. As another example, when an inference result “defective” is acquired, the check image may be displayed on the external display device. When the display controller 105 is a unit separated from the inference apparatus 10, the interest map may be transmitted from the inference apparatus 10 to the display controller 105, and processing to display the interest map and the check image may be executed.
As another example, the display controller 105 may execute processing of coloring the foreign substance 402 with a color that is not used in the input image according to the pixel value of the interest map to further highlight the foreign substance 402 on the image. As another example, the display controller 105 may display a mark, such as an arrow, indicating the region of the foreign substance 402 or cause the region of the foreign substance to blink to enable the user to easily recognize the defect. As another example, the display controller 105 may perform control to display a message “defective” or the like. As another example, the display controller 105 may perform control to display the region including the foreign substance part in an enlarged state in response to the user's click or touch on the region around the defect in the image illustrated in
Specifically, any method may be used as long as the display mode is a mode enabling emphasis display of the intermediate partial image as the interest map with the display controller 105.
According to the first embodiment described above, partial images are extracted, and a convolutional operation is executed for each of the partial images. This structure prevents rapid reduction in image size, enables a convolutional operation while the resolution of the image is maintained, and enables acquisition of high discrimination accuracy. In addition, even when the original image has a large size, the image is processed in partial images acquired by extracting parts of the image. This structure has a merit that increase in the processing quantity and/or required memory quantity is prevented even when the resolution is maintained, without decreasing the resolution.
In addition, because the intermediate partial image is acquired by extracting a part of the image and subjected to a convolutional operation without changing the image size, the intermediate partial image can be used as the interest map without any processing. This structure removes the necessity for processing of generating an interest map separately, unlike the conventional art. In addition, because presence/absence of defects can be directly recognized on the basis of the pixel value in the intermediate partial image serving as the interest map, the grounds for identification are clear even when a neural network is used. As a result, the inference apparatus according to the first embodiment enables achievement of classification processing with high accuracy.
The first embodiment illustrates one-class classification of presence/absence of defects. In a modification of the first embodiment, the inference apparatus 10 executes multi-class classification of executing classification into a plurality of classes as inference targets. The multi-class classification in the present modification is supposed to execute identification of the type of the defect, such as adhesion of a foreign substance, deformation of the component, and scratches, for example, in a defection inspection.
An operation example of the inference apparatus 10 according to the modification of the first embodiment will be explained hereinafter with reference to the conceptual diagram of
In
Intermediate partial images 1701 illustrated in
The number of channels of each of the intermediate partial images 1701 output from the last layer of the convolutional neural network is not one but set to the same number as the number of classes to be classified by inference processing. In this example, because it is supposed to execute four-class classification, the intermediate partial images 1701 each having four channels (a first channel Ch1, a second channel Ch2, a third channel Ch3, and a fourth channel Ch4) are generated.
The calculation unit 103 calculates the mean value 901 of the pixel values of each intermediate partial image 1701 for each of the channels, and calculates a statistic based on the intermediate partial images 1701 for each of the channels.
In the example of
The output unit 104 applies a sigmoid function to the maximum value of the first channel Ch1 to generate an inference result of the first class. For example, the output unit 104 outputs the probability of presence/absence of a foreign substance as an inference result of the first class.
In the same manner, for the intermediate images from the second channel to the fourth channel, the output unit 104 outputs the probabilities of the second class to the fourth class as the inference results. The output unit 104 may prepare a plurality of functions to separately output inference results for the respective classes in accordance with the number of classes, or apply one function a plurality of times to output inference results of the respective classes.
Each of the modifications described above according to the first embodiment may be applied to the calculation unit 103 and the output unit 104.
According to the modification of the first embodiment described above, an output of the last layer of the convolutional neural network in the convolution processing unit is set to be intermediate partial images each having a plurality of channels. The inference apparatus calculates a statistic for each of the channels in the same manner as the first embodiment, and outputs inference results of classes in accordance with the statistics. This structure achieves classification including classes of the number corresponding to the number of channels, that is, multi-class classification.
The second embodiment is different from the first embodiment in that extraction processing with the extraction unit 101 is executed for the output of the last layer of the convolutional neural network.
An operation example of the inference apparatus according to the second embodiment will be explained hereinafter with reference to the flowchart of
At Step S1801, the convolution processing unit 102 executes convolution processing for the input signal with the convolutional neural network to generate an intermediate signal.
At Step S1802, the extraction unit 101 extracts a plurality of intermediate partial signals from the intermediate signal. With respect to the positions at which the intermediate partial signals are extracted, because the input to the convolutional neural network is an input signal, the method for extracting partial signals from the input signal described in the first embodiment is applicable, and intermediate partial signals can be extracted from the intermediate signal in the same manner.
Processing from Step S203 to Step S205 is the same as that in
The following is an explanation of an operation example of the inference apparatus according to the second embodiment illustrated in
The convolution processing unit 102 executes convolution processing using the convolutional neural network for the input image 1901 to generate an intermediate image 1902.
The extraction unit 101 receives the intermediate image 1903 from the convolution processing unit 1902, and extracts a plurality of intermediate partial images 1904 from the intermediate image 1903.
The calculation unit 103 calculates mean values 1905 of the respective intermediate partial images 1904, and calculates the maximum value 902 in the mean values 1905.
The output unit 1904 applies a sigmoid function to the maximum value 902 and outputs, for example, the probability of “defective” as the inference result 903, in the same manner as the first embodiment.
According to the second embodiment described above, the inference apparatus subjects the input signal to convolution processing with the convolutional neural network, and executes extraction processing for an intermediate signal output from the last layer of the convolutional neural network to generate intermediate partial signals. Even when the timing of extraction processing is different, classification processing with high accuracy is achieved in the same manner as the first embodiment.
In the first embodiment and the second embodiment, the extracted image including the whole defect enables easier detection of the defect. By contrast, when the extracted image is too large, information other than the defect relatively increases, and causes difficulty in detection of the defect. For this reason, when the size of the defect can be expected in advance, the size of the extracted image may be set in accordance with the size of the defect. For example, the magnification of the extraction size may be set, such as the size twice or four times as large as the size of the defect lengthwise and breadthwise. Specifically, it suffices that the extraction unit 101 receives information relating to the size of the defect, for example, from the external device, and extracts partial images in the first embodiment, intermediate partial images in the second embodiment, in a size acquired by multiplying the set magnifications of the extraction size by the size of the defect. This structure is expected to improve the detection accuracy.
The third embodiment illustrates the case of using a one-dimensional signal as the input signal with reference to
When a pulse 2001 to be measured is specified, ambient light 2002, such as sunlight, other than the pulse 2001 is mixed as a noise, and the measurement accuracy may deteriorates.
The inference processing with the inference apparatus 10 is also applicable to such distance measurement with a distance measurement apparatus. Convolution processing in the convolution processing unit for a one-dimensional signal will be explained hereinafter with reference to the conceptual diagram of
The extraction unit 101 extracts a plurality of partial signals from the input signal acquired by sampling the received light. For example, it suffices that the extraction unit 101 extracts partial signals 2101 at predetermined time intervals. The example of
The convolution processing unit 102 executes one-dimensional convolution for the partial signal 2101. Specifically, the convolution processing unit 102 applies a one-dimensional kernel to the partial signal 2101, and executes a product-sum operation for the sampling values of the partial signal 2101 and the weight coefficient to generate an intermediate partial signal 2102. In the example of
Although it is not illustrated, the calculation unit 103 calculates the mean values of the intermediate partial signal for the output from the convolution processing unit 102 in the same manner as the case of an image, and calculates the maximum value in the mean values. The output unit 104 can detect the probability that the position (time) of the partial signal at which the intermediate partial signal serving as the origin of the calculated maximum value is extracted is the position of the pulse, as the inference result.
When the input signal is a signal including a plurality of channels, it suffices that the convolution processing unit 102 executes convolution processing for the one-dimensional signal including a plurality of channels, for each of the channels, and executes processing such that the input signal is changed to data of one channel at the last layer of the convolutional neural network, in the same manner as the case of an image according to the first embodiment.
The third embodiment described above achieves classification processing with high accuracy even when the input signal is a one-dimensional signal, in the same manner as the case where the input signal is an image.
The fourth embodiment illustrates a learning apparatus learning the convolution neural network included in the inference apparatus 10 explained in the first embodiment to the third embodiment.
A learning system including the learning apparatus according to the fourth embodiment is illustrated in the block diagram of
The training data storage 22 stores therein training data to train the inference apparatus 10, specifically, train the convolutional neural network included in the inference apparatus 10. The training data is sample data with a correct label (teaching data). For example, when the training data is training data for a defect inspection, training data should be formed of pairs each formed of a normal product image and a correct label (for example, “0”) of a classification result indicating that the product is normal, or pairs each formed of an anomaly product image and a correct label (for example, “1”) of a classification result indicating that the product is anomaly.
The learning controller 211 calculates an error between the inference result output from the output unit 104 when the training data is input to the inference apparatus 10 and the correct label of the training data. Specifically, for example, suppose that the probability of “defective” is output as the inference result from the output unit 104. The probability of “defective” and the probability of “not defective” acquired by subtracting the probability of “defective” from 1 are expressed with a vector. For example, the output unit 104 outputs a vector (the probability of “defective”, the probability of “not defective”) as the inference result for the image of the input training data.
By contrast, the vector of the correct label of the training data is expressed as “(1, 0)” in the case of “defective”, and as “(0, 1)” in the case of “not defective”. The learning controller 211 calculates an error between the vector output from the output unit 104 and the vector of the correct label by, for example, cross entropy.
The learning controller 211 updates and optimizes the weight coefficients and the bias values by the probabilistic gradient descent method or the like, while tracing the position of the pixel used for convolution processing and the position of the data acquired as the maximum value through the network in a backward direction by the error back propagation method, and updates the parameters in the convolutional neural network until the training is finished. The same method as that used in ordinary training processing can be used as the machine learning method in the neural network, such as the error back propagation method, and a specific explanation thereof is omitted.
In the case of multi-class classification, it suffices that the vector of the correct label is expressed as a one-hot vector including a vector element corresponding to the class number. For example, training data includes pairs each formed of an anomaly product image and a correct label being a one-hot vector in which the element of the class indicating the type of the anomaly is set to 1 and the elements of the other classes are set to 0. Specifically, vectors (scratch, adhesion of a foreign substance, and deformation of the component) acquired by classifying the types of anomaly into three types are set as correct labels, when a product image includes a scratch by visual observation, a pair of the product image and a correct label of the vector (1, 0, 0) in which the element indicating a scratch is set to 1 and the other elements are set to 0 is set as training data. When the product image includes a plurality of types of anomaly, the training data may be a vector in which all the elements of the corresponding types are set to 1.
The learning controller 211 calculates an error between the vector having the dimensions corresponding to the number of types of anomaly and output from the output unit 104 in the case where the product image of the training data is input to the inference apparatus 10 and the correct label of the training data for each of the elements of the types of anomaly.
The fourth embodiment described above achieves the inference apparatus according to the first embodiment to the third embodiment for each of the partial images, by learning the convolutional neural network with training data provided with a correct label for an input signal.
For example, in a neural network in which each part of an image is simply extracted and presence/absence of anomaly is independently inspected for each part, it is required to set presence/absence of anomaly of each part as correct data and prepare pieces of correct data for the number of parts. By contrast, in the training data in the fourth embodiment, for example, intermediate partial images of partial images for the input image are integrated after convolution processing, and classification is executed for the whole input image as to whether an anomaly exists in any part of the original input image. For this reason, it suffices that one correct label is provided to the image. This structure enables easy preparation of correct data by visual observation.
The inference apparatus 10 and the learning apparatus 21 include a CPU (Central Processing Unit) 31, a RAM (Random Access Memory) 32, a ROM (Read Only Memory) 33, a storage 34, a display 35, an input device 36, and a communication device 37 that are connected with a bus.
The CPU 31 is a processor executing arithmetic processing and control processing and the like in accordance with programs. The CPU 31 executes various types of processing in cooperation with the programs stored in the ROM 33 and the storage 34 and the like, with a predetermined region of the RAM 32 used as the working area.
The RAM 32 is a memory, such as a SDRAM (Synchronous Dynamic Random Access Memory). The RAM 32 functions as a working area for the CPU 31. The ROM 33 is a memory storing programs and various types of information therein in an unrewritable manner.
The storage 34 is a device of writing and reading data to and from a magnetic recording medium, such as an HDD (Hard Disk Drive), a semiconductor storage medium, such as a flash memory, a magnetically recordable storage medium, such as an HDD, or an optically recordable storage medium. The storage 34 executes writing and reading of data to and from a storage medium in response to control of the CPU 31.
The display 35 is a display device, such as an LCD (Liquid Crystal Display). The display 35 displays various types of information on the basis of a display signal from the CPU 31.
The input device 36 is an input device, such as a mouse and a keyboard. The input device 36 receives information input by a user's operation as an instruction signal, and outputs the instruction signal to the CPU 31.
The communication device 37 communicates with the external apparatus via a network in response to control of the CPU 31.
The flow charts of the embodiments illustrate methods and systems according to the embodiment. It is to be understood that the embodiments described herein can be implemented by hardware, circuit, software, firmware, middleware, microcode, or any combination thereof. It will be understood that each block of the flowchart illustrations, and combinations of blocks in the flowchart illustrations, can be implemented by computer program instructions. These computer program instructions may be loaded onto a computer or other programmable apparatus to produce a machine, such that the instructions which execute on the computer or other programmable apparatus create means for implementing the functions specified in the flowchart block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacturing including instruction means which implement the function specified in the flowchart block or blocks. The computer program instructions may also be loaded onto a computer or other programmable apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer programmable apparatus which provides steps for implementing the functions specified in the flowchart block or blocks.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel apparatuses, methods and computer readable media described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the apparatuses, methods and computer readable media described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
2020-142879 | Aug 2020 | JP | national |