The present invention relates to an image processing apparatus and an image processing method.
A method for determining whether a sound of a device is normal or abnormal has been known (Patent Literature 1). The invention described in Patent Literature 1 determines whether a sound of a device is normal or abnormal using a locus vector indicating intensity features in all time directions and a previously learned identification parameter.
When a machine learning model for determining an abnormality using sound data is generated, it is conceivable to generate a machine learning model by quantifying sound data by physical quantities, such as sound pressure (dB) indicating the magnitude of sound data and frequency (Hz) indicating the pitch of sound data. Although it is possible to effectively generate a machine learning model by imaging sound data, the invention described in Patent Literature 1 does not mention imaging of sound data.
In response to the above issue, an object of the present invention is to provide an image processing apparatus and an image processing method of imaging sound data.
An image processing apparatus according to one aspect of the present invention calculates a fundamental frequency component included in sound data and a harmonic component corresponding to the fundamental frequency component, converts the fundamental frequency component and the harmonic component into image data, and generates a sound image where the fundamental frequency component and the harmonic component that have been converted into the image data are arranged adjacent each other.
The present invention enables a machine learning model of sound data to be generated by imaging sound data.
Embodiments of the present invention are described below with reference to the drawings. In the drawings, the same parts are denoted by the same reference numerals, and the description thereof is omitted.
(Configuration Example of Image Processing Apparatus)
A configuration example of an image processing apparatus 1 according to a first embodiment is described with reference to
The controller 20 acquires sound data via a microphone 10 and analyzes the acquired sound data. The controller 20 is a general-purpose microcomputer including a CPU (central processing unit), a memory, and an input/output unit. The microcomputer has a computer program installed to function as the image processing apparatus 1. By executing the computer program, the microcomputer functions as multiple information processing circuits provided in the image processing apparatus 1. Note that the example described here is to use software to realize the multiple information processing circuits provided in the image processing apparatus 1; however, it is also possible to prepare dedicated hardware for executing each information processing described below to configure the information processing circuits. Further, the multiple information processing circuits may be configured by individual hardware. The controller 20 includes a sound data acquisition unit 21, a frequency characteristics analysis unit 22, a fundamental frequency calculation unit 23, a sound data image generation unit 24, and a machine learning model generation unit 25, as the multiple information processing circuits.
As described above, the sound data acquisition unit 21 acquires sound data via the microphone 10. The sound data acquired by the sound data acquisition unit 21 are converted into an electric signal and treated as time-series data. The sound data acquired by the sound data acquisition unit 21 is given a label indicating normal or abnormal and is used as machine learning teacher data. Note that the sound data is, for example, sound data of a machine used in a factory.
The frequency characteristics analysis unit 22 analyzes frequency characteristics of the sound data acquired by the sound data acquisition unit 21. As an analysis method, signal processing represented by FFT (Fast Fourier Transform) is used, for example. The FFT converts time series data into frequency data to provide “frequency-energy characteristics”.
The fundamental frequency calculation unit 23 calculates a fundamental frequency of the sound data using the “frequency-energy characteristics” acquired by the frequency characteristics analysis unit 22. The fundamental frequency calculation unit 23 calculates harmonics corresponding to the calculated fundamental frequency.
The sound data image generation unit 24 generates an image of the sound data using the fundamental frequency calculated by the fundamental frequency calculation unit 23.
The machine learning model generation unit 25 generates an optimum machine learning model for determining normality and abnormality in the sound data using the image generated by the sound data image generation unit 24. The machine learning model generation unit 25 confirms the performance of determining normality and abnormality of sound data by multiple machine learning algorithms using cross verification or the like and outputs a model with the best performance. Note that the machine learning model generation unit 25 may also output other models together.
The storage device 40 stores the “frequency-energy characteristics” acquired by the frequency characteristics analysis unit 22, the sound data image generated by the sound data image generation unit 24, the machine learning model generated by the machine learning model generation unit 25, and the like.
The display 50 displays the machine learning model generated by the machine learning model generation unit 25 and its prediction performance and the like.
Next, an example of the image generated by the sound data image generation unit 24 is described with reference to
An image 60 in
The frequency characteristics analysis unit 22 analyzes the electric signal using the FFT.
The fundamental frequency calculation unit 23 calculates the fundamental frequency of sound data using the “frequency-energy characteristics” illustrated in
As another calculation method, a frequency between peaks may be calculated as the fundamental frequency. As illustrated in
As another calculation method, a difference of frequencies between peaks may be acquired, and when the difference is the same as the minimum peak frequency, the frequency may be calculated as the fundamental frequency. Specifically, as illustrated in
Next, the fundamental frequency calculation unit 23 calculates harmonics corresponding to the calculated fundamental frequency. Harmonics means a higher-order frequency component that includes integral multiples of a wave motion having a certain frequency component (here, the fundamental frequency component). Harmonics are sometimes called overtones in the field of music. As illustrated in
The sound data image generation unit 24 generates a two-dimensional image, such as a so-called heat map, as illustrated in
In this way, the sound data image generation unit 24 converts the fundamental frequency component and harmonic component calculated by the fundamental frequency calculation unit 23 into image data. The fundamental frequency component converted into image data is represented as the pixel 60a (400 Hz). The harmonic component converted into the image data is represented as the pixel 60b (800 Hz). As illustrated in
The image 60 is illustrated in a rectangular shape but is not limited to this. For example, the image 60 may have a thin comb shape for increasing resolution. Each of the pixels is also illustrated in a rectangle shape but is not limited to this. Further, the respective pixels are discretely arranged at a distance from each other but not limited thereto. The respective pixels may be continuously arranged without any distance from each other. The vertical axis and the horizontal axis may be interchanged. In
In
Next, a machine learning model is described with reference to
The machine learning model generation unit 25 generates a machine learning model using a well-known machine learning algorithm. Machine learning algorithms to be used include Decision Tree, Random Forest, Gradient Boosted Tree (GBT), General Linear Regression (GLR), Support Vector Machine (SVM), and Deep Learning. However, the present invention is not limited thereto, and any algorithm capable of supervised learning may be used.
The machine learning prediction performance is compared by any one of or a combination of items indicating the performance of the machine learning algorithm, such as Accuracy, Precision, and Recall. The model with the highest prediction performance is selected based on the comparison result.
As illustrated in
Next, an operation example of the image processing apparatus 1 according to the first embodiment is described with reference to a flowchart of
In step S101, the sound data acquisition unit 21 acquires sound data via the microphone 10. The acquired sound data are converted into an electric signal and treated as time series data. The process proceeds to step S103, and the frequency characteristics analysis unit 22 analyzes the electric signal acquired in step S101 using the FFT. The FFT provides the “frequency-energy characteristics” (see
The process proceeds to step S105, and the fundamental frequency calculation unit 23 calculates a fundamental frequency of the sound data using the “frequency-energy characteristics” acquired in step S103. The above-described method is used for calculating the fundamental frequency. The process proceeds to step S107, and the sound data image generation unit 24 generates the image 60, such as a heat map, using the fundamental frequency calculated in step S105 (see
The process proceeds to step S109, and the image 60 generated in step S107 is displayed on the display 50. Each of the pixels forming the image 60 is set as a brightness or color corresponding to an amplitude of the sound data. Thus, the operator who sees the image 60 can grasp the intensity, normality, abnormality, and the like of the sound data at a glance.
The process proceeds to step S111, and the machine learning model generation unit 25 selects an algorithm for generating a machine learning model. The selected algorithm includes Decision Tree, Random Forest, Gradient Boosted Tree (GBT), General Liner Regression (GLR), Support Vector Machine (SVM), and Deep Learning.
The process proceeds to step S113, and the machine learning model generation unit 25 generates a machine learning model using the algorithm selected in step S111. The process proceeds to step S115, and the machine learning model generation unit 25 displays the generated machine learning model and its prediction performance on the display 50 (see
As described above, the image processing apparatus 1 according to the first embodiment provides the following advantageous effects.
The fundamental frequency calculation unit 23 calculates a fundamental frequency component included in sound data and a harmonic component corresponding to the fundamental frequency component. The sound data image generation unit 24 converts the fundamental frequency component and harmonic component calculated by the fundamental frequency calculation unit 23 into image data. The sound data image generation unit 24 generates the image 60 (sound image) in which the fundamental frequency component (pixel 60a in
The fundamental frequency component (pixel 60a) and the harmonic component (pixel 60b) have an overtone relationship. Monophonic tones, such as those of automobile horns and stringed instruments, are influenced by their overtones. According to the first embodiment, such a relationship can be displayed as a sound image.
The sound data image generation unit 24 arranges the fundamental frequency component (pixel 60a) and the harmonic component (pixel 60b) adjacent to each other on the vertical axis or the horizontal axis of the sound image. This enables the relationship between the fundamental frequency and the harmonic component to be displayed as a two-dimensional sound image.
The sound data image generation unit 24 generates a sound image by converting the fundamental frequency component and the harmonic component into image data where brightnesses or colors corresponding to amplitudes of the sound data are set. This enables the intensity of the sound or the like to be clearly displayed as a sound image.
The sound data image generation unit 24 arranges multiple frequency components of the sound data in the order of frequency on the vertical axis or the horizontal axis of the sound image. This enables the sound data to be displayed as a two-dimensional sound image.
Next, a second embodiment of the present invention is described with reference to
The scale setting unit 26 uses the “frequency-energy characteristics” acquired by the frequency characteristics analysis unit 22 to set a twelve-tone scale of “C, C #, D, D #, E, F, G, G #, A, A #, B, B #, (C)”, which corresponds to the “do, re, mi, fa, sol, la, ti (do)” used in music, and octaves (1 to 10 octaves), which are overtones. The scale setting unit 26 classifies sounds of respective frequencies into the twelve-tone scale using the twelve-tone scale and octaves.
Next, an example of an image generated by the sound data image generation unit 24 is described with reference to
An image 61 in
In the second embodiment, the fundamental frequency is not 400 Hz but any value. A pitch name corresponding to the pixel 60a indicating the fundamental frequency is “A”. A pitch name corresponding to the pixel 60b indicating a harmonic corresponding to the fundamental frequency is also “A”. As illustrated in
The image 61 is also illustrated in a rectangular shape like the image 60 (see
The machine learning model generation unit 25 generates an optimum machine learning model for determining normality and abnormality in sound data using the image 61. Since the details are the same as those of the first embodiment, the description thereof is omitted.
Next, an operation example of the image processing apparatus 1 according to the second embodiment is described with reference to a flowchart of
In step S207, the scale setting unit 26 sets the twelve-tone scale and octaves that are overtones thereof using the “frequency-energy characteristics” acquired in step S203. The scale setting unit 26 classifies sounds of respective frequencies into the twelve-tone scale using the twelve-tone scale and octaves.
In the second embodiment, the fundamental frequency component and the harmonic component have the same scale. The second embodiment enables music-related events, such as a scale and an octave, to be displayed as an image.
Next, a third embodiment of the present invention is described with reference to
The critical band setting unit 27 sets the twelve-tone scale of “C, C #, D, D #, E, F, G, G #, A, A #, B, B #, (C)”, which corresponds to the “do, re, mi, fa, sol, la, ti (do)” used in music, and a critical band (band numbers are 1 to 24), which is human hearing characteristics, using the “frequency-energy characteristics” acquired by the frequency characteristics analysis unit 22. The critical band setting unit 27 classifies sounds of respective frequencies into the twelve-tone scale using the twelve-tone scale and the critical band.
The critical band is defined as a maximum frequency band when the sound intensity of band noise with a constant band sound pressure level is constant regardless of a bandwidth. As another definition, the critical band is defined as a minimum frequency bandwidth of band noise, which becomes a sound pressure level at which a pure sound equal to a center frequency of the band noise is just heard when the bandwidth is increased while the spectrum level of the band noise is kept constant.
Next, an example of an image generated by the sound data image generation unit 24 is described with reference to
An image 62 in
In
In the third embodiment, the fundamental frequency is not 400 Hz but any value. As illustrated in
The image 62 is also illustrated in a rectangular shape like the image 60 (see
The machine learning model generation unit 25 generates an optimum machine learning model for determining normality and abnormality in sound data using the image 62. The details are the same as those of the first embodiment, and thus the description thereof is omitted.
Next, an operation example of the image processing apparatus 1 according to the third embodiment is described with reference to a flowchart of
In step S307, the critical band setting unit 27 sets the twelve-tone scale and the critical band using the “frequency-energy characteristics” acquired in step S303. The critical band setting unit 27 classifies sounds of respective frequencies into the twelve-tone scale using the twelve-tone scale and the critical band.
In the third embodiment, the fundamental frequency component and the harmonic component have a relationship with respect to the critical band of human hearing. The third embodiment enables such a relationship to be displayed as an image.
Next, a fourth embodiment of the present invention is described with reference to
The image 60 in
The machine learning model generation unit 25 generates an optimum machine learning model for determining normality and abnormality in sound data using the three-dimensional image 70. The details are the same as those of the first embodiment, and thus the description thereof is omitted.
Next, an operation example of the image processing apparatus 1 according to the fourth embodiment is described with reference to a flowchart of
In step S407, the time-specific image generation unit 28 generates the images 60, and 63 to 65 at predetermined intervals using the fundamental frequency and harmonics acquired in step S405. The process proceeds to step S409, and the three-dimensional image generation unit 29 generates the three-dimensional image 70 using the images 60, and 63 to 65 generated at every predetermined time in step S407. The process proceeds to step S411, and the three-dimensional image 70 generated in step S409 is displayed on the display 50. Each pixel forming the three-dimensional image 70 is set as a brightness or color corresponding to the amplitude of the sound data. Thus, the operator who sees the three-dimensional image 70 can grasp the intensity, normality, abnormality, and the like of the sound data at a glance.
The fourth embodiment enables sound data to be displayed as a three-dimensional image.
Next, a fifth embodiment of the present invention is described with reference to
The frequency setting unit 30 sets a frequency to be extracted from the three-dimensional image 70. The frequency set by the frequency setting unit 30 is any frequency, and the fundamental frequency may be set.
The image cutout unit 31 cuts out pixels related to the frequency set by the frequency setting unit 30. Specifically, as illustrated in
The machine learning model generation unit 25 generates an optimum machine learning model for determining normality and abnormality in sound data using the spectrogram. The details are the same as those of the first embodiment, and thus the description thereof is omitted.
Next, an operation example of the image processing apparatus 1 according to the fifth embodiment is described with reference to a flowchart of
In step S511, the frequency setting unit 30 sets a frequency to be extracted from the three-dimensional image 70. The process proceeds to step S513, and the image cutout unit 31 cuts out pixels relating to the frequency set in step S511. The image cutout unit 31 generates a spectrogram using the cutout pixels.
The fifth embodiment enables analysis to be performed using a spectrogram.
Next, a sixth embodiment of the present invention is described with reference to
The new sound data acquisition unit 32 acquires new sound data via a microphone 11 different from the microphone 10. The microphones 10 and 11 are attached to machines of the same type. The sound data image generation unit 24 generates an image of the new sound data.
The image processing unit 33 uses the image of the new sound data as input data of the machine learning model generated by the machine learning model generation unit 25. The image processing unit 33 outputs an index, such as the degree of agreement of images, using a predetermined image processing method.
The determination unit 34 compares the output value output by the image processing unit 33 with a preset threshold value. When the output value is greater than the threshold value, the determination unit 34 determines that the sound data is normal. In contrast, when the output value is equal to or less than the threshold value, the determination unit 34 determines that the sound data is abnormal. Note that the determination method is not limited to a method using a threshold value, and other methods may be used.
The determination result by the determination unit 34 is displayed on the display 50. Although not illustrated, the determination result by the determination unit 34 may be notified by voice through a speaker. When the determination result by the determination unit 34 is abnormal, a red rotating light may be turned on.
Next, an operation example of the image processing apparatus 1 according to the sixth embodiment is described with reference to flowcharts of
In step S615, the new sound data acquisition unit 32 acquires new sound data via the microphone 11. The acquired new sound data is converted into an electric signal and treated as time series data. The process proceeds to step S617, and the frequency characteristics analysis unit 22 analyzes the electric signal acquired in step S615 using the FFT. The FFT provides the “frequency-energy characteristics”.
The process proceeds to step S619, and the fundamental frequency calculation unit 23 calculates a fundamental frequency of the new sound data using the “frequency-energy characteristics” acquired in step S617. The process proceeds to step S621, and it is determined whether the fundamental frequency calculated in step S619 agrees with the fundamental frequency of the machine learning model generated in step S613. The reason for this determination is that when the fundamental frequencies are different, the machine learning model cannot perform the normal and abnormal determination processing. Note that “the fundamental frequencies agree with each other” means substantial agreement.
When the fundamental frequencies do not agree with each other (NO in step S621), “Determination processing not possible due to disagreement of fundamental frequencies” is displayed on the display 50, and the process proceeds to step S631. In contrast, when the fundamental frequencies agree with each other (YES in step S621), the process proceeds to step S623, and the sound data image generation unit 24 generates an image of the new sound data using the fundamental frequency calculated in step S619.
The process proceeds to step S625, and the image processing unit 33 uses the image of the new sound data generated in step S623 as input data for the machine learning model. The image processing unit 33 outputs an index, such as the degree of agreement of images, using a predetermined image processing method. The determination unit 34 compares the output value output by the image processing unit 33 with a preset threshold value to determine whether the sound data is normal or abnormal.
The process proceeds to step S627, and the determination result of step S625 is displayed on the display 50. The process proceeds to step S629, and a file name of the new sound data, a name of the machine learning model, a processing execution time, a value of the fundamental frequency, a determination result, and the like are stored in the storage device 40. A series of processing is repeatedly executed until completion (step S631). Note that when the processing is completed, a notice “End of normal/abnormal determination processing” may be displayed on the display 50.
The sixth embodiment makes it possible to determine whether other sound data are normal or abnormal using a machine learning model acquired by imaging sound data.
Each of the functions described in the above embodiments may be implemented by one or more processing circuits. The processing circuit includes a programmed processing device, such as a processing device including an electric circuit. The processing circuit also includes devices, such as an application specific integrated circuit (ASIC) arranged to perform the described functions and circuit components.
While embodiments of the present invention have been described as above, the statements and drawings that form part of this disclosure should not be understood as limiting the invention. Various alternative embodiments, examples, and operating techniques will become apparent to those skilled in the art from this disclosure.
In the above-described embodiments, a machine learning model is used as a method for determining normality and abnormality in other sound data, but the method is not limited thereto. For example, an abnormality in the image (sound data) may be determined by comparing the fundamental frequency component and harmonic component with other frequency components. This makes it possible to determine whether the sound data is normal or abnormal in a case where there is no overtone relation, such as a critical band.
Further, the determination unit 34 may determine an abnormality in a predetermined sound included in sound data using the image 60 (sound image).
The image 60 (sound image) may be made from a two-dimensional matrix including a fundamental frequency component and harmonic component converted into image data, and other frequency components converted into image data, wherein a predetermined area is set for each frequency component. Note that the other frequency components mean frequency components other than the fundamental frequency component and the harmonic component.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IB2020/000097 | 2/20/2020 | WO |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2021/165712 | 8/26/2021 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
4956999 | Bohannan | Sep 1990 | A |
5808225 | Corwin | Sep 1998 | A |
6490359 | Gibson | Dec 2002 | B1 |
7126876 | Rowland | Oct 2006 | B1 |
11227626 | Krishnan Gorumkonda | Jan 2022 | B1 |
11635411 | Ito | Apr 2023 | B2 |
20050149321 | Kabi | Jul 2005 | A1 |
20060011046 | Miyaki | Jan 2006 | A1 |
20060195500 | Joublin | Aug 2006 | A1 |
20090249879 | Jeyaraman | Oct 2009 | A1 |
20130129097 | Park | May 2013 | A1 |
20130151245 | Stark | Jun 2013 | A1 |
20160103038 | Lacaille | Apr 2016 | A1 |
20170110135 | Disch | Apr 2017 | A1 |
20180061382 | Summers | Mar 2018 | A1 |
20200019855 | Kato | Jan 2020 | A1 |
20200042285 | Choi | Feb 2020 | A1 |
20210074267 | Higurashi | Mar 2021 | A1 |
20210116293 | Yang | Apr 2021 | A1 |
20210256991 | Jun | Aug 2021 | A1 |
20210304786 | Fujii | Sep 2021 | A1 |
20220130411 | Lu | Apr 2022 | A1 |
20220254006 | Jin | Aug 2022 | A1 |
20230222711 | Hirose | Jul 2023 | A1 |
Number | Date | Country |
---|---|---|
H09-166483 | Jun 1997 | JP |
2013076909 | Apr 2013 | JP |
2017521705 | Aug 2017 | JP |
2015068446 | May 2015 | WO |
2019176029 | Sep 2019 | WO |
Entry |
---|
Yu et al., Fault Diagnosis Based on an Approach Combining a Spectrogram and a Convolutional Neural Network with Application to aWind Turbine System, 2018 (Year: 2018). |
Prego et al., Audio Anomaly Detection on Rotating Machinery Using Image Signal Processing, 2016 (Year: 2016). |
Sakamoto, Takayuki, “Introduction to Programming Data Analysis AI Made With MXNet”, Jul. 2, 2018, pp. 175-196 (58 pages). |
Ortiz-Echeverri, C.J. et al.; “An Approach to STFT and CWT learning through music hands-on labs”; Computer Applications in Engineering Education, Wiley Periodicals, Inc., vol. 26, No. 6, Apr. 25, 2018, pp. 2026-2035 (10 pages). |
Ozaki. Kentaro; “Machine Learning and Vibration Power Generation for Vibration Monitoring: Railway Research Institute conducts practical research on fault detection”; Nikkei Monozukuri, vol. 769, Oct. 1, 2018, pp. 35-36 (8 pages). |
Number | Date | Country | |
---|---|---|---|
20230222711 A1 | Jul 2023 | US |