This application claims the priority benefit of Taiwan application serial no. 109132502, filed on Sep. 21, 2020. The entirety of the above-mentioned patent application is hereby incorporated by reference herein and made a part of this specification.
The disclosure relates to a machine learning technology, and particularly relates to a model construction method for audio recognition.
Machine learning algorithms can analyze a large amount of data to infer the regularity of these data, thereby predicting unknown data. In recent years, machine learning has been widely used in the fields of image recognition, natural language processing, medical diagnosis, or voice recognition.
It is worth noting that for the voice recognition technology or other types of audio recognition technologies, during the training process of the model, the operator will label the type of sound content (for example, female's voice, baby's voice, alarm bell, etc.), so as to produce the correct output results in the training data, wherein the sound content is used as the input data in the training data. If the image is marked, the operator can recognize the object in a short time and provide the corresponding label. However, for the sound label, the operator may need to listen to a long sound file before marking, and the content of the sound file may be difficult to identify because of noise interference. It can be seen that the current training operations are quite inefficient for operators.
In view of this, the embodiments of the disclosure provide a model construction method for audio recognition, which provides simple inquiry prompts to facilitate operator marking.
The model construction method for audio recognition according to the embodiment of the disclosure includes (but is not limited to) the following steps: audio data is obtained. A predicted result of the audio data is determined by using the classification model which is trained by machine learning algorithm. The predicted result includes a label defined by the classification model. A prompt message is provided according to a loss level of the predicted result. The loss level is related to a difference between the predicted result and a corresponding actual result. The prompt message is used to query a correlation between the audio data and the label. The classification model is modified according to a confirmation response of the prompt message, and the confirmation response is related to a confirmation of the correlation between the audio data and the label.
Based on the above, the model construction method for audio recognition in the embodiment of the disclosure can determine the difference between the predicted result obtained by the trained classification model and the actual result, and provide a simple prompt message to the operator based on the difference. The operator can complete the marking by simply responding to this prompt message, and further modify the classification model accordingly, thereby improving the identification accuracy of the classification model and the marking efficiency of the operator.
In order to make the aforementioned features and advantages of the disclosure more comprehensible, embodiments accompanying figures are described in detail below.
In an embodiment, the audio data is obtained by audio processing the original audio data (the implementation mode and type of the audio data can be inferred from the audio data).
There are many ways to reduce noise from audio. In an embodiment, the server can analyze the properties of the original audio data to determine the noise component (i.e., interference to the signal) in the original audio data. Audio-related properties are, for example, changes in amplitude, frequency, energy, or other physical properties, and noise components usually have specific properties.
For example,
In an embodiment, the original audio data can be subjected to empirical mode decomposition (EMD) or other signal decomposition based on time-scale characteristics to obtain the corresponding intrinsic mode function components (i.e., mode component). The mode components include local characteristic signals of different time scales on the waveform of the original audio data in the time domain.
For example,
It should be noted that, in some embodiments, each intrinsic mode function may be subjected to Hilbert-Huang Transform (HHT) to obtain the corresponding instantaneous frequency and/or amplitude.
The server may further determine the autocorrelation of each mode component (step S330). For example, Detrended Fluctuation Analysis (DFA) can be used to determine the statistical self-similar property (i.e., autocorrelation) of a signal, and the slope of each mode component can be obtained by linear fitting through the least square method. In another example, an autocorrelation operation is performed on each mode component.
The server can select one or more mode components as the noise component of the original audio data according to the autocorrelation of those mode components. Taking the slope obtained by DFA as an example, if the slope of the first mode component is less than the slope threshold (for example, 0.5 or other values), the first mode component is anti-correlated and is taken as noise component; if the slope of the second mode component is not less than the slope threshold, the second mode component is correlated and will not be regarded as a noise component.
In other embodiments, in other types of autocorrelation analysis, if the autocorrelation of the third mode component is the smallest, second smallest, or smaller, the third mode component may also be regarded as a noise component.
After determining the noise component, the server can reduce the noise component from the original audio data to generate audio data. Taking mode decomposition as an example, please refer to
It should be noted that the noise reduction of audio is not limited to the aforementioned mode and autocorrelation analysis, and other noise reduction techniques may also be applied to other embodiments. For example, a filter configured with a specific or variable threshold, or spectral subtraction, etc. may also be used.
On the other hand, there are many audio segmentation methods for audio.
After obtaining the sound feature, the server can determine the target segment and non-target segment in the audio data according to the sound feature (step S530). Specifically, the target segment represents a sound segment of one or more designated sound types, and the non-target segment represents a sound segment of a type other than the aforementioned designated sound types. The sound type is, for example, music, ambient sound, voice, or silence. The corresponding value of the sound feature can correspond to a specific sound type. Taking the zero crossing rate as an example, the zero crossing rate of voice is about 0.15, the zero crossing rate of music is about 0.05, and the zero crossing rate of ambient sound changes dramatically. In addition, taking short time energy as an example, the energy of voice is about 0.15 to 0.3, the energy of music is about 0 to 0.15, and the energy of silence is 0. It should be noted that the value and segment adopted by different types of sound features for determining the types of sound may be different, and the foregoing values only serve as examples.
In an embodiment, it is assumed that the target segment is voice content (that is, the sound type is voice), and the non-target segment is not voice content (for example, ambient sound, or musical sound, etc.). The server can determine the end points of the target segment in the audio data according to the short time energy and zero crossing rate of the audio data. For example, in the audio data, the audio signal of which the zero crossing rate is lower than the zero crossing threshold is regarded as voice, the sound signal of which the energy is greater than the energy threshold is regarded as voice, and the sound segment of which the zero crossing rate is lower than the zero crossing threshold or the energy is greater than the energy threshold is regarded as the target segment. In addition, the beginning and end points of a target segment in the time domain are its boundary, and the sound segment outside the boundary may be a non-target segment. For example, the short time energy is utilized first for detection to roughly determine the end of sounding voice, and then zero crossing rate is utilized to detect the actual beginning and end of the voice segment.
In an embodiment, the server may retain the target segment for the original audio data or the denoising audio data and remove the non-target segment, so as to be used as the final audio data. In other words, a piece of sound data includes one or more pieces of target segments, and there are no non-target segments. Taking the target segment of the voice content as an example, if the audio data segmented by the audio is played, only human speech can be heard.
It should be noted that in other embodiments, either or both of steps S210 and S230 in
Referring to
After all the target segments are marked, the server can train the classification model according to the initial confirmation response of the initial prompt message (step S630). The initial confirmation response includes the label corresponding to the target segment. That is, the target segment serves as the input data in the training data, and the corresponding label serves as the output/predicted result in the training data.
The server can use a machine learning algorithm preset or selected by the user. For example,
After the classification model is trained, if the audio data is input to the classification model, the predicted result can be inferred. The predicted result includes one or more labels defined by the classification model. The labels are, for example, female's voices, male's voices, baby's voices, crying sound, laughter, voices of specific people, alarm bells, etc., and the labels can be changed according to the needs of the user. In some embodiments, the predicted result may further include predicting the probability of each label.
Referring to
In the embodiment of the disclosure, the server will further provide prompt messages to the operator. The prompt message is provided to query the correlation between the audio data and the label. In an embodiment, the prompt message includes audio data and inquiry content, and the inquiry content queries whether the audio data belongs to a label (or whether it is related to a label). The server can play audio data through the speaker, and provide the inquiry content through the speaker or display. For example, the display presents the option of whether it is a baby's crying sound, and the operator simply needs to select one from the options of “Yes” and “No”. In addition, if the audio data has been processed by the audio as described in
It should be noted that, in some embodiments, the prompt message may also be an option presenting a query of multiple labels. For example, the message content may be “is it a baby's crying sound or adult's crying sound?”
The server can modify the classification model according to the confirmation response of the prompt message (step S170). Specifically, the confirmation response is related to a confirmation of the correlation between the audio data and the label. The correlation is, for example, belonging, not belonging, or a level of correlation. In an embodiment, the server may receive an input operation (for example, pressing, or clicking, etc.) of an operator through an input device (for example, a mouse, a keyboard, a touch panel, or a button, etc.). This input operation corresponds to the option of the inquiry content, and this option is that the audio data belongs to the label or the audio data does not belong to the label. For example, a prompt message is presented on the display and provides two options of “Yes” and “No”. After listening to the target segment, the operator can select the option of “Yes” through the button corresponding to “Yes”.
In other embodiments, the server may also generate a confirmation response through other voice recognition methods such as preset keyword recognition, preset acoustic feature comparison, and the like.
If the correlation is that the audio data belongs to the label in question or its correlation level is higher than the level threshold, it can be confirmed that the predicted result is correct (that is, the predicted result is equal to the actual result). On the other hand, if the correlation is that the information data does not belong to the label in question or its correlation level is lower than the level threshold, it can be confirmed that the predicted result is incorrect (that is, the predicted result is different from the actual result).
It can be seen that the embodiment of the disclosure evaluates whether the prediction ability of the classification model meets expectations or whether it needs to be modified through two stages, namely loss level and confirmation response, thereby improving training efficiency and prediction accuracy.
In addition, the server can also provide the classification model for other devices to use. For example,
The communication interface 31 can support optical-fiber networks, Ethernet networks, or wired networks such as cables, and may also support Wi-Fi, mobile networks, and Bluetooth (for example, BLE, fifth-generation, or later generation), Zigbee, Z-Wave and other wireless networks. In an embodiment, the communication interface 31 is used to transmit or receive data, for example, receive audio data, or transmit the classification model.
The memory 33 can be any type of fixed or removable random access memory (RAM), read-only memory (ROM), flash memory or the like, and are used to record program codes, software modules, audio data, classification models and related parameters thereof, and other data or files.
The processor 35 is coupled to the communication interface 31 and the storage 33. The processor 35 may be a central processing unit (CPU) or other programmable general-purpose or specific-purpose microprocessor, digital signal processing (DSP), programmable controller, application-specific integrated circuit (ASIC) or other similar components or a combination of the above components. In the embodiment of the disclosure, the processor 35 is configured to execute all or part of the operations of the server 30, such as training the classification model, audio processing, or data modification.
In summary, in the model construction method for audio recognition in the embodiment of the disclosure, a prompt message is provided according to the loss level difference between the predicted result obtained by the classification model and the actual result, and the classification model is modified according to the corresponding confirmation response. For the operator, the marking can be easily completed by simply responding to the prompt message. In addition, the original audio data can be processed by noise reduction and audio segmentation to make it easy for the operators to listen to. In this way, the recognition accuracy of the classification model and the marking efficiency of the operator can be improved.
Although the present disclosure has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it is still possible to modify the technical solutions described in the foregoing embodiments, or equivalently replace some or all of the technical features; these modifications or replacements do not make the nature of the corresponding technical solutions deviate from the scope of the technical solutions in the embodiments of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
109132502 | Sep 2020 | TW | national |