This disclosure relates to technical fields of a voice detection apparatus, a voice detection method, and a recording medium that are configured to detect a voice segment that appears in a voice signal.
Patent Literature 1 describes an example of a voice detection apparatus that is configured to detect a voice segment that appears in a voice signal. In addition, as prior art documents related to this disclosure, Patent Literature 2 to Patent Literature 4 and Non-Patent Literature 1 are cited.
It is an example object of this disclosure to provide a voice detection apparatus, a voice detection method, and a recording medium that are intended to improve the techniques/technologies described in Citation List.
A voice detection apparatus according to an example aspect of this disclosure includes: a beginning determination unit that determines a beginning of a voice segment including a voice that appears in a voice signal: an end determination unit that determines an end of the voice segment by determining whether or not a length of a non-voice segment that appears after the beginning is determined, is greater than or equal to a threshold; and a setting unit that sets the threshold on the basis of a property of a provisional voice segment starting from the beginning.
A voice detection method according to an example aspect of this disclosure includes: determining a beginning of a voice segment including a voice that appears in a voice signal; determining an end of the voice segment by determining whether or not a length of a non-voice segment that appears after the beginning is determined, is greater than or equal to a threshold; and setting the threshold on the basis of a property of a provisional voice segment starting from the beginning.
A recording medium according to an example aspect of this disclosure is a recording medium on which a computer program that allows a computer to execute a voice detection method is recorded, the voice detection method including: determining a beginning of a voice segment including a voice that appears in a voice signal: determining an end of the voice segment by determining whether or not a length of a non-voice segment that appears after the beginning is determined, is greater than or equal to a threshold; and setting the threshold on the basis of a property of a provisional voice segment starting from the beginning.
Hereinafter, with reference to the drawings, a voice detection apparatus, a voice detection method, and a recording medium according to example embodiments will be described.
First, a voice detection apparatus, a voice detection method, and a recording medium according to a first example embodiment will be described. With reference to
As illustrated in
As described above, the voice detection apparatus 1000 in the first example embodiment is configured to set (i.e., change) the threshold TH on the basis of a length Lt of the provisional voice segment. Therefore, the voice detection apparatus 1000 is capable of detecting the voice segment with an appropriate length for a post-processing operation (e.g., a voice recognition operation, a voice authentication operation, or an emotion recognition operation) performed after the voice segment is detected.
Next, a voice detection apparatus, a voice detection method, and a recording medium according to a second example embodiment will be described. The following describes the voice detection apparatus, the voice detection method, and the recording medium according to the second example embodiment, by using a voice detection apparatus 1 to which the voice detection apparatus, the voice detection method, and the recording medium according to the second example embodiment are applied.
The voice detection apparatus 1 is an apparatus that performs voice activity detection (VAD). The voice activity detection is an operation of detecting a voice segment, from a voice signal indicating a voice uttered by a speaker. In other words, the voice activity detection is an operation of distinguishing the voice segment that appears in the voice signal, from the non-voice segment that appears in the voice signal. The voice segment is a segment including the voice uttered by the speaker. That is, the voice segment is a segment in which the speaker is speaking. On the other hand, the non-voice segment is different from the voice segment. Typically, the non-voice segment is a segment in which the speaker is not speaking.
Hereinafter, the voice detection apparatus 1 that performs such voice activity detection will be described.
First, with reference to
As illustrated in
The arithmetic apparatus 11 includes, for example, at least one of a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), and a FPGA (Field Programmable Gate Array). The arithmetic apparatus 11 reads a computer program. For example, the arithmetic apparatus 11 may read a computer program stored in the storage apparatus 12. For example, the arithmetic apparatus 11 may read a computer program stored by a computer-readable and non-transitory recording medium, by using a not-illustrated recording medium reading apparatus provided in the voice detection apparatus 1. The arithmetic apparatus 11 may acquire (i.e., download or read) a computer program from a not-illustrated apparatus disposed outside the voice detection apparatus 1, through the communication apparatus 13 (or another communication apparatus). The arithmetic apparatus 11 executes the read computer program. Consequently, a logical functional block for performing an operation to be performed by the voice detection apparatus 1 (e.g., the voice activity detection described above) is realized or implemented in the arithmetic apparatus 11. That is, the arithmetic apparatus 11 is allowed to function as a controller for realizing or implementing the logical functional block for performing an operation (in other words, processing) to be performed by the voice detection apparatus 1.
The symbol generation unit 111 generates symbol data from the voice signal. Specifically; the symbol generation unit 111 outputs a symbol for each voice frame SF (e.g., voice frame SF of several tens of milliseconds) obtained by subdividing/segmenting the voice signal. The symbol may include a character symbol representing the voice uttered by the speaker in the voice frame SF, as a character. One character symbol may represent one character (e.g., one alphabetical letter, one Hiragana character, one Korean alphabet or Hangul, or one Kanji character). As an example, one character symbol may represent a single alphabet of “a”. One character symbol may represent a plurality of characters (e.g., a plurality of alphabetical letters, a plurality of Hiragana characters, a plurality of Korean alphabets, or a plurality of Kanji characters). As an example, one character symbol may represent a plurality of alphabetical letters “pat.” The symbol may include a blank symbol indicating that the speaker is not speaking in the voice frame SF. As a result, the symbol generation unit 111 generates symbol data in which a plurality of outputted symbols are arranged along time series. The character symbol may be a symbol representing a character itself (e.g., Hiragana or alphabet), or may be a symbol representing a phoneme that is the smallest unit of the character.
In the second example embodiment, the symbol generation unit 111 generates the symbol data from the voice signal by using a CTC (Connectionist Temporal Classification) model. A method of generating the symbol data from the voice signal by using the CTC model is described in Non-Patent Literature 1. Therefore, a detailed description of the method of generating the symbol data from the voice signal by using the CTC model will be omitted, but an outline thereof will be briefly described below with reference to
In
Referring again to
First, as illustrated in
Thereafter, the voice activity detection unit 112 determines the end of the voice segment. Specifically; as illustrated in
Referring back to
The storage apparatus 12 is configured to store desired data. For example, the storage apparatus 12 may temporarily store a computer program to be executed by the arithmetic apparatus 11. The storage apparatus 12 may temporarily store data that are temporarily used by the arithmetic apparatus 11 when the arithmetic apparatus 11 executes the computer program. The storage apparatus 12 may store data that are stored by the voice detection apparatus 1 for a long time. The storage apparatus 12 may include at least one of a RAM (Random Access Memory), a ROM (Read Only Memory), a hard disk apparatus, a magneto-optical disk apparatus, a SSD (Solid State Drive), and a disk array apparatus. That is, the storage apparatus 12 may include a non-transitory recording medium.
The communication apparatus 13 is configured to communicate with an external apparatus of the voice detection apparatus 1.
The input apparatus 14 is an apparatus that receives an input of information to the voice detection apparatus 1 from an outside of the voice detection apparatus 1. For example, the input apparatus 14 may include an operating apparatus (e.g., at least one of a keyboard, a mouse, and a touch panel) that is operable by an operator of the voice detection apparatus 1. For example, the input apparatus 14 may include a reading apparatus that is configured to read information recorded as data on a recording medium that is externally attachable to the voice detection apparatus 1.
The output apparatus 15 is an apparatus that outputs information to the outside of the voice detection apparatus 1. For example, the output apparatus 15 may output information as an image. That is, the output apparatus 15 may include a display apparatus (a so-called display) that is configured to display an image indicating the information that is desirably outputted. For example, the output apparatus 15 may output information as audio/sound. That is, the output apparatus 15 may include an audio apparatus (a so-called speaker device) that is configured to output the audio/sound. For example, the output apparatus 15 may output information onto a paper surface. That is, the output apparatus 15 may include a print apparatus (a so-called printer) that is configured to print desired information on the paper surface.
Next, with reference to
As illustrated in
Thereafter, the voice activity detection unit 112 determines the beginning of the voice segment on the basis of the symbol data generated in the step S100 (step S101). Thereafter, the voice activity detection unit 112 determines the end of the voice segment on the basis of the symbol data generated in the step S100 (the step S103 to step S104). That is, the voice activity detection unit 112 determines whether or not the blank symbol number BSN is greater than or equal to the threshold TH (step S103). The voice activity detection unit 112 determines the end of the voice segment on the basis of a determination result in the step S103 (step S104).
Especially in the second example embodiment, from when the beginning of the voice segment is determined to when the end of the voice segment is determined, the threshold setting unit 113 sets the threshold TH to be used in the step S103 (step S102). Specifically, the threshold setting unit 113 sets (i.e., changes) the threshold TH on the basis of the property of the provisional voice segment starting from the beginning determined in the S101.
The provisional voice segment means a provisional voice segment in which the end is not yet determined. Specifically, as illustrated in
In the second example embodiment, described is an example in which the length Lt of the provisional voice segment is used as the property of the provisional voice segment. The following describes an example in which the number of voice frame SF included in a provisional voice segment Lt (i.e., the number of symbols included in the provisional voice segment Lt) is used as the length Lt of the provisional voice segment. In this instance, the threshold setting unit 113 sets the threshold TH on the basis of the length Lt of the provisional voice segment. Specifically, the threshold setting unit 113 changes the threshold TH on the basis of the length Lt of the provisional voice segment. For example, the threshold setting unit 113 may set the threshold TH such that the threshold TH set when the length Lt of the provisional voice segment is a first length, is different from the threshold TH set when the length Lt of the provisional voice segment is a second length that is different from the first length.
Especially in the second example embodiment, the threshold setting unit 113 may set the threshold TH such that the threshold TH set when the length Lt of the provisional voice segment is the first length, is greater than the threshold TH set when the length Lt of the provisional voice segment is the second length that is longer than the first length. For example, as illustrated in
Referring back to
As described above, the voice detection apparatus 1 in the second example embodiment is configured to set (i.e., change) the threshold TH on the basis of the length Lt of the provisional voice segment. Therefore, the voice detection apparatus 1 is capable of detecting the voice segment with an appropriate length for the post-processing operation (e.g., the voice recognition operation, the voice authentication operation, or the emotion recognition operation) performed after the voice segment is detected. Hereinafter, a specific reason why the voice segment with an appropriate length for the post-processing operation can be detected will be described with reference to
First, a voice detection apparatus in a comparative example in which the threshold TH is fixed independently of the length Lt of the provisional voice segment, may detect an unnecessarily short voice segment. For example, in a case where the speaker takes a short pose after speaking for a short time, the voice detection apparatus in the comparative example is more likely to detect a short voice segment including the voice uttered in the short time. For example.
In the second example embodiment, however, the voice detection apparatus 1 determines the threshold TH on the basis of the length Lt of the provisional voice segment. Therefore, as compared with the voice detection apparatus in the comparative example, the voice detection apparatus 1 is less likely to detect an unnecessarily short voice segment. For example.
On the other hand, the voice detection apparatus in the comparative example in which the threshold TH is fixed independently of the length Lt of the provisional voice segment, may detect an unnecessarily long voice segment, in addition to or in place of an unnecessarily short voice segment. For example, in a case where the speaker is continuously speaking fast, the voice detection apparatus in the comparative example is likely to detect an unnecessarily long voice segment. For example.
In the second example embodiment, however, the voice detection apparatus 1 determines the threshold TH on the basis of the length Lt of the provisional voice segment. Therefore, the voice detection apparatus 1 is less likely to detect an unnecessarily long voice segment, as compared with the voice detection apparatus in the comparative example. For example.
As described above, the voice detection apparatus 1 is less likely to detect an unnecessarily short or long voice segment for the post-processing operation performed after the voice segment is detected, as compared with the voice detection apparatus in the comparative example. That is, the voice detection apparatus 1 is capable of detecting the voice segment with an appropriate length for the post-processing operation performed after the voice segment is detected.
In view of the above-described technical effect, it is preferable that the voice detection apparatus 1 sets the threshold TH on the basis of the length Lt of the provisional voice segment so as to achieve both the effect of detecting the voice segment with a length long enough to understand the context of a sentence indicated by the voice uttered by the speaker and the effect of providing an appropriate calculation amount required for the post-processing operation.
In addition, the voice detection apparatus 1 detects the voice segment by using the symbol data generated by using the CTC model. Therefore, the voice detection apparatus 1 is capable of properly detecting the voice segment.
In the example illustrated in
In the above description, the voice activity detection unit 112 determines the end of the voice segment on the basis of the symbol data including a plurality of symbols that constitute the character string having the highest posterior probability. The voice activity detection unit 112, however, may determine the end of the voice segment on the basis of the symbol data including a plurality of symbols that constitute a character string having a posterior probability that is not the highest, but relatively high. For example, the voice activity detection unit 112 may determine the end of the voice segment on the basis of the symbol data including a plurality of symbols that constitute a character string having an Nth highest posterior probability (where N is an integer of 1 or more). That is, the voice activity detection unit 112 may determine whether or not the length Lb of the non-voice segment is greater than or equal to the predetermined threshold TH by using the symbol data including the plurality of symbols that constitute the character string having the Nth highest posterior probability. Even in this case, the voice activity detection unit 112 is capable of properly setting the end of the voice segment.
Next, a voice detection apparatus, a voice detection method, and a recording medium according to a third example embodiment will be described. With reference to
As illustrated in
The threshold setting unit 113b is different from the threshold setting unit 113 in that a different property from the length Lt is used as the property of the provisional voice segment used to set the threshold TH. Other features of the threshold setting unit 113b may be the same as those of the threshold setting unit 113.
For example, the threshold setting unit 113b may use the number of characters included in the provisional voice segment (e.g., the number of characters represented by the character symbol), as the property of the provisional voice segment. Here, as the length Lt of the provisional voice segment becomes longer, there may be a larger number of characters included in the provisional voice segment. Therefore, the number of characters included in the provisional voice segment has a correlation with the length Lt of the provisional voice segment. Therefore, an operation of setting the threshold TH on the basis of the number of characters included in the provisional voice segment, may be regarded as substantially equivalent to the operation of setting the threshold TH on the basis of the length Lt of the provisional voice segment. In this case, the threshold setting unit 113b may set the threshold TH on the basis of the number of characters included in the provisional voice segment, in the same manner as in the case of setting the threshold TH on the basis of the length Lt of the provisional voice segment. For example, the threshold setting unit 113b may set the threshold TH such that the threshold TH set when the number of characters included in the provisional voice segment is a first number, is greater than the threshold TH set when the number of characters included in the provisional voice segment is a second number that is greater than the first number.
For example, the threshold setting unit 113b may use the number of words included in the provisional voice segment, as the property of the provisional voice segment. Since the word is a combination of characters, the voice detection apparatus 1 is capable of detecting the word on the basis of the character symbol included in the symbol data. Specifically, the threshold setting unit 113b is capable of detecting the word by performing morphological analysis on the character symbols included in the symbol data. Therefore, the threshold setting unit 113b is capable of calculating the number of words included in the provisional voice segment. Here, as the length Lt of the provisional voice segment becomes longer, there may a larger number of words included in the provisional voice segment. Therefore, the number of words included in the provisional voice segment has a correlation with the length Lt of the provisional voice segment. Therefore, an operation of setting the threshold TH on the basis of the number of words included in the provisional voice segment, may be regarded as substantially equivalent to the operation of setting the threshold TH on the basis of the length Lt of the provisional voice segment. In this case, the threshold setting unit 113b may set the threshold TH on the basis of the number of words included in the provisional voice segment, in the same manner as in the case of setting the threshold TH based on the length Lt of the provisional voice segment. For example, the threshold setting unit 113b may set the threshold TH such that the threshold TH set when the number of words included in the provisional voice segment is a first number, is greater than the threshold TH set when the number of words included in the provisional voice segment is a second number that is greater than the first number.
For example, the threshold setting unit 113b may use a speaking speed of the voice that appears in the provisional voice segment, as the property of the provisional voice segment. As the speaking speed is higher, there may be a larger number of character symbols included in a certain voice segment. As a result, as the number of character symbols included in the voice segment increases, a larger calculation amount is required for the post-processing operation. Therefore, in view of the calculation amount required for the post-processing operation, it is preferable that as the speaking speed is higher, the length of the voice segment is shorter (resulting in a smaller number of character symbols included in the voice segment). Therefore, the threshold setting unit 113b may set the threshold TH such that the threshold TH is smaller/shorter in length as the speaking speed is higher. For example, the threshold setting unit 113b may set the threshold TH such that the threshold TH set when the speaking speed in the provisional voice segment is a first speed, is smaller than the threshold TH set when the speaking speed in the provisional voice segment is a second speed that is less than the first speed.
As the speaking speed is higher, there are a larger number of characters (i.e., a larger number of character symbols) per unit hour. Furthermore, as the speaking speed is higher, there are a larger number of words per unit time. In addition, as the speaking speed is higher, there are a smaller number of blank symbols per unit time. Therefore, the threshold setting unit 113b may calculate at least one of the number of characters (i.e., the number of character symbols) per unit time, and the number of words per unit time, and the number of blank symbols per unit time, as an index value representing the speaking speed.
For example, the threshold setting unit 113b may use the number of character symbols included in the provisional voice segment, as the property of the provisional voice segment. Here, as the length Lt of the provisional voice segment becomes longer, there may be a larger number of character symbols included in the provisional voice segment. Therefore, the number of character symbols included in the provisional voice segment has a correlation with the length Lt of the provisional voice segment. Therefore, an operation of setting the threshold TH on the basis of the number of character symbols included in the provisional voice segment, may be regarded as substantially equivalent to the operation of setting the threshold TH on the basis of the length Lt of the provisional voice segment. In this case, the threshold setting unit 113b may set the threshold TH on the basis of the number of character symbols included in the provisional voice segment, in the same manner as in the case of setting the threshold TH on the basis of the length Lt of the provisional voice segment. For example, the threshold setting unit 113b may set the threshold TH such that the threshold TH set when the number of character symbols included in the provisional voice segment is a first number, is greater than the threshold TH set when the number of character symbols included in the provisional voice segment is a second number that is greater than the first number.
The voice detection apparatus 1b in the third example embodiment can enjoy the same effects as the effects that can be enjoyed by the voice detection apparatus 1 in the second example embodiment.
Next, a voice detection apparatus, a voice detection method, and a recording medium according to a fourth example embodiment will be described. With reference to
As illustrated in
The threshold setting unit 113c is different from at least one of the threshold setting units 113 and 113b described above, in that the threshold TH is set on the basis of the speaker information 121c, in addition to or in place of the property of the provisional voice segment. Other features of the threshold setting unit 113c may be the same as those of at least one of the threshold setting units 113 and 113b.
The speaker information 121c includes information about characteristics of the voice uttered by the speaker. For example, the storage apparatus 12 may include first speaker information including information about characteristics of a voice uttered by a first speaker, and second speaker information including information about characteristics of a voice uttered by a second speaker.
The speaker information 121c may include information about a result of the voice detection operation that is performed on the basis of the voice signal indicating a voice uttered by a certain speaker, as the information about the characteristics of the voice uttered by the utterer. For example, the speaker information 121c may include at least one of information about an average of the length of the voice segment detected (or other arithmetic values, and hereinafter the same shall apply), information about an average of the length of the non-voice segment detected, information about an average of the number of characters uttered per unit time, information about an average of the number of words uttered per unit time, and information about the speaking speed.
The threshold setting unit 113c may identify the speaker from whom the voice signal inputted to the voice detection apparatus 1c is acquired, may acquire the speaker information 121c corresponding to the identified speaker from the storage apparatus 12, and may set the threshold TH on the basis of the acquired speaker information 121c. For example, as the average of the length of the voice segment indicated by the uttered speaker information 121c becomes longer, the threshold setting unit 113c may set the threshold TH to be a larger value such that a relatively long voice segment is detected. For example, the threshold setting unit 113c may set the threshold TH to the average of the length of the non-voice segment indicated by the speaker information 121c, or to a value close to the average. For example, the threshold TH may be set to a lower value such that as the average of the number of characters indicated by the speaker information 121c increases, a relatively short voice segment (resulting in a voice segment in which the number of included characters is not excessively large) is detected. For example, the threshold TH may be set to a lower value such that as the average of the number of words indicated by the speaker information 121c increases, a relatively short voice segment (resulting in a voice segment in which the number of included words is not excessively large) is detected. For example, the threshold TH may be set to a lower value such that as the speaking speed indicated by the speaker information 121c is higher, a relatively short voice segment (resulting in a voice segment in which the number of included characters is not excessively large) is detected.
The voice detection apparatus 1c in the fourth example embodiment can enjoy the same effect as the effect that can be enjoyed by at least one of the voice detection apparatus 1 in the second example embodiment to the voice detection apparatus 1b in the third example embodiment. In addition, the voice detection apparatus 1c is capable of setting the threshold TH that matches the characteristics of the voice uttered by the speaker. Therefore, the voice detection apparatus 1c is capable of more properly detecting the voice segment in view of a difference in the characteristics of the voice uttered by the speaker.
Next, a voice detection apparatus, a voice detection method, and a recording medium according to a fifth example embodiment will be described. With reference to
As illustrated in
The text generation unit 111d is different from the symbol generation unit 111 that generates the symbol data by using the CTC model, in that it generates, from the voice signal, text data representing the voice uttered by the speaker as characters, without using the CTC model. For example, the text generation unit 111d calculates the posterior probability of the character string by using an acoustic model, a pronunciation dictionary, and a language model, and generates, as the text data, serial data on a plurality of texts that constitute the character string having the highest posterior probability. Even in this case, the voice activity detection unit 112 may determine the beginning of the voice segment from the generated text data, and may then determine the end of the voice segment by comparing the length Lb of the non-voice segment with the threshold TH. Furthermore, the threshold setting unit 113 may set the threshold TH on the basis of the length Lt of the provisional voice segment. As a consequence, it is possible to enjoy the above-described benefit even when the CTC model is not used.
In a case where the text generation unit 111d generates the text data by using the pronunciation dictionary (i.e., dictionary data), the threshold setting unit 113 may set the threshold TH on the basis of a property of the pronunciation dictionary. For example, in a case where the pronunciation dictionary has a property of generating the text data including many kanji characters, the threshold setting unit 113 may set the threshold TH to a smaller value than a standard value such that a relatively short voice segment (resulting in a voice segment in which the number of included characters is not excessively large) is detected.
The voice detection apparatus 1d in the fifth example embodiment described above can enjoy the same effect as the effect that can be enjoyed by at least one of the voice detection apparatus 1 in the second example embodiment to the voice detection apparatus 1c in the fourth example embodiment. In addition, the voice detection apparatus 1d is capable of setting the threshold TH on the basis of the pronunciation dictionary: Therefore, the voice detection apparatus 1d is capable of more properly detecting the voice segment in view of a difference in an operation of converting the voice signal into the text data.
With respect to the example embodiment described above, the following Supplementary Notes are further disclosed.
A voice detection apparatus including:
The voice detection apparatus according to Supplementary Note 1, wherein the property of the provisional voice segment includes a length of the provisional voice segment.
The voice detection apparatus according to Supplementary Note 2, wherein the setting unit sets the threshold such that the threshold set when the length of the provisional voice segment is a first length, is greater than the threshold set when the length of the provisional voice segment is a second length that is longer than the first length.
The voice detection apparatus according to any one of Supplementary Notes 1 to 3, wherein the property of the provisional voice segment includes at least one of a number of characters of the voice included in the provisional voice segment, a number of words of the voice included in the provisional voice segment, and a speaking speed of the voice included in the provisional voice segment.
The voice detection apparatus according to any one of Supplementary Notes 1 to 4, wherein
The voice detection apparatus according to Supplementary Note 5, wherein the property of the provisional voice segment includes a number of character symbols included in the provisional voice segment.
The voice detection apparatus according to any one of Supplementary Notes 1 to 6, wherein
The voice detection apparatus according to any one of Supplementary Notes 1 to 7, wherein
A voice detection method including:
A recording medium on which a computer program that allows a computer to execute a voice detection method is recorded, the voice detection method including:
At least a part of the constituent components of each of the example embodiments described above can be combined with at least another part of the constituent components of each of the example embodiments described above, as appropriate. A part of the constituent components of each of the example embodiments described above may not be used. Furthermore, to the extent permitted by law, all the references (e.g., publications) cited in this disclosure are incorporated by reference as a part of the description of this disclosure.
This disclosure is allowed to be changed, if desired, without departing from the essence or spirit of this disclosure which can be read from the claims and the entire identification. A voice detection apparatus, a voice detection method, and a recording medium with such changes are also intended to be within the technical scope of this disclosure.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2022/013089 | 3/22/2022 | WO |