The present application claims priority from Japanese patent application JP 2011-252425 filed on Nov. 18, 2011, the content of which is hereby incorporated by reference into this application.
The present invention relates to a system of retrieving voice data.
A large amount of voice data has been accumulated in accordance with mass storage formation of a storage device in recent years. In a number of voice databases of background arts, information of time at which voice is recorded is imparted in order to manage voice data, and a desired voice data is retrieved based on the information. However, in the retrieval based on the time information, it is necessary to previously know the time at which the desired voice is spoken, which is not suitable for a use of retrieving voice in which a specific keyword is included in speech. In a case of retrieving the voice in which the specific keyword is included in the speech, it is necessary to listen to the voice from start to end.
Hence, there has been developed a technology of automatically detecting time at which a specific keyword in a voice database is spoken. According to a sub-word retrieving method which is one of representative methods, first, a voice data is converted into a sub-word sequence by sub-word recognition. Here, a sub-word is a technical term indicating a unit which is smaller than a unit of a word such as a phoneme or syllable. When a keyword is inputted, there is detected time at which the keyword is spoken in a voice data by comparing a sub-word expression of the keyword and a result of a sub-word recognition of the voice data and detecting a portion at which a degree of agreement of the sub-word and the voice data is high (Japanese Unexamined Patent Application Publication No. 2002-221984, Kohei Iwata, et al. “Verification of Effectiveness of New Sub-word Model and Sub-word Acoustic Distance in Vocabulary Free Acoustic Document Retrieving Method”, Information Processing Society of Japan Journal, Vol. 48, No. 5, 2007). According to a word spotting method shown in Tatsuya Kawahara, Toshihiko Munetsugu, Shuji Dooshita, “Word Spotting in Conversation Voice Using Heuristic Language Model”, Journal of Information & Communication Research, D-II, Information•System, II-Information Processing, Vol. 78, No. 7, pp. 1013-1020, 1995., there is detected time at which a keyword is spoken in a voice data by creating an acoustic model of the keyword by combining the acoustic model by a unit of a phoneme and checking the corresponding keyword acoustic model and the voice data.
However, any of the above-described technologies undergoes an influence of a variation in speech (provincial accent, a difference in a speaker attribute or the like) or noise, an error is included in a retrieval result, and actually, time at which the keyword is not spoken appears in the retrieval result. Therefore, a user needs to determine whether the keyword has been truly spoken by listening by reproducing a voice data from time of speaking the keyword that is provided by retrieval in order to remove an erroneous retrieval result. There is also proposed a technology for assisting the correct/incorrect determination described above. There is disclosed a technology of highlighting to reproduce time of detecting the keyword in order to determine whether the keyword is truly spoken by listening in Japanese Unexamined Patent Application Publication No. 2005-38014.
There is disclosed the technology of highlighting to reproduce time of detecting the keyword in order to determine whether the keyword is truly spoken by listening in Japanese Unexamined Patent. Application Publication No. 2005-38014.
However, there poses a problem that difficulty is accompanied in carrying out the correct/incorrect determination as described above by listening frequently under a situation in which a language of a voice data by which a user configures a retrieval object cannot sufficiently be understood. For example, as a result of carrying out retrieval by a keyword of “play” by a user, there is a case where time at which “pray” is spoken is actually detected. In this case, there is a possibility that a Japanese user who does not sufficiently understand English determines “pray” as “play”. The above-described problem cannot be resolved by the technology of highlighting to reproduce the position of detecting the keyword as is proposed in Japanese Unexamined Patent Application Publication No. 2005-38014.
It is an object of the present invention to be able to easily carry out correct/incorrect determination of a retrieval result in a voice data retrieval system by resolving the problem.
The present invention adopts a configuration that is described in, for example, the scope of claim(s) in order to resolve the above-described problem.
As an example of a voice data retrieval system according to the present invention, there is provided a voice data retrieval system including an inputting device of inputting a keyword, a phoneme converting unit of converting the inputted keyword in a phoneme expression, a voice data retrieving unit of retrieving a portion of a voice data at which the keyword is spoken based on the keyword in the phoneme expression, a comparison keyword creating unit of creating a set of comparison keywords separately from the keyword having a possibility of a confusion of a user in listening to the keyword based on the keyword in the phoneme expression, and a retrieval result presenting unit of presenting a retrieval result from the voice data retrieving unit and the comparison keyword from the comparison keyword creating unit to the user.
When an example of a program product of the present invention is pointed out, there is provided a computer readable medium storing a program causing a computer to execute a process for functioning as a data retrieval system, the process including the steps of converting an inputted keyword in a phoneme expression, retrieving a portion of a voice data at which the keyword is spoken based on the keyword in the phoneme expression, creating a set of comparison keywords separately from the keyword having a possibility of a confusion of a user in listening to the keyword based on the keyword in the phoneme expression, and presenting a retrieval result and the comparison keyword to the user.
According to the present invention, in the voice data retrieval system, a determination of correct/incorrect of the retrieval result can easily be carried out by creating the comparison keyword set having the possibility of the confusion of the user in listening to the keyword based on the keyword inputted by the user to present to the user.
An explanation will be given of an embodiment of the present invention in reference to attached drawings.
A voice data retrieval system can be realized by loading a prescribed program onto a memory by CPU, and executing the prescribed program loaded onto the memory by CPU in a computer. The prescribed program may be loaded directly onto the memory by inputting the prescribed program from a storage medium stored with the program via a reading device, or from a network via a communication device, or may be loaded onto the memory after storing the prescribed program once to an external storage device, although not illustrated.
The present invention of a program product according to the present invention is a program product which is integrated to a computer in this way and operating the computer as a voice data retrieval system. A voice data retrieval system shown in the block diagrams of
A description will be given as follows of a flow of processing of respective constituent elements.
When a user inputs a keyword from the inputting device 112 (processing 301) in text, first, the phoneme converting unit 106 converts the keyword in a phoneme expression (processing 302). For example, in a case where the user inputs a keyword of “play” as an input, “play” is converted into “pleI”. The conversion is known as a morpheme analyzing processing, which is well known for the skilled person, and therefore, an explanation thereof will be omitted.
A keyword can also be inputted by speaking the keyword to a microphone by voice of a user by using the microphone as an inputting device. In this case, a voice waveform can be converted into a phoneme expression by utilizing a speech recognition technology as a phoneme converting unit.
Successively, the voice data retrieving unit 105 detects time at which the keyword is spoken in voice data accumulated in the voice data accumulating device 102 (processing 303). In the processing, there can be used a word spotting processing presented in, for example, Tatsuya Kawahara, Toshihiko Munetsugu, Shuji Dooshita, “Word Spotting in Conversation Voice Using Heuristic Language Model”, Journal of Information & Communication Research, D-II, Information•System, II-Information Processing, Vol. 78, No. 7, pp. 1013-1020, 1995. Or, there can also be used a method of previously pretreating a voice data accumulating device as in Japanese Unexamined Patent Application Publication No. 2002-221984, Kohei Iwata, et al. “Verification of Effectiveness of New Sub-word Model and Sub-word Acoustic Distance in Vocabulary Free Acoustic Document Retrieving Method”, Information Processing Society of Japan Journal, Vol. 48, No. 5, 2007 or the like. An enterprise person may select any of the means.
Successively, the comparison keyword creating unit 107 creates a comparison keyword set having a possibility of being confused in listening by a user (processing 304). In the following explanation, a keyword is inputted in English; on the other hand, a user speaks Japanese as a native language. However, a language of a keyword and a language of a user are not limited to English and Japanese, and any combination of languages will do.
An edit distance defines a distance scale between a certain character string A and a character string B, and is defined as a minimum operation cost for converting the character string A into the character string B by subjecting the character string A to respective operations of substitution, insertion, and deletion. For example, when the character string A is abcde and the character string B is acfeg as shown in
According to the embodiment, a cost taken for inserting a certain phoneme X is made to be Matrix (SP, X), a cost taken for deleting a certain phoneme X is made to be Matrix (X, SP), and a cost of substituting a phoneme X for a phoneme Y is made to be Matrix (X, Y). Thereby, the edit distance which reflects a phoneme confusion matrix can be calculated. For example, consider a case of calculating an edit distance between a phoneme expression “pleI” of a keyword “play” and a phoneme expression “preI” of a word “pray” in accordance with the phoneme confusion matrix of
Incidentally, a dynamic programming which is an efficient method of calculating an edit distance is well known for the skilled person, and therefore, only a pseudo-code is shown here.
As a definition of an edit distance separately from the above-described, an edit distance can also be defined as a minimum operation cost for including a character string A as operated in a character string B by subjecting the character string A to respective operations of substitution, insertion, and deletion. For example, in a case where a character string A is abcde, and a character string B is xyzacfegklm as shown in
In creating a comparison keyword, either of 2 kinds described above may be used as a definition of an edit distance. Any method can be utilized so far as the method is a method of measuring a distance between character strings other than the processes described above.
Not only a word Wi but a word sequence W1 . . . WN may be used in processes 403 and 404 of
There can be carried out packaging in which in processing 403, not only an edit distance Ed (K, W1 . . . WN) but a probability P (W1 . . . WN) of creating a word sequence W1 . . . WN are calculated in processing 403, and in processing 404, when the edit distance is equal to or less than the threshold, and P (W1 . . . WN) is equal to or more than a threshold, C←C U {W1 . . . WN}. In this case, a comparison keyword set also includes the word sequence. Incidentally, as a method of calculating P (W1 . . . WN), for example, an N-gram model which is well known in a field of a language processing can be utilized. Details of the N-gram model are well known for the skilled person, and therefore, an explanation thereof will be omitted here.
There can also be utilized an arbitrary scale of combining Ed (K, W1 . . . WN) and P (W1 . . . WN) other than the above-described. For example, in processing 404, there can be utilized the scale of Ed (K, W1 . . . WN)/P (W1 . . . WN) or (P (W1 . . . WN)*(length (K)−Ed (K, W1 . . . WN))/length (K). Incidentally, length (K) is a number of phonemes included in a phoneme expression of keyword K.
A phoneme confusion matrix which is used for creating a comparison keyword can be switched by a native language or a usable language of a user. In this case, a user inputs information with regard to a native language or a useable language of the user to a system via the language information inputting unit 114. In the system which receives an input from the user, the phoneme confusion matrix creating unit 115 outputs a phoneme confusion matrix for the native language of the user. For example, although
The phoneme confusion matrix creating unit is not limited to a native language of a user but can switch a phoneme confusion matrix by information of a language which the user can understand.
In a case where a user can understand plural languages, the phoneme confusion matrix creating unit 115 can also create a phoneme confusion matrix which combines these pieces of the language information. As one of embodiments, for a user who can understand α language and β language, there can be created a confusion matrix i row j column element of which is configured by a larger one of i row j column element of a phoneme confusion matrix for an α language user and i row j column element of a phoneme confusion matrix for a β language user. Also in a case where a user can understand languages of 3 languages or more, there may be selected the largest one of i row j column elements for respective matrix elements in phoneme confusion matrices of respective languages.
For example, for a user who can understand Japanese and Chinese, a phoneme confusion matrix of
A user can also adjust a value of a phoneme confusion matrix by directly operating the matrix.
Incidentally, creation of a phoneme confusion matrix can be carried out at an arbitrary timing before operating the comparison keyword creating unit.
It is selected whether the corresponding comparison keyword is to be presented to a user for comparison keyword candidates that are created by the comparison keyword creating unit 107 by operating the comparison keyword checking unit 108. An unnecessary comparison keyword candidate is removed thereby.
(i) Cut out voice X including start and finish ends of time of speaking the keyword (processing 703).
(ii) Execute a word spotting processing for the voice with regard to all of comparison keyword candidates Wi (i=1, . . . , N) (processing 705).
(iii) Execute flag (Wi)=1 for a word Wi in which score P*(Wi*|X) provided as a result of the word spotting exceeds a threshold (processing 706).
Incidentally, in the word spotting processing, there is calculated a probability P (*key*|X) for speaking a keyword Wi in voice X in accordance with Equation 1.
Here, notation h0 designates an element which includes a phoneme expression of the keyword in an arbitrary phoneme set, and notation h1 designates an element of an arbitrary phoneme sequence set. Details thereof are shown in Tatsuya Kawahara, Toshihiko Munetsugu, Shuji Dooshita, “Word Spotting in Conversation Voice Using Heuristic Language Model”, Journal of Information & Communication Research, D-II, Information•System, II-Information Processing, Vol. 78, No. 7, pp. 1013-1020, 1995 or the like, the details are well known for the skilled person, and therefore, here, a further explanation thereof will be omitted.
In a case where a value P (*W*|X) of the word spotting which is calculated in checking a comparison keyword exceeds a threshold, the corresponding retrieval result can also be removed from a retrieval result.
Incidentally, a processing of checking a comparison keyword candidate may be omitted.
Both of the voice comparison keyword candidate and the keyword inputted by the user are converted into voice waveforms by the voice synthesizing unit 109. Here, a voice synthesizing technology of converting a text into a voice waveform is well known for the skilled person, and therefore, details thereof will be omitted.
Finally, the retrieval result presenting unit 110 presents information with regard to a retrieval result and a comparison keyword to a user via the display device 111 and the voice outputting device 113.
A user can retrieve a portion at which the keyword is spoken in a voice date accumulated in the voice data accumulating device 102 by inputting a retrieval keyword to a retrieval window 801 and pressing down a button 802. In an example of
The retrieval result is a voice file name 805 in which the keyword inputted by the user is spoken and time 806 at which the keyword is spoken in the voice file, and voice is reproduced via the voice outputting device 113 from the time of the file by clicking a portion of “reproduce from keyword” 807. Also, voice is reproduced by the voice outputting device 113 from the start of the file by clicking a portion of “reproduce from start of file” 808.
A voice synthesis of the keyword is reproduced via the voice outputting device 113 by clicking a portion of “listen to keyword voice synthesis” 803. Thereby, the user can listen to a correct pronunciation of the keyword, which can configure a reference of whether the retrieval result is correct.
As candidates of the comparison keyword, pray and clay are displayed at 804 of
Number | Date | Country | Kind |
---|---|---|---|
2011-252425 | Nov 2011 | JP | national |