This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2022-134502, filed on Aug. 25, 2022, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to a dictionary selection program, a dictionary selection method, and a dictionary selection device.
Speech-to-text conversion, which is so-called dictation, has an aspect of being applied to voice data of a moving image. For example, voice data of a moving image in which a conference held through a chat function such as a voice call and a video is recorded by a recording function is input to a voice recognition engine, whereby speech-to-text transcription of the conference minutes, or the like is implemented.
Japanese Laid-open Patent Publication No. 2013-50605 is disclosed as related art.
According to an aspect of the embodiments, a non-transitory computer-readable recording medium stores a dictionary selection program for causing a computer to execute a process including: determining genres indicated by moving image data for each of a plurality of sections in the moving image data, based on each of voice data and image data of the moving image data; determining quality of voice and the quality of an image for each of the plurality of sections in the moving image data, based on each of the voice data and the image data; and selecting voice recognition dictionaries that are specified, from among a plurality of the voice recognition dictionaries, based on determination results for the genres and the determination results for the quality of the voice and the quality of the image for each of the plurality of sections.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
The accuracy of the voice recognition in such speech-to-text transcription has an aspect of being affected by the word dictionary in which the vocabulary to be recognized by the voice recognition engine is registered. For example, a language model switching device is known as one of techniques for choosing one voice recognition dictionary from among a plurality of voice recognition dictionaries. For example, in a case where a plurality of language models adapted for each topic is prepared, the language model switching device estimates a topic currently in progress, using a voice recognition result, and sequentially switches the language model to one matching the estimated topic.
However, the above-described language model switching device has an aspect of having a difficulty in selecting a voice recognition dictionary corresponding to a genre of a moving image.
For example, in the above language model switching device, switching to a language model having low relevance to the topic in the moving image is executed due to factors exemplified below, and as a result, the voice recognition accuracy sometimes decreases. For example, when background music (BGM) or noise is superimposed on the input voice, it is difficult to grasp the features of the voice, and consequently, the estimation accuracy for the topic decreases. In addition, also in a case where a plurality of speakers is performing crosstalk, it is similarly difficult to grasp the features of the voice, and consequently, the estimation accuracy for the topic decreases. Furthermore, when the input voice includes a high percentage or frequency of silent sections, information that grasps the features of the voice is insufficient, and consequently, errors are likely to increase in estimation of the topic. As described above, in the above language model switching device, the estimation accuracy for the topic decreases when unexpected voice data such as noise, BGM, and crosstalk is input, and it is difficult to choose the voice recognition dictionary corresponding to the topic.
In one aspect, an object of the embodiments is to provide a dictionary selection program, a dictionary selection method, and a dictionary selection device capable of implementing selection of a voice recognition dictionary corresponding to a genre of a moving image.
Hereinafter, embodiments of a dictionary selection program, a dictionary selection method, and a dictionary selection device according to the present application will be described with reference to the accompanying drawings. Each of the embodiments merely illustrates an example or aspect, and such exemplification does not limit numerical values, a range of functions, usage scenes, and the like. Then, each of the embodiments can be appropriately combined within a range that does not cause contradiction between processing contents.
<System Configuration>
As a mere example of usage scenes of such a speech-to-text transcription function, a scene can be mentioned in which, for example, a record of statements is generated by inputting, to a voice recognition engine, voice data of a moving image in which a conference or a lecture held through a chat function such as a voice call or a video is recorded.
The server device 10 is an example of a computer that provides the above-mentioned speech-to-text transcription function. For example, the server device 10 can be implemented as a server that provides the above speech-to-text transcription function on-premises. Additionally, by implementing the server device 10 as a platform as a service (PaaS) type or software as a service (SaaS) type application, the server device 10 can provide the above speech-to-text transcription function as a cloud service.
As illustrated in
The client terminal 30 corresponds to an example of a computer provided with the above speech-to-text transcription function. For example, the client terminal 30 may be implemented by a portable terminal device such as a smartphone, a tablet terminal, or a wearable terminal as well as a personal computer.
Note that
<One Aspect of Problem>
As described in the background part, the above language model switching device has an aspect of having a difficulty in selecting a voice recognition dictionary corresponding to a genre of a moving image.
For example, in the above language model switching device, switching to a language model having low relevance to the topic in the moving image is executed due to factors exemplified below, and as a result, the voice recognition accuracy sometimes decreases. For example, when BGM or noise is superimposed on the input voice, it is difficult to grasp the features of the voice, and consequently, the estimation accuracy for the topic decreases. In addition, also in a case where a plurality of speakers is performing crosstalk, it is similarly difficult to grasp the features of the voice, and consequently, the estimation accuracy for the topic decreases. Furthermore, when the input voice includes a high percentage or frequency of silent sections, information that grasps the features of the voice is insufficient, and consequently, errors are likely to increase in estimation of the topic. As described above, in the above language model switching device, the estimation accuracy for the topic decreases when unexpected voice data such as noise, BGM, and crosstalk is input, and it is difficult to choose the voice recognition dictionary corresponding to the topic.
<One Aspect of Problem Solving Approach>
Thus, the speech-to-text transcription function according to the present embodiment is equipped with a dictionary selection function that selects a voice recognition dictionary corresponding to one genre among a genre determined from voice and a genre determined from an image in a moving image, based on the quality of the voice and the image of the moving image.
With such a dictionary selection function, for example, a voice recognition dictionary corresponding to the genre determined from the voice may be selected when the image has a poor quality in the moving image, or a voice recognition dictionary corresponding to the genre determined from the image may be selected when the voice has a poor quality in the moving image. Alternatively, for example, a voice recognition dictionary corresponding to the genre determined from the image may be selected when the image has a good quality in the moving image, or a voice recognition dictionary corresponding to the genre determined from the voice may be selected when the voice has a good quality in the moving image.
Therefore, according to the dictionary selection function according to the present embodiment, selection of a voice recognition dictionary corresponding to a genre of a moving image may be implemented.
<Configuration of Server Device 10>
Next, a functional configuration example of the server device 10 according to the present embodiment will be described. In
As illustrated in
The acceptance unit 11, the voice extraction units 12A and 12B, the image extraction unit 13, the first genre determination unit 14A, the second genre determination unit 14B, the first quality determination unit 15A, the second quality determination unit 15B, the selection unit 16, the character extraction unit 17, the dictionary generation unit 18, the voice recognition unit 19, and the like will be referred to as functional units. Such functional units can be implemented by a hardware processor. For example, examples of the hardware processor include a central processing unit (CPU), a micro processing unit (MPU), a graphics processing unit (GPU), and a general-purpose computing on GPU (GPGPU). Additionally, the above functional units may be implemented by hard wired logic such as an application specific integrated circuit (ASIC) or a field programmable gate array (FPGA).
The storage unit such as the dictionary storage unit 16A can be implemented by various storages such as a hard disk drive (HDD), an optical disc, or a solid state drive (SSD), or by allocating a part of a storage area that a storage has.
The acceptance unit 11 is a processing unit that accepts various requests from the client terminal 30. As a mere example, the acceptance unit 11 can accept a speech-to-text transcription request for demanding the execution of speech-to-text transcription, from the client terminal 30.
When accepting such a speech-to-text transcription request, the acceptance unit 11 can also accept, for example, designation of a moving image to be subjected to speech-to-text transcription. As one aspect, the acceptance unit 11 can accept a moving image to be subjected to speech-to-text transcription from the client terminal 30 via the network NW. As another aspect, the acceptance unit 11 can also accept the designation from among moving images stored in a file server (not illustrated) or the like.
Both of the voice extraction units 12A and 12B are processing units that extract voice data from a moving image. These voice extraction units 12A and 12B have different sections of voice data to be separated from the moving image.
As one aspect, the voice extraction unit 12A extracts voice data corresponding to a section corresponding to a specified analysis frame length, for each of the sections among all sections of the moving image. As a mere example, the voice extraction unit 12A extracts frames having a specified time length in order from the head of the voice data for all sections separated from the moving image, for each frame period, and applies a window function such as the hanning window. At this time, from the aspect of reducing information loss due to the window function, the voice extraction unit 12A is allowed to cause the preceding and succeeding analysis frames to overlap at an optional percentage. For example, by setting a fixed length such as 512 samples as the analysis frame length at regular intervals, such as at every frame period of 256 samples, the overlap ratio can be set to 50%. The voice data extracted for each section corresponding to the analysis frame length in this manner is output to the first genre determination unit 14A to be described later and the first quality determination unit 15A to be described later.
As another aspect, the voice extraction unit 12B extracts voice data corresponding to all sections of the moving image. The voice data corresponding to all sections obtained in this manner is output to the voice recognition unit 19 to be described later.
The image extraction unit 13 is a processing unit that extracts image data from a moving image. As a mere example, the image extraction unit 13 extracts image data corresponding to a section from which voice data is extracted by the voice extraction unit 12A, in synchronization with the voice extraction unit 12A for each of the sections. The image data extracted for each section in this manner is output to the second genre determination unit 14B to be described later and the second quality determination unit 15B to be described later.
The first genre determination unit 14A is a processing unit that determines a genre of the moving image, based on the voice data extracted by voice extraction unit 12A. Such genre determination can be implemented by a machine learning model that executes a class classification task of outputting confidence levels for each genre class with the voice data as input, as a mere example. For example, the machine learning model may be implemented by a neural network such as a long short-term memory (LSTM) or a convolutional neural network (CNN) used for the voice recognition task. Hereinafter, from the aspect of distinguishing the above machine learning model from other machine learning models, the above machine learning model will be sometimes referred to by the expression “first genre determination model”.
For example, in the training phase, the first genre determination model m11 can be trained in accordance with any machine learning algorithm such as deep learning, with the section voice signal as an explanatory variable for the first genre determination model m11 and the label as an objective variable for the first genre determination model m11. This obtains a trained first genre determination model M11.
In the inference phase, the section voice signal extracted by the voice extraction unit 12A is input to the first genre determination model M11. The first genre determination model M11 to which the section voice signal has been input in this manner outputs the confidence levels for each genre class. For example, “80%” is output as the confidence level of the genre class “weather forecast”. Furthermore, “15%” is output as the confidence level of the genre class “cultural program”. Furthermore, “5%” is output as the confidence level of the genre class “sports”. In this case, as a mere example, the class “weather forecast” having the highest confidence level can be regarded as the determination result for the genre.
Note that, although an example of inputting the voice signal to the first genre determination model M11 has been given here, the whole voice signal is not necessarily restrictive, but for example, a feature or the like obtained by feature extraction for the voice signal may be input to the first genre determination model M11.
The second genre determination unit 14B is a processing unit that determines a genre of the moving image, based on the image data extracted by the image extraction unit 13. Such genre determination can be implemented by a machine learning model that executes a class classification task of outputting confidence levels for each genre class with the image data as input, as a mere example. For example, the machine learning model may be implemented by a convolutional neural network used for the image recognition task, which is a so-called CNN-based neural network. Hereinafter, from the aspect of distinguishing the machine learning model described in the present part from other machine learning models, the machine learning model described in the present part will be sometimes referred to by the expression “second genre determination model”.
For example, in the training phase, the second genre determination model m12 can be trained in accordance with any machine learning algorithm such as deep learning, with the image as an explanatory variable for the second genre determination model m12 and the label as an objective variable for the second genre determination model m12. This obtains a trained second genre determination model M12.
In the inference phase, the image extracted by the image extraction unit 13 is input to the second genre determination model M12. The second genre determination model M12 to which the image has been input in this manner outputs the confidence levels for each genre class. For example, “80%” is output as the confidence level of the genre class “weather forecast”. Furthermore, “15%” is output as the confidence level of the genre class “cultural program”. Furthermore, “5%” is output as the confidence level of the genre class “sports”. In this case, as a mere example, the class “weather forecast” having the highest confidence level can be regarded as the determination result for the genre.
Note that, although
The first quality determination unit 15A is a processing unit that determines the quality relating to the voice in the moving image, based on the voice data extracted by the voice extraction unit 12A. Such determination of the voice quality can be implemented by a machine learning model that executes a class classification task of outputting confidence levels for each quality class relating to the voice, with the voice data as input, as a mere example. For example, the machine learning model may be implemented by a neural network such as an LSTM or a CNN. Hereinafter, from the aspect of distinguishing the machine learning model described in the present part from other machine learning models, the machine learning model described in the present part will be sometimes referred to by the expression “first quality determination model”.
For example, in the training phase, the first quality determination model m21 can be trained in accordance with any machine learning algorithm such as deep learning, with the section voice signal as an explanatory variable for the first quality determination model m21 and the label as an objective variable for the first quality determination model m21. This obtains a trained first quality determination model M21.
In the inference phase, the section voice signal extracted by the voice extraction unit 12A is input to the first quality determination model M21. The first quality determination model M21 to which the section voice signal has been input in this manner outputs the confidence levels for each voice quality class. For example, “10%” is output as the confidence level of the voice quality class “OK”, whereas “90%” is output as the confidence level of the voice quality class “NG”. In this case, as a mere example, the class “NG” having the highest confidence level can be regarded as the determination result for the voice quality.
Note that, although an example of inputting the voice signal to the first quality determination model M21 has been given here, the whole voice signal is not necessarily restrictive, but for example, a feature or the like obtained by feature extraction for the voice signal may be input to the first quality determination model M21. In addition, although an example in which the machine learning task of the first quality determination model M21 is two-class classification that classifies the voice data into two classes of OK and NG has been given here, the machine learning task of the first quality determination model M21 is not limited to this. For example, the machine learning task of the first quality determination model M21 may be multi-class classification that classifies the voice data into multiple classes equal to or more than three, such as noise, BGM, crosstalk, and normal.
The second quality determination unit 15B is a processing unit that determines the quality relating to the image in the moving image, based on the image data extracted by the image extraction unit 13. Such determination of the image quality can be implemented by a machine learning model that executes a class classification task of outputting confidence levels for each image quality class with the image data as input, as a mere example. For example, the machine learning model may be implemented by a CNN-based neural network. Hereinafter, from the aspect of distinguishing the machine learning model described in the present part from other machine learning models, the machine learning model described in the present part will be sometimes referred to by the expression “second quality determination model”.
For example, in the training phase, the second quality determination model m22 can be trained in accordance with any machine learning algorithm such as deep learning, with the image as an explanatory variable for the second quality determination model m22 and the label as an objective variable for the second quality determination model m22. This obtains a trained second quality determination model M22.
In the inference phase, the image extracted by the image extraction unit 13 is input to the second quality determination model M22. The second quality determination model M22 to which the image has been input in this manner outputs the confidence levels for each image quality class. For example, “10%” is output as the confidence level of the image quality class “OK”, whereas “90%” is output as the confidence level of the image quality class “NG”. In this case, as a mere example, the class “NG” having the highest confidence level can be regarded as the determination result for the image quality.
Note that, although an example in which the machine learning task of the second quality determination model M22 is two-class classification that classifies the image data into two classes of OK and NG has been given here, the machine learning task of the second quality determination model M22 is not limited to this. For example, the machine learning task of the second quality determination model M22 may be multi-class classification that classifies the image data into multiple classes equal to or more than three, such as scene change, out-of-focus, and normal.
The selection unit 16 is a processing unit that selects a specified voice recognition dictionary from among a plurality of voice recognition dictionaries, based on the determination result for the genre and the determination results for the voice quality and image quality, for each of a plurality of sections.
For example, the selection unit 16 executes the processing as follows for each section. For example, the selection unit 16 compares the voice quality determined by the first quality determination unit 15A and the image quality determined by the second quality determination unit 15B, thereby selecting a medium with the better quality from among the two media of voice and image. Then, the selection unit 16 selects the genre corresponding to the medium with the better quality from among the genre determined from the voice by the first genre determination unit 14A and the genre determined from the image by the second genre determination unit 14B. After that, the selection unit 16 selects the voice recognition dictionary corresponding to the genre having the highest selection frequency among the genres selected for each section, from among a plurality of voice recognition dictionaries stored in the dictionary storage unit 16A.
Such a dictionary storage unit 16A may store, as a mere example, voice recognition dictionaries specialized in each genre, for each genre. The “voice recognition dictionary” mentioned here may include a “word dictionary” listing vocabularies, for example, a set of words, to be subjected to recognition by the voice recognition engine. Additionally, the “voice recognition dictionary” may also include, for example, a grammar of a language, a “language model” in which an occurrence probability or the like of a word string is defined, for example, and an “acoustic model” in which a feature pattern of an acoustic sound is defined in units of phonemes or the like. For example, by generating a word dictionary, a language model, and an acoustic model based on a corpus corresponding to a specified genre, a voice recognition dictionary for the specified genre can be generated. In accordance with the examples illustrated in
An example of selecting a genre by the selection unit 16 will be described with reference to
For example, in
The first genre determination model M11 to which the section voice signal 21A has been input in this manner outputs the genre label “weather forecast (confidence level of 80%)” having the highest confidence level among the confidence levels for each genre class, to the selection unit 16. Furthermore, the second genre determination model M12 to which the image 21B has been input outputs the genre label “movie (confidence level of 80%)” having the highest confidence level among the confidence levels for each genre class, to the selection unit 16. Furthermore, the first quality determination model M21 to which the section voice signal 21A has been input outputs the voice quality label “OK (confidence level of 80%)” having the highest confidence level among the confidence levels for each voice quality class, to the selection unit 16. Furthermore, the second quality determination model M22 to which the image 21B has been input outputs the image quality label “NG (confidence level of 70%)” having the highest confidence level among the confidence levels for each image quality class, to the selection unit 16.
The selection unit 16 that has accepted such input compares the voice quality label “OK (confidence level of 80%)” output by the first quality determination model M21 and the image quality label “NG (confidence level of 70%)” output by the second quality determination model M22. In this case, since the voice quality is superior to the image quality, the medium “voice” having the better quality is selected from among the two media of voice and image. As a result, the genre “weather forecast” corresponding to the medium “voice” having the better quality is selected from among the genre label “weather forecast” output by the first genre determination model M11 and the genre label “movie” output by the second genre determination model M12.
Next, in
The first genre determination model M11 to which the section voice signal 22A has been input in this manner outputs the genre label “weather forecast (confidence level of 80%)” having the highest confidence level among the confidence levels for each genre class, to the selection unit 16. Furthermore, the second genre determination model M12 to which the image 22B has been input outputs the genre label “weather forecast (confidence level of 80%)” having the highest confidence level among the confidence levels for each genre class, to the selection unit 16. Furthermore, the first quality determination model M21 to which the section voice signal 22A has been input outputs the voice quality label “OK (confidence level of 80%)” having the highest confidence level among the confidence levels for each voice quality class, to the selection unit 16. Furthermore, the second quality determination model M22 to which the image 22B has been input outputs the image quality label “NG (confidence level of 80%)” having the highest confidence level among the confidence levels for each image quality class, to the selection unit 16.
The selection unit 16 that has accepted such input compares the voice quality label “OK (confidence level of 80%)” output by the first quality determination model M21 and the image quality label “NG (confidence level of 80%)” output by the second quality determination model M22. In this case, since the voice quality is superior to the image quality, the medium “voice” having the better quality is selected from among the two media of voice and image. As a result, the genre “weather forecast” corresponding to the medium “voice” having the better quality is selected from among the genre label “weather forecast” output by the first genre determination model M11 and the genre label “weather forecast” output by the second genre determination model M12. Note that, as in the example of the section (b), when the first genre determination model M11 and the second genre determination model M12 output the same genre label, the comparison between the voice quality and the image quality may be skipped.
Next, in
The first genre determination model M11 to which the section voice signal 23A has been input in this manner outputs the genre label “weather forecast (confidence level of 80%)” having the highest confidence level among the confidence levels for each genre class, to the selection unit 16. Furthermore, the second genre determination model M12 to which the image 23B has been input outputs the genre label “cultural program (confidence level of 80%)” having the highest confidence level among the confidence levels for each genre class, to the selection unit 16. Furthermore, the first quality determination model M21 to which the section voice signal 23A has been input outputs the voice quality label “OK (confidence level of 80%)” having the highest confidence level among the confidence levels for each voice quality class, to the selection unit 16. Furthermore, the second quality determination model M22 to which the image 23B has been input outputs the image quality label “OK (confidence level of 70%)” having the highest confidence level among the confidence levels for each image quality class, to the selection unit 16.
The selection unit 16 that has accepted such input compares the voice quality label “OK (confidence level of 80%)” output by the first quality determination model M21 and the image quality label “OK (confidence level of 70%)” output by the second quality determination model M22. In this case, since the voice quality is superior to the image quality, the medium “voice” having the better quality is selected from among the two media of voice and image. As a result, the genre “weather forecast” corresponding to the medium “voice” having the better quality is selected from among the genre label “weather forecast” output by the first genre determination model M11 and the genre label “cultural program” output by the second genre determination model M12.
Next, in
The first genre determination model M11 to which the section voice signal 24A has been input in this manner outputs the genre label “sports program (confidence level of 40%)” having the highest confidence level among the confidence levels for each genre class, to the selection unit 16. Furthermore, the second genre determination model M12 to which the image 24B has been input outputs the genre label “cultural program (confidence level of 40%)” having the highest confidence level among the confidence levels for each genre class, to the selection unit 16. Furthermore, the first quality determination model M21 to which the section voice signal 24A has been input outputs the voice quality label “NG (confidence level of 80%)” having the highest confidence level among the confidence levels for each voice quality class, to the selection unit 16. Furthermore, the second quality determination model M22 to which the image 24B has been input outputs the image quality label “OK (confidence level of 90%)” having the highest confidence level among the confidence levels for each image quality class, to the selection unit 16.
The selection unit 16 that has accepted such input compares the voice quality label “NG (confidence level of 80%)” output by the first quality determination model M21 and the image quality label “OK (confidence level of 90%)” output by the second quality determination model M22. In this case, since the image quality is superior to the voice quality, the medium “image” having the better quality is selected from among the two media of voice and image. As a result, the genre “cultural program” corresponding to the medium “image” having the better quality is selected from among the genre label “sports program” output by the first genre determination model M11 and the genre label “cultural program” output by the second genre determination model M12.
Finally, in
The first genre determination model M11 to which the section voice signal 25A has been input in this manner outputs the genre label “sports program (confidence level of 40%)” having the highest confidence level among the confidence levels for each genre class, to the selection unit 16. Furthermore, the second genre determination model M12 to which the image 25B has been input outputs the genre label “weather forecast (confidence level of 100%)” having the highest confidence level among the confidence levels for each genre class, to the selection unit 16. Furthermore, the first quality determination model M21 to which the section voice signal 25A has been input outputs the voice quality label “NG (confidence level of 90%)” having the highest confidence level among the confidence levels for each voice quality class, to the selection unit 16. Furthermore, the second quality determination model M22 to which the image 25B has been input outputs the image quality label “OK (confidence level of 100%)” having the highest confidence level among the confidence levels for each image quality class, to the selection unit 16.
The selection unit 16 that has accepted such input compares the voice quality label “NG (confidence level of 90%)” output by the first quality determination model M21 and the image quality label “OK (confidence level of 100%)” output by the second quality determination model M22. In this case, since the image quality is superior to the voice quality, the medium “image” having the better quality is selected from among the two media of voice and image. As a result, the genre “weather forecast” corresponding to the medium “image” having the better quality is selected from among the genre label “sports program” output by the first genre determination model M11 and the genre label “weather forecast” output by the second genre determination model M12.
The selection results of selecting genres in each of these sections (a) to (e) are as illustrated in
Returning to the description of
Here, the character extraction unit 17 further extracts characters from the image extracted from the moving image by the image extraction unit 13, from the aspect of adding a vocabulary of a specified genre to the voice recognition dictionary selected by the selection unit 16, such as a word dictionary or a user dictionary that is a sort of the word dictionary. For this reason, the character extraction unit 17 may operate every time an image is extracted by the image extraction unit 13, such as for each section, but can also operate exclusively when the confidence level of the image quality determined by the second quality determination unit 15B is equal to or higher than a threshold value Th such as 70%.
This enables, when a character string having low relevance to the genre of the moving image is reproduced as on-screen subtitles or the like of the moving image, for example, to restrain a word or phrase relating to the on-screen subtitles from being registered in the voice recognition dictionary or to restrain resources from being consumed in extraction processing for such characters. For example, there is a case where on-screen subtitles of a prompt report such as a prompt report of the earthquake, a prompt report of the election, or a prompt report of the extra are inserted into the moving image. Since there is a high possibility that such on-screen subtitles have no connection with the purpose of creating or distributing the moving image, there is technical significance in restraining a word or phrase relating to the on-screen subtitles from being registered in the voice recognition dictionary.
The dictionary generation unit 18 is a processing unit that generates the voice recognition dictionary to be applied to the voice recognition engine, based on the selection result for the voice recognition dictionary by the selection unit 16 and the character extraction result by the character extraction unit 17. As a mere example, the dictionary generation unit 18 acquires the voice recognition dictionary selected by the selection unit 16 from among the plurality of voice recognition dictionaries stored in the dictionary storage unit 16A. Then, the dictionary generation unit 18 registers the word or phrase corresponding to the characters extracted by the character extraction unit 17 from the section in which the image quality determined by the second quality determination unit 15B is equal to or higher than the threshold value Th, in the voice recognition dictionary selected by the selection unit 16, such as the word dictionary or the user dictionary of the word dictionary.
For example, as for the examples of the sections (a) and (b) illustrated in
For example, the filtering results for the character extraction results by the character extraction unit 17 is as illustrated in
Note that the condition that the image quality is equal to or higher than the threshold value Th has been given here as an example of the condition for registering a word or phrase in the voice recognition dictionary, but this is not restrictive. For example, registration of a word or phrase in the voice recognition dictionary may be carried out by focusing on a section in which a word that frequently appears in the on-screen subtitles, such as “earthquake”, “election”, “extra”, “prompt report” as an example, is not included in the character extraction result by the character extraction unit 17.
The voice recognition unit 19 is a processing unit that executes voice recognition. Such voice recognition may be implemented by any voice recognition engine as a mere example. As a mere example, the voice recognition unit 19 inputs the voice data of all sections extracted from the moving image by the voice extraction unit 12B to the voice recognition engine to which the voice recognition dictionary generated by the dictionary generation unit 18 is applied. This causes the voice recognition engine to output text data obtained by converting the voice data of all sections of the moving image into text, as a voice recognition result. Such a voice recognition result is supplied to the client terminal 30 as a response.
<Processing Flow>
Thereafter, loop processing 1 of repeating the processing from step S101 below to step S107 below is executed by the number of times corresponding to a number M of sections obtained by dividing the moving image accepted in step S100. Note that, although
For example, the voice extraction unit 12A extracts the voice data corresponding to an m-th section of the moving image accepted in step S100 (step S101A). Then, the first genre determination unit 14A determines the genre of the m-th section, based on the voice data extracted in step S101A (step S102A). Furthermore, the first quality determination unit 15A determines the voice quality of the m-th section, based on the voice data extracted in step S101A (step S103A).
Note that, although an example in which the processing is executed in the order of steps S102A and S103A has been given in
In parallel with the processing from these step S101A to step S103A, the processing from step S101B to step S105B is executed.
For example, the image extraction unit 13 extracts the image data corresponding to the m-th section of the moving image accepted in step S100 (step S101B). Then, the second genre determination unit 14B determines the genre of the m-th section, based on the image data extracted in step S101B (step S102B). Furthermore, the second quality determination unit 15B determines the image quality of the m-th section, based on the image data extracted in step S101B (step S103B).
At this time, when the confidence level of the image quality is equal to or higher than the threshold value Th such as 70% (Yes in step S104B), the character extraction unit 17 extracts characters from the image data extracted in step S101B (step S105B). Note that, when the confidence level of the image quality is lower than the threshold value Th such as 70% (No in step S104B), the processing in step S105B is skipped.
Thereafter, the selection unit 16 compares the voice quality determined in step S103A and the image quality determined in step S103B, thereby selecting a medium with the better quality from among the two media of voice and image (step S106).
Then, the selection unit 16 selects the genre corresponding to the medium with the better quality selected in step S106 from among the genre determined from the voice in step S102A and the genre determined from the image in step S102B (step S107).
By repeating such loop processing 1, the selection results for the genre are obtained for each of the M sections obtained by dividing the moving image.
After that, the selection unit 16 selects the voice recognition dictionary corresponding to the genre having the highest selection frequency among the genres selected for each section, from among the plurality of voice recognition dictionaries stored in the dictionary storage unit 16A (step S108).
Then, the dictionary generation unit 18 registers the word or phrase corresponding to the characters extracted in step S105B, in the voice recognition dictionary selected in step S108, such as the word dictionary or the user dictionary of the word dictionary, among the plurality of voice recognition dictionaries in the dictionary storage unit 16A (step S109).
<One Aspect of Effects>
As described above, the server device 10 according to the present embodiment resolves which of the genre determined from the voice or the genre determined from the image in the moving image is used for selection of the voice recognition dictionary, based on the quality of the voice and the image.
This may enable, for example, to select a voice recognition dictionary corresponding to the genre determined from the voice when the image has a poor quality in the moving image, or to select a voice recognition dictionary corresponding to the genre determined from the image when the voice has a poor quality in the moving image. Alternatively, for example, a voice recognition dictionary corresponding to the genre determined from the image may be selected when the image has a good quality in the moving image, or a voice recognition dictionary corresponding to the genre determined from the voice may be selected when the voice has a good quality in the moving image.
Therefore, according to the server device 10 according to the present embodiment, selection of a voice recognition dictionary corresponding to a genre of a moving image may be implemented.
Incidentally, while the embodiments relating to the disclosed device have been described above, the embodiments may be carried out in a variety of different modes apart from the embodiments described above. Thus, in the following, other embodiments included in the embodiments will be described.
The first embodiment described above has given an example in which the genre is selected based on the voice quality and the image quality when selecting the genres for each section, but is not limited to this. For example, it is also possible to select a genre having a higher confidence level from among the confidence level of the genre output by the first genre determination model M11 and the confidence level of the genre output by the second genre determination model M12. In this case, the determination of the voice quality and the image quality does not necessarily have be executed.
<Distribution and Integration>
In addition, each component of each of the illustrated devices does not necessarily have to be physically configured as illustrated in the drawings. For example, specific modes of distribution and integration of each device are not limited to those illustrated, and the whole or a part of each device may be configured by being functionally or physically distributed and integrated in any unit depending on various loads, use situations, or the like. For example, the acceptance unit 11, the voice extraction units 12A and 12B, the image extraction unit 13, the first genre determination unit 14A, the second genre determination unit 14B, the first quality determination unit 15A, the second quality determination unit 15B, the selection unit 16, the character extraction unit 17, the dictionary generation unit 18, or the voice recognition unit 19 may be coupled through a network as an external device of the server device 10. In addition, different devices may include one of the acceptance unit 11, the voice extraction units 12A and 12B, the image extraction unit 13, the first genre determination unit 14A, the second genre determination unit 14B, the first quality determination unit 15A, the second quality determination unit 15B, the selection unit 16, the character extraction unit 17, the dictionary generation unit 18, or the voice recognition unit 19 and coupled to a network to cooperate with each other, thereby implementing the function of the server device 10.
<Hardware Configuration>
In addition, various types of processing described in the above embodiments can be implemented by a computer such as a personal computer or a workstation executing a program prepared in advance. Thus, in the following, an example of a computer that executes a dictionary selection program having functions similar to the functions in the first and second embodiments will be described with reference to
As illustrated in
Under such an environment, the CPU 150 reads the dictionary selection program 170a from the HDD 170 and then loads the read dictionary selection program 170a into the RAM 180. As a result, the dictionary selection program 170a functions as a dictionary selection process 180a as illustrated in
In addition, the dictionary selection program 170a described above does not necessarily have to be stored in the HDD 170 or the ROM 160 from the beginning. For example, the dictionary selection program 170a is stored in a “portable physical medium” to be inserted in the computer 100, such as a flexible disk, which is a so-called FD, a compact disc (CD)-ROM, a digital versatile disc (DVD) disk, a magneto-optical disk, or an integrated circuit (IC) card. Then, the computer 100 may acquire and execute the dictionary selection program 170a from those portable physical media. In addition, the dictionary selection program 170a is stored in another computer, a server device, or the like coupled to the computer 100 via a public line, the Internet, a LAN, a wide area network (WAN), or the like. The computer 100 may be caused to download the dictionary selection program 170a stored in this manner and then caused to execute the downloaded dictionary selection program 170a.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2022-134502 | Aug 2022 | JP | national |