1. Technical Field
The present disclosure relates to a multimedia recording system, and particularly to a multimedia recording system which is capable of translating spoken words into text and tagging a multimedia file corresponding to the spoken words according to the text.
2. Description of Related Art
Meeting minutes are generally made by manually translating the spoken words of the participators into text in a paper file or an electronic file. However, errors such as wrong comprehension are liable to happen when manually translating the spoken words, while text-only files are disadvantageous to a person in understanding the content of a meeting. In addition, although multimedia items such as audio/video recordings can present the content of a meeting in an intuitive manner, topics in each multimedia item cannot be located by a user without a search.
Thus, there is room for improvement in the art.
Many aspects of the present disclosure can be better understood with reference to the drawings. The components in the drawing(s) are not necessarily drawn to scale, the emphasis instead being placed upon clearly illustrating the principles of the present disclosure. Moreover, in the drawing(s), like reference numerals designate corresponding parts throughout the several views.
The storage module 110 includes a device such as a random access memory, a non-volatile memory, or a hard disk drive for storing and retrieving digital information, which stores the received multimedia data D as a multimedia file 1110. The recognition module 120 converts the audio content of the multimedia file 1110 corresponding to the audio content of the multimedia data D into text. When the multimedia file 1110 includes the video content, the recognition module 120 may reference the video content when converting, thereby ensuring the correctness or enhancing the accuracy of the conversion. For instance, the recognition module 120 can detect the movements of the lips of a speaker through the video content with respect to the speaker, determine the pronunciations corresponding to the movements, and reference the pronunciations when converting the audio content into the text, thereby complementing the inadequacy in receiving sounds. In addition, the recognition module 120 can determine the identity or the mood of a speaker through the video content with respect to the speaker, thereby describing the identity or the mood of the speaker in the text. The recognition module 120 may also reference text content of a document file when converting. For instance, the multimedia recording system 100 can input meeting materials such as presentation documents, such that the recognition module 120 can use the phrase(s) in the text content of the meeting materials as the key words for converting the audio content into the text, thereby enhancing the correctness of the conversion.
In the illustrated embodiment, the recognition module 120 includes a pronunciation recognition database 1210 and an audio-to-text mapping database 1220. The pronunciation recognition database 1210 stores pronunciation recognition principles. The audio-to-text mapping database 1220 stores audio-to-text mapping data. The recognition module 120 converts the audio content of the multimedia data D into waveform signal(s), identifies sound portion(s) such as vowels and consonants by analyzing the waveform signal(s) according to the pronunciation recognition principles in the pronunciation recognition database 1210, produces pronunciation data according to the sound portion(s), and produces the text by comparing the pronunciation data with the audio-to-text mapping data in the audio-to-text mapping database 1220.
Table 1, below, shows an embodiment of tag information I produced by the tagging module 130 shown in
The multimedia recording system 100 may be selectively operated in different scenarios. For instance, in a meeting scenario, the storage module 110 stores related information of a meeting, for example, the organization and the content (including the text, see
In step S1110, the multimedia data D with audio content is received through the computer network 2000. In the illustrated embodiment, the multimedia data D includes audio content and video content.
In step S1120, the multimedia file 1110 corresponding to the multimedia data D is stored.
In step S1130, the audio content of the multimedia file 1110 corresponding to the audio content of the multimedia data D is converted into the text. In the illustrated embodiment, the video content of the multimedia data D can be referenced while being converted. In addition, a document file can be referenced while being converted.
In step S1140, the tag information I corresponding to portion(s) of the multimedia file 1110 is produced according to the text and the predetermined topic list. The tag information I includes topic(s) corresponding to the predetermined topic list, wherein each of the topics corresponds to a beginning of a portion of the multimedia file 1110 corresponding to the topic. In the illustrated embodiment, the tag file 1120 corresponding to the multimedia file 1110 is created according to the tag information I. In other embodiments, the related information can be integrated with the multimedia file 1110 according to the tag information I.
In the illustrated embodiment, a network service such as a web service is provided through the computer network 2000, wherein the network service is capable of providing the editing interface Fe (see
In step S1131, the audio content of the multimedia data D is converted into waveform signal(s).
In step S1132, sound portion(s) such as vowels and consonants are identified by analyzing the waveform signal(s) according to pronunciation recognition principles.
In step S1133, pronunciation data is produced according to the sound portion(s).
In step S1134, the text is produced by comparing the pronunciation data with audio-to-text mapping data.
The multimedia recording system and the multimedia recording method are capable of translating spoken words into text and tagging a multimedia file corresponding to the spoken words according to the text, thereby producing computer files with respect to multimedia items such as multimedia meeting minutes or audio/video recordings, which allows a user to locate a topic in each multimedia item.
While the disclosure has been described by way of example and in terms of preferred embodiment, the disclosure is not limited thereto. On the contrary, it is intended to cover various modifications and similar arrangements as would be apparent to those skilled in the art. Therefore the range of the appended claims should be accorded the broadest interpretation so as to encompass all such modifications and similar arrangements.
Number | Date | Country | Kind |
---|---|---|---|
101130202 | Aug 2012 | TW | national |