1. Technical Field
The present disclosure relates to an electronic device capable of generating a tag file for a media file based on speaker recognition.
2. Description of Related Art
With the increasing number of broadcasts, meeting recordings, and voice mail collected every year, there is a need to provide an electronic device and method for processing these contents to generating tag files based on speaker recognition, such that a user can search for a media file associated with a specific speaker based on the tag files.
Many aspects of the embodiments can be better understood with reference to the following drawings. The components in the drawings are not necessarily drawn to scale, the emphasis instead being placed upon clearly illustrating the principles of the present disclosure. Moreover, in the drawings, like reference numerals designate corresponding parts throughout the several views.
Embodiments of the present disclosure will be described with reference to the accompanying drawings.
Referring to
The electronic device 100 includes a processor 10, a storage unit 20, a speaker recognition unit 30, and a speech-to-text converting unit 40. The storage unit 20 stores a number of acoustic models. The speaker recognition unit 30 extracts acoustic features from speech content of a media file from the video/audio recording device 200 or other devices. The speaker recognition unit 30 compares each extracted acoustic feature with the acoustic models of the storage unit 30 to determine identities of speakers.
In the embodiment, the speaker recognition unit 30 divides the media file into a number of segments with the same length of time. The length of time of each segment is sufficiently small, such that each segment of the media file includes speech content of only one speaker. The speaker recognition unit 30 extracts an acoustic feature from the speech content of each segment, and compares the extracted acoustic feature with the acoustic models of the storage unit 30. If the extracted acoustic feature matches one of the acoustic models, the identity of one speaker is determined. The identities of all the speakers whose speech content is included in the media file are thus determined.
The processor 10 records a relationship between each segment and the identity of the speaker corresponding thereto, and can thus determine one or more time durations when each speaker is speaking. For example, for a test audio file having a time duration of 110 seconds, the speaker recognition unit 30 may divide the test audio file into 11 segments, each of which has a length of time of 10 seconds. The speaker recognition unit 30 makes a speaker recognition for each of the segments, and determines that the speech content of segments A, B, C, E, and F corresponds to speaker Jon, the speech content of segment D corresponds to speaker Bran, the speech content of segments G and H corresponds to speaker Tommen, and the speech content of segments I, J, and K corresponds to speaker Arya. The processor 10 can then determine the time durations when each of the speakers Jon, Bran, Tommen, and Arya is speaking. It is noted that the number of the segments can be varied according to need.
The processor 10 generates a tag file including the identities of speakers and the time durations, and associates the tag file with the media file. In one embodiment, the processor 10 may insert a hyperlink into the tag file that points to the media file. The tag file is stored on the storage unit 20, and is accessible by the user. When a user clicks on the hyperlink, he/she will be directed to the media file. The tag file is editable, and a user is allowed to insert other information, such as a location where the media file is recorded and the date when the media file is recorded.
The speech-to-text converting unit 40 converts the speech content of each of the segments of the media file into text. The processor 10 can then determine the text corresponding to each speaker. In one embodiment, the tag file may include the text corresponding to each speaker.
Referring to
The displayed time durations are clickable, and the processor 10 plays the corresponding portion of the corresponding media file when a time duration is clicked. For example, the processor 10 plays media file 2, from 50 minutes to 1 hour, when the time duration “0:50-1:00” is clicked. The user interface 60 may include playback control buttons 623 to control the playback the media files. The user interface 60 further includes text displaying field 64 for displaying text corresponding to the playing media files.
In the embodiment, the user interface 60 further includes a download button. A user can select some content in the search result area 62 and then clicks the download button. The processor 10 creates a single file including the selected content. For example, the user may select “0:20-0:50” portion of media file 1 and “0:50-1:00” portion of media file 2, and the processor 10 creates a single file including “0:20-0:50” portion of media file 1 and “0:50-1:00” portion of media file 2.
While various embodiments have been described and illustrated, the disclosure is not to be construed as being limited thereto. Various modifications can be made to the embodiments by those skilled in the art without departing from the true spirit and scope of the present disclosure as defined by the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
101138642 | Oct 2012 | TW | national |