Users of digital media content come from vast and diverse markets and cultures throughout the world. Accessibility, therefore, is an essential component in the development of digital media content because the products that can be accessed by the most markets will generally garner the greatest success. By providing a multiple audio language product, a far wider audience can be reached to experience the digital media presentation.
The conventional media development technology enables presentations to be developed in multiple languages. Computerized multi-media presentations, such as e-Learning, have been developed with narration. This narration may also be associated with on-screen, closed-caption text, and synchronized with video or animations, through programs such as Macromedia Flash tools. For the presentation to play in different languages, the video would typically need to be synchronized each audio track in the presentation. This can result in several different versions of the presentation, one for each audio track. Typically, for each audio track the presentation would need to be synchronized by manually adjusting the timing of the video (e.g., animation) to match the audio (or visa versa), resulting in audio and video that is synchronized; and thus has equal amounts of play time.
In general, after media content has been synchronized with audio, closed-caption script may be attached using time-codes. Time-codes, for example, may be specified in units of fractional seconds or video frame count, or a combination of these. The time-codes can provide instructions as to when each segment of closed-caption script is to be displayed in a presentation. Once computed, these time codes can be used to segment the entire presentation, perhaps to drive a visible timeline with symbols, such as a bull's-eye used between timeline segments whose length is proportional to the running time of the associated segment.
Once a presentation (e.g., movie, e-learning presentation, etc.) has had its visual media synchronized with its audio, it can be difficult to make changes that effect either the audio or video streams, without disrupting the synchronization. For instance, the substitution of new audio, such as a different human language, or the replacement of rough narration with professional narration, typically results in different run-time for the new audio track that replaces the old audio track, and thus, a loss of synchronization. Unfortunately, re-working the animations or video in order to restore synchronization is labor intensive, and consequently, expensive.
Due to the problems of the prior art, there is a need for techniques to synchronize video and audio. A multiple audio language product (presentation) can be produced containing a video stream that is automatically synchronized to whichever audio the viewer selects. Video to audio synchronization can be substantially maintained even though new audio streams are added to the presentation.
A system for synchronizing media content can be provided. A media segment has a media duration. A first audio segment corresponds to the media segment. The first audio segment has a first audio duration. A second audio segment corresponds to the media segment. The second audio segment has a second audio duration. A processor compares the first audio duration with the second audio duration. Based on the comparison, the media duration is adjusted to substantially equal the second audio duration.
The first audio stream can reflect an initial (draft) version of the audio. Alternatively, the first audio stream can be directed to a specific language. The second audio stream can reflect a final version of the first audio stream. Alternatively, the second audio stream can be directed to another language. For example, the first audio stream can correspond to a first language and the second audio stream can correspond to a second language.
A video stream can be initially synchronized to a first audio stream. The video stream and first audio stream are partitioned into logical segments, respectively. The end-points of the segments can be specified by time-codes. Closed-caption script can be assigned to each audio segment. Once the video stream has been synchronized to the first audio stream and the video stream and first audio stream have been partitioned into segments, the video stream can be quickly and easily synchronized, automatically, to any other audio streams that have been partitioned into corresponding segments. At run-time, for example, the video stream can be substantially synchronized to another audio stream. This can be accomplished by comparing the duration of the first audio stream with the second audio stream, and adjusting the duration of the video stream based on this comparison. In particular, the duration of a segment in the first audio stream is compared with the duration of a corresponding segment in the second audio stream. If the duration of the first segment is greater than the duration of the second segment, then frames from the media stream are dropped at regular intervals. If the duration of the first segment is less than the duration of the second segment, then frames in the media stream are repeated at regular intervals.
The video stream (e.g. media stream) and the first and second audio streams can be processed into a plurality of media and audio segments, respectively. Each media segment, for example, can correspond to a sentence in the audio and closed-caption text, or the segment can correspond to a “thought” or scene in the presentation. The media and audio streams can be defined into segments using time-codes. The time-codes may include information about the duration of each segment. The durational information may be stored in an XML file that is associated with the presentation.
The media stream in the presentation can be synchronized with the first audio stream at development time. Closed-caption text can be time-coded to closed-caption text and the first audio stream (and thus to the associated video). Even though the media stream has not been substantially synchronized to the second audio stream, at run-time, for example, a viewer may select the second audio stream to be played in the presentation. The video stream can be automatically substantially synchronized to the second audio stream in the presentation with no manual steps. In particular, each segment in the media stream can be substantially synchronized to each segment in the second audio stream by comparing the respective duration of a segment from the first audio stream and a corresponding segment from the second audio stream and by adjusting the duration of the corresponding media segment based on the comparison. Thus, a single video stream may be played and substantially synchronized at run-time any selected audio stream from the plurality audio streams.
If, for example, the duration of the second audio segment is greater than the duration of the first audio segment, then additional frames can be added to the corresponding media segment. By adding one or more frames to the media segment, the duration of the media segment can be increased. One or more frames can be added to the media segment by causing the media segment to repeat (or copy) a few of its frames. Every Nth frame of the media segment can be repeated or copied to increase the duration of the media segment. If, for instance, the audio segment is approximately ten percent greater than the duration of the first audio segment, then every tenth frame of the media segment can be repeated.
If, for example, the duration of the second audio segment is less than the first audio segment, then one or more frames from the media segment can be removed. By removing one or more frames from the media segment, the duration of the media segment can be decreased. Every Nth frame from the media segment can be deleted to decrease the duration of the media segment. If, for instance, the duration of the second audio segment is approximately twenty percent less than the duration of the first audio segment, then every twentieth frame from the media segment can be dropped.
The media segment can be modified by adding or dropping frames at anytime. For example, the media segment can be modified by a processor at run-time, such that the media segment includes copied or deleted frames. In this way, the media segment can be substantially synchronized with the audio segment at run-time (play-time). In another embodiment, frames can be added to or deleted from the media segment at development time, for example, using a processor. In this way, as the audio streams are processed in connection with the media segment, synchronization can be preserved by automatically modifying the media segment to compensative for any losses or gains in overall duration.
The media segment and first and second audio segments can be defined as segments using time-codes. The media and audio segments reflect a portion of a file, respectively (e.g., a portion of a video file, first audio file, second audio file). The media and first and second audio streams can be segmented with time-codes. The time-codes can define the segments by specifying where each segment begins and ends in the stream. In addition, markers may be inserted into the audio and media segments. These markers may be used to determine which segment is currently being processed. When a marker is processed, it can trigger an event. For example, at run-time (e.g., upon playback), if a marker is processed, an event can be fired.
Developer tools can be provided for creating a presentation that includes the synchronized media and audio streams. The developer tools can include a time-coder, which is used to associate closed-caption text with audio streams. The developer tools can include an electronic table having rows and columns, where the intersection of a respective column and row defines a cell. Cells in the table can be used to specify media, such as an audio file, time-code information, closed-captioning text, and any associated media or audio files. Any cells associated with the audio file cell can be used to specify the time-coding information or closed-captioning text. For example, a first cell in a column may specify the file name of an audio file, and time-code information associated with the audio file may be specified in the cells beneath the audio file cell, which are in the same column. The time-coding information may define the respective audio segments for the audio file. A cell that is adjacent to a cell with time-coding information that defines the audio segment can be used to specify media, such as closed-captioning text that should be presented when the audio segment is played. Further, the cells may also specify video segments (e.g. animations) that should be presented when the audio segment is played. In this way, video segments and closed-captioning text, and the relationships between them, may be specified using cells of a table. A developer, for instance, using the table can specify that a specific block of text (e.g., the closed-captioning text) should be displayed, while an audio segment is being played. The use of a cells in a table as a tool for developing the presentation facilitates a thought-by-thought (segment-by-segment) development process.
The contents of the electronic table can be stored in an array. For example, an engine, such as a builder, can be used to process the contents of the electronic table and store the specified media and time-coding information into one or more arrays. The arrangement of the cells and their respective contents can be preserved in the cells of the arrays. The arrays can be accessed by, for example, a player, which processes the arrays to generate a presentation. The builder can generate an XML file that includes computer readable instructions that define portions of the presentation. The XML file can be processed by the player, in connection with the arrays, to generate the presentation.
By processing portions of media streams into segments, a presentation can be developed according to a thought-by-thought developmental approach. Each segment (e.g., thought) can be associated with respective audio segment, video segment and block of closed-captioning text. The audio segment and closed-captioning text can be revised and the synchronization of the audio, closed-caption text and video segment can be computationally maintained. The durational properties of the video segment can be modified by adding or dropping frames. In this way, a multiple audio language product can be developed and the synchronization of audio/visual content can be computationally maintained to whichever audio the viewer selects.
The foregoing and other objects, features and advantages of the invention will be apparent from the following more particular description of embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.
Consider the situation, for example, where a developer creates a presentation that includes a video stream that is time-coded to an English audio stream. Later, the developer wants to revise the presentation so that instead of having an English audio stream, it has a Vietnamese audio stream. In the past, a developer in this situation, typically, a developer had to modify the video with the new Vietnamese audio, in order to ensure that the video and the new Vietnamese audio are substantially synchronized. The developer would have generally been required to synchronize to the video to the new Vietnamese audio even though the presentation was previously synchronized with the English audio stream. In accordance with particular embodiments of the invention, however, changes to the audio streams in a presentation can be made and the content of presentation can still be substantially synchronized.
A presentation can be developed that has a plurality of different audio streams that can be selected. One audio stream, the “first audio stream” can reflect an initial (draft) version of the audio. Alternatively, the first audio stream can be directed to a specific language, such as English. Another audio stream, the “second audio stream” can reflect a final version of the first audio stream. Alternatively, the second audio stream can be directed to a different language, such as Vietnamese. For example, the first audio stream can correspond to an English version and the second audio stream can correspond to the Vietnamese version.
A video stream can be substantially synchronized to a first audio stream, however, this can be difficult because it needs to be done manually. Once the media stream and the first audio stream are substantially synchronized, the media stream can be automatically synchronized to whichever audio stream a viewer may selects.
The first audio stream can be partitioned into logical segments, (such as thoughts, phrases sentences or paragraphs). The logical segments can be easily specified by, for example, time-codes to assign closed-caption script to each audio segment.
A second audio stream (such as a second language) can be created and easily partitioned into logical segments that have a one-to-one correspondence to, but with different duration than, the logical segments of the first audio stream. (If this was a different language, one might add closed-caption script in the new language.) It is desirable that the video be substantially synchronized with the second audio. The invention does this automatically, without difficultly. Once the video stream has been synchronized to the first audio stream and the first audio stream has been partitioned into logical segments, the video stream can be automatically synchronized to any other audio streams that have been partitioned into corresponding logical segments.
At run-time, for example, the video stream can be substantially synchronized to another audio stream. This can be accomplished by comparing the duration of the first audio stream with the second audio stream, and adjusting the duration of the video stream based on this comparison. In particular, the duration of a segment in the first audio stream is compared with the duration of a corresponding segment in the second audio stream. If the duration of the first segment is greater than the duration of the second segment, then frames from the media stream are dropped at regular intervals. If the duration of the first segment is less than the duration of the second segment, then frames in the media stream are repeated at regular intervals.
Closed-caption text can be time-coded to the audio at development time.
In this example, file 130 could contain both the video and the original audio to which the video was already synchronized. However, due to current animation player limitations of not being able to play the animation and mute the audio, at development time, the English audio might be stripped out leaving only the video. The English audio (if needed) could be provided on a separate file (not shown.)
A developer can use the time-coder 140 to partition the second audio file into segments. The segments can correspond to thoughts or sentences, or paragraphs. For example, the media content can include the audio file 130, closed-captioned text 120, 122, and video file 130. To process the second audio file into segments, a developer can select one of the cells 110-1, 110-2, . . . , 110-5 under the audio file cell 110, and then start 140-1 and stop 140-2 the audio file 110 to define the audio segment.
For example, cell 110-4 is a selected cell. The time-coder controls 140 can be used to indicate time-coding information that associates the closed-captioned text with the audio file 110 in the selected cell 110-4.
Referring to
Closed-caption text 120-1, 120-2, 120-3 is associated with an audio segment 110-1, 110-2, 110-3, respectfully For example, the content of audio segment 110-1 can correspond to the sentence in closed-captioned text cells 120-1. In addition, a specific column in the table 105 can be associated with closed-captioned text of a particular language. In the example shown in
The animation 130 is processed into segments 130-1, . . . , 130-5. The segments 130-1, . . . , 130-5 correspond to other media segments. For instance, animation segment 130-1 corresponds to blocks of closed-captioned text 120-1, 122-1 and to audio segment 110-1. In one embodiment, each segment corresponds to a thought or sentence in the presentation. In another embodiment, each segment corresponds to a unit of time in the media file.
By processing the original audio for animation 130 and audio 110 into segments and by providing the closed-captioned text 120, 122 as blocks of text, each audio segment 110-1 can be associated with a respective media segment(s), such as the animation segment 130-1 and block of closed-caption text 120-1 or 122-1. As discussed in more detail below, processing the audio and visual media into segments facilitates the synchronization process.
Before the process 200 can be invoked, the second audio is processed into segments. Each of the second audio segments corresponds to a respective video segment. When the second audio is processed into segments, the duration properties of each segment is determined. At 205, the duration properties of the first audio segments and the second audio segments are processed and each stored into arrays. At 210, the durational properties of the first and second audio segments are accessed from their respective arrays. At 215, the data from the arrays is used to generate thought nodes on the animation control/status bar.
A depiction of an animation control/status bar 400 is shown in
Referring back to
In general, the skipping or repetition of an occasional video frame is not noticeable to the viewers. Typically, the standard frame rate in Flash animations is 12 frames per second (fps), and depending on the format, in film it is 24 fps, in television it is 29.97 fps, and in some three-dimensional games it is 62 fps. If the process 200 causes certain video frames to be dropped, the human eye is used to motion blur and would not notice a considerable difference in smoothness of the animation. Similarly, when there are 12 or more frames played in a second, and some of those frames are repeated, the repeated frames are substantially unapparent because the repetition occurs in a mere fraction of a second.
In one embodiment, when the video and audio files are processed into segments, each audio segment corresponds to a spoken sentence reflected in the audio file. The process 200 works particularly well when the sentence structures in the first language and the second language are similar. If the sentence structure of the second language used in the audio is similar to the first language, even if the sentences are substantially longer or quite shorter, then the process 200 can produce automatic synchronization. This is the case, for example, with Vietnamese and English.
If the sentence structure of the second language is different than the first language, the synchronization can not be seamless for every word; however, synchronization is maintained across sentences. The resultant synchronization is adequate for many applications. If necessary, the video for certain sentences could be reworked manually, taking advantage of the automatic synchronization for the remainder of the sentences (e.g. segments).
In general, a presentation is developed that includes media, such as an animation and several audio tracks. Any these audio tracks can be played with the presentation. Although the animation is initially synchronized to a first audio track at development time, at run-time the animation can be substantially synchronized to a second audio track. The animation and the first audio file are time-coded and processed into corresponding segments. The second audio file is also processed into corresponding segments. The time-coding information associated with the video and first audio streams and durational properties associated with the second audio stream, are stored in an XML file associated with the presentation.
The first and second audio tracks are processed with Microsoft's Windows Media command line encoder, which causes a new .wma audio file to be produced, respectively. Microsoft's asfchop.exe can be used to insert, hidden markers at regular intervals into the newly encoded audio file (10 markers per second, for example). At run-time, the marker events are fired at a rate of 10 times per second. A handler that is responsive to a marker event communicates with the player, in order to ensure that the video file is substantially synchronized with the second audio file. This process is discussed in more detail below, in reference to
As described in
The handler is responsive to the MarkerHit event, and in communication with the player. The player determines (i) the time value of the current position of the second audio track (“Current Audio Thought Value”), (ii) animation frame rate, e.g. 15 frames per second, (“Animation Frame Rate”), (iii) overall duration of first audio file and its current segment compared with the overall duration of the second audio file and its current segment (“Current Thought Dual Media Ratio”), (iv) current marker that triggered the MarkerHit event (“Current Marker”), and (v) the frame number (“n”). These values are processed using the following formula to substantially synchronize the animation with the second audio track.
The animation control/status bar is also updated. The following formula is used to update the animation control/status bar.
((CurrentMarkerIn)/AudioFileDuration)*100
It should be noted that in the event that marker frequency is less than the animation frame rate, a secondary algorithm can be invoked to aesthetically “smooth” the progress of the Animation Control/Status bar.
At 275, synchronization is maintained. Thus, the time-coding process 250 allows the designer to generate two or more sets of time-codes for the same animation. This allows for the support of several language tracks for a single animation/video.
Embodiments of the invention are commercially available, such as the Automatic e-Leaming Builder™ and Automatic e-Learning Builder™, from Automatic e-Learning, LLC of St. Marys, Kans.
It will be apparent to those of ordinary skill in the art that methods involved herein can be embodied in a computer program product that includes a computer usable medium. For example, such a computer usable medium can include a readable memory device, such as a hard drive device, a CD-ROM, a DVD-ROM, or a computer diskette, having computer readable program code segments stored thereon. The computer readable medium can also include a communications or transmission medium, such as a bus or a communications link, either optical, wired, or wireless, having program code segments carried thereon as digital or analog data signals.
It will further be apparent to those of ordinary skill in the art that, as used herein, “presentation” can be broadly construed to mean any electronic simulation with text, audio, animation, video or media.
In addition, it will be further apparent to those of ordinary skill that, as used herein, “synchronized” can be broadly construed to mean any matching or correspondence. In addition, it should be understood that that the video can be synchronized to the audio, or the audio can be synchronized to the video.
While this invention has been particularly shown and described with references to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the invention encompassed by the appended claims.
This application claims the benefit of U.S. Provisional Application No. 60/530,457, filed on Dec. 17, 2003 and is a Continuation-in-Part of U.S. patent application Ser. Nos. 10/287,441, filed Nov. 1, 2002, 10/287,464, filed Nov. 1, 2002 and 10/287,468, filed Nov. 1, 2002, all which claim priority to Provisional Patent Application Nos. 60/334,714, filed Nov. 1, 2001 and 60/400,606, filed Aug. 1, 2002.
Number | Date | Country | |
---|---|---|---|
60530457 | Dec 2003 | US | |
60334714 | Nov 2001 | US | |
60400606 | Aug 2002 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 10287441 | Nov 2002 | US |
Child | 11016552 | Dec 2004 | US |
Parent | 10287464 | Nov 2002 | US |
Child | 11016552 | Dec 2004 | US |
Parent | 10287468 | Nov 2002 | US |
Child | 11016552 | Dec 2004 | US |