The present invention relates to the field of generating multimedia content and in particular to a system and method for generating multimedia content from a text or audio input.
The present field of multimedia content requires a developer to act almost like a producer, in that the developer must first develop a text, convert the text to speech, determine what visual content is to be added, and then adjust resultant output so as to fit a predetermined time slot. Such a process is labor intensive, and is thus not generally economical.
In the area of news information, wherein the facts and story line are constantly changing, text based information remains the leading source. A certain amount of multi-media content is sometimes added, usually by providing a single fixed image, or by providing some video of the subject matter. Unfortunately, in the ever changing landscape of news development, resources to properly develop a full multi-media presentation are rarely available.
A number of sources are arranged to present push news information to registered clients, thus keeping them up to date regarding pre-selected area of interest. The vast majority of these sources are text based, and are not provided with multi-media information.
Wibbitz, of Tel Aviv, Israel, provides a text-to-video platform as a software engine which matches a visual representation for the text, adds a computer generated voice-over narration and generates a multi-media video responsive to the provided text. Unfortunately, the computer generated voice-over narration is often unnatural. Additionally, the tool provided is primarily for publishers, requiring a text input, and is not suitable for use with an audio input.
Accordingly, it is a principal object of the present invention to overcome at least some of the disadvantages of prior art methods of multi-media content generation. Certain embodiments provide for a system arranged to generate multimedia content, the system comprising: a textual input module arranged to receive a textual input; an audio input module arranged to receive an audio input, wherein the received audio input is a human generated audio and the textual input is a textual representation of the human generated audio; a contextual analysis module in communication with the textual input module and arranged to extract metadata from the received textual input; a media asset collection module arranged to retrieve a plurality of media assets responsive to the metadata of the received textual input; an alignment module in communication with the audio input module and the textual input module, the alignment module arranged to determine time markers in the received audio input for predetermined words in the received textual input; a video creation module arranged to create a video clip responsive to the received audio input, the determined time markers and the retrieved plurality of media assets of the media asset collection module; and an output module arranged to output the created video clip.
Additional features and advantages of the invention will become apparent from the following drawings and description.
For a better understanding of the invention and to show how the same may be carried into effect, reference will now be made, purely by way of example, to the accompanying drawings in which like numerals designate corresponding elements or sections throughout.
With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of the preferred embodiments of the present invention only, and are presented in the cause of providing what is believed to be the most useful and readily understood description of the principles and conceptual aspects of the invention. In this regard, no attempt is made to show structural details of the invention in more detail than is necessary for a fundamental understanding of the invention, the description taken with the drawings making apparent to those skilled in the art how the several forms of the invention may be embodied in practice. In the accompanying drawings:
Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not limited in its application to the details of construction and the arrangement of the components set forth in the following description or illustrated in the drawings. The invention is applicable to other embodiments or of being practiced or carried out in various ways. Also, it is to be understood that the phraseology and terminology employed herein is for the purpose of description and should not be regarded as limiting.
Textual input module 20 is in communication with at least one data provider and/or database and optionally with optional summarization module 100. Contextual analysis module 30 is in communication with textual input module 20, optionally via summarization module 100, and with media asset collection module 40. In the event that summarization module 100 is not provided, contextual analysis module 30 is in communication with textual input module 20. Media asset collection module 40 is further in communication with filtering module 50. In one embodiment, media asset collection module 40 communicates with the one or more media asset databases. In one embodiment, the communication is over the Internet. Filtering module 50 is further in communication with video creation module 80 and optionally with alignment module 70. Audio input module 60 is in communication with alignment module 70, video creation module 80 and interim output module 105. Alignment module 70 is further in communication with textual input module 20 and optionally in communication with summarization module 100. Video creation module 80 is further in communication with alignment module 70, template storage 90 and output module 110. Interim output module 105 is in communication with narration station 107 and memory 115 is in communication with output module 110. Template storage 90 is further in communication with contextual analysis module 30.
Each of textual input module 20; contextual analysis module 30; media asset collection module 40; filtering module 50; audio input module 60; alignment module 70; video creation module 80; template storage 90; optional summarization module 100; interim output module 105; narration station 107; and output module 110 may be constituted of special purpose hardware, or may be a general computing device programmed to provide the functionality described herein. The instructions for the general computing device may be stored on a portion of memory 115 without limitation.
The operation of system 10 will now be described by the high level flow chart of
In one embodiment, in the event the length of the received textual input exceeds a predetermined value, optional summarization module 100 is arranged to summarize the received input. In one non-limiting embodiment, the received textual input is summarized to contain about 160 words. In another embodiment, the received textual input is summarized to contain about 15 words. In one embodiment, the summarization is responsive to a text summarization technique known to those skilled in the art, and thus in the interest of brevity will not be further detailed.
In one embodiment, as will be described below, a plurality of textual inputs are received, such as a plurality of news articles, optionally from a plurality of news providers. In another embodiment, as will be described below, a plurality of textual inputs are received and textual input module 20 is arranged to: identify a set of textual inputs which are related to the same topic; and create a single textual record from the related set of textual inputs.
In stage 1010, the textual input of stage 1000 is analyzed by contextual analysis module 30. In the embodiment where the input of stage 1000 is summarized, the summarized input is analyzed by contextual analysis module 30. In one embodiment, the analysis is performed by Natural Language Processing (NLP).
Contextual analysis module 30 is arranged to extract metadata from the analyzed textual input. In one embodiment, the extracted metadata comprises at least one entity, such as one or more persons, locations, events, companies or speech quotes. In another embodiment, the extracted metadata further comprises values for one or more of the extracted entities. In another embodiment, the extracted metadata further comprises relationships between extracted entities. For example, a relationship is determined between a person and a company, the relationship being that the person is an employee of the company. In another embodiment, the extracted metadata comprises social tags arranged to provide general topics related to the analyzed textual input. Examples of social tags can include, without limitation: manmade disaster; gastronomy; television series; and technology news. In one embodiment, the social tags are created responsive to the analysis of the textual input. In another embodiment, the metadata further comprises extracted information such as the date and time of publication, the author and the title.
In stage 1020, media asset collection module 40 is arranged to retrieve a plurality of media assets from one or more media asset databases, responsive to the extracted metadata of stage 1010. In one embodiment, the retrieved media assets comprise, without limitation, one or more of: images, such as editorial images or created images; videos, such as editorial videos or created videos; and audio portions, such as music and sound effects. In one embodiment, the media assets are selected by comparing the extracted metadata of stage 1010 to the metadata of the media assets. In one embodiment, media asset collection module 40 is further arranged to compare the extracted metadata of stage 1010 to properties of the video templates and template components stored on template storage 90 and select a particular video template and particular template components responsive to the comparison. In the event that one or more media assets were extracted by textual input module 20 in stage 1000, the one or more media assets are added to the media assets retrieved by media asset collection module 40 as potential media assets. In another embodiment, the plurality of media assets are retrieved responsive to the length of the input of stage 1000. For example, a larger number of media assets are retrieved for a longer textual input than for a shorter textual input.
In stage 1030, interim output module 105 is arranged to output the received textual input of stage 1000 to narration station 107 and in the embodiment where the textual input is summarized, interim output module 105 is arranged to output the summarized text to narration station 107. The received text is then narrated by a narrator, preferably a human narrator, the narration being received by narration station 107 and transmitted to interim output module 105 as a voice articulated record. The voice articulated record is then fed to audio input module 60.
Contextual analysis and search results based thereon, although accurate, can have some degree of errors and ambiguity. In addition, such techniques do not weight human comprehension, opinion or emotion. The retrieved media assets may therefore contain irrelevant or inaccurate media assets with respect to the input, or summarized input, of stage 1000. In one embodiment, interim output module 105 is further arranged to output the retrieved media assets of stage 1020 to narration station 107. A user associated with narration station 107, preferably the human narrator, is arranged to delete any of the received media assets responsive to a user input. Narration station 107 thus allows a user to delete irrelevant or inaccurate media assets. Additionally, in one embodiment, narration station 107 is arranged to rank the received media assets in order of relevancy, responsive to a user input. Narration station 107 is arranged to output the adjusted set of media assets to filtering module 50 via interim output module 105. In the embodiment where a video template and template components were selected in stage 1020, narration station 107 is arranged to change the selection of the video template and/or template components responsive to a user input and the adjustments are output to video creation module 80 via interim output module 105.
In stage 1040, alignment module 70 is arranged to determine time markers in the voice articulated record of stage 1030 for predetermined words in the received textual input of stage 1000. Each time marker represents the point in the voice articulated record in which a particular portion of the text begins. In one embodiment, a time marker is determined for each word in the text. In one embodiment, the time markers are determined responsive to a forced alignment algorithm.
In stage 1050, filtering module 50 is arranged to select a set of media assets from the retrieved plurality of media assets of stage 1020, or the adjusted set of media assets of stage 1030, and the optionally extracted media assets of stage 1000. In one embodiment, the selection is performed responsive to the analysis of stage 1010. In another embodiment, the selection is performed responsive to the length of the input of stage 1000, or the length of the summarized input of stage 1000. In another embodiment, the selection is performed responsive to the length of the narration of stage 1030. In one embodiment, the media assets are selected responsive to the relevancy of the media assets to the text and responsive to the length of the text such that appropriate media assets are selected for the particular length. In the embodiment where in stage 1030 the media assets were ranked according to relevancy, the media assets are further selected responsive to the rankings. In one embodiment, the media assets are further selected responsive to the determined time markers of stage 1040 such that an appropriate media asset is selected for each portion of text associated with the respective time marker. Thus, advantageously, the media assets are selected responsive to the speed of speech of the voice articulated record. In particular, the media assets are thus selected responsive to the actual narration of the voice articulate record of stage 1030, which is preferably a human articulated voice.
In stage 1060, video creation module 80 is arranged to create a video clip responsive to: the received voice articulated record of stage 1030 or the received audio of stage 1000; the determined time markers of stage 1040; and the selected set of media assets of stage 1050. Each media asset is inserted into the video clip at a particular time marker and in one embodiment the media assets are edited responsive to the optionally selected video template and template components of stage 1020. In the event that the selected video template and template components were adjusted in stage 1030, the media assets are edited responsive to the adjusted video template and template components. Advantageously, editing the media assets responsive to the video template and template components provides a video clip which is more accurately correlated with the textual input.
In stage 1070, the created video clip of stage 1060 is output by output module 110. In one embodiment, the created video clip is output to be displayed on a user display. The user display may be associated with a computer, cellular telephone or other computing device arranged to receive the output of output module 110 over a network, such as the Internet, without limitation.
In one embodiment, memory 115 has stored thereon user parameters associated with a plurality of users. Output module 110 is arranged to output the created video clip to a display responsive to the stored user parameters. In particular, output module 110 is in communication with a plurality of user systems, each user system associated with a particular user and comprising a display. Output module 110 is arranged to output the video clip to one or more of the plurality of user systems responsive to the stored user parameters. In one further embodiment, the stored user parameters comprise one or more video clip topics requested by each user system and output module 110 is arranged to output the created video clip to any of the user systems associated with the topic of the created video clip. For example, the created video clip is about the weather and output module 110 is arranged to output the created weather video clip to all of the user systems which have requested video clips about the weather. Thus, a user is presented with a personalized video clip channel.
In one embodiment, a video clip is created for each of a plurality of textual records and for each textual record a video clip is further created for a summarized version of the particular textual record. In one embodiment, the video clips of the summarized versions of the textual records are output by output module 110 to a user display and responsive to a user input, such as a gesture on a portion of a touch screen associated with a particular display of a video clip, the associated video clip of the textual record is displayed. In another embodiment, responsive to the user input the textual record is displayed. In another embodiment, a link to the associated textual record is stored on a memory to be later viewed. In one embodiment, a single video clip is created for a plurality of the summarized textual records and a textual record, or a video clip representation thereof, is displayed responsive to a user input at a particular point in the single video clip. In one non-limiting embodiment, the textual record is a news article and the summarized version is a headline associated with the news article. In one particular embodiment, a plurality of news articles are received from a news publisher and a video clip is created for a series of headlines. As described above, in one embodiment responsive to a user input during a particular temporal point in the news headline video clip where a particular news headline is displayed, the full news article, or a video clip thereof, is displayed. In another particular embodiment, a plurality of news articles are received from a plurality of news publishers and a video clip is created for a plurality of news headlines, as described above. In one embodiment, textual input module 20 is further arranged to select particular news articles from the plurality of received news articles, in one embodiment responsive to user information stored on memory 115 as described above. In one embodiment, the particular articles are selected responsive to areas of interest of a user and/or preferred news providers, thereby a user is displayed a video clip of preferred news headlines. In another non-limiting embodiment, the textual record is a description of tourist properties of a particular geographical location.
In one embodiment, a video clip is created for a plurality of summarized textual records, each associated with the respective complete textual record. In one embodiment, the summarized textual records are search results of a search engine. The video clips of the summarized textual records are output by output module 110 to a user display and responsive to a user input, such as a gesture on a portion of a touch screen associated with a particular display of a video clip, the associated complete textual record, or other information associated therewith, is displayed on the user display.
In one embodiment, information regarding the displayed video clips are stored on memory 115 and output module 110 is arranged to output video clips responsive to the information stored on memory 115 such that a video clip is not displayed twice to the same user. In another embodiment, memory 115 has stored thereon user parameters associated with a plurality of users and output module 110 is arranged to output video clips responsive to the parameters associated with the user viewing the video clips. In one further embodiment, the source of the textual inputs is selected responsive to the information stored on memory 115.
In one embodiment, output module 110 is arranged to replace the output video clip with another video clip responsive to a user input on a user display displaying a video clip. In another embodiment, output module 110 is arranged to adjust the speed of display of the output video clip responsive to a user input on a user display displaying the video clip. In another embodiment, output module 110 is arranged to adjust the point in the output video clip currently being displayed, responsive to a user input on a user display displaying the video clip.
In one embodiment, as described above, a plurality of textual inputs are received by textual input module 20 and textual input module 20 is arranged to: identify a set of textual inputs which are related to the same topic; and create a single textual record from the related set of textual inputs. As described above in relation to stages 1010-1060, a video clip is then created for the single textual record. In one embodiment, one or more portions of each selected textual input is selected, the single textual record being created from the plurality of selected portions. In one embodiment, the plurality of textual inputs are news articles and a set of news articles are selected, each of the selected news articles relating to the same news item. In one further embodiment, portions of each news article are selected and a single news article is created from the selected portions. In one yet further embodiment, each of the selected portions of the news articles relate to a different aspect of the particular news item.
In optional stage 1080, output module 110 is arranged to output a plurality of video clips responsive to a plurality of received textual inputs of stage 1000, each of the plurality of video clips related to a different topic. In particular, stages 1000-1060 as described above are repeated for a plurality of textual inputs, or a plurality of sets of textual inputs, to create a plurality of video clips. As described above, each video clip is created responsive to at least one textual input. In one embodiment, at least one of the plurality of textual inputs is used for more than one video clip. Optionally, each video clip is created responsive to a plurality of textual inputs received from a plurality of sources. Contextual analysis module 30 is arranged to determine which topics relate to each textual input and a textual input relating to a plurality of topics is used for creating a plurality of video clips. In one embodiment, the textual input and associated topic tags are stored on memory 115 to be later used for creating another video clip.
In one further embodiment, a plurality of video clips are output to each of a plurality of user systems responsive to user parameters stored on memory 115, as described above in relation to stage 1070. Thus, each user is provided with their own video clip channel constantly providing updated video clips relating to the topics desired by the user.
Each of audio input module 210; speech to text converter 220; textual input module 230; contextual analysis module 30; media asset collection module 40; filtering module 50; alignment module 70; video creation module 80; template storage 90; and optional summarization module 100 may be constituted of special purpose hardware, or may be a general computing device programmed to provide the functionality described herein. The instructions for the general computing device may be stored on a portion of memory 115 without limitation.
The operation of system 200 will now be described by the high level flow chart of
In one embodiment, audio input module 210 is further arranged to receive a textual input comprising a textual representation of the received audio. In another embodiment, optional speech to text converter 220 is arranged to convert the received audio into a textual representation of the received audio. The received textual representation, or the converted textual representation, is output to textual input module 230. In one embodiment, in the event the length of the received input exceeds a predetermined value, optional summarization module 100 is arranged to summarize the textual representation of the received audio input, as described above in relation to stage 1000 of
In stage 2010, the textual representation of the audio input of stage 2000 is analyzed by contextual analysis module 30 and metadata is extracted, as described above in relation to stage 1010. In the embodiment where the input of stage 2000 is summarized, the summarized input is analyzed by contextual analysis module 30.
In stage 2020, as described above in relation to stage 1020, media asset collection module 40 is arranged to retrieve a plurality of media assets from one or more media asset databases, responsive to the extracted metadata of stage 2010. In one embodiment, as described above, media asset collection module 40 is further arranged to compare the extracted metadata of stage 1010 to properties of the video templates and template components stored on template storage 90 and select a particular video template and particular template components responsive to the comparison. In one embodiment, as described above in relation to stage 1030, the retrieved media assets are output to a narration station, such as narration station 107 of system 10, and are adjusted responsive to a user input.
In stage 2030, alignment module 70 is arranged to determine time markers in the audio input of stage 2000 for predetermined words in the textual representation of the audio input, as described above in relation to stage 1040.
In stage 2040, as described above in relation to stage 1050, filtering module 50 is arranged to select a set of media assets from the retrieved plurality of media assets of stage 2020, or the adjusted set of media assets, responsive to the analysis of stage 2010. As indicated above, an adjusted set of media assets may be supplied responsive to a narration station 107 as described above in relation to system 10. As described above, in one embodiment the selection is further performed responsive to the determined time markers of stage 2030.
In stage 2050, as described above in relation to stage 1060, video creation module 80 is arranged to create a video clip responsive to: the received audio of stage 2000; the determined time markers of stage 2030; and the selected set of media assets of stage 2040. Each media asset is inserted into the video clip at a particular time marker and in one embodiment the media assets are edited responsive to the optionally selected video template and template components of stage 2020. In stage 2060, the created video clip is output by output module 110. In one embodiment, the created video clip is output to be displayed on a user display. In another embodiment, the created video clip is output to a data provider to be later displayed on a user display.
As described above, in one embodiment information regarding the displayed video clips is stored on memory 115 and output module 110 is arranged to output video clips responsive to information stored on memory 115 such that a video clip is not displayed twice to the same user. In another embodiment, memory 115 has stored thereon information regarding a plurality of users and output module 110 is arranged to output video clips responsive to the information associated with the user viewing the video clips. In one further embodiment, the source of the audio inputs is adjusted responsive to the information stored on memory 115. In one embodiment, output module 110 is arranged to replace the output video clip with another video clip responsive to a user input on a user display displaying the output video clip. In another embodiment, output module 110 is arranged to adjust the speed of display of the output video clip responsive to a user input on a user display displaying the output video clip.
Each of textual input module 310; audio input module 320; contextual analysis module 30; media asset collection module 40; alignment module 70; video creation module 80; and output module 110 may be constituted of special purpose hardware, or may be a general computing device programmed to provide the functionality described herein. The instructions for the general computing device may be stored on a portion of a memory, (not shown) without limitation.
The operation of system 300 will now be described by the high level flow chart of
In stage 3010, the textual input of stage 3000 is analyzed by contextual analysis module 30 and metadata is extracted, as described above in relation to stage 1010. In stage 3020, as described above in relation to stage 1020, media asset collection module 40 is arranged to retrieve a plurality of media assets from one or more media asset databases, responsive to the extracted metadata of stage 3010.
In stage 3030, alignment module 70 is arranged to determine time markers in the audio input of stage 3000 for predetermined words in the textual input of stage 3000, as described above in relation to stage 1040.
In stage 3040, as described above in relation to stage 1060, video creation module 80 is arranged to create a video clip responsive to: the received audio input of stage 3000; the determined time markers of stage 3030; and the retrieved media assets of stage 3020. Each media asset is inserted into the video clip at a particular time marker and in one embodiment the media assets are edited responsive to predetermined editing rules, as described above. In stage 3050, the created video clip is output by output module 110. In one embodiment, the created video clip is output to be displayed on a user display. In another embodiment, the created video clip is output to a data provider to be later displayed on a user display.
In prior art methods of synchronizing text and human generated audio, speech to text engines are utilized. In particular, a portion of the audio signal containing speech is analyzed and matched with a word in the text. The next portion in the audio signal is then analyzed and matched with the next word in the text, and so on. Unfortunately, this method suffers from significant inaccuracies as the word represented by the portion of the audio signal is not always correctly identified. Additionally, when such an error occurs the method cannot correct itself and a cascade of errors follows. The following synchronization method overcomes some of these disadvantages.
In stage 4010, a weight value is determined for each of a plurality of portions of the textual representation of stage 4000. Optionally, a plurality of element types are defined, each element type corresponding to different text characters. A weight is assigned to each element type. The weight of each element type represents the relative time during which such an element type is assumed to be heard when spoken, or silence to be heard while it is encountered while reading, e.g. a period. In one non-limiting embodiment, the element types and their weights are:
1. Word (weight=1);
2. Space (any type of white space, including space, TAB, etc.) (weight=0);
3. Number (weight=1);
4. Comma (weight=1);
5. Semicolon (weight=0.75);
6. Period (weight 3);
7. Paragraph marker (Carriage Return+Line Feed) (weight=2);
8. Characters which have no speech representation, such as ‘!’, ‘?’, ‘(‘and’)’: (weight=0);
9. Characters which have a particular word representation, such as ‘@’, ‘$’, ‘%’ and ‘&’: (weight=1);
The weight value of each portion of the textual representation is responsive to the element types and their weights. In one embodiment, a token weight is determined for each element type in the text, the token weight determined responsive to the element type weight and length. For example, if an element type Word has a weight of 1, a 5 letter word will be assigned a token weight of 5. For characters which have a particular word representation, such as ‘$’, the character is assigned the token weight of the representing word. The weight value of each text portion is thus defined as the sum of the token weights of all of the element types in the text portion. In one non-limiting embodiment, each text portion is defined as comprising only a single element type.
In optional stage 4020, a representation of the length of time of the human generated audio of stage 4000 is adjusted responsive to the determined weight values of stage 4010. In one embodiment, the sum of the weight values of all of the portions of the text is determined. The duration of the recorded text is then divided by the weight value sum to define a unit of time equivalent to one unit duration weight. Each portion of the text is then assigned its appropriate calculated duration responsive to the defined unit duration weight and the length of the portion. For example, if the recorded text duration is 12 seconds (or 12,000 milliseconds), and the sum of weight values of the text is 1,400, then a single unit duration weight is defined as 12,000/1,400=±85.71 milliseconds. If the text portion comprises a 5 letter word, its calculated duration would be 85.71×5=±428 milliseconds.
In stage 4030, a respective portion of the human generated audio is associated to a particular portion of the textual representation of stage 4000 responsive to the determined weight values of stage 4010. In one embodiment, as described in relation to optional stage 4020, the respective portion of the human generated audio is associated to a particular portion of the textual representation responsive to the determined calculated durations of each text portion.
In the event that there is a miscalculation of the weight of a particular text portion, there will be a misalignment between the speech and the particular text portion, however the method of optional stage 4020 will cause the misalignment to correct itself as the speech progresses because the miscalculated weight of each text portion will accumulate to compensate for the misalignment. For example, in the event that a particular text portion is determined to be longer than it really is, the weight value sum of the entire text will be greater than it really is. Therefore, the single unit duration weight will be shorter than it should be and will slowly compensate for the misalignment at the text portion which exhibits the error. For a short text, the accumulating compensation will be greater for each text portion than in a longer text. In any event, the beginning of the first token and the end of the last token will be synchronized with the audio.
Thus, the above described method of synchronizing text with narrated speech provides improved speech to text synchronization without any information about the recorded audio other than its duration and which is independent of language.
In stage 4040, a set of media assets is selected, as described above in relation to stage 1050 of
In one embodiment, client module 420 comprises a software application, optionally a web widget. Client module 420 is associated with a particular topic and comprises a predetermined time limit for the amount of time video clips are to be displayed by client module 420. In one embodiment, the video clip time limit is determined by an administrator of the web site comprising the client module 420 inputting a client time limit input on client module 420. Each user display 430 is in communication with an associated user system, preferably comprising a user input device. System 400 is illustrated as comprising system 10 of
In stage 5000, system 10 is arranged to retrieve a plurality of textual inputs, as described above in relation to stage 1000 of
In optional stage 5030, client module 420 is arranged to detect a user adjust input thereat and communicate the user adjust input to system 10. Responsive to the detected user adjust input, system 10 is arranged to: output to client module 420 information associated with the output video clips; or adjust the output video clips. In particular, in one non-limiting embodiment a user can choose any of: skipping to the next video clip; opening a web window which will display the original source article of one or more textual inputs associated with the displayed video clip; viewing the textual representation of the human generated audio of the video clip, i.e. the textual input; and skipping to another temporal point in the displayed video clip, optionally responsive to a selection of a particular word in the displayed textual representation.
It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable sub-combination.
Unless otherwise defined, all technical and scientific terms used herein have the same meanings as are commonly understood by one of ordinary skill in the art to which this invention belongs. Although methods similar or equivalent to those described herein can be used in the practice or testing of the present invention, suitable methods are described herein.
All publications, patent applications, patents, and other references mentioned herein are incorporated by reference in their entirety. In case of conflict, the patent specification, including definitions, will prevail. In addition, the materials, methods, and examples are illustrative only and not intended to be limiting.
It will be appreciated by persons skilled in the art that the present invention is not limited to what has been particularly shown and described herein above. Rather the scope of the present invention is defined by the appended claims and includes both combinations and sub-combinations of the various features described hereinabove as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not in the prior art.
Number | Date | Country | |
---|---|---|---|
61640748 | May 2012 | US | |
61697833 | Sep 2012 | US |