The present invention relates generally to video distribution systems and more specifically to generation of video recommendations based upon user preferences.
News aggregation sites such as the Google News service provided by Google, Inc. of Mountain View, Calif. and the Yahoo News service provided by Yahoo, Inc. of Sunnyvale, Calif. have garnered significant attention in recent years. These services provide a user interface via which users can customize the types of news stories they want to read. Furthermore, the sites can progressively learn each user's preferences from their reading history to improve future selections.
A great deal of news information is distributed in the form of video content. Although the term “video content” references video information, the term is typically utilized to encompass a combination of video, audio, and text data. In many instances, video content can also include and/or reference sources of metadata. While video news has traditionally between broadcast over-the-air or transmitted via cable networks, video content is increasingly being distributed via the Internet. Therefore, video news stories can be obtained from a variety of sources.
Next-generation media consumption is likely to be more personalized, device agnostic, and pooled from many different sources. Systems and methods in accordance with embodiments of the invention can provide users with personalized video content feeds providing the video content that matters most to them. In several embodiments, a multi-modal segmentation process is utilized that relies upon cues derived from video, audio and/or text data present in a video data stream. In a number of embodiments, video streams from a variety of sources are segmented. Links are identified between video segments and between video segments and online articles containing additional information relevant to the video segments. The additional information obtained by linking a video segment to an additional source of data, such as an online article, can be utilized in the generation of personalized video playlists for one or more users. In several embodiments, the personalized video playlists are utilized to playback video segments via a television, personal computer, tablet computer, and/or mobile device such as (but not limited to) a smartphone, or a media player. In many embodiments, viewing histories and user interactions can be utilized to continuously optimize the personalization. In the context of video streams containing news programming, the dynamic mixing and aggregation of news videos from multiple sources can greatly enrich the news watching experience by providing more comprehensive coverage and varying perspectives. In several embodiments, processes for linking video segments to additional sources of data can be implemented as part of a video search engine service that constructs indexes including inverted indexes relating keywords to video segments to facilitate the retrieval of video segments relevant to a search query.
One embodiment of the invention includes a playlist generation server system, incorporating: at least one processor; and memory containing an indexing application, a playlist generation application, user preference data, and video segment metadata referencing a set of video segments containing content. In addition, the indexing application configures at least one processor to: annotate the video segments in the set of video segments with metadata describing the content of the video segments based upon text data and video data contained within the video segments; and identify relationships between video segments and keywords based upon the metadata describing the content of the video segments. Furthermore, the playlist generation application configures at least one processor to generate personalized playlists identifying video segments from the set of video segments based upon user preference data, video segment metadata, and relationships between the video segments and keywords, and the playlist generation application configures the at least one processor to weigh both content coverage and user preferences in the generation of personalized playlists.
In a further embodiment, the memory also contains user viewing history data; and the playlist generation application configures at least one processor to generate personalized playlists identifying video segments from the set of video segments based upon user preference data, the user viewing history data, video segment metadata and the relationships between the video segments and keywords
In another embodiment, the playlist generation application configures at least one processor to generate a playlist identifying video segments having a combined playback duration that is less than a predetermined total playlist duration.
In a still further embodiment, the playlist generation application configures at least one processor to generate personalized playlists identifying video segments from the set of video segments based upon user preference data, video segment metadata, and the relationships between the video segments and keywords using an integer linear programming optimization that employs an objective function that weighs both content coverage and user preferences in the generation of a personalized playlist.
In still another embodiment, the linear programming optimization is formulated as follows:
where n is the number of video segments in the set of video segments,
wcoverage represents a weighting applied to content coverage relative to the weight applied to user preferences (cTx),
x is a vector including an element for each identified video segment in the set of video segments, where for i ∈ [1 . . . n], xi ∈ {0,1} is 1 if the ith video segment is selected,
y is a vector including an element for each identified video segment in the set of video segments, where for i ∈ [1 . . . n], yi ∈ {0,1} is 1 if xi is covered by a video segment that has been already selected,
c is a vector representing a set of personalization weights ci determined with respect to each video segment xi in the set of video segments based upon user preferences,
R ∈ {0,1}n×n denotes an adjacency matrix, where 1 represents a relationship between video segments,
di is the duration of the video segment xi in the set of video segments, and
t is a maximum playlist duration.
In a yet further embodiment, the playlist generation application configures at least one processor to determine the personalization weights c based upon factors comprising a user's preferences with respect to sources and/or categories of video segments (ssource,scategory), recency (stime), and viewing history (shistory).
In yet another embodiment, the playlist generation application configures at least one processor to determine the personalization weight ci for a video segment vi as follows:
c
i
=w
source
·s
source(vi)+wcategory·scategory(vi)+wtime·stime(vi)+whistory·shistory(vi)
In a further embodiment again, the playlist generation application configures at least one processor to determine stime(vi) and shistory(vi) as follows:
where, Videos is a set of video segments, and
related(vi,w)∈{0,1} is 1 if video segments vi and w are related.
In another embodiment again, the playlist generation application configures the at least one processor to order video segments in the playlist based upon importance.
In a further additional embodiment, the playlist generation application configures the at least one processor to score the importance of a given video segment based upon factors comprising the number of video segments related to the same content as the given video segment.
In another additional embodiment, the playlist generation application configures the at least one processor to score the importance of a video segment based upon factors comprising the number of video segments related to the same content as the given video segment published within a predetermined time period.
In a still yet further embodiment, the playlist generation application configures the at least one processor to filter the list of video segments in the playlist based upon category.
In still yet another embodiment, the playlist generation application configures the at least one processor to record user interactions with video segments identified in a personalized playlist in a user viewing history.
A still further embodiment again also includes at least one playback device, comprising: at least one processor; memory containing a client application; wherein the client application configures the processor to: obtain a personalized playlist from the playlist generation server system; and playback video segments from the personalized playlist.
In still another embodiment again, the personalized playlist contains at least a portion of the metadata annotations of the video segments in the personalized playlist, and the client application configures the processor to generate a user interface including metadata annotations describing the video segments in the personalized playlist.
In a still further additional embodiment, the client application of a first of the at least one playback devices configures the at least one processor to generate a message to the client application of a second of the at least one playback devices, where the message directs the playback of video segments from the personalized playlist on the second playback device.
In still another additional embodiment, the personalized playlist identifies additional sources of relevant data that are relevant to the video segments identified in the personalized playlist, and the client application configures the processor to generate a user interface via which a user can access an additional source of relevant data identified within the personalized playlist.
In a yet further embodiment again, the indexing application configures at least one processor to identify video segments related to the same content based upon the metadata describing the content of the video segments.
In yet another embodiment again, the indexing application configures at least one processor to identify whether the content of a first video segment is cumulative of the content of a second video segment based upon the metadata describing the content of the first and second video segments.
In a yet further additional embodiment, the indexing application configures at least one processor to identify whether two video segments are related to the same content based upon keywords associated with the video segments.
In yet another additional embodiment, the indexing application configures at least one processor to calculate a term frequency-inverse document frequency (tf-idf) histogram intersection score (S(Ha,Hb)) for the keywords associate with the two video segments as follows:
where, Ha(w) and Hb(w) are the L1 normalized histograms of the words in the two sets of words; and
{f(w)} is the set of estimated relative word frequencies.
In a further additional embodiment again, the indexing application configures at least one processor to determine that the two video segments relate to the same content when the term frequency-inverse document frequency (tf-idf) histogram intersection score exceeds a first threshold and the number of named entities associated with each of the video segments exceeds a second threshold.
In another additional embodiment again, the content coverage of a personalized playlist is determined based upon the number of video segments identified within the personalized playlist that relate to the same content.
In a still yet further embodiment again, video segments that relate to the same content form a set of video segments, and the content coverage of a personalized playlist is determined based upon the number of sets of video segments covered by the video segments identified within the personalized playlist.
In still yet another embodiment again, the memory also includes a video segmentation application; and the video segmentation application configures the at least one processor to perform a multi-modal segmentation of a video data stream by: identifying segmentation cues from video data, audio data, and text data contained within the video data stream; and identifying segmentation boundaries using the identified segmentation cues.
In another further embodiment, the video data stream includes: a sequence of frames of video; at least one audio track time synchronized with the sequence of frames of video; and closed caption textual data; the video segmentation application configures the at least one processor to: identify segmentation cues from video data, audio data, and text data contained within the video data stream by identifying: visual segmentation cues within the sequence of frames of video; audio segmentation cues within the at least one audio track; and textual segmentation cues within the closed caption textual data; fuse the visual segmentation cues, the audio segmentation cues, and the textual segmentation cues to form a stream of segmentation cues time synchronized with the sequence of frames of video; and identify segmentation boundaries between frames of video within the sequence of frames of video using at least one classifier based upon the stream of segmentation cues.
In still another further embodiment, the at least one classifier is selected from the group consisting of a support vector machine, a neural-network classifier, and a decision tree classifier.
In yet another further embodiment, the textual segmentation cues include “>>>” markers within the closed caption textual data.
In another further embodiment again, the video segmentation application configures the at least one processor to perform automatic speech recognition on an audio track from the at least one audio track to produce audio track textual data that is time synchronized with the sequence of frames of video.
In another further additional embodiment, the video segmentation application configures the at least one processor to: match at least a portion of the closed caption textual data with the audio track textual data; and time synchronize the closed caption textual data to the sequence of frames of video data based upon the time synchronization of the matching audio track textual data.
In still yet another further embodiment, the visual segmentation cues include anchor frames.
In still another further embodiment again, the video segmentation application configures the at least one processor to detect anchor frames by: detecting frames in the sequence of frames of video containing a face using a face detector; determining color histograms for the detected faces; clustering the color histograms; and identifying anchor frames as frames that contain a face having a color histogram from within a dominant cluster of color histograms.
In still another further additional embodiment, the visual segmentation cues include logo frames.
In yet another further embodiment again, the video segmentation application configures the at least one processor to detect that a given frame from the sequence of frames of video is a logo frame by performing feature matching between a set of logo images and the given frame.
In yet another further additional embodiment, the video segmentation application configures the at least one processor to detect that a series of frames from the sequence of frames of video is a logo animation by performing feature matching between each of a series of logo animation frames and the corresponding frame in the series of frames.
In another further additional embodiment again, the visual segmentation cues include dark frames.
In still yet another further embodiment again, the audio segmentation cues include pauses in speech having a duration exceeding a threshold.
In still yet another further additional embodiment, at least some of the time stamped segmentation cues include confidence scores, and the video segmentation application configures the at least one processor to identify segmentation boundaries between frames of video within the sequence of frames of video using at least one classifier based upon the stream of time stamped segmentation cues and the confidence scores.
In yet another further additional embodiment again, the indexing application further configures at least one processor to index the set of video segments by extracting text data from the video segments in the set of video segments.
In still yet another further additional embodiment again, the text data is selected from the group consisting of closed caption text data, subtitle text data, and text data generated by applying an Automatic Speech Recognition process to an audio track within the video segment.
In a still further embodiment, the indexing application further configures at least one processor to use the extracted text data to identify additional sources of relevant data and identify relationships between video segments and keywords based upon keywords contained in the additional sources of relevant data.
In still another embodiment, the indexing application further configures at least one processor to use keywords from the extracted text data to identify candidate sources of relevant data based upon keywords contained within the candidate sources of relevant data based upon bag-of-words histogram comparisons that enable matching of text segments from the extracted text data with similar distributions of words in a candidate source of relevant data.
In a yet further embodiment, the indexing application further configures at least one processor to calculate a term frequency-inverse document frequency (tf-idf) histogram intersection score (S(Ha,Hb)) as follows:
where, Ha(w) and Hb(w) are the L1 normalized histograms of the words in the two sets of words; and
{f(w)} is the set of estimated relative word frequencies.
In yet another embodiment, the indexing application further configures at least one processor to determine that a candidate source of relevant data is an additional source of relevant data when the tf-idf histogram intersection score (S(Ha,Hb)) exceeds a predetermined threshold.
In a further embodiment again, the indexing application further configures at least one processor to identify candidate sources of relevant data by providing at least some of the keywords extracted from a selected video segment to a search engine.
In another embodiment again, the indexing application further configures at least one processor to identify a title from text extracted from the selected video segment and identify candidate sources of relevant data and the keyword provided to the search engine is the extracted title.
In a further additional embodiment, the identified relationships between video segments and keywords comprise relationships between named entities and video segments.
In another additional embodiment, the indexing application further configures at least one processor to: identify named entities within the text data extracted from the video segments; and identify an additional source of relevant data when a predetermined number of named entities are present within both the additional source of relevant data and the text data extracted from a selected video segment.
In a still yet further embodiment, the indexing application further configures at least one processor to identify additional named entities by performing object recognition within frames from the selected video segment.
In still yet another embodiment, the indexing application further configures at least one processor to index the set of video segments by extracting text from frames of video within the set of video segments using automatic text recognition processes.
In a still further embodiment again, the indexing application further configures at least one processor to index the set of video segments by identifying objects within frames of video within the set of video segments using object detection processes.
In still another embodiment again, the object detection processes include facial recognition processes.
In a still further additional embodiment, the indexing application further configures at least one processor to index the set of video segments by matching image portions associated with additional sources of relevant data to frames of video within the set of video segments to identify additional sources of relevant data.
In still another additional embodiment, the indexing application further configures at least one processor to identify relationships between a given video segment and at least one keyword based upon at least one keyword contained in an additional source of data identified as relevant to the given video segment based upon a match between an image portion associated with the source of additional data and a frame of video from the given video segment.
In a yet further embodiment again, the indexing application further configures at least one processor to determine that a given frame of video contains a region that includes a geometrically and photometrically distorted version of a portion of an image obtained from an additional source of relevant data.
In yet another embodiment again, the indexing application further configures at least one processor to index the set of video segments by: extracting text data from a selected video segment in the set of video segments and using keywords from the extracted text data to identify candidate sources of relevant data based upon keywords contained within the candidate sources of relevant data; and identifying images from the candidate sources of relevant data, where at least a portion of the image matches at least a portion of a frame of video from within the selected video segment; identifying additional sources of relevant data from the candidate sources of relevant data based upon the extracted keywords, the keywords in the candidate sources of relevant data and the identified images; and generate an index of keywords relevant to the selected video segment using the extracted keywords and the keywords contained within the additional sources of relevant data.
In a yet further additional embodiment, the indexing application further configures at least one processor to identify additional sources of relevant data from the candidate sources of relevant data based upon the extracted keywords, the keywords in the candidate sources of relevant data, the identified images, and timestamps associated with the selected video segment and the candidate sources of relevant data.
In yet another additional embodiment, the indexing application further configures at least one processor to use keywords from the extracted text data to identify candidate sources of relevant data based upon keywords contained within the candidate sources of relevant data based upon bag-of-words histogram comparisons that enable matching of text segments from the extracted text data with similar distributions of words in a candidate source of relevant data.
In a further additional embodiment again, the indexing application further configures at least one processor to calculate a term frequency-inverse document frequency (tf-idf) histogram intersection score (S(Ha,Hb)) as follows:
where, Ha(w) and Hb(w) are the L1 normalized histograms of the words in the two sets of words; and
{f(w)} is the set of estimated relative word frequencies.
In another additional embodiment again, the indexing application further configures at least one processor to determine that a candidate source of relevant data is an additional source of relevant data when the tf-idf histogram intersection score (S(Ha,Hb)) exceeds a predetermined threshold.
In a still yet further embodiment again, the indexing application further configures at least one processor to: identify named entities within the text data extracted from the selected video segment; and determine that a candidate source of relevant data is an additional source of relevant data when a predetermined number of named entities are present within both the candidate source of relevant data and the text data extracted from the selected video segment.
In still yet another embodiment again, the indexing application further configures at least one processor to identify additional named entities by performing object recognition.
In a still yet further additional embodiment, the indexing application further configures at least one processor to identify candidate sources of relevant data by providing at least some of the keywords extracted from the selected video segment to a search engine.
In still yet another additional embodiment, the indexing application further configures at least one processor to identify a title from text extracted from the selected video segment and identify candidate sources of relevant data and the keyword provided to the search engine is the extracted title.
In another further embodiment, the indexing application further configures at least one processor to identify at least a portion of an image from a candidate source of relevant data that matches at least a portion of a frame of video from within the selected video segment by determining that a given frame of video contains a region that includes a geometrically and photometrically distorted version of a portion of an image obtained from the candidate source of relevant data.
An embodiment of the method of the invention includes: crawling media sources using a playlist generation system to identify a set of video segments containing content; annotating video segments with metadata describing the content of the video segments by linking video segments in the set of video segments to sources of additional data based upon text data and video data contained within the video segments; identifying relationships between video segments in the set of video segments and keywords based upon the metadata describing the content of the video segments using the playlist generation system; generating personalized playlists identifying video segments from the set of video segments based upon user preference data, and the relationships between the video segments and keywords using the playlist generation system, where the process of generating the personalized process weighs both content coverage and user preferences in the generation of personalized playlists; and recording user interactions with video segments identified in personalized playlists as user viewing history data using the playlist generation system.
A further embodiment of the method of the invention also includes aggregating video data from content sources in a content storage system using the playlist generation system.
In another embodiment, aggregation of video data comprises segmenting video data streams using a video segmentation system.
A still further embodiment also includes transcoding the video segments to target profiles.
Turning now to the drawings, systems and methods for generating personalized video playlists for video content aggregated from a variety of content sources in accordance with embodiments of the invention are illustrated. In many embodiments, data streams of video content are aggregated from various sources. Relationships are identified between various segments of the video content and/or between segments of the video content and other relevant sources of information including (but not limited to) metadata databases, web pages and/or social media services. Relevant information concerning the video segments can then be utilized to generate personalized playlists of video content based upon each user's viewing history and preferences. Users can then utilize the playlists to playback segments of video content via any of a variety of playback devices. In a number of embodiments, the user interface presented to the user via the playback device and/or via a second screen can display and/or provide users with links to information related to the displayed video segment.
Online sources of video content, such as news websites, typically provide video content in individual segments. By contrast traditional broadcast sources of video content are typically provided in continuous streams. In many embodiments, the process of aggregating video content from various sources can include segmentation of continuous data streams of video content. In the context of a news personalization service, the streams of video content can be segmented into individual news stories. In other contexts, the streams of video content can be segmented in accordance with other criteria including (but not limited to) commercial breaks, repeated events, slow motion sequences, camera shots, sentences, and/or anchor frames. In the specific context of sporting events, repeated sequences, slow motion sequences, and shots of the crowd are often indicative of important activity and can be utilized as segmentation boundaries. In addition, certain camera angles are typically utilized to capture video of important regions of a sports field. Therefore, camera angle can also be utilized as segmentation boundaries. As can readily be appreciated, any of a variety of segmentation cues can be utilized to identify specific segmentation boundaries that are appropriate to the requirements of a given application. In a number of embodiments, the segmentation process is a multi-modal segmentation process that detects segmentation cues in video, audio, and/or text data available in the data stream. Multi-modal segmentation processes in accordance with certain embodiments of the invention utilize specific text segmentation cues contained within closed caption text data. In a number of embodiments, specific video segmentation cues such as the recognition of a recurring face (e.g. an anchorperson), and/or recurring logo or logo animation are utilized to assist video segmentation. In other embodiments, any of a variety of segmentation techniques can be utilized as appropriate to the requirements of specific applications.
In a number of embodiments, segments of video content are analyzed to identify links between the segments and other relevant sources of information including (but not limited to) metadata databases, web pages and/or short messages posted via social media services such as the Facebook service provided by Facebook, Inc. of Menlo Park, Calif. and the Twitter service provided by Twitter, Inc. of San Francisco, Calif. In several embodiments, a multi-modal search for relevant additional data sources is performed that utilizes textual analysis and visual analysis of the video segments to identify relevant sources of additional data. In a number of embodiments, the textual analysis involves extracting keywords from text data such as closed caption and/or subtitles. The extracted keywords can then be utilized to locate relevant text data. In certain embodiments, the visual analysis involves recognizing elements within individual frames of video such as (but not limited to) text, faces, images and/or image patterns (e.g. clothing, scene background). In several embodiments, visual analysis can also involve object detection and/or detection of specific object events (e.g. gestures or specific object movements). Text and faces of named entities can be extracted as metadata describing the video segment and utilized to locate sources of relevant text data. In several embodiments, some or all of a frame of video can be compared to images related to additional sources of data and matching images used to identify relevant sources of additional data. In other embodiments, any of a variety of text and/or visual analysis can be performed to identify relevant sources of additional information.
In a number of embodiments, a multi-modal video search engine service is provided that creates an index of video segments that are relevant to specific keywords based upon relevant keywords identified through the textual and visual analysis of the video segments. In several embodiments, the list of relevant keywords for a particular video segment can be expanded by identifying keywords from in additional sources of data identified through the textual and visual analysis of the video segment. Once generated, the index can be utilized to generate a list of video segments that are relevant to a text search query. In several embodiments, an image, a video segment, and/or a Universal Resource Locator (URL) identifying a data sources such as (but not limited to) an image, a video sequence, a web page, and/or an online article can be provided as an input to the search engine (as opposed to a text query) to generate a list of related video segments. In other embodiments, any of a variety of multi-modal search engine services can be implemented as appropriate to the requirements of specific applications.
With specific regard to the generation of personalized playlists, the ability to identify related video segments can be useful in generating a playlist having a specified duration that provides the greatest coverage of the content of a set of video segments. The ability to identify related and/or duplicate content in a set of video segments can be utilized in the selection of video segments to include in a playlist. In the context of news stories, a personalized playlist can be constructed by selecting video segments of news stories that provide the greatest coverage of the stories taking into consideration an individual user's preferences concerning factors such as (but not limited to) content source, content category, anchorperson and/or any other factors appropriate to specific applications. As discussed further below, many embodiments of the invention utilize an integer linear programming optimization or a suitable approximate solution that employs an objective function that weighs both content coverage and user preferences in the generation of a personalized playlist. However, any of a variety of techniques for recommending video segments can be utilized in accordance with embodiments of the invention including (but not limited to) processes that generate playlists using video segments that do not contain cumulative content.
Systems and methods for generating personalized video playlists, performing multi-modal video data stream segmentation, and generating video search results using multi-modal analysis of video segments in accordance with embodiments of the invention are discussed further below.
Playlist generation systems in accordance with embodiments of the invention perform multi-modal analysis of video segments to generate personalized playlists based upon factors including (but not limited to) a user's preferences, and/or viewing history. In a number of embodiments, the user's preferences can touch upon topic, content provider, and total playlist duration. A playlist generation system configured to generate personalized playlists of news stories in accordance with an embodiment of the invention is conceptually illustrated in
The playlist generation system 100 analyzes and indexes (108) the video segments. In several embodiments, a multi-modal process that performs textual and visual analysis is utilized to analyze and index the video segments. In a number of embodiments, the multi-modal process identifies keywords from text sources within the video segment including (but not limited to) closed caption, and subtitles. Keywords can also be extracted based upon text recognition, and object recognition. In certain embodiments, various object recognition processes are utilized including facial recognition processes to identify named entities. The set of keywords associated with a video segment can then be utilized to identify additional sources of data. Examples of additional sources of data include (but are not limited to) online articles and websites, and posting to social media services. In certain embodiments, comparisons can be performed between frames of a video segment and images associated with additional sources of data as an additional modality for determining the extent of the relevance of an additional source of data. In other embodiments, any of a variety of analysis and indexing processes can be utilized as appropriate to the requirements of specific applications. Analysis and indexing processes that are utilized by various playlist generation systems in accordance with embodiments of the invention are discussed further below.
The indexed video segments can be utilized by the playlist generation system 100 to generate personalized playlists (110). Any of a variety of processes can be utilized to generate personalized playlists in accordance with embodiments of the invention. Several particularly effective processes for generating personalized playlists are described below. A number of embodiments are directed toward the generation of playlists in the context of news stories and select video segments that provide the greatest coverage of recent news stories in a manner that is informed by user preferences. In several embodiments, the selection process is further constrained by the need to generate a playlist having a playback duration that does not exceed a duration specified by the user.
Personalized playlists can be provided by the playlist generation system to playback devices. In a number of embodiments, the playlist can take the form of JSON playlist metadata. In other embodiments, any of a variety of data transfer techniques can be utilized including the creation of a top level index file such as (but not limited to) a SMIL file, or an MPEG-DASH file. Client applications on playback devices can generate a user interface (112) that enables the user to obtain and playback the video segments identified within the playlist. In many instances, the user may simply enable the playback device to continuously play through the playlist. In several embodiments, the user interface provides the user with the ability to select video segments, express sentiment toward video segments (e.g. like/dislike), skip video segments, reorder and/or delete video segments from the playlist, and share video segments via email, messaging services, and/or social media services. In a number of embodiments, the playlist generation system 100 logs user interactions via the user interface and uses the interactions to infer user preferences. In this way, the system can learn over time information about a user's preferences including (but not limited to) preferred content categories, content services, and/or anchorpeople. In a number of embodiments, playback devices can generate a so-called “second screen” user interface that can enable control of playback of a playlist on another playback device and/or provide information that complements a video segment and/or playlist being played back by another playback device. As can readily be appreciated, the specific user interface generated by a playback device is typically only limited by the capabilities of the playback device and the requirements of a specific application.
Although specific playlist generation systems are described above with reference to
A video distribution system incorporating a playlist generation server system in accordance with an embodiment of the invention is illustrated in
Playlist generation server systems 202 in accordance with many embodiments of the invention utilize multi-modal analysis of video segments to identify additional relevant sources of data accessible via the content storage system 204, a content distribution network 206, a web server system 208 and/or a social media server system 210. In several embodiments, the playlist generation server system 202 annotates video segments with metadata extracted from the video segment and/or from additional sources of relevant data. The metadata describing the video segments can be stored in a database 216 and utilized to generate personalized playlists based upon user preferences that can also be stored in the database.
Playback client applications installed on a variety of playback devices 218 can be utilized to request personalized playlists from a playlist generation server system 202 via a network 220 such as (but not limited to) the Internet. The playback client applications can configure the playback devices 218 to display a user interface that enables a user to view and interact with the video segments identified in the user's personalized playlist. In a number of embodiments, the playlist generation server system and the playback devices can support multi-screen user interfaces. For example, a first playback device can be utilized to playback video segments identified in the playlist and a second playback device can be utilized to provide a “second screen” user interface enabling control of playback of video segments on the first playback device and/or additional information concerning the video segments and/or playlist being played back on the first playback device. In the illustrated embodiment, the playback devices 218 are personal computers and mobile phones. As can be readily appreciated, playback client applications can be created for any of a variety of playback devices including (but not limited to) network connected consumer electronics devices such a televisions, game consoles, and media players, tablet computers and/or any other class of device that is typically utilized to view video content obtained via a network connection.
A process for generating a personalized playlist of video segments drawn from different content sources based upon user preferences in accordance with an embodiment of the invention is illustrated in
In order to generate a playlist of video segments personalized to a user's preferences, the process 300 seeks to annotate the video segments with metadata describing the content of the segments. In a number of embodiments, a video segment linking process (306) is performed that seeks to identify additional sources of relevant data that describe the content of the video segment. In a number of embodiments, the video segment linking process (306) also seeks to identify relationships between video segments. In various contexts, including in the generation of personalized playlists of news stories, knowledge concerning the relationship between video segments can be useful in identifying video segments that contain cumulative content and can be excluded from a playlist without significant loss of information or content coverage. Information concerning the number of related stories can also provide an indication of the importance of the story.
Metadata describing a set of video segments can be utilized to generate (308) personalized playlists for one or more users. As is described in detail below, a variety of processes can be utilized in the generation of a personalized playlist based upon the metadata generated by process 300. In the context of news stories, a number of embodiments utilize an integer linear programming optimization and/or an approximation of an integer linear programming optimization that employs an objective function that weighs both content coverage including (but not limited to) measured trending topics (e.g. breaking news, or popular stories) and user preferences in the generation of a personalized playlist. Although, any of a variety of processes for recommending video segments can be utilized as appropriate to the requirements of specific applications in accordance with embodiments of the invention.
In many embodiments, video segments are streamed to playback devices. Many of the standards that exist for encoding video specify profiles and playback devices are typically constructed in a manner that enables playback of content encoded in accordance with one or more of the profiles specified within the standard. The same profile may not, however, be suitable or desirable for playing back content on different classes of playback device. For example, mobile devices are typically unable to support playback of profiles designed for home theaters. Similarly, a network connected television may be capable of playing back content encoded in accordance with a mobile profile. However, playback quality may be significantly reduced relative to the quality achieved with a profile that demands the resources that are typically available in a home theater setting. Accordingly, processes for generating personalized video playlists in accordance with many embodiments of the invention involve transcoding video segments into formats and/or profiles suitable for different classes of device. As can readily be appreciated, the transcoding of media into target profiles can be performed in parallel with the processes utilized to perform video segment linking (306) and personalized playlist generation (308).
As discussed above, personalized playlists can be utilized by playback devices to obtain (312) and playback video segments identified within the playlists. In a number of embodiments, the video segments are streamed to the playback device and any of a variety of streaming technologies can be utilized including any of the common progressive playback or adaptive bitrate streaming protocols utilized to stream video content over a network. In several embodiments, a playback device can download the video segments using a personalized video playlist for disconnected (or connected) playback. The personalized playlists are generated based upon user preferences. Therefore, the process of generating personalized playlists can be continuously improved by collecting information concerning user interactions with video segments identified in a personalized playlist. The interactions can be indicative of implicit user preferences and may be utilized to update explicit user preferences obtained from the user.
Although specific processes for generating personalized video playlists are described above with reference to
In a number of embodiments, computers and television tuners are utilized to continually record media content from over-the-air broadcasts and cable television transmissions. In the context of a playlist generation system configured to generate personalized video playlists of news stories, the recorded programs can include national morning and evening news programs (e.g., TODAY Show, ABC World News), investigative journalism (e.g., 60 Minutes), and late-night talk shows (e.g., The Tonight Show). In many embodiments, the closed caption (CC) and/or any subtitles and metadata that may be available within the broadcast data stream are recorded along with the media content for use in subsequent processing of the recorded media content. In other contexts, content sources appropriate to the requirements of specific applications can be recorded. In several embodiments, segmentation is performed in real-time prior to storage. In a number of embodiments, the video data streams are recorded and segmentation is performed on the recorded data streams.
A video segmentation system configured to aggregate and segment over-the air broadcasts and cable television transmissions in accordance with an embodiment of the invention is illustrated in
In the illustrated embodiment, the tuners 408 connect to a central storage system 410 via a high bandwidth digital switch 412. The data streams are recorded to the central storage system 410 and then a video segmentation server system 414 can commence the process of segmenting the data stream into discrete video segments.
A similar process is utilized to record and segment data streams obtained from over-the-air broadcasts. In the illustrated embodiment, tuner boxes 416 are utilized to tune to and demodulate digital television signals that are provided via a network 418 to the video segmentation server system 414 for segmentation. In many embodiments, the video segmentation server system records the over-the-air data streams to the central storage system 410 and then processes the recorded data streams. In a number of embodiments, the video segmentation server 414 system performs video segmentation in real-time and the video segments are recorded to the central storage system 410. In a number of embodiments, local machines 420 can be utilized to administer the aggregation and segmentation of video and/or view video segments.
Although specific systems for performing video aggregation and segmentation are described above with reference to
Due to the diversity of video content generated by various broadcast and online content sources, video segmentation systems in accordance with many embodiments of the invention can utilize a variety of cues to reliably segment content. In a typical data stream of video content, the sources of information concerning the structure of the content include (but are not limited to) image data in the form of frames of video, audio data in the form of time synchronized audio tracks, text data in the form of closed caption and/or subtitles, and/or additional sources of video, audio, and/or text information indicated by metadata contained within the data stream (e.g. in a time synchronized metadata track). In the context of video data streams, the term structure can often be used to describe a common progression of content within a data stream. For example, many data streams include content interrupted by advertising. At a more sophisticated level many news services structure transitions between news stories to incorporate shots of an anchorperson, which can be referred to as anchor frames, and/or transition animations that often include a station logo. The goal of video segmentation is to use information concerning the structure of content to divide a continuous video data stream into logical video segments such as (but not limited to) discrete news stories. In a number of embodiments, video segmentation is performed using multi-modal fusion of a variety of visual, auditory and textual cues. By combining cues from different types of data contained within the data stream, the segmentation process has a greater likelihood of correctly identifying structure within the content indicative of logical boundaries between video segments.
A multi-modal video segmentation server system in accordance with an embodiment of the invention is illustrated in
Although specific multi-modal video segmentation server systems are described above with reference to
Multi-modal video segmentation processes can utilize a variety of different types of data contained within a video data stream to identify cues indicative of the structure of the data stream. A multi-modal video segmentation process that utilizes textual, audio and visual cues to identify segmentation boundaries in accordance with an embodiment of the invention is conceptually illustrated in
Some of the most important cues for story boundaries can be found in closed caption textual data incorporated within a video data stream. Often, >>> and >> markers are inserted to denote changes in stories or changes in speakers, respectively. Due to human errors, relying solely on these markers can provide inaccurate segmentation results. Therefore, segmentation analysis of closed caption data can be enhanced by looking for additional cues including (but not limited to) commonly used transition phrases that occur at segmentation boundaries. In several embodiments, string searches are performed within closed caption textual data and all >>> markers and transition phrases are identified as potential segmentation boundaries. In a number of embodiments, the list of transition phrases include “Now, we turn to . . . ” and “Stephanie Gross, NBC News, Seattle”. In other embodiments, any of a variety of text tags and/or phrases can be utilized as textual segmentation cues as appropriate to the requirements of specific applications.
In many instances, there is a delay between the video and closed caption text that varies randomly even within the same segment of video content. Indeed, delays of the order of tens of seconds have been observed. In a number of embodiments, automatic speech recognition can be performed with respect to the audio track and the timestamps of the audio track used to align the audio track textual data output by the automatic speech recognition process with text in the accompanying closed caption textual data. In several embodiments, the text data output by the automatic speech recognition process can also be analyzed to detect the presence of transition phrases. In other embodiments, the uncertainty in the time alignment between the closed caption text and the video content can be accommodated by the multi-modal segmentation process and a separate time alignment process is not required.
A process for identifying textual segmentation cues in accordance with an embodiment of the invention is illustrated in
Visual boundaries in video content can provide information concerning transitions in content that cannot be discerned from analysis of closed caption textual data alone. In several embodiments, an analysis of video content for visual cues indicative of segmentation boundaries can be utilized to identify additional segmentation boundaries and to confirm and/or improve the accuracy of boundaries identified using closed caption textual data.
In the context of segmentation of news stories, several embodiments of the invention rely upon one or more of a set of visual cues as strong indicators of a segmentation boundary. In a number of embodiments, the set of visual cues includes (but is not limited to) anchor frames, logo frames, logo animation sequences and/or dark frames. In other embodiments and/or contexts, any of a variety of visual cues can be utilized as appropriate to the requirements of specific applications.
The term anchor frame refers to a frame in which an anchorperson appears. Typically, one or more anchorpersons appear between stories to provide a graceful transition. In several embodiments, a face detector is applied to some or all of the video frames in a video data stream. In certain embodiments, a face detector that can detect the presence of a face (without performing identification) is utilized to identify candidate anchor frames and then a facial recognition process is applied to the candidate anchor faces to detect anchor frames. In other embodiments, any of a variety of techniques can be used to identify the presence of a specific person's face within a frame in a video data stream as appropriate to the requirements of specific applications
A process for detecting anchor frames in a data stream in accordance with an embodiment of the invention is conceptually illustrated in
When no faces are detected (756), then the frame is determined not to be an anchor frame. When a determination (756) is made that a face is present, then a face identification process (758) can be performed within the region containing the detected face. In several embodiments, face identification is performed by generating a color histogram for a region containing a candidate face. In several embodiments, an elliptical region is utilized. In a number of embodiments, confidence information generated by the face detection process is utilized to define the region from which to form a histogram. The color histograms can be clustered from candidate anchor frames across the video data stream and dominant clusters identified as corresponding to an anchorperson. The dominant clusters can then be used to identify candidate anchor frames that contain a face with a face having a color histogram that is close to one of the dominant “anchor” color histograms. In certain embodiments, similarity is determined using the L1 distance between the color histograms. In other embodiments, any of a variety of metrics can be utilized as appropriate to the requirements of specific applications including metrics that consider the color histogram of a potential anchor face over more than one frame as appropriate to the requirements of specific application.
When a determination (760) that an anchorperson's face is present, an anchor frame is detected (762). In several embodiments, factors including (but not limited to) the L1 distance, and the number of adjacent frames in which the anchor face are detected are utilized to generate a confidence score that can be used by a multi-modal segmentation process in combination with information concerning other cues to determine the likelihood of a transition indicative of a segmentation boundary.
Many news programs insert a program logo or transition animation between stories or segments. Logo appearance and position can vary unpredictably over time. In a number of embodiments, feature matching is performed between a set of logo images and frames from a video data stream. A set of logo images can be obtained by periodically crawling the websites of news organizations and/or other appropriate sources. Feature matching can also be performed between sequences of images in a transition animation and frames from a video data stream. Similarly, new transition animations can be periodically observed in video data streams generated by specific content sources and added to a library of transition animations.
Feature matching between logo images and frames of video in accordance with an embodiment of the invention is illustrated in
A specific process for performing feature matching is illustrated in
The localized features can be utilized to generate (906) global signatures and the selected frames ranked by comparing their global signatures to the global signature of the reference image. The ranking can be utilized to select (908) a set of candidate frames that are compared in a pairwise fashion (910) with the logo image. In several embodiments, the pairwise comparisons can utilize the techniques described in D. Chen, S. Tsai, V. Chandrasekhar, G. Takacs, R. Vedantham, R. Grezeszczuk, and B. Girod, “Residual enhanced visual vector as a compact signature for mobile visual search,” Signal Processing, 2012. When the pairwise comparison yields a match exceeding a predetermined threshold, a match is identified (912). As noted above, a match may represent that the candidate frame incorporates a logo and/or that the candidate frame corresponds to a frame from a transition animation. In many embodiments, the process of determining a match also involves determining a confidence metric that can also be utilized in the segmentation of a video data stream.
Although specific processes are described above with references to
Dark frames are frequently inserted at the boundaries of commercials and hence provide another valuable visual cue for segmentation. In several embodiments, dark frames are detected by converting some or all frames in a video data stream to gray scale and comparing the mean and standard deviation of the pixel intensities. In many embodiments, a frame is determined to be a dark frame if the mean is below μb and the standard deviation is below σb. In several embodiments, values of μb=40 and σb=10 can be utilized for gray levels in the range [0, 255]. In other embodiments, any of a variety of processes can be utilized to identify dark frames in accordance with embodiments of the invention, including (but not limited) to processes that identify sequences of multiple dark frames and/or processes that provide a confidence measure that can be utilized by a multi-modal segmentation process in combination with information concerning other cues to determine the likelihood of a transition indicative of a segmentation boundary.
In a number of embodiments, an audio track within a data stream can also be utilized as a source of segmentation cues. Anchorpersons commonly pause momentarily or take a long breath before introducing a new story. In several embodiments, significant pauses in an audio track are utilized as a segmentation cue. In many embodiments, a significant pause is defined as a pause in speech having a duration of 0.3 seconds or longer. In other embodiments, any of a variety of classifiers can be utilized to detect pauses indicative of a segmentation boundary in accordance with embodiments of the invention including processes that provide a confidence measure that can be utilized by a multi-modal segmentation process in combination with information concerning other cues to determine the likelihood of a transition indicative of a segmentation boundary. Pauses are not the only auditory cues that can be utilized in the detection of segmentation boundaries. In many embodiments, specific changes in tone and/or pitch can be utilized as indicative of segmentation boundaries as can musical accompaniment that is indicative of a transition to a commercial break and/or between segments.
Although various systems and methods that utilize a variety of segmentation cues in the multi-modal segmentation of video data streams are described above with reference to
Playlist generation systems in accordance with many embodiments of the invention are configured to index sets of video segments and generate personalized playlists based upon user preferences. The user preferences can be explicit preferences specified by the user, and/or can be inferred based upon user interactions with previously recommended video segments (i.e. the user's viewing history). In many embodiments, the playlist generation system also generates playlists that are subject to time constraints in recognition of the limited time available to a user to consume content.
A playlist generation server system configured to index video segments and generate personalized playlists in accordance with an embodiment of the invention is illustrated in
The non-volatile memory 1030 can also contain a playlist generation application 1034 that configures the processor 1010 to generate personalized playlists for individual users based upon information collected by the playlist generation server system 1000 concerning user preferences and viewing histories 1036. Various processes for generating personalized video playlists in accordance with embodiments of the invention are discussed further below.
Although specific playlist generation server system implementations are described above with reference to
Metadata describing video segments can be utilized as inputs to a personalized video playlist generation system and to populate the user interfaces of playback devices with descriptive information concerning the video segments. A great deal of metadata describing a video segment can be derived from the video segment itself. Analysis of text data such as closed caption and subtitle text data can be utilized to identify relevant keywords. Analysis of visual data using techniques such as (but not limited to) text recognition, object recognition, and facial recognition can be utilized to identify the presence of keywords and/or named entities within the content. In many instances video segments can also include a metadata track that describes the content of the video segment.
Metadata describing video segments can also be obtained by matching the video segments to additional sources of relevant data. In the context of news stories, video segments can be matched to online articles related to the content of the video segment. In a number of embodiments, visual analysis is used to match portions of images associated with online articles to frames of video as an indication of the relevance of the online article. These sources of additional data (e.g. online news articles or Wikipedia pages) can be used to identify additional keywords describing the content. In addition, online articles matched to specific video segments can be utilized to generate titles for video segments and provide thumbnail images that can be used within user interfaces of playback devices. Hyperlinks to the online articles can also be provided via the user interfaces to enable a user to link to the additional content. In other contexts, any of a variety of data sources appropriate to the requirements of the specific application can be utilized in the generation of user interfaces and/or personalized playlists in accordance with embodiments of the invention.
In several embodiments, visual analysis and text analysis is utilized to match video segments to additional sources of data. A process for matching a segment of video to an online news article in accordance with an embodiment of the invention is conceptually illustrated in
In a number of embodiments, computational complexity can be reduced by initially performing text analysis to identify candidate sources of additional data. Images related to the candidate sources of additional data can then be utilized to perform visual analysis and the final ranking of the candidate sources of additional data determined based upon the combination of the text and visual analysis. In other embodiments, the text and visual analysis can be performed in alternative sequences and/or independently. Processes for performing text analysis and visual analysis to identify additional sources of data relevant to the content of video segments in accordance with embodiments of the invention are discussed further below.
In a number of embodiments, sources of text within a video segment including (but not limited to) closed caption, subtitles, text generated by automatic speech recognition processes, and text generated by text recognition (optical character recognition) processes can be utilized to annotate video segments and identify additional sources of relevant data. In the context of video segments that have a temporal relevancy component (e.g. news stories), time stamp metadata associated with additional sources of data and/or dates and/or times contained within text forming part of an additional source of data can be utilized in limiting the sources of additional data considered when determining relevancy. In many instances, the presence of common dates and/or times in text extracted from a video segment and text from an additional data source can be considered indicative of relevance.
In a number of embodiments, bag-of-words histogram comparisons enable matching of text segments with similar distributions of words. In certain embodiments, a term frequency-inverse document frequency (tf-idf) histogram intersection score (S(Ha,Hb)) is computed as follows:
where, Ha(w) and Hb(w) are the L1 normalized histograms of the words in the two sets of words (i.e. the text from the video segment and the additional data source); and
{f(w)} is the set of estimated relative word frequencies.
In many embodiments, a candidate additional data source is considered to have been identified when the tf-idf histogram intersection score (S(Ha,Hb)) exceeds a predetermined threshold.
In a number of embodiments, the process of identifying relevant sources of additional data places particular significance upon named entities. A database of named entities can be built using sources such as (but not limited to) Wikipedia, Twitter, the Stanford Named Entity Recognizer, and/or Open Calais. String searches can then be utilized to identify named entities in text extracted from a video segment and a potential source of additional data, such as an online article. In several embodiments, the presence of a predetermined number of common named entities is used to identify a source of additional data that is relevant to a video segment. In certain embodiments, the presence of five or more named entities in common is indicative of a relevant source of additional data. In other embodiments, any of a variety of processes can be utilized to determine relevancy based upon named entities including processes that utilize a variety of matching rules such as (but not limited to) number of matching named entities, number of matching named entities that are people, number of matching named entities that are places and/or combinations of numbers of matching named entities that are people and number of matching named entities that are places.
A process for performing text analysis of video segments to identify relevant sources of additional data in accordance with an embodiment of the invention is illustrated in
In a number of embodiments, the relevancy of additional sources of data to specific video segments can be confirmed by identifying (1208) named entities in text data describing a video segment, identifying (1210) named entities referenced in candidate additional sources of data that share common terms with the video segment, and determining (1212) that an additional source of data relates to the content of a video segment when a predetermined number of named entities are referenced in the text data extracted from the video segment and the additional source of data. As is discussed further below, named entities associated with a video segment can be identified within text data extracted from the video segment and/or by performing object detection and/or facial recognition processes with respect to frames from the video segment.
Although specific processes are described above with reference to
The frames of a video segment can contain a variety of visual information including images, faces, and/or text. In a number of embodiments, the text analysis processes similar to those described above can be augmented using relevant keywords identified through analysis of the visual information (as opposed to text data) within a video segment. In several embodiments, text recognition processes are utilized to identify text that is visually represented within a frame of video and relevant keywords can be extracted from the identified text. In a number of embodiments, additional relevant keywords can also be extracted from a video segment by performing object detection and/or facial recognition.
Text extraction processes can be used to detect and recognize letters forming words within frames in a video segment. In several embodiments, the text can be utilized to identify keywords that annotate the video segment. In the context of news stories, keywords such as (but not limited to) “breaking news” can be utilized to categorize news stores both for the purpose of detecting additional sources of data and during the generation of personalized playlists.
In a number of embodiments, text is extracted from frames of video and filtered to identify text that describes the video segment. News stories commonly include title text and identification of the title text can be useful for the purpose of incorporating the title into a user interface and/or for using keywords in the title to identify relevant additional sources of data. In many embodiments, an extracted title is provided to a search engine to identify additional sources of potentially relevant data. In the context of video segments within a specific category or vertical (e.g. news stories), the title can be provided as a query to a vertical search engine (e.g. the Google News search engine service provided by Google, Inc. of Mountain View, Calif.) to identify additional sources of potentially relevant data. In many embodiments, the ranking of the search results is utilized to determine relevancy. In several embodiments, the search results are separately scored to determine relevancy.
Processes for extracting relevant keywords from video segments for use in the annotation of video segments in accordance with embodiments of the invention are illustrated in
A process for extracting relevant keywords from frames of video using automatic text recognition in accordance with an embodiment of the invention is illustrated in
Referring again to the process 1400 shown in
Although specific processes for extracting additional relevant keywords from frames of video by performing automatic text recognition are described above with reference to
A variety of techniques are known for performing object detection including various face recognition processes. Processes for detecting anchor faces are described above with respect to video segmentation. As can readily be appreciated, recognizing the people appearing in video segments can be useful in identifying additional sources of data that are relevant to the content of the video segments. In a number of embodiments, similar processes can be utilized to identify a larger number of faces (i.e. more named entities than simply anchorpeople). In other embodiments, any of a variety of processes can be utilized to perform face recognition including processes that have high recognition precision across a large population of faces.
A process for performing face recognition based upon localized features during the annotation of a video segment in accordance with an embodiment of the invention is conceptually illustrated in
Although specific processes for annotating video segments with named entity keywords by performing automatic face recognition are described above with reference to
Video segments and additional sources of data, such as online articles, often utilize the same image, different portions of the same image, or different images of the same scene. In a number of embodiments, an image portion within one or more frames in a video segment can be matched to an image associated with additional sources of information to assist with establishing the relevancy of additional sources of data. In several embodiments, matching is performed by determining whether the frame of video contains a region that includes a geometrically and photometrically distorted version of a portion of an image obtained from the additional data source. As noted previously, processes similar to those described above with reference to
Once a set of video segments is annotated, and index can be generated using keywords extracted from the video segment and/or additional sources of data that are relevant to the content of the video segment. The resulting index and metadata can be utilized in the generation of personalized video playlists. Playlist personalization is a complex problem that can consider user preferences, viewing history, and/or story relationships in choosing the video segments that are most likely to form the set of content that is of most interest to a user. In many embodiments, processes for generating personalized playlists for users involve consideration of a recommended set of content in recognition of the limited amount of time an individual user may have to view video segments. Accordingly, processes in accordance with a number of embodiments of the invention can attempt to select a set of video segments having a combined duration less than a predetermined time period and spanning the content that is most likely to be of interest to the user. In several embodiments, the video segments can be further sorted into a preferred order. In a number of embodiments, the order can be determined based upon relevancy and/or based upon heuristics concerning sequences of content categories that make for “good television”. In certain embodiments, the process of generating playlists involves the generation of multiple playlists including a personalized playlist and “channels” of content filtered by categories such as “technology” or keywords such as “Barack Obama”. Within categories, user preferences can still be considered in the generation of the playlist. Effectively, the process for generating a personalized video playlist is simply applied to a smaller set of video segments. In the context of news stories, processes for generating personalized playlists in accordance with many embodiments of the invention attempt to provide a comprehensive view of the day's news in a way that avoids duplicate or near-duplicate stories. Additionally, more recent video segments can receive higher weightings. Intuitively, this formulation chooses trending video segments, which originated from news programs the user prefers, and are also associated with categories in which the user is interested.
In many embodiments, the process of generating a personalized playlist is treated as a maximum coverage problem. A maximum coverage problem typically involves a number of sets of elements, where the sets of elements can intersect (i.e. a single element can belong to multiple sets). Solving a maximum coverage problem involves finding the fixed number of elements that cover the largest number of sets of elements. In the context of generating a personalized playlist, the elements are the video segments and video segments that relate to the same content are treated as belonging to the same set. Therefore, the concept of content coverage can be used to refer to the amount of different content covered by a set of video segments. As noted above, video segments can be compared to determine whether the content is related or unrelated. In the context of news stories, many embodiments attempt to span the major news stories of the day and an objective function for solving the maximum coverable problem can be weighted by a linear combination of several personalization factors. These factors can include (but are not limited to) explicit preferences specified by a user, personal information provided by the user and/or obtained from secondary sources including (but not limited to) online social networks, and implicit preferences obtained by analyzing a user's viewing history. Information concerning implicit preferences may be derived by analyzing a user's viewing history with respect to playlists generated by a playlist generation server system. In other embodiments, implicit preferences can be derived from additional sources of information including (but not limited to) a user's browsing activity (especially with respect to online articles relevant to video segment content), activity within an online social network, and/or viewing history with respect to video and/or audio content provided by one or more additional services.
A process for generating personalized playlists from metadata describing a set of video segments based upon user preferences in accordance with an embodiment of the invention is illustrated in
Personalized playlists can be provided to playback devices, which can utilize the playlists to stream (1712), or otherwise obtain, the video segments identified in the playlist and to enable the user to interact with the video segments. In several embodiments, the playback devices and/or the playlist generation server system to collect analytic data based upon user interactions with the video segments and/or additional data sources identified within the playlist. The analytic information can be utilized to improve the manner in which personalization ratings are determined for specific users so that the playlist generation process can provide more relevant content recommendations over time.
Although specific processes for performing personalized playlist generation with respect to a set of video segments based upon user preferences are described above with reference to
As is discussed in further detail below, playlist generation processes in accordance with many embodiments of the invention rely upon information concerning the relationships between the content in video segments to identify the greatest amount of information that can be conveyed within the shortest or a specified time period. In the context of video segments extracted from news programming, related video segments can be considered to be video segments that relate to the same news story. In many embodiments, care is taken when classifying two video segments relating to the same content as “related” to avoid classifying a video segment that includes updated information as related in the sense of being cumulative. In many embodiments, a video segment that contains additional information can be identified as a primary video segment and a video segment containing an earlier version of the content and/or a subset of the content can be classified as a related or cumulative video segment. In this way, a related classification can be considered hierarchical or one directional. Stated another way, the classification of a first segment as related to a second segment does not imply that the second segment is related to (cumulative of) the first segment. In many embodiments, however, only bidirectional relationships are utilized.
A process for identifying whether a first video segment is cumulative of the content in a second video segment based upon keywords associated with the video segments in accordance with an embodiment of the invention is illustrated in
Although specific processes for identifying whether one video segment is cumulative of another are described above with respect to
In several embodiments, personalized playlists are generated by formalizing the problem of generating a playlist for a user as an integer linear programming optimization problem, or more specifically a maximum coverage problem, as follows:
where n is the number of today's videos,
wcoverage represents a weighting applied to the news story coverage relative to user preferences,
x is a vector including an element for each identified video segment, where for i ∈ [1 . . . n], xi ∈ {0,1} is 1 if the ith video segment is selected,
y is a vector including an element for each identified video segment, where for i ∈ [1 . . . n], yi ∈ {0,1} is 1 if xi is covered by a video segment that has been already selected,
c is a vector representing a set of personalization weights ci determined with respect to each video segment xi based upon user preferences, and
R ∈ {0,1}n×n denotes an adjacency matrix, where 1 represents a link between news stories.
In the above formulation, duration of the news story and time limitations are represented by di and t accordingly. As can readily be appreciated, the above objective function maximizes a weighted combination of the coverage of the day's new stories achieved within a specified time period (wcoverageΣi=1nyi) and the user's preferences (cTx).
In a number of embodiments, factors including (but not limited to) a user's preferences with respect to sources and/or categories of video segments (ssource,scategory), recency (stime), and viewing history (shistory) are considered in calculating the personalization weights c. In several embodiments, viewing history (shistory) can be determined based upon the number of related news stories, which were watched previously by the user. In several embodiments, processes for detecting related and/or similar stories similar to those described above with respect to
c
i
=w
source
·s
source(vi)+wcategory·scategory(vi)+wtime·stime(vi)+whistory·shistory(vi)
As can readily be appreciated, the weights can be selected arbitrarily and updated manually and/or automatically based upon user feedback.
In certain embodiments, stime(vi) and shistory(vi) are defined as follows:
where, Videos is a set of all video segments (i.e. not just the recent segments v).
The function related(vi,w) ∈ {0,1} is 1 if video segments vi and w are linked. In several embodiments, a process similar to any of the processes described above with respect to
Once a set of video segments is identified, a variety of choices can be made with respect to the ordering of the set of video segments to generate a playlist. In a number of embodiments, the “importance” of a video segment can be scored and utilized to determine the order in which the video segments are presented in a playlist. In several embodiments, importance can be scored based upon factors including (but not limited to) the number of related video segments. In the context of news stories, the number of related video segments within a predetermined time period can be indicative of breaking news. Therefore, the number of related video segments to a video segment within a predetermined time period can be indicative of importance. In other embodiments, any of a variety of techniques can be utilized to measure the importance of a video segment as appropriate to the requirements of specific applications. In a number of embodiments, the content of the video segments is utilized to determine the order of the video segments in a personalized video playlist. In several embodiments, sentiment analysis of metadata annotating a video segment can be utilized to estimate the sentiment of the video segment and heuristics utilized to order video segments based upon sentiment. For example, a playlist may start with the most important story. Where the story has a negative sentiment (a dispatch from a warzone), the process can select a second story that has more uplifting sentiment. As can readily be appreciated, machine learning techniques can be utilized to determine processes for ordering stories from a set of stories to create a personalized playlist as appropriate to the requirements of specific applications in accordance with embodiments of the invention.
Although specific processes are described above for generating personalized video playlist using integer linear programming optimization process, any of a variety of processes can be utilized to generate personalized video playlists using a set of video segments based upon user preferences in accordance with embodiments of the invention including processes that indirectly consider viewing history by modifying source and category weightings. Furthermore, processes in accordance with many embodiments of the invention consider other user preferences including (but not limited to) keyword and/or named entity preferences.
Personalized video playlists can be provided to a host of playback devices to enable viewing of video segments and/or additional data sources identified in the playlists. In a number of embodiments a playback device is configured via a client application to render a user interface based upon metadata describing video segments obtained using the playlist. Playback devices can also be configured to provide a “second screen” display that can enable control of playback of video segments on another playback device and/or viewing of additional video segments and/or data related to the video segment being played back on the other playback device. As can readily be appreciated, the user interfaces that can be generated by playback devices are largely only limited by the capabilities of the playback device and the requirements of specific applications.
A playback device in accordance with an embodiment of the invention is illustrated in
Although specific playback device implementations are illustrated in
The user interface generated by a playback device based upon a personalized playlist is typically determined by the capabilities of a playback device. In many embodiments, instructions for generating a user interface can be provided to a playback device by a remote server. In several embodiments, the instructions can be in a markup and/or scripting language that can be rendered by the rendering engine of a web browser application on a computing device. In a number of embodiments, the remote server provides structured data to a client application on a playback device and the client application utilizes the structured data to populate a locally generated user interface. In other embodiments, any of a variety of approaches to generating a user interface can be utilized in accordance with an embodiment of the invention.
A user interface rendered by the rendering engine of a web browser application in accordance with an embodiment of the invention is illustrated in
In the illustrated embodiment, the player region 2002 includes user interface buttons for sharing a link to the current story 2012, skipping to the previous 2014 or next story 2016 and expressing like 2018 or dislike 2020 toward the story being played back within the player region 2002. In other embodiments, additional user interface affordances can be provided to facilitate user interaction including (but not limited to) user interface mechanisms that enable the user to select an option to follow stories related to the story currently being played back within the player region 2002.
The user interface also includes a personalized playlist 2022 filled with tiles 2024 that each include a description 2025 of a video segment intended to interest the user and an accompanying image 2026. In many embodiments, tiles 2024 in the playlist 2022 can also be easily reordered or removed. In the illustrated embodiment, the tile at the bottom of the list 2028 contains a description of the video segment being played back in the player region. The tile also contains sliders 2030 indicating categories, sources, and/or keywords for which a user has or can provide an explicit user preference. In this way, the user is prompted to modify previously provided user preference information and/or provide additional user preference information during playback of the video segment. In other embodiments, any of a variety of affordances can be utilized to directly obtain user preference information via a user interface in which video segments identified within a playlist are played back as appropriate to the requirements of specific applications.
Beneath the player region 2002, there are several menus for video segment exploration showing: video segments related to the current video segment 2032, other (recent) video segments from the same source 2034, video segments from “channels” (i.e. playlists) generated around a specific category and/or keyword(s) 2036, and news briefs 2038 (i.e. aggregations of video segments across one or more sources to provide a news summary). As can readily be appreciated, any of a variety of playlists can be generated utilizing video segment metadata annotations generated in accordance with embodiments of the invention. Various processes for generating news brief video segments in accordance with embodiments of the invention are discussed further below.
At the top of the displayed user interface 2000, there is a search bar 2040 for receiving a search query. In several embodiments, the query is executed by comparing keywords from the query to keywords contained within the segment of video content (e.g. speech, closed caption, metadata). In a number of embodiments, the query is executed by also considering the presence of keywords in additional sources of information that were determined to be related to the video segment during the process of generating the personalized playlist. As can readily be appreciated, indexes relating keywords to video segments that are constructed as part of the process of generating personalized playlists can also be utilized to generate lists of video segments in response to text based search queries in accordance with embodiments of the invention. Implementation of various video search engines in accordance with embodiments of the invention are described further below.
The displayed user interface 2000 also includes an option 2042 to enter a settings menu for adjusting preferences toward different categories of video content and/or sources of video content. A settings menu user interface in accordance with an embodiment of the invention is illustrated in
The display and input capabilities of a playback device can inform the user interface provided by the playback device. A user interface for a touch screen computing device, such as (but not limited to) a tablet computer, in accordance with an embodiment of the invention is illustrated in
In a number of embodiments, a mobile computing device such as (but not limited to) a mobile phone or tablet computer can act as a second display enabling control of playlist playback on another playback device and/or providing additional information concerning a video segment being played back on a playback device. A screen shot of a “second screen” user interface generated by a tablet computing device in accordance with an embodiment of the invention is illustrated in
A screen shot of a “second screen” user interface generated by a tablet computing device enabling control of playback of video segments identified in a personalized playlist on another playback device in accordance with an embodiment of the invention is illustrated in
Although specific user interfaces are illustrated in
The user interaction information that can be logged by a personalized playlist generation system in accordance with embodiments of the invention is typically only limited by the user interface generated by a playback device and the input modalities available to the playback device. An example of a user interaction log generated based upon user interactions with a user interface generated to enable playback of video segments identified within a personalized playlist in accordance with an embodiment of the invention is illustrated in
Although specific processes are described above with respect to the logging of user interactions with user interfaces and the use of user interaction information to continuously update and improve personalized video playlist generation, any of a variety of techniques can be utilized to infer user preferences from user interactions and incorporate the user preferences in the generation of personalized playlists as appropriate to the requirements of specific applications in accordance with embodiments of the invention.
The ability to identify related video segments enables the generation of summaries of a number of related video segments or news briefs. Text data extracted from video segments in the form of closed caption, or subtitle data or through use of automatic speech recognition can be utilized to identify sentences that include keywords that are not present in related video segments. The portions of some or all of the related video segments in which the sentences containing the “unique” keywords occur can then be combined to provide a summary of the related video segments. In the context of news stories, the news brief can be constructed in time sequence order so that the news brief provides a sense of how a particular story evolved over time. In several embodiments, the video segments that are combined can be filtered based upon factors including (but not limited to) user preferences and/or proximity in time. In other embodiments, any of a variety of criteria can be utilized in the filtering and/or ordering of related video segments in the creation of a video summary sequence.
A process for generating a summary of related video segments in accordance with an embodiment of the invention is illustrated in
The techniques described above for annotating video segments and utilizing the annotations to generate indexes relating keywords to video segments are not limited to the generation of personalized playlists, but can be utilized in a myriad of applications including the provision of a video search engine service. A system for accessing video segments utilizing a video search engine service in accordance with an embodiment of the invention is illustrated in
A multi-modal video search engine server system that can be utilized to index video segments and respond to search queries in accordance with an embodiment of the invention is illustrated in
The non-volatile memory 2630 can also contain a search engine application 2634 that configures the processor 2610 to generate a user interface via which a user can provide a search query. As noted above, a search query can be in the form of a text string, an image, and/or a video sequence. The search engine application can utilize the inverted index to identify video segments relevant to text queries and can utilize the processes described above for locating image portions within frames of video to identify video segments relevant to images and/or video segments provided as search queries. In a number of embodiments, relevant video segments can also be found by comparing query images, or frames to images, or frames o video obtained from additional data sources known to be relevant to one or more video segments. In several embodiments, text data can be extracted from images and/or video sequences provided as search queries to the search engine application and a multi-modal search can be performed utilizing the extracted text and searches for portions of images within frames of indexed video segments. As can readily be appreciated, identification of a video segment can also be utilized to identify other relevant video segments using the processes for identifying relationships between video segments described above with reference to
As can readily be appreciated, the functions of crawling, indexing, and responding to search queries can be distributed across a number of different servers in a video search engine server system. Furthermore, depending upon the number of video segments indexed, the size of the database(s) utilized to store the metadata annotations and/or the inverted index may be sufficiently large as to necessitate the splitting of the database table across multiple computing devices utilizing techniques that are well known in the provision of search engine services. Accordingly, although specific architectures for providing online video search engine services are described above with reference to
A process for generating multi-modal video search engine results in accordance with an embodiment of the invention is illustrated in
In many embodiments, video segments are scored based upon a variety of factors including the number of related stories. Analysis of news story video segments reveals that related stories tend not to form fully connected graphs. Therefore, the number of related video segments (stories) can be indicative of the importance of the video segment. Time can also be an important measure of importance, the number of related video segments published within a predetermined time period can provide an even stronger indication of the relevance of a story to a particular query. In several embodiments, the relevance of a video segment to a search query can also be ranked based upon common keywords, frequency of common keywords, and/or common images. In several embodiments, a search query that includes an image, video sequence, and/or URL can be related to sources of additional data including (but not limited to) other video segments, and/or online articles. The sources of additional data can be utilized to perform keyword expansion and the expanded set of keywords utilized in scoring the relevance of a specific video segment to the search query.
In a number of embodiments, search result scores can be personalized based upon similar factors to those discussed above with respect to the generation of personalized video playlists. In this way, the most relevant search result for a specific user can be informed by factors including (but not limited to) a user's preferences with respect to content source, anchor people, and/or actors. In other embodiments, video search results can be scored and/or personalized in any of a variety of ways appropriate to the requirements of specific applications.
In several embodiments, analytics are collected (2712) concerning user interactions with video segments selected by users. In several embodiments, metrics including (but not limited to) percentage of playback duration watched can be utilized to infer information concerning the relevancy of the video segment to the search query and update (2714) relevance parameters associated with an indexed video by a video search engine service. In other embodiments, any of a variety of analytics can be collected and utilized to improve the performance of the search results in accordance with embodiments of the invention.
Although certain specific features and aspects of personalized video playlist generation systems, multi-modal video segmentation systems, and video search engine systems have been described herein, many additional modifications and variations would be apparent to those skilled in the art. For example, the features and aspects described herein may be implemented independently, cooperatively or alternatively without deviating from the spirit of the disclosure. It is therefore to be understood that the systems and methods disclosed herein may be practiced otherwise than as specifically described. Accordingly, the scope of the invention should be determined not by the described embodiments, but by the appended claims and their equivalents.
The current application claims priority to U.S. Provisional Patent Application No. 61/978,988, filed Apr. 14, 2014, entitled “Systems and Methods for Generating Personalized Video Playlists”, to Chen et al., the disclosure of which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
61978988 | Apr 2014 | US |