This invention relates generally to online video or streaming services, and in particular to improving video recommendations by identifying and removing highly similar video thumbnails.
Online systems store, index, and make available for consumption various forms of media content to Internet users. This content may take a variety of forms; in particular, video content, including streaming video is widely available across the Internet. Online video systems allow users to view videos uploaded by other users. Popular online content systems for videos include YouTube™.
A common feature of online video systems is the ability to recommend videos to users based on current or previously watched videos and a variety of other factors, examples of which include the video title, content, upload date, author or source, video language, user information, inter-user connection information. These types of recommendation features take a number of different forms and are referred to by a number of different names; some examples include “Watch Next” lists or “Recommended for You” lists. These video recommendation lists generally consist of links to other videos also available on the online video system. By virtue of operation of the recommendation feature, these videos are understood by the online video system to be related in some substantial way to a current or recently-watched video and thus are referred to as “related videos”.
Video recommendations are intended to increase user engagement by encouraging users to watch more videos. Most popular online video systems generate revenue by serving advertisements before, during, or after displaying a video to a user. Increased user engagement—in other words, users watching more videos—directly translates to increased revenue for the online video system as well as for content-producers and other partners.
A persistent problem in video recommendations has to do with providing recommendations in a way that effectively interests users and encourages them to view more videos. A video recommendation in a “Watch Next” or “Recommended for You” list usually takes the form of a static image (or thumbnail) accompanied by a limited amount of text, often the video title or description. The thumbnail thus represents the related video, and can be highly determinative of user interest in the related video. Interaction with the thumbnail from a user causes the linked video to be played back. In many cases, thumbnail images displayed on the webpage of an online video system are very similar (if not identical) to one another, making it difficult for a user to decide which related video to watch next. This problem is evident across a wide range of video categories. One example is sports, where thumbnails corresponding to highlight videos of a particular sport look exactly the same. Thumbnail images for two videos, each of two different soccer matches, both feature players scattered over a green background. To take another example, news videos featuring a particular news anchor are generally represented by thumbnails that feature the same man or woman sitting behind a desk. Although two different videos may feature the anchor speaking about completely different topics, the thumbnails look nearly identical and offer little to no utility to a user in deciding which video to select from the recommendation list. Thus, highly similar thumbnails reduce the utility of video recommendations in online video systems.
Embodiments of the invention include a system and method for improving the utility of video recommendation lists in an online content system by de-duplicating highly similar thumbnail images. A video is added by a user to a front-end server of a content system. A back-end server of the system contains a thumbnail generator, a compression module, and a de-duplication module. The thumbnail generator of the content system produces a thumbnail image representative of the video. The compression module then receives the thumbnail image from the thumbnail generator and computes a compressed representation of the thumbnail image. The compression module stores the video, the thumbnail image and its associated compressed representation in a back-end database of the content system.
Asynchronously, the content system displays videos to a user upon request as the user navigates through one or more webpages of the content system. For each video displayed to a user, the content system generates a video recommendation list including related videos that the content system determined to be relevant to the current video. For each video in the list, the de-duplication module retrieves the thumbnail image and its associated compressed representation from the back-end database. The de-duplication module then computes a measure of visual distance for each unique pair of compressed representations in the set of representations. The module compares each computed measure of visual distance against a threshold value, and distances below the threshold value are identified. Based on the set of measures of visual distance below a threshold, the module removes selected representations from the set of representations in order to reduce similarity of the thumbnail images to an acceptable level. Subsequently, the de-duplication module provides to the front-end server an identification of the videos and thumbnail images corresponding to the remaining representations. The front-end server displays the thumbnail images, each thumbnail linking to its associated video, on a webpage of the content system.
In another embodiment, subsequent to reducing similarity via selective removal of highly similar representations, the de-duplication module may itself provide the remaining representations and/or thumbnail images to the front-end server of the content system. The front-end server of the content system may then provide the received thumbnail images in a video recommendation list as part of a webpage provided to a user via a client computing device.
A typical web computing environment contains a variety of different types of media that is accessible through many different types of computing devices to a user accessing the Internet through a software application. This media could be news, entertainment, video (streaming or otherwise), or other types of data commonly made available on the Web. Media in the form of video may be streamed and/or uploaded by users to a content system for viewing by other users of the content system. Videos can be made available to other users to view via the content system. YouTube™ is one example of a video content system available on the Internet. Each allows users to browse through and view videos covering a wide variety of topics.
A content system designed to serve videos to users in an online environment includes a number of hardware and logical components.
The front-end server 220 receives videos uploaded by users and allows users to browse and view uploaded videos. Videos may be fixed-length or be streamed. The thumbnail generator 230 takes as input a new or updated video and generates a thumbnail image describing the video. The thumbnail image is displayed as a link on a webpage served by the front-end server 220; a user clicking on the thumbnail image causes its corresponding video to be played by the content system 110.
The compression module 240 interfaces with the thumbnail generator 230 by receiving generated thumbnail images, and generating for each thumbnail image a compressed representation. The de-duplication module 260 takes as input a set of compressed representations, each representation corresponding to a different thumbnail image, and compares them to identify and remove highly similar thumbnail images. The back-end database stores thumbnail images and their corresponding representations.
In typical embodiments, the thumbnail generator, compression module, and de-duplication module can each communicate independently with the back-end database and retrieve thumbnail images or compressed representations as needed. Information may also be transferred between the components as required.
The content system performs video digestion to make the video available for consumption by users.
The compression module 240 takes the thumbnail image as input and performs a series of computations to produce a compressed representation corresponding to the inputted thumbnail image. In typical embodiments, the compressed representation is expressed as a feature vector containing multiple parameters. The parameters collectively describe spatial and graphical parameters describing the thumbnail image using fewer bits of data than would be required to otherwise store the thumbnail image on its own. In one embodiment, the computations performed by the compression module 240 to produce each compressed representation include one or more dimensionality reduction or quantization steps. In some embodiments, the technique of principal component analysis may be employed to produce a compressed representation, in conjunction with the previously described techniques. Once the compressed representation has been computed, it is stored by the compression module 240 in the back-end database 250 along with its corresponding thumbnail image and the uploaded/updated video itself.
In some embodiments, the previously described video digestion process occurs asynchronous to requests from the front-end server of the content system for a selection of thumbnail images to be displayed on a video page. Generation of thumbnail images and compressed representations, by the thumbnail generator and compression module respectively, may occur in real-time upon addition of new videos by users to the content system. Alternatively, the thumbnail images and compressed representations may be generated offline, for example in batch mode, to ensure availability from the moment videos are made available to users of the content system.
As users navigate through the webpages or mobile application screens provided by the content system, they click on videos that are then provided by the front-end server of the content system. On some webpages or screens on which a video is displayed, the front-end server generates a list of recommended videos for further viewing. The list may be ordered by relevance, using a relevance score associated with each video that has been calculated by the content system. For each video, an associated thumbnail image link is displayed on the webpage. Highly similar videos can have similar thumbnails, reducing utility for the user. The content system 110 de-duplicates thumbnail images in order to ensure visual diversity in the recommendation list.
After collecting the appropriate set of compressed representations of the thumbnails corresponding to the identified videos, the de-duplication module 260 compares the visual distance between the compressed representations for the recommended list of videos. Visual distance, as introduced previously, is a quantitative measure of how alike two images are. In the content system, visual distance is computed between compressed representations and not between the original thumbnail images because computation of visual distance between thumbnail images would be computationally intensive, in terms of both computational cycles and storage medium access time, due to the size of each thumbnail image. More specifically, although any given computation of distance between images may not be computationally intensive, it would be computationally intensive to perform such calculations in aggregate across the entirety of the content system, including every time a list of recommended videos must be provided.
De-duplication based on comparison of compressed representations, instead of thumbnail images, offers significant performance advantages. In typical embodiments, computation of a compressed representation according to previously described techniques takes between 100 ms and 500 ms, while comparison between two such compressed representations takes approximately 1 microsecond. Therefore, given a typical recommendation list consisting of 20 thumbnail images, deduplication of the list by comparing previously prepared compressed representations can be performed over 1000 times faster than by comparing the thumbnail images themselves.
Compressed representations, on the other hand, are suitable for performing similarity comparisons at a vast volume without overtaxing available computing power. For example, in a typical embodiment, a visual distance may be computed between two compressed representations by taking the Euclidean distance of the individual feature vectors. A Euclidean distance of 0 between two representations indicates that they are associated with identical images. The greater the distance, the more dissimilar the images. This visual distance may then be used for purposes of comparison. Such a simple calculation is not possible with the original thumbnail images.
In practice, the de-duplication module 260 computes a measure of visual distance as described above for each unique pair of representations in the set of compressed representations for the recommended list of videos. Therefore, for a set of n compressed representations, the de-duplication module computes “n choose 2”, or nC2, quantitative measures of visual distance, each corresponding to one unique pair in the set. In other words, a measure of visual similarity is computed for every unique pair of compressed representations in the set, without reciprocity. The de-duplication module 260 then evaluates each measure of visual distance to determine if it indicates an excessive similarity between two representations. In one embodiment, this is accomplished by defining a threshold value and marking each measure of visual distance that does not exceed the threshold value.
Based on the evaluation, the de-duplication module 260 identifies a subset of measures of visual distance that are considered insufficient. As previously described, each of these measures corresponds to a pair of compressed representations. The de-duplication module 260 selectively removes from each pair one of the representations. For example, if two representations are identified as similar, only one of those representations is removed, so that at least one representation, as well as its corresponding thumbnail and its associated video, remains in consideration for inclusion in the list of related videos. This technique can be extended to alternate embodiments in which more than two compressed representations are considered too similar to one another. In such a situation, only one compressed representation will be retained. It should be noted that removal of a compressed representation and its associated thumbnail image and video only refers to removal from the recommendation list. The compressed representation, thumbnail, and video are retained in the content system for future use.
The representations are removed in such a way as to prioritize more relevant videos over less relevant videos. For example, each of two compressed representations corresponds to a related video as previously described. Of the two videos, one may be considered more relevant to the “currently watched” video than the other, for example based on the relevance score previously calculated by the content system 110. If the visual distance between the compressed representations is below the threshold value, one of the videos must be excluded from the recommendation list. The video considered less relevant will be excluded, and the more relevant video will remain in the list.
The de-duplication module 260 then returns the thumbnails corresponding to those representations and corresponding videos that have not been removed, or an identification thereof, to the front-end server 220. The front-end server 220 displays the thumbnails to one or more of the users 310.
In another embodiment, measures of visual distance corresponding to commonly occurring pairs of representations may be persisted in order to reduce computational load during subsequent iterations of the de-duplication process. These commonly-occurring measures may be retained by the de-duplication module itself, or else stored in the back-end database and retrieved as required for de-duplication purposes.
As described in previous embodiments, video de-duplication improves the utility of thumbnail images in video recommendation lists by reducing their similarity. Users may select from videos represented by mostly dissimilar thumbnail images, enhancing the uniqueness of each video suggestion and driving user engagement on the content system. In content systems without de-duplication, video recommendation lists will often end up having multiple videos listed with very similar thumbnails. For example, thumbnail images corresponding to news or sports videos often feature the same or a markedly similar image repeated across multiple thumbnail images, with only slight differences in size, zoom, or cropping of the thumbnail image. This greatly reduces the ability of a user to use the thumbnail image as a means to determine which video to watch next, which has the consequence, in aggregate, of reducing users' engagement with the content system.