The present disclosure relates generally to multimedia content delivery and, more specifically, to an enhanced natural language processing search engine for media content searches.
Using natural search phrases to locate video assets faces challenges. Previously existing natural language processing (NLP) engines typically rely on tags for media content such as video assets, e.g., titles, synopsis, character names, genre, etc. of movies. However, media content is associated with a rich set of semantics. As such, merely relying on the text from the tags may lead to inaccurate search results, e.g., results that are not in the intended domain.
So that the present disclosure can be understood by those of ordinary skill in the art, a more detailed description may be had by reference to aspects of some illustrative embodiments, some of which are shown in the accompanying drawings.
In accordance with common practice the various features illustrated in the drawings may not be drawn to scale. Accordingly, the dimensions of the various features may be arbitrarily expanded or reduced for clarity. In addition, some of the drawings may not depict all of the components of a given system, method, or device. Finally, like reference numerals may be used to denote like features throughout the specification and figures.
Numerous details are described in order to provide a thorough understanding of the example embodiments shown in the drawings. However, the drawings merely show some example aspects of the present disclosure and are therefore not to be considered limiting. Those of ordinary skill in the art will appreciate that other effective aspects and/or variants do not include all of the specific details described herein. Moreover, well-known systems, methods, components, devices, and circuits have not been described in exhaustive detail so as not to obscure more pertinent aspects of the example embodiments described herein.
An enhanced natural language processing (NLP) search engine described herein solves the aforementioned problems of locating media assets using a natural search phrase. The enhanced NLP search engine ingests from multiple sources, e.g., including not only tags and/or keywords associated with media content, but also additional data from sources such as content recognition of videos, audio, subtitles, posters, film databases, online knowledge base, etc. Moreover, the enhanced NLP search engine learns from actual searches and views to dynamically update the enhanced NLP engine. The dynamic updates create more associations based on user inputs and/or responses. As a result, a model for the enhanced NLP search engine is retrained using ingested data, which include more information than conventional NLP engine models, as well as user feedback, thus improving correlation of data and improving the accuracy of media content search results.
In some embodiments, the model is a vector generator that creates vectors based on the trained similarities, e.g., based on the similarities among the metadata. As new similarities are added, e.g., domain specific similarities, similarities based on ingested data, and/or similarities based on user inputs, the accuracy of the model improves and the model generates more meaningful vector values for more accurate search results. As such, the solution described herein relies on creating additional descriptions at the time the content is ingested and also retraining the model as new search strings are submitted by end users. Accordingly, the additional data relevant to the media content enable users to search for media content with better results.
In accordance with various embodiments, a media content search method is performed at a device that includes a processor and a non-transitory memory, where the device hosts a natural language processing (NLP) search engine with a model pretrained to derive sentence embeddings. The method includes obtaining additional data related to media content. The method further includes providing the additional data to the model to retrain the model, including modifying parameters of the model of the NLP search engine to correlate vectors representing the additional data with the sentence embeddings derived by the model prior to the retraining. The method also includes storing the vectors for searches of the media content.
A key part of identifying content that a user attempts to describe relies on a model in natural language processing (NLP) for making logical correlations. Previously existing NLP models typically are trained for a specific language and/or use certain types of documents, e.g., publications. Such models do not perform well when searching for media content. For example, tags or titles of movies do not always use terms from English dictionary. Accordingly, using an NLP model trained as a spell checker for media content searches may return results that mistakenly make corrections to a non-English movie title. An enhanced NLP engine described herein addresses the aforementioned issues by using a model that is optimized for media content searches. The enhanced NLP engine thus improves the accuracy and relevancy of media asset search results from a natural language search.
Once the ingestor 110 receives the metadata from the plurality of sources 101, the ingestor 110 sends the metadata to other components of the enhanced NLP search engine 130. In some embodiments, the enhanced NLP search engine 130 includes a model 132 that is pretrained, e.g., pretrained in one or more natural languages, and retrained and/or enhanced using the metadata received from the plurality of sources 101 via the ingestor 110. Further, in some embodiments, the model 132 is further retrained and/or enhanced using user inputs and feed backs so that the improved model 132 builds vector representation of text, e.g., generating a plurality of vectors 134 (sometimes also referred to herein as “vectors 134” or “vectors repository 134” for the sake of brevity) that represent text associated with searches.
In some embodiments, the model 132 are pretrained NLP models, e.g., Sentence BERT (SBERT). Bidirectional Encoder Representations from Transformers (BERT) is a transformer-based machine learning technique for NLP pre-training. BERT includes various techniques for pretraining general purpose language representation model. The general purpose pretrained models can then be fine-tuned on smaller task-specific datasets. SBERT is a modification of the pretrained BERT networks that use Siamese and triplet network structures to derive semantically meaningful sentence embeddings, which can then be compared using cosine-similarity for example. Other NLP models can be used in place of or in conjunction with SBERT for the model 132.
As will be described in further detail below, in some embodiments, before retraining, the exemplary system 100 uses general purpose pretrained models as the initial model for the model 132 for deriving sentence embeddings, such as “missing dog”. As used herein, a sentence embedding is a collective name for a set of techniques in NLP where sentences are mapped to vectors of numbers. As such, the terms “sentence embedding”, “document embedding”, “embedding”, “vector representation of a document”, and “vector” are used interchangeably.
In some embodiments, the model 132 is retrained using additional data, e.g., using the metadata from the ingestor 110 and/or the enrichment engine 120. Once retrained using the metadata, the model 132 learns to associate a movie title such as “Lassie” with the embedding “missing dog”, e.g., by adjusting parameters such as weights of the model 132 to establish stronger correlations between “Lassie” and “missing dog”. Further, in some embodiments, certain correlations are defined to retrain the model that certain search phrases, e.g., “man bitten by an insect” having a strong correlation to certain media content such as the movie “Spider-man” Additionally, in some embodiments, the enhanced NLP search engine 130 saves the retrained model 132 (e.g., saving the parameters of the model 132) and continues the retraining process as new search phrases received from the feedback database 137.
In some embodiments, the enhanced NLP search engine 130 also saves at least a portion of the output vectors 134 from the retrained model 132 into the results database 135. For example, when a new search, the search phrase is provided into the model 132 to generate an embedding for which the similarity or uniqueness to the existing document embeddings are evaluated, parameters are adjusted, and the closest matches are returned as the results. In another example, good search results, e.g., when a specific movie is selected from a search result, the search phrase and selected movie title are also used to retrain the model 132. With the continued retraining process, the enhanced NLP search engine 130 improves the model 132 for media content searches with each ingested metadata and each user input.
In some embodiments, the ingestor 110 and/or an enrichment engine 120 of the enhanced NLP search engine process additional data about the media content for creating more correlations in the model 132. For example, the ingestor 110 and/or the enrichment engine 120 can obtain the media content from an origin 105-1, e.g., obtaining videos, audio, and/or text. In particular, in some embodiments, when analyzing a movie, the ingestor 110 and/or the enrichment engine 120 segment the movie into chapters. In some embodiments, each chapter is a duration, a logical scene, a change of music, and/or upon identifying black frames, etc. In some embodiments, the ingestor 110 and/or the enrichment engine 120 then process each chapter's audio and/or subtitles to generate short descriptions of the scene (e.g., scene summary) as the additional data. In another example, the ingestor 110 and/or the enrichment engine 120 can obtain movie posters 105-2, e.g., processing images of movie posters, extract text from the processed movie posters, and generate poster descriptions as the additional data.
In some embodiments, to generate the additional text description, the enrichment engine 120 includes several sub engines, e.g., sub engine 1122-1, sub engine 2122-1, etc., collectively referred to hereinafter as the sub engines 122. For example, sub engine 1122-2 can be configured to process text, sub engine 2122-2 can be configured to process images, and another sub engine (not shown) can be configured to process videos and perform tasks such as extracting context from videos, etc. In another example, sub engine 1122-1 as a scene summary sub engine can be configured to segment movies into chapters and generate scene summaries, and sub engine 2122-2 as a poster description sub engine can be configured to process movie posters and generate poster descriptions, etc. In some embodiments, the sub engines 122 receive the additional data (e.g., the multimedia content from the origin 105-1 and/or the movie posters 105-2) from the ingestor 110 and generates the additional text description for updating the model 132 of the enhanced NLP search engine 130, e.g., generating more vectors and/or enhancing the correlations in the model. Using the additional text description derived from the additional data for fine tuning the model thus enables better media content search results.
In some embodiments, the enhanced NLP search engine 130 stores certain search results in a results database 135 and a results processor 139 processes the results before sending the results to a client device 140 for rendering, e.g., segmenting, categorizing, filtering, and/or ranking the results. In some embodiments, the enhanced NLP search engine 130 maintains a feedback database 137 for storing user feedback from the client device 140, e.g., search strings, clicks and/or playbacks of the selected item indicating search result selections, etc. For example, an actual playback of a media content item in the search result for a duration (e.g., longer than a few seconds) indicates a good result and potentially new or revised correlations in the model 132. The feedback data in the feedback database 137 allow the enhanced NLP search engine 130 to learn from the actual searches and views to dynamically update the model 132 and/or create more associations within the model 132 based on the user responses, e.g., generating the vectors 134 and/or updating associations for the model 132. The generated vectors 134 and/or updated correlations in the model 132 allow better media content search results.
In some embodiments, a results processor 139 obtains the results from the enhanced NLP search engine 130, e.g., retrieving from the results database 135, and prepares the results for the client device 140, e.g., segmenting, categorizing, ranking, and/or filtering the results. In some embodiments, because the results can be very long, e.g., many results related to the search phrase, the results processor 139 analyzes the common grouping among the results and dynamically re-groups the list according to detected categories. For example, the results processor 139 can group the search results into categories such as crime movies, filmed in NYC, released in 90's, released in 20's, etc. The grouping helps the user quickly refine their search by selecting the relevant group.
It should be noted that components are represented in the exemplary media content search system 100 for illustrative purposes. Other configurations can be used and/or included in the exemplary media content search system 100. Further, components can be divided, combined, and/or re-configured to perform the functions described herein. For example, at least a portion of the results processor 139 can be part of the enhanced NLP search engine 130, such that the search results returned by the enhanced NLP search engine 130 are segmented, categorized, ranked, and/or filtered. In another example, the ingestor 110 can be a part of the enhanced NLP search engine 130 or as a separate component (e.g., on a separate and/or distinct device) that provides the ingested media content and/or media content metadata to the enhanced NLP search engine 110. In another example, each of the components, e.g., the ingestor 110, the enrichment engine 120, the model 132, the vectors 134, the results database 135, and/or the feedback database 137, can reside on the same server or distribute over multiple distinct servers. Various features of implementations described herein with reference to
For example, movies from different genres are assigned different weights when being associated with the release dates domain 210-2. As such, using the domain information, a search for “new releases” can return a list of newly released movies and the newest released movie in a series of titles, e.g., the latest movie in “Spider-Man”, “Spider-Man 2”, “Spider-Man 3” series, would be returned. Similarly, movies with famous cast members, e.g., on the front page of multiple recent news outlets, are assigned higher weights when being associated with the casts domain 210-1. As such, when searching for movies based on the name of a cast member, the movies with the cast member mentioned in recent news would be closer to the top of the search results. In another example, the box office number domain 210-N can be used to locate movies that have high box office numbers.
Once the domain information is captured in the model, when a search string a combination of keywords from multiple domains, the enhanced NLP search engine can locate media assets based on the associations with the multiple domains. In previously existing search engines, when a search string is a combination of keyword searches, e.g., “Morgan Freeman has Superpower”, previously existing search engines often have difficulties separating the keywords in the search string and merging the search results from different domains. In contrast, using the domain information added in the model, the enhanced NLP search engine can locate movies with “Morgan Freeman” being a cast member according to the casts domain 210-1, merge with results according to a different domain in the vector space 220, e.g., any movies related to superpower including God from a semantic match, and possibly rank by the release dates 210-2 and/or the box office 210-N to generate more search results that are close to the user's intention.
Once a movie such as “Lassie” is ingested and/or processed by the enrichment engine, additional vectors representing the additional descriptions such as “lost dog”, “runaway dog”, “missing dog” are added along with the vector representing “Lassie” to the vector space 300A and a vector space 300B in
For example, by analyzing objects in a series of videos, the enrichment engine segments the series into chapters or episodes, e.g., chapter 1410-1, chapter 2410-2, . . . , chapter N 410-N, collectively referred to hereinafter as the chapters 410. In some embodiments, the enrichment engine uses any image processing techniques to identify objects in each chapter 410, e.g., identifying object 1, object 2, object 3, . . . in chapter 1410-1, identifying object a, object b, object c, . . . in chapter 2410-2, and/or identifying object A, object B, object C, . . . in chapter 3410-3, etc. Further, using any image processing techniques, the enrichment engine labels the identified objects with tags, e.g., generating tag 1, tag 2, tag 3, . . . for object 1, object 2, object 3, etc., generating tag a, tag b, tag c, . . . for object a, object b, object c, etc., and/or generating tag A, tag B, tag C, . . . for object A, object B, object C, etc. Further, in some embodiments, the enrichment engine applies filters to remove the metadata that are associated with similar scene descriptions, e.g., removing tag 2, tag c, and tag a during the filtering processing. Additionally, in some embodiments, weights are added that are based on the number of similar descriptions, the similarity of the descriptions to the existing metadata descriptions, and/or the uniqueness of the descriptions as compared to other descriptions that exist in the entire corpus (the uniqueness relative to the vectors 134 in
In some embodiments, the tags along with the weights are added to the model 132 (
In some embodiments, the enhanced NLP search engine 130 segments the video assets into chapters 510 (e.g., chapter 1510-1, chapter 2510-2, . . . , chapter N 510-N) as described above with reference to
As shown in
The method 700 begins with the enhanced NLP search engine obtaining additional data related to media content as represented by block 720. In some embodiments, as represented by block 722, the additional data related to the media content include one or more of posters, objects in the videos, scene positions in the videos, casts, release dates, box office numbers, news, and social media postings. For example, in
As represented by block 724, in some embodiments, to extract the metadata from the information received from the sources and/or the additional sources, obtaining the additional data related to the media content includes dividing videos into chapters and obtaining one or more of audio data and subtitle data corresponding to each of the chapters, and generating descriptions of the videos as the additional data based on one or more of the audio data and the subtitle data. For example, as shown in
In some embodiments, to extract the metadata from the information received from the sources and/or the additional sources, as represented by block 726, obtaining the additional data related to the media content includes ingesting videos to identify objects in the videos, generating metadata associated with the objects, and extracting descriptions from the metadata associated with the object as the additional data. For example, as shown in
The method 700 continues, as represented by block 730, with the enhanced NLP search engine providing the additional data to the model to retrain the model, including modifying parameters of the model of the NLP search engine to correlate vectors representing the additional data with the sentence embeddings derived by the model prior to the retraining. In some embodiments, as represented by block 732, modifying the parameters of the model includes identifying a domain in the additional data, and modifying the parameters of the model to correlate the vectors to the domain. For example, in
In some embodiments, as represented by block 734, modifying the parameters of the model includes determining a similarity score for a respective description relative to descriptions derived from the additional data, and updating the parameters based on the similarity score. In some embodiments, as represented by block 736, modifying the parameters of the model includes determining a uniqueness score for a respective description relative to descriptions derived from the additional data, and updating the parameters based on the uniqueness score. For example, in
Turning to
For example, as shown in
Still referring to
In some embodiments, the communication buses 804 include circuitry that interconnects and controls communications between system components. The memory 806 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM or other random access solid state memory devices; and, in some embodiments, include non-volatile memory, such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid state storage devices. The memory 806 optionally includes one or more storage devices remotely located from the CPU(s) 802. The memory 806 comprises a non-transitory computer readable storage medium. Moreover, in some embodiments, the memory 806 or the non-transitory computer readable storage medium of the memory 806 stores the following programs, modules and data structures, or a subset thereof including an optional operating system 830, a storage module 833, an ingestor 840, an enrichment engine 850, and a results processor 860. In some embodiments, one or more instructions are included in a combination of logic and non-transitory memory. The operating system 830 includes procedures for handling various basic system services and for performing hardware dependent tasks.
In some embodiments, the storage module 833 stores parameters of a model 835 (e.g., the model 132,
In some embodiments, ingestor 840 (e.g., the ingestor 110,
In some embodiments, the enrichment engine 850 (e.g., the enrichment engine 120 in
In some embodiments, the results processor 860 (e.g., the results processor 139,
Although the storage module 833, the ingestor 840, the enrichment engine 850, and the results processor 860 are illustrated as residing on a single computing device 800, it should be understood that in other embodiments, any combination of the storage module 833, the ingestor 840, the enrichment engine 850, and the results processor 860 can reside in separate computing devices in various embodiments. For example, in some embodiments, each of the storage module 833, the ingestor 840, the enrichment engine 850, and the results processor 860 resides on a separate computing device.
Moreover,
While various aspects of implementations within the scope of the appended claims are described above, it should be apparent that the various features of implementations described above may be embodied in a wide variety of forms and that any specific structure and/or function described above is merely illustrative. Based on the present disclosure one skilled in the art should appreciate that an aspect described herein may be implemented independently of any other aspects and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method may be practiced using any number of the aspects set forth herein. In addition, such an apparatus may be implemented and/or such a method may be practiced using other structure and/or functionality in addition to or other than one or more of the aspects set forth herein.
It will also be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first device could be termed a second device, and, similarly, a second device could be termed a first device, which changing the meaning of the description, so long as all occurrences of the “first device” are renamed consistently and all occurrences of the “second device” are renamed consistently. The first device and the second device are both devices, but they are not the same device.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the claims. As used in the description of the embodiments and the appended claims, the singular forms “a”, “an”, and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
As used herein, the term “if” may be construed to mean “when” or “upon” or “in response to determining” or “in accordance with a determination” or “in response to detecting”, that a stated condition precedent is true, depending on the context. Similarly, the phrase “if it is determined [that a stated condition precedent is true]” or “if [a stated condition precedent is true]” or “when [a stated condition precedent is true]” may be construed to mean “upon determining” or “in response to determining” or “in accordance with a determination” or “upon detecting” or “in response to detecting” that the stated condition precedent is true, depending on the context.