The embodiments herein generally relate to the field of digital media content and more particularly, to a computer-implemented digital media content search and recommendation.
The volume of digital media content available on internet is growing rapidly and recommendation systems play an important role in determining who will consume which content and how. From Amazon's product recommendation to Netflix's movie recommendation, such systems govern what products people will buy, and what movies they will watch. Given their importance there is an increasing focus on developing intelligent recommendation systems which can guide people in making choices based on their interests.
Content providers deploy recommendation systems to help people discover content of their interest. Media recommendation is a field where the system can recommend media items either based on view history or based on specific query. Most media recommendation systems today employ mainly two techniques. One is like-based and the other is static metadata based.
In like-based system, media items are related to one another based on whether they are liked by the same person. If two movies are liked by same person and this is observed for a large number of people, then it is deduced that those two movies have one or more common attributes and they may be of same taste.
In metadata-based system, items are tagged with metadata (attributes) to enable cataloguing and searching. The metadata is created statically at the time of cataloguing and it does not evolve with time. For example, a movie can be tagged by the content provider as belonging to “action” genre and, by that definition, it can be related to other movies which also belong to “action” genre.
The relevance of recommendations from like-based system generally improves with time as viewing history accumulates. However, relevance may be adversely affected if disparate viewing history of large number of people are combined to generate recommendations. In such cases like-based system may not always capture the attributes of media correctly. For example, a horror movie might get related to a science fiction movie just because they might have same actors. Furthermore, like-based recommendation systems do not provide rich search capabilities.
Static metadata-based systems offer better search capabilities compared to like-based system, however, they have their own drawbacks. First, the metadata is created by few individuals and hence choice of metadata may be subjective and may not represent a larger audience. For example, a critic may classify a movie as belonging to “Action” genre whereas other viewers may classify it as “Comedy” given the combination of Action and Comedy content in the movie. Second, richness of metadata depends on the creativity of the metadata designer. For example, metadata designer may only categorize a movie genre as “Action”. However, it may further be subcategorized as “spy”, “war”, or “comedy” to enable more refined content search. Static metadata based recommendation system does not evolve with time, and most importantly, it does not accommodate views of the end users.
The embodiments of this invention are illustrated in the accompanying drawings, throughout which like reference letters indicate corresponding parts in the various figures. The embodiments herein will be better understood from the following description with reference to the drawings, in which:
Like reference numerals refer to corresponding parts throughout the several views of the drawings.
The following description and the drawings sufficiently illustrate specific embodiments to enable those skilled in the art to practice them. Other embodiments may incorporate structural, logical, electrical, process, and other changes. Portions and features of some embodiments may be included in, or substituted for, those of other embodiments. Embodiments set forth in the claims encompass all available equivalents of those claims. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments herein. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein can be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.
The embodiments discussed below include systems and methods that provide a review based digital media content search and recommendation system. According to examples of the preferred embodiments, the digital media content implies movies. Another object of the present invention is to enhance the relevance of the recommendation results by dynamically discovering attributes of the digital media content. Yet another object of the present invention is to vastly improve user experience by making it easier for the users to find their desired digital media content.
Referring now to the drawings, and more particularly to
Referring now to
In an embodiment, the training phase 101 starts with the review collection system 910 configured to collect review data for all the movies for which reviews are available in public domains (110). The public domains can be at least one of an IMDB, Rotten Tomatoes, and the like. A database comprising the collection of all the movies along with their reviews is thereby created, and is referred to as Global movie set 111 hereinafter. The data for the Global movie set 111 is saved in the global movie set database 941.
From the Global movie set 111, a plurality of movies is selected to create a second database of movies and their reviews (120), this second database is referred to as Training movie set 121 hereinafter and is used to train the system 901. The data for the training movie set 121 is saved in the training movie set database 942. The number of items in the Training movie set 121 can be less than or equal to the number of items in the Global movie set 111. The Global movie set 111 is updated as soon as reviews of a new movie item is added in any of the considered public domains. However, the Training movie set 121 is intermittently updated and the update interval can be configured by the system.
In an embodiment, the review processing and attribute tagging system 920 is configured to process the reviews in the Training movie set 121 to determine most talked-about attributes, and thereafter create a Global dictionary 131 and one or more attribute-specific dictionaries (130).
In the tagging phase 102, the most talked-about attributes for each movie in the Global movie set 111 are identified (140) and then each movie in the Global movie set 111 is tagged by the review processing and attribute tagging system 920 with the corresponding attributes, identified in step 140 (150).
In the search and recommendation phase 103, movies considered relevant for a user, are identified and are recommended to the user by the search and recommendation system 930. In an embodiment, the process begins by fetching a plurality of movie data from the user (160). The movie data from the user is fetched by either accessing the user view history, or by processing the keywords input by the user as a search query. The system then searches for similar or relevant movies within the Global movie set 111 by matching attributes of the movie data fetched from the user in step 160 individually with attributes of each movie in the Global movie set 111 (170). The movies identified to be similar or relevant in step 170 are recommended to the said user on his/her media access device through a web application. The media access device can be one of the devices, but not limited to, such as: a smart phone, a laptop, a smart TV, a desktop etc.
Referring now to
After cleaning the text 211, n-gram collocation lists are created (230). This is done by using collocation finding algorithms of Natural Language Processing NLTK python library. The collocation algorithm finds each n-grams separately, e.g., bi-grams are collocation of two words based on how often these words occur together. According to an example of the preferred embodiment, the filter was set to six occurrences, which means that collocations are picked up only if they occur more than five times in the text. Each n-gram is saved as a separate list and the list also includes frequency of occurrence of each attribute.
Following lists are created for cleaning up the attributes:
N-gram collocation lists are then created by identifying most frequently occurring collocations of bi-grams, tri-grams . . . n-grams (330). These attributes are n-grams as already described with reference to
Further, these genre specific attributes are compared with the Global dictionary 131 to determine the importance of each attribute to each genre through an algorithm called “term frequency-inverse document frequency” (TF-IDF). If the specific attribute is not listed in the Global dictionary 131, then it is discarded. If it exists, then its score is calculated (340) based on the following formula:
Attribute_Score=(No. of occurrence in genre specific list)/(No. of occurrence in Global dictionary)
Lists of cleaned collocation along with scores are saved as Genre dictionary 311 for each genre and for each n-gram.
Referring now to
In an embodiment, each item of Genre dictionary 311 for a particular genre is searched for words that matches items in the sub-genre word list 421 (430). Each matched item of Genre dictionary 311 is listed in the Sub-genre dictionary 411 for that particular sub-genre.
N-gram collocation lists are created by identifying most frequently occurring collocations of bi-grams, tri-grams . . . n-grams (530). The number of occurrence of each collocation is noted too. The collocations are then compared with the Global dictionary 131 and are then scored according to TF-IDF algorithm (540). The scoring is done with the following formula:
Attribute Score=(Number of occurrence in that movie)/(Number of occurrence in Global dictionary)
The attribute score is again normalized for each movie, the sum of all attribute scores for any movie being ‘1’. For each movie in the Global movie set 111, this procedure is done and the attribute lists are saved along with number of occurrence and the attribute score. This list is saved as Movie attributes list 551(550) and every movie in the Global movie set 111 is tagged with its corresponding Movie attributes list 551.
Referring now to
Along with that, a polarization score is also calculated and recorded for each movie (640), the polarization score being a measure of how confident the system is on the score and how polarized the movie is towards a single genre.
Where total_movie_attrib_occ_found is the summation of occurrences for all the attributes in that movie.
Referring now to
According to the examples of the preferred embodiment, another attribute for movies is Movie Sentiment. The method for identifying the one or more sentiments associated with a particular movie is described hereafter.
Following lists are made for deducing the sentiments of a movie:
Yet another attribute for movies according to the examples of the preferred embodiment is Movie Rating. The method for finding the rating of a particular movie is described hereafter. Following lists are made for finding the rating of the movie:
Each bi-gram item of the Movie attributes list 551 for said movie is compared for if the bi-gram is a combination of one word from Positive_Word_List and another from Movie_Specific_Word_List. Same procedure is followed for Negative_Word_List. The movie gets a positive score every time there is a match of attribute with Positive_Word_List and the positive score is increased by a factor is equal to the number of occurrences of that attribute in the Movie_Attribute list. Similar procedure is done with Negative_Word_List to find negative score.
In addition to positive and negative scores, a confidence score is calculated and recorded for each movie. The confidence score indicates a measure of how confident the system is on the score and it is based on number of negative or positive words found and the number of attributes the movie has. The confidence score is calculated using the following code:
Confidence=math.sqrt((pos_score+neg_score)/(total_attribs)*math.sqrt(len(attributes)))
pos_score is score of the positive keywords;
neg_score is the score of the negative keywords;
total_attribs is the sum of occurrences of all attributes of that movie; and
len(attributes) is the total number of attributes for that movie.
Movie rating is deduced as the percentage of positive score among the sum of positive and negative score. This score is then normalized to 10 and listed as Movie_Score. Also, while displaying actual rating of the movie to the user the system takes confidence score into consideration. As the confidence tends to zero the movie rating tends to 5 which is the average rating.
Next step is to construct a single genre score for the input movie set 811(830). The genre score of each movie is fetched and a single genre score is constructed. The single genre score is sum of each genre score for each movie and taken average upon total number of input movies.
Now, another parameter is found for the input movie set 811 and it is called Genre Consistency (GC). This parameter defines how the user's taste is towards choosing the genre of input movies. A higher GC denotes that the user chooses movies aligned towards a particular genre distribution. Lower GC means that the user doesn't care much about genre of the movie and the input movie set is from varied genres. For calculating GC, the standard deviation of each genre (gsd) is calculated for the input movie set. The standard deviation of the polarization strength (psd) is also calculated.
If number of movies in the input movie set is one, then GC is set to 0.75. The combined attribute list 821 is compared with the Movie attributes list 551 of each movie in global movie set 111 (840) and the single genre score is compared with genre score of each movie in global movie set 111 (850) to find a matching score. The weightage of genre score while finding matching movies is polarized by the Genre Consistency factor.
For each target movie, the Movie attributes list 551 of that movie is compared with the combined attribute list 821 of the input movie set 811. A parameter called TnaTnb is calculated and it is the number of matched attributes. For each n-gram, an attribute match score is calculated which is the sum of all matched attributes and their scores multiplied.
For each target movie, a parameter called Tnb is found out which is total number of attributes for that movie for a particular n-gram in its Movie attributes list 551. Also, the polarization strength of the target movie is saved as Tgb_pol.
Total attribute list and total matched attribute list for input set and target movies are found out with the following formula.
The attribute match score is found out by adding the matched scores of each n-gram with a weightage.
The attribute matched score is biased with the popularity of the target movie and the input movie set 811.
For each target movie, the genre list of that movie is compared with the combined genre list of the input set. For each genre, a genre match score is found which is the sum of all matched genres and their scores multiplied.
The final matched score is found out by adding the attribute_match_score and the genre_match_score with the GC in consideration.
movie_match_score=attribute_match_score+gc*genre_match_score
Based on this movie_match_score movies are recommended (860) for the input movie set 811 in the order of highest matched score.
Further, the user is also enabled to search for particular movies based on certain parameters. The following options are available for the user:
The user can either search for sentiments, genres, or keywords separately, or, he can search on a parameter based on a mix of all three. The keywords are nothing but the n-gram attributes from the Global dictionary 111 which is auto-completed as user types. The user search parameter can also include percentage of any particular genre. For example, user can search for movies with 80% action and 20% comedy content.
The embodiments disclosed herein can be implemented through at least one software program running on at least one hardware device and performing network management functions to control the network elements. The network elements shown in
The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated.
Number | Date | Country | Kind |
---|---|---|---|
201741030023 | Aug 2017 | IN | national |