This application is based upon and claims the benefit of priority from the prior Japanese Patent Applications No. 2007-209729, filed on Aug. 10, 2007; the entire contents of which are incorporated herein by reference.
1. Field of the Invention
The present invention relates to a program searching apparatus and program searching method for searching for a program similar to a specified program (group) on a television receiving/accumulating/replaying system that permits viewing of broadcast programs on multiple channels and utilization of meta-information about broadcast program contents in the form of an Electronic Program Guide (EPG).
2. Related Art
In recent years, BS/CS broadcastings have become widely available in addition to traditional terrestrial TV broadcasting, ushering in a real multi-channel era. With this background, systems and/or services have been proposed that recommend programs to a user based on program metadata including genre. Some of such systems and services learn a user's preference based on his/her history of viewing and the like and recommend a program in accordance with the learned preference. A function of searching for a program similar to a certain program can be utilized on a program searching apparatus that provides the function itself as a primary feature, for example. Such a function can be also utilized for identifying programs that are similar to a program (B) that was not watched even though it was recommended on a program recommending apparatus and/or a program (W) that was watched even though it was not recommended and making recommendation that takes into consideration the identified programs so as to improve the appropriateness of recommendation. Such a search for similar programs can be realized by applying similar document search, which has been developed in the field of information retrieval, to program metadata.
However, the conventional techniques outlined above have such drawbacks as follows.
Information retrieval generally defines similarity among documents by assigning a weight to a word based on “tf-idf” (term frequency-inverse document frequency) to vectorize a document, but “tf” (in-document term frequency) is often meaningless in a short document like an EPG (Electronic Program Guide), thus making the information retrieval approach less effective.
Also, an EPG involves a category that is obtained based on document structure (e.g., a performer's name) in addition to a word/phrase category that results from natural language processing, such as a part of speech or a semantic class. However, the former information cannot be exploited just by employing an approach of information retrieval in a simple manner.
In addition, some of programs appearing in an EPG have a small amount of program information, e.g., the description thereof being extremely short, and a similarity search performed with such a program as a search query has low reliability, leading to the user's complaint about the capability of a program searching apparatus. Also, program recommendation that takes into account a program similar to the program “B” and/or “W” excessively generalizes the program “B” and/or “W”, which possibly causes degradation of recommendation appropriateness.
According to an aspect of the present invention, there is provided with a program searching apparatus, comprising:
According to an aspect of the present invention, there is provided with a program searching method, comprising:
Embodiments of the present invention are described below with respect to drawings.
The usage flow of this program searching apparatus is different for first and second embodiments which are discussed below, but similarity search processing and preliminary processing therefor are common to the first and second embodiments. Accordingly, the similarity search processing and preliminary processing therefor, and blocks pertaining to those processings: an EPG (Electronic Program Guide) data storage 1, a natural language processing keyword extractor 2, a structural keyword extracting unit 3, an inverted file storage 4, and an element count storage 5, are described first.
First, the program searching apparatus acquires a new EPG at an appropriate time, such as at midnight every day, and stores the acquired EPG data in the EPG data storage 1 (S21). The EPG may be acquired from SI signals of digital broadcasting or from a website on the Internet that provides EPGs. The EPG data storage 1 may include an EPG acquiring unit for acquiring EPG data via a network.
The EPG is structured with tags, such as <TITLE> that represents a title and <CATEGORY> that represents a genre. As can be seen in description contained between <SHORT_DESC> and </SHORT_DESC> that represent a short program contents as well as in portions following description of “Cast” and following description of “Original/Screenplay” within description between <LONG_DESC> and </LONG_DESC> that represent a long program contents, denotations such as “[Cast]” and “[Screenplay]” are used to explicitly show what is represented by the character strings that follow the denotations. A characteristic of an EPG is that the amount of description is small and the frequency of the same word appearing a number of times is low. Information on cast or a screenplay writer such as shown in
The structural keyword extracting unit 3 extracts information on the genre, cast and screenplay of a program by extracting character strings that lie between tags and character strings that follow denotations as keywords based on such tags and denotations in the EPG (structural KW or keyword extraction) (S22). When the same keyword appears a plurality of times, only one of them has to be extracted. The program genre, cast, and screenplay writer are examples of categories.
The natural language processing keyword extractor 2 applies a known technique such as morphological analysis or semantic class analysis to the content description and title of a program in the EPG so as to extract keywords that cannot be extracted by structural keyword extraction (NLP KW extraction) (S23). That is, as morphological analysis can obtain separations between words and the part-of-speech of words in a sentence, keywords can be obtained by specifying the part-of-speech of a word which should be extracted as a keyword, e.g., as a noun or adjective. With semantic class analysis, which performs semantically more advanced processing than morphological analysis, it is possible to extract a word or phrase having a category name (a semantic class) from a sentence, such as “Japanese prefecture” or “professional baseball team”. Thus, keywords can be also obtained by specifying a semantic class that should be extracted as a keyword. When the same keyword appears a plurality of times, only one of them has to be extracted. Morphological analysis or semantic class analysis may use a dictionary that maps keywords to categories for defining the category of a keyword.
As processing at S22 and S23 reveals keywords contained in the EPG (or program), the inverted file storage 4 stores data that shows correspondence between the program and the keywords contained in that program (S24). This data may be of a straightforward format that maintains a keyword list for each program ID, but advantageously is maintained a known format called an inverted file in view of efficiency in subsequent search processing.
An inverted file maintains, for a keyword, a list of program IDs that contain the keyword. A portion of an exemplary inverted file is shown in
The present example assumes that data showing the correspondence between programs and keywords is stored in the form of an inverted file, and the inverted file storage 4 updates the inverted file using the data resulting from the processing at S22 and S23 that shows the correspondence between program IDs and keywords which are contained in those program IDs. The inverted file storage 4 includes a first calculating unit, for example.
The element count storage 5 counts the number of different keywords in each category and stores the number of different keywords for each category (S25). This is carried out by doing nothing if a keyword extracted at S22 and S23 is already present in the inverted file, or incrementing a counter prepared for each category (e.g., noun, cast and the like) if the keyword is not present in the file yet. For example, when the inverted file is as illustrated in
Next, description is given on processing of searching for a program similar to a program group (or a query) (similar search processing) when a program group which includes one or more programs (hereinafter such a program group will be called a search query or just a query, and each program contained in the query may be sometimes referred to as a query program) is given. This similarity search processing is performed by a similarity search unit 8. The similarity search unit 8 includes a weight calculating unit, a detecting unit, a similarity calculating unit, and a similar program calculating unit, for example. In the following, the flow of similarity search processing is illustrated in the flowchart of
First, a variable (or a score) that represents the similarity level to the query is initialized for all programs (S51). The all programs relevant to the initialization may include the query (which is made of one or more query programs) itself, and this example assumes the query is included in them. A program relevant to the initialization, namely a program covered by a search, represents a search target program, for example.
Then, for all keywords contained in the query (when the query includes a number of query programs, the logical sum of keywords contained in each of the query programs), a weight of each keyword (a query keyword) is calculated, and the sum of the weights of keywords that have commonality to the query keyword (or common keywords) is calculated as a score (or alternatively a similarity level) for each program. To describe specifically, processing as described below can be performed based on the inverted file storage 4, for example.
First, for each keyword contained in the inverted file, the number of query programs which are included in programs that contain that keyword (programs on the right-hand part) is counted and the number is set as “N” (S52). If N>0 (YES at S52), that is, the keyword is a query keyword, a weight “W(kw)” for that keyword “KW” is calculated according to the formula below (S53). If N=0 (NO at S52), the flow proceeds to the next keyword without calculating a weight.
where “idf (kw)” is an “idf” (inverse document frequency) value, namely the “idf” weight of the keyword “KW”, and this value is generally defined as:
with the total number of programs as “A”. In embodiments of the present invention, however, various modifications may be made, such as not using a logarithm or adding a positive constant to the denominator, as long as the value is a monotonically increasing function of the inverse of the number of programs that contain the keyword “KW”. Since the inverted file is employed, the number of programs that contain the keyword “KW” is determined as the number of programs on the right side.
Also, “c” is a category to which the keyword “KW” belongs and “CS(c)” is the number of different keywords that belong to the category “c”. “f” is an arbitrary monotonically increasing function, but typically a formula:
or a similar formula can be used.
Thus, the weight “W(kw)” of the keyword “KW” is a value determined by adjusting (e.g., dividing) the “idf” weight with respect to the number of different keywords that belong to the category “c” of the keyword “KW” and further weighting it with the number of keywords “KW” that are contained in the query. For example, when the category of a keyword “1” is “Place”, and the category of another keyword “2” is “Baseball Team”, and the number of different keywords contained in the category “Place” is 5000 and the number of different keywords contained in the category “Baseball Team” is 12, the “idf” value of the “Place” of course tends to be large as compared to that of “Baseball Team”, but the weight “W(kw)” of the keyword “KW” is corrected such as by dividing the former by 5000 and the latter by 12.
After the weight “W(kw)” thus determined is added to the variable (or score) for the programs that contain the keyword “KW” (S54), the flow proceeds to the next keyword in the inverted file. The scores the programs have been obtained when processing on all keywords in the inverted file is completed.
Thereafter, the programs are sorted in descending order of score, and in accordance with a predetermined threshold value “M”, the top M programs (or alternatively, the top M programs except the query program) are output as similar programs to a similar program outputting unit 13, which is a displaying unit for displaying an image for the user, for example (S55). Alternatively, with reference to the score of the query (when the query includes a number of query programs, the maximum, minimum, median, or average value of scores of those query programs may be used as the score of the query), and in accordance with a predetermined percentage R%, programs having a score equal to or greater than R% of the query score may be output as similar programs to the similar program outputting unit 13.
First, a program information amount calculator 6 calculates the information amounts of all programs (S61).
This is carried out by calculating weights for all keywords contained in each of the programs included in the EPG and adding or summing the weights. A flow of specific processing is illustrated in the flowchart of
First, one program is picked out and a score that represents the program information amount of the program in question is initialized (S71).
Then, for all keywords contained in the program in question, the following processing is repeated with reference to the inverted file.
The weight “W(kw)” of the keyword “KW” is calculated according to the formula (S72):
where “idf(kw)” is the idf value of the keyword “KW” and is generally defined as:
with the total number of programs as “A”. However, various modifications may be made, such as not using a logarithm or adding a positive constant to the denominator, as long as the value is a monotonically increasing function of the inverse of the number of programs that contain the keyword “KW”. Since the inverted file is employed, the number of programs that contain the keyword “KW” is determined as the number of programs on the right side corresponding to the keyword “KW” in the inverted file. Also, “c” is a category to which the keyword “KW” belongs and “CS(c)” is the number of different keywords that belong to the category “c”. “f” is an arbitrary monotonically increasing function, but typically a formula:
or a similar formula can be used.
The weight “W(kw)” value thus determined is added to the score of the program in question (S73), and the flow proceeds to the next keyword. After weights of all keywords are calculated and added to the score, the final sum (total) obtained for the program is stored in the EPG program information amount storage 7 as its program information amount.
By performing the above-described processing (S71 to S73) on all the other programs, program information amounts are obtained and stored in the EPG program information amount storage 7 for all the programs.
Referring back to
The search query specifying interface 9 selects K programs having a large program information amount from among those programs that meet the user-specified condition based on the EPG program information amount storage 7, and presents the selected programs as query candidates (S63). For example, the selected K programs (query candidates) are presented on a GUI with checkboxes as shown in
The search query specifying interface 9 accepts one or more programs selected by the user as queries (S64) and stores the accepted queries in a query storage 12. The search query specifying interface 9 is an example of a specifying unit for designating a query.
The similarity search unit 8 searches for programs that are similar to the queries stored in the query storage 12 (S65), and outputs data on programs found in the search to the similar program outputting unit 13 (S66). The similar program outputting unit 13 displays the program data inputted from the similarity search unit 8 on a screen.
As described, according to the first embodiment of the invention, it is possible to realize a program similarity search function with a higher demonstration effect by determining the similarity among programs in conformity with characteristics of an EPG (e.g., the amount of description is small and the frequency of the same word appearing a number of times is low) by utilizing the keyword weight “W(kw)”.
First, the program information amount calculator 6 calculates the program information amounts of all programs (S91). This is carried out by calculating weights of all keywords contained in each of the programs included in the EPG and adding or summing the weights. A flow of specific processing is illustrated in the flowchart of
First, a score that represents the program information amount of each program is initialized (S101).
Then, with respect to the logical sum of all keywords contained in the all programs, the following processing is repeated with reference to an inverted file.
The weight “W(kw)” of the keyword “KW” is calculated according to the formula (S102):
where “idf(kw)” is the idf value of the keyword “KW” and is generally defined as:
with the total number of programs as “A”. However, various modification may be made, such as not using a logarithm or adding a positive constant to the denominator, as long as the value is a monotonically increasing function of the inverse of the number of programs that contain the keyword “KW”. Since the inverted file is employed, the number of programs that contain the keyword “KW” is determined as the number of programs on the right side corresponding to the keyword “KW” in the inverted file.
“c” is a category to which the keyword “KW” belongs and “CS(c)” is the number of different keywords that belong to the category “c”. “f” is an arbitrary monotonically increasing function, but typically a formula:
or a similar formula can be used.
The weight “W(kw)” value thus determined is added to the score of programs that have the keyword “KW” (programs on the right-hand part corresponding to the keyword “KW” in the inverted file) (S103). Then, the present maximum score is maintained in “Smax” (S104), and the flow proceeds to the next keyword.
When processing for all keywords is completed, program information amount is normalized to a range from 0 to 1 inclusive ([0, 1]) by dividing the score of each program by “Smax” (S105). Then, the normalized score of each program is maintained in the EPG program information amount storage 7 as a program information amount.
Referring to
A determining unit 11 determines whether the program information amount of the program “P” is smaller than a predetermined threshold “T”, and if the program information amount of the program “P” is smaller than the threshold “T” (NO at S 93), the determining unit 11 does not perform search processing in order to avoid a meaningless similarity search and determines that there is no program similar to the program “P”, and passes a notice that there is no program similar to the program “P” to a similar B/W outputting unit 14 (S96). The similar B/W outputting unit 14 then notifies the program recommending system that there is no program similar to the program “P”. When the program recommending system is notified that there is no program similar to the program “P”, the program recommending system recommends programs in a conventional manner. That is, the program recommending system does not update the recommendation list.
On the other hand, if the program information amount of the program “P” is equal to or greater than the threshold “T” (YES at S93), the program “P” is stored in the query storage 12 as a query, and the similarity search unit 8 performs a similarity search based on the query in the EPG storage 12 (S94) and passes information on a program that has been found in the similarity search to the similar B/W outputting unit 14. The similar B/W outputting unit 14 provides information on the program passed from the similarity search unit 8 back to the program recommending system (S95). The program recommending system uses the information received from the similar B/W outputting unit 14 to update the recommendation list. Specifically, when the program “P” is a program “B”, the program recommendation system deletes the program indicated in the received information from the recommendation list, and when the program “P” is a program “W”, it adds the similar program indicated in the received information to the recommendation list. This realizes highly satisfactory recommendation.
As described above, the second embodiment of the invention can realize generation of a recommendation list that is closer to the user's preference without requiring a long learning time by avoiding meaningless similarity search on programs with a small program information amount.
Number | Date | Country | Kind |
---|---|---|---|
2007-209729 | Aug 2007 | JP | national |