The present invention relates to an information processing apparatus, an information processing method, and a program, which select a content associated with a document viewed by a user and display the content together with the document.
In order to add a content (such as an advertisement) to a document viewed by a user and present the content, it is important to select a content associated with a target document appropriately according to the user's taste. Patent Document 1 discloses a terminal device capable of providing an advertisement optimum for a user.
[Patent Document 1] Japanese Patent Application Publication No. 2015-22561
Patent Document 1 discloses such a terminal device that assigns higher priority to an advertisement high in degree of user's interest corresponding to the attributes of a target document and displays the advertisement by changing the display position. Thus, the advertisement optimum for the user can be provided to the user.
It is known that accessible documents are acquired to identify the attributes of a target document based on a database in which the appearance frequencies of words included in each document are counted up. It is also known that a history of operations to each document is acquired to identify a degree of user's interest corresponding to the attributes of the document based on a database in which the appearance frequencies of words included in the document are counted up.
In a database in which the appearance frequencies of words included in documents are counted up, clustering may be performed in such a manner that words similar in appearance tendency in each document are grouped and documents similar in appearance tendency of each word are grouped. Since clustering makes it possible to identify the attributes of the documents from information on a grouped cluster, there is no need to keep detailed information on each document.
The results of clustering in the database in which the appearance frequencies of words in accessible documents are counted up may be used to grasp a degree of user's interest. Specifically, a word included in a document accessed by a user is positioned in associated information (cluster) between words and documents, which is created based on accessible documents. In this case, since there is no need to create, for each user, the associated information between words and documents, the degree of user's interest can be grasped efficiently.
When target documents are various documents accessible via a network such as news site articles on the Internet, documents are added from day to day. Further, the meaning of each word used in documents changes with the times. For example, if an entertainer who was a pop idol at first when he debuted becomes a movie actor, the cluster to which the name of the entertainer belongs will change from the pop idol to the movie actor.
In order to continue providing appropriate contents, there is a need to update such a database that counts up documents as the meaning of each word changes. To this end, there is a database update method in which documents generated after the creation of an old database are added to create a new database while keeping all documents used to create the old database.
According to this method, since the database is created based on documents accessible at the creation time, such a database as to reflect the meaning of each word at the creation time properly can be created. However, there are problems of putting pressure on the data storage capacity due to the need to keep ever-increasing documents, and increasing the load on the resources to create the database for enormous numbers of documents and hence requiring more time to create the database.
Another database update method can also be considered, in which documents are discarded while keeping only cluster information of the old database, and new documents are added to the cluster information. Since the cluster information can be defined by the range of each cluster (e.g., by the center coordinates and radius of the cluster), the amount of data can be made very small compared with that of the original documents.
However, this method cannot follow the changes of each word with time. In the above example, since the name of the entertainer who is now the movie actor continues to be associated with the pop idol at the time of creating the database, a content appropriate for the user cannot be presented.
Especially, when the degree of user's interest is grasped based on the associated information between words in accessible documents and the documents as mentioned above, there is a problem that the degree of user's interest cannot be grasped correctly if the database on the degree of user's interest is not updated in cooperation with updating of the associated information between words in accessible documents and the documents. For example, if only the associated information (cluster) on the accessible documents is updated, the range of the cluster when accessed documents are positioned can be updated later. If the content of the cluster is not consistent before and after the updating, information on documents accessed in the past cannot be used to identify the attributes of a currently targeted document.
The present invention has been made to solve the problems with updating of such a database, and it is an object thereof to provide an information processing apparatus capable of updating a database without increasing the load excessively and presenting, to a user, a content associated with a document appropriately.
In order to solve the above problems, the information processing apparatus according to the present invention includes:
a document storage section that stores each of documents acquired via a network in association with an acquisition time of the document;
a two-dimensional cluster generating section that generates, in terms of the documents and terms as words appearing in the documents, a two-dimensional cluster in which the documents similar in appearance tendency of the terms are grouped and the terms similar in appearance tendency in the documents are grouped;
a one-dimensional cluster generating section that generates a one-dimensional cluster in which the terms similar in appearance tendency in the documents are grouped;
a document updating section that adds, to the document storage section, a new document in terms of the acquisition time, and deletes, from the document storage section, an old document in terms of the acquisition time;
a two-dimensional cluster updating section that causes the two-dimensional cluster generating section to generate the two-dimensional cluster based on the documents stored in the updated document storage section after the document updating section adds and deletes the documents; and
a one-dimensional cluster updating section that updates the one-dimensional cluster based on the old document in terms of the acquisition time, which is deleted from the document storage section.
According to the present invention, there can be provided an information processing apparatus capable of updating a database without increasing the load excessively and presenting, to a user, a content associated with a document appropriately.
An embodiment of the present invention will be described in detail below.
The communication unit 10 of the information processing apparatus 1 connects the information processing apparatus 1 to the network 3 to send and receive information. Specifically, the communication unit 10 can be configured of unillustrated wired LAN interface, wireless LAN interface, and mobile telephone communication interface, and control software or firmware therefor.
The processing unit 11 of the information processing apparatus 1 performs processing on various pieces of information. The processing for various pieces of information includes processing, which is not explicitly specified by a user, such as the control of each of units constituting the information processing apparatus 1, in addition to the execution of software specified by the user through an unillustrated input unit. The processing unit 11 can be configured of unillustrated CPU and memory.
The display unit 12 of the information processing apparatus 1 displays the information processing results by the processing unit 11 in such a manner that the user can view the results. The display unit 12 can be a display unit including a liquid crystal display panel, or a projector.
The data storage unit 13 of the information processing apparatus 1 stores various data in a nonvolatile manner. The various data may be received from the network 3 through the communication unit 10, or input through the unillustrated input unit. Further, the various data can be processing targets of the processing unit 11. The data storage unit 13 can be a nonvolatile storage device, such as a hard disk drive or an SSD (Solid State Drive).
The communication unit 20 of the document server 2 connects the document server 2 to the network 3 to send and receive information. Specifically, the communication unit 20 can be configured of unillustrated wired LAN interface, wireless LAN interface, and mobile telephone communication interface, and control software or firmware therefor.
In response to a document request accepted by the communication unit 20 via the network 3, the document providing unit 21 of the document server 2 provides a document to a requestor via the network 3. The document may be provided by transmitting a preformed and stored page, or a page dynamically generated for each request.
The document storage section 100 stores each of documents acquired via a network in association with the acquisition time. The document storage section 100 may store, as targets, documents acquirable via the network regardless of the presence or absence of user accesses, or store, as targets, documents identified based on user operations on the information processing apparatus.
An example of data stored in the document storage section 100 is illustrated in
In terms of documents and terms as words appearing in the documents, the two-dimensional cluster generating section 110 generates a two-dimensional cluster in which documents similar in appearance tendency of the terms are grouped, and terms similar in appearance tendency in the documents are grouped.
The two-dimensional cluster can be generated by grouping documents and terms based on the documents stored in the document storage section 100. Further, a two-dimensional cluster (hereinafter also referred to as UM (User Model), in which documents identified based on user operations on the information processing apparatus are targeted, can be generated by positioning, in a two-dimensional cluster (hereinafter also referred to as LM (Language Model) generated by targeting documents accessible via the network, terms appearing in the documents identified based on the user operations stored in the document storage section 100.
Referring to
Using the UM thus generated, it can be grasped which of clusters based on the appearance tendency of each word in all documents accessible via the network each user prefers. When the LM is generated on a server and the UM is generated on a user terminal, this procedure is suitable because preference information can be accumulated for each user after the LM cluster information commonly used for all users is generated collectively, but the embodiment of the present invention is not limited to this procedure.
An example of a two-dimensional cluster generated by the two-dimensional cluster generating section 110 is illustrated in
The one-dimensional cluster generating section 120 generates a one-dimensional cluster in which terms similar in appearance tendency in documents are grouped. An example of the one-dimensional cluster generated by the one-dimensional cluster generating section 120 is illustrated in
The document updating section 130 adds, to the document storage section 100, a new document in terms of the acquisition time, and deletes, from the document storage section 100, an old document in terms of the acquisition time. In this case, the added document and the deleted document may be controlled to make the capacities constant, controlled to make the range of acquisition times constant (e.g., one week), or controlled based on any other criterion. When the documents are controlled to make the capacities constant, the memory capacity required by the document storage section 100 can be maintained constant.
Further, the timings of addition and deletion of the documents may be simultaneous or sequential to each other. If the deletion of the document is done first, the memory capacity required by the document storage section 100 can be prevented from being increased during updating. The document updating section 130 can be implemented by the processing unit 11 executing the predetermined program.
When the document updating section 130 adds and deletes the documents, the two-dimensional cluster updating section 140 causes the two-dimensional cluster generating section 110 to generate the two-dimensional cluster based on the updated documents stored in the document storage section 100. The two-dimensional cluster updating section 140 can be implemented by the processing unit 11 executing the predetermined program.
The one-dimensional cluster updating section 150 updates the one-dimensional cluster based on the old document in terms of the acquisition time deleted from the document storage section 100. The update processing for the one-dimensional cluster performed by the one-dimensional cluster updating section 150 will be described later. The one-dimensional cluster updating section 150 can be implemented by the processing unit 11 executing the predetermined program.
Based on the two-dimensional cluster, the first term identification section 160 identifies a term associated with a content including at least a word. The term identification processing performed by the first term identification section 160 will be described later. The first term identification section 160 can be implemented by the processing unit 11 executing the predetermined program.
When no term is identified by the first term identification section 160, the second term identification section 170 identifies a term associated with the content based on the one-dimensional cluster. The term identification processing performed by the second term identification section 170 will be described later. The second term identification section 170 can be implemented by the processing unit 11 executing the predetermined program.
The display section 180 displays, together with the content, an additional content associated with the term identified by the first term identification section 160 or the second term identification section 170. The display section 180 can transmit, as a keyword, the identified term to an additional content providing server connected to the network 3 to make a request in order to acquire the additional content. The content and the additional content are displayed on the display unit 12 of the information processing apparatus 1. The display section 180 can be implemented by the processing unit 11 executing the predetermined program to control the communication unit 10 and the display unit 12.
Referring next to
Referring to
First, the two-dimensional cluster generating section 110 morphologically analyzes the content of each document stored in the document storage section 100 to decompose the content of the document into words. Then, the two-dimensional cluster generating section 110 counts up the appearance frequency of each word in the document. In this case, words other than nouns, such as postpositional particles and adjectives, whose appearance tendencies do not vary from field to field to which the document is related may be excluded. Further, heavy emphasis may be placed on proper nouns, the appearance tendencies of which tend to vary pronouncedly from field to field to which the document is related.
Next, the two-dimensional cluster generating section 110 groups documents similar in appearance tendency of each word, and groups terms similar in appearance tendency in the documents. Through this grouping processing, a two-dimensional cluster in which similar documents and terms are grouped is generated. The two-dimensional cluster corresponds to a predetermined area when the documents and the terms are arranged in a two-dimensional table. When being approximated by a circle, this area can be defined by the center and radius of the circle.
In the example of
Next, the information processing apparatus 1 generates a one-dimensional cluster as advance preparation (step S62). The one-dimensional cluster is generated by the one-dimensional cluster generating section 120. For example, the one-dimensional cluster can be generated in the following procedure.
From the two-dimensional cluster generated in step S61, the one-dimensional cluster generating section 120 extracts the terms, the appearance frequencies of the terms, and the TCs to generate the one-dimensional cluster that does not include the document category information illustrated in
The processing steps S61 and S62 described above are advance preparation steps, and the execution of these processing steps is required once before a series of processes are executed. However, there is no need to execute these processes after the two-dimensional cluster and the one-dimensional cluster are generated. Note that the two-dimensional cluster and the one-dimensional cluster may as well be regenerated by using, as a trigger, a user's instruction, a lapse of a predetermined time, or the like.
Then, the information processing apparatus 1 updates the documents stored in the document storage section 100, i.e., the information processing apparatus 1 adds a new document in terms of the acquisition time to the document storage section 100, and deletes an old document in terms of the acquisition time from the document storage section 100 (step S63). The documents may be updated every predetermined period of time, updated when the capacity for documents to be updated reaches a threshold value, or updated based on any other criterion. It is also possible to update the documents based on a user operation. The documents are updated by the document updating section 130.
Next, the information processing apparatus 1 updates the two-dimensional cluster (step S64). The two-dimensional cluster is updated by the two-dimensional cluster updating section 140 in such a manner as to cause the two-dimensional cluster generating section 110 to generate a two-dimensional cluster based on the updated documents stored in the document storage section 100. The existing two-dimensional cluster is replaced by the two-dimensional cluster generated in this process.
Then, the information processing apparatus 1 updates the one-dimensional cluster (step S65). The one-dimensional cluster is updated by the one-dimensional cluster updating section 150 in the following manner: First, the content of an old document in terms of the acquisition time to be deleted from the document storage section 100 is morphologically analyzed and decomposed into words. Next, the one-dimensional cluster updating section 150 determines the frequency of appearance of each of the words decomposed from the old document in terms of the acquisition time to be deleted, and adds the determined appearance frequency to the appearance frequency of each corresponding term in the existing one-dimensional cluster. When the probability (the appearance frequency of a term/the appearance frequencies of all terms) is used as the appearance frequency, the updated probability is determined based on the figures obtained by adding the appearance frequency of the corresponding term in the existing one-dimensional cluster to both the denominator and the numerator.
Referring next to
The information processing apparatus 1 first identifies a term associated with a content including at least a word based on the two-dimensional cluster (step S71). The term based on the two-dimensional cluster is identified by the first term identification section 160. Specifically, the first term identification section 160 morphologically analyzes a content to decompose the content into words. Next, the first term identification section 160 identifies a document (category) having a term appearance tendency similar to the appearance tendency of a word in this content. Then, the first term identification section 160 identifies a term high in appearance frequency in the document (category) as a term associated with the content. In this case, if the appearance tendency of the term associated with the content does not vary from document (category) to document (category), or the difference in appearance frequency between terms in the identified document (category) is not large, it will be difficult to identify a term sufficiently associated with the content. In such a case, the information processing apparatus 1 does not identify any term.
Next, the information processing apparatus 1 determines whether a term is identified in step S71 based on the two-dimensional cluster (S72). As described in step S71, no term may be identified based on the two-dimensional cluster depending on the content. The first term identification section 160 determines whether a term is identified based on the two-dimensional cluster.
When it is determined that a term is identified based on the two-dimensional cluster in step S71 (Y in step S72), the information processing apparatus 1 performs additional content acquisition processing (step S74) to be described later. On the other hand, when it is determined that no term is identified based on the two-dimensional cluster (N in step S72), the information processing apparatus 1 identifies a term associated with the content based on the one-dimensional cluster (step S73). The second term identification section 170 identifies a term based on the one-dimensional cluster.
Specifically, the second term identification section 170 acquires a word obtained by decomposing the content. Here, the second term identification section 170 may morphologically analyze the content to decompose the content, or may use the decomposing results of the first term identification section in step S71. Next, the second term identification section 170 identifies a TC in which the word included in the content appears prominently. Then, the second term identification section 170 identifies a term high in appearance frequency in the TC as a term associated with the content.
When the term is identified based on the two-dimensional cluster (Y in step S72), or when the term is identified based on the one-dimensional cluster (step S73), the information processing apparatus 1 acquires an additional content associated with the identified term, and displays the additional content together with the content (step S74). The additional content is acquired and displayed by the display section 180.
Through the processing described above, the information processing apparatus 1 can identify a term associated with a content, and acquire an additional content associated with the identified term to present, to the user, the additional content together with the content.
Since the most recent document information is reflected in the two-dimensional cluster and relatively old document information is reflected in the one-dimensional cluster, these two clusters can be used to identify an appropriate term in association with the content.
When the UM generated in a manner as illustrated in
While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the specific embodiment, and various modifications and changes are possible within the gist of the present invention as set forth in the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
2016-139751 | Jul 2016 | JP | national |