The present invention relates to an information processing apparatus, an information processing system, and an information processing method, which select a content associated with a document viewed by a user and display the content together with the document.
When a user has a limited amount of time to view countless pieces of information transmitted over the Internet from day to day, it is extremely important for the user to make a choice of information. In Patent Document 1, a technique is described, which collects information associated with information being viewed, and displays the information on the same screen to enable efficient information viewing.
[Patent Document 1] Japanese Patent Application Publication No. 2014-215949
In Patent Document 1, information acquired by making a search using, as search words, a keyword extracted from target content information and an additional word defined for a category to which the target content information belongs is displayed in a screen area. Thus, the information associated with the target content information is displayed to enable efficient information viewing.
The keyword can be extracted from the content information by referring to a proper noun dictionary or the like, but the keyword may not appropriately represent the content information. Further, even the same keyword may have a different meaning from user to user, such as homonyms or the name of a person who plays an active role in plural fields. In such a case, information associated with a target content cannot be selected and displayed appropriately.
It is an object of the present invention to provide an information processing apparatus which appropriately acquires a content associated with a document and displays the content together with the document.
In order to solve the above-mentioned problem, an information processing apparatus according to the present invention includes:
a database section which stores, in terms of documents accessible via a network and terms as words appearing in the documents, document clusters in each of which documents similar in appearance tendency of the terms are grouped;
a word extraction section which extracts a word from a specified document;
a document cluster identifying section which identifies, based on the extracted word, a document cluster associated with the specified document;
a keyword selection section which selects, as a keyword, a term appearing in the identified document cluster;
a content acquisition section which acquires, from the network, a content associated with the selected keyword; and
a display section which displays the acquired content together with the specified document.
According to the present invention, there can be provided an information processing apparatus which appropriately acquires a content associated with a document and displays the content together with the document.
Embodiments of the present invention will be described in detail below.
The communication unit 10 of the information processing apparatus 1 connects the information processing apparatus 1 to the network 3 to send and receive information. Specifically, the communication unit 10 can be configured of unillustrated wired LAN interface and wireless LAN interface, and control software or firmware therefor.
The processing unit 11 of the information processing apparatus 1 performs processing on various pieces of information. The processing for various pieces of information includes processing, which is not explicitly specified by a user, such as the control of each of units constituting the information processing apparatus 1, in addition to the execution of software specified by the user through an unillustrated input unit. The processing unit 11 can be configured of unillustrated CPU and memory.
The display unit 12 of the information processing apparatus 1 displays the information processing results by the processing unit 11 in such a manner that the user can view the results. The display unit 12 can be a display unit including a liquid crystal display panel and the like.
The data storage unit 13 of the information processing apparatus 1 stores various data in a nonvolatile manner. The various data may be received from the network 3 through the communication unit 10, or created based on user input through the unillustrated input unit. Further, the various data can be processing targets of the processing unit 11. The data storage unit 13 can be a nonvolatile storage device, such as a hard disk drive or an SSD (Solid State Drive).
The communication unit 20 of the retrieval server 2 connects the retrieval server 2 to the network 3 to send and receive information. Specifically, the communication unit 20 can be configured of unillustrated wired LAN interface and wireless LAN interface, and control software or firmware therefor.
The searching unit 21 of the retrieval server 2 performs a search in response to a search request accepted by the communication unit 20 via the network 3, and sends the search results to a requestor via the network 3. The search here is made to identify information having predetermined association with a keyword included in the search request. Such a search may be made based on data held in the retrieval server 2, or may be made by making a request to an information holding server different from the retrieval server 2.
The database section 100 stores, in terms of documents accessible via the network and terms as words appearing in the documents, term clusters in each of which terms similar in appearance tendency in the documents are grouped, and document clusters in each of which documents similar in appearance tendency of the term are grouped.
Examples of data stored in the database section 100 are illustrated in
In
In
For example, it can be read from
The database section 100 may also store a degree of interest identified for each term based on the history of operations to the information processing apparatus 1 carried out by a user of the information processing apparatus 1. The degree of interest is an estimate value of the degree of user's interest in the term, which can be calculated, for example, in such a manner that when the user carried out an operation to a certain document such as to view the document, a score corresponding to the operation is given to each term appearing in the document to count up the scores of the term.
An example of data stored in the database section 100 in this case is illustrated in
The way of calculating the degree of interest is not limited to that mentioned above. The degree of interest can also be calculated by respectively providing and comparing appearance frequencies in documents accessible via a network and appearance frequencies in documents actually accessed by the user. In other words, a term higher in appearance frequency in the documents actually accessed by the user than in the documents accessible via the network can be determined to be higher in degree of user's interest.
Suppose that
The database section 100 stores predetermined data in the data storage unit 13, which can be implemented by the processing unit 11 executing a predetermined database management program.
The word extraction section 110 extracts a word from a specified document. Here, the document means a content having corresponding text, such as a web page with a news article. The term “specified” here means that the document is selected from multiple targets. The document may be selected by the user, or by the apparatus according to a predetermined algorithm.
For example, the word can be extracted by performing morphological analysis on the text corresponding to the specified document. The word extraction section 110 can be implemented by the processing unit 11 executing a predetermined program.
The document cluster identifying section 120 identifies a document cluster associated with the specified document based on the extracted word. For example, a document cluster in which the appearance frequency of a term corresponding to the extracted word is high and the appearance frequencies of terms other than the extracted word are low can be identified as an associated document cluster. For example, a document cluster small in distance composed of a vector of the extracted word and a vector of the appearance frequency of each term in the document cluster can be identified as the associated document cluster.
Suppose that “Suzuki” and “Jacket” are extracted from the specified document to identify a document cluster associated with this document from data illustrated in
First, a case is considered where a document cluster high in the appearance frequency of a term corresponding to the extracted word and low in the appearance frequency of any term other than the extracted word is identified as an associated document cluster. The ranking of the appearance frequencies of “Suzuki” and “Jacket” corresponding to extracted words in each document cluster is as follows: Second and third in A, second and fourth in B, third and first in C, and second and third in D. The ranking of the appearance frequencies of terms “Derek” and “Fukuoka” other than the extracted words in each document cluster is as follows: First and fourth in A, third and first in B, fourth and second in C, and third and first in D. Providing that four points are given to the first place, three points are given to the second place, two points are given to the third place, and one point is given to the fourth place, scores of the extracted words and the terms other than the extracted words are counted up, respectively, and these scores are summed up. In this case, when the scores are summed up by multiplying, by minus one, the scores other than those of the extracted words, A is zero point, B is −2 points, C is two points, and D is −1 point. Thus, the document cluster C with the highest score is identified as the associated document cluster.
Next, a case is considered where a document cluster small in distance composed of a vector of each extracted word and a vector of the appearance frequency of each term in the document cluster is identified as the associated document cluster. When the words “Suzuki” and “Jacket” are extracted, vectors of these words are (0.5, 0, 0, 0.5) by normalizing the vectors to make the sum total become 1.0. Similarly, when vectors of the appearance frequencies of the respective terms in each document cluster are normalized, the vectors are (0.38, 0.42, 0.00, 0.21) in A, (0.32, 0.27, 0.36, 0.05) in B, (0.22, 0.06, 0.28, 0.44) in C, and (0.25, 0.00, 0.75, 0.00) in D, respectively. When the distances composed of these vectors are obtained as the total sums of absolute values of differences of values corresponding to respective terms, the total sums are 0.83 in A, 1.27 in B, 0.67 in C, and 1.50 in D. In this case, the document cluster C small in distance is identified as the associated document cluster.
In any of these cases, the calculation method of the scores or distances is just an example, and any other calculation method can be applied. For example, Euclidean distance may be used as the distance composed of vectors, or cosine similarity may be used.
The document cluster identifying section 120 can be implemented by the processing unit 11 executing the predetermined program. Although the case of identifying a document cluster from data in
The keyword selection section 130 selects, as a keyword, a term appearing in the identified document cluster. For example, a term high in appearance frequency in the identified document cluster can be selected as the keyword. A term high in appearance probability in the identified document cluster as a result of being compared with the appearance probability in all documents can also be selected as the keyword. Further, a term high in degree of interest in the identified document cluster when the database section 100 stores the degree of interest can be selected as the keyword.
It is considered a case where “Suzuki” and “Jacket” are extracted from a specified document, and terms appearing in the document cluster C identified as a document cluster associated with this document from data illustrated in
The terms appearing in the document cluster C in
Among them, since “Jacket” and “Fukuoka” high in appearance frequency appear at high frequency in documents belonging to the document cluster C, these terms are suitable for being selected as keywords to acquire a content to be added to the documents.
Further, each appearance probability in the document cluster C and each appearance probability in all documents can be compared to select a keyword. The appearance probability in the document cluster C can be calculated by dividing the appearance frequency of each term in the document cluster C by the total appearance frequency in the document cluster C. When the values of the appearance frequencies of respective terms illustrated in
When these are compared, the appearance probability of the term “Jacket” in the document cluster C is 0.44, whereas the appearance probability of the term in all the documents is 0.21. Thus, the appearance probability of the term “Jacket” in the document cluster C is high. Since such a keyword is a term appearing in an identified document cluster at high frequency, it is suitable for being selected as the keyword to acquire a content to be added to the documents. When the selection is made in this way, even if many common words (postpositional particles, etc.), which do not feature the document cluster but appear in the document cluster at high frequency, are included in the documents, a keyword can be selected appropriately with no effects of these common words.
Further, when the values of the appearance frequencies of respective terms illustrated in
In selecting a term as a keyword from a document cluster, it can also be considered whether the term is extracted from a specified document. When a term which is not only appeared in or extracted from documents belonging to the document cluster but also is appeared in or extracted from the specified document is selected as a keyword, a more suitable document content and a higher degree of user's interest, compared with the way of acquiring the content to be added to the documents based on only the words appeared in or extracted from the specified document, can be reflected.
The keyword selection section 130 can be implemented by the processing unit 11 executing the predetermined program.
The content acquisition section 140 acquires, from the network, a content associated with a selected keyword. The content associated with the keyword is acquired, for example, by sending a search request together with the keyword as a search word to the retrieval server 2 connected through the network 3, and receiving, from the retrieval server 2, the retrieval results as information having predetermined association with the keyword. The content acquisition section can be implemented by the processing unit 11 executing the predetermined program, and the communication unit 10 performing communication through the network 3 as needed.
The display section 150 displays the acquired content together with the specified document. Since the specified document and the acquired content are displayed together, the user can access the associated content together with the document.
The content may be displayed in an area different from the area of the document on the screen, or displayed by adding the content into the document. When the document does not fit in one screen, the content may be added to and displayed in the area of the document that does not fit in one screen. In this case, the user can view the entire content by performing a scroll operation. Even so, however, the user can easily grasp that the content is displayed in association with the document.
The display section can be implemented by the processing unit 11 executing the predetermined program to control the display content of the display unit 12. Even if the information processing apparatus 1 does not have the display unit 12, the display section can also be implemented by controlling the display content of a display device (not illustrated) connected.
Referring next to
First, in the information processing apparatus 1, the word extraction section 110 extracts a word from a specified document (step S41). Then, in the information processing apparatus 1, the document cluster identifying section 120 identifies, based on the word extracted in step S41, a document cluster associated with the specified document from among document clusters stored in the database section 100 (step S42).
Next, in the information processing apparatus 1, the keyword selection section 130 selects, as a keyword, a term appearing in the document cluster identified in step S42 (step S43). Then, in the information processing apparatus 1, the content acquisition section 140 acquires, from the network, a content associated with the keyword selected in step S43 (step S44).
Finally, in the information processing apparatus 1, the display section 150 displays the content acquired in step S44 together with the specified document (step S45).
Thus, the content having predetermined association with the content of the specified document can be acquired and displayed together with the document by executing the processing steps mentioned above.
Next, a second embodiment of the present invention will be described.
A counting server 4 counts terms as words appearing in each of documents accessible via the network to provide the terms to the information processing apparatus 1. The counting server 4 is configured to include a communication unit 40, a counting unit 41, and a data storage unit 42.
The communication unit 40 of the counting server 4 connects the counting server 4 to the network 3 to send and receive information. Specifically, the communication unit 40 can be configured of unillustrated wired LAN interface and wireless LAN interface, and control software or firmware therefor.
The counting unit 41 of the counting server 4 counts up data received by the communication unit 40 from the network 3. Specific counting processing will be described later. The counting unit 41 can be configured by an unillustrated processor executing a predetermined program.
The data storage unit 42 of the counting server 4 stores various data in a nonvolatile manner. The various data may be data obtained by the counting unit 41 counting up the data received by the communication unit 40 from the network 3. The data storage unit 42 can be a nonvolatile storage device such as a hard disk drive or an SSD (Solid State Drive).
The counting unit 41 stores, in terms of documents accessible via the network and terms appearing in the documents, term clusters in each of which terms similar in appearance tendency in the documents are grouped, and document clusters in each of which documents similar in term appearance tendency are grouped.
Suppose here that multiple apparatuses similar to the information processing apparatus 1 exist on the network 3 and these apparatuses are operated by different users. In this case, data stored in the database section 100 can, of course, be constructed individually by each information processing apparatus 1. However, the appearance tendency of terms in the documents accessible via the network is the same among all the information processing apparatuses 1. Therefore, if the data are constructed by the counting server 4 and at least some pieces of data are delivered to the information processing apparatus 1 through the network 3, the load on the information processing apparatus 1 can be reduced efficiently.
Further, the tendency of a user who operates each information processing apparatus 1 is first grasped by the information processing apparatus 1. Therefore, a database in which a degree of interest of the user in each term grasped by the information processing apparatus 1 is added to data on common appearance tendencies between documents and terms received from the counting server 4 can be built to acquire and display a content that matches the user's taste.
Alternatively, the number of times the user viewed each document grasped based on the history of user operations on the information processing apparatus 1 may be stored in categories according to the data on the common appearance tendencies between the documents and the terms received from the counting server 4. In this case, since the appearance frequencies in documents accessible via the network and the appearance frequencies in documents actually accessed by the user can be compared, a degree of interest can be determined.
While the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the specific embodiments, and various modifications and changes are possible within the gist of the present invention as set forth in the appended claims.