This application is based on and claims priority under 35 USC 119 from Japanese Patent Application No. 2015-039955 filed Mar. 2, 2015.
The present invention relates to an information processing apparatus, an information processing method, and a non-transitory computer readable medium.
The gist of the present invention resides in an aspect of the present invention as described below.
According to an aspect of the invention, there is provided an information processing apparatus including a first extracting unit, a second extracting unit, and a third extracting unit. The first extracting unit applies a topic model to target text information and extracts topic distributions for words constituting the text information. The second extracting unit extracts a first topic for the text information from the topic distributions extracted by the first extracting unit. The third extracting unit extracts a word satisfying a predetermined condition, from at least one word having the first topic, as a context word in the text information. The first topic is extracted by the second extracting unit.
Exemplary embodiments of the present invention will be described in detail based on the following figures, wherein:
Various exemplary embodiments suitable to embody the present invention will be described below on the basis of the drawings.
In general, a module refers to a component, such as software (a computer program) that is logically separable or hardware. Thus, a module in the exemplary embodiment refers to not only a module in terms of a computer program but also a module in terms of a hardware configuration. Consequently, the description of the exemplary embodiment also serves as a description of a system, a method, and a computer program which cause the hardware configuration to function as a module (a program that causes a computer to execute procedures, a program that causes a computer to function as units, or a program that causes a computer to implement functions). For convenience of explanation, the terms “to store something” and “to cause something to store something”, and equivalent terms are used. These terms mean that a storage apparatus stores something or that a storage apparatus is controlled so as to store something, when the exemplary embodiment is achieved by using computer programs. One module may correspond to one function. However, in the implementation, one program may constitute one module, or one program may constitute multiple modules. Alternatively, multiple programs may constitute one module. Additionally, multiple modules may be executed by one computer, or one module may be executed by multiple computers in a distributed or parallel processing environment. One module may include another module. Hereinafter, the term “connect” refers to logical connection, such as transmission/reception of data, an instruction, or reference relationship between pieces of data, as well as physical connection. The term “predetermined” refers to a state in which determination has been made before a target process. This term also includes a meaning in which determination has been made in accordance with the situation or the state at that time or the situation or the state before that time, not only before processes according to the exemplary embodiment start, but also before the target process starts even after the processes according to the exemplary embodiment have started. When multiple “predetermined values” are present, these may be different from each other, or two or more of the values (including all values, of course) may be the same. A description having a meaning of “when A is satisfied, B is performed” is used as having a meaning of “whether or not A is satisfied is determined and, when it is determined that A is satisfied, B is performed”. However, this term does not include a case where the determination of whether or not A is satisfied is unnecessary.
A system or an apparatus refers to one in which multiple computers, pieces of hardware, devices, and the like are connected to each other by using a communication unit such as a network which includes one-to-one communication connection, and also refers to one which is implemented by using a computer, a piece of hardware, a device, or the like. The terms “apparatus” and “system” are used as terms that are equivalent to each other. As a matter of course, the term “system” does not include what is nothing more than a social “mechanism” (social system) operating on man-made agreements.
In each of the processes performed by modules, or in each of the processes performed in a module, target information is read out from a storage apparatus. After the process is performed, the processing result is written in a storage apparatus. Accordingly, no description about the reading of data from the storage apparatus before the process and the writing into the storage apparatus after the process may be made. Examples of the storage apparatus may include a hard disk, a random access memory (RAM), an external storage medium, a storage apparatus via a communication line, and a register in a central processing unit (CPU).
An information processing apparatus 100 according to the first exemplary embodiment extracts context words for the first topic (may be hereinafter referred to as the main topic) for target text information. As illustrated in the example in
Terms used in the description in the exemplary embodiment will be described below.
The term “polarity” means a property of a document or a word based on a certain pole. In the description in the exemplary embodiment, “polarity” indicates a property of the positive sensibility pole or the negative sensibility pole.
The term “target” means a target for which context information is to be extracted. Examples of a target include a person name, an organization name, a place name, and a product name.
The term “topic” means a multinomial word distribution which is output by using a topic modeling technique, such as latent Dirichlet allocation (LDA) or Labeled LDA. In a topic, a word having stronger relationship has a higher probability value. A term “cluster”, “latent class”, or the like may be also used as an alias of the term “topic”.
The term “model” means data obtained as a learning result using a machine learning technique. In the description in the exemplary embodiment, it indicates a learning result using a topic modeling technique. For example, a resulting model obtained by subjecting a text set to learning using a topic modeling technique may be used to estimate a topic distribution for a word.
The term “supervised signal” means data showing a correct result produced for certain input data on the basis of some criterion. For example, a supervised signal may be used as data representing a correct classification result for a certain input data example in a learning process. Learning using a combination of such input data and a supervised signal which is the classification result enables a model to be generated.
In a determination process, use of a model, which is obtained by performing machine learning, on input data whose classification is not known enables classification of the input data to be presumed. Thus, a supervised signal may indicate correct output result data which is determined on the basis of a certain criterion and which is produced for input data.
In techniques of the related art, syntax information is used to obtain context information for a target. In a technique using syntax information, when text (for example, colloquial words such as social media text, words which are used by young people and which include new words, and a sentence having a grammatical error) which has plenty of noise which causes accuracy of syntactic analysis to be decreased is a target, the performance is decreased due to an error in syntactic analysis.
The model generating module 105 includes a document database (DB) 110, a topic modeling module 115, and a model output module 120. The model generating module 105 applies a topic modeling technique to a text set, and generates a topic model. Examples of a text set include a writing posted in a social networking service (SNS), such as a tweet.
The contextual processing module 150 includes a document/target receiving module 155, a word topic estimating module 160, a main topic extracting module 165, a context information determining module 170, and a context information output module 190. The contextual processing module 150 applies the topic model generated by the model generating module 105, to text to be analyzed, and obtains a topic distribution for each word. The contextual processing module 150 extracts a topic, for example, having the highest probability, as the main topic on the basis of the topic distributions for the target. Then, the contextual processing module 150 extracts, for example, words whose highest probability is one for the main topic, among words other than the target, as context information for the target.
The document DB 110 is connected to the topic modeling module 115. The document DB 110 is used to store text collected in advance. For example, text collected from an SNS is stored.
The topic modeling module 115 is connected to the document DB 110 and the model output module 120. From multiple texts stored in the document DB 110, the topic modeling module 115 extracts words constituting the texts. The topic modeling module 115 applies the topic modeling technique to the extracted words, and generates a topic model. The topic modeling module 115 transmits the generated topic model to the model output module 120.
The model output module 120 is connected to the topic modeling module 115 and the model storage apparatus 125. The model output module 120 stores the topic model generated by the topic modeling module 115, in the model storage apparatus 125.
The model storage apparatus 125 is connected to the model output module 120 and the word topic estimating module 160. The model storage apparatus 125 stores the topic model which is output from the model output module 120 (the topic model generated by topic modeling module 115). The model storage apparatus 125 supplies the topic model to the word topic estimating module 160 of the contextual processing module 150.
The document/target receiving module 155 is connected to the word topic estimating module 160. The document/target receiving module 155 receives a target and a target text. The target text is a text which is a target from which context words for the topic are extracted. For example, the target text may be a text created through a user operation using a mouse, a keyboard, a touch panel, voice, a line of sight, a gesture, or the like, or may be a text obtained by reading out a text stored in a storage apparatus such as a hard disk (including a storage apparatus included in a computer, and a storage apparatus connected via a network) or the like.
The word topic estimating module 160 is connected to the model storage apparatus 125, the document/target receiving module 155, and the main topic extracting module 165. The word topic estimating module 160 applies the topic model to the target text, and extracts topic distributions for the words constituting the text. The expression “words constituting text information” means words included in the text information. The term “topic distribution” means a probability of a topic indicated by a target word. In the case where it is possible to attach multiple topics to one word, the term “topic distribution” means probabilities of the topics. For example, as described below, for the word “” which means “Food A”, a probability that the topic indicated by the word is “T1” is 100%. For the word “” which means “selling”, topics indicated by the word are “T1” and “T2”. A probability that the topic indicated by the word is “T1” is 66.7%, and a probability that the topic indicated by the word is “T2” is 33.3%. That is, in the specific data structure of a topic distribution, a word may correspond to one or more sets (pairs) of a topic indicated by the word and a probability value for the topic.
The main topic extracting module 165 is connected to the word topic estimating module 160 and the context information determining module 170. The main topic extracting module 165 extracts the main topic for the target text from the topic distributions extracted by the word topic estimating module 160. Specifically, the main topic extracting module 165 extracts the topic having the highest probability value, from the topic distributions as the main topic for the target.
The context information determining module 170 is connected to the main topic extracting module 165 and the context information output module 190. The context information determining module 170 extracts a word satisfying a predetermined condition, from words having the main topic extracted by the main topic extracting module 165, as a context word in the text. The “predetermined condition” may be, for example, (1) a condition that, when the topic having the highest probability value among the topics for a word is the main topic, the word is regarded as a context word, (2) a condition that, when a topic having a probability value equal to or higher than a predetermined threshold, among the topics for a word is the main topic, the word is regarded as a context word, or (3) a condition that, when the topic having the highest probability value equal to or higher than a predetermined threshold, among the topics for a word is the main topic, the word is regarded as a context word. Multiple words may be extracted as context words.
The context information output module 190 is connected to the context information determining module 170. The context information output module 190 receives the context word (word set) extracted by the context information determining module 170, and outputs the context word. Examples of the outputting the context word include printing the context word using a printer apparatus such as a printer, displaying the context word on a display apparatus such as a display, writing the context word into a storage apparatus such as a database, storing the context word into a storage medium such as a memory card, and transmitting the context word to another information processing apparatus. As information to be output, not only the context word but also a correspondence between the target text and the context word may be output.
The post-processing of the information processing apparatus 100 is, for example, as follows. The information processing apparatus 100 extracts words for the main topic from each sentence in an SNS in which evaluation texts for a certain product which is a target are written, receives information which is output by the context information output module 190, determines the polarity of each word for the main topic, and determines whether the product is evaluated as being positive (affirmative) or negative (critical).
The information processing apparatus 100, a document processing apparatus 210, a context-information application processing apparatus 250, and a user terminal 280 are connected with one another via a communication line 290. The communication line 290 may be wireless, wired, or may be a combination of these. For example, the communication line 290 may be, for example, the Internet or an intranet which serves as a communication infrastructure. The document processing apparatus 210 provides a service such as an SNS, and collects texts. Instead, the document processing apparatus 210 collects texts from an information processing apparatus providing a service such as an SNS. The information processing apparatus 100 extracts context information by using the texts collected by the document processing apparatus 210. The context-information application processing apparatus 250 performs processing using the context information. The user terminal 280 receives processing results produced by the information processing apparatus 100 and the context-information application processing apparatus 250, and presents them to a user. The functions of the information processing apparatus 100, the document processing apparatus 210, and the context-information application processing apparatus 250 may be implemented as cloud services. The document processing apparatus 210 may include the model generating module 105 and the model storage apparatus 125. In this case, the information processing apparatus 100 receives a topic model from the document processing apparatus 210. The user terminal 280 may be a portable terminal.
In step S302, the topic modeling module 115 extracts a document set. The topic modeling module 115 extracts a document set from the document DB 110. In the document DB 110, for example, a document table 400 is stored.
In step S304, the topic modeling module 115 extracts words. The topic modeling module 115 extracts words from each text. In extraction of words, a part of speech (POS) tagger or the like is used when the text is English, and a morphological analyzer or the like is used when the text is Japanese.
In step S306, the topic modeling module 115 performs topic modeling. The topic modeling module 115 applies the topic modeling technique to the word set for each text. Specifically, a technique such as latent Dirichlet allocation (LDA) is used.
In step S308, the model output module 120 outputs a topic model. The model output module 120 outputs the generated topic model.
In step S502, the document/target receiving module 155 receives a target. The document/target receiving module 155 receives input of a target for which context information is to be extracted. For example, “” (“Food A”) is received.
In step S504, the document/target receiving module 155 receives a document which is text. The document/target receiving module 155 receives input of a text from which context information for the target is to be extracted. For example, a text “” which means “Food A of Flavor B is selling very well, and is already in short supply. Our store has it in stock.” is received.
In step S506, the word topic estimating module 160 extracts words from the text. For example, in the above-described example, the extraction result is “”. The symbol “/” indicates a separator of a word.
In step S508, the word topic estimating module 160 receives a model. That is, the word topic estimating module 160 reads the topic model generated according to the flowchart illustrated in the example in
In step S510, the word topic estimating module 160 estimates word topics. That is, the word topic estimating module 160 estimates a topic for each word by using the topic modeling technique.
A word extraction result 600 shows “”.
As a result of the process performed by the word topic estimating module 160, topic distributions are estimated as follows: “100% for Topic 1” for “” (“Food A”); “100% for Topic 1” for “” (“Flavor B”); “66.7% for Topic 1 and 33.3% for Topic 2” for “” (“selling”); “55.6% for Topic 3 and 11.1% for Topic 1” for “” (“already”); “77.8% for Topic 3” for “” (“in short supply”); “55.6% for Topic 1 and 22.2% for Topic 4” for “” (“our store”); “33.3% for Topic 3 and 11.1% for Topic 1” for “” (“in stock”); and “22.2% for Topic 1 and 22.2% for Topic 3” for “” (“has”).
In step S512, the main topic extracting module 165 extracts the main topic. Specifically, the main topic extracting module 165 extracts a topic having the highest probability value among the topics for a word corresponding to the target, as the main topic. In the above-described example, the target is “” (“Food A”). Since the topic distribution of “” is “100% for Topic 1”, Topic 1 is extracted as the main topic.
In step S514, the context information determining module 170 determines context words. The context information determining module 170 determines a word whose highest probability value is one for the main topic (Topic 1), to be a context word. In the example illustrated in
In step S516, the context information output module 190 outputs the context information for the target. In the above-described example, the words “” are output.
An information processing apparatus 700 includes the model generating module 105, the model storage apparatus 125, and a contextual processing module 750. The contextual processing module 750 includes the document/target receiving module 155, the word topic estimating module 160, the main topic extracting module 165, the document topic estimating module 770, the subtopic extracting module 775, the context information determining module 780, and the context information output module 190. Components similar to those in the first exemplary embodiment are designated with identical reference numerals, and repeated description will be avoided.
The model storage apparatus 125 is connected to the model output module 120, the word topic estimating module 160, and the document topic estimating module 770.
The main topic extracting module 165 is connected to the word topic estimating module 160 and the document topic estimating module 770.
The document topic estimating module 770 is connected to the model storage apparatus 125, the main topic extracting module 165, and the subtopic extracting module 775. The document topic estimating module 770 applies the topic modeling technique to the target text, and extracts topic distributions in the text.
The subtopic extracting module 775 is connected to the document topic estimating module 770 and the context information determining module 780. The subtopic extracting module 775 extracts a second topic (may be hereinafter referred to as a subtopic) for the text from the topic distributions extracted by the document topic estimating module 770. That is, in consideration of a subtopic for the target, context information covering a wider range may be extracted.
The context information determining module 780 is connected to the subtopic extracting module 775 and the context information output module 190. The context information determining module 780 extracts a word satisfying a predetermined condition among words having the subtopic extracted by the subtopic extracting module 775, as a context word. Further, the process performed by the context information determining module 170 in the first exemplary embodiment may be performed.
The context information output module 190 is connected to the context information determining module 780.
In step S802, the document/target receiving module 155 receives a target.
In step S804, the document/target receiving module 155 receives a document.
In step S806, the word topic estimating module 160 extracts words.
In step S808, the word topic estimating module 160 receives the model.
In step S810, the word topic estimating module 160 estimates word topics.
In step S812, the main topic extracting module 165 extracts the main topic.
In step S814, the document topic estimating module 770 extracts document topics. The document topic estimating module 770 estimates topics for the document by using the topic modeling technique. A document topic is obtained by normalizing the sum of the topic distributions for the words. In the normalization, for example, the sum of the topic distributions may be divided by the number of words (or the number of words used in the addition). An example is a topic-distribution table 900.
In step S816, the subtopic extracting module 775 extracts a subtopic. The subtopic extracting module 775 extracts a subtopic for the target. Specifically, for example, a topic having the largest ratio is extracted from the document topics. In the example illustrated in
In step S818, the context information determining module 780 determines context words. Similarly to step S514 in the flowchart illustrated in the example in
In step S820, the context information output module 190 outputs the context information. In the above-described example, the words “” are output as context words for the subtopic. Further, the context words for the main topic may be also output.
The following subtopic extraction method may be employed for the process in step S816. A subtopic (surrounding topic) which is likely to surround the target may be extracted by using Expression (1) described below.
In this example, T5 is a topic having the highest score because score (T5)=(0.7+0.2+0.4)/3=0.433 by using Expression (1). Therefore, T5 is regarded as a subtopic.
An information processing apparatus 1100 includes the model generating module 1105, the model storage apparatus 125, and the contextual processing module 150. The model generating module 1105 includes the supervised document DB 1110, the supervised topic modeling module 1115, and the model output module 120.
The supervised document DB 1110 is connected to the supervised topic modeling module 1115. The supervised document DB 1110 is used to store multiple texts which serve as supervised data and which are collected in advance.
The supervised topic modeling module 1115 is connected to the supervised document DB 1110 and the model output module 120. From the multiple texts in the supervised document DB 1110, the supervised topic modeling module 1115 extracts words constituting the texts. The supervised topic modeling module 1115 applies a topic modeling technique to the extracted words, and generates a topic model. The multiple texts which are stored in the supervised document DB 1110 and which serve as supervised data are used as texts for machine learning, and a supervised topic modeling technique is applied as the topic modeling technique.
The model output module 120 is connected to the supervised topic modeling module 1115 and the model storage apparatus 125. The model output module 120 stores the topic model generated by the supervised topic modeling module 1115 in the model storage apparatus 125.
In step S1202, the supervised topic modeling module 1115 extracts a document set.
In step S1204, the supervised topic modeling module 1115 extracts words.
In step S1206, the supervised topic modeling module 1115 performs supervised topic modeling. That is, the supervised topic modeling module 1115 applies the supervised topic modeling technique to the word set in each text in the supervised document DB 1110. For example, labeled latent Dirichlet allocation (LLDA) is used as a specific method. An example of the supervised document DB 1110 is illustrated in
In the ID column 1310, information (ID) for identifying a text in the text column 1320 uniquely is stored in the third exemplary embodiment. In the text column 1320, a text is stored. In the supervised signal column 1330, one or more supervised signals for the text are stored. For example, by using a word “” (“eating”) as a supervised signal, a text “” which means “I ate curry rice with pork cutlet and ramen.” is subjected to machine learning. By using words “” (“eating”) and “” (“toy”) as supervised signals, a text “” which means “Recently, I often eat Food A to get a giveaway.” is subjected to machine learning.
In step S1208, the model output module 120 outputs the topic model generated in step S1206, to the model storage apparatus 125.
An information processing apparatus 1400 includes the model generating module 1105, the model storage apparatus 125, and the contextual processing module 750.
The model generating module 1105 includes the supervised document DB 1110, the supervised topic modeling module 1115, and the model output module 120. The supervised document DB 1110 is connected to the supervised topic modeling module 1115. The supervised topic modeling module 1115 is connected to the supervised document DB 1110 and the model output module 120. The model output module 120 is connected to the supervised topic modeling module 1115 and the model storage apparatus 125.
The model storage apparatus 125 is connected to the model output module 120, the word topic estimating module 160, and the document topic estimating module 770.
The contextual processing module 750 includes the document/target receiving module 155, the word topic estimating module 160, the main topic extracting module 165, the document topic estimating module 770, the subtopic extracting module 775, the context information determining module 780, and the context information output module 190.
The document/target receiving module 155 is connected to the word topic estimating module 160. The word topic estimating module 160 is connected to the model storage apparatus 125, the document/target receiving module 155, and the main topic extracting module 165. The main topic extracting module 165 is connected to the word topic estimating module 160 and the document topic estimating module 770. The document topic estimating module 770 is connected to the model storage apparatus 125, the main topic extracting module 165, and the subtopic extracting module 775. The subtopic extracting module 775 is connected to the document topic estimating module 770 and the context information determining module 780. The context information determining module 780 is connected to the subtopic extracting module 775 and the context information output module 190. The context information output module 190 is connected to the context information determining module 780.
As illustrated in
For an exemplary embodiment achieved by using computer programs among the above-described exemplary embodiments, the computer programs which are software are read into a system having the hardware configuration, and the software and the hardware resources cooperate with each other to achieve the above-described exemplary embodiment.
The hardware configuration in
The programs described above may be provided through a recording medium which stores the programs, or may be provided through a communication unit. In these cases, for example, the programs described above may be interpreted as an invention of “a computer-readable recording medium that stores a program”.
The term “a computer-readable recording medium that stores a program” refers to a computer-readable recording medium that stores programs and that is used for, for example, the installation and execution of the programs and the distribution of the programs.
Examples of the recording medium include a digital versatile disk (DVD) having a format of “DVD-recordable (DVD-R), DVD-rewritable (DVD-RW), DVD-random access memory (DVD-RAM), or the like” which is a standard developed by the DVD forum or having a format of “DVD+recordable (DVD+R), DVD+rewritable (DVD+RW), or the like” which is a standard developed by the DVD+RW alliance, a compact disk (CD) having a format of CD read only memory (CD-ROM), CD recordable (CD-R), CD rewritable (CD-RW), or the like, a Blu-ray® Disk, a magneto-optical disk (MO), a flexible disk (FD), a magnetic tape, a hard disk, a ROM, an electrically erasable programmable ROM (EEPROM®), a flash memory, a RAM, and a secure digital (SD) memory card.
The above-described programs or some of them may be stored and distributed by recording them on the recording medium. In addition, the programs may be transmitted through communication, for example, by using a transmission medium of, for example, a wired network which is used for a local area network (LAN), a metropolitan area network (MAN), a wide area network (WAN), the Internet, an intranet, an extranet, and the like, a wireless communication network, or a combination of these. Instead, the programs may be carried on carrier waves.
The above-described programs may be included in other programs, or may be recorded on a recording medium along with other programs. Instead, the programs may be recorded on multiple recording media by dividing the programs. The programs may be recorded in any format, such as compression or encryption, as long as it is possible to restore the programs.
The foregoing description of the exemplary embodiments of the present invention has been provided for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Obviously, many modifications and variations will be apparent to practitioners skilled in the art. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, thereby enabling others skilled in the art to understand the invention for various embodiments and with the various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the following claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
2015-039955 | Mar 2015 | JP | national |