The present invention relates to a word latent topic estimation device and a word latent topic estimation method that estimate latent topics for words in document data.
In the field of natural language processing, there is a demand for dealing with meanings behind words, not dealing with text data as merely symbol sequences. Recently, much attention has been given to devices that estimate latent topics (hereinafter referred to as the latent topic estimation device).
A topic is data representing a notion, meaning or field that lies behind each word. A latent topic is not a topic manually defined beforehand but instead is a topic that is automatically extracted on the basis of the assumption that “words that have similar topics are likely to co-occur in the same document” by taking an input of document data alone. In the following description, a latent topic is sometimes simply referred to as a topic.
Latent topic estimation is processing that takes an input of document data, posits that k latent topics lie behind words contained in the document, and estimates a value representing whether or not each of the words relates to each of the 0-th to (k−1)-th latent topics.
Known latent topic estimation methods include latent semantic analysis (LSA), probabilistic latent semantic analysis, and latent Dirichlet allocation (LDA).
Particularly LDA will be described here. LDA is a latent topic estimation method that assumes that each document is a mixture of k latent topics. LDA is predicated on a document generative model based on this assumption and can estimate a probability distribution in which words represent relations between latent topics in accordance with the generative model.
A word generative model in LDA will be described first. Generation of documents in LDA is determined by the follow two parameters.
α_{t} is a parameter of a Dirichlet distribution that generates a topic t β_{t, v} represent the probability of a word v being chosen from a topic t (the word topic probability). Note that _{t, v} represents that subscripts t, v are written below β.
A generative model in LDA is a model for generating words by using the following procedure in accordance with these parameters. The generative model first determines a mixture ratio θ_{j, t} (0≦t<k) of latent topics in accordance with the Dirichlet distribution of parameter α for a document. Then, generation of a word is repeated a number of times equivalent to the length of the document in accordance with the mixture ratio. Generation of each word is accomplished by choosing one topic t in accordance with the topic mixture ratio θ_{j, t} and then choosing words v in accordance with the probabilities β_{t, v}.
LDA allows α and β to be estimated by assuming such a generative model as described above and giving document data. The estimation is based on the maximum likelihood principle and is accomplished by computing α_{t} and β_{t, v} that are likely to replicate a set of document data.
LDA differs from the other latent topic estimation methods in that LDA deals with latent topics of a document using the mixture ratio θ_{j, t} and therefore a document can have multiple topics. A document written in a natural language often contains multiple topics. LDA can estimate word topic probabilities more accurately than the other latent topic estimation methods can do.
NPL 1 describes a method for estimating α, β one at a time (every time a document is added). A latent topic estimation device to which the method described in NPL 1 is applied repeats computations of the parameters given below when a document j is given, thereby estimating word topic probability β.
The latent topic estimation device illustrated in
γ_{j, t}̂{k} is a parameter (document topic parameter) on a Dirichlet distribution representing the probability of a topic t appearing in document j. Note that ̂{k} represents that the superscript k is written above γ. φ_{j, i, t}̂{k} is the probability (document word topic probability) that the i-th word in document j being assigned to topic t. n_{j, t}̂{k} is an expected value of the number of times assignments to topic t in document j (the number of document topics). n{j, t, v}̂{k} is an expected value indicating the number of times a word v is assigned to topic t in document j (the number of word topics).
The latent topic estimation device illustrated in
A flow of processing in the latent topic estimation device illustrated in
First, when a document including one or more words is added to the document data addition unit 501, the latent topic estimation device illustrated in
The processing by the topic estimation unit 502 illustrated in
When a document j made up of N_{j} words is added, the topic estimation unit 502 computes initial values of the following parameters (step n1).
n_{j, t}̂{old} is the initial value of the number of document topics and is computed according to Equation 2′. n_{j, t, v}̂{old} is the initial value of the number of word topics and is computed according to Equation 2′. γ_{j, t}̂{k} is the initial value of the document topic parameter and is computed according to Equation 3. β_{t, v}̂{k} is the initial value of the word topic probability and is computed according to Equation 4′.
Note that φ_{j, i, t}̂{old} is the initial value of the document word topic probability and is randomly assigned.
The function I (condition) in Equations 2 and 2′ returns 1 when a condition is satisfied, and otherwise returns 0. w_{j, i} represents the i-th word in document j.
Then, the topic estimation unit 502 performs processing for updating the values of φ_{j, i, t}̂{k}, β_{t, v}̂{k} and γ_{j, t}̂{k} for each topic t (1≦t<k) for each word (step n2). The update processing is accomplished by computing Equations 1, 2, 3 and 4 in order.
In Equation 1, ψ(x) represents a digamma function and exp(x) represents an exponential function. A_{t, v} in Equation 4 is stored in the topic distribution storage unit 504. Note that when there is not a corresponding value in the topic distribution storage unit 504 at time such as the time the first document is added, 0 is assigned to A_{t, v}.
When the parameter updating for all of the words is completed, the topic estimation unit 502 replaces φ_{j, i, t}̂{old}, n_{j, t}̂{old}, and n_{j, t, v}̂{old} with the values φ_{j, i, t}̂{k}, n_{j, t}̂{k}, and n_{j, t, v}̂{k} computed at the current topic estimation in preparation for next update processing. Then the topic estimation unit 502 performs update processing in accordance with Equations 1 to 4 again for each word.
The topic estimation unit 502 then determines whether to end the processing (step n3). The number of iterations of step n2 performed after a document is added is stored and, upon completion of a certain number of iterations (Yes, at step n3), the topic estimation unit 502 ends the processing.
The data update unit 503 updates the value in the topic distribution storage unit 504 on the basis of the number of word topics n_{j, t, v} among the values computed by the topic estimation unit 502. The update is performed according to Equation 5.
[Math. 7]
A
k,t,v
=A
k,t,v
+n
j,t,v
k (Eq. 5)
The word topic distribution output unit 505 is called by a user operation or an external program. The word topic distribution output unit 505 outputs β_{t, v} according to Equation 6 on the basis of the value in the topic distribution storage unit 504.
This method does not store all documents and does not repeat the estimation processing for all documents but repeats estimation only for an added document when the document is added. This method is known to be capable of efficient probability estimation and to operate faster than common LDA. However, the speed is not high enough, processing time proportional to the number of topics is required and therefore when the number of topics k is large, much time is required. This problem may be addressed by using hierarchical clustering.
NPL 2 describes a hierarchical clustering method. In order to estimate the latent topic in the document data, this method recursively performs processing in which clustering (=topic estimation) with two clusters (=the number of topics) is performed to divide data into two. This enables topics to be assigned to each document on the order of log(K). Although the method is a technique in a similar field, the method is merely a technique to assign topics to documents and cannot estimate the probability of a topic being assigned to a word. Furthermore, a single topic is assigned to each piece of data and a mixture of multiple topics cannot be represented.
Latent topic estimation methods that are capable of dealing with a mixture of topics require much processing time that is proportional to the number of topics. A method that efficiently performs processing in a case where the number of topics is large is a hierarchical latent topic estimation method. A topic tree and hierarchical topic estimation are defined below.
A topic tree is data of W-ary tree having nodes representing topics, edges representing semantic inclusive relations between topics, and depth D. Each of the topics in the topic tree has an ID (topic ID) that is unique at each level.
Hierarchical latent topic estimation means that topic estimation for each word is performed at each of the levels in such a manner that there are no inconsistencies in semantic inclusive relations between topics in a topic tree represented as described above.
However, it is difficult to estimate latent topics of words by the hierarchical latent topic estimation method while taking into consideration a mixture of topics.
An object of the present invention is to provide a word latent topic estimation device and a word latent topic estimation method that are capable of hierarchical processing and fast estimation for latent topics of words while taking into consideration a mixture of topics.
A word latent topic estimation device according to the present invention includes a document data addition unit inputting a document including one or more words, a level setting unit setting the number of topics at each level in accordance with a hierarchical structure of topics for hierarchically estimating a latent topic of a word, a higher-level constraint creation unit creating, for a word in the document on the basis of a result of topic estimation at a given level, a higher-level constraint indicating an identifier of a topic that is likely to be assigned to the word and the probability of the word being assigned to the topic, and a higher-level-constraint-attached topic estimation unit which, when estimating the probability of each word in the input document being assigned to each topic, refers to the higher-level constraint and uses the probability of the word being assigned to a parent topic at a higher level as a weight to perform estimation processing for a lower-level topic.
A word latent topic estimation method according to the present invention includes inputting a document including one or more words, setting the number of topics at each level in accordance with a hierarchical structure of topics for hierarchically estimating a latent topic of a word, creating, for a word in the document on the basis of a result of topic estimation at a given level, a higher-level constraint indicating an identifier of a topic that is likely to be assigned to the word and the probability of the word being assigned to the topic, and when estimating the probability of each word in the input document being assigned to each topic, referring to the higher-level constraint and using the probability of the word being assigned to a parent topic at a higher level as a weight to perform estimation processing for a lower-level topic.
The present invention enables hierarchical processing and allows latent topics of words to be estimated fast while taking into consideration a mixture of the topics.
The document data addition unit 101 adds document data which includes one or more words and is input by a user operation or an external program.
The level setting unit 102 sets the number of topics k on the basis of width W and depth D, which are preset parameters, and calls processing of the higher-level-constraint-attached topic estimation unit 103.
The higher-level-constraint-attached topic estimation unit 103 takes inputs of the number of topics k set in the level setting unit 102, document data provided from the document data addition unit 101,and a higher-level constraint in the higher-level-constraint buffer 107 and performs topic estimation for k topics.
The data update unit 104 updates data in the topic distribution storage unit 106 on the basis of the number of word topics computed by the higher-level-constraint-attached topic estimation unit 103.
The higher-level constraint creation unit 105 is called after the processing by the data update unit 104 and creates a higher-level constraint on the basis of a document word topic probability computed by the higher-level-constraint-attached topic estimation unit 103. The higher-level constraint creation unit 105 stores the created higher-level constraint in the higher-level-constraint buffer 107 and calls the level setting unit 102.
The topic distribution storage unit 106 stores the number of word topics provided from the data update unit 104. The topic distribution storage unit 106 in this exemplary embodiment holds the number of word topics using a word v, the number of topics k, and topics t as a key. The topic distribution storage unit 106 stores information having a data structure given below.
An example of data stored in the topic distribution storage unit 106 is given below.
The example indicates that, in the topic estimation where the number of topics is 4, the number of word topics for the 0-th topic of the word “children” is 2.0 and the number of word topics for the 1st topic of the word “children” is 1.0.
The higher-level-constraint buffer 107 stores a higher-level constraint created by the higher-level constraint creation unit 105. In this exemplary embodiment, the higher-level-constraint buffer 107 holds a topic ID that is likely to be assigned to the i-th word in a document in the last higher-level topic estimation performed and a document word topic probability φ for the topic. The higher-level-constraint buffer 107 stores information having the following data structure.
An example of data stored in the higher-level-constraint buffer 107 is given below.
This example indicates that the probability of the fifth word being assigned to the 0th topic is 0.3 and the probability of the fifth word being assigned to the eighth topic is 0.7.
The word topic distribution output unit 108, when called by a user operation or an external program, computes a word topic probability on the basis of the number of word topics in the topic distribution storage unit 106 and outputs the result.
Note that the document data addition unit 101, the level setting unit 102, the higher-level-constraint-attached topic estimation unit 103, the data update unit 104, the higher-level constraint creation unit 105 and the word topic distribution output unit 108 are implemented by a CPU or the like of the word latent topic estimation device.
The topic distribution storage unit 106 and the higher-level constraint buffer 107 are implemented by a storage device such as a memory of the word latent topic estimation device.
An operation of this exemplary embodiment will be described below.
Processing in this exemplary embodiment is generally made up of a flow of adding document data and an output flow.
The flow of adding document data is started when a user operation or an external program inputs document data including one or more words.
When document data is added, first the level setting unit 102 sets 1 as the initial value of the number of topics k (step u01). The level setting unit 102 then multiplies the number of topics k by W (step u02). The level setting unit 103 then determines on the basis of the value of k whether to end the processing.
When the value of k is greater than the D-th power of W (Yes at step u03), the word latent topic estimation device performs a termination processing (step u07). In the termination processing, the higher-level-constraint buffer 107 is cleared in preparation for addition of a next document. Otherwise (No at step u03), the word latent topic estimation device proceeds step u04, where topic estimation processing for k topics is performed. At step u04, the higher-level-constraint-attached topic estimation unit 103 performs the latent topic estimation processing for k topics.
When the higher-level-constraint buffer 107 is empty at this point in time, the higher-level-constraint-attached topic estimation unit 103 performs the same processing performed by the topic estimation unit 502 illustrated in
The determination as to whether a topic satisfies a higher-level constraint or not is made on the basis of whether or not the parent topic of that topic in the topic tree is included in the higher-level constraint. Specifically, the determination is made on the basis of whether or not the topic ID divided by W is included in the higher-level constraint. For example, consider topic estimation where W=4 and the number of topics is 16. If the higher-level constraint for a given word (assignment in estimation for 4 topics) is 0, 2, eight topics {0, 1, 2, 3, 8, 9, 10, 11} among topics 0 to 16 satisfy the hither-level constraint. A topic that satisfies a higher-level constraint is hereinafter referred to as an allowable topic and a list of allowable topics is referred to as allowable topic list.
When a document j made up of N_{j} words is added, the higher-level-constraint-attached topic estimation unit 103 first refers to the higher-level-constraint buffer 107 to obtain a higher-level constraint for every position in the document (step u041). When the higher-level-constraint buffer 107 is empty (Yes at step u042), the higher-level-constraint-attached topic estimation unit 103 proceeds to the processing at step n1 illustrated in
At step u043, the higher-level-constraint-attached topic estimation unit 103 compares the topic ID at the higher level that is included in the higher-level constraint with each of topic IDs 0 to k−1 divided by W and generates an allowable topic list.
The higher-level-constraint-attached topic estimation unit 103 then computes initial values for the probability parameters (step u044). This processing is the same as the processing at step n1 illustrated in
The higher-level-constraint-attached topic estimation unit 103 then updates the values of Equations 7, 2, 3 and 4 for each word in the document (step u045).
Note that the update processing is performed only for the topics that satisfy the higher-level constraint. This process is accomplished by updating φ_{j, i, t}̂{k}, β_{t, v}̂{k}, n_[j, t]̂{k}, n_{j, t, v}̂{k} and γ_{j, t}̂{k} in order by using Equations 7, 2, 3 and 4.
Note that cons in Equation 7 represents a set of IDs of allowable topics. φ_{j, i t/W}̂{k/W} represents the document word topic probability of a parent topic included in the higher-level constraint. Equation 7 differs from Equation 1 in that the provability values of topics other than the allowable topics are fixed at 0 in Equation 7 and that the document word topic probability φ_{j, i, t/W}̂{k/W} of the parent topic is used as a weight in Equation 7. Equation 7 enables probability assignment that takes into consideration the result of probability estimation for k/W topics in the probability estimation for k topics
Upon completion of the update processing for all of the words in document j, the higher-level-constraint-attached topic estimation unit 103 replaces φ_{j, i, t}̂{old}, n_{j, t}̂{old}, and n_{j, t, v} ̂{old} with the values φ_{j, i, t}̂{k}, n_{j, t}̂{k}, and n_{j, t, v}̂{k} computed in the current topic estimation in preparation for next update processing.
Upon completion of the processing described above, the higher-level-constraint topic estimation unit 103 performs determination processing at step n046 as to whether to end the processing. This processing is the same as the processing at step n3 illustrated in
At step u05, the data update unit 104 updates the values in the topic distribution storage unit 106 on the basis of the number of word topics n_{j, t, v}̂{k} among the values computed by the higher-level-constraint-attached topic estimation unit 103. The update is performed according to Equation 5.
When the processing by the data update unit 104 ends, the higher-level constraint creation unit 105 creates a higher-level constraint on the basis of the document word topic probabilities φ computed by the higher-level-constraint-attached topic estimation unit 103 (step u06). The processing is performed as follows.
First, the higher-level constraint creation unit 105 clears the higher-level-constraint buffer 107 at this point in time.
Then, the higher-level constraint creation unit 105 performs the following processing for each word. The higher-level constraint creation unit 105 checks the document word topic probabilities φ_{j, i, t}̂{k} for the i-th word in document j for t=0 to k−1 in order. The higher-level constraint creation unit 105 extract the IDs of topics that have document word topic probabilities greater than a threshold value TOPIC_MIN (for example 0.2) and adds the IDs to the allowable topic list cons(j, i).
The higher-level constraint creation unit 105 then updates the values of φ_{j, i, t}̂{k} for t in the allowable topic list according to Equation 8. The higher-level constraint creation unit 105 then adds the IDs of the topics included in cons(j, i) and φ_{j, i, t}̂{k} for the topics in the higher-constraint buffer 107.
When the higher-level constraint creation unit 105 completes the processing described above for all positions in document j, the word latent topic estimation device returns to the processing at step u02 and performs the processing at the next level.
The output flow will be described next. The output flow is started when the word topic distribution output unit 108 is called by a user operation or an external program. The word topic distribution output unit 108 computes a word topic probability for each topic t for all words v at each number of topics k by using Equation 6 on the basis of data stored in the topic distribution storage unit 106 and outputs the word topic probabilities.
An operation of this exemplary embodiment will be described using a specific example. In this example, the word latent topic estimation device performs hierarchical latent topic estimation with W=4 and D=2. It is assumed that the following parameter have been set beforehand.
It is also assumed that 1000 documents have already been added and the following data has been stored in the topic distribution storage unit 106.
A flow of adding document data performed when a document 555 made up of only two words, “children” and “year”, is added to the document data addition unit 101 will be described.
When the document data is added, the level setting unit 102 sets the number of topics k=4 (step u01, step u02). The level setting unit 102 determines at step u03 whether or not to end the processing. Since k (=4)<ŴD (=16), the word latent topic estimation device proceeds to the processing at step u04. At step u04, latent estimation is performed for 4 topics. Since the higher-level constraint buffer 107 is empty at this point in time, the same repetitive processing as that performed by the topic estimation unit 502 illustrated in
The description of the computation will be omitted here and it is assumed that the following results have been obtained.
The document word topic probabilities for “children” at position 0
φ—{555, 0, 0}̂{4}=0.8
φ—{555, 0, 1}̂{4}=0.1
φ—{555, 0, 2}̂{4}=0.01
φ—{555, 0, 3}̂{4}=0.09
The document word topic probabilities for “year” at position 1
φ—{555, 1, 0}̂{4}=0.01
φ—{555, 1, 1}̂{4}=0.225
φ—{555, 1, 2}̂{4}=0.675
φ—{555, 1, 3}̂{4}=0.09
At step u06, the higher-level constraint creation unit 105 creates a constraint for the lower level on the basis of these probabilities φ. The processing will be described.
First, among the document word topic probabilities φ—{555, 0, t}̂{4} for “children” at position 0, only φ—{555, 0, 0}̂4=0.8 has a value greater than TOPIC_MIN (=0.2). Accordingly, only 0 is added to the allowable topic list cons(555, 0). The higher-level constraint creation unit 105 updates φ—{555, 0, 0}̂{4} to 1 according to Equation 8. As a result, the higher-level constraint creation unit 105 adds the following higher-level constraint to the higher-level constraint buffer 107.
Next, among the document word topic probabilities φ—{555, 1, t}̂{4} for “year” at position 1, the following two document word topic probabilities have values equal to or greater than TOPIC_MIN (0.2).
φ—{555, 1, 1}̂{4}=0.225
φ—{555, 1, 2}̂{4}=0.675
Therefore, {1, 2} is added to the allowable topic list cons(555, 1).
The higher-level constraint creation unit 105 updates the values of φ—{555, 1, 1}̂{4} and φ—{555, 1, 2}̂{4} according to Equation 8 as follows.
φ—{555, 1, 1}̂{4}=0.225/(0.225+0.675)=0.25
φ—{555, 2, 2}̂{4}=0.675/(0.225+0.675)=0.75
As a result, the higher-level constraint creation unit 105 adds the following higher-level constraint to the higher-level constraint buffer 107.
When the creation of the higher-level constraint is completed, the word latent topic estimation device returns to step u02, where the number of topics k is updated to 16 (step u02). Since “k=16” does not satisfy the end condition at step u03, the word latent topic estimation device proceeds to step u04, i.e. the processing by the higher-level constraint-attached topic estimation 103.
The higher-level-constraint-attached topic estimation unit 103 first retrieves the following data from the higher-level constraint buffer 107 at step u041.
The higher-level-constraint-attached topic estimation unit 103 then generates an allowable topic list at each position for each word at step u043. For position 0, the higher-level constraint topic is 0, therefore the number that gives 0 when divided by W is chosen from among 0 to 15 to generate an allowable topic list {0, 1, 2, 3}. For position 1, the higher-level constraint topic is {1, 2}, therefore the number that gives 1 when divided by W and the number that gives 2 when divided by W are chosen from among 0 to 15 to list {4, 5, 6, 7} and {8, 9, 10, 11}, from which an allowable topic list {4, 5, 6, 7, 8, 9, 10, 11} is generated. The higher-level-constraint-attached topic estimation unit 103 computes initial values of φ, γ, β and n (step u044).
The higher-level-constraint-attached topic estimation unit 103 then performs processing for updating φ, γ, β and n that correspond to the topics in the allowable topic lists (step u045). The description here focuses on computation of φ—{555, i, t}̂{16}. For simplicity, it is assumed that the term base_{j, i, t}̂{k} given by Equation 9 has been computed as follows.
base—{555, i, t}̂{16}=1/16(i=0, 1; t=0 to 15)
For the word “children” at position 0, the computation is performed as follows. From the higher-level constraint 0→0: 1, the document word topic probabilities φ—{555, 0, t}̂{4} at the higher level can be considered as follows.
φ—{555, 0, 0}̂{4}=1
φ—{555, 0, 1}̂{4}=0
φ—{555, 0, 2}̂{4}=0
φ—{555, 0, 3}̂{4}=0
Allowable topics are 0 to 3 and the subsequent computations need to be performed only for topics 0 to 3. Specifically, in the computation of φ—{555, 0, 0}̂{16}, only the following computations are performed.
φ—{555, 0, 0}̂{4} multiplied by base—{555, 0, 0}̂{16}=1/16
φ—{555, 0, 0}̂{4} multiplied by base—{555, 0, 1}̂{16}=1/16
φ—{555, 0, 0}̂{4} multiplied by base—{555, 0, 2}̂{16}=1/16
φ—{555, 0, 0}̂{4} multiplied by base—{555, 0, 3}̂{16}=1/16
Computations beyond the above can be ignored because φ—{555, 0, 1}̂{4}, φ—{555, 0, 2}̂{4}, and φ—{555, 0, 3}̂{4} are 0. φ—{555, 0, t}̂{16} is computed as follows.
φ—{555, 0, 0}̂{16}=1/4
φ—{555, 0, 1}̂{16}=1/4
φ—{555, 0, 2}̂{16}=1/4
φ—{555, 0, 3}̂{16}=1/4
φ—{555, 0, t}̂{16}=0 (4≦t<16)
In this way, the use of the result of estimation for 4 topics as a higher-level constraint can reduce the number of updates of φ required for each topic. Specifically, the number of computations required for ordinary latent topic estimation for 16 topics is 16 topics×number of iterations. In contrast, when a higher-level constraint created from the result of estimation for k=1 is used, only (4 higher-level topics+4 lower-level topics)×number of iterations are required for “children” at position 0.
Computations for the word “year” at position 1 are performed next. Again, the description focuses on the computation of φ. Since the higher-level constraint for position 1 is “1→1: 0.25, 2: 0.75”, the document word topic probabilities φ—{555, 1, t}̂{4} at the higher level can be considered as follows.
φ—{555, 0, 0}̂{4}=0
φ—{555, 1, 1}̂{4}=0.25 (=1/4)
φ—{555, 2, 2}̂{4}=0.75 (3/4)
φ—{555, 3, 3}̂{4}=0
Computations beyond the above are performed only for the allowable topics {4, 5, 6, 7, 8, 9, 10, 11}. Specifically, in the computation of φ—{555, 1, 0}̂{16}, only the following computations are performed.
φ—{555, 1, 1}̂{4} multiplied by base—{555, 1, 4}̂{16}=1/64
φ—{555, 1, 1}̂{4} multiplied by base—{555, 1, 5}̂{16}=1/64
φ—{555, 1, 1}̂{4} multiplied by base—{555, 1, 6}̂{16}=1/64
φ—{555, 1, 1}̂{4} multiplied by base—{555, 1, 7}̂{16}=1/64
φ—{555, 1, 2}̂{4} multiplied by base—{555, 1, 8}̂{16}=3/64
φ—{555, 1, 2}̂{4} multiplied by base—{555, 1, 9}̂{16}=3/64
φ—{555, 1, 2}̂{4} multiplied by base—{555, 1, 10}̂{16}=3/64
φ—{555, 1, 2}̂{4} multiplied by base—{555, 1, 11}̂{16}=3/64
φ—{555, 1, t}̂{16} is computed as follows.
φ—{555, 1, 4}̂{16}=1/16
φ—{555, 1, 5}̂{16}=1/16
φ—{555, 1, 6}̂{16}=1/16
φ—{555, 1, 7}̂{16}=1/16
φ—{555, 1, 8}̂{16}=3/16
φ—{555, 1, 9}̂{16}=3/16
φ—{555, 1, 10}̂{16}=3/16
φ—{555, 1, 11}̂{16}=3/16
φ—{555, 1, t}̂{16}=0 (t<4 or t>12)
As with the computation for “children” at position 0, the conventional method requires 16 computations of φ whereas the computation method described above requires only 4+8=12 computations of φ to complete the processing. The same computation method can be applied to processing for updating γ, β and n to achieve the same advantageous effect.
The processing is followed by the processing at step u05 to u06 and then the level setting unit 102 updates k to 64 at step u02. Since this result in k>16, the termination processing at step u07 is performed and the processing flow ends.
As has been described above, a higher-level constraint enables topic estimation to be hierarchically performed without performing estimation processing for excess topics while taking into consideration the probabilities of a mixture of multiple topics. For example, when estimation is performed for 100 topics, ordinary latent topic estimation requires estimation for 100 topics for each word. On the other hand, the latent topic estimation according to the present invention requires estimation for only 10 to several tens of topics for each word when, for example D=2 and W=10 are set, and thus enables efficient estimation.
A second exemplary embodiment of the present invention will be described with reference to drawings.
The initial value storage unit 201 stores an initial value of the number of topics k set by a level setting unit 102. Specifically, the initial value storage unit 201 stores the initial value initK of k. It is assumed that initK has been set to Ŵ(D−1) before addition of a document.
The initial value update unit 202 is called by the level setting unit 102 and updates the initial value of the number of topics k in the initial value storage unit 201 on the basis of a document word topic probability computed by a higher-level-constraint-attached topic estimation unit 103 and the number of documents that have been added.
Note that the initial value update unit 202 is implemented by a CPU or the like of the word latent topic estimation device. The initial value storage unit 201 is implemented by a storage device such as a memory of the word latent topic estimation device, for example.
The other components in the second exemplary embodiment are the same as those in the first exemplary embodiment and therefore the description of those components will be omitted.
An operation of this exemplary embodiment will be described below.
A flow of adding document data in the second exemplary embodiment will be described here.
The flow of adding document data in the second exemplary embodiment is started when document data including one or more words is input by a user operation or an external program. When document data is added, first the level setting unit 102 retrieves the initial value initK of k from the initial value storage unit 201 and sets the value as the initial value of k (step u101). The subsequent processing (steps u102 to u105) is the same as the processing at step u02 to u05 of the first exemplary embodiment.
Then, at step u106, a data update unit 104 updates a higher-level topic with the number of word topics n_{j, t, v}̂{k} for each word. This processing is performed only when K is not equal to W and k=initK*W. The reason is as follows. When k=W, there is no higher-level topics and therefore the processing is unnecessary. When k is not equal to initK*W, assignment to a higher-level topic for the document is performed in k/W topic estimation and therefore the processing does not need to be performed here.
As illustrated in
The processing at step u0624 is performed for each t. For example, if W=4, D=4, initK=16 and k=64, the value n_{j, 18, v} of topic ID=18 is added to A—{16, 4, v} where k=16 (=64/4) and t=4(=18/4) and to A—{4, 1, v} where k=4 (=64/16) and t=1 (=18/16).
When the data update unit 104 ends the process at step u106, the word latent topic estimation device performs the higher-level constraint creation processing at step u107 and then proceeds to the termination processing (step u108). These processing steps are the same as those of the flow of adding document data in the first exemplary embodiment. However, in the flow of adding document data in this exemplary embodiment, the initial value update unit 202 is called after the level setting unit 102 has performed the termination processing.
The initial value update unit 202 updates the initial value initK of the number of topics k in the initial value storage unit 201 in preparation for the next and subsequent documents (step u109). How much computation can be reduced by performing filtering at the higher-level after estimation for a lower-level topic is calculated here and initK is set to a small value in accordance with the reduction effect. The reduction effect E can be calculated according to the following equation.
E=nCost(initK*W)−1 upCost(initK*W)
The function nCost(k) represents the amount of computation required when latent topic estimation for k topics is performed by an ordinary computation method and can be computed according to Equation 10.
[Math. 12]
nCost(k)=k·lenj (Eq. 10)
In Equation 10, len_{j} represents the number of words included in document j. When initK is used without change, the number of topics k is equal to initK*W and the topic estimation requires the amount of computation that is equal to the number of words included in document j multiplied by k.
The function upCost(K) represents the amount of computation required when latent topic estimation for k/W topics at one level higher is performed and then a higher-level constraint obtained as a result of the latent estimation is used. upCost(k) can be computed according to Equation 11.
[Math. 13]
upCost(k)=nCost(k/W)+F(k/W) (Eq. 11)
The first term of Equation 11 represents the amount of computation required for performing latent topic estimation at one level higher, i.e. latent topic estimation for k/W topics. The second term F(k) represents the amount of computation required for performing latent topic estimation for k*W topics by using a higher-level constraint created in latent topic estimation for k topics. F(k) can be computed according to Equation 12.
In order to compute F(k/W) in Equation 11, φ_{j, i, t}̂{k/W} is required in topic estimation at one level higher. The value can be calculated according to Equation 13 from the value of φ_{j, i, t}̂{k} computed by the higher-level-constraint-attached topic estimation unit 103. In the Equation, the function c(p) represents a set of topics that are child topics of topic p in the topic tree.
The initial value update unit 202 computes E and, when E is greater than a threshold value E_MIN (for example 0), updates initK in the initial value storage unit 201 to initK/W.
Note that while the reduction effect has been calculated in order to update the initial value initK, initK may be updated in accordance with the number of documents that have been added if what degree of reduction effect can be achieved by adding how many documents can be empirically established. For example, initK may be updated to initK/W when the number of documents that have been added reaches 10000 or more. According to such a method, determination as to which of hierarchical topic estimation and ordinary topic estimation, i.e. topic estimation for all topics is to be performed can be made on the basis of only on the number of documents.
While the reduction effect E is calculated each time a document is added, the reduction effect E may be calculated when the number of documents that have added reaches X rather than every time a document is added, thereby reducing the processing time required for calculating the reduction effect E.
While determination is made as to whether or not to update initK on the basis of reduction effect E for one document, the mean of reduction effects E for a plurality of documents, for example Y documents, mean(E), may be calculated and, when the mean exceeds a threshold value E_MIN, initK may be updated.
An operation of this exemplary embodiment will be described with a specific example. In this example, a document 300 made up only of “school” and “children” is added in a situation where W=4 and D=2 and initK is not set in the initial value storage unit 201. When document data is added, the level setting unit 102 first refers to the initial value storage unit 201 and retrieves initK (=Ŵ(D−1))=4 (step u101). As a result, the number of topics k=16 is set at step u102 and estimation processing for 16 topics is performed at step u104. It is assumed here that as a result, the following φ and n have been obtained.
φ—{300, i, t}̂{16}=1/16(i=0, 1; 1≦t<16)
n
—{300, t, “school”}̂{16}=1/16(1≦t<16)
n
—{300, t, “children”}̂{16}=1/16(1≦t<16)
After the processing at step u105, the data update unit 104 performs the processing at step u106.
First, processing for “school” will be described. As a result of processing at steps u0621 and u0622, p=4 is set. Since k (=16) is not equal to p, the data update unit 104 proceeds to the processing at step u0624. At step u0624, the data update unit 104 performs the following computations for 0≦=t<16.
Add n—{300, 0, “school”}̂{16} to A—{4, 0, “school”}
Add n—{300, 1, “school”}̂{16} to A—{4, 0, “school”}
Add n—{300, 2, “school”}̂{16} to A—{4, 0, “school”}
Add n—{300, 3, “school”}̂{16} to A—{4, 0, “school”}
Add n—{300, 4, “school”}̂{16} to A—{4, 1, “school”}
Add n—{300, 5, “school”}̂{16} to A—{4, 1, “school”}
Add n—{300, 6, “school”}̂{16} to A—{4, 1, “school”}
Then, the data update unit 104 returns to the processing at step u0622, where p=16 is set. Since the value is equal to k (=16), the processing ends.
Topic estimation for higher-level topics is not performed in an early stage of learning in the second exemplary embodiment. However, the processing enables the number of word topics in a parent topic to be computed. Accordingly, word topic probabilities can be computed without decreasing the degree of accuracy when topic estimation for topics at the higher level is started. When the data update unit 104 has completed the processing at step u106, step u107 is performed and the flow proceeds to termination processing (step u108). Then the initial value update unit 202 determines whether or not to update initK.
The processing at step u109 will now be described. At step u109, on the basis of φ—{300, i, t}̂{16}=1/16(i=0, 1; 1≦t<16), the reduction effect E is computed as nCost(16)−upCost(16). To compute the value, the initial value update unit 202 first computes φ—{300, i, t}̂{4} from φ—{300, i, t}̂{16} according to Equation 13.
This gives the following results.
φ—{300, i, t}̂{4}=1/4(i=0, 1; 1≦t<4)
Since these values are all equal to or greater than TOPIC_MIN (0.2), the function I in Equation 12 returns 1. Thus the amount of computation F(4) when the higher-level constraint is used is found to be 32.
Therefore, nCost(16) is kxlen—{300}=16×2=32 according to Equation 10. From Equation 11, upCost(16) is 8+32=40. The reduction effect E is 32−4=−8, a negative value. This means that the number of updates of 0 required in topic estimation for 16 topics using the higher-level constraint of 4 topics is 8 more than the number of updates required in simple topic estimation for 16 topics and therefore reduction effect cannot be achieved. Therefore, the initial value update unit 202 does not update initK.
On the other hand, assume that 0—{300, i, t}̂{16} estimated by the higher-level-constraint topic estimation unit 103 has the following values.
φ—{300, 0, 0}̂{16}=7/28
φ—{300, 0, 15}̂{16}=7/28
φ—{300, 0, t}̂{16}=1/28 (1≦t<15)
φ—{300, 1, 0}̂{16}=7/28
φ—{300, 1, 1}̂{16}=7/28
φ—{300, 1, t}̂{16}=1/28(2≦t)
Processing at step u109 performed in this case will be described. The initial value update unit 202 first computes φ—{300, i, t}̂{4} according to Equation 13. The computation yields the following results.
φ—{300, 0, 0}̂{4}=10/28
φ—{300, 0, 1}̂{4}=4/28
φ—{300, 0, 2}̂{4}=4/28
φ—{300, 0, 3}̂{4}=10/28
φ—{300, 1, 0}̂{4}=16/28
φ—{300, 1, 1}̂{4}=4/28
φ—{300, 1, 2}̂{4}=4/28
φ—{300, 1, 3}̂{4}=4/28
Since 4/28 among these results is less than or equal to TOPIC_MIN (0.2), the amount of computation F(4) when the higher-level constraint is used is computed as 3×4=12.
From Equation 10, nCost(16) is kxlen—{300}=16×2=32. From Equation 11, upCost(16) is 8+12=20. The reduction effect E is 32−20=12. This means that the higher-level constraint for 4 topics can reduce the number of updates of φ by 12.
At this point in time, the initial value update unit 202 updates initK to 4/4=1. Thus, the processing speed for new documents can be increased by performing latent topic estimation for 4 topics in the next and subsequent estimation.
As has been described above, in this exemplary embodiment, the reduction effect E is calculated and the initial value initK of the number of topics k is set in accordance with the reduction effect E. This enables selection between ordinary topic estimation and hierarchical latent estimation in accordance with the reduction effect E.
In the first exemplary embodiment, a higher-level constraint is created on the basis of the result of estimation for a higher-level topic in hierarchical latent topic estimation, thereby reducing the amount of estimation for lower-level topics. However, when the amount of document data is small, the bias in the results of estimation for a higher-level topic is small. Accordingly, the reduction effect cannot be achieved relative to the cost required for estimation at the higher level and the processing time is likely to increase or the degree of accuracy is likely to decrease.
In contrast, according to this exemplary embodiment, ordinary topic estimation is performed and probabilities for all topics are computed in an early stage of learning where the amount of document data is small and, once a reduction effect has been confirmed, switching can be made to the hierarchical latent topic estimation. Accordingly, a decrease in the degree of accuracy and an increase in processing time in the early stage of learning can be avoided.
According to the present invention, latent topics in an information management system that manage text information can be dealt with to automatically extract characteristic word information fast without using a dictionary. Accordingly, the present invention enables efficient document summation and document search.
As illustrated in
The exemplary embodiments described above also disclose a word latent topic estimation device as described below.
According to such a configuration, when estimating a topic to which each word will be assigned, assignment to all topics does not need to be taken into consideration while taking into consideration that topic generation in each document is a mixture of topics is based on a Dirichlet distribution. Accordingly, topics can be hierarchically estimated without performing estimation processing for excess topics and thus effective computation can be carried out.
According to such a configuration, how much computation can be reduced can be calculated and selection can be made between ordinary topic estimation and hierarchical latent topic estimation in accordance with the reduction effect E. Thus, a decrease in the degree of accuracy and an increase in processing time in an early stage of learning where the amount of document data is small can be avoided.
According to such a configuration, selection between hierarchical topic estimation and ordinary topic estimation, i.e. topic estimation for all topics, can be made on the basis of only the number of documents. Thus, the reduction effect does not need to be calculated and the processing load can be reduced.
This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2012-169986 filed on Jul. 31, 2012, the entire disclosure of which is incorporated herein.
While the present invention has been described with respect to exemplary embodiments thereof, the present invention is not limited to the exemplary embodiments described above. Various modifications that are apparent to those skilled in the art can be made to configurations and details of the present invention within the scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
2012-169986 | Jul 2013 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2013/004242 | 7/9/2013 | WO | 00 |