There are many different types of techniques for discovering information, using a computer network. One specific technique is referred to as a community-based question and answering service (referred to as cQA services). The cQA service is a kind of web service through which people can post questions and also post answers to other peoples' questions on a web site. The growth of cQA has been relatively significant, and it has recently been offered by commercially available web search engines.
In current cQA services, a community of users either subscribes to the service, or simply accesses the service through a network. The users in the community can post questions that are viewable by other users in the community. The community users can also post answers to questions that were previously submitted by other users. Therefore, over time, cQA services build up very large archives of previous questions and answers posted for those previous questions. Of course, the number of questions and answers that are archived depends on the number of users in the community, and how frequently the users access the cQA services.
In any case, there is typically a lag time between the time when a user in the community posts a question, and the time when other users of the community post answers to that question. In order to avoid this lag time, some cQA services automatically search the archive of questions and answers to see if the same question has previously been asked. If the question in found in the archives, then one or more previous answers can be provided, in answer to the current question, with very little delay. This type of searching for previous answers is referred to as “question search”.
By way of example, assume that a given question is “any cool clubs in Berlin or Hamburg?” A cQA service that has question search capability might return, in response to searching the questions in the archive, a previously posted question such as “what are the best/most fun clubs in Berlin?” which is substantially semantically equivalent to the input question, and one would expect it to have the same answers as in the input question.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.
Another technique used to augment question search is referred to as question recommendation. Question recommendation is a technique by which a system automatically recommends additional questions to a user, based on an input question.
Questions submitted in a cQA service can be viewed as having a combination of a question topic and a question focus. Question topic generally presents a major context or constraint of a question while the question focus presents certain aspects of the question topic. For instance, in the example given above, the question topic is “Berlin” or “Hamburg” while the question focus is “cool club.” When users ask questions in a cQA service, it is believed that they usually have a fairly clear idea about the question topic, but may not be aware that there exists several other aspects around the question topic (several question foci) that may be worth exploring.
The present system graphs topic terms in stored cQA questions and also converts a submitted question into a graph of topic terms. Topic terms that correspond to a question topic are delineated from topic terms that correspond to question focus. New questions are recommended to the user based on a comparison between the topics of the new questions and the topic of the submitted question as well as the focus of the new questions and the focus of the submitted question.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in the background.
The present system receives a question in a community question and answering system from a user. The present system then divides the question into its topic and focus, and recommends one or more additional questions that reflect different aspects (or different areas of focus) for the topic in the question input by the user. This can be illustrated in more detail as shown in
More specifically, the question 100 input by the user shown in
In
The present system can recommend questions to the user by retaining the question topic nodes in tree 102, but by substituting different focus terms 108. In doing so, the present system identifies the focus of a question by beginning at root node 106 and advancing towards leaf nodes 108 and deciding where to make a cut in tree 102 that divides the question focus of the questions represented by the tree from the question topic represented by the tree.
To accomplish this, the present system first represents the archive questions 104 and the input question 100 as one or more question trees (or graphs) of topic terms. The topic terms are not to be confused with the question topic. Topic terms are simply terms in the question input by the user, or the archived questions, that are content words, as opposed to non-content words. The question topic, as discussed above, is the topic of the question, as opposed to the focus of the question. Therefore, in order to represent each of the questions as a tree or graph of topic terms, the system first builds a vocabulary of topic terms such that the vocabulary adequately models both the input question 100 and the archived questions 104. Given that vocabulary of topic terms, a question tree (graph) is constructed. A tree cut is then performed to divide the tree among its question foci and question topic. Then, different question focus terms are substituted for those submitted in the input question 100, and those questions are ranked. The highest ranked questions are output as recommended questions for the user.
Using the example shown in
Therefore, the system can generate recommended questions to be provided to the user such as “What to see between Hamburg and Berlin?” In that instance, the substitution of “what to see” is substituted for the focus “cool club”. Another recommended question might be “How far is it from Hamburg to Berlin?” In that instance, the focus “how far is it” is substituted for the focus “cool club”, etc. Given all of the various questions that could be recommended to the user, the system then ranks those questions, as is discussed below.
More specifically, topic chain generator 206 first receives training data in the form of questions from community question data store 202. The training data questions 214 are illustratively questions which were previously submitted by a community of users in a given community question and answering system. This is indicated by block 250 in
In order to extract topic terms from questions 214, topic chain generator 206 is a two-phase system which first extracts a list of topic terms from the questions and then reduces that set of topic terms to represent the topics more compactly. Topic term acquisition component 208 this first identifies the set of topics in the questions. This is indicated by block 252 in
There are many different ways that can be used to identify topic terms in questions. For instance, in one embodiment, linguistic units, such as words, noun phrases, and n-grams can be used to represent topics. The topic terms for a given sentence illustratively capture the overall topic of a sentence, as well as the more specific aspects of that topic identified in the sentence or question. It has been found that words are sometimes too specific to outline the overall topic of sentences or questions. Therefore, in one embodiment, topic term acquisition component 208 considers noun phrases and n-grams (multiword units) as candidates for topic terms.
In order to acquire noun phrases from the input questions 214, component 208 identifies base noun phrases as simple and non-recursive noun phrases. In many cases, the base noun phrases represent holistic and non-divisible concepts within the question 214. Therefore, topic term acquisition component 208 extracts base noun phrases (as opposed to noun phrases) as topic term candidates. The base noun phrases include both multi-word terms (such as “budget hotel”, “nice shopping mall”) and named entities (such as “Berlin”, “Hamburg”, “forbidden city”). There are many different known ways for identifying base noun phrases in sentences or questions, and one way uses a unified statistical model that is trained to identify base noun phrases in a given language. Of course, other statistical methods, or heuristic methods, could be used as well.
Another type of topic term that is used by topic term acquisition component 208 is n-grams of words. There are also many ways for identifying n-grams by using natural language processing, which can be either statistical or heuristically based processing, or other processing systems as well. In any case, it has been found that a particular type of n-gram (wh-n-grams) are particularly useful in identifying topic terms in questions 214. Most meaningful n-grams are already extracted by component 208, once it has extracted base noun phrases. To complement the base noun phrase extraction, component 208 uses wh-n-grams, which are n-grams beginning with wh-words. For the sake of the present discussion, these include “when”, “what”, “where”, “why”, “which”, and “how”.
By way of example, Table 1 provides exemplary topic term candidates that are base noun phrases containing the word “hotel” and exemplary wh-n-grams containing the word “where”. It should be noted that the table does not include all the topic term candidates containing “hotel” or “where”, but only exemplary ones. The base noun phrases are listed separately from the wh-n-grams and the frequency of occurrence of each topic term, in the data store 202, is listed as well.
Having thus identified a preliminary set of topic terms (in block 252 in
To clarify this step, an example will be discussed. Assume that a topic term candidate containing the word “hotel” is that one in Table 2 which identifies “embassy suite hotel”. This topic term may be reduced to “suite hotel” because “embassy suite hotel” may be too sparse and unlikely to be hit by a new question posted by a user in the community question answering system. At the same time, it may be desirable to maintain “inexpensive western hotel” although “western hotel” is also one of the topic terms.
Reducing the set of topic terms is discussed in greater detail below with respect to
Once the reduced set of topic terms has been extracted by component 208, topic term linking component 210 links the topic terms to construct a topic chain for each question 214. This is indicated by block 256 in
Topic chains are indicated by block 220 in
Reducing the topic terms (as briefly discussed above with respect to block 254 in
In order to perform reduction, a question tree is built (as discussed above with respect to
M=(Γ,Θ) Eq. 1
where Γ and Θ are defined as follows:
Γ=[C1, C2, . . . Ck], Θ=[p(C1), p(C2), . . . , p(Ck)] Eq. 2
where C1, C2, . . . Ck are classes determined by a cut in the tree and
A “cut” in a tree identifies any set of nodes that define a partition of all the nodes, viewing each node as representing the set of child nodes, as well as itself. For instance,
A straight-forward way for determining a cut of the tree is to collapse nodes in the tree that occur less frequently in the training data into the parent of those nodes, and then updating the frequency of the parent node to include the frequency of the child nodes that are collapsed into it. For instance, node n24 in
The MDL principle is a principle of data compression and statistical estimation from information theory. Given a sample S and a tree cut Γ, maximum likelihood estimation is employed to estimate the parameters of the corresponding tree cut model {circumflex over (M)}=(Γ, {circumflex over (Θ)}) where {circumflex over (Θ)} denotes the estimated parameters.
According to the MDL principle, the description length L({circumflex over (M)}, S) of the tree cut model {circumflex over (M)} and the sample S is the sum of the model description length L (Θ)), the parameter description length L ({circumflex over (Θ)}|Γ), and the data description length L(S|Γ, Θ). That is:
L({circumflex over (M)},S)=L(Γ)+L({circumflex over (Θ)}|Γ)+L(S|Γ,{circumflex over (Θ)}) Eq. 3
The model description length L(Γ) is a subjective quantity which depends on the coding scheme employed. In the present system, it is simply assumed that each tree cut model is equally likely, a priori. The parameter description length L ({circumflex over (Θ)}|Γ) is calculated as follows:
where the absolute value S denotes the sample size, k denotes the number of tree parameters in the tree cut model. That is k=the number of nodes in Γ−1.
The data description length L (S|Γ, {circumflex over (Θ)}) is calculated as follows:
where f(C) denotes the total frequency of topic terms in class C in the sample S.
With the description length defined as in Eq. 3 above, a tree cut model is to be selected with the minimum description length and output as the result of reduction.
In accordance with one embodiment, modifier portions of topic terms are ignored when reducing the topic term to another topic term. Therefore, the present system uses two types of reduction, the first being removing the prefix of base noun phrases, and the second being removing the suffix of wh-n-grams. A data structure referred to as a prefix tree (also sometimes referred to as trie) is used for representing the base noun phrases and wh-n-grams.
The two types of reduction correspond to two types of prefix trees, namely a prefix tree of reversely ordered base noun phrases and a prefix tree of wh-n-grams. In order to generate the prefix tree for base noun phrases, the order of the terms (or words) in the extracted base noun phrases is first reversed. This is indicated by block 300 in
Once prefix trees 450 and 454 are generated, and then a tree cut technique is used for selecting the best cut of the tree in order to reduce the topic terms to a desired level. As discussed above, in one embodiment, the MDL-based tree cut principle is used for selecting the best cut. Of course, a prefix tree can have a plurality of different cuts, which correspond to a plurality of different choices of topic terms.
In
Similarly, in one embodiment, the MDL-based tree cut technique cuts tree 454 in
Performing the tree cut and updating the frequency indicators is illustrated by block 306 in
In order to identify the set, or collection of questions used to construct the tree, a topic profile Θt is first defined. The topic profile Θt of a topic term t in a categorized text collection is a probability distribution of categories {p(c|t)}cεC where C is a set of categories.
where count(c,t) is the frequency of the topic term t within the category c. Then,
By categorized questions, it is meant the questions that are organized in a tree of taxonomy. For example, in one embodiment, the question “How do I install my wireless router” is categorized as “Computers and Internet Computer→Networking”.
Identifying the topic profile for topic terms in a question set over a set of categories is indicated by block 308 in
Next, a specificity for the topic terms is defined. The specificity s(t) of a topic term t is the inverse of the entropy of the topic profile Θt. More specifically:
where ε is a smoothing parameter used to cope with the topic terms whose entropy are 0. In practice, the value of ε can be empirically set to a desired level. In one embodiment, it is set as 0.001.
Specificity represents how specific a topic term is in characterizing information needs of users who post questions. A topic term of high specificity (e.g., Hamburg, Berlin) usually specifies the question topic corresponding to the main context of a question. Thus, a good question recommendation is required to keep such a question topic as much as possible so that the recommendation can be around the same context. A topic term of low specificity is usually used to represent the question focus (e.g., cool club, where to see) which is relatively volatile.
Calculating the specificity of the topic terms is indicated by block 310 in
After all of the topic terms have had a topic profile and specificity calculated for them, topic chains are identified in each category for the questions in the question set, based on the calculated specificity for the topic terms. A topic chain qc of a question q is a sequence of ordered topic terms t1→t2→ . . . →tm such that
1) ti is included in q, 1≦i≦m;
2) s(tk)>s(t1), 1≦k≦1≦m.
For example, the topic chain of “any cool clubs in Berlin or Hamburg?” is “Hamburg→Berlin→cool club” because the specificities for “Hamburg”, “Berlin”, and “cool club” are 0.99, 0.62, and 0.36, respectively.
Identifying the topic chains for the topic terms is indicated by block 312.
Once the topic chains have been identified for the set of questions, then a question tree for the set of questions can be generated.
A question tree of a question set Q={qi}i=1N is a prefix tree built over the topic chains Qc={qic}i=1N of the question set Q. Clearly, if a question set contains only one question, its question tree will be exactly the same as the topic chain of the question.
For instance, the topic chains associated with the questions in
From this description, it can be seen that the question tree 102 in
The topic chain generated for the input question is used by question collection component 406 to identify topic chains in index 204 that have a similar root node to the topic chain generated for input question 402. More specifically, the topic terms of low specificity in the topic chains in index 204 and the topic chain for input question 402 are usually used to represent the question focus, which are relatively volatile. These topic terms are discriminated from those of high specificity and then suggested as substitutions.
For instance, recall that the topic terms in the topic chain of a question are ordered according to their specificity values calculated above with respect to Eq. 8. A cut of a topic chain thus gives a decision which discriminates the topic terms of low specificity (representing question focus) from the topic terms of high specificity (representing question topic). Given a topic chain of a question where the topic chain consists of M topic terms, there exists M−1 possible cuts. Each possible cut yields one kind of suggestion or substitution.
One method for recommending substitutions of topic terms (in order to generate recommended questions) is simply to take the M−1 cuts and then, on the basis of them, suggest M−1 kinds of substitutions. However, such a simple method can complicate the problem of ranking recommendation candidates (for recommended questions) because it introduces a relatively high level of uncertainty. Of course, if this level of uncertainty is acceptable in the ranking process, then this method can be used.
In another embodiment, the MDL-based tree cut model is used for identifying a best cut of a topic chain. Given a topic chain qc of a question q, a question tree is constructed of related questions as follows. First, a set of topic chains Qc={qic}i=1n is identified (as represented by block 408 in
Once the question tree 412 is generated by component 410, the topic/focus identifier component (which can be implemented as a MDL-based tree cut model) 414 performs a tree cut in the tree. Component 414 obtains a best cut of the question tree, which also gives a cut for each topic chain in the question tree, including qc. In this way, the best cut is obtained by observing the distribution of topic terms over all the potential recommendations (all the questions in index 204 that are related to the input question 402), instead of only the input question 402.
A cut of a given topic chain qc separates the topic chain into two parts: the head and the tail. The head (denoted as H(qc) is the sub-sequence of the original topic chain qc before the cut (upstream of the cut) in the topic chain. The tail portion (denoted as T(qc)) is the sub-sequence of the original topic chain qc after the cut (downstream of the cut) in the topic chain. Therefore, qc=H(qc)→T(qc).
Performing a tree cut to obtain a head and tail for each topic chain in the question tree, including the topic chain for the input question, is indicated by block 508 in
By way of example, one of the topic chains represented by question trees 102 in
In order to decide which questions to recommend to the user, component 414 calculates a recommendation score r({tilde over (q)}|q) for each of the substitution candidates (or recommendation candidates) represented by the other leaf nodes 108, as indicated by block 510 in
Given that the topic chain of an input q 402 is separated into its head and tail as follows: qc=H(qc)→T(qc) by a cut, and given that the topic chain of a recommendation candidate {tilde over (q)} is separated into a head and tail as well, {tilde over (q)}c=H({tilde over (q)}c)→T({tilde over (q)}c), the recommendation score r(q, {tilde over (q)}) will satisfy the following with respect to specificity and generality. First, the more similar that the head of qc(i.e., H(qc)) is to the head of the T(qc)
recommendation {circumflex over (q)}c(i.e., H({tilde over (q)}c)), then the greater is the recommendation score r ({tilde over (q)}|q). Similarly, the more similar that the tail T(qc) is to the tail of the recommendation T(qc) then the less the recommendation score r ({tilde over (q)}|q).
These requirements with respect to specificity and generality, respectively, help to ensure that the substitutions given by the recommendation candidates focus on the tail part of the topic chain, which provides users with the opportunity of exploring different question focus around the same question topic. For instance, again using the example questions shown in
where |q1c| represents the number of topic terms contained in q1c; and
PMI(t1,t2) represents the pointwise mutual information of a pair of topic terms t1 and t2.
According to Eq. 9, the similarity between topic chains is basically determined by the associations between consistent topic terms. The PMI values of individual pairs of topic terms in Eq. 9 are weighted by the specificity of topic terms occurring in q1c. It should be noted that the similarity defined is asymmetric. Having the similarity defined, the recommendation score r({tilde over (q)}|q) can be defined as follows, in order to meet all of the constraints discussed above:
r({tilde over (q)}|q)=λ·sim(H({tilde over (q)}c)|H(qc))−(1−λ)·sim(T({tilde over (q)}c)|T(qc) Eq. 10
Eq. 10 balances between the two requirements of specificity and generality in a way of linear interpolation. The higher value of λ implies that the recommendations tend to be similar to the input question 402. The lower value of λ encourages the recommended questions to explore the question focus that is different from that in the queried question 402.
To calculate the scores, component 416 first selects a topic chain as a recommendation candidate. This is indicated by block 512 in
Recommendation scoring and ranking component 416 thus generates the recommendation score for each of the recommendation candidates based on the similarities calculated. This is indicated by block 520 in
Once component 416 generates the recommendation score for the recommendation candidates, the topic chains in each of the recommendation candidates can be ranked based on the recommendation scores calculated. This is indicated by block 522 in
Embodiments are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with various embodiments include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, telephony systems, distributed computing environments that include any of the above systems or devices, and the like.
Embodiments may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Some embodiments are designed to be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules are located in both local and remote computer storage media including memory storage devices.
With reference to
Computer 910 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 910 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 910. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 930 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 931 and random access memory (RAM) 932. A basic input/output system 933 (BIOS), containing the basic routines that help to transfer information between elements within computer 910, such as during start-up, is typically stored in ROM 931. RAM 932 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 920. By way of example, and not limitation,
The computer 910 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
A user may enter commands and information into the computer 910 through input devices such as a keyboard 962, a microphone 963, and a pointing device 961, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 920 through a user input interface 960 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 991 or other type of display device is also connected to the system bus 921 via an interface, such as a video interface 990. In addition to the monitor, computers may also include other peripheral output devices such as speakers 997 and printer 996, which may be connected through an output peripheral interface 995.
The computer 910 is operated in a networked environment using logical connections to one or more remote computers, such as a remote computer 980. The remote computer 980 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 910. The logical connections depicted in
When used in a LAN networking environment, the computer 910 is connected to the LAN 971 through a network interface or adapter 970. When used in a WAN networking environment, the computer 910 typically includes a modem 972 or other means for establishing communications over the WAN 973, such as the Internet. The modem 972, which may be internal or external, may be connected to the system bus 921 via the user input interface 960, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 910, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.