When searching for information and relevant documents, searching for meta data which describe documents and searching within data bases, it is often time-consuming to get the desired information. Documentation-heavy application areas, such as news summarization, service analysis and fault tracking, customer feedback analysis, medical diagnosis and process report analysis, trend scouting or technical and scientific literature search, require efficient means for exploration and filtering of the underlying textual information. Commonly, filtering of documents by topic segmentation is used to address the issue at hand. Conventional approaches for clustering documents take into account only a single text corpus, i.e. a so-called foreground language model. The foreground language model is formed by a text corpus which comprises a selected cluster of documents. The disadvantage of conventional methods for clustering text documents is that they do not differentiate efficiently the documents of the selected document cluster from other documents within other document clusters.
Accordingly, it is an object of the present invention to provide a method and an apparatus for performing a drill-down operation allowing a more specific exploration of documents, based on the use of language modelling.
The invention provides a method for performing a drill-down operation on a text corpus comprising documents using language models for key phrase weighting, said method comprising the steps of
weighting key phrases occurring both in a foreground language model which contains a selected document cluster of said text corpus and in a background language model which does not contain said selected document cluster by calculating for each key phrase a key phrase weight comprising a ratio between the foreground weight of said key phrase and a background weight of said key phrase; and
assigning documents of the foreground language model to cluster labels which are formed by key phrases having high calculated key phrase weights.
In an embodiment of the method according to the present invention, the foreground weight of said key phrase in the documents of the foreground language model which contains said selected document cluster and the background weight of said key phrase in the documents of the background language model which does not contain said selected document cluster are both calculated according to a predetermined weighting scheme.
In an embodiment of the method according to the present invention, the weighting scheme comprises a TF/IDF weighting scheme, an informativeness/phraseness measurement weighting scheme, a log-likelihood ratio test weighting scheme, a CHI Square-weighting scheme, a student's t-test weighting scheme or
a Kullback-Leibler distance weighting scheme.
In an embodiment of the method according to the present invention, the foreground weight of the key phrase is calculated by using a TF/IDF weighting scheme depending on a term frequency and an inverse document frequency of said key phrase in the documents of the foreground language model which contains said selected document cluster.
In an embodiment of the method according to the present invention, the background weight of the key phrase is calculated by using a TF/IDF weighting scheme depending on a term frequency and an inverse document frequency of said key phrase in the document of the background language model which does not contain said selected document cluster.
In an embodiment of the method according to the present invention, the key phrase weight w(k) is calculated by:
w(k)=└wfg(k)/wbg(k)┘·log └wfg(k)+wbg(k)┘,
wherein wfg is a foreground weight of said key phrase (k) and,
wherein wbg is the background weight of said key phrase (k).
In an embodiment of the method according to the present invention, the key phrase weight w(k) is calculated by:
w(k)=log └wfg(k)/wbg(k)┘·log └wfg(k)+wbg(k)┘,
wherein wfg is the foreground weight of said key phrase (k) and,
wherein wbg is the background weight of said key phrase (k).
In an embodiment of the method according to the present invention, the key phrase weight w(k) is calculated by:
wherein wfg is the foreground weight of said key phrase (k) and,
wherein wbg is the background weight of said key phrase (k).
In an embodiment of the method according to the present invention, the key phrase weight w(k) is calculated by:
wherein wfg is the foreground weight of said key phrase (k) and,
wherein wbg is the background weight of said key phrase (k).
In an embodiment of the method according to the present invention, the text corpus is a monolingual text corpus or a multilingual text corpus.
In an embodiment of the method according to the present invention, said weighting scheme for calculation of a foreground weight and of said background weight of a key phrase (k) in a document weights also said key phrase depending on whether it is a meta tag, a key phrase within a title of said document, a key phrase within in an abstract of said document or a key phrase in a text of said document.
In an embodiment of the method according to the present invention, the document is a html-document.
In an embodiment of the method according to the present invention, the cluster labels of the document clusters are displayed for selection of the corresponding document clusters on a screen.
In an embodiment of the method according to the present invention, the selection of the corresponding document cluster is performed by a user.
In an embodiment of the method according to the present invention, the documents of the selected document cluster are displayed to the user on said screen.
The invention further provides a method for performing a drill-down operation on a text corpus comprising documents using language models for key phrase weighting comprising the steps of
clustering said text corpus into clusters each including a set of documents;
selecting a cluster from among the clusters to generate a foreground language model containing the selected document cluster and a background language model which does not contain the selected document cluster;
weighting key phrases occurring both in the foreground language model and in the background language model by calculating for each key phrase a key phrase weight comprising a ratio between a foreground weight of said key phrase and a background weight of said key phrase;
sorting the weighted key phrases according to the respective key phrase weight in descending order;
weighting a configurable number of key phrases having a high key phrase weight as cluster labels; and
assigning documents of a foreground language model to the selected cluster labels.
In an embodiment of the method according to the present invention, the selected cluster labels are displayed on a screen for selection of subclusters.
In an embodiment of the method according to the present invention, the selection of the subclusters is performed by a user.
The invention further provides a user terminal for performing a drill-down operation on a text corpus comprising documents stored in at least one data base using language models for key phrase weighting, said user terminal comprising
a screen for displaying cluster labels of selectable document clusters each including a set of documents;
a calculation unit for weighting key phrases occurring both in a foreground language model which contains a selected document cluster of said text corpus and in a background language model which does not contain said selected document cluster by calculating for each key phrase (k) a key phrase weight w(k) comprising a ratio between a foreground weight wfg(k) of said key phrase (k) and a background weight wbg(k) of said key phrase (k) and for assigning documents of said foreground language model to cluster labels which are formed by key phrases (k) having high calculated key phrase weights w(k).
In an embodiment of the user terminal according to the present invention, the user terminal is connected via a network to said data base.
In an embodiment of the user terminal according to the present invention, the network is a local network.
In an embodiment of the user terminal according to the present invention, the network is formed by the Internet. The invention further provides an apparatus for performing a drill-down operation on a text corpus comprising documents using language models for key phrase weighting, said apparatus comprising
means for weighting a key phrase (k) occurring both in a foreground language model which contains a selected document cluster of said text corpus and in a background language model which does not contain a selected document cluster by calculating for each key phrase (k) a key phrase weight w(k) comprising a ratio between a foreground weight wfg(k) of said key phrase (k) and a background weight wbg(k) of said key phrase (k); and
means for assigning documents of the foreground language model to cluster labels which are formed by key phrases (k) having high calculated key phrase weights w(k).
The invention further provides an apparatus for performing a drill-down operation on a text corpus comprising documents using language models for key phrase and weighting, wherein said apparatus comprises
means for clustering said text corpus into clusters each including a set of documents;
means for selecting a cluster from among the clusters to generate a foreground language model which contains the selected document cluster and a background language model which does not contain the selected document cluster;
means for weighting key phrases (k) occurring both in the foreground language model and in the background language model by calculating for each key phrase (k) a key phrase weight w(k) comprising a ratio between a foreground weight wfg(k) of said key phrase and a background weight wbg(k) of said key phrase (k);
means for sorting the weighted key phrases (k) according to their key phrase weights w(k);
means for selecting a configurable number of key phrases having the highest key phrase weight as cluster labels; and
means for assigning documents of the foreground-language model to the selected cluster labels.
In the following, possible embodiments of the method and apparatus according to the present invention are described with reference to the enclosed figures.
The initial document base dB comprises a plurality of text documents, wherein each text document has text words or key phrases. The terms or phrases of the text document can be sorted into an index vector including all words occurring in said document into a corresponding term vector indicating how often the respective word occurs in the respective text document. Usually, some words are not very significant because they occur very often in the document and/or have no significant meaning, such as articles (“a”, “the”). Therefore, a stop words removal is performed to get an index vector with a reduced set of significant phrases. The key phrases k are weighted using weighting schemes, such as TF/IDF weighting and are then sorted in descending order, wherein the key phrases with the highest calculated weights w(k) are placed on top of a selection list. A predetermined number (N) of sorted key phrases k are, for example ten key words or key phrases k and are then selected as cluster labels L for respective document clusters DC. Finally, the documents d of the data base dB are assigned to document clusters DC labelled by the selective key phrases k having the highest key phrase weights w(k). The clustering of documents d always comprises a labelling and an assignment step, wherein labelling of the document cluster can be performed before or after the assignment of the documents d to a document cluster DC.
After this initial clustering step, the found cluster labels L are displayed to a user on a screen.. If the user is interested in a specific document cluster and its data content and likes to examine and to explore text documents contained in the respective document cluster, the user clicks on the cluster of interest and a further segmentation is triggered. This segmentation step is called a drill-down operation. Upon triggering, the drill-down operates only documents associated with the cluster at hand denoted C which is selected for further segmentation. The referenced set of documents is denoted DC, wherein DC is a strict subset of the document set D of the data base dB.
After a drill-down operation, when the user has selected the cluster “car, vehicles, auto”, subclusters are displayed as shown in
w
fg(k)=w(k, DC)
As a second weight which is referred to as background weight and denoted by wbg(k) of the key phrase k, a score is calculated for the superset of documents, i.e. document set D. Accordingly, the background weight is given by:
w
bg(k)=w(k, D).
Any weighting scheme w can be used, for example a TF/IDF weighting scheme, an informativeness/phraseness measurement weighting scheme, a binomial log-likelihood ratio test (BLRT) weighting scheme, a CHI Square weighting scheme, student's-t-test weighting scheme or Kullback-Leibler divergence weighting scheme.
After calculating the foreground weight wfg and the background weight wbg, the ratio between the foreground weight wfg and the background wbg is calculated indicating how specific the respective key phrase k is for the currently selected foreground model. To get cluster labels L which are typical for the context, i.e. a selected cluster, and which at the same time are atypical for a general background model or surrounding contexts, a ratio between the foreground weight wfg and the background weight wbg has to be maximized.
In a possible embodiment of the method according to the present invention, the key phrase weight w(k) is calculated by:
w(k)=└wfg(k)/wbg(k)┘·log └wfg(k)+wbg(k)┘
Accordingly, the weight w for the key phrase k is determined by calculating the ratio between the foreground and the background weight and by multiplying this ratio with the logarithm of the sum of both weights. The larger the ratio is the higher is the final key phrase weight of the key phrase. The rationale behind taking the sum of the foreground and the background weight is to encourage key phrases k that have a high foreground weight and a high background weight as opposed to key phrases k that have both a low foreground and low background weight. When only taking the ratio between the foreground weight wfg and the background weight wbg, it can happen that a key phrase k occurs that has a low foreground weight wfg but an even lower background weight wfg (so that the ratio between both weights is again high) giving a large overall key phrase weight w. This is avoided by multiplying the ratio with the logarithm of the sum of both weights wfg and wbg. The logarithm as employed in the calculation of the key phrase weight has also a dampening effect.
The above given formula is only a possible embodiment.
With the method according to the present invention, the rating is performed by computing two key phrase weights, i.e. the background wbg and the foreground weight wfg by combining both weights into one score based upon a ratio of both weights.
In a possible embodiment, the key phrase weight w(k) is calculated by:
w(k)=log └wfg(k)/wbg(k)┘·log wfg(k)+wbg(k)┘.
In a further embodiment of the method according to the present invention, the key phrase weight w(k) is calculated by:
In another embodiment of the method according to the present invention, the key phrase weight w(k) is calculated by:
As can be seen from the above formulas, the key phrase weight w(k) comprises in all embodiments a ratio between the foreground weight wfg(k) of the key phrase k and the background weight wbg(k) of the same key phrase k.
When using, for instance a TF/IDF weighting scheme the foreground weight wfg of the key phrase k is calculated depending on the term frequency TF and the inverse document frequency IDF of the key phrase k in the respective documents of the foreground language model which contains the selected document cluster.
In the same manner, the background weight wbs of the key phrase k is calculated when using the TF/IDF weighting scheme depending on the term frequency TF and depending on the inverse document frequency IDF of the key phrase k in the documents of the background language model which does not contain the selected document cluster.
The TF/IDF weighting scheme is used for information retrieval and text mining. This weighting scheme is a statistical measure to evaluate how important a phrase or term is to a document collection or a text corpus. The importance of a key phrase increases proportionally to the number of times the key phrase appears in a document but is offset by the frequency of the key phrase in the text corpus. The term frequency TF is the number of times a given key phrase or term appears in a document. The inverse document frequency IDF is a measure of the general importance of a key phrase k. The inverse document frequency IDF is the logarithm of the number of all documents divided by a number of documents containing the respective key phrase k or term.
After the calculation of the key phrase weights Ik of the key phrases k, the key phrases k are sorted in a further step as shown in
Then, in a further step, the configurable number N of key phrases k having the highest key phrase weights wk are selected as cluster labels L.
In a further step, the documents of the foreground language model are assigned to the selected cluster labels L as can be seen in
In a possible embodiment, the selected cluster labels L are displayed for the user on a screen, so that the user can select subclusters using the displayed cluster labels L.
In a possible embodiment, the selected cluster labels L are displayed on a touch screen of a user terminal. A user touches the screen at the displayed cluster label of the desired subcluster to perform the selection of the respective document cluster.
A further drilling step to the selected cluster can be performed in the same manner as shown in
After a first drill-down operation, the data base dB is narrowed down to cluster C1. After a further drill-down operation, the set of documents is narrowed down to document cluster C2.
The foreground language model is formed by the document cluster C2.
In the embodiment as shown in
In another embodiment as shown in
The method according to the present invention for performing a drill-down operation allows in principle an infinite deep drill-down into a document data base dB. From the user's perspective, drill-down operations are performed until the set of documents of the current context, i.e. the foreground model, is sufficiently small. In this case, the user has a look on the actual documents of the current context and does not perform a further drill-down operation.
With the method and apparatus for performing a drill-down operation according to the present invention, the intra-cluster similarity for each document cluster DC is maximized whereas the inter-cluster similarity across different document clusters is minimized. The method according to the present invention can be used for clustering text documents according to their content, extracting key phrases and supporting hierarchical drill-down operations for refining a currently focused document set in an effective way by using language models for weighting cluster labels L.
The method according to the present invention can be applied to text corpora containing a very large number of documents as well as to text corpora containing a small number of documents, e.g. sentences or short comments.
A user drills, for example into the cluster “car, vehicles, auto” as shown in
With the method according to the present invention by computing a weight ratio between a foreground and a background model, the key phrase weight w(k) of the key phrase k (for example, “Siemens”) is not very high since the ratio between the foreground and the background weight is low.
When using another key phrase k, such as the term “steering wheel”, the weight with respect to the context DC is not as high as the weight of the key phrase “Siemens”. However, the key phrase “steering wheel” is typical for cars and therefore its occurrence in documents d other than those of the current context DC, i.e. documents d contained in the document set D but not in the context DC, is rather low. Consequently, the background weight wbg of the key phrase “steering wheel” is low and the foreground weight wfg of the key phrase “steering wheel” is high, resulting in an overall key phrase weight w(k) of the key phrase “steering wheel” which is much higher than the key phrase weight w(k) of the key phrase “Siemens”. Accordingly, with the method according to the present invention the key phrase “steering wheel” is more likely to become a subcluster of the current context DC than the key phrase “Siemens”. Accordingly, the method according to the present invention reflects what a user desires when drilling into a set D of documents d.
Number | Date | Country | Kind |
---|---|---|---|
EP07006429 | Mar 2007 | EP | regional |