With the advent of computers and the internet, the world has seen an information explosion like never before. Gone are the days, when print used to dominate the medium of expression. The internet has changed the way, people consume data. It's very common to find a digital version of almost every other document that is printed today. Such massive digitization, although immensely beneficial in many ways, has its own limitations. There is always this pressing problem of finding the right information or data. Therefore, document search remains one of the most challenging areas of research.
Keywords or keywords offer a valuable mechanism for characterizing text documents. They offer a meaningful way of searching for information in a document or corpus of documents. Traditionally, keywords are manually specified by authors, librarians, professional indexers and catalogers. However, with thousands of documents getting digitized everyday, manual specification is no longer possible. Computer-based automatic keyword extraction was a natural corollary of this problem. A number of keyword extraction methods have been proposed in the past several years. In some methods, the problem is formulated as a supervised classification problem and a classifier is trained based on a labeled training dataset. In some other methods, the keyword extraction is formulated as a ranking problem and candidate words are ranked according to some measures. The existing methods, however, have their own limitations. For example, they don't explicitly consider the semantic relationship between the candidate keywords and the document. Also, the extracted keywords are limited to the document content.
For a better understanding of the invention, embodiments will now be described, purely by way of example, with reference to the accompanying drawings, in which:
The following terms are used interchangeably through out the document including the accompanying drawings.
(a) “keyword” and “key phrase”
(b) “document” and “electronic document”
Embodiments of the present invention provide methods, computer executable code and computer storage medium for extracting keywords from a document which may be present in a corpus of documents. Specifically, the disclosed methods involve an in-document keyword extraction method and an in-corpus keyword extraction method. The former extracts keywords that appear in a single document; the latter extracts keywords that appear in a corpus (may not appear in the document).
The method begins in step 110. In step 110, a corpus of documents is obtained or accessed. The corpus of documents may be obtained from a repository, which could be an electronic database. The electronic database may be an internal database, such as, an intranet of a company, or an external database, such as, Wikipedia. Also, the electronic database may be stored on a standalone personal computer, or spread across a number of computing machines, networked together, with a wired or wireless technology. For example, the electronic database may be hosted on a number of servers connected through a wide area network (WAN) or the internet.
In step 120, a document is selected from the corpus of documents, and a set of words that appear as keywords in the document is determined. The method steps involved in the selection of a set of words that appear as keywords in the document are described in further detail with reference to
In step 130, a set of words that appear in the corpus of documents may be determined. Such set of words may not necessarily appear in the document selected in step 120. The method steps, involved in the determination of a second set of words that appear in the corpus of documents but may not necessarily appear as keywords in the document selected earlier, are described in further detail with reference to
In step 140, a final set of keywords for the document is determined. The step involves combining the first set of words, determined in step 120, with the second set of words, determined in step 120. Once the method steps outlined for step 120 and 130 are completed, a two set of keywords emerge that are used together to determine a final set of keywords for the document selected in step 120.
In step 210, a topic model is learned for a corpus of documents D, by utilizing a statistic topic modelling method. Any statistic topic modelling method, such as, but not limited to, Probabilistic latent semantic analysis (PLSA) and Latent Dirichlet Allocation (LDA), represented by {P(w|z)}w,z, a set of multinomial distributions of words W over topics Z and optionally {P(z|d)}z,d, a set of multinomial distributions of topics Z over documents D, may be used. Optionally, a pre-processing step may be performed, which may comprise of stop word removal, word stemming, and transformation of the corpus into a word by document matrix. Step 210 may be executed just one time for a corpus of documents. Once a model has been learnt, it may be directly applied in the following steps.
In step 220, for a given document, a multinomial distribution of topics over the document is inferred, according to the statistical topic model, to determine main topics of the document. To illustrate, in an embodiment, for a document d, the distribution of topics Z over the document d, i.e. {P(z|d)}z, is inferred according to the learnt model (in step 210), which is used to determine the main topics T of the document by picking up the top k ones with the largest probabilities, i.e. T=argtopzP(z|d).
In step 230, posterior distributions of topics over words in the document is determined and used to assign topics to words in the document, resulting in a set of labeled words in triples. In an embodiment, the posterior distributions of topics over words in the document, i.e. {P(z|d,w)}z,w, are computed, which are used to assign topics to words by picking up the topic with the largest posterior probability for each word, i.e. z*d,w=argmaxz P(z|d,w), resulting in a set of labeled words in <w,z*,P(z*|d,w)> triples.
In step 240, a set of noun phrases are extracted from the same document by utilizing a noun phrase chunking method. The step may optionally include a post-processing step for filtering leading articles (e.g. “a”, “an”, “the”) and pronouns (e.g. “his”, “her”, “your” “that”, “those”, etc).
In step 250, the extracted noun phrases are scored, according to occurrence of words labeled with the main topics T, and sorted in a descending order.
The scoring methods may be varied. For example, in one embodiment, the posterior probabilities of words labelled with the main topics of the document may be summed up as the score of a noun phrase. In another embodiment, the length of a noun phrase may be considered as a scoring factor by preferring bigram or trigram noun phrases.
In step 260, the top m noun phrases with highest scores are provided as an output. The output is the first set of words that appear as keywords of the document.
In step 310, a statistical topic model with respect to a corpus of documents is learnt. Any statistic topic modelling method, such as, but not limited to, Probabilistic latent semantic analysis (PLSA) and Latent Dirichlet Allocation (LDA) may be utilized for learning the statistical topic model.
Once a statistical topic model has been determined, the following steps are performed for each document in the corpus.
In step 320, for each document in the corpus, posterior distributions of topics over words is determined and used to assign topics to the words, resulting in a set of labeled words in <word, topic, probability>triples;
In step 330, for each document in the corpus, noun phrases are extracted from the document by utilizing a noun phrase chunking method. Optionally, a post-processing step of removing the articles and pronouns as described earlier may be performed, resulting in a set of noun phrases.
In step 340, each extracted noun phrase is labeled by associating each word with a topic and a weight according to the triples. This results in a sequence of triples. An output of labeled noun phrases is provided into a repository. The repository may be an electronic database.
In step 350, labeled noun phrases are read out from the repository and indexed with the help of an index engine. While indexing, the index engine may organize the sequence of triples in a way that supports the word-based search and the topic-based search, and supports the search result ranking by considering the probability as a scoring factor (step 360). Apache Lucene index engine, among others, may be customised to perform this task.
In step 370, for main topics of the document, a string query is composed. This may be done by concatenating the main topics of the document in a Boolean logic and then submitting the string query to the index engine. This results in a ranked list of matched noun phrases. The top n noun phrases are returned as keywords for the document. These are the second set of words that appear in the corpus of documents, but may not necessarily appear in the document.
The storage medium 420 (such as a hard disk) stores a number of programs including an operating system, application programs and other program modules. A user may enter commands and information into the computer system 400 through input devices, such as a keyboard 450, a touch pad (not shown) and a mouse 460. The monitor 440 is used to display textual and graphical information.
An operating system runs on processor 410 and is used to coordinate and provide control of various components within personal computer system 400 in
It would be appreciated that the hardware components depicted in
Further, the computer system 400 may be, for example, a desktop computer, a server computer, a laptop computer, or a wireless device such as a mobile phone, a personal digital assistant (PDA), a hand-held computer, etc.
The embodiment described provides an effective way of extracting keywords from a document by utilizing the noun phrase chunking technology to extract high-quality keyword candidates, and the statistic topic modelling technology to analyze the latent topics of text documents. The embodiment ranks the keyword candidates by considering the topic relevance between the candidate and the document as a scoring factor. By combining the in-document method and the in-corpus method, it generates a set of in-document keywords and a set of out-of-document keywords.
It will be appreciated that the embodiments within the scope of the present invention may be implemented in the form of a computer program product including computer-executable instructions, such as program code, which may be run on any suitable computing environment in conjunction with a suitable operating system, such as, Microsoft Windows, Linux or UNIX operating system. Embodiments within the scope of the present invention may also include program products comprising computer-readable media for carrying or having computer-executable instructions or data structures stored thereon. Such computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer. By way of example, such computer-readable media can comprise RAM, ROM, EPROM, EEPROM, CD-ROM, magnetic disk storage or other storage devices, or any other medium which can be used to carry or store desired program code in the form of computer-executable instructions and which can be accessed by a general purpose or special purpose computer.
It should be noted that the above-described embodiment of the present invention is for the purpose of illustration only. Although the invention has been described in conjunction with a specific embodiment thereof, those skilled in the art will appreciate that numerous modifications are possible without materially departing from the teachings and advantages of the subject matter described herein. Other substitutions, modifications and changes may be made without departing from the spirit of the present invention.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/CN2010/071758 | 4/14/2010 | WO | 00 | 10/12/2012 |