Embodiments are generally related to data-processing systems and methods. Embodiments are additionally related to the field of computers and similar technologies, and in particular to software utilized in this field. Embodiments are also related to keyword extraction methods and systems.
A keyword is a single word or multiple-words present within documents that can characterize and summarize the topics covered by the documents. Generally, when documents are prepared, there is often a need to generate a list of keywords and phrases that represent the main concepts described in such documents. For example, a reader may utilize a list of keywords and phrases as a simple summary of a document for searching and locating articles in academic documents such as technical papers, journal articles etc. Similarly, due to an increase in the usage of the well-known Internet, there is a need to provide a keyword list of electronic documents to facilitate searching for a particular document. Keyword extraction from a document possesses many potential applications, such as the creation of metadata for a document, facilitating skimming documents by highlighting keywords, and use used in the context of index terms for searching document collections, and also for analyzing usage patterns in Web server logs.
Keywords from a document can be generated manually by an author of the document or a person skilled in indexing documents. The keywords may also be generated automatically by tagging words in documents by their part-of-speech, such as for example a noun, a verb, an adjective, etc. Similarly, the most frequent words in documents can be listed, excluding stop words such as “and” “if” “have” etc. Stop words are commonly utilized insignificant words such as “the” which occurs frequently in a document. Such prior art keyword extraction methods possess limited capabilities, which results in a relatively low-quality list of keywords. Such approaches are also usually highly labor intensive.
One prior art keyword extraction approach collects word frequencies with respect to a corpus of documents to determine average word frequencies. The same frequency counting method can be utilized to determine the word frequencies of a page or a document in question. The problem associated with such prior art approaches is that common words may occur more frequently in a given page or document than in the corpus, and may be incorrectly output as keywords. Similarly, if the given page possesses a small word count, quantization causes the word frequencies to be inaccurate, thereby resulting in non-keywords appearing more frequent than in the corpus. One solution to this problem is to utilize a list of stop words composed of a predetermined set of common words. Hence, if a given word in the page or document is a stop word, it is not considered a keyword. Similarly, the raw frequency in the given page or document can be compared against the raw frequency in the corpus to generate keywords. Such methods, however, generate frequency quantization problems due to small sample sizes.
Based on the foregoing it is believed that a need exists for an improved automated method and system for simple keyword extraction, as described in greater detail herein.
The following summary is provided to facilitate an understanding of some of the innovative features unique to the present invention and is not intended to be a full description. A full appreciation of the various aspects of the embodiments disclosed herein can be gained by taking the entire specification, claims, drawings, and abstract as a whole.
It is, therefore, one aspect of the present invention to provide for an improved data-processing method, system and computer-usable medium.
It is another aspect of the present invention to provide for an improved method and system for automatically extracting keywords from a document to avoid frequency quantization problems.
It is further aspect of the present invention to provide for an improved method for extracting keywords from a document utilizing a statistical measure.
The aforementioned aspects and other objectives and advantages can now be achieved as described herein. Frequency based keyword extraction method and system utilizing a statistical measure is disclosed which generates keywords within a page and/or document that distinguish the document from an average document. A simple frequency threshold parameter can be utilized to determine a number of common stop words if a word in the document possesses a frequency in a corpus that is more than the threshold parameter. A statistical confidence interval of the frequency of a word in the document can be compared against a frequency confidence interval of the word in the corpus. The extracted keyword possesses a greater intra-document frequency confidence interval than the frequency confidence interval of the word within the corpus. A statistical hypothesis test can also be utilized to determine the keyword by calculating a test statistic and testing whether the test statistic is greater than some threshold. The test statistic possesses an approximate normal distribution function.
The confidence intervals can be utilized to avoid frequency quantization problems caused by small sample sizes. Furthermore, the lower bound of the frequency confidence interval in the document must be greater than the upper bound of the frequency confidence interval in the corpus in order to generate keywords. The confidence interval utilized for the word in the document does not need to be the same as the interval utilized for the words in the corpus. The keywords produced are those words that are stressed more than in the average document. Such a method can be utilized for keyword extraction or utilized as an input to a more elaborate keyword extraction scheme.
The accompanying figures, in which like reference numerals refer to identical or functionally-similar elements throughout the separate views and which are incorporated in and form a part of the specification, further illustrate the present invention and, together with the detailed description of the invention, serve to explain the principles of the present invention.
The particular values and configurations discussed in these non-limiting examples can be varied and are cited merely to illustrate at least one embodiment and are not intended to limit the scope of such embodiments.
As depicted in
Illustrated in
In the depicted example, server 304 and server 306 connect to network 302 along with storage unit 308. In addition, clients 310, 312, and 314 connect to network 302. These clients 310, 312, and 314 may be, for example, personal computers or network computers. Data-processing apparatus 100 depicted in
In the depicted example, server 304 provides data, such as boot files, operating system images, and applications to clients 310, 312, and 314. Clients 310, 312, and 314 are clients to server 304 in this example. Network data processing system 300 may include additional servers, clients, and other devices not shown. Specifically, clients may connect to any member of a network of servers which provide equivalent content.
In the depicted example, network data processing system 300 is the Internet with network 302 representing a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, government, educational and other computer systems that route data and messages. The keyword can be generated for a list of electronic documents to facilitate searching for a document. Of course, network data processing system 300 also may be implemented as a number of different types of networks, such as for example, an intranet, a local area network (LAN), or a wide area network (WAN).
The following description is presented with respect to embodiments of the present invention, which can be embodied in the context of a data-processing system such as data-processing apparatus 100, computer software system 150 and data processing system 300 and network 302 depicted respectively
Referring to
A determination can be made whether the intra-document frequency confidence interval is greater than the frequency confidence interval of the word within the corpus of documents, as depicted at block 460. If the frequency confidence interval of the word within the corpus of documents is lower the word is not considered further for keyword extraction, as shown at block 440. The lower bound of the frequency confidence interval in the page or document must be greater than the upper bound of the frequency confidence interval in the corpus. Any confidence interval can be utilized and the interval utilized for the word in the document does not need to be the same as the interval utilized for words in the corpus. For example, 95% confidence can be utilized for both intervals. Reducing the confidence interval for the word in the page leads to more words becoming keywords, at the expense of false detection and increasing the confidence leads to less keyword at the potential expense of missing actual keywords. Thereafter, the keywords can be extracted and highlighted, as shown at block 480 and 490.
p=x/m (1)
Consider y as the number of occurrences of the same word in a corpus without the document with n words. The frequency of the word in the corpus can be determined as the count of the word in the corpus divided by the count of words in the corpus, as depicted at block 530. The calculated frequency for the word in the corpus can be expressed, for example, as equation (2) below.
q=y/n (2)
A hypothesis can be tested by calculating test statistics, as illustrated at block 540. The word is a significant keyword for a document if its frequency of occurrence in the document is statistically greater than its occurrence in the corpus. This can be determined by a statistical hypothesis test. Such a hypothesis generally includes a null hypothesis with respect to the document frequency of the word less than or equal to the corpus frequency of the term. The alternative hypothesis is that the document frequency of the term is greater than the corpus frequency of the term. The null hypothesis and the alternative hypothesis can be expressed, for example, as indicated by equation (3) and equation (4) respectively.
H0:p<=q (3)
Ha:p>q (4)
Consider r=(x+y)/(m+n). The test statistic possesses an approximate normal distribution and the test statistic can be written as indicated, for example, in equation (5) below:
z(p−q)/sqrtr*(1−r)*(1/m+1/n)) (5)
A determination can be made whether the test statistics is greater than a threshold, as illustrated at block 550. If the test statistics is greater than the threshold the word is considered as a keyword, as depicted at block 560. Otherwise, the word is not considered further for keyword extraction. For example, the hypothesis H0 can be rejected and the word can be considered as a keyword for the document if N(z)>0.05, where N(z) is the probability that a standard normal random variable with mean 0 and standard deviation 1 which is greater than or equal to z. The “alpha” value 0.05 can be chosen larger in order to allow more words or can be set to be more conservative and allow fewer words. Thereafter, the extracted keyword can be highlighted, as shown at block 570.
The particular set of words points out the importance of not using a fixed set of stop words to prevent common words as keywords. The words “why” and “go” can generally be stop words, but they fall below the set threshold for corpus common words and they occur in greater frequency in the passage than in the corpus making them as keywords. The keywords produced are those that distinguish the document from the average document and are stressed more than in the average document. The keyword extraction method 400 and 500 can be utilized on its own as a form of keyword extraction or utilized as input to a more elaborate keyword extraction scheme. Such an approach allows stop words to be flagged as keywords if they appear more often than average.
It will be appreciated that variations of the above-disclosed and other features and functions, or alternatives thereof, may be desirably combined into many other different systems or applications. Also that various presently unforeseen or unanticipated alternatives, modifications, variations or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims.
Number | Name | Date | Kind |
---|---|---|---|
5642518 | Kiyama et al. | Jun 1997 | A |
6173251 | Ito et al. | Jan 2001 | B1 |
6470307 | Turney | Oct 2002 | B1 |
6473753 | Katariya et al. | Oct 2002 | B1 |
Number | Date | Country | |
---|---|---|---|
20100005083 A1 | Jan 2010 | US |