The present disclosure generally relates to the field of document analysis, and more particularly to a method for generating an ordered list of signature terms occurring in at least a first document and a second document.
Traditional Term Frequency Inverse Document Frequency approaches are computationally intensive methods for identifying significant terms and phrases that differentiate specific documents from the rest of the corpus.
In finding unique terms and phrases for a set of search results, all of the documents in the result set are summed to create a result set frequency, this set is then sorted, and the top N terms or phrases are selected as the signature terms.
Our approach seeks to reduce the complexity of this type of calculation through approximation and pre-computation. It is designed to work efficiently with modern relational database constructs for content management. The approach is designed to enable the kinds of highly interactive data-driven visualizations that are the hallmark of third generation business intelligence.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not necessarily restrictive of the present disclosure. The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate subject matter of the disclosure. Together, the descriptions and the drawings serve to explain the principles of the disclosure.
The numerous advantages of the disclosure may be better understood by those skilled in the art by reference to the accompanying figures in which:
Reference will now be made in detail to the subject matter disclosed, which is illustrated in the accompanying drawings.
We begin in the same manner as traditional TFIDF by computing the DF. We then sort the lists to bring the signature terms of the document to the front.
We then approximate this list by setting the top Q entries of the TFIDF to 1 and the rest to 0. We can make this approximation because of the Zipf's Law distribution of terms and phrases in English documents. This is additionally helpful from a computational standpoint because it moves the computation space out of floating point math and replaces it with simple sums over integers.
We then truncate the zero entries, leaving just the top terms. At this point we could also use a thesaurus for further term reduction.
This technique is most useful in cases where a novel set of results is generated as part of a user interaction. This is very typically the result when a search query is issued; a set of results is returned that a user may want to explore and understand. Given these search results, which constitute a document set within the corpus, we can take the union of all the sets of “indicative terms” for those documents. We then return the top M of them when requested. These are sorted to produce an ordered list of the signature terms of a result set.
This computation by contrast is order q*(k+1) for a result set of size k and a threshold of q for the number of entries considered. This dramatic reduction in computation enables functionalities including 3rd generation business intelligence which require real-time interactivity.
Referring to
In the present disclosure, the methods disclosed may be implemented as sets of instructions or software readable by a device. Further, it is understood that the specific order or hierarchy of steps in the methods disclosed are examples of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the method can be rearranged while remaining within the disclosed subject matter. The accompanying method claims present elements of the various steps in a sample order, and are not necessarily meant to be limited to the specific order or hierarchy presented.
It is believed that the present disclosure and many of its attendant advantages will be understood by the foregoing description, and it will be apparent that various changes may be made in the form, construction and arrangement of the components without departing from the disclosed subject matter or without sacrificing all of its material advantages. The form described is merely explanatory, and it is the intention of the following claims to encompass and include such changes.
Number | Name | Date | Kind |
---|---|---|---|
5263159 | Mitsui | Nov 1993 | A |
5621454 | Ellis et al. | Apr 1997 | A |
5675819 | Schuetze | Oct 1997 | A |
6006221 | Liddy et al. | Dec 1999 | A |
6233575 | Agrawal et al. | May 2001 | B1 |
6499030 | Igata | Dec 2002 | B1 |
6564210 | Korda et al. | May 2003 | B1 |
6970863 | Cragun et al. | Nov 2005 | B2 |
7146356 | Choi et al. | Dec 2006 | B2 |
7219090 | Travis, Jr. | May 2007 | B2 |
7251637 | Caid et al. | Jul 2007 | B1 |
7302646 | Nomiyama et al. | Nov 2007 | B2 |
7392262 | Alspector et al. | Jun 2008 | B1 |
20020065957 | Rubin | May 2002 | A1 |
20020072895 | Imanaka et al. | Jun 2002 | A1 |
20020103809 | Starzl et al. | Aug 2002 | A1 |
20020165873 | Kwok et al. | Nov 2002 | A1 |
20020174101 | Fernley et al. | Nov 2002 | A1 |
20030014405 | Shapiro et al. | Jan 2003 | A1 |
20030033287 | Shanahan et al. | Feb 2003 | A1 |
20030061025 | Abir | Mar 2003 | A1 |
20030172058 | Namba | Sep 2003 | A1 |
20040064447 | Simske et al. | Apr 2004 | A1 |
20040186826 | Choi et al. | Sep 2004 | A1 |
20050010560 | Altevogt et al. | Jan 2005 | A1 |
20050086254 | Zou et al. | Apr 2005 | A1 |
20050198076 | Stata et al. | Sep 2005 | A1 |
20050262050 | Fagin et al. | Nov 2005 | A1 |
20050278314 | Buchheit | Dec 2005 | A1 |
20060106767 | Adcock et al. | May 2006 | A1 |
20070016574 | Carmel et al. | Jan 2007 | A1 |
20070074131 | Assadollahi | Mar 2007 | A1 |
20070185858 | Lu et al. | Aug 2007 | A1 |
20070185871 | Canright et al. | Aug 2007 | A1 |
20070192293 | Swen | Aug 2007 | A1 |
20070244915 | Cha et al. | Oct 2007 | A1 |
20070294220 | Tabraham | Dec 2007 | A1 |
20080005651 | Grefenstette et al. | Jan 2008 | A1 |
20080015844 | Fux et al. | Jan 2008 | A1 |
20080027918 | Altevogt et al. | Jan 2008 | A1 |
20090028441 | Milo et al. | Jan 2009 | A1 |
20090138466 | Henry et al. | May 2009 | A1 |
Number | Date | Country |
---|---|---|
2006244294 | Sep 2006 | JP |
Number | Date | Country | |
---|---|---|---|
20100070495 A1 | Mar 2010 | US |