Claims
- 1. A method for analyzing and characterizing a database of electronically formatted natural language based documents comprising the steps of:
a) subjecting the database to a sequence of word filters to eliminate terms in the database which do not discriminate document content, resulting in a filtered word set whose members are highly predictive of content; b) defining a subset of the filtered word set as the topic set, said topic set being characterized as the set of filtered words which best discriminate the content of the documents which contain them, c) forming a two dimensional matrix with the words contained within the topic set defining one dimension of said matrix and the words contained within the filtered word set comprising the other dimension of said matrix d) calculating matrix entries as the conditional probability that a document in the database will contain each word in the topic set given that it contains each word in the filtered word set, and e) providing said matrix entries as vectors to interpret the document contents of said database.
- 2. The method of claim 1 wherein one of said sequence of filters comprises a frequency filter, a topicality filter and an overlap filter
- 3. The method of claim 2 wherein one of said topicality filter comprises the steps of:
a) calculating the expected distribution of each word contained in said database, b) measuring the actual distribution of each word contained in said database, c) expressing the ratio of said actual distribution to said expected distribution, and d) defining said set of topic words as those which fall below a predetermined value of said ratio.
- 4. The method of claim 2 wherein one of said frequency filter comprises the steps of:
a) defining a predetermined upper and lower limit for the frequency of said words in the database, b) determining the frequency of occurrence of each word contained in said database, c) further defining said set of topic words as those words whose frequency of occurrence in the database are above said predetermined lower limit and below said predetermined upper limit.
- 5. The method of claim 2 wherein one of said overlap filter comprises the steps of:
a) defining a preset limit for joint distribution of word pairs occurring within said database, b) calculating the joint distribution of word pairs occurring within said database, c) defining the set of word pairs whose joint distribution falls above said preset limit c) further defining said set of topic words as not containing one of those words for each set of word pairs whose joint distribution falls above said preset upper limit.
Government Interests
[0001] This invention was made with Government support under Contract DE-AC06-76RLO 1830 awarded by the U.S. Department of Energy. The Government has certain rights in the invention.
Continuations (3)
|
Number |
Date |
Country |
Parent |
09455849 |
Dec 1999 |
US |
Child |
10298361 |
Nov 2002 |
US |
Parent |
09191004 |
Nov 1998 |
US |
Child |
10298361 |
Nov 2002 |
US |
Parent |
08713313 |
Sep 1996 |
US |
Child |
10298361 |
Nov 2002 |
US |