The technical field relates generally to the analysis of documents including text. More particularly, the field pertains to the prioritization of documents based on their texts.
A portion of the disclosure of this patent document includes material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever. The following notice applies to the software and data as described below and in the drawings attached hereto: Copyright© 2008, David P. Fan, All Rights Reserved.
In current methods, documents including text can be held in an electronic database. Methods have been developed for searching the database to retrieve a document set that is especially relevant to a topic area. A human user may desire to focus on documents including a plurality of ideas in the topic area.
Systems, methods, and structures are discussed to support a text analysis for the prioritization of documents in a document set. According to one embodiment the documents include texts divided into paragraphs. The system may include a document set, communication means that allows access to the documents, and a text analysis engine for generating a prioritized order for a document set.
In the following detailed description of exemplary embodiments of the invention, reference is made to the accompanying drawings which form a part hereof, and in which is shown, by way of illustration, specific exemplary embodiments in which the invention may be practiced. In the drawings, like numerals describe substantially similar components throughout the several views. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention. Other embodiments may be utilized and structural, logical, electrical, and other changes may be made without departing from the spirit or scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined only by the appended claims.
One embodiment of the present invention includes a text analysis system for analyzing a document set and outputting a prioritized document order. The prioritized order may be based on a number of factors, including, but not limited to, a plurality of keywords and user defined criteria.
In an embodiment, document 212 includes text that includes at least one paragraph. Each paragraph may include one or more sentences that have a sentence-terminating punctuation mark. For example, the sentence-terminating punctuation mark may be a period, question mark, or an exclamation point. In one embodiment the sentence-terminating punctuation mark is preceded by a word that is at least three characters in length.
In another embodiment, a text search method is used to search for an idea in the text. In one embodiment, the text search method is to search for a keyword 214. A keyword 214 may include a word stem. A word stem may include the initial characters of a word. As an example, a word stem “metal” matches the initial five characters of each of the words “metal,” “metals,” and “metallic.” In one embodiment, a word stem is a complete word. As an example, a word stem may be the complete word “the” with no trailing characters permitted. In a further embodiment, the text search method may be a Boolean search method where the search condition may be a match to the condition “(buy and (paper or clip)).” In one embodiment, a text search method may be a match to a regular expression such as “buy|metal*.” In one embodiment, the text search condition may be a fuzzy logic search method. For the Boolean, regular expression, and fuzzy logic search methods the resultant matches may be considered keywords that may then be entered into the system.
In an embodiment, controller 230 may include control logic to control a text analysis process. The control may include an entering logic 232 to control the task of entering documents (e.g., document 212) into the electronic data storage system 210. Inputting logic 234 may be used to control the task of inputting user-specific conditions for text analysis. Analyzing logic 236 may control the text analysis. Outputting logic 238 may control the output of the text analysis.
In an example embodiment, document set 302 includes at least one document umbrella 310. Document umbrella 310 may include representations of an entire document. Document umbrella 310 may also include data members such as a document identifier 312, a combination count 314, one or more documents presences 316, a filtered indicator 318, and a retained indicator 320. Document set 302 may also include at least one paragraph 330. Paragraph 330 may include data members such as a paragraph identifier 332, text 334, reference 336, one or more paragraph presences 338 and a presence count 340.
In an embodiment, document identifier 312 uniquely identifies a document. This may for example be a randomly generated number. It may also be the next available number in a sequence. The data member combination count 314 represents the combination count of a document. A combination count 314 may represent the number of distinct combinations of paragraph presences 338 associated with a document. Data members document presences 316 may represent the extent to which a keyword is present in a document. The filtered indicator 318 data member may represent the selection of a document based on a filter keyword. In an example embodiment, retained indicator 320 represents the retention of a document based on a threshold combination count that may be set by a user.
In an embodiment, data member paragraph identifier 332 uniquely identifies a paragraph 330. Data member text 334 may represent the text of a paragraph 330. In one embodiment data member reference 336 represents the document to which a paragraph belongs. For instance, reference 336 may be the same as a document identifier 312. Data members paragraph presences 338 may represent the presence of the word stem of a keyword in the text of a paragraph. In an embodiment, data member presence count 340 represents the count of presences 338 with non-zero values for a paragraph.
In an embodiment, a keyword identifier 406 uniquely identifies a keyword 404. Data member word stem 408 may represent a word stem of a keyword 404. Data member target count 410 may represent the number of instances of the word stem 408 of a keyword 404 in a user specified target text. Data member document count 412 may represent the number of documents (e.g., data member 310) with the word stem 408 of a keyword 404. Data member worth 414 may represent the worth of a keyword 404. In an example embodiment, worth 414 may be the equivalent of target count 410 divided by document count 412, if the document count 412 is non-zero. Data member selected indicator 416 may represent whether or not the keyword 404 has been selected. For instance, a user may indicate that only the four highest worth keywords should be selected. If a keyword is in the top four of worth levels, then the selected indicator 416 may be set to a value indicating the keyword is selected (e.g., ‘1’).
At block 804, in one embodiment, a plurality of keywords is generated from a target text. In an embodiment, there may be the generation of text analysis conditions from the target text. The target text may be a user selected set of words or paragraphs that the system uses to generate a set of keywords. The target text may be a representative paragraph that has many of the words or phrases that are important to the user in the eventual prioritization of a document set. This target text may, for example, be a patent claim. In one embodiment, the generation of text analysis conditions from the target text is omitted. In one embodiment, text analysis conditions are obtained through a process external to the processes of the present invention. In one embodiment, text analysis conditions are specified by a user. The conditions may include excluding one or more word stems from the plurality of keywords. The process of generating keywords is described in more detail below with reference to
At block 806, in an embodiment, the documents are filtered according to a filter keyword. The filter keyword may be selected from the plurality of keywords, the selecting including comparing the target count of each keyword in the plurality of keywords to a filter count. In an embodiment, a document set is filtered to only include documents that include the filter keyword. The filter count may be based on a preference of the user of the system. The count may be a number that the user may use to limit the number of documents that are ultimately prioritized. An example is described more fully below with reference to block 910 and
At block 808, the plurality of documents is prioritized into a prioritized document order according to combination counts. As described above, a combination count is the number of distinct combinations of paragraphs presences for a document. A user may also specify a threshold combination count such that no documents below the threshold combination count are prioritized. The process of prioritization is more fully described below with reference to
At block 810 the prioritized document order is outputted. This output may be presented to a user in a variety of forms, including, but not limited to, a list of document identifiers or the full document text. In an embodiment, a prioritized paragraph order of paragraphs in the document set is ordered irrespective of the documents to which the paragraphs belong. In an embodiment the prioritized document order is displayed to a user through a web browser. In one embodiment, the paragraphs are ordered based on at least one paragraph member. In one embodiment, the paragraphs are ordered by paragraph presence count. Further details regarding the presentation of a prioritized output may be found U.S. patent application Ser. No. 11/275,947, entitled “POPULATION ANALYSIS USING LINEAR DISPLAYS”, filed on Feb. 6, 2006, the contents of which are hereby incorporated by reference for all purposes.
In one embodiment, at block 904, a target count 412 is entered for each keyword. The target count corresponds to the number of instances of the word stem 408 in the target text. Using example entry 702 again, a value of ‘3’ is entered for K.TargetCount for the word stem “buy” because the example target text has 3 instances of the word stem “buy.”
In an embodiment, at block 906, a document count 412 is entered for each keyword. This may include scanning the texts of all paragraphs of a document for the presence of at least one instance of a word stem of a keyword. The document count 412 may reflect the count of documents with at least one paragraph with at least one instance of the word stem. Looking at entry 704 in
In an embodiment, at block 908 a worth is entered for each keyword. A zero may be entered if a document count corresponding to the word stem is zero. If a document count is not zero, the worth value may be equivalent to the target count divided by the document count. Referring to example entry 702, the K.Worth value is equal to 1 because K.TargetCount and K.DocumentCount are the same (3/3=1). Example entry 704 shows a K.Worth of 0.04 (2/50) and example entry 706 has a K.Worth value of 0 because K.DocumentCount is 0.
In yet another embodiment, at block 910 a filter keyword is selected. The user may specify a desired filter count corresponding to the desired number of documents to be analyzed. A filter keyword may be the keyword with a document count closest to the user specified desired filter count. As an example, with reference to example table 700, a user may specify a desired filter count of 40. Entry 704 with K.ID=4 and K.WordStem=“clip” may be selected as the filter keyword because the K.DocumentCount=50 is the value in the column closest to the desired filter count of 40.
Once the filter keyword is selected, the filtered indicator 318 can be set as a ‘1’ or a ‘0.’ A ‘1’ may be entered if at least one paragraph of a document has at least one word stem matching the word stem of the filter keyword. Otherwise, a ‘0’ may be entered. For example, in
Referring back to
Consider example table 600 and row 602 representing the paragraph with the text “Buy a clip.” The paragraph may be scanned for the presence of the word stem “buy” corresponding to the word stem “buy” in row 702 of table 700. Because “buy” is present, a ‘1’ may be entered into the paragraph presence indexed to the selected keyword buy, in this case P.Pres—1. The process may be repeated iteratively for each of the selected keywords and for each of the paragraphs in the document set.
Referring back to
At block 1006, a combination count is entered for each document. A combination count is the number of distinct combinations of paragraph presences 338 for all the paragraphs of a document wherein at least one paragraph presence 338 is non-zero. As an example, the document in example row 504 in table 500 is represented by rows example 602, 604, 606, and 608 in table 600. There are three distinct combinations of paragraph presences because example rows 606 and 608 have the same combination of paragraph presences 1, 3, and 4 in presence columns 338. Therefore a combination count value of three may be entered into row 504.
At block 1008, a retained indicator value is entered for each document. In an embodiment, a user may specify a minimum combination count. A value of one may be entered into the retained indicator data member when a filtered indicator has the value of one and at least one paragraph has at least one presence count greater than or equal to the minimum combination count. A value of zero may be entered for all other documents. As an example, a user may specify three to be the minimum combination count. Row 504 has a retained indicator 320 value of one because filtered indicator 318 has the value of one and at least one of the rows 602, 604, 606, and 608 belonging to document 504 has a presence count 340 greater than or equal to three. The three rows matching this condition are 604, 606, and 608.
At block 1010, document presences are entered for each document. In an embodiment, a document presence data member is computed from the paragraphs belonging to the document. The value of a document presence may be the sum of the corresponding paragraph presences for all paragraphs belonging to the document. As an example, consider example tables 500 and 600. Example row 504, representing document 1, has a value of four for document presence 1 (represented as D.Pres—1) because the sum is four for the P.Pres—1 values for the four rows, 602, 604, 606, and 608 which correspond to the paragraphs for document 1.
At block 1012, in an embodiment, the documents are ordered. This may include, as a prioritized document set, all documents with document umbrellas with retained indicator 320 values equal to ‘1’. The prioritized document set may be ordered hierarchically beginning with the combination count followed by each document presence n 316 with n ordered by worth 414. As an example, example table 500 in
In the example embodiments discussed above, various modifications can be made without departing from the scope of the inventive subject matter.
Systems, methods, and structures have been discussed for prioritizing documents by combinations of keywords in paragraphs. Whereas prior ordering methods do not include the combinations, the embodiments discussed hereinbefore do include such combinations thereby providing a superior method of prioritizing documents.
Although the specific embodiments have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that any arrangement which is calculated to achieve the same purpose may be substituted for the specific embodiments shown. This application is intended to cover any adaptations or variations of the inventive subject matter. It is to be understood that the above description is intended to be illustrative, and not restrictive. Combinations of the above embodiments and other embodiments will be apparent to those of skill in the art upon reviewing the above description. The scope of the inventive subject matter includes any other applications in which the above structures and fabrication methods are used. Accordingly, the scope of the claimed inventive subject matter should only be determined with reference to the appended claims, along with the full scope of equivalences to which such claims are entitled.
This patent application claims the benefit of priority, under 35 U.S.C. Section 119(e), to U.S. Provisional Patent Application Ser. No. 60/962,162, filed on Jul. 27, 2007, and to U.S. Provisional Patent Application Ser. No. 61/000,988, filed on Oct. 30, 2007, which applications are incorporated herein by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
60962162 | Jul 2007 | US | |
61000988 | Oct 2007 | US |