Claims
- 1. A method for understanding and decomposing a document, the method comprising:
utilizing at least one of the following algorithms to understand and decompose the document: one or more pre-processing algorithms; one or more token identification algorithms; one or more token type identification algorithms; one or more column count identification algorithms; one or more column boundary identification algorithms; one or more column type identification algorithms; one or more token-to-column assignment algorithms; and one or more line merging algorithms, wherein no prior identification of a document type is required, no prior identification of an expected format for the document type is required, and no pre-created scripts are required to map contents of the document.
- 2. The method of claim 1, wherein the method is performed automatically by a computer system.
- 3. The method of claim 1, wherein the document comprises tabular information.
- 4. The method of claim 1, wherein the document comprises at least one of: an ASCII text document, an EBCDIC text document, a spreadsheet, a PDF file, a Postscript file, and an HTML document.
- 5. The method of claim 1, wherein the document comprises a financial statement.
- 6. The method of claim 5, wherein the financial statement comprises at least one of: a balance sheet, an income statement, and a cash flow statement.
- 7. The method of claim 1, wherein the document comprises an electronic document.
- 8. The method of claim 7, wherein the electronic document is obtained electronically via at least one of: the Internet, an electronic mail message, an intranet, an extranet, and a scanner.
- 9. The method of claim 1, wherein the one or more pre-processing algorithms comprise at least one of:
removing anomalous characters from the file and replacing at least some of the anomalous characters with other characters that will not change the meaning of the document; removing dollar signs; replacing tab characters with a predetermined number of spaces; removing sequences of multiple underscores; removing sequences of multiple periods; removing characters having non-ASCII values; and replacing runs of one or two dashes with a zero.
- 10. The method of claim 1, wherein the one or more token identification algorithms comprise at least one of:
identifying, as tokens, strings of non-space characters having no more than two consecutive internal space characters; identifying textual elements for each row of text that are a predetermined number of spaces from a left or right non-space neighbor; skipping single tokens that comprise only a “$” character; and establishing a predetermined white space threshold via statistical evaluation distribution of white space markers throughout the document.
- 11. The method of claim 1, wherein the one or more token type identification algorithms comprise:
identifying the token type as at least one of: numeric, text, and date.
- 12. The method of claim 1, wherein the one or more column count identification algorithms comprise:
determining a statistical average of the population of tokens in each row.
- 13. The method of claim 1, wherein the one or more column boundary identification algorithms comprise at least one of:
sequentially positioning the tokens within the columns identified by the one or more column count identification algorithms; establishing a start point of each column; establishing an end point of each column; and extending the start point and the end point of each column proportionately to the size of the columns to accommodate gaps between columns.
- 14. The method of claim 1, wherein the one or more column type identification algorithms comprise:
assigning default column types to columns in the document.
- 15. The method of claim 1, wherein the one or more token-to-column assignment algorithms comprise:
assigning each token to one or more columns based on the boundaries of the columns within which the token falls and adjusting the token assignments as necessary to accommodate tokens that span multiple cells.
- 16. The method of claim 1, wherein the one or more line merging algorithms comprise:
utilizing natural language processing to combine multiple tokens in consecutive rows that should actually be a single token.
- 17. A system for understanding and decomposing a document, the system comprising:
a means for utilizing at least one of the following algorithms to understand and decompose the document: one or more pre-processing algorithms; one or more token identification algorithms; one or more token type identification algorithms; one or more column count identification algorithms; one or more column boundary identification algorithms; one or more column type identification algorithms; one or more token-to-column assignment algorithms; and one or more line merging algorithms, wherein no prior identification of a document type is required, no prior identification of an expected format for the document type is required, and no pre-created scripts are required to map contents of the document.
- 18. The system of claim 17, wherein a computer system is used to automatically understand and decompose the document.
- 19. The system of claim 17, wherein the document comprises tabular information.
- 20. The system of claim 17, wherein the document comprises at least one of: an ASCII text document, an EBCDIC text document, a spreadsheet, a PDF file, a Postscript file, and an HTML document.
- 21. The system of claim 17, wherein the document comprises a financial statement.
- 22. The system of claim 21, wherein the financial statement comprises at least one of: a balance sheet, an income statement, and a cash flow statement.
- 23. The system of claim 17, wherein the document comprises an electronic document.
- 24. The system of claim 23, wherein the electronic document is obtained electronically via at least one of: the Internet, an electronic mail message, an intranet, an extranet, and a scanner.
- 25. The system of claim 17, wherein the one or more pre-processing algorithms comprise at least one of:
removing anomalous characters from the file and replacing at least some of the anomalous characters with other characters that will not change the meaning of the document; removing dollar signs; replacing tab characters with a predetermined number of spaces; removing sequences of multiple underscores; removing sequences of multiple periods; removing characters having non-ASCII values; and replacing runs of one or two dashes with a zero.
- 26. The system of claim 17, wherein the one or more token identification algorithms comprise at least one of:
identifying, as tokens, strings of non-space characters having no more than two consecutive internal space characters; identifying textual elements for each row of text that are a predetermined number of spaces from a left or right non-space neighbor; skipping single tokens that comprise only a “$” character; and establishing a predetermined white space threshold via statistical evaluation distribution of white space markers throughout the document.
- 27. The system of claim 17, wherein the one or more token type identification algorithms comprise:
identifying the token type as at least one of: numeric, text, and date.
- 28. The system of claim 17, wherein the one or more column count identification algorithms comprise:
determining a statistical average of the population of tokens in each row.
- 29. The system of claim 17, wherein the one or more column boundary identification algorithms comprise at least one of:
sequentially positioning the tokens within the columns identified by the one or more column count identification algorithms; establishing a start point of each column; establishing an end point of each column; and extending the start point and the end point of each column proportionately to the size of the columns to accommodate gaps between columns.
- 30. The system of claim 17, wherein the one or more column type identification algorithms comprise:
assigning default column types to columns in the document.
- 31. The system of claim 17, wherein the one or more token-to-column assignment algorithms comprise:
assigning each token to one or more columns based on the boundaries of the columns within which the token falls and adjusting the token assignments as necessary to accommodate tokens that span multiple cells.
- 32. The system of claim 17, wherein the one or more line merging algorithms comprise:
utilizing natural language processing to combine multiple tokens in consecutive rows that should actually be a single token.
- 33. A method for understanding and decomposing a document, the method comprising:
preprocessing text in the document; identifying a physical layout of the document by establishing tokens; characterizing the tokens in the document as at least one of: numeric, text and date; establishing a column count of the number of columns in the document; establishing column boundaries for each column; establishing a column type for each column; assigning tokens to a column; identifying spanning tokens; identifying wrapping lines; identifying a table construct and a relationship between the tokens and table cells; identifying special rows and special cells in the document; identifying logical layout of the document; interpreting text in the document; and applying validation rules to verify totals and subtotals are correct.
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This invention is related to commonly-owned, co-pending U.S. patent application Ser. No. ______, entitled “Automated Understanding, Extraction and Structured Reformatting of Information in Electronic Files,” filed herewith on Mar. 27, 2003, which is hereby incorporated in full by reference. This invention is also related to commonly-owned, co-pending U.S. patent application Ser. No. ______, entitled “Mathematical Decomposition of Table-Structured Electronic Documents,” filed herewith on Mar. 27, 2003, which is also hereby incorporated in full by reference.