Claims
- 1. A method for automatically understanding a document, the method comprising:
utilizing algorithms to automate the understanding of a document, wherein no prior identification of a document type is required, no prior identification of an expected format for the document type is required, and no pre-created scripts are required to map contents of the document.
- 2. The method of claim 1, wherein the algorithms comprise table decomposition algorithms, financial aspect identification algorithms, mathematical structure decomposition algorithms, accounting categorization algorithms, and validation algorithms.
- 3. The method of claim 2, wherein the table decomposition algorithms comprise algorithms for performing at least one of the following: token identification, token type identification, column count identification, column boundary identification, column type identification, token-to-column assignment, and line merging.
- 4. The method of claim 3, wherein the token identification comprises utilizing spacing information between words to identify which words should be grouped together as a single portion of the table.
- 5. The method of claim 3, wherein the token type identification comprises using special characters and alphanumeric combinations to determine whether the token represents text, a number, or a date.
- 6. The method of claim 3, wherein the column count identification comprises identifying an appropriate number of columns in the document based on statistical measures of a token count per row.
- 7. The method of claim 3, wherein the column boundary identification comprises identification of suitable column boundaries based on right-most and left-most position of all tokens assigned to each column.
- 8. The method of claim 3, wherein the column type identification comprises assigning a column type to each column based on a frequency of each token type within each column.
- 9. The method of claim 3, wherein the token-to-column assignment comprises assigning tokens from each row to their respective columns based on their sequential position within the row and their proximity to other tokens.
- 10. The method of claim 3, wherein the line merging comprises using key separator words to identify wrapping lines.
- 11. The method of claim 2, wherein the financial aspect identification algorithms comprise algorithms for performing at least one of the following: identification of date periods for the document, identification of audited/un-audited status, and identification of dollar units in the documents.
- 12. The method of claim 11, wherein the identification of date periods for the document comprises utilizing a set of heuristics to interrogate date portions throughout the document to assemble a picture of the date periods covered by each column in the document.
- 13. The method of claim 11, wherein the identification of audited/un-audited status comprises searching the document for key phrases that indicate whether or not the financial statement has been audited.
- 14. The method of claim 11, wherein the identification of dollar units in the documents comprises identifying key word patterns that indicate the dollar units in the document.
- 15. The method of claim 2, wherein the mathematical structure decomposition algorithms comprise algorithms for performing at least one of the following: table boundary identification, total identification, and subtotal identification.
- 16. The method of claim 15, wherein the table boundary identification comprises identifying key word patterns and mathematical relationships that identify a start and an end of the table.
- 17. The method of claim 15, wherein the total identification comprises identifying word patterns that indicate relevant totals of the document.
- 18. The method of claim 15, wherein the subtotal identification comprises at least one of the following: identifying lines that indicate subtotals, identifying lines that have no line item description, and identifying lines that are mathematical compositions of other line items within the document.
- 19. The method of claim 2, wherein the accounting categorization algorithms comprise algorithms for performing at least one of the following: hierarchy matching and assignment of the line items to accounting categories.
- 20. The method of claim 19, wherein the hierarchy matching comprises splitting the document into its hierarchical parts by using word patterns to identify key segments.
- 21. The method of claim 19, wherein the assignment of the line items to accounting categories comprises using a line item description and a row position related to a hierarchy header to determine a suitable categorization for each line item.
- 22. The method of claim 2, wherein the validation algorithms comprise algorithms for performing validation utilizing at least one of the following: generally accepted accounting principles (GAAP) and historical trends.
- 23. The method of claim 22, wherein validation comprises ensuring that the summation of the line items assigned to a given category equals a total given for that category.
- 24. The method of claim 1, wherein the steps are performed automatically by a computer system.
- 25. A method for understanding a document and converting it into an intermediate structured representation of the information contained therein, the method comprising:
obtaining a document; utilizing algorithms to automatically understand the document; and creating an intermediate structured representation of the information contained therein from the extracted information, wherein no prior identification of a document type is required, no prior identification of an expected format for the document type is required, no pre-created scripts are required to map contents of the document, and the intermediate structured representation of the information is capable of being exchanged across diverse hardware, operating systems and applications.
- 26. The method of claim 25, wherein the steps are performed automatically by a computer system.
- 27. The method of claim 25, wherein the algorithms used to automatically understand the document are capable of:
analyzing information contained in the document; decomposing the information contained in the document; extracting the decomposed information; categorizing the decomposed information; and validating the decomposed information.
- 28. The method of claim 27, wherein the steps are performed automatically by a computer system.
- 29. The method of claim 25, further comprising:
converting the intermediate structured representation of the information into a format capable of being used in one or more target systems.
- 30. The method of claim 29, wherein the converting step comprises utilizing an ETL tool to convert the intermediate structured representation of the information into a format capable of being used in one or more target systems.
- 31. The method of claim 25, wherein the document that is obtained is in the form of at least one of: an ASCII text document, an EBCDIC text document, a spreadsheet, a PDF file, a Postscript file, and an HTML document.
- 32. The method of claim 25, wherein the document that is obtained comprises a financial statement.
- 33. The method of claim 32, wherein the financial statement comprises at least one of: a balance sheet, an income statement, and a cash flow statement.
- 34. The method of claim 25, wherein the document that is obtained comprises an electronic document.
- 35. The method of claim 34, wherein the electronic document is obtained electronically via at least one of: the Internet, an electronic mail message, an intranet, an extranet, and a scanner.
- 36. The method of claim 25, wherein the method is utilized to analyze at least one of: a company's financial health and the integrity of the financial statement.
- 37. The method of claim 25, wherein the document that is obtained comprises tabular information.
- 38. A system for understanding a document and converting it into an intermediate structured representation of the information contained therein, the system comprising:
a means for obtaining a document; a means for utilizing algorithms to automatically understand the document; and a means for creating an intermediate structured representation of the information contained therein from the extracted information, wherein no prior identification of a document type is required, no prior identification of an expected format for the document type is required, no pre-created scripts are required to map contents of the document, and the intermediate structured representation of the information is capable of being exchanged across diverse hardware, operating systems and applications.
- 39. The system of claim 38, wherein the steps are performed automatically by a computer system.
- 40. The system of claim 38, wherein the means for utilizing algorithms to automatically understand the document further comprises:
a means for analyzing information contained in the document; a means for decomposing the information contained in the document; a means for extracting the decomposed information; a means for categorizing the decomposed information; and a means for validating the decomposed information.
- 41. The system of claim 40, wherein the steps are performed automatically by a computer system.
- 42. The system of claim 38, further comprising:
a means for converting the intermediate structured representation of the information into a format capable of being used in one or more target systems.
- 43. The system of claim 42, wherein the means for converting the intermediate structured representation of the information into a format capable of being used in one or more target systems comprises utilizing an ETL tool to convert the intermediate structured representation of the information into a format capable of being used in one or more target systems.
- 44. The system of claim 38, wherein the document that is obtained is in the form of at least one of: an ASCII text document, an EBCDIC text document, a spreadsheet, a PDF file, a Postscript file, and an HTML document.
- 45. The system of claim 38, wherein the document that is obtained comprises a financial statement.
- 46. The system of claim 45, wherein the financial statement comprises at least one of: a balance sheet, an income statement, and a cash flow statement.
- 47. The system of claim 38, wherein the document that is obtained comprises an electronic document.
- 48. The system of claim 47, wherein the electronic document is obtained electronically via at least one of: the Internet, an electronic mail message, an intranet, an extranet, and a scanner.
- 49. The system of claim 38, wherein the system is utilized to analyze at least one of: a company's financial health and the integrity of the financial statement.
- 50. The system of claim 38, wherein the document that is obtained comprises tabular information.
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This invention is related to commonly-owned, co-pending U.S. patent application Ser. No. ______, entitled “Automated Understanding and Decomposition of Table-Structured Electronic Documents,” filed herewith, which is hereby incorporated in full by reference. This invention is also related to commonly-owned, co-pending U.S. patent application Ser. No. ______, entitled “Mathematical Decomposition of Table-Structured Electronic Documents,” filed herewith, which is also hereby incorporated in full by reference.