Claims
- 1. A document processing method, comprising:
processing documents derived from at least one of spontaneous and conversational expression and containing non-deterministic text with average word recognition precision below 50 percent, said processing utilizing non-textual differences between n-word sequences in the documents to resolve more than two decision options, where n is a positive integer.
- 2. A method as recited in claim 1, wherein said processing includes data mining of the documents.
- 3. A method as recited in claim 2, wherein said data mining includes retrieving at least one of the documents utilizing the non-textual differences between the n-word sequences in the documents.
- 4. A method as recited in claim 2, wherein said data mining includes extracting parameters from the documents, utilizing the non-textual differences between said n-word sequences.
- 5. A method as recited in claim 4, wherein said data mining further includes producing graphic results indicating the relations between the parameters extracted from the documents.
- 6. A method as recited in claim 4, wherein at least one of the parameters extracted from the documents is an assessment of relevance to a query based on the non-textual differences between the n-word sequences.
- 7. A method as recited in claim 4, wherein at least one of the extracted parameters is an assessment of a hidden variable that cannot be fully determined from information existing in the document.
- 8. A method as recited in claim 4, wherein at least one of the extracted parameters is the assessment of the document's relevance to a category.
- 9. A method as recited in claim 2, wherein said processing includes categorizing the documents.
- 10. A method as recited in claim 9, wherein said categorizing includes use of at least one algorithm to detect salient terms in the documents based on non-linguistic differences between the n-word sequences.
- 11. A method as recited in claim 2, further comprising clustering the documents.
- 12. A method as recited in claim 11, wherein said clustering includes discovering salient terms in the documents based on non-linguistic differences between the n-word sequences.
- 13. A method as recited in claim 11, wherein said clustering includes assessing a relation between the n-word sequences based on non-textual differences.
- 14. A method as recited in claim 4, wherein said data mining includes establishing relations between the parameters extracted from the documents.
- 15. A method as recited in claim 1, wherein the non-textual differences between the n-word sequences relate to recognition confidence of the n-word sequences.
- 16. A method as recited in claim 1, further comprising at least one of classifying and filtering the documents as the documents are received.
- 17. A method as recited in claim 1, further comprising labeling the documents as the documents are received.
- 18. A method as recited in claim 1, further comprising displaying information related to at least one of the documents, including at least some of the non-textual differences between the n-word sequences.
- 19. A method as recited in claim 18, wherein said displaying uses at least one of gray scaling, color, font-size and font style to indicate at least some of the non-textual differences between the n-word sequences.
- 20. A method as recited in claim 18, wherein said displaying selectively displays portions of the at least one of the documents based on confidence of accuracy of words displayed.
- 21. A method as recited in claim 18, wherein said displaying further displays salient terms in the at least one of the documents based on said processing of confidence levels of the salient terms that resolves more than two decision options.
- 22. A method as recited in claim 21, wherein a number of the salient terms are available for display and said displaying is further based on the number of the salient terms available for display and available space for display of the salient terms.
- 23. A method as recited in claim 1, further comprising:
receiving user input indicating errors in recognition; and replacing at least one word in the document with a corrected word based on the user input and setting the confidence levels of the corrected word to indicate high recognition accuracy.
- 24. A method as recited in claim 1, further comprising generating the documents by automatic speech recognition of audio signals received via a telephone system.
- 25. A method as recited in claim 1, further comprising generating the documents by automatic character recognition.
- 26. A method as recited in claim 1, further comprising generating the documents by a fact extraction system.
- 27. A method as recited in claim 1, wherein said processing includes
applying different data mining techniques, each of which does not indicate non-textual differences; and merging results of the different data mining techniques to obtain results that are dependent on the non-textual differences between the n-word sequences.
- 28. A method as recited in claim 27, wherein the different data mining techniques include at least one of retrieving, categorizing, filtering, classifying, labeling and clustering documents without utilization of any non-textual differences between the n-word sequences.
- 29. A method as recited in claim 27,
wherein said applying uses a plurality of different algorithms to transform non-deterministic text into standard text documents usable in text mining, and wherein the data mining techniques operate on the standard text documents.
- 30. A method as recited in claim 29,
wherein said processing further includes generating a plurality of indexes of the standard text documents, and wherein the data mining techniques operate on the indexes to obtain the results.
- 31. A method as recited in claim 30, wherein the data mining techniques include
receiving a query; and retrieving the results relevant to the query:
- 32. A method as recited in claim 30, wherein the data mining of at least some of the different indexes is performed by data mining software that does not output non-textual differences.
- 33. A method as recited in claim 29, wherein the different algorithms are thresholding algorithms using different confidence thresholds to determine omitted words that fall below the confidence thresholds.
- 34. A method as recited in claim 1, further comprising:
receiving user input indicating a change in labeling of at least one document; and replacing at least part of information provided by at least one label for the at least one document based on the user input.
- 35. A document processing method, comprising:
producing at least one index of n-word sequences in documents derived from at least one of spontaneous and conversational expression and containing non-deterministic text with average word recognition precision below 50 percent, utilizing non-textual differences between the n-word sequences, where n is a positive integer; and processing the documents based on the non-textual differences between the n-word sequences in the at least one index, where said processing resolves more than two decision options.
- 36. A method as recited in claim 35, wherein the non-textual differences between the n-word sequences relate to recognition confidence of the n-word sequences.
- 37. At least one computer readable medium storing instructions for controlling at least one computer system to perform a document processing method comprising:
processing documents derived from at least one of spontaneous and conversational expression and containing non-deterministic text with average word recognition precision below 50 percent, said processing utilizing non-textual differences between n-word sequences in the documents, where n is a positive integer and said processing resolves more than two decision options.
- 38. At least one computer readable medium as recited in claim 37, wherein said processing includes data mining of the documents.
- 39. At least one computer readable medium as recited in claim 38, wherein said data mining includes retrieving at least one of the documents utilizing the non-textual differences between the n-word sequences in the documents.
- 40. At least one computer readable medium as recited in claim 38, wherein said data mining includes
extracting parameters from the documents, utilizing the non-textual differences between said n-word sequences; and establishing relations between the parameters extracted from the documents.
- 41. At least one computer readable medium as recited in claim 40, wherein said data mining further includes producing graphic results indicating the relations between the parameters extracted from the documents.
- 42. At least one computer readable medium as recited in claim 40, wherein at least one of the parameters extracted from the documents is an assessment of relevance to a query based on the non-textual differences between the n-word sequences.
- 43. At least one computer readable medium as recited in claim 40, wherein at least one of the extracted parameters is an assessment of a hidden variable that cannot be fully determined from information existing in the document.
- 44. At least one computer readable medium as recited in claim 40, wherein at least one of the extracted parameters is the assessment of the document's relevance to a category.
- 45. At least one computer readable medium as recited in claim 38, wherein said processing includes categorizing the documents.
- 46. At least one computer readable medium as recited in claim 45, wherein said categorizing includes use of at least one algorithm to detect salient terms in the documents based on non-linguistic differences between the n-word sequences.
- 47. At least one computer readable medium as recited in claim 38, further comprising clustering the documents.
- 48. At least one computer readable medium as recited in claim 47, wherein said clustering includes discovering salient terms in the documents based on non-linguistic differences between the n-word sequences.
- 49. At least one computer readable medium as recited in claim 47, wherein said clustering includes assessing a relation between the n-word sequences based on non-textual differences.
- 50. At least one computer readable medium as recited in claim 37, wherein the non-textual differences between the n-word sequences relate to recognition confidence of the n-word sequences.
- 51. At least one computer readable medium as recited in claim 37, further comprising at least one of classifying and filtering the documents as the documents are received.
- 52. At least one computer readable medium as recited in claim 37, further comprising labeling the documents as the documents are received.
- 53. At least one computer readable medium as recited in claim 37, further comprising displaying information related to at least one of the documents, including at least some of the non-textual differences between the n-word sequences.
- 54. At least one computer readable medium as recited in claim 53, wherein said displaying uses at least one of gray scaling, color, font-size and font style to indicate at least some of the non-textual differences between the n-word sequences.
- 55. At least one computer readable medium as recited in claim 53, wherein said displaying selectively displays portions of the at least one of the documents based on confidence of accuracy of words displayed.
- 56. At least one computer readable medium as recited in claim 53, wherein said displaying further displays salient terms in the at least one of the documents based on said processing of confidence levels of the salient terms that resolves more than two decision options.
- 57. At least one computer readable medium as recited in claim 56, wherein a number of the salient terms are available for display and said displaying is further based on the number of the salient terms available for display and available space for display of the salient terms.
- 58. At least one computer readable medium as recited in claim 37, further comprising:
receiving user input indicating errors in recognition; and replacing at least one word in the document with a corrected word based on the user input and setting the confidence levels of the corrected word to indicate high recognition accuracy.
- 59. At least one computer readable medium as recited in claim 37, further comprising generating the documents by automatic speech recognition of audio signals received via a telephone system.
- 60. At least one computer readable medium as recited in claim 37, further comprising generating the documents by automatic character recognition.
- 61. At least one computer readable medium as recited in claim 37, further comprising generating the documents by a fact extraction system.
- 62. At least one computer readable medium as recited in claim 37, wherein said processing includes
applying different data mining techniques, each of which does not indicate non-textual differences; and merging results of the different data mining techniques to obtain the non-textual differences between the n-word sequences.
- 63. At least one computer readable medium as recited in claim 62, wherein the different data mining techniques include at least one of retrieving, categorizing, filtering, classifying, labeling and clustering documents without utilization of any non-textual differences between the n-word sequences.
- 64. At least one computer readable medium as recited in claim 62,
wherein said applying uses a plurality of different algorithms to transform non-deterministic text into standard text documents usable in text mining, and wherein the data mining techniques operate on the standard text documents.
- 65. At least one computer readable medium as recited in claim 64,
wherein said processing further includes generating a plurality of indexes of the standard text documents, and wherein the data mining techniques operate on the indexes to obtain the results.
- 66. At least one computer readable medium as recited in claim 65, wherein the data mining techniques include
receiving a query; and retrieving the results relevant to the query.
- 67. At least one computer readable medium as recited in claim 66, wherein the data mining of at least some of the different indexes is performed by data mining software that does not output non-textual differences.
- 68. At least one computer readable medium as recited in claim 64, wherein the different algorithms are thresholding algorithms using different confidence thresholds to determine omitted words that fall below the confidence thresholds.
- 69. At least one computer readable medium as recited in claim 37, further comprising:
receiving user input indicating a change in labeling of at least one document; and replacing at least part of information provided by at least one label for the at least one document based on the user input.
- 70. At least one computer readable medium for controlling at least one computer system to perform document processing method, comprising:
producing at least one index of n-word sequences in documents derived from at least one of spontaneous and conversational expression and containing non-deterministic text with average word recognition precision below 50 percent, utilizing non-textual differences between the n-word sequences, where n is a positive integer; and processing the documents based on the non-textual differences between the n-word sequences in the at least one index, where said processing resolves more than two decision options.
- 71. At least one computer readable medium as recited in claim 70, wherein the non-textual differences between the n-word sequences relate to recognition confidence of the n-word sequences.
- 72. An apparatus for processing documents, comprising:
processing means for processing documents derived from at least one of spontaneous and conversational expression and containing non-deterministic text with average word recognition precision below 50 percent, said processing utilizing non-textual differences between n-word sequences in the documents, where n is a positive integer and said processing resolves more than two decision options.
- 73. An apparatus as recited in claim 72, wherein said processing means comprises index means for producing at least one index of the n-word sequences utilizing the non-textual differences between the n-word sequences.
- 74. An apparatus as recited in claim 73, wherein said processing means comprises data mining means for retrieving at least one of the documents utilizing the at least one index.
- 75. An apparatus as recited in claim 74,
wherein said data mining means comprises:
parameter extraction means for extracting parameters from the documents, utilizing the non-textual differences between said n-word sequences; and relations establishment means for establishing relations between the parameters extracted from the documents, and wherein said apparatus further comprises display means for producing graphic results indicating the relations between the parameters extracted from the documents.
- 76. An apparatus as recited in claim 75, wherein at least one of the extracted parameters is an assessment of a hidden variable that cannot be fully determined from information existing in the at least one of the documents.
- 77. An apparatus as recited in claim 72, wherein the non-textual differences between the n-word sequences relate to recognition confidence of the n-word sequences.
- 78. An apparatus as recited in claim 72, wherein said processing means comprises categorizing means for categorizing the documents utilizing at least one algorithm based on non-linguistic differences between the n-word sequences.
- 79. An apparatus as recited in claim 72, wherein said processing means comprises clustering means for clustering the documents by assessing a relation between the n-word sequences based on non-textual differences.
- 80. An apparatus as recited in claim 72, wherein said processing means comprises means for at least one of classifying and filtering the documents as the documents are received.
- 81. An apparatus as recited in claim 72, further comprising display means for displaying information related to at least one of the documents, including at least some of the non-textual differences between the n-word sequences.
- 82. An apparatus as recited in claim 81, wherein said display means selectively displays portions of the at least one of the documents based on confidence of accuracy of words displayed.
- 83. An apparatus as recited in claim 72,
further comprising input means for receiving user input indicating errors in recognition, and wherein said processing means comprises means for replacing at least one word in the at least one of the documents with a corrected word based on the user input and setting the confidence levels of the corrected word to indicate high recognition accuracy.
- 84. An apparatus as recited in claim 72, coupled to a telephone system and further comprising automatic speech recognition means for generating the documents by automatic speech recognition of audio signals received via the telephone system.
- 85. An apparatus as recited in claim 72, further comprising automatic character recognition means for generating the documents by automatic character recognition.
- 86. An apparatus as recited in claim 72, wherein said processing means comprises:
data mining means for applying different data mining techniques, each of which does not indicate non-textual differences; and merge means for merging results of the different data mining techniques to obtain the non-textual differences between the n-word sequences.
- 87. An apparatus as recited in claim 86, wherein said data mining means includes means for at least one of retrieving, categorizing, filtering, classifying, labeling and clustering documents without utilization of any non-textual differences between the n-word sequences.
- 88. An apparatus as recited in claim 87, wherein said data mining means uses a plurality of different algorithms to transform non-deterministic text into standard text documents usable in text mining and the data mining techniques operate on the standard text documents.
- 89. An apparatus as recited in claim 87,
further comprising indexing means for generating a plurality of indexes of the standard text documents, and wherein said data mining means uses the different indexes in applying the different data mining techniques.
- 90. An apparatus as recited in claim 89,
further comprising input means for receiving a query; and wherein said data mining means further includes retrieving means for retrieving the results relevant to the query.
- 91. A data processing system, comprising:
at least one server to process documents, derived from at least one of spontaneous and conversational expression and containing non-deterministic text with word recognition precision of less than 50 percent, utilizing non-textual differences between n-word sequences, where n is a positive integer.
- 92. A data processing system as recited in claim 91, wherein said at least one server includes an indexing server producing at least one index of the n-word sequences utilizing the non-textual differences between the n-word sequences,
- 93. A data processing system as recited in claim 92, wherein said indexing server retrieves at least one of the documents utilizing data mining of the at least one index.
- 94. A data processing system as recited in claim 91,
wherein said at least one server extracts parameters from the documents, utilizing the non-textual differences between said n-word sequences, and establishes relations between the parameters extracted from the documents, and wherein said data processing system further comprises at least one display device producing graphic results indicating the relations between the parameters extracted from the documents.
- 95. A data processing system as recited in claim 94, wherein at least one of the extracted parameters is an assessment of a hidden variable that cannot be fully determined from information existing in the at least one of the documents.
- 96. A data processing system as recited in claim 91, wherein the non-textual differences between the n-word sequences relate to recognition confidence of the n-word sequences.
- 97. A data processing system as recited in claim 91, further comprising at least one display device displaying information related to at least one of the documents, including at least some of the non-textual differences between the n-word sequences
- 98. A data processing system as recited in claim 97, wherein said at least one display device selectively displays portions of at least one of the documents based on confidence of accuracy of words displayed.
- 99. A data processing system as recited in claim 91, wherein said at least one server applies different data mining techniques, each of which does not indicate non-textual differences and merges results of the different data mining techniques to obtain the non-textual differences between the n-word sequences.
- 100. A data processing system as recited in claim 99, wherein said at least one server uses a plurality of different algorithms to transform non-deterministic text into standard text documents usable in text mining and the data mining techniques operate on the standard text documents.
- 101. A data processing system as recited in claim 100, wherein said at least one server generates a plurality of indexes of the standard text documents and uses the different indexes in applying the different data mining techniques.
- 102. A data processing system as recited in claim 99, wherein the different data mining techniques include at least one of retrieving, categorizing, filtering, classifying, labeling and clustering documents without utilization of any non-textual differences between the n-word sequences.
- 103. A data processing system as recited in claim 102, wherein said at least one server uses a plurality of different algorithms to transform non-deterministic text into standard text documents usable in text mining and the data mining techniques operate on the standard text documents.
- 104. A data processing system as recited in claim 91,
further comprising at least one user terminal providing user input indicating errors in recognition in a document, and wherein said at least one server replaces at least one word in the document with a corrected word based on the user input and sets confidence levels of the corrected word to indicate high recognition accuracy.
- 105. A data processing system as recited in claim 91, further comprising at least one of an automatic speech recognition unit, an automatic character recognition unit and a fact extraction unit to generate the documents from data that on average produces word recognition precision of less than 50 percent.
CROSS-REFERENCE TO RELATED APPLICATION
[0001] This application is related to and claims priority to U.S. provisional application entitled METHOD FOR AUTOMATIC AND SEMI-AUTOMATIC CLASSIFICATION AND CLUSTERING OF NON-DETERMINISTIC TEXTS having serial No. 60/444,982, by Assaf ARIEL, Itsik HOROWITZ, Itzik STAUBER, Michael BRAND, Ofer SHOCHET and Dror ZIV, filed Feb. 5, 2003 and incorporated by reference herein. This application is also related to the application entitled AUGMENTATION AND CALIBRATION OF OUTPUT FROM NON-DETERMINISTIC TEXT GENERATORS BY MODELING ITS CHARACTERISTICS IN SPECIFIC ENVIRONMENTS by Michael BRAND, filed concurrently and incorporated by reference herein.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60444982 |
Feb 2003 |
US |