Claims
- 1. A method for generating an enlarged corpus of training entries for a particular application, given a set of k labels and an initial corpus of training m entries, where each of said entries includes at least a data portion, comprising the steps of:
for each label l of said k labels, creating an associated rule that specifies one or more conditions that said data portion of an applied entry x must meet in order for said rule to reach a conclusion that said label l attaches to said entry x, and also specifies an confidence measure p(x,l), associated with said conclusion, which measure is a number between 0 and 1; creating an augmented corpus of m training entries, where each entry i in said augmented corpus is created from data portion of entry i in said initial corpus of training entries, i=1,2, . . . m, with each of said k labels attached to said data portion of said entry i, or not attached to said data portion of said entry i, based on whether a preselected variable Z is either a +1 or a 0, respectively, and with a confidence measure associated with each of said labels being U(x,l)=[Zηp(x,l)+(1−Z)η(1−p(x,l))] when said data portion of said entry i meets said conditions of said rule for label l, η being a preselected positive number, and being 1−U(x,l) when said data portion of said entry i fails to meet said conditions of said rule for label l; and combining said augmented corpus of m training entries with said initial corpus of training m entries to form said enlarged corpus having 2m training entries.
- 2. The method of claim 1 where said rule is created based on prior knowledge of said particular application.
- 3. The method of claim 1 where said one or more conditions constitute a logical association of data elements that are expected in entries to a classifier constructed through interaction with said enlarged corpus of training entries.
- 4. The method of claim 3 where said particular application involves recognizing spoken speech, and said data elements are words.
- 5. The method of claim 1 where said variable Z is +1, further comprising a step of developing a second augmented corpus of m training entries, which step is identical to said set of creating, with said variable Z being equal to 0, and with said step of combining being enhanced to combine said augmented corpus of m training entries and said second augmented corpus of m training entries with said initial corpus of training m entries to form said enlarged corpus of 3m training entries.
- 6. The method of claim 1 where each entry x in said initial training corpus includes, in addition to said data portion, an indication of which of said k labels are attached to said entry x.
- 7. The method of claim 1 further comprising a step of attaching any number of said k labels to each of said m entries of said initial training corpus.
- 8. The method of claim 1 further comprising the step of creating a classifier from said enlarged corpus of training sequences.
- 9. A method of understanding presented data comprising the steps of:
normalizing said data to reduce variations in said presented data, to develop normalized data; assigning portions of said normalized data to be instances of objects from a set of preselected objects when said portions of said normalized data meet predetermined conditions, thereby forming entity-extracted data; and classifying said entity-extracted data by determining whether any of a predetermined set of labels should be attached to said entity-extracted data.
- 10. The method of claim 9 where said data represents speech.
- 11. The method of claim 9 where said data represents real-time speech.
- 12. The method of claim 9 where said step of normalizing employs morphing, synonym detection, and disfluencies detection.
- 13. The method of claim 9 where said objects are related to a given classification application.
- 14. The method of claim 9 where said objects are taken from a set including at least date, time, numbers, money, and phone numbers.
- 15. The method of claim 9 where said objects a preselected based on application for which said method is employed.
- 16. The method of claim 9 where said presented data is text data, or speech data converted to text data.
- 17. The method of claim 9 where said classifying is carried out with a boosting classifier.
- 18. The method of claim 9 where said classifying is carried out with a classifier represented by
- 19. The method of claim 18 where said understanding of presented data is relative to a particular application, and said classifier is developed from a constructed corpus of training entries that is at Mm in size, where m is size of a provided corpus of training entries, M is a positive integer, and said constructed corpus of training entries is derived from said provided corpus of training sequences.
- 20. The method of claim 19 where said classifier is developed with of a rule established with prior knowledge of said particular application.
- 21. The method of claim 19 where M is either 3 or 2.
- 22. A method for understanding applied data, relative to a particular application, by classifying said applied data with a classifier developed from an enhanced corpus of training entries, the improvement comprising:
developing said enhanced corpus of training entries by creating from said provided corpus of training entries a set of auxiliary training entries that are developed with aid of a rule that is based on prior knowledge of said particular application, said set of auxiliary training entries being combined with said provided corpus of training entries to form said enhanced corpus of training entries.
- 23. The method of claim 22 where M training entries are developed for said set of auxiliary training entries from each entry in said provided corpus of training entries, M being a positive integer.
- 24. The method of claim 22 where said classifier imparts said understanding of said data by providing information about whether labels from a set of k labels attach to said applied data.
- 25. The method of claim 24 where said provided corpus of training entries comprises m entries, each of which includes at least a data portion, and said set of auxiliary entries is developed by creating an associated rule, for each label l of said k labels, that specifies one or more conditions that said data portion of an applied entry x must meet in order for said rule to reach a conclusion that said label l attaches to said entry x, and also specifies an confidence measure p(x,l), associated with said conclusion, which measure is a number between 0 and 1;
creating an augmented corpus of m training entries, where each entry i in said augmented corpus of training entries is created from data portion of entry i in said provided corpus, i=1,2, . . . m, with each of said k labels attached to said data portion of said entry i, or not attached to said data portion of said entry i, based on whether a preselected variable Z is either a +1 or a 0, respectively, and with a confidence measure associated with each of said attached labels being U(x,l)=[Zηp(x,l)+(1−Z)η(1−p(x,l))] when said data portion of said entry i meets said conditions of said rule for label l, η being a preselected positive number, or being 1−U(x,l) when said data portion of said entry i fails to meet said conditions of said rule for label.
RELATED APPLICATIONS
[0001] This application claims priority from provisional applications 60/306,283 filed Jul. 18, 2001 and 60/374,961 filed Apr. 23, 2001.
Provisional Applications (2)
|
Number |
Date |
Country |
|
60306283 |
Jul 2001 |
US |
|
60374961 |
Apr 2002 |
US |