The present invention deals with lexical data. More specifically, the present invention provides a system for developing lexical data for use in a wide variety of applications, such as natural language processing applications.
Lexical data forms a core part of most natural language processing systems. Lexical data typically includes a list of word forms with links to knowledge about the linguistic properties of those word forms. Such properties may include, for example, part-of-speech, morphology, pronunciation, stress, syllable boundaries, frequency, grammar, usage or any of a wide variety of other features that comprise knowledge about the word and how it behaves in different contexts. Such properties are often essential to the linguistic coverage and accuracy of a natural language processing application. The list of word forms, with optional links to knowledge about the linguistic property of the word forms, is referred to as a lexicon.
The acquisition and coding of lexical data is often a major part of the development of a natural language processing software application. This is because development of such lexical data (or lexicons) has typically required a linguist, or other person with a combination of linguistic and computational skills to generate and compile such data. Therefore, natural language software application development companies often outsource dictionary (or lexicon) development to companies that employ linguists or other highly skilled people. Such people typically manually enter words into the lexicon, from a corpus of text, in an ad hoc way. They are provided with no structure within which to enter the words, and the result is that the entries are often inconsistent and incomplete from one person to the next. They are inconsistent because the manual dictionary builders do not enter every form of every word (such as every form of every verb, noun, etc.). They are incomplete because the manual entering personnel fail to define all rules to completely define the morphology of the words.
Another problem associated with conventional lexicon builders is that different languages have different complexity in morphological rules. Depending on the language, affixes can reside at the beginning, middle, or at the end of words, or at the beginning and end of words, or multiple suffixes can be added to a single stem. This results in even higher development costs, and the employment of even more skilled personnel, such as linguists who are professionally trained in multiple languages and who have computational skills necessary to create the electronic lexicon.
In the past, these problems have required the lexicons (or lexical data) for many useful applications (such as spell checkers and word breakers) to be licensed at extremely high, long term, license rates.
The present invention provides a lexicon development tool which allows an author to first define templates and then assign words in an input word list to correct templates. The present invention can be used to automatically match a template to an input word or the words can be matched to templates manually. In addition, the present invention can provide a wide variety of different processing components to sort or otherwise process an input word list and to test and export a lexicon, once it has been authored.
The present invention deals with development of lexical data. The present description proceeds with respect to examples in the English language. However, the invention is not so limited. In various embodiments, the present invention may support all inflectional and agglutinative languages as well as Semitic languages. However, before describing the present invention in greater detail, one environment in which the present invention can be used will be described.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, telephony systems, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention is designed to be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules are located in both local and remote computer storage media including memory storage devices.
With reference to
Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 110. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation,
The computer 110 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
A user may enter commands and information into the computer 110 through input devices such as a keyboard 162, a microphone 163, and a pointing device 161, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, computers may also include other peripheral output devices such as speakers 197 and printer 196, which may be connected through an output peripheral interface 195.
The computer 110 is operated in a networked environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110. The logical connections depicted in
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
The operation of system 200 is described in greater detail below. Briefly, however, word list creation component 202 receives a source of words from one of a variety of different sources, such as text corpora 220 or existing dictionaries 222. Word list creation component 202 then extracts a word list 224 from those sources. Alternatively, of course, word list 224 can be manually entered through a manual entry component 226. In addition, word list creation component can include more complex lexical data import utilities and word list manipulation utilities. It may also be linked to an external corpus by a key word in context (KWIC) viewing component. However, the present discussion simply proceeds with respect to component 202 creating a word list.
In any case, word list 224 is provided to lexicon creation component 204. The author has illustratively already created template classes and templates (described in greater detail below). Lexicon creation component 204 illustratively includes lexicon generator component 226 and lexicon test and export component 228. Therefore, lexicon generator 226 receives word list 224 and matches the words in the word list to templates in template store 208 to create the lexicon. As will be described in greater detail below, this can be done by manually selecting a template and entering the words, or by using automatic template matching. Either way, assigning a word to a template generates a lexical entry in the lexicon. Once the lexicon generator 226 has generated the lexicon, test and export component 228 tests the lexicon and configures it for export in one of a wide variety of desired formats. The lexicon 210 is then output in the desired format to applications 212.
Template authoring component 234 allows an author, through authoring interface 206, to create template classes and corresponding templates.
Therefore, by way of example, the noun part-of-speech in English is the category that contains words that can appear (among other places) as subjects and objects of verbs and prepositions (these are the grammatical functions of nouns), and that has singular, plural, and possessive forms (these are the inflectional characteristics of nouns). At this point, it is not important that different forms of nouns can be related in different ways for different words in English. All that needs to be determined is that each noun has four inflected forms, which are referred to as Singular, Plural, Possessive.Singular and Possessive.Plural. To put it another way the standard template class for nouns in English will have four slots which are used to store the four inflected words of each noun. Note that the forms that fill each of the slots need not be different, the same form can fill more than one slot.
In some languages, there are one or more irregular words that, while they intuitively belong to a particular part-of-speech, they should be treated as a separate part-of-speech for the purposes of creating templates because they contain a different set of inflectional forms. In English, for example, the verb “be” fits into this category because, although it is clearly a verb, unlike any other verb in the English language, it has multiple past tense forms (“was” for singular subjects and “were” for plural subjects). This type of verb would therefore be assigned to a special part-of-speech, such as “verb_be”.
Once the major parts-of-speech are defined, the template classes and templates are created by the author, through authoring interface 206, for each of the identified major parts-of-speech. This is indicated by block 302 in
The template classes are categories to which many individual templates may belong. A template includes a list of stems and a set of rules, one rule for each of the generated forms in the template class to which the template belongs. Rules may be blank, indicating that no form exists in the template for the generated form slot the rule is associated with. While template classes are concerned with the various types of inflectional forms that each part-of-speech as a whole is associated with, the templates contain information about the specific ways in which words belonging to a particular part-of-speech relate the different inflectional forms to each other. That is, any particular word belongs to some template, and that template, in turn, belongs to some template class.
Continuing with the example of English nouns mentioned above, a high level view of the hierarchical nature of the template classes and templates is shown in
Below the notation of stems in
A concrete example may assist in understanding. Assume that the stem value equals “dog” in the noun-regular template shown in
In accordance with another embodiment, a stem macro can be used so that second, third, . . . , nth stems can be derived by applying the rule to the first stem. For instance, a set of rules used in a template may fall into two or more classes where the “interior” of the rule is the same for each class and these “interiors” are related to each other in a deterministic way. In order to eliminate redundancy in the textual representation of the rules, a macro can be defined for each of the interiors. In one particular implementation this macro looks like a stem (it has a number and is referred to in the same way) but it is not an independent piece of information. Instead, it is related by rule to a “real”, independent stem. For example, assume the following is an abbreviated template:
1: Stem
(1)abc
(1)def
(1)xyz
(1)xyzghi
(1)xyzjkl
The stem can alternatively be expressed equivalently as
1: Stem
2: DependentStem=(1)xyz
(1)abc
(1)def
(2)
(2)ghi
(2)jkl
It can be seen that the stems are derived by referring to a rule and applying that rule to the stem.
In accordance with another embodiment, suggestions for other stem values are generated by associating with a stem one or more rules that are based on other stems. For instance, in the English verb-irregular template, there are three slots, one for the present verb stem (e.g., drink), one for the past tense (drank) and one for the past participle (drunk). It is often the case that the past tense and the past participle are the same (such as with bring-brought-brought), and when they are not the same there is often a “ . . . a . . . ” --> “ . . . u . . . ” relationship (such as with swim-swam-swum). These relationships can be embodied as rules applied to other stems to suggest likely values for stems. Suggesting likely values for stems helps the user's data entry flow.
Once the template classes and templates are created conceptually, they are entered into system 200 shown in
After the template classes and templates are stored in template store 208, the actual lexical entries are made. Entering the lexical entries is illustrated in greater detail below with respect to
The author first invokes template authoring component 234 which provides a display, (such as that shown in
Defining the template classes and templates conceptually is indicated by block 401 in
Entering the template edit mode illustratively causes template authoring component 234 to bring up a dialog box such as that shown in
The author then exits the template class edit mode by clicking the Done button 427. This is indicated by block 327 in
The next step is to create the three member templates of the “Noun” template class. Therefore, the author exits the template class view by clicking on the Back to Main Screen button 429. In one embodiment, member templates can be created from the template class view by clicking Create button 417, but the main screen is used to edit stems.
In any case,
Once the template edit mode has been entered, template authoring component 234 provides a display such as that shown in
The author then enters the rules associated with each generated form, in the generated forms table 418. This is indicated by block 430 in
Of course, the author can now enter the noun-regular.s.sh.ch.x and the noun-irregular templates by actuating the “Create” button 420, creating a new template. That template will inherit the generated form names from the noun template class and therefore the author simply needs to change the name, possibly the number of stems, and the rules associated with the generated forms. Once all of the templates have been added, template authoring component 234 provides a display such as that shown in
Once the template classes and templates have been defined and created by the author through template authoring component 234, lexical entries can be made to create a lexicon. In other words, in one embodiment, words are assigned to (or linked to) templates to create the lexicon. Template manager 236 and lexical data manager 240 (shown in
In one illustrative embodiment, there are two ways in which the author can make lexical entries. The first is simply to invoke template manager 236 and lexical data manager 240 to retrieve a desired template, assign a word manually to the template, and then to restore the template. The second way is to use automatic template matching component 238. Both of these embodiments are described in greater detail.
As the discussion proceeds, it will be noted that it may be desirable to define all template classes and templates prior to entering lexical data. In practice, however, this may be very difficult and the author will likely notice missing templates or template classes or other problems with existing templates while making lexical entries. As will be described below, in accordance with one embodiment of the present invention, it is relatively easy to change templates and template classes without negatively affecting existing lexical entries that use those templates.
Assume further, for the sake of example, that the user wishes to enter the word “dog” into the lexicon. The author types the stem value “dog” into the form column in stem table 416. This is indicated by block 502 in
At this point, the author can associate lexical data with this entry, other than that which has already been entered. For instance, in the embodiment illustrated in
The two items of information associated with the lexical entry shown in
It will be appreciated, of course, that other lexical attributes can be assigned to a lexical entry, template, rule, etc., as well. For instance, it may be desirable to associate data indicating whether the lexical entry is a named entity (such as a proper name, a city name, etc.), the part-of-speech of the stem in a lexical entry, the pronunciation, information for certain parts of speech (such as typical subjects and objects of verbs, or grammatical structures—prepositions and particles—that occur around verbs), sense tagging to identify a specific sense of a word, domain encoding to indicate association of a use of a word and a given domain, translation information, examples of usage extracted from corpora, etc. Other attributes can be used as well.
It will be noted that the author can also add constraints to the stem, under the “constraint” column shown in table 416. Stem constraints are simply regular expressions which can be set on the stem slots in table 416 so that a template is not proposed by the automatic template matching component 238 (described below with respect to
In any case, once the necessary data is entered by the author, the author adds the lexical entry to the lexicon. This is indicated by block 504 in
When the author has done this, lexical data manager 240 adds a new row to the lexical entries table 410 and changes the counts indicating the lexical entries in lexical entries table 410. This is illustrated in
The author then repeats these steps, as necessary, for all desired lexical entries. This is indicated by block 506 in
For an irregular word such as “sheep” the author enters multiple stem values in stem table 416. For instance, the author first chooses the noun-irregular template and enters the necessary stem values in table 416 and then adds the word to the lexicon. Once this has been done, lexical data manager 240 updates the lexicon as illustrated in
It will be noted that the “lexical entries” count is now 3-12-10 because, although there are twelve generated words (three noun lexical entries, and four generated forms per template), two of those words are the same—“sheep” and “sheep's” are both in the list twice. Therefore, it can be seen that depending on the form of the lexical entries that are entered, the number of distinct generated forms may be lower than the total number of generated forms, and will never exceed the total number of generated forms.
It can be seen that while choosing a template and entering stem values by hand certainly works to build a lexicon in accordance with the present invention, it may be relatively tedious and slow. Therefore, in accordance with another embodiment of the present invention, automatic template matching component 238 is configured to receive a word and match it against one or more templates to which the word most likely belongs.
Automatic template matching component 238 illustratively includes automatic template matcher 600, and template scoring component 602.
In accordance with the first embodiment, the user simply enters a word to be matched against the templates into a text box, and automatic template matching component 238 receives the word. This is indicated by block 612 in
Automatic template matching component 238 then loops over all selected templates stored in template store 208, and over all slots in each of those templates, in order to determine whether the input word matches any of the slots in any of the templates. This is indicated by block 614 in
In performing this exhaustive search, automatic template matcher 600 asks two questions. The first is whether, for each slot in each template, the associated rule can be reversed given the input word. This is indicated by block 616 in
For example, assume that the input word is “breathes”. Further, assume that the template being considered is the “verb-silent e” template. Assume also that the generated forms and rules for the “verb-silent e” template being considered by automatic template matcher 600 are as follows:
Present.Non3PersSing (1)e
Present.3PersSing (1)es
Participle.Present (1)ing
Past (1)ed
Automatic template matcher 600 first considers the rule associated with the first generated form. Since the input word “breathes” does not end in “e”, then this rule cannot be reversed given the input word. Therefore, for the first generated form, the answer to the question asked at block 616 is no. Thus, automatic template matcher 600 determines whether there are any additional slots in the template being considered and, if so, moves on to the next slot. This is indicated by block 618 in
Automatic template matcher 600 thus asks this question of the next generated form and associated rule. Since the input word “breathes” does end in “es”, the rule associated with the second generated form (which indicates that “es” is to be added onto a stem in order to obtain the third person singular form of the word) can be reversed. Therefore, the answer to the question asked at block 616 in
Since the answer is yes, automatic template matcher 600 reverses the rule to obtain a proposed stem, given the input word. This is indicated by block 620 in
Even though the rule in the third person singular slot can be reversed, that does not mean that the template under consideration (verb-silent e) should be provided as a possible analysis yet. That is because simply finding that the input word can possibly fit into a slot of a template does not provide enough information to make a possible analysis. A possible analysis is an entire lexical entry which includes both the choice of a template and the choice of a particular word to put into the stem values in that template.
As mentioned earlier with respect to
(1) stem .+<:cons:>
This indicates that a constraint is placed on the first stem value. The constraint is indicated by the “.+<:cons:>” term. The “.+” portion of the constraint represents any non-empty string and the <:cons:> indicates that the first stem value must end in a consonant.
Automatic template matcher 600 then asks the question whether the proposed stem value derived by reversing the rule in the matched generated from slot meets any constraints on the stem associated with that rule. This is indicated by block 622 in
However, because in the present example, the constraint is met, then automatic template matcher 600 adds the template under analysis to the possible analysis list (or list of matched templates) 606. This is indicated by block 626 in
When all of the templates and all slots for each template have been searched, a full list of possible analyses 606 will have been generated from automatic template matcher 600. However, in any given set of templates in template store 208 which have been defined by an author, a relatively large number of them will likely contain rules that pass through input strings unchanged and therefore place no constraints on possible outputs. Further, a large number of them will likely have no, or very weak, stem constraints. Therefore, to the extent that these templates exist, they will be considered as possible analyses and the possible analysis list in possible analyses table 406 will be flooded with possible analyses that likely will not apply to the given input word.
Therefore, the present invention also illustratively provides template scoring component 602 in automatic template matching component 238. Template scoring component 602 illustratively scores each of the matched templates to indicate how likely the possible analysis associated with each matched template is to be a correct analysis. This is indicated by block 628 in
In one illustrative embodiment, template scoring component 602 uses three different factors in scoring each proposed analysis. The first is the amount of modification required to translate the input word back into the corresponding stem form. In other words, this factor identifies the amount of modification represented by the rule used to transform the input word into the corresponding stem value. It is believed that, if there is more modification required to transform the input word into the stem value, that tends to mean the rule being reversed is a more sophisticated and specialized rule. Therefore, if it actually does apply to the input word, it is more likely to be correct than a rule which requires very little modification of the input word.
In order to measure the amount of modification required, any conventional means can be used, and the present invention illustratively uses a measure indicative of edit distance between the input word and the stem, after the rule is applied. A bonus is added to the score for this possible analysis if the edit distance is non-zero. (The edit distance will be zero for rules such as “(1)” which do not transform the stem at all.)
A second factor that may illustratively be used to score possible analyses is based on whether the possible analysis has more than one stem associated with it. In one illustrative embodiment, if the possible analysis does have more than one stem associated with it, a penalty is applied to the score for that template. If more than one stem is required to describe a word, it is likely to be an irregular word and therefore it is less likely to occur than a regular word.
In English, for example, irregular forms include “buy/buys/buying/bought/bought” and “fight/fights/fighting/fought”. Each of these forms requires two stems (buy and bought; fight and fought) because the author must explicitly set out what the these stems are since it cannot be predicted from the singular. It is likely that these irregular, multi-stem templates will apply to a relatively small subset of words, and therefore a penalty is applied to those templates when they are suggested as possible analyses.
It will be noted that each of the factors used by template scoring component 602 can be weighted and the weights can be empirically determined or user defined. In one illustrative embodiment, the weight associated with the penalty for having more than one stem can be overcome by the bonus associated with the edit distance, if the edit distance is very large (such as adding 4-5 characters, for instance). This means that if the rule requires 4 or 5 characters to be removed from the input word in order to obtain the stem, it is likely a correct analysis even if it is irregular. In addition, in one embodiment, the author can be provided with a plurality of selectable scoring options. Of course, other weighting schemes can be used as well.
In accordance with another embodiment of the present invention, a third factor used in scoring each possible analysis is based on whether siblings (other generated forms from the template of the proposed analysis) are actually found in the input word list. In other words, if one of the rules generates a hypothetical word that is actually found in the input word list (inputting and processing of an input word list is discussed below) this provides a bonus for the possible analysis. For instance, assume again that the input word is “breathes”. The generated forms will likely be “breathe” “breathes” “breathing” and “breathed”. Assume also that the input word list has the words “breathe” and “breathing”. Then, this possible analysis will get a bonus for each of those generated forms because they are found in the word list. Assume further that the input word list does not include the form “breathed” This does not necessarily mean that the possible analysis is incorrect, but may simply mean that the word list is incomplete. Therefore, the more data that is input to the system, the better performance may be achieved by automatic template matching component 238.
However, a complete absence of siblings in the input word list may indicate that the possible analysis is incorrect. Assume for example that a possible analysis for a verb “walk” is identified in a regular template “verb-regular”, and that the generated forms and rules associated with the generated forms are as follows:
Present.Non3PersSing (1)
Present.3PersSing (1)s
Participle.Present (1)ing
Past (1)ed
It can be seen that the verb “breathes” will still match this template because it matches the third person singular form. The input word ends in “s” and it goes back to the proposed stem “breathe” upon reversing the third person singular form rule. (This assumes that stem 1 has no stem constraint or has a stem constraint satisfied by “breathe”.) However, all of the siblings will be incorrect. For instance, applying the Participle.Present rule to the proposed stem would result in the word “breatheing”, and applying the Past form rule would result in the word “breatheed”, neither of which are correct. Therefore, if none of the generated siblings are found in the input word list, the possible analysis may well be incorrect.
Another factor which can be considered in scoring a possible analysis is frequency information. Some input word lists may have frequency associated with each entry in the word list. The frequency is indicative of how often the given word is found in the corpus from which the word list was extracted. This indicates which words are common and which are uncommon. The template scoring component 602 can apply a number of different rules, using frequency information, to score each possible analysis. For instance, assume there are four generated forms associated with a possible analysis and two of the generated siblings are in the word list and have a large frequency (for instance, in the “breathes” example discussed above, the siblings “breathe” and “breathes” would both be found in the word list and may be relatively high frequency).
However, assume some of the siblings have zero instances occurring in the input word list (again, using the “breathes” example mentioned above, the terms “breatheed” and “breatheing” have zero instances in the word list). This is good evidence that the possible analysis is incorrect. It is likely that (in this example which is in the English language and assumes a relatively large input word list) if some of the siblings are very common in the input word list, all of them will at least be present. Therefore, if some of the siblings are very common in the input word list but some of them are completely missing from the input word list, then a penalty maybe applied to that possible analysis. Of course, this scoring component may be omitted when processing languages such as Finnish with vast numbers of generated forms per template.
Other scoring techniques can be used as well. For instance, a “restrictiveness score” can be added to the regular expressions that serve as stem constraints, and possible analyses that involve proposed stems that satisfy stem constraints are awarded bonuses proportional to this restrictiveness score. So even a possible analysis that arises from a content-free rule like “(1)” might get a high score if stem 1 has a very specific constraint that is satisfied by the proposed stem.
In other words, a rule effectively contains the stem constraint as a subpart, and there can be a global score for how likely a given word is to be a possible output of that rule. Satisfying the stem constraint can be as much a part of this as satisfying the other requirements in the rule and successful possible analyses can be awarded points accordingly.
Again, it is worth noting that template scoring component 602 can score the possible analyses using these, different, or additional factors, as desired by the user. Outputting the rank ordered possible analyses 608 is illustrated by block 630 in
By selecting any of the possible analyses in the possible analysis table 404, lexical data manager 240 displays the full analysis proposed by automatic template matching component 238.
It can be seen that the next time the user enters “dishes” into the text box 400, not only will automatic template matching component 238 output the rank ordered possible analyses 608 in the possible analysis table 404, but it also outputs an actual analysis 610 in the actual analyses table 406. Because an actual analysis appears in actual analysis table 406, the user knows that this word is already in the lexicon for some part of speech (which is visible from the actual analysis template name).
By selecting the entry in the actual analysis table 406, lexical data manager 240 displays the various ways in which the input word has already been matched in the lexicon. In this case, “dishes” only matches in one way as shown in
It may be more common, however, to find that rather than finding a word that matches the same lexical entry twice, the word may actually match two or more different lexical entries. For instance, assume that the input word is “talks”. If this word has already been properly entered into the lexicon, it will appear as a “verb-regular” analysis (as in “he talks a lot”) and also as a “noun-regular” analysis (as in “I went to five talks at the conference.”)
Recall that
After the word list is received, lexicon generator 226 can use sort and grouping utilities 230 or other input data analysis utilities 232 (shown in
In addition, the other input data analysis utilities 232 may include such things as language dependent heuristics which can be run on the word list. For instance, assume that the word list not only contains a list of words, but also multi-word expressions (such as phrases and idiomatic expressions) and semantic information derived from the corpus from which the words where extracted. One heuristic employed by block 232 may include, for instance, sorting the word list into all words that followed the word “the” in the corpus from which they were extracted. This can be applied, for instance, if the author wishes to concentrate on entering nouns and adjectives into the lexicon. It is quite likely that if the input word in the word list followed the word “the” as it was used in the text corpus from which it was extracted, it is very likely to either be a noun or an adjective. Other heuristics can be employed as well, of course, and this is only by way of example.
In addition, presort and preprocessing of the word list can be done by automatic template matching component 238. For instance, as described above with respect to
In any case, once the word list has been received and it has been pre-sorted or pre-processed or otherwise analyzed (if desired), words in the word list can be quickly assigned to templates instead of typing a word into text box 400, the user simply selects one of the words in the word list (such as by clicking on it). This is indicated by block 704 in
A number of other things should be mentioned with respect to the present invention. First, the present invention will illustratively enable description and linking of complex morphological phenomena in a relatively simple and intuitive way. This can be done by extending the hierarchical nature of the template classes and templates discussed above. In this embodiment, the rows (such as rows defining rules and generated forms) in the template are hierarchically arranged. One row is deemed to be a child of a parent row (or a descendent row is a child of an ancestor row) if it contains everything in the ancestor plus something else.
This simplifies representation of words in languages that allow multiple affixes, one added to the next, at the end of a word for example, or circumfixes (where the stem has a prefix and suffix) In addition, information can be accommodated and other morphology affixation as well. If a hierarchical representation of the affixes was not allowed, the system would have to represent the templates as thousands of separate rows. However, by allowing the rows to be represented hierarchically, the hierarchical tree is able to represents all combinatorial possibilities of the affixes much more efficiently and simply.
More specifically, by representing each affix as a separate data structure an author can build up structured templates that encode all possible affix combinations without having to enter and maintain multiple copies of the same information.
Two data types can be used to implement this: affix group and affix group class. These are directly parallel to the previously discussed template and template class data types, with only a few small differences. They can be viewed in terms of inheritance as shown in
Assume a template class base has the following properties: a name, and a tree of generated form slots. In addition, assume a template class has a dictionary form index.
An affix group class, on the other hand, has no extra properties. It does have an extra capability, however. It can be embedded in the generated forms tree of some other template class base, while template classes cannot.
Assume further that a template base (shown in
In addition, assume that a template has a set of one or more stems.
A template must belong to a template class. An affix group, on the other hand, has no extra properties. It does have an extra capability, however—it can be embedded in the rules tree of some other template base, while templates cannot. An affix group must belong to an affix group class.
There are two other important differences between templates and affix groups. First, lexical entries can belong to templates but they cannot belong to affix groups. Second, the rules in templates must make reference to one of the template's stems. The syntax of a template rule is, then, prefix(function(stem))suffix, where any of prefix, function and suffix may be null. (When function is null there is only one set of parentheses.) Examples include (1), (1)ed, pre(2), un(Transform(1))ing. Since affix groups do not have stems, rules in affix groups naturally cannot make reference to stem numbers. Affix group rules are also prohibited from using functions. The syntax is, then, prefix( )suffix, where either or both of prefix and suffix may be null. Examples include: ( ), ( )ed, pre( ), un( )ing.
As mentioned above, primary distinction between template classes and templates is that template classes encode the “what” (which slots in a morphological paradigm are logically present) and templates encode the “how” (how specific word forms to fill those slots are computed). This extends to affix group classes and affix groups—affix group classes capture the “what” (which affixes are being treated as one logical unit) while affix groups capture the “how” (how those affixes are realized in different environments).
In one embodiment, an important difference between template (classes) and affix group (classes) is that template (classes) are inherently independent and can meaningfully stand on their own, while affix group (classes) are inherently dependent and are made fully concrete only in the context of how they are referenced. This is reflected most clearly in the differences in rule syntax above—template rules have an explicit input (the stem value), while affix group rules do not. As we'll see below an affix group rule is “attached” to some other rule, meaning that the output of that rule will implicitly become the input of the affix group Rule (and fill the “( )” placeholder).
The formal structure of the generated forms trees of template classes and affix group classes and the rules trees of templates and affix groups will now be described.
First note that a tree of depth one is equivalent to a list. All “flat” templates are therefore already degenerate “structured” templates (meaning that the affix-related features strictly add functionality rather than change existing functionality). Generated forms trees are the primary type of tree. (Rules trees are secondary since they cannot have their structures edited independently of the generated forms trees they reflect).
Any given template class base (i.e., either a template class or an affix group class) has a set of generated form nodes. These are the top-level nodes in the generated forms tree. Each node has a name, a flag indicating whether the node is “standalone” (which is described below), and zero or more children of type affix group class. As a concrete example we can examine the affix group classes tense and number. Templates for these classes are shown in
In the embodiment shown, there are several important things to note. First, adding a child reference to an affix group class results in new generated form rows being automatically generated with names computed by concatenating the name of the generated form node plus “.” with the names of the rows in the child affix group class.
Also, even though 14B depicts an affix group class in edit mode, the rows belonging to tense are shaded out in number's tree. This is because although the two references to tense belong to number, the actual contents of tense do not. The author therefore cannot edit the names (“Past”, “Present” and “Future”), standalone status (all “true”) or children (no children) of the three generated form nodes in tense.
In addition, the impact of “Standalone?” being false for the two generated form nodes in number will not be obvious until we exit edit mode, when we will see the display shown in
Note from
Non-standalone nodes can also have empty names. This is how affix group classes can be embedded “directly” (at the top level of) template class bases. The second generated form node in the verb template class provides an example of this, and is shown in
In one illustrative embodiment, adding, inserting and deleting generated form nodes and affix group class children is accomplished by right-clicking on the relevant nodes. New generated form nodes are standalone by default. Examples of context menus that are displayed when the author right-clicks on the “Plur” node and “tense” node are shown in
Rules trees will now be described in greater detail. In one illustrative embodiment, the rules tree for any template base (i.e., either a template or an affix group) has precisely the same structure as the generated forms tree of the template base's template class base. That is, it has the same number of top-level nodes, each of which has the same number of children as the corresponding node in the generated forms tree. The type of information that lives on the nodes and the type of the children differ, however.
A Rules tree node carries a single Rule (of the appropriate type—e.g., template rules for rules trees in templates, affix group rules for rules trees in affix groups). Its children, if any, are references to affix groups rather than to affix group classes. It does not have a name or anything corresponding to the “standalone” flag.
As described above, outside of edit mode, the generated forms tree is reduced to a list of named slots corresponding to the checked nodes in edit mode. Some of these slots may have names that were fully specified in the template class base in question but if any generated form nodes had children some names will be computed by reference to information stored in the referenced affix group classes. In a similar fashion a template or affix group's rules tree is reduced to a list of rules, some of which are fully specified in the template or affix group, but some of which may be computed by reference to information stored in referenced affix groups.
There are two generated form nodes in number, and both Number-AfterCons and Number-AfterVowel give rules for them. On the rows corresponding to the references to tense, however, they indicate a choice of affix group rather than a rule. The effect of any particular choice is to “cascade” the rule on the rules tree node down “into” the referenced affix group. This creates a “compound” rule in a manner parallel to how “compound” generated form slot names are created. In both cases, in accordance with one embodiment, the generated object is not editable but only the inputs are.
Each reference to a particular affix group class in a generated forms tree can be realized by a different choice of affix group in a corresponding rules tree. In both Number-AfterCons and Number-AfterVowel the first reference to tense is realized with Tense-AfterCons (because the rules ( )in and ( )yin end in consonants), while the second reference is realized with Tense-AfterVowel (since ( )o and ( )yo end in vowels.)
Note that even non-standalone generated forms with null names typically have non-null rules.
The “(.+)” term represents some string in the above transformations. In the first transformation the “<p|b>” represents that the string indicated by the “(.+)” ends in either the letter “p” or the letter “b”. The replacement in the first transformation indicates that the “p” or “b” on the end of the string is replaced by the letter “m”. Let the function identified above be labeled “MyFunction”. In that case, applying the function would act as follows:
MyFunction(“bap”)=“bam”
MyFunction(“cat”)=“cat”
Because “bap” is some string that ends in the letter “p” it matches pattern 1 and the letter “p” is replaced by the letter “m” to obtain “bam”. However, because “cat” is a string that does not end in “p” or “b” it is simply copied over as “cat”.
Rules can use functions by referring to them by name and optionally adding additional affixes. Take the rule “(MyFunction(1))”, which simply calls “MyFunction” and does not add any affixes onto the result. Note that rules are many-to-one relations between strings. In other words, all rules, including rules that use functions, have only one output value for any given input value. When rules are reversed, however, it is possible that there are multiple possible inputs for a given output. When the rule “(MyFunction(1))” is reversed against the word “bam”, for instance, there are three possible results as proposed stems—“bap”, “bab” and “bam”. All three of these strings are transformed into “bam” when passed to MyFunction and, hence, to the rule “(MyFunction(1))”.
In accordance with one embodiment of the present invention, each of the proposed stems generated by reversing the rule is given its own possible analysis. Therefore, at block 616 it will be determined that the rule applies to this input word. Then, at block 620, each of the transformations is applied to generate three proposed stems, and each proposed stem is associated with a separate possible analysis as processing continues at block 622.
In one embodiment, functions are provided which feed other functions. For instance, a nested function can be used, such as (MyFunctionA(MyFunctionB(MyFunctionC(1)))). In such an embodiment, all proposed stems generated by reversing the rules are given their own possible analyses.
It is also worth noting that, in one embodiment of the present invention, a dynamic full form lexicon is maintained. In other words, all of the lexical entries represented in terms of templates have associated word forms that are generated from those templates. A full form lexicon is simply a list of all the words represented by all of the lexical entries in the lexicon. In accordance with one embodiment of the present invention, lexical data manager 240 maintains this list. Then, whenever any additions, deletions or changes are made to any given template, the dynamic full form lexicon is updated to reflect those changes.
One way in which the dynamic full form lexicon is used is to identify which words in the input word list are already in the lexicon, and to generate the check marks adjacent those words as shown in
Another note is that the present invention may illustratively enable fuzzy affix matching. For instance, it is common in some languages that multiple affixes are appended or prepended to a given word. Fuzzy affix matching allows possible analyses to be generated for a given word, even if the entire analysis is not located for that word. In other words, suppose a word “stemxyzabcdef” consists of the stem “stem” followed by the three suffixes “xyz”, “abc” and “def” appended to one another. Also, assume that a rule has been set up for a stem with the suffix “xyz”. Fuzzy affix matching would identify that rule as a possible analysis for the word “stemxyzabcdef” even though there is nothing in the rule corresponding to the affixes “abc” and “def”. This provides the author with a starting point from which to modify the template to add additional generated forms and rules corresponding to the input word.
It can thus be seen that the present invention provides significant advantages over prior art systems. The present invention provides an overall system and process for generating a lexicon from a data corpus or word list all the way through a lexicon usable by one or more natural language applications or other components. The present invention also provides a mechanism for authoring template classes and templates and managing those templates and associated data structures, as well as the functions and constraints found in those data structures. The present invention further provides a system for automatically matching input words against the authored templates to generate the lexicon. Further, the present invention provides a system for pre-analyzing an input word list based on context, by filtering the input list to remove unwanted words, and also by scoring the input words against templates using the automatic template matching component. Of course, other advantages are derived as well.
Although the present invention has been described with reference to particular embodiments, workers skilled in the art will recognize that changes may be made in form and detail without departing from the spirit and scope of the invention.