The present invention relates generally to the field of natural language processing. More specifically, the present invention relates to word segmentation.
Word segmentation refers to the process of identifying the individual words that make up an expression of language, such as text. Word segmentation is useful for checking spelling and grammar, synthesizing speech from text, and performing natural language parsing and understanding, all of which benefit from an identification of individual words.
Performing word segmentation of English text is rather straightforward, since spaces and punctuation marks generally delimit the individual words in the text. Consider the English sentence in Table 1 below.
By identifying each contiguous sequence of spaces and/or punctuation marks as the end of the word preceding the sequence, the English sentence in Table 1 may be straightforwardly segmented as shown in Table 2 below.
In Chinese text, word boundaries are implicit rather than explicit. Consider the sentence in Table 3 below, meaning “The committee discussed this problem yesterday afternoon in Buenos Aires.”
Despite the absence of punctuation and spaces from the sentence, a reader of Chinese would recognize the sentence in Table 3 as being comprised of the words separately underlined in Table 4 below.
Many methods and systems have been devised to provide word segmentation for languages such as Chinese and Japanese. In some systems, models are trained based on a corpus of segmented text. The models describe the likelihood of various segments appearing in a text string and provide an output indicative thereof. Developing a corpus to train the models takes time and expense. In many instances, the quality of the output of an associated word segmentation system depends largely upon the quality of the corpus used to train the model. As a result, a method for evaluating corpora and developing corpora will aide in providing quality word segmentation.
The present invention relates to a corpus for use in training a language model. The corpus includes a plurality of characters and a plurality of morphological tags associated with a plurality of sequences of characters. The plurality of morphological tags indicate a morphological type of an associated sequence of characters and a combination of parts forming a morphological subtype.
In another aspect, a computer readable medium having instructions for performing word segmentation is provided. The instructions include receiving an input of unsegmented text and accessing a language model to determine a segmentation of the text. A morphologically derived word is detected in the text and an output indicative of segmented text and an indication of a combination of parts that form the morphologically derived word is provided.
Prior to discussing the present invention in greater detail, an embodiment of an illustrative environment in which the present invention can be used will be discussed.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Those skilled in the art can implement the description and/or figures herein as computer-executable instructions, which can be embodied on any form of computer readable media discussed below.
The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
With reference to
Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 110. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation,
The computer 110 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
A user may enter commands and information into the computer 110 through input devices such as a keyboard 162, a microphone 163, and a pointing device 161, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, computers may also include other peripheral output devices such as speakers 197 and printer 196, which may be connected through an output peripheral interface 195.
The computer 110 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110. The logical connections depicted in
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user-input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
During processing, the language processing system 200 can access a language model 206 in order to determine a segmentation for the input text 202. Language model 206 can be constructed from an annotated corpus that defines various types of words as well as an indication of the specific type. As appreciated by those skilled in the art, language processing system 200 can be useful in various situations such as spell checking, grammar checking, synthesizing speech from text, speech recognition, information retrieval and performing natural language parsing and understanding to name a few. Additionally, language model 206 may be developed based on the particular application for which language processing system 200 is used.
In addition to providing segmentation, system 200 also provides an indication of word type for each of the segmented words. In one embodiment, Chinese words are defined as one of the following four types: (1) entries in a given lexicon (lexicon words or LWs hereafter), (2) morphologically derived words (MDWs), (3) factoids such as Date, Time, Percentage, Money, etc., and (4) named entities (NEs) such as person names (PNs), location names (LNs), and organization names (ONs). Various subtypes can also be defined. Given the definitions of these types of words, system 200 can provide an output indicative of segmentation and word type. For example, consider the unsegmented sentence in Table 5 below, meaning “Friends happily go to Professor Li Junsheng's home for lunch at twelve thirty.”
An exemplary output of system 200 is shown in Table 6 below. Square brackets indicate word boundaries and a “+” indicates a morpheme boundary. Tags are provided within the brackets to indicate the various types and subtypes of words within the sentence.
In order to provide segmentation, language model 206 detects word types in the input text 202. For lexicon words, word boundaries are detected if the word is contained in the lexicon. For morphologically derived words, morphological patterns are detected, e.g. (which means friend+s) is derived by affixation of the plural affix to the noun (MA_S is a tag that indicates a suffixation pattern), and (which means happily) is a reduplication of (happy) (MR_AABB is a tag that indicates an AABB reduplication pattern).
In the case of factoids, their types and normalized forms are detected, e.g. 12:30 is the normalized form of the time expression (TIME is a tag that indicates a time expression). For named entities, subtypes are detected, e.g. (Li Junsheng) is a person name (PN is a tag that indicates a person name).
Language model 206 can be created from an annotated corpus.
At step 258, the extracted list can be manually checked if desired to filter out any noise or errors within the list. It is then determined whether the list has sufficient coverage of the defined words and rules at step 260. In one embodiment, the list may be compared to a balanced, independent test corpus having a wide variety of domains and styles. For example, the domains and styles may include text related to culture, economy, literature, military, politics, science and technology, society, sports, computers and law to name a few. Alternatively an application specific corpus may be used having broad coverage of a particular application. If it is determined that the list has sufficient coverage, the corpus is then tagged at step 262. The tagging of the corpus can be performed as discussed below. At step 264, the tagged corpus can be checked and any errors may be corrected. At step 266, the resulting corpus is used as a seed corpus to tag a larger amount of text as a training or testing corpus. As a result, an annotated corpus is developed that can be evaluated using method 280 in
In order to evaluate a language model, the output of a word segmentation system using the model can be compared to a standard annotated testing corpus that serves as a standard output of a segmentation system. To achieve a reliable evaluation, a raw (unannotated) test corpus may be chosen that is independent, balanced and of appropriate size. An independent test corpus will have a relatively small overlap with the annotated corpus used to train the language model. A balanced corpus contains documents having wide variety of domain, style and time. In order to be large enough, one embodiment of a test corpus includes approximately one million Chinese characters. After developing the test corpus, the corpus is manually annotated to be used as a standard output of a Chinese word segmentation system given the test corpus. The test corpus can be annotated using the tagging specification described below or another tagging specification.
Given the annotated test corpus, a quantitative evaluation can be used to evaluate the performance of a language model. If the total number of word tokens in the standard test set is “S”, the total number of word tokens of the output of a word segmentation system to be evaluated applied to the test set is “E” and a number of word tokens in the output which exactly matched the word tokens in the standard test set is “M”, quantitative values can be calculated to evaluate performance of the language model. Equations 1-3 below show values for precision, recall and an F-score.
Precision=M/E (1)
Recall=M/S (2)
F=2×Precision×Recall/(Precision+Recall) (3)
Furthermore, the evaluation may be performed on various subtypes according to equations 1-3 above. For example, a person name performance evaluation may be conducted where SPN is the total number of person name tokens in the standard test corpus. EPN is the total number of person name tokens in the output of a word segmentation system to be evaluated and MPN is a the number of person name tokens in the output which exactly matched the person names in the standard test set. As a result, the performance equations are:
PrecisionPN=MPN/EPN (4)
RecallPN=MPN/SPN (5)
FPN=2×PrecisionPN×RecallPN/(PrecisionPN+RecallPN) (6)
It is further useful to compare other system results in evaluating performance of language models. For example, it may be useful to only compare various portions of outputs of different word segmentation systems such as (1) person names, (2) location names, (3) organization names, (4) overlapping ambiguous strings and (5) covering ambiguous strings. By only evaluating a subset of the output of the segmentation systems, a better idea of where errors are occurring in segmentation can result.
In order to develop annotated corpora, a tagging specification is used to consistently tag the corpora given the definitions of Chinese word types described above. Lexicon words with the lexicon are delimited by brackets without additional tagging. Other types are tagged as provided below.
The format in
Split includes a set of expressions that are separate words at the syntactic level but single words at the semantic level. For example, a character string ABC may represent the phrase “already ate”, where the bi-character word AC represents the word “ate” and is split by the particle character B representing the word “already”. Split includes two subtypes. One subtype involves inserting a character or characters between a verb and an object and the other inserts an object between the phrase “qilai”. Merging occurs where one word consisting of two characters and another word consisting of two characters are combined to form a single word and includes three subtypes. A head particle occurs when combining a verb character with other characters to form a word and includes two subtypes that combine an adjective and a direction and a verb and a direction.
The tagging format for named entities and factoids is presented in Table 7 below. Format-1 includes simple tags for various types and subtypes to help facilitate quick and easy tagging by a human. For example, the name entities for person, location and organization are simply tagged as P, L and O, respectively. Format-2 represents tagging using the Standardized General Mark-up Language (SGML) according to the Second Multilingual Entity Task Evaluation (MET-2). If desired, a transformation between format-1 and format-2 can be realized through a suitable transformation program.
Given the tagging format in Table 7, named entities and factoids within corpora can be easily tagged to provide annotated corpora. An example of tagging in format-1 and format-2 is provided below.
Tag in Format-1:
It is useful to provide general guidelines when tagging corpora to insure consistency and accuracy. The following description provides these guidelines.
1. Proper Nouns are those NEs with objective and specific meanings, while the NEs with abstractive and general meanings are not included.
Eg: The expressions, Foreigner’, girl’ are not Proper Nouns.
2. For a complex Proper Noun, embedded tagging is not allowed. That is to say the maximum matching approach is used where the segmented word having the greatest number of characters is used.
3. TIMES, NUMEX, MEASUREX and ADDRESS that are embedded in Person Name, Location Name and Organization Name are not to be tagged.
If the annotators are not sure whether the expression is decomposable or not, then the expression is treated as decomposable, and the Entity within it is to be tagged. E.g. [L_ms “Hong Kong Foot”, with the same meaning as athlete's foot. The expression as a whole is non-decomposable. According to the guideline, the word ‘Hong Kong’ can be tagged as a Location name, ‘L_ms’. E.g. [ord “Forty-sixth Pacific Asia travel Association annual meeting”, in the guideline the expression is treated as decomposable:
Pacific Asia travel Association’ is tagged as organization, while Pacific Asia travel Association annual meeting’ is not an organization.
For an expression ‘Person Name+thought (or: theory, law, ideology)’, the whole expression is to be tagged as ‘p-ms’
In general, do not tag terms ending in “force” as ORGANIZATION. [L “West Africa peacekeeping force”, “military base” is to be tagged as LOCATION, NOT ORGANIZATION. [ “Peterson air military base”
9. For a Name Entity (Person name, Location name, Organization name), if it is a kind of multimedia (TV & Radio shows, movies and books), product or treaty, it is to be tagged with the “-ms” tag.
[P-ms “Deng Xiaoping (CL-for-film)'s release, i.e. the release of the film “Deng Xiaoping”
Since Ding Xiao Ping’ is the title of a TV program. According to the guideline, ‘Ding Xiao Ping’ is to be tagged as ‘P-ms’.
If a Name Entity is embedded in Acronym of Entity, then it is not to be tagged. [O, means no mark up for
1. Titles of Person
Titles and role names are not considered part of a person's name.
However, generational designators , are considered part of a person's name.
When a person's title falls between the surname and the given name, include the title.
If people names appear as the titles of multimedia (TV and radio show, movies and books), of products and of treaties, the names are to be tagged as ‘p_ms’.
<<[P_ms “Mona Lisa”, as the title of a painting (or title of a book), is to be tagged “P_ms”.
In the following five cases, the proper names are not to be tagged as Person: laws named after people, courts cases named after people, weather formations named, diseases/prizes named after people.
Generally, person Name is constitute of two parts: Family Name (FN) & Given Name (GN)
The strings that are tagged as LOCATION include: oceans, continents, countries, provinces, counties, cities, regions, streets, villages, towns, airports, military bases, roads, railways, bridges, rivers, seas, channels, sounds, bays, straights, sand beach, lakes, parks, mountains, plains, meadows, mines, exhibition centers, etc., fictional or mythical locations, and certain structure, such as the Eiffel Tower and Lincoln Monument.
[L “Korea south and north dialogue”, tag on Korea but no tag on south/north” (L “conflict between Arab and Israel”, tag on Israel but no tag on Arab since it does not refer to a specific country
“epicenter located at north 36.0 degrees east 95.9 degrees”.
1. For Location entity embedded in another Location Entity, then the whole entity is to be tagged.
Compound expressions in which place names are listed in succession are to be tagged as separate instances of Location. [L [L [L “Jilin province Yanbian Korean autonomous region Tumen municipality”.
3. Transnational Locative Entity Expressions
[L “west Africa country leader” [L “Asia & Pacific Rim”, tagged as one entity [L “western hemisphere countries” No mark up.
Subnational region names:
Do tag the location names of the form x-it, where x is a location. “using Sichuan words”, tag on Location on
7. Do not tag location names which are part of the names, ending in or of ethnic groups.
In the expressions and are not to be tagged as Location. However, in the expressions
8. Normal Pattern of Location
Proper names that are to be tagged as Organization include stock exchanges, multinational organizations, businesses, TV or radio stations, political parties, religious groups, orchestras, bands, or musical groups, unions, non-generic governmental entity names such as “congress”, or “chamber of deputies,” sports teams and armies ( unless designated only by country names, which are tagged as Location), as well as fictional organizations.
Corporate or organization designators are considered part of an organization name. A basic principle for Location tagging is to use maximum matching approach.
Normal Pattern for Organization
1. National (or international) legislative bodies and departments or ministries are to be tagged as Organization.
In this case, tagging A is chosen by default.
2.6 In the case that annotators do not have enough knowledge to decide whether organization begins with a location.
E.g.: in the expression “ annotators are not sure whether is a location name. However, it is clear that once this string is removed, the left strings have no specific referring. Therefore, according to 2.1, the expression is to be tagged as:
If the phrases “ . . . ” refer to “Congress” or “Chamber of deputies”, then they are to be tagged as Organization. Notice that session meetings of Congress (or Chamber of deputies) are not be tagged as Organization, because they are events.
If Embassy descriptor is contiguous with the country/district it represents, then the country/district is to be tagged as part of Organization.
“go to Honduras Embassy in Hong Kong” If Embassy descriptor is contiguous with the geography location, then mark any locations separately as Location, and do not tag the embassy as an Organization.
[L [L “U.S. going through stationed at Kinshasa embassy and other normal channels”.
6. Manufacture and Product
In cases where the manufacture and the product are named, the manufacture is to be tagged as Organization, while the product is not to be tagged. Products must be defined loosely to include manufactured products (e.g. vehicles), as well as computed products (e.g., stock indexes) and media products (e.g., television shows).
Do not mark the term “center” by itself as an Organization. However, do mark “party center” as an Organization.
The TIME type is defined as a temporal unit shorter than a full day, such as “second, minute, or hour”. The DATE sub-type is a temporal unit of a full day or longer, such as “day, week, month, quarter, year(s), century, etc.” The DURATION sub-type captures durations of time.
1. Date
For the form string duration, then entire phrase is tagged as dat_MET, because the duration is embedded in DAT so not to be tagged.
Do not tag the “spring” in “Spring couplets”.’
5. Special Case:
If two time expressions are in different sub-types, then they are to be tagged separately. If the two expression are non-decomposable, then they are to be tagged together.
If a location entity is embedded in time expression, the mark ‘MET’ is introduced to refer to the MET-2 guideline. “ER99” can be used to tag according to an alternative specification.
The expressions such as “last year”, “yesterday”, “this morning” are to be tagged according to MET-2, call for annotators attention on the difference and use the extra mark accordingly.
For the expression this morning’, ER-99 treats it as a relative time entity and is not to be tagged, while in MET-2 the relative time is to be tagged.
For the expression quite a few years”, ER-99 treat it as a fixed time duration and to be tagged, while many years” is non-fixed duration and not be tagged.
The expression one year” is to be tagged as Duration
The expression each year”/annual, yearly”
1. Percentage
If the integer/fraction/decimal has a number unit as a modifier, then the number unit is to be tagged.
[int “several ‘jia’ factories” [int 5 “one family with five ‘kou’ persons” [int 58 “58 times”.
4. Special case
MEASUREX includes: Age, Weight, Length, Temperature, Angle, Area, Capacity, Speed and Rate.
Notes that: for the other units of weights and measures in Physics and Chemistry, they are to be tagged as “mea”
ADDRESX includes: Email, Phone, Fax, Telex, WWW.
For numbers of tel or fax, it is to be tagged only there is a designator such as “tel,
Although the present invention has been described with reference to particular embodiments, workers skilled in the art will recognize that changes may be made in form and detail without departing from the spirit and scope of the invention.