Natural language is one of the fundamental aspects human behaviors and is an essential component of our lives. Human beings learn language by discovering patterns and templates, which are used to put together a sentence, a question, or a command. Natural language processing/understanding (NLP/U) assumes that if we can define those patterns and describe them to a computer then we can teach a machine something of how we understand and communicate with each other. This work is based on research in a wide range of area, most importantly computer science, linguistics, logic, psycholinguistics, and the philosophy of language. These difference disciplines define their own set of problems and the methods for addressing them. The linguisticians, for instance, study the structure of language itself and consider questions such as why certain combinations of words from sentences but other do not. The philosophers consider how words can mean anything at all and how they identify objects in the world. The goal of computational linguistic is to develop a computational theory of language, using the notions of algorithms and data structures from computer science. To build a computational model, one must take advantage of what is known from all the other disciplines.
There are many applications of natural language understanding that researchers work on. The applications of natural language understanding can be divided into two major classes: text-based applications and dialogue-based applications.
Text-based applications involve the processing of written text, such as newspapers, reports, manuals etc. These kinds of texts are reading-based. The text-based natural language research is ongoing in applications listed below:
Dialogue-based applications involve communication between humans and computers. It involves spoken language, that is, humans may use microphone or keyboards to interact and communicate with computer. These applications include:
The essential task of performing these applications is to analyze or parse texts in the database of a system and the text users input. That is, we have to process each sentence systematically and effectively. Most traditional approach to parse natural language sentences aim to recover complete, exact parses based on the integration of complex syntactic and semantic information. They search through the entire space of parses defined by the grammar and then seek the globally best parse referring to some heuristic rules or manual correction. For example, the sentence (1a) taken from Sinica Treebank (Sinica Treebank, 2002) is annotated as (1b). )|time:Dd:
|Head:VC2:
|goal:
|Head:Nac:
)|particle:Ta:
)
The sentence structure in Sinica Treebank is represented by employing head-driven principle, that is, each sentence or phrase has a head leading it. A phrase consists of a head, arguments and adjuncts. One can use the concept of head to figure out the relationship among the phrases in a sentence. In the example (1), the head of the NP (noun phrase), ‘he,’ is the agent of the verb,
‘find’. Although the head-driven principle may prevent the ambiguity of syntactical analysis (Chen et al., 1999), to choose the head of a phrase automatically may cause errors. Another example (2) is extracted from the Penn Chinese TreeBank (The Penn Chinese Treebank Project, 2000).
))
)
))
))
)
)))))
The Penn Chinese TreeBank provides solid linguistic analysis for the selected text, based on the current research in Chinese syntax and the linguistic expertise of those involved in the Penn Chinese Treebank project to annotate the text manually.
Another approach to parse natural language sentences is based on shallow parsing which is an inexpensive, fast and reliable procedure. Shallow parsing (or chunking) does not deliver full syntactic analysis but is limited to parsing smaller constituents such as noun phrases or verb phrases (Abney, 1996). For example (3), the sentence (3a) can be processed as follows: (N)
(Vt)
(Vt)
(N)
(De)
(N)]
] [VP
] [NP
]]
In (3b), ‘N’ denotes a noun and ‘Vt’ denotes a transitive verb. In (3c), there are three chunks which are two NP chunks and one VP chunk generated. A chunk consists of syntactically correlated parts of words in sentences.
The present invention is a method for processing Chinese sentences which can automatically transform a Chinese sentence into a Triple representation based on shallow parsing without manually annotating every sentence. Our method is concerned with parsing Chinese sentences by employing lexical and partial syntactical information to extract more prominent entities in a Chinese sentence, and the sentence is then transformed into a Triple representation. The lexical and syntactical information in our method is referring a lexicon possessing part-of-speech (POS) information and phrase-level syntax in Chinese respectively. The Triple representation consists of three elements which are agent, predicate, and patient in a sentence.
The invention of the method for processing Chinese sentences is divided into several steps as shown in
The rule of the longest word prioritized first is a simple and easy-to-implement rule, which is described as follows: Given a lexicon having POS information and a Chinese sentence, the leading sub-strings are compared with the entries in the lexicon. Then the longest word in the matched sub-strings is selected and the remaining sub-string becomes the string to be matched in the next round of matching until the remaining sub-string is empty. In the step of word filtering (104), based on observations on real Chinese texts, the part of speech of most important words are nouns and verbs. Therefore, the words having POS of Noun and Verb are kept, and besides, the prepositions are also reserved for the predicates other than verbs between noun phrases. For example (4), the relation sentence (4a) can be processed as (4b): ], [
], [
]]
For parsing smaller constituents such as noun phrases or verbal phrases in a Chinese sentence, the ]] [vp, [
]] [np [
]]]
], [
], [
]]
The present invention proposes a Triple representation, [A, Pr, Pa], which consists of three elements—agent, predicate, and patient—corresponding to subject, verb/preposition, object in a clause or a sentence. The three elements, A, Pr and Pa, are three word lists enclosed in square brackets [ ], as shown in (5c). In the steps 102, 104 and 106, a sentence is processed into a sequence of word lists consisting of prominent words like (5b). Because Chinese is a SVO (Subject-Verb-Object) language (Li and Thompson, 1981), the simple syntax is employed to transform the output of phrase-level parsing into the Triples. The definition of Triple representation is illustrated in Definition 1.
Definition 1:
As illustrated in Definition 1, the Triple is a simple representation which consists of three elements: A, Pr and Pa which correspond to the Subject (noun phrase), Predicate (verb phrase) and Object (noun phrase) respectively in a clause. No matter how many clauses within the Chinese sentences, the Triples will be extracted in order. For example (6), there are two Triples in (6b). In the second Triple of (6b), zero denotes a zero anaphor, which often occurs in Chinese texts. ], [
], [
]], [[zero], [
], [
]]]
The
The Triple Rule Set is built by referring to the Chinese syntax. There are five kinds of Triples in the Triple Rule Set, which corresponds to five basic clauses: subject+transitive verb+object, subject+intransitive verb, subject+preposition+object, preposition+noun phrase, and a noun phrase only. The rules listed below are employed in order:
Triple Rule Set:
Triple1(A,Pr,Pa)→np(A), vtp(Pr), np(Pa).
Triple2(A,Pr,none)→np(A), vip(Pr).
Triple3(A,Pr,Pa)→np(A), prep(Pr), np(Pa).
Triple4(none,Pr,Pa)→prep(Pr), np(Pa).
Triple5(A,none,none)→np(A).
The vtp(Pr) denotes the predicate is a transitive verb phrase, which contains a transitive verb in the rightmost position in the phrase; likewise the vip(Pr) denotes the predicate is an intransitive verb phrase, which contains an intransitive verb in the rightmost position in the phrase. In the rule Triple3, the prep(Pr) denotes the predicate is a preposition. If all the rules in the Triple Rule Set failed, the Triple Exception Rules referring to the phenomenon of zero anaphora in Chinese is utilized:
Triple Exception Rules:
Triple1e1(zero,Pr,Pa)→vtp(Pr), np(Pa).
Triple1e2(A,Pr,zero)→np(A), vtp(Pr).
Triple1e3(zero,Pr,zero)→vtp(Pr).
Triple23(zero,Pr,none)→vip(Pr).
The zero anaphora in Chinese generally occurs in the topic, subject or object position. The rules Triple1e1, Triple1e3, and Triple2e reflect the zero anaphora occurs in the topic or subject position. The rule Triple1e2 reflects the zero anaphora occurs in the object position.