Methods and Systems for Automated Text Correction

Information

  • Patent Application
  • 20130325442
  • Publication Number
    20130325442
  • Date Filed
    September 23, 2011
    13 years ago
  • Date Published
    December 05, 2013
    11 years ago
Abstract
The present embodiments demonstrate systems and methods for automated text correction. In certain embodiments, the methods and systems may be implemented through analysis according to a single text correction model. In a particular embodiment, the single text correction model may be generated through analysis of both a corpus of learner text and a corpus of non-learner text.
Description
BACKGROUND

1. Field of the Invention


This invention relates to methods and systems for automated text correction.


2. Description of the Related Art


Text correction is often difficult and time consuming. Additionally, it is often expensive to edit text, particularly involving translations, because editing often requires the use of skilled and trained workers. For example, editing of a translation may require intensive labor to be provided by a worker with a high level of proficiency in two or more languages.


Automated translation systems, such as certain online translators, may alleviate some of the labor intensive aspects of translation, but they are still not capable of replacing a human translator. In particular, automated systems do a relatively good job of word to word translation, but the meaning of a sentence is often lost because of inaccuracies in grammar and punctuation.


Certain automated text editing systems do exist, but such systems generally suffer from inaccuracy. Additionally, prior automated text editing systems may require a relatively large amount of processing resources.


Some automated text editing systems may require training or configuration to edit text accurately. For example, certain prior systems may be trained using an annotated corpus of learner text. Alternatively, some prior art systems may be trained using a corpus of non-learner text that is not annotated. One of ordinary skill in the art will recognize the differences between learner text and non-learner text.


Outputs of standard automatic speech recognition (ASR) systems typically consist of utterances where important linguistic and structural information, such as true case, sentence boundaries, and punctuation symbols, is not available. Linguistic and structural information improves the readability of the transcribed speech texts, and assists in further downstream processing, such as in part-of-speech (POS) tagging, parsing, information extraction, and machine translation.


Prior punctuation prediction techniques make use of both lexical and prosodic cues. However, prosodic features such as pitch and pause duration, are often unavailable without the original raw speech waveforms. In some scenarios where further natural language processing (NLP) tasks on the transcribed speech texts become the main concern, speech prosody information may not be readily available. For example, in the evaluation campaign of the International Workshop on Spoken Language Translation (IWSLT), only manually transcribed or automatically recognized speech texts are provided but the original raw speech waveforms are not available.


Punctuation insertion conventionally is performed during speech recognition. In one example, prosodic features together with language model probabilities were used within a decision tree framework. In another example, insertion in the broadcast news domain included both finite state and multi-layer perceptron methods for the task, where prosodic and lexical information was incorporated. In a further example, a maximum entropy-based tagging approach to punctuation insertion in spontaneous English conversational speech, including the use of both lexical and prosodic features, was exploited. In yet another example, sentence boundary detection was performed by making use of conditional random fields (CRF). The boundary detection was shown to improve over a previous method based on the hidden Markov model (HMM).


Some prior techniques consider the sentence boundary detection and punctuation insertion task as a hidden event detection task. For example, a HMM may describe a joint distribution over words and inter-word events, where the observations are the words, and the word/event pairs are encoded as hidden states. Specifically, in this task word boundaries and punctuation symbols are encoded as inter-word events. The training phase involves training an n-gram language model over all observed words and events with smoothing techniques. The learned n-gram probability scores are then used as the HMM state-transition scores. During testing, the posterior probability of an event at each word is computed with dynamic programming using the forward-backward algorithm. The sequence of most probable states thus forms the output which gives the punctuated sentence. Such a HMM-based approach has several drawbacks.


First, the n-gram language model is only able to capture surrounding contextual information. However, modeling of longer range dependencies may be needed for punctuation insertion. For example, the method is unable to effectively capture the long range dependency between the initial phrase “would you” which strongly indicates a question sentence, and an ending question mark. Thus, special techniques may be used on top of using a hidden event language model in order to overcome long range dependencies.


Prior examples include relocating or duplicating punctuation symbols to different positions of a sentence such that they appear closer to the indicative words (e.g., “how much” indicates a question sentence). One such technique suggested duplicating the ending punctuation symbol to the beginning of each sentence before training the language model. Empirically, the technique has demonstrated its effectiveness in predicting question marks in English, since most of the indicative words for English question sentences appear at the beginning of a question. However, such a technique is specially designed and may not be widely applicable in general or to languages other than English. Furthermore, a direct application of such a method may fail in the event of multiple sentences per utterance without clearly annotated sentence boundaries within an utterance.


Another drawback associated with such an approach is that the method encodes strong dependency assumptions between the punctuation symbol to be inserted and its surrounding words. Thus, it lacks the robustness to handle cases where noisy or out-of-vocabulary (OOV) words frequently appear, such as in texts automatically recognized by ASR systems.


Grammatical error correction (GEC) has also been recognized as an interesting and commercially attractive problem in natural language processing (NLP), in particular for learners of English as a foreign or second language (EFL/ESL).


Despite the growing interest, research has been hindered by the lack of a large annotated corpus of learner text that is available for research purposes. As a result, the standard approach to GEC has been to train an off-the-shelf classifier to re-predict words in non-learner text. Learning GEC models directly from annotated learner corpora is not well explored, as are methods that combine learner and non-learner text. Furthermore, the evaluation of GEC has been problematic. Previous work has either evaluated on artificial test instances as a substitute for real learner errors or on proprietary data that is not available to other researchers. As a consequence, existing methods have not been compared on the same test set, leaving it unclear where the current state of the art really is.


The de facto standard approach to GEC is to build a statistical model that can choose the most likely correction from a confusion set of possible correction choices. The way the confusion set is defined depends on the type of error. Work in context-sensitive spelling error correction has traditionally focused on confusion sets with similar spelling (e.g., {dessert, desert}) or similar pronunciation (e.g., {there, their}). In other words, the words in a confusion set are deemed confusable because of orthographic or phonetic similarity. Other work in GEC has defined the confusion sets based on syntactic similarity, for example all English articles or the most frequent English prepositions form a confusion set.


SUMMARY

The present embodiments demonstrate systems and methods for automated text correction. In certain embodiments, the methods and systems may be implemented through analysis according to a single text editing model. In a particular embodiment, the single text editing model may be generated through analysis of both a corpus of learner text and a corpus of non-learner text.


According to one embodiment, an apparatus includes at least one processor and a memory device coupled to the at least one processor, in which the at least one processor is configured to identify words of an input utterance. The at least one processor is also configured to place the words in a plurality of first nodes stored in the memory device. The at least one processor is further configured to assign a word-layer tag to each of the first nodes based, in part, on neighboring nodes of the linear chain. The at least one processor is also configured to generate an output sentence by combining words from the plurality of first nodes with punctuation marks selected, in part, on the word-layer tags assigned to each of the first nodes.


According to another embodiment, a computer program product includes a computer-readable medium having code to identify words of an input utterance. The medium also includes code to place the words in a plurality of first nodes stored in the memory device. The medium further includes code to assign a word-layer tag to each of the plurality of first nodes based, in part, on neighboring nodes of the plurality of first nodes. The medium also includes code to generate an output sentence by combining words from the plurality of first nodes with punctuation marks selected, in part, on the word-layer tags assigned to each of the first nodes.


According to yet another embodiment, a method includes identifying words of an input utterance. The method also includes placing the words in a plurality of first nodes. The method further includes assigning a word-layer tag to each of the first nodes in the plurality of first nodes based, in part, on neighboring nodes of the plurality of first nodes. The method yet also includes generating an output sentence by combining words from the plurality of first nodes with punctuation marks selected, in part, on the word-layer tags assigned to each of the first nodes.


Additional embodiments of a method include receiving a natural language text input, the text input comprising a grammatical error in which a portion of the input text comprises a class from a set of classes. This method may also include generating a plurality of selection tasks from a corpus of non-learner text that is assumed to be free of grammatical errors, wherein for each selection task a classifier re-predicts a class used in the non-learner text. Further, the method may include generating a plurality of correction tasks from a corpus of learner text, wherein for each correction task a classifier proposes a class used in the learner text. Additionally, the method may include training a grammar correction model using a set of binary classification problems that include the plurality of selection tasks and the plurality of correction tasks. This embodiment may also include using the trained grammar correction model to predict a class for the text input from the set of possible classes.


In a further embodiment, the method includes outputting a suggestion to change the class of the text input to the predicted class if the predicted class is different than the class in the text input. In such an embodiment, the learner text is annotated by a teacher with an assumed correct class. The class may be an article associated with a noun phrase in the input text. The method may also include extracting feature functions for the classifiers from noun phrases in the non-learner text and the learner text.


In another embodiment, the class is a preposition associated with a prepositional phrase in the input text. Such a method may include extracting feature functions for the classifiers from prepositional phrases in the non-learner text and the learner text.


In one embodiment, the non-learner text and the learner text have a different feature space, the feature space of the learner text including the word used by a writer. Training the grammar correction model may include minimizing a loss function on the training data. Training the grammar correction model may also include identifying a plurality of linear classifiers through analysis of the non-learner text. The linear classifiers further comprise a weight factor included in a matrix of weight factors.


In one embodiment, training the grammar correction model further comprises performing a Singular Value Decomposition (SVD) on the matrix of weight factors. Training the grammar correction model may also include identifying a combined weight value that represents a first weight value element identified through the analysis of the non-learner text and a second weight value component that is identified by analyzing a learner text by minimizing an empirical risk function.


An apparatus is also presented for automated text correction. The apparatus may include, for example, a processor configured to perform the steps of the methods described above.


Another embodiment of a method is presented. The method may include correcting semantic collocation errors. One embodiment of such a method includes automatically identifying one or more translation candidates in response to analysis of a corpus of parallel-language text conducted in a processing device. Additionally, the method may include determining, using the processing device, a feature associated with each translation candidate. The method may also include generating a set of one or more weight values from a corpus of learner text stored in a data storage device. The method may further include calculating, using a processing device, a score for each of the one or more translation candidates in response to the feature associated with each translation candidate and the set of one or more weight values.


In a further embodiment, identifying one or more translation candidates may include selecting a parallel corpus of text from a database of parallel texts, each parallel text comprising text of a first language and corresponding text of a second language, segmenting the text of the first language using the processing device, tokenizing the text of the second language using the processing device, automatically aligning words in the first text with words in the second text using the processing device, extracting phrases from the aligned words in the first text and in the second text using the processing device, and calculating, using the processing device, a probability of a paraphrase match associated with one or more phrases in the first text and one or more phrases in the second text.


In a particular embodiment, the feature associated with each translation candidate is the probability of a paraphrase match. The set of one or more weight values may be calculated using, for example, a minimum error rate training (MERT) operation on a corpus of learner text.


The method may also include generating a phrase table having collocation corrections with features derived from spelling edit distance. In another embodiment, the method may include generating a phrase table having collocation corrections with features derived from a homophone dictionary. In another embodiment, the method may include generating a phrase table having collocation corrections with features derived from synonym dictionary. Additionally, the method may include generating a phrase table having collocation corrections with features derived from native language-induced paraphrases.


In such embodiments, the phrase table comprises one or more penalty features for use in calculating the probability of a paraphrase match.


An apparatus, comprising at least one processor and a memory device coupled to the at least one processor, in which the at least one processor is configured to perform the steps of the method of claims as described above is also presented. A tangible computer readable medium comprising computer readable code that, when executed by a computer, cause the computer to perform the operations as in the method described above is also presented.


The term “coupled” is defined as connected, although not necessarily directly, and not necessarily mechanically.


The terms “a” and “an” are defined as one or more unless this disclosure explicitly requires otherwise.


The term “substantially” and its variations are defined as being largely but not necessarily wholly what is specified as understood by one of ordinary skill in the art, and in one non-limiting embodiment “substantially” refers to ranges within 10%, preferably within 5%, more preferably within 1%, and most preferably within 0.5% of what is specified.


The terms “comprise” (and any form of comprise, such as “comprises” and “comprising”), “have” (and any form of have, such as “has” and “having”), “include” (and any form of include, such as “includes” and “including”) and “contain” (and any form of contain, such as “contains” and “containing”) are open-ended linking verbs. As a result, a method or device that “comprises,” “has,” “includes” or “contains” one or more steps or elements possesses those one or more steps or elements, but is not limited to possessing only those one or more elements. Likewise, a step of a method or an element of a device that “comprises,” “has,” “includes” or “contains” one or more features possesses those one or more features, but is not limited to possessing only those one or more features. Furthermore, a device or structure that is configured in a certain way is configured in at least that way, but may also be configured in ways that are not listed. Other features and associated advantages will become apparent with reference to the following detailed description of specific embodiments in connection with the accompanying drawings.





BRIEF DESCRIPTION OF THE DRAWINGS

The following drawings form part of the present specification and are included to further demonstrate certain aspects of the present invention. The invention may be better understood by reference to one or more of these drawings in combination with the detailed description of specific embodiments presented herein.



FIG. 1 is a block diagram illustrating a system for analyzing utterances according to one embodiment of the disclosure.



FIG. 2 is block diagram illustrating a data management system configured to store sentences according to one embodiment of the disclosure.



FIG. 3 is a block diagram illustrating a computer system for analyzing utterances according to one embodiment of the disclosure.



FIG. 4 is a block diagram illustrating a graphical representation for linear-chain CRF.



FIG. 5 is an example tagging of a training sentence for the linear-chain conditional random fields (CRF).



FIG. 6 is block diagram illustrating a graphical representation of a two-layer factorial CRF.



FIG. 7 is an example tagging of a training sentence for the factorial conditional random fields (CRF).



FIG. 8 is a flow chart illustrating one embodiment of a method for inserting punctuation into a sentence.



FIG. 9 is a flow chart illustrating one embodiment of a method for automatic grammatical error correction.



FIG. 10A is a graphical diagram illustrating the accuracy of one embodiment of a text correction model for correcting article errors.



FIG. 10B is a graphical diagram illustrating the accuracy of one embodiment of a text correction model for correcting preposition errors.



FIG. 11A is a graphical diagram illustrating an F1-measure for the method of correcting article errors as compared to ordinary methods using DeFelice feature set.



FIG. 11B is a graphical diagram illustrating an F1-measure for the method of correcting article errors as compared to ordinary methods using Han feature set.



FIG. 11C is a graphical diagram illustrating an F1-measure for the method of correcting article errors as compared to ordinary methods using Lee feature set.



FIG. 12A is a graphical diagram illustrating an F1-measure for the method of correcting preposition errors as compared to ordinary methods using DeFelice feature set.



FIG. 12B is a graphical diagram illustrating an F1-measure for the method of correcting preposition errors as compared to ordinary methods using. TetreaultChunk feature set



FIG. 12C is a graphical diagram illustrating an F1-measure for the method of correcting preposition errors as compared to ordinary methods using TetreaultParse feature set.



FIG. 13 is a flow chart illustrating one embodiment of a method for correcting semantic collocation errors.





DETAILED DESCRIPTION

Various features and advantageous details are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. Descriptions of well known starting materials, processing techniques, components, and equipment are omitted so as not to unnecessarily obscure the invention in detail. It should be understood, however, that the detailed description and the specific examples, while indicating embodiments of the invention, are given by way of illustration only, and not by way of limitation. Various substitutions, modifications, additions, and/or rearrangements within the spirit and/or scope of the underlying inventive concept will become apparent to those skilled in the art from this disclosure.


Certain units described in this specification have been labeled as modules, in order to more particularly emphasize their implementation independence. A module is “[a] self-contained hardware or software component that interacts with a larger system. Alan Freedman, “The Computer Glossary” 268 (8th ed. 1998). A module comprises a machine or machines executable instructions. For example, a module may be implemented as a hardware circuit comprising custom VLSI circuits or gate arrays, off-the-shelf semiconductors such as logic chips, transistors, or other discrete components. A module may also be implemented in programmable hardware devices such as field programmable gate arrays, programmable array logic, programmable logic devices or the like.


Modules may also include software-defined units or instructions, that when executed by a processing machine or device, transform data stored on a data storage device from a first state to a second state. An identified module of executable code may, for instance, comprise one or more physical or logical blocks of computer instructions which may be organized as an object, procedure, or function. Nevertheless, the executables of an identified module need not be physically located together, but may comprise disparate instructions stored in different locations which, when joined logically together, comprise the module, and when executed by the processor, achieve the stated data transformation.


Indeed, a module of executable code may be a single instruction, or many instructions, and may even be distributed over several different code segments, among different programs, and across several memory devices. Similarly, operational data may be identified and illustrated herein within modules, and may be embodied in any suitable form and organized within any suitable type of data structure. The operational data may be collected as a single data set, or may be distributed over different locations including over different storage devices.


In the following description, numerous specific details are provided, such as examples of programming, software modules, user selections, network transactions, database queries, database structures, hardware modules, hardware circuits, hardware chips, etc., to provide a thorough understanding of the present embodiments. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, materials, and so forth. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.



FIG. 1 illustrates one embodiment of a system 100 for automated text and speech editing. The system 100 may include a server 102, a data storage device 106, a network 108, and a user interface device 110. In a further embodiment, the system 100 may include a storage controller 104, or storage server configured to manage data communications between the data storage device 106, and the server 102 or other components in communication with the network 108. In an alternative embodiment, the storage controller 104 may be coupled to the network 108.


In one embodiment, the user interface device 110 is referred to broadly and is intended to encompass a suitable processor-based device such as a desktop computer, a laptop computer, a personal digital assistant (PDA) or table computer, a smartphone or other a mobile communication device or organizer device having access to the network 108. In a further embodiment, the user interface device 110 may access the Internet or other wide area or local area network to access a web application or web service hosted by the server 102 and provide a user interface for enabling a user to enter or receive information. For example, the user may enter an input utterance or text into the system 100 through a microphone (not shown) or keyboard 320.


The network 108 may facilitate communications of data between the server 102 and the user interface device 110. The network 108 may include any type of communications network including, but not limited to, a direct PC-to-PC connection, a local area network (LAN), a wide area network (WAN), a modem-to-modem connection, the Internet, a combination of the above, or any other communications network now known or later developed within the networking arts which permits two or more computers to communicate, one with another.


In one embodiment, the server 102 is configured to store input utterances and/or input text. Additionally, the server may access data stored in the data storage device 106 via a Storage Area Network (SAN) connection, a LAN, a data bus, or the like.


The data storage device 106 may include a hard disk, including hard disks arranged in an Redundant Array of Independent Disks (RAID) array, a tape storage drive comprising a magnetic tape data storage device, an optical storage device, or the like. In one embodiment, the data storage device 106 may store sentences in English or other languages. The data may be arranged in a database and accessible through Structured Query Language (SQL) queries, or other data base query languages or operations.



FIG. 2 illustrates one embodiment of a data management system 200 configured to store input utterances and/or input text. In one embodiment, the data management system 200 may include a server 102. The server 102 may be coupled to a data-bus 202. In one embodiment, the data management system 200 may also include a first data storage device 204, a second data storage device 206, and/or a third data storage device 208. In further embodiments, the data management system 200 may include additional data storage devices (not shown). In one embodiment, a corpus of learner text, such as the NUS Corpus of Learner English (NUCLE) may be stored in the first data storage device 204. The second data storage device 206 may store a corpus of, for example, non-learner texts. Examples of non-learner texts may include parallel corpora, news or periodical text, and other commonly available text. In certain embodiments, the non-learner texts are chosen from sources that are assumed to contain relatively few errors. The third data storage device 208 may contain computational data, input texts, and or input utterance data. In a further embodiment, the described data may be stored together in a consolidated data storage device 210.


In one embodiment, the server 102 may submit a query to selected data storage devices 204, 206 to retrieve input sentences. The server 102 may store the consolidated data set in a consolidated data storage device 210. In such an embodiment, the server 102 may refer back to the consolidated data storage device 210 to obtain a set of data elements associated with a specified sentence. Alternatively, the server 102 may query each of the data storage devices 204, 206, 208 independently or in a distributed query to obtain the set of data elements associated with an input sentence. In another alternative embodiment, multiple databases may be stored on a single consolidated data storage device 210.


The data management system 200 may also include files for entering and processing utterances. In various embodiments, the server 102 may communicate with the data storage devices 204, 206, 208 over the data-bus 202. The data-bus 202 may comprise a SAN, a LAN, or the like. The communication infrastructure may include Ethernet, Fibre-Chanel Arbitrated Loop (FC-AL), Small Computer System Interface (SCSI), Serial Advanced Technology Attachment (SATA), Advanced Technology Attachment (ATA), and/or other similar data communication schemes associated with data storage and communication. For example, the server 102 may communicate indirectly with the data storage devices 204, 206, 208, 210; the server 102 first communicating with a storage server or the storage controller 104.


The server 102 may host a software application configured for analyzing utterances and/or input text. The software application may further include modules for interfacing with the data storage devices 204, 206, 208, 210, interfacing a network 108, interfacing with a user through the user interface device 110, and the like. In a further embodiment, the server 102 may host an engine, application plug-in, or application programming interface (API).



FIG. 3 illustrates a computer system 300 adapted according to certain embodiments of the server 102 and/or the user interface device 110. The central processing unit (“CPU”) 302 is coupled to the system bus 304. The CPU 302 may be a general purpose CPU or microprocessor, graphics processing unit (“GPU”), microcontroller, or the like that is specially programmed to perform methods as described in the following flow chart diagrams. The present embodiments are not restricted by the architecture of the CPU 302 so long as the CPU 302, whether directly or indirectly, supports the modules and operations as described herein. The CPU 302 may execute the various logical instructions according to the present embodiments.


The computer system 300 also may include random access memory (RAM) 308, which may be SRAM, DRAM, SDRAM, or the like. The computer system 300 may utilize RAM 308 to store the various data structures used by a software application having code to analyze utterances. The computer system 300 may also include read only memory (ROM) 306 which may be PROM, EPROM, EEPROM, optical storage, or the like. The ROM may store configuration information for booting the computer system 300. The RAM 308 and the ROM 306 hold user and system data.


The computer system 300 may also include an input/output (I/O) adapter 310, a communications adapter 314, a user interface adapter 316, and a display adapter 322. The I/O adapter 310 and/or the user interface adapter 316 may, in certain embodiments, enable a user to interact with the computer system 300 in order to input utterances or text. In a further embodiment, the display adapter 322 may display a graphical user interface associated with a software or web-based application or mobile application for generating sentences with inserted punctuation marks, grammar correction, and other related text and speech editing functions.


The I/O adapter 310 may connect one or more storage devices 312, such as one or more of a hard drive, a compact disk (CD) drive, a floppy disk drive, and a tape drive, to the computer system 300. The communications adapter 314 may be adapted to couple the computer system 300 to the network 108, which may be one or more of a LAN, WAN, and/or the Internet. The user interface adapter 316 couples user input devices, such as a keyboard 320 and a pointing device 318, to the computer system 300. The display adapter 322 may be driven by the CPU 302 to control the display on the display device 324.


The applications of the present disclosure are not limited to the architecture of computer system 300. Rather the computer system 300 is provided as an example of one type of computing device that may be adapted to perform the functions of a server 102 and/or the user interface device 110. For example, any suitable processor-based device may be utilized including without limitation, including personal data assistants (PDAs), tablet computers, smartphones, computer game consoles, and multi-processor servers. Moreover, the systems and methods of the present disclosure may be implemented on application specific integrated circuits (ASIC), very large scale integrated (VLSI) circuits, or other circuitry. In fact, persons of ordinary skill in the art may utilize any number of suitable structures capable of executing logical operations according to the described embodiments.


The schematic flow chart diagrams and associated description that follow are generally set forth as logical flow chart diagrams. As such, the depicted order and labeled steps are indicative of one embodiment of the presented method. Other steps and methods may be conceived that are equivalent in function, logic, or effect to one or more steps, or portions thereof, of the illustrated method. Additionally, the format and symbols employed are provided to explain the logical steps of the method and are understood not to limit the scope of the method. Although various arrow types and line types may be employed in the flow chart diagrams, they are understood not to limit the scope of the corresponding method. Indeed, some arrows or other connectors may be used to indicate only the logical flow of the method. For instance, an arrow may indicate a waiting or monitoring period of unspecified duration between enumerated steps of the depicted method. Additionally, the order in which a particular method occurs may or may not strictly adhere to the order of the corresponding steps shown.


Punctuation Prediction

According to one embodiment, punctuation symbols may be predicted from a standard text processing perspective, where only the speech texts are available, without relying on additional prosodic features such as pitch and pause duration. For example, punctuation prediction task may be performed on transcribed conversational speech texts, or utterances. Different from many other corpora such as broadcast news corpora, a conversational speech corpus may include dialogs where informal and short sentences frequently appear. In addition, due to the nature of conversation, it may also include more question sentences compared to other corpora.


One natural approach to relax the strong dependency assumptions encoded by the hidden event language model is to adopt an undirected graphical model, where arbitrary overlapping features can be exploited. Conditional random fields (CRF) have been widely used in various sequence labeling and segmentation tasks. A CRF may be a discriminative model of the conditional distribution of the complete label sequence given the observation. For example, a first-order linear-chain CRF which assumes first-order Markov property may be defined by the following equation:









p
λ



(

y
|
x

)


=


1

Z


(
x
)





exp
(



t









k








λ
k




f
k



(

x
,

y

t
-
1


,

y
t

,
t

)





)



,




where x is the observation and y is the label sequence. A feature function fk as a function of time step t may be defined over the entire observation x and two adjacent hidden labels. Z(x) is a normalization factor to ensure a well-formed probability distribution.



FIG. 4 is a block diagram illustrating a graphical representation for linear-chain CRF. A series of first nodes 402a, 402b, 402c, . . . , 402n are coupled to a series of second nodes 404a, 404b, 404c, . . . , 404n. The second nodes may be events such as word-layer tags associated with the corresponding node of the first nodes 402. Punctuation prediction tasks may be modeled as a process of assigning a tag to each word. A set of possible tags may include none (NONE), comma (,), period (.), question mark (?), and exclamation mark (!). According to one embodiment, each word may be associated with one event. The event identifies which punctuation symbol (possibly NONE) should be inserted after the word.


Training data for the model may include a set of utterances where punctuation symbols are encoded as tags that are assigned to the individual words. The tag NONE means no punctuation symbol is inserted after the current word. Any other tag identifies a location for insertion of the corresponding punctuation symbol. The most probable sequence of tags is predicted and the punctuated text can then be constructed from such an output. An example tagging of an utterance may be illustrated in FIG. 5.



FIG. 5 is an example tagging of a training sentence for the linear-chain conditional random fields (CRF). A sentence 502 may be divided into words and a word-layer tag 504 assigned to each of the words. The word-layer tag 504 may indicate a punctuation mark that will follow the word in an output sentence. For example, the word “no” is tagged with “Comma” indicating a comma should follow the word “no.” Additionally, some words such as “please” are tagged with “None” to indicate no punctuation mark should follow the word “please.”


According to one embodiment, a feature of conditional random fields may be factorized as a product of a binary function on assignment of the set of cliques at the current time step (in this case an edge), and a feature function solely defined on the observation sequence. n-gram occurrences surrounding the current word, together with position information, are used as binary feature functions, for n=1; 2; 3. Words that appear within 5 words from the current word are considered when building the features. Special start and end symbols are used beyond the utterance boundaries. For example, for the word do shown in FIG. 5, example features include unigram features “do” at relative position 0, “please” at relative position −1, bigram feature “would you” at relative position 2 to 3, and trigram feature “no please do” at relative position −2 to 0.


A linear-chain CRF model in this embodiment may be capable of modeling dependencies between words and punctuation symbols with arbitrary overlapping features. Thus strong dependency assumptions in the hidden event language model may be avoided. The model may be further improved by including analysis of long range dependencies at a sentence level. For example, in the sample utterance shown in FIG. 5, the long range dependency between the ending question mark and the indicative words “would you” which appear very far away may not be captured.


A factorial-CRF (F-CRF), an instance of dynamic conditional random fields, may be used as a framework for providing the capability of simultaneously labeling multiple layers of tags for a given sequence. The F-CRF learns a joint conditional distribution of the tags given the observation. Dynamic conditional random fields may be defined as the conditional probability of a sequence of label vectors y given the observation x as:









p
λ



(

y
|
x

)


=


1

Z


(
x
)





exp
(



t










c

C










k








λ
k




f
k



(

x
,

y

(

c
,
t

)


,

y
t

,
t

)






)



,




where cliques are indexed at each time step, C is a set of clique indices, and y(c;t) is the set of variables in the unrolled version of a clique with index c at time t.



FIG. 6 is block diagram illustrating a graphical representation of a two-layer factorial CRF. According to one embodiment, a F-CRF may have two layers of nodes as tags, where the cliques include the two within-chain edges (e.g., z2-z3 and y2-y3) and one between-chain edge (e.g., z3-y3) at each time step. A series of first nodes 602a, 602b, 602c, . . . , 602n are coupled to a series of second nodes 604a, 604b, 604c, . . . , 604n. A series of third nodes 606a, 606b, 606c, . . . , 606n are coupled to the series of second nodes and the series of first nodes. The nodes of the series of second nodes are coupled with each other to provide long range dependency between nodes.


According to one embodiment, the second nodes are word-layer nodes and the third nodes are sentence-layer nodes. Each sentence-layer node may be coupled with a respective word-layer node. Both sentence-layer nodes and word-layer nodes may be coupled with first nodes. Sentence layer nodes may capture long-range dependencies between word-layer nodes.


In a F-CRF two groups of labels may be assigned to words in an utterance: word-layer tags and sentence-layer tags. Word-layer tags may include none, comma, period, question mark, and/or exclamation mark. Sentence-layer tags may include declaration beginning, declaration inner part, question beginning, question inner part, exclamation beginning, and/or exclamation inner part. The word layer tags may be responsible for inserting a punctuation symbol (including NONE) after each word, while the sentence layer tags may be used for annotating sentence boundaries and identifying the sentence type (declarative, question, or exclamatory).


According to one embodiment, tags from the word layer may be the same as those of the linear-chain CRF. The sentence layer tags may be designed for three types of sentences: DEBEG and DEIN indicate the start and the inner part of a declarative sentence respectively, likewise for QNBEG and QNIN (question sentences), as well as EXBEG and EXIN (exclamatory sentences). The same example utterance we looked at in the previous section may be tagged with two layers of tags, as shown in FIG. 7.



FIG. 7 is an example tagging of a training sentence for the factorial conditional random fields (CRF). A sentence 702 may be divided into words and each word tagged with a word-layer tag 704 and a sentence-layer tag 706. For example, the word “no” may be labeled with a comma word-layer tag and a declaration beginning sentence-layer tag.


Analogous feature factorization and the n-gram feature functions used in linear-chain CRF may be used in F-CRF. When learning the sentence layer tags together with the word layer tags, the F-CRF model is capable of leveraging useful clues learned from the sentence layer about sentence type (e.g., a question sentence, annotated with QNBEG, QNIN, QNIN, or a declarative sentence, annotated with DEBEG, DEIN, DEIN), which can be used to guide the prediction of the punctuation symbol at each word, hence improving the performance at the word layer.


For example, consider jointly labeling the utterance shown in FIG. 7. When evidences show that the utterance consists of two sentences—a declarative sentence followed by a question sentence, the model tends to annotate the second half of the utterance with the sentence tag sequence: QNBEG, QNIN. These sentence-layer tags help predict the word-layer tag at the end of the utterance as QMARK, given the dependencies between the two layers existing at each time step. According to one embodiment, during the learning process, the two layers of tags may be jointly learned. Thus the word-layer tags may influence the sentence-layer tags, and vice versa. The GRMM package may be used for building both the linear-chain CRF (LCRF) and factorial CRF (F-CRF). The tree-based reparameterization (TRP) schedule for belief propagation is used for approximate inference.


The techniques described above may allow the use of conditional random fields (CRFs) to perform prediction in utterances without relying on prosodic clues. Thus, the methods described may be useful in post-processing of transcribed conversational utterances. Additionally, long-range dependencies may be established between words in an utterance to improve prediction of punctuation in utterances.


Experiments on part of the corpus of the IWSLT09 evaluation campaign, where both Chinese and English conversational speech texts are used, are carried out with the different methods. Two multilingual datasets are considered, the BTEC (Basic Travel Expression Corpus) dataset and the CT (Challenge Task) dataset. The former consists of tourism-related sentences, and the latter consists of human-mediated cross-lingual dialogs in travel domain. The official IWSLT09 BTEC training set consists of 19,972 Chinese-English utterance pairs, and the CT training set consists of 10,061 such pairs. Each of the two datasets may be randomly split into two portions, where 90% of the utterances are used for training the punctuation prediction models, and the remaining 10% for evaluating the prediction performance. For all the experiments, the default segmentation of Chinese may be used as provided, and English texts may be pre-processed with the Penn Treebank tokenizer. TABLE 1 provides statistics of the two datasets after processing.


The proportions of sentence types in the two datasets are listed. The majority of the sentences are declarative sentences. However, question sentences are more frequent in the BTEC dataset compared to the CT dataset. Exclamatory sentences contribute less than 1% for all datasets and are not listed. Additionally, the utterances from the CT dataset are much longer (with more words per utterance), and therefore more CT utterances actually consist of multiple sentences.









TABLE 1







Statistics of the BTEC and CT Datasets












BTEC dataset

CT dataset













Chinese
English
Chinese
English





Declarative sentence
64%
65%
77%
81%


Question sentence
36%
35%
22%
19%


Multiple sentences
14%
17%
29%
39%


per utterance






Average number of
8.59
9.46
10.18
14.33


words per utterance









Additional experiments may be divided into two categories: with or without duplicating the ending punctuation symbol to the start of a sentence before training. This setting may be used to assess the impact of the proximity between the punctuation symbol and the indicative words for the prediction task. Under each category, two possible approaches are tested. The single pass approach performs prediction in one single step, where all the punctuation symbols are predicted sequentially from left to right. In the cascaded approach, the training sentences are formatted by replacing all sentence-ending punctuation symbols with special sentence boundary symbols first. A model for sentence boundary prediction may be learned based on such training data. According to one embodiment, this step may be followed by predicting the punctuation symbols.


Both trigram and 5-gram language models are tried for all combinations of the above settings. This provides a total of eight possible combinations based on the hidden event language model. When training all the language models, modified Kneser-Ney smoothing for n-grams may be used. To assess the performance of the punctuation prediction task, computations for precision (prec), recall (rec), and F1-measure (F1), are defined by the following equations:







prec
.

=


#





Correctly





predicted





punctuation





symbols


#





predicted





punctuation





symbols









rec
.

=


#





Correctly





predicted





punctuation





symbols


#





predicted





punctuation





symbols









F
1

=

2


1
/

prec
.

+
1



/

rec
.







The performance of punctuation prediction on both Chinese (CN) and English (EN) texts in the correctly recognized output of the BTEC and CT datasets are presented in TABLE 2 and TABLE 3, respectively. The performance of the hidden event language model heavily depends on whether the duplication method is used and on the actual language under consideration. Specifically, for English, duplicating the ending punctuation symbol to the start of a sentence before training is shown to be very helpful in improving the overall prediction performance. In contrast, applying the same technique to Chinese hurts the performance.


One explanation may be that an English question sentence usually starts with indicative words such as “do you” or “where” that distinguish it from a declarative sentence. Thus, duplicating the ending punctuation symbol to the start of a sentence so that it is near these indicative words helps to improve the prediction accuracy. However, Chinese presents quite different syntactic structures for question sentences.


First in many cases, Chinese tends to use semantically vague auxiliary words at the end of a sentence to indicate a question. Such auxiliary words include custom-character, and custom-character. Thus, retaining the position of the ending punctuation symbol before training yields better performance. Another finding is that, different from English, other words that indicate a question sentence in Chinese can appear at almost any position in a Chinese sentence. Examples include custom-character . . . (where . . . ), . . . custom-character (what . . . ), or . . . custom-character . . . (how many/much . . . ). These pose difficulties for the simple hidden event language model, which only encodes simple dependencies over surrounding words by means of n-gram language modeling.









TABLE 2







Punctuation Prediction Performance on Chinese (CN) and English (EN) Texts


in the Correctly Recognized Output of the BTEC Dataset. Percentage Scores of Precision


(Prec.), recall (Rec.), and F1 Measure (F1) are Reported









BTEC











NO DUPLICATION
USE DUPLICATION














SINGLE PASS
CASCADED
SINGLE PASS
CASCADED


















LM ORDER
3
5
3
5
3
5
3
5
L-CRF
F-CRF





















CN
Prec.
87.40
86.44
87.72
87.13
76.74
77.58
77.89
78.50
94.82
94.83



Rec.
83.01
83.58
82.04
83.76
72.62
73.72
73.02
75.53
87.06
87.94



F1
85.15
84.99
84.79
85.41
74.63
75.60
75.37
76.99
90.78
91.25


EN
Prec.
64.72
62.70
62.39
58.10
85.33
85.74
84.44
81.37
88.37
92.76



Rec.
60.76
59.49
58.57
55.28
80.42
80.98
79.43
77.52
80.28
84.73



F1
62.68
61.06
60.42
56.66
82.80
83.29
81.86
79.40
84.13
88.56
















TABLE 3







Punctuation Prediction Performance on Chinese (CN) and English (EN) Texts


in the Correctly Recognized Output of the CT Dataset. Percentage Scores of Precision


(Prec.), recall (Rec.), and F1 Measure (F1) are Reported









CT











NO DUPLICATION
USE DUPLICATION














SINGLE PASS
CASCADED
SINGLE PASS
CASCADED


















LM ORDER
3
5
3
5
3
5
3
5
L-CRF
F-CRF





















CN
Prec.
89.14
87.83
90.97
88.04
74.63
75.42
75.37
76.87
93.14
92.77



Rec.
84.71
84.16
77.78
84.08
70.69
70.84
64.62
73.60
83.45
86.92



F1
86.87
85.96
83.86
86.01
72.60
73.06
69.58
75.20
88.03
89.75


EN
Prec.
73.86
73.42
67.02
65.15
75.87
77.78
74.75
74.44
83.07
86.69



Rec.
68.94
68.79
62.13
61.23
70.33
72.56
69.28
69.93
76.09
79.62



F1
71.31
71.03
64.48
63.13
72.99
75.08
71.91
72.12
79.43
83.01









By adopting a discriminative model which exploits non-independent, overlappitext missing or illegible when filed features, the LCRF model generally outperforms the hidden event language model. Btext missing or illegible when filed introducing an additional layer of tags for performing sentence segmentation and sentence tytext missing or illegible when filed prediction, the F-CRF model further boosts the performance over the L-CRF model. Statistictext missing or illegible when filed significance tests are performed with bootstrap resampling. The improvements of F-CRF ovtext missing or illegible when filed L-CRF are statistically significant (p<0.01) on Chinese and English texts in the CT dataset, antext missing or illegible when filed on English texts in the BTEC dataset. The improvements of F-CRF over L-CRF on Chinestext missing or illegible when filed texts are smaller, probably because L-CRF is already performing quite well on Chinese. Ftext missing or illegible when filed measures on the CT dataset are lower than those on BTEC, mainly because the CT datastext missing or illegible when filed consists of longer utterances and fewer question sentences. Overall, the proposed F-CRF modtext missing or illegible when filed is robust and consistently works well regardless of the language and dataset it is tested on. Thitext missing or illegible when filed indicates that the approach is general and relies on minimal linguistic assumptions, and thus catext missing or illegible when filed be readily used on other languages and datasets.


The models may also be evaluated with texts produced by ASR systems. Fotext missing or illegible when filed evaluation, the 1-best ASR outputs of spontaneous speech of the official IWSLT08 BTEtext missing or illegible when filed evaluation dataset may be used, which is released as part of the IWSLT09 corpus. The datastext missing or illegible when filed consists of 504 utterances in Chinese, and 498 in English. Unlike the correctly recognized texttext missing or illegible when filed described in Section 6.1, the ASR outputs contain substantial recognition errors (recognitiotext missing or illegible when filed accuracy is 86% for Chinese, and 80% for English). In the dataset released by the IWSLT 200text missing or illegible when filed organizers, the correct punctuation symbols are not annotated in the ASR outputs. To conductext missing or illegible when filed the experimental evaluation, the correct punctuation symbols on the ASR outputs may btext missing or illegible when filed manually annotated. The evaluation results for each of the models are shown in TABLE 4. Thtext missing or illegible when filed results show that F-CRF still gives higher performance than L-CRF and the hidden eventext missing or illegible when filed language model, and the improvements are statistically significant (p<0.01).









TABLE 4







Punctuation Prediction Performance on Chinese (CN) and English (EN) Texts


in the ASR Output of the IWSLT08 BTEC Evaluation Dataset. Percentage Scores of Precision


(Prec.), recall (Rec.), and F1 Measure (F1) are Reported









BTEC











NO DUPLICATION
USE DUPLICATION














SINGLE PASS
CASCADED
SINGLE PASS
CASCADED


















LM ORDER
3
5
3
5
3
5
3
5
L-CRF
F-CRF





















CN
Prec.
85.96
84.80
86.48
85.12
66.86
68.76
68.00
68.75
92.81
93.82



Rec.
81.87
82.78
83.15
82.78
63.92
66.12
65.38
66.48
85.16
89.01



F1
83.86
83.78
84.78
83.94
65.36
67.41
66.67
67.60
88.83
91.35


EN
Prec.
62.38
59.29
56.86
54.22
85.23
87.29
84.49
81.32
90.67
93.72



Rec.
64.17
60.99
58.76
56.71
88.22
89.65
87.58
84.55
88.22
92.68



F1
63.27
60.13
57.79
55.20
86.70
88.45
86.00
82.90
89.43
93.19









In another evaluation of the models, indirect approach may be adopted text missing or illegible when filed automatically evaluate the performance of punctuation prediction on ASR output texts btext missing or illegible when filed feeding the punctuated ASR texts to a state-of-the-art machine translation system, and evalualtext missing or illegible when filed the resulting translation performance. The translation performance is in turn measured by atext missing or illegible when filed automatic evaluation metric which correlates well with human judgments. Moses, a state-of-thtext missing or illegible when filed art phrase-based statistical machine translation toolkit is used as a translation engine along wittext missing or illegible when filed the entire IWSLT09 BTEC training set for training the translation system.


Berkeley aligner is used for aligning the training bitext with the lexicalized reorderintext missing or illegible when filed model enabled. This is because lexicalized reordering gives better performance than simpltext missing or illegible when filed distance-based reordering. Specifically, the default lexicalized reordering model (msdtext missing or illegible when filed bidirectional-fe) is used. For tuning the parameters of Moses, we use the official IWSLT0 text missing or illegible when filed evaluation set where the correct punctuation symbols are present. Evaluations are performed otext missing or illegible when filed the ASR outputs of the IWSLT08 BTEC evaluation dataset, with punctuation symbols insertetext missing or illegible when filed by each punctuation prediction method. The tuning set and evaluation set include 7 referenctext missing or illegible when filed translations. Following a common practice in statistical machine translation, we report BLEU-text missing or illegible when filed scores, which were shown to have good correlation with human judgments, with the closestext missing or illegible when filed reference length as the effective reference length. The minimum error rate training (MERTtext missing or illegible when filed procedure is used for tuning the model parameters of the translation system.


Due to the unstable nature of MERT, 10 runs are performed for each translation tasktext missing or illegible when filed with a different random initialization of parameters in each run, and the BLEU-4 scores averagetext missing or illegible when filed over 10 runs are reported. The results are shown in Table 5. The best translation performancetext missing or illegible when filed for both translation directions are achieved by applying F-CRF as the punctuation predictiotext missing or illegible when filed model to the ASR texts. In addition, we also assess the translation performance when thtext missing or illegible when filed manually annotated punctuation symbols are used for translation. The averaged BLEU scoretext missing or illegible when filed for the two translation tasks are 31.58 (Chinese to English) and 24.16 (English to Chinesetext missing or illegible when filed respectively, which show that our punctuation prediction method gives competitive performanctext missing or illegible when filed for spoken language translation.









TABLE 5







Translation Performance on Punctuated ASR Outputs


Using Moses (Averaged Percentage Scores of BLEU)











NO DUPLICATION
USE DUPLICATION














SINGLE PASS
CASCADED
SINGLE PASS
CASCADED


















LM Order
3
5
3
5
3
5
3
5
L-CRF
F-CRF





CN→EN
30.77
30.71
30.98
30.64
30.16
30.26
30.33
30.42
31.27
31.30


EN→CN
21.21
21.00
21.16
20.76
23.03
24.04
23.61
23.34
23.44
24.18









According to the embodiments described above, an exemplary approach fctext missing or illegible when filed predicting punctuation symbols for transcribed conversational speech texts is described. Thtext missing or illegible when filed proposed approach is built on top of a dynamic conditional random fields (DCRFs) framewortext missing or illegible when filed which performs punctuation prediction together with sentence boundary and sentence typtext missing or illegible when filed prediction on speech utterances. The text processing according to DCRFs may be completetext missing or illegible when filed without reliance on prosodic cues. The exemplary embodiments outperform the widely usetext missing or illegible when filed conventional approach based on the hidden event language model. The disclosed embodimenttext missing or illegible when filed have been shown to be non-language specific and work well on both Chinese and English, antext missing or illegible when filed on both correctly recognized and automatically recognized texts. The disclosed embodimenttext missing or illegible when filed also result in better translation accuracy when the punctuated automatically recognized texts antext missing or illegible when filed used in subsequent translation.



FIG. 8 is a flow chart illustrating one embodiment of a method for insertintext missing or illegible when filed punctuation into a sentence. In one embodiment, the method 800 starts at block 802 witltext missing or illegible when filed identifying words of an input utterance. At block 804 the words are placed in a plurality of firstext missing or illegible when filed nodes. At block 806 word-layer tags are assigned to each of the first nodes in the plurality otext missing or illegible when filed first nodes based, in part, on neighboring nodes of the plurality of first nodes. According to ontext missing or illegible when filed embodiment, sentence-layer tags may also be assigned to each of the first nodes in the pluralittext missing or illegible when filed of first nodes. According to another embodiment, sentence-layer tags and/or word-layer tagtext missing or illegible when filed may be assigned to the first nodes based, in part, on boundaries of the input utterance. At bloctext missing or illegible when filed808 an output sentence is generated by combining words from the plurality of first nodes witltext missing or illegible when filed punctuation marks selected, in part, on the word-layer tags assigned to each of the first nodes.


Grammar Error Correction

There are differences between training on annotated learner text and training on notext missing or illegible when filed learner text, namely whether the observed word can be used as a feature or not. When trainitext missing or illegible when filed on non-learner text, the observed word cannot be used as a feature. The word choice of tltext missing or illegible when filed writer is “blanked out” from the text and serves as the correct class. A classifier is trained to rtext missing or illegible when filed predict the word given the surrounding context. The confusion set of possible classes is usualltext missing or illegible when filed pre-defined. This selection task formulation is convenient as training examples can be createtext missing or illegible when filed “for free” from any text that is assumed to be free of grammatical errors. A more realistitext missing or illegible when filed correction task is defined as follows: given a particular word and its context, propose atext missing or illegible when filed appropriate correction. The proposed correction can be identical to the observed word, i.e., ntext missing or illegible when filed correction is necessary. The main difference is that the word choice of the writer can be encodetext missing or illegible when filed as part of the features.


Article errors are one frequent type of errors made by EFL learners. For articltext missing or illegible when filed errors, the classes are the three articles a, the, and the zero-article. This covers article insertiortext missing or illegible when filed deletion, and substitution errors. During training, each noun phrase (NP) in the training data itext missing or illegible when filed one training example. When training on learner text, the correct class is the article provided btext missing or illegible when filed the human annotator. When training on non-learner text, the correct class is the observed articltext missing or illegible when filed The context is encoded via a set of feature functions. During testing, each NP in the test set itext missing or illegible when filed one test example. The correct class is the article provided by the human annotator when testintext missing or illegible when filed on learner text or the observed article when testing on non-learner text.


Preposition errors are another frequent type of errors made by EFL learners. Thtext missing or illegible when filed approach to preposition errors is similar to articles but typically focuses on prepositiotext missing or illegible when filed substitution errors. In this work, the classes are 36 frequent English prepositions (about, alongtext missing or illegible when filed among, around, as, at, beside, besides, between, by, down, during, except, for, from, in, insidetext missing or illegible when filed into, of, off, on, onto, outside, over, through, to, toward, towards, under, underneath, until, uptext missing or illegible when filed upon, with, within, without). Every prepositional phrase (PP) that is governed by one of the 3text missing or illegible when filed prepositions is one training or test example. PPs governed by other prepositions are ignored itext missing or illegible when filed this embodiment.



FIG. 9 illustrates one embodiment of a method 900 for correcting grammatext missing or illegible when filed errors. In one embodiment, the method 900 may include receiving 902 a natural language textext missing or illegible when filed input, the text input comprising a grammatical error in which a portion of the input tetext missing or illegible when filed comprises a class from a set of classes. This method 900 may also include generating 904text missing or illegible when filed plurality of selection tasks from a corpus of non-learner text that is assumed to be free text missing or illegible when filed grammatical errors, wherein for each selection task a classifier re-predicts a class used in thtext missing or illegible when filed non-learner text. Further, the method 900 may include generating 906 a plurality of correctiotext missing or illegible when filed tasks from a corpus of learner text, wherein for each correction task a classifier proposes a clastext missing or illegible when filed used in the learner text. Additionally, the method 900 may include training 908 a gammtext missing or illegible when filed correction model using a set of binary classification problems that include the plurality ctext missing or illegible when filed selection tasks and the plurality of correction tasks. This embodiment may also include usintext missing or illegible when filed910 the trained grammar correction model to predict a class for the text input from the set otext missing or illegible when filed possible classes.


According to one embodiment, grammatical error correction (GEC) is formulated astext missing or illegible when filed classification problem and linear classifiers are used to solve the classification problem.


Classifiers are used to approximate the unknown relation between articles otext missing or illegible when filed prepositions and their contexts in learner text, and their valid corrections. The articles otext missing or illegible when filed prepositions and their contexts are represented as feature vectors X ∈ χ. The corrections are thtext missing or illegible when filed classes Y ∈ γ.


In one embodiment, binary linear classifiers of the form uTX, where u is a weightext missing or illegible when filed vector, is employed. The outcome is considered +1 if the score is positive and −1 otherwise. Atext missing or illegible when filed popular method for finding u is empirical risk, minimization with least square regularizationtext missing or illegible when filed Given a training set {Xi, Yi}i=1, . . . , n, the goal is to find the weight vector that minimizes thtext missing or illegible when filed empirical loss on the training data








u
^

=


argmin
u



(



1
n






i
=
1

n







L


(



u
T



X
i


,

Y
i


)




+

λ




u


2



)



,




where L is a loss function. In one embodiment, a modification of Huber's robust loss function itext missing or illegible when filed used. The regularization parameter λ may be to 10−4 according to one embodiment. A multitext missing or illegible when filed class classification problem with m classes can be cast as m binary classification problems in text missing or illegible when filed one-vs-rest arrangement. The prediction of the classifier is the class with the highest score Ŷ arg max Y ∈ γ (uYTX).


Six feature extraction methods are implemented, three for articles and three ftext missing or illegible when filed prepositions. The methods require different linguistic pre-processing: chunking, CCG parsintext missing or illegible when filed and constituency parsing.


Examples of feature extraction for article errors include “DeFelice”, “Han”, antext missing or illegible when filed “Lee”. DeFelice—The system for article errors uses a CCG parser to extract a rich set ctext missing or illegible when filed syntactic and semantic features, including part of speech (POS) tags, hypernyms from WordNetext missing or illegible when filed and named entities. Han—The system relies on shallow syntactic and lexical features derivetext missing or illegible when filed from a chunker, including the words before, in, and after the NP, the head word, and POS tagstext missing or illegible when filed Lee—The system uses a constituency parser. The features include POS tags, surrounding wordstext missing or illegible when filed the head word, and hypernyms from WordNet.


Examples of feature extraction for preposition errors include “DeFelice'text missing or illegible when filed “TetreaultChunk”, and “TetreaultParse”. DeFelice—The system for preposition errors uses text missing or illegible when filed similar rich set of syntactic and semantic features as the system for article errors. In the retext missing or illegible when filed implementation, a subcategorization dictionary is not used. TetreaultChunk—The system uses text missing or illegible when filed chunker to extract features from a two-word window around the preposition, including lexicatext missing or illegible when filed and POS ngrams, and the head words from neighboring constituents. TetreaultParse—Thtext missing or illegible when filed system extends TetreaultChunk by adding additional features derived from a constituency and text missing or illegible when filed dependency parse tree.


For each of the above feature sets, the observed article or preposition is added as atext missing or illegible when filed additional feature when training on learner text.


According to one embodiment, Alternating Structure Optimization (ASO), a multitext missing or illegible when filed task learning algorithm that takes advantage of the common structure of multiple relatetext missing or illegible when filed problems, can be used for grammatical error correction. Assume that there are m binartext missing or illegible when filed classification problems. Each classifier ui is a weight vector of dimension p. Let θ be artext missing or illegible when filed orthonormal h×p matrix that captures the common structure of the m weight vectors. It istext missing or illegible when filed assumed that each weight vector can be decomposed into two parts: one part that models thtext missing or illegible when filed particular i-th classification problem and one part that models the common structure






u
i
=w
i

T
v
i


The parameters [{wi, vi}, Θ] can be learned by joint empirical risk minimization, i.e., btext missing or illegible when filed minimizing the joint empirical loss of the m problems on the training data









l
=
1

m








(



1
n






i
=
1

n







L


(




(


w
l

+


Θ
T



v
l



)

T



X
i
l


,

Y
i
l


)




+

λ





w
l



2



)

.





In ASO,the problems used to find θ do not have to be same as the target problems ttext missing or illegible when filed be solved. Instead, auxiliary problems can be automatically created for the sole purpose otext missing or illegible when filed learning a better θ.


Assuming that there are k target problems and m auxiliary problems, an approximattext missing or illegible when filed solution to the above equation can be obtained by performing the following algorithm:

    • 1. Learn m linear classifiers ui independently.
    • 2. Let U=[u1, u2, . . . um] be the p×m matrix formed from the m weight vectors.
    • 3. Perform Singular Value Decomposition (SVD) on U:U=V1DV2T The first text missing or illegible when filed column vectors of V1 are stored as rows of θ.
    • 4. Learn wj and vj for each of the target problems by minimizing the empirical risk:








1
n






i
=
1

n







L


(




(


w
j

+


Θ
T



v
j



)

T



X
i


,

Y
i


)




+

λ






w
j



2

.






5. The weight vector for the j-th target problem is:






u
j
=w
jTvj.


Beneficially, the selection task on non-learner text is a highly informative auxiliatext missing or illegible when filed problem for the correction task on learner text. For example, a classifier that can predict ttext missing or illegible when filed presence or absence of the preposition on can be helpful for correcting wrong uses of on itext missing or illegible when filed learner text, e.g., if the classifier's confidence for on is low but the writer used the prepositiotext missing or illegible when filed on, the writer might have made a mistake. As the auxiliary problems can be createtext missing or illegible when filed automatically, the power of very large corpora of non-learner text can be leveraged.


In one embodiment, a grammatical error correction task with m classes is assumetext missing or illegible when filed For each class, a binary auxiliary problem is defined. The feature space of the auxiliartext missing or illegible when filed problems is a restriction of the original feature space χ to all features except the observed wortext missing or illegible when filed χ\{Xobs}. The weight vectors of the auxiliary problems form the matrix U in Step 2 of thtext missing or illegible when filed ASO algorithm from which θ is obtained through SVD. Given θ, the vectors wj and vj, j=1, . . . , k can be obtained from the annotated learner text using the complete feature space χ.


This can be seen as an instance of transfer learning, as the auxiliary problems artext missing or illegible when filed trained on data from a different domain (nonlearner text) and have a slightly different featurtext missing or illegible when filed space (χ\{Xobs}). The method is general and can be applied to any classification problem itext missing or illegible when filed GEC.


Evaluation metrics are defined for both experiments on non-learner text and learnetext missing or illegible when filed text. For experiments on non-learner text, accuracy, which is defined as the number of correctext missing or illegible when filed predictions divided by the total number of test instances, is used as evaluation metric. Fotext missing or illegible when filed experiments on learner text, F 1-measure is used as evaluation metric. The F1-measure is definetext missing or illegible when filed as







F
1

=

2
×


Precision
×
Recall


Precision
+
Recall







where precision is the number of suggested corrections that agree with the human annotatotext missing or illegible when filed divided by the total number of proposed corrections by the system, and recall is the number otext missing or illegible when filed suggested corrections that agree with the human annotator divided by the total number of errortext missing or illegible when filed annotated by the human annotator.


A set of experiments were designed to test the correction task on NUCLE test dattext missing or illegible when filed The second set of experiments investigates the primary goal of this work: to automaticaltext missing or illegible when filed correct grammatical errors in learner text. The test instances were extracted from NUCLE. text missing or illegible when filed contrast to the previous selection task, the observed word choice of the writer can be differetext missing or illegible when filed from the correct class and the observed word was available during testing. Two differetext missing or illegible when filed baselines and the ASO method were investigated.


The first baseline was a classifier trained on the Gigaword corpus in the same way text missing or illegible when filed described in the selection task experiment. A simple thresholding strategy was used to make ustext missing or illegible when filed of the observed word during testing. The system only flags an error if the difference between thtext missing or illegible when filed classifier's confidence for its first choice and the confidence for the observed word is higher thatext missing or illegible when filed a threshold t. The threshold parameter t was tuned on the NUCLE development data for eactext missing or illegible when filed feature set. In the experiments, the value fort was between 0.7 and 1.2.


The second baseline was a classifier trained on NUCLE. The classifier was trained itext missing or illegible when filed the same way as the Gigaword model, except that the observed word choice of the writer itext missing or illegible when filed included as a feature. The correct class during training is the correction provided by the humatext missing or illegible when filed annotator. As the observed word is part of the features, this model does not need an extrtext missing or illegible when filed thresholding step. Indeed, thresholding is harmful in this case. During training, the instancetext missing or illegible when filed that do not contain an error greatly outnumber the instances that do contain an error. To reductext missing or illegible when filed this imbalance, all instances that contain an error were kept and a random sample of q percent otext missing or illegible when filed the instances that do not contain an error was retained. The under-sample parameter q was tunetext missing or illegible when filed on the NUCLE development data for each data set. In the experiments, the value for q watext missing or illegible when filed between 20% and 40%.


The ASO method was trained in the following way. Binary auxiliary problems fotext missing or illegible when filed articles or prepositions were created, i.e., there were 3 auxiliary problems for articles and 3text missing or illegible when filed auxiliary problems for prepositions. The classifiers for the auxiliary problems were trained otext missing or illegible when filed the complete 10 million instances from Gigaword in the same ways as in the selection tastext missing or illegible when filed experiment. The weight vectors of the auxiliary problems form the matrix U. Singular valutext missing or illegible when filed decomposition (SVD) was performed to get U=V1DV2T. All columns of V1 were kept to fomtext missing or illegible when filed θ. The target problems were again binary classification problems for each article or prepositiontext missing or illegible when filed but this time trained on NUCLE. The observed word choice of the writer was included as feature for the target problems. The instances that do not contain an error were undersampltext missing or illegible when filed and the parameter q was tuned on the NUCLE development data. The value for q is betwetext missing or illegible when filed 20% and 40%. No thresholding is applied.


The learning curves of the correction task experiments on NUCLE test data atext missing or illegible when filed shown in FIGS. 11 and 12. Each sub-plot shows the curves of three models as described in ttext missing or illegible when filed last section: ASO trained on NUCLE and Gigaword, the baseline classifier trained on NUCLtext missing or illegible when filed and the baseline classifier trained on Gigaword. For ASO, the x-axis shows the number of targtext missing or illegible when filed problem training instances. We observe that training on annotated learner text can significantltext missing or illegible when filed improve performance. In three experiments, the NUCLE model outperforms the Gigawortext missing or illegible when filed model trained on 10 million instances. Finally, the ASO models show the best results. In thtext missing or illegible when filed experiments where the NUCLE models already perform better than the Gigaword baseline, AStext missing or illegible when filed gives comparable or slightly better results. In those experiments where neither baseline showtext missing or illegible when filed good performance (TetreaultChunk, TetreaultParse), ASO results in a large improvement ovtext missing or illegible when filed either baseline.


Semantic Collocation Error Correction

In one embodiment, the frequency of collocation errors caused by the writer's nativtext missing or illegible when filed or first language (L-1). These types of errors are referred to as “L1-transfer errors.” L1-transfetext missing or illegible when filed errors are used to estimate how many errors in EFL writing can potentially be corrected wittext missing or illegible when filed information about the writer's L1- language. For example, L1-transfer errors may be a result otext missing or illegible when filed imprecise translations between words in the writers L-1 language and English. In such atext missing or illegible when filed example, a word with multiple meanings in Chinese may not precisely translate to a word in, fotext missing or illegible when filed example, English.


In one embodiment, the analysis is based on the NUS Corpus of Learner Englisltext missing or illegible when filed (NUCLE). The corpus consists of about 1,400 essays written by EFL university students on text missing or illegible when filed wide range of topics, like environmental pollution or healthcare. Most of the students are nativtext missing or illegible when filed Chinese speakers. The corpus contains over one million words which are completely annotatetext missing or illegible when filed with error tags and corrections. The annotation is stored in a stand-off fashion. Each error tatext missing or illegible when filed consists of the start and end offset of the annotation, the type of the error, and the appropriattext missing or illegible when filed gold correction as deemed by the annotator. The annotators were asked to provide a correctiotext missing or illegible when filed that would result in a grammatical sentence if the selected word or phrase would be replaced btext missing or illegible when filed the correction.


In one embodiment, errors which have been marked with the error tag wrontext missing or illegible when filed collocation/idiom/preposition are analyzed. All instances which represent simple substitutions ctext missing or illegible when filed prepositions are automatically filtered out using a fixed list of frequent English prepositions. Intext missing or illegible when filed similar way, a small number of article errors which were marked as collocation errors are filteretext missing or illegible when filed out. Finally, instances where the annotated phrase or the suggested correction is longer than text missing or illegible when filed words are filtered out, as they contain highly context-specific corrections and are unlikely ttext missing or illegible when filed generalize well (e.g., “for the simple reasons that these can help them”→“simply to”).


After filtering, 2,747 collocation errors and their respective corrections are generated which account for about 6% of all errors in NUCLE. This makes collocation errors the 7ttext missing or illegible when filed largest class of errors in the corpus after article errors, redundancies, prepositions, noun numbertext missing or illegible when filed verb tense, and mechanics. Not counting duplicates, there are 2,412 distinct collocation errontext missing or illegible when filed and corrections. Although there are other error types which are more frequent, collocation errontext missing or illegible when filed represent a particular challenge as the possible corrections are not restricted to a closed set otext missing or illegible when filed choices and they are directly related to semantics rather than syntax. The collocation errors wertext missing or illegible when filed analyzed and it was found that they can be attributed to the following sources of confusion:


Spelling: An error can be caused by similar orthography if the edit distance betweetext missing or illegible when filed the erroneous phrase and its correction is less than a certain threshold.


Homophones: An error can be caused by similar pronunciation if the erroneous wortext missing or illegible when filed and its correction have the same pronunciation. A phone dictionary was used to map words ttext missing or illegible when filed their phonetic representations.


Synonyms: An error can be caused by synonymy if the erroneous word and ittext missing or illegible when filed correction are synonyms in WordNet. WordNet 3.0 was used.


L1-transfer: An error can be caused by L1-transfer if the erroneous phrase and ittext missing or illegible when filed correction share a common translation in a Chinese-English phrase table. The details of thtext missing or illegible when filed phrase table construction are described herein. Although the method is used on Chinese-English translation in this particular embodiment, the method is applicable to any language pair whetext missing or illegible when filed parallel corpora are available.


As the phone dictionary and WordNet are defined for individual words, the matchintext missing or illegible when filed process is extended to phrases in the following way: two phrases A and B are deemtext missing or illegible when filed homophones/synonyms if they have the same length and the i-th word in phrase A is homophone/synonym of the corresponding i-th word in phrase B.









TABLE 6







Analysis of collocation errors. The threshold for


spelling errors is one for phrase of up to six


characters and two for the remaining phrases.









Suspected Error Source
Tokens
Types












Spelling
154
131


Homophones
2
2


Synonyms
74
60


L1-transfer
1016
782


L1-transfer w/o spelling
954
727


L1-transfer w/o homophones
1015
781


L1-transfer w/o synonyms
958
737


L1-transfer w/o spelling, homophones, synonyms
906
692
















TABLE 7





Examples of collocation errors with different sources of confusion.


The correction is shown in parenthesis. For L1-transfer,


the shared Chinese translation is also shown. The L1-transfer


examples shown here do not belong to any of the other categories.
















Spelling
it received critics (criticism) as much as complaints



budget for the aged to improvise (improve) other areas


Homophones
diverse spending can aide (aid) our country



insure (ensure) the safety of civilians


Synonyms
rapid increment (increase) of the seniors



energy that we can apply (use) in the future


L1-transfer
and give (provide, custom-character  ) reasonable fares to the public



and concerns (attention, custom-character  ) that the nation put on



technology and engineering









The results of the analysis are shown in Table 6 Tokens refer to running erroneous phrase-correction pairs including duplicates and types refer to distinct erroneous phrase-correction pairtext missing or illegible when filed As a collocation error can be part of more than one category, the rows in the table do not sum utext missing or illegible when filed to the total number of errors. The number of errors that can be traced to L1-transfer greatly outnumbers all other categories. The table also shows the number of collocation errors that can be traced to L1-transfer but not the other sources. 906 collocation errors with 692 distinct collocation error types can be attributed only to L1-transfer but not to spelling, homophones, or synonyms. Table 7 shows some examples of collocation errors for each category from our corpus. There are also collocation error types that cannot be traced to any of the above sources.


A method 1300 for correcting collocation errors in EFL writing is disclosed. On embodiment of such a method 1300 includes automatically identifying 1302 one or mortext missing or illegible when filed translation candidates in response to analysis of a corpus of parallel-language text conducted in processing device. Additionally, the method 1300 may include determining 1304, using thtext missing or illegible when filed processing device, a feature associated with each translation candidate. The method 1300 matext missing or illegible when filed also include generating 1306 a set of one or more weight values from a corpus of learner textext missing or illegible when filed stored in a data storage device. The method 1300 may further include calculating 1308, using text missing or illegible when filed processing device, a score for each of the one or more translation candidates in response to thtext missing or illegible when filed feature associated with each translation candidate and the set of one or more weight values.


In one embodiment, the method is based on L1-induced paraphrasing. L1-inducetext missing or illegible when filed paraphrasing with parallel corpora is used to automatically find collocation candidates from text missing or illegible when filed sentence-aligned L1-English parallel corpus. As most of the essays in the corpus are written btext missing or illegible when filed native Chinese speakers, the FBIS Chinese-English corpus is used, which consists of aboutext missing or illegible when filed 230,000 Chinese sentences (8.5 million words) from news articles, each with a single Englitext missing or illegible when filed translation. The English half of the corpus are tokenized and lowercased. The Chinese half text missing or illegible when filed the corpus is segmented using a maximum entropy segmenter. Subsequently, the texts atext missing or illegible when filed automatically aligned at the word level using the Berkeley aligner. English-L1 and L1-Englitext missing or illegible when filed phrases of up to three words are extracted from the aligned texts using phrase extractictext missing or illegible when filed heuristic. The paraphrase probability of an English phrase e1 given an English phrase e2 defined as







p


(


e
1

|

e
2


)


=



f








p


(


e
1

|
f

)




p


(

f
|

e
2


)








where f denotes a foreign phrase in the L1 language. The phrase translation probabilities p(e1|text missing or illegible when filed and p(f|e2) are estimated by maximum likelihood estimation and smoothed using Good-Turintext missing or illegible when filed smoothing. Finally, only paraphrases with a probability above a certain threshold (set to 0.001 itext missing or illegible when filed the work) are kept.


In another embodiment, the method of collocation correction may be implemented itext missing or illegible when filed the framework of phrase-based statistical machine translation (SMT). Phrase-based SMT tries ttext missing or illegible when filed find the highest scoring translation e given an input sentence f. The decoding process otext missing or illegible when filed finding the highest scoring translation is guided by a log-linear model which scores translatiotext missing or illegible when filed candidates using a set of feature functions hi,=1, . . . , n







score


(

e
|
f

)


=


exp


(




i
=
1

n








λ
i




h
i



(

e
,
f

)




)


.





Typical features include a phrase translation probability p(e|f), an inverse phrastext missing or illegible when filed translation probability p(f|e), a language model score p(e), and a constant phrase penalty. Thtext missing or illegible when filed optimization of the feature weights λi, i=1, . . . , n can be done using minimum error rate trainintext missing or illegible when filed (MERT) on a development set of input sentences and the reference translations.


The phrase table of the phrase-based SMT decoder MOSES is modified to includtext missing or illegible when filed collocation corrections with features derived from spelling, homophones, synonyms, and L1text missing or illegible when filed induced paraphrases.


Spelling: For each English word, the phrase table contains entries consisting of tltext missing or illegible when filed word itself and each word that is within a certain edit distance from the original word. Each enttext missing or illegible when filed has a constant feature of 1.0.


Homophones: For each English word, the phrase table contains entries consisting text missing or illegible when filed the word itself and each of the word's homophones. Homophones are determined using thtext missing or illegible when filed CuVPlus dictionary. Each entry has a constant feature of 1.0.


Synonyms: For each English word, the phrase table contains entries consisting of thtext missing or illegible when filed word itself and each of its synonyms in WordNet. If a word has more than one sense, all itext missing or illegible when filed senses are considered. Each entry has a constant feature of 1.0.


L1-paraphrases: For each English phrase, the phrase table contains entrietext missing or illegible when filed consisting of the phrase and each of its L1-derived paraphrases. Each entry has two real-valuetext missing or illegible when filed features: a paraphrase probability and an inverse paraphrase probability.


Baseline: The phrase tables built for spelling, homophones, and synonyms artext missing or illegible when filed combined, where the combined phrase table contains three binary features for spellingtext missing or illegible when filed homophones, and synonyms, respectively.


All: The phrase tables from spelling, homophones, synonyms, and L1-paraphrasetext missing or illegible when filed are combined, where the combined phrase table contains five features: three binary features fotext missing or illegible when filed spelling, homophones, and synonyms, and two real-valued features for the L 1-paraphrastext missing or illegible when filed probability and inverse L1-paraphrase probability.


Additionally, each phrase table contains the standard constant phrase penalty featuretext missing or illegible when filed The first four tables only contain collocation candidates for individual words. It is left to thtext missing or illegible when filed decoder to construct corrections for longer phrases during the decoding process if necessary.


A set of experiments was carried out to test the methods of semantic collocation errotext missing or illegible when filed correction. The data set used for the experiments was a randomly sampled development set otext missing or illegible when filed 770 sentences and a test set of 856 sentences from the corpus. Each sentence contained exactltext missing or illegible when filed one collocation error. The sampling was performed in a way that sentences from the samtext missing or illegible when filed document cannot end up in both the development and the test set. In order to keep conditions text missing or illegible when filed realistic as possible, the test set was not filtered in any way.


Evaluation metrics were also defined for the experiments to evaluation the collocatitext missing or illegible when filed error correction. An automatic and a human evaluation were conducted. The main evaluatitext missing or illegible when filed metric is mean reciprocal rank (WIRR) which is the arithmetic mean of the inverse ranks of ttext missing or illegible when filed first correct answer returned by the system






MRR
=


1
N






i
=
1

N







1

rank


(
i
)









where N is the size of the test set. If the system did not return a correct answer for a test instanctext missing or illegible when filed






1

rank


(
i
)






is set to zero.


In the human evaluation, precision at rank k, k=1, 2, 3, was additionally reportetext missing or illegible when filed where the precision is calculated as follows:







P
@
k

=





a

A








score


(
a
)





A







where A is the set of returned answers of rank k or less and score(•) is a real-valued scorintext missing or illegible when filed function between zero and one.


In the collocation error experiments, automatic correction of collocation errors catext missing or illegible when filed conceptually be divided into two steps: i) identification of wrong collocations in the input, and iitext missing or illegible when filed correction of the identified collocations. It was assumed that the erroneous collocation hatext missing or illegible when filed already been identified.


In the experiments, the start and end offset of the collocation error provided by thtext missing or illegible when filed human annotator was used to identify the location of the collocation error. The translation of thtext missing or illegible when filed rest of the sentence was fixed to its identity. Phrase table entries where the phrase and thtext missing or illegible when filed candidate correction are identical were removed, which practically forced the system to changtext missing or illegible when filed the identified phrase. The distortion limit of the decoder was set to zero to achieve monotontext missing or illegible when filed decoding. For the language model, a 5-gram language model trained on the English Gigawotext missing or illegible when filed corpus with modified Kneser-Ney smoothing was used. All experiments used the same languagtext missing or illegible when filed model to allow a fair comparison.


MERT training with the popular BLEU metric was performed on the development stext missing or illegible when filed of erroneous sentences and their corrections. As the search space was restricted to changingtext missing or illegible when filed single phrase per sentence, training converges relatively quickly after two or three iterationtext missing or illegible when filed After convergence, the model can be used to automatically correct new collocation errors.


The performance of the proposed method was evaluated on the test set of 85 sentences, each with one collocation error. Both an automatic and a human evaluation wertext missing or illegible when filed conducted. In the automatic evaluation, the system's performance was measured by computintext missing or illegible when filed the rank of the gold answer provided by the human annotator in the n-best list of the system. Thtext missing or illegible when filed size of the n-best list was limited to the top 100 outputs. If the gold answer was not found in thtext missing or illegible when filed top 100 outputs, the rank was considered to be infinity, or in other words, the inverse of the rantext missing or illegible when filed is zero. The number of test instances for which the gold answer was ranked among the top answers, k=1, 2, 3, 10, 100 was reported. The results of the automatic evaluation are shown itext missing or illegible when filed Table 8.









TABLE 8







Results of automatic evaluation. Columns two to six show the number of


gold answers that are ranked within the top k answers. The last column


shows the mean reciprocal rank in percentage. Bigger values are better.













Model
Rank = 1
Rank ≦ 2
Rank ≦ 3
Rank ≦ 10
Rank ≦ 100
MRR
















Spelling
35
41
42
44
44
4.51


Homophones
1
1
1
1
1
0.11


Synonyms
32
47
52
60
61
4.98


Baseline
49
68
80
93
96
7.61


L1-paraphrases
93
133
154
216
243
15.43


All
112
150
166
216
241
17.21
















TABLE 9





Inter-annotator agreement P(E) = 0.5


















P(A)
0.8076



Kappa
0.6152









For collocation errors, there is usually more than one possible correct answetext missing or illegible when filed Therefore, automatic evaluation underestimates the actual performance of the system by onltext missing or illegible when filed considering the single gold answer as correct and all other answers as wrong. A humatext missing or illegible when filed evaluation for the systems BASELINE and ALL was carried out. Two English speakers wetext missing or illegible when filed recruited to judge a subset of 500 test sentences. For each sentence, a judge was shown thtext missing or illegible when filed original sentence and the 3-best candidates of each of the two systems. The human evaluatiotext missing or illegible when filed was restricted to the 3-best candidates, as the answers at a rank larger than three will not be vertext missing or illegible when filed useful in a practical application. The candidates were displayed together in alphabetical ordtext missing or illegible when filed without any information about their rank or which system produced them or the gold answer btext missing or illegible when filed the annotator. The difference between the candidates and the original sentence was highlightetext missing or illegible when filed The judges were asked to make a binary judgment for each of the candidates on whether thtext missing or illegible when filed proposed candidate was a valid correction of the original or not. Valid corrections wertext missing or illegible when filed represented with a score of 1.0 and invalid corrections with a score of 0.0. Inter-annotatotext missing or illegible when filed agreement was reported in Table 8 The chance of agreement P(A) is the percentage of times thtext missing or illegible when filed the annotators agree, and P(E) is the expected agreement by chance, which is 0.5 in our castext missing or illegible when filed The Kappa coefficient is defined as






Kappa
=



P


(
A
)


-

P


(
E
)




1
-

P


(
E
)








A Kappa coefficient of 0.6152 was obtained from the experiment, where a Kapptext missing or illegible when filed coefficient between 0.6 and 0.8 is considered as showing substantial agreement. To computtext missing or illegible when filed precision at rank k, the judgments was averaged. Thus, a system can receive a score of 0.0 (bottext missing or illegible when filed judgments negative), 0.5 (judges disagree), or 1.0 (both judgments positive) for each returnetext missing or illegible when filed answer.


All of the methods disclosed and claimed herein can be made and executed withoutext missing or illegible when filed undue experimentation in light of the present disclosure. While the apparatus and methods otext missing or illegible when filed this invention have been described in terms of preferred embodiments, it will be apparent ttext missing or illegible when filed those of skill in the art that variations may be applied to the methods and in the steps or in thtext missing or illegible when filed sequence of steps of the method described herein without departing from the concept, spirit atext missing or illegible when filed scope of the invention. In addition, modifications may be made to the disclosed apparatus atext missing or illegible when filed components may be eliminated or substituted for the components described herein where ttext missing or illegible when filed same or similar results would be achieved. All such similar substitutes and modificatiotext missing or illegible when filed apparent to those skilled in the art are deemed to be within the spirit, scope, and concept of ttext missing or illegible when filed invention as defined by the appended claims.

Claims
  • 1. An apparatus, comprising: at least one processor and a memory device coupled to the at least one processor, in which the at least one processor is configured: to identify words of an input utterance;to place the words in a plurality of first nodes stored in the memory device;to assign a word-layer tag to each of the plurality of first nodes based, in part, on neighboring nodes of the plurality of first nodes; andto generate an output sentence by combining words from the plurality of first nodes with punctuation marks selected, in part, on the word-layer tags assigned to each of the first nodes.
  • 2. The apparatus of claim 1, in which the word-layer tag is at least one of none, comma, period, question mark, and exclamation mark.
  • 3. The apparatus of claim 1, in which the plurality of first nodes is a first-order linear chain of conditional random fields.
  • 4. The apparatus of claim 1, in which each of the word-layer tags is placed in a node of a plurality of second nodes stored in the memory device, each of the second nodes coupled to at least one of the first nodes.
  • 5. The apparatus of claim 1, in which the at least one processor is further configured to assign a sentence-layer tag to each of the nodes in the plurality of first nodes based, in part, on boundaries of the input utterance, in which punctuation marks selected for the output sentence are selected, in part, on the sentence-layer tag, in which the sentence-layer tag is at least one of a declaration beginning, declaration inner, question beginning, question inner, exclamation beginning, and exclamation inner, and in which the plurality of first nodes and the plurality of second nodes comprise a two-layer factorial structure of dynamic conditional random fields.
  • 6-7. (canceled)
  • 8. A computer program product, comprising: a non-transitory computer-readable medium comprising: code to identify words of an input utterance;code to place the words in a plurality of first nodes stored in the memory device;code to assign a word-layer tag to each of the plurality of first nodes based, in part, on neighboring nodes of the plurality of first nodes; andcode to generate an output sentence by combining words from the plurality of first nodes with punctuation marks selected, in part, on the word-layer tags assigned to each of the first nodes.
  • 9. The computer program product of claim 8, in which the word-layer tag is at least one of none, comma, period, question mark, and exclamation mark.
  • 10. The computer program product of claim 8, in which the plurality of first nodes is a first-order linear chain of conditional random fields.
  • 11. The computer program product of claim 8, in which each of the word-layer tags is placed in a node of a plurality of second nodes stored in the memory device, each of the second nodes coupled to one of the first nodes.
  • 12. The computer program product of claim 8, in which the medium further comprises code to assign a sentence-layer tag to each of the nodes in the first plurality of nodes based, in part, on boundaries of the input utterance, in which the code to generate the output sentence selects punctuation marks for the output sentence based, in part, on the sentence-layer tag, in which the sentence-layer tag is at least one of a declaration beginning, declaration inner, question beginning, question inner, exclamation beginning, and exclamation inner.
  • 13-33. (canceled)
  • 34. An apparatus, comprising: at least one processor and a memory device coupled to the at least one processor, in which the at least one processor is configured: to receive a natural language text input, the text input comprising a grammatical error in which a portion of the input text comprises a class from a set of classes;to generate a plurality of selection tasks from a corpus of non-learner text that is assumed to be free of grammatical errors, wherein for each selection task a classifier re-predicts a class used in the non-learner text;to generate a plurality of correction tasks from a corpus of learner text, wherein for each correction task a classifier proposes a class used in the learner text;to train a grammar correction model using a set of binary classification problems that include the plurality of selection tasks and the plurality of correction tasks; andto use the trained grammar correction model to predict a class for the text input from the set of possible classes.
  • 35. The apparatus of claim 34, in which the at least one processor is further configured to output a suggestion to change the class of the text input to the predicted class if the predicted class is different than the class in the text input.
  • 36. The apparatus of claim 34, wherein the learner text is annotated by a teacher with an assumed correct class.
  • 37. The apparatus of claim 34, wherein the class is an article associated with a noun phrase in the input text, and wherein the at least one processor is further configured to extract feature functions for the classifiers from noun phrases in the non-learner text and the learner text.
  • 38. (canceled)
  • 39. The apparatus of claim 34, wherein the class is a preposition associated with a prepositional phrase in the input text, and wherein the at least one processor is further configured to extract feature functions for the classifiers from prepositional phrases in the non-learner text and the learner text.
  • 40. (canceled)
  • 41. The apparatus of claim 34, wherein the non-learner text and the learner text have a different feature space, the feature space of the learner text including the word used by a writer.
  • 42. The apparatus of claim 34, wherein training the grammar correction model comprises minimizing a loss function on the training data.
  • 43. The apparatus of claim 34, wherein training the grammar correction model further comprises identifying a plurality of linear classifiers through analysis of the non-learner text, and wherein the linear classifiers further comprise a weight factor included in a matrix of weight factors, and wherein training the grammar correction model further comprises performing a Singular Value Decomposition (SVD) on the matrix of weight factors.
  • 44-55. (canceled)
  • 56. An apparatus, comprising at least one processor and a memory device coupled to the at least one processor, in which the at least one processor is configured to correct semantic collection errors by performing the steps of: automatically identifying one or more translation candidates in response to analysis of a corpus of parallel-language text conducted in a processing device;determining, using the processing device, a feature associated with each translation candidate:generating a set of one or more weight values from a corpus of learner text stored in a data storage device; andcalculating, using a processing device, a score for each of the one or more translation candidates in response to the feature associated with each translation candidate and the set of one or more weight values.
  • 57. (canceled)
  • 58. The apparatus of claim 56, in which the at least one processor is further configured to perform the steps of: selecting a parallel corpus of text from a database of parallel texts, each parallel text comprising text of a first language and corresponding text of a second language;segmenting the text of the first language using the processing device;tokenizing the text of the second language using the processing device;automatically aligning words in the first text with words in the second text using the processing device;extracting phrases from the aligned words in the first text and in the second text using the processing device; andcalculating, using the processing device, a probability of a paraphrase match associated with one or more phrases in the first text and one or more phrases in the second text,wherein the feature associated with each translation candidate is the probability of a paraphrase match.
  • 59. The apparatus of claim 56, wherein the set of one or more weight values is calculated using a minimum error rate training (MERT) operation on a corpus of learner text.
  • 60. The apparatus of claim 56, wherein the at least one processor is further configured to perform the step of generating a phrase table having collocation corrections with features derived from at least one of a spelling edit distance, a homophone dictionary, a synonym dictionary, and native language-induced paraphrases.
  • 61. The apparatus of claim 60, wherein the phrase table comprises one or more penalty features for use in calculating the probability of a paraphrase match.
  • 62. A non-transitory tangible computer-readable medium comprising computer-readable code that, when executed by a computer, cause the computer to perform the operation of correcting semantic collocation errors comprising: automatically identifying one or more translation candidates in response to analysis of a corpus of parallel-language text conducted in a processing device;determining, using the processing device, a feature associated with each translation candidate;generating a set of one or more weight values from a corpus of learner text stored in a data storage device; andcalculating, using a processing device, a score for each of the one or more translation candidates in response to the feature associated with each translation candidate and the set of one or more weight values.
  • 63. The non-transitory tangible computer-readable medium of claim 62, wherein the computer-readable code further comprises computer-readable code to cause the computer to perform the operations of: selecting a parallel corpus of text from a database of parallel texts, each parallel text comprising text of a first language and corresponding text of a second language;segmenting the text of the first language using the processing device;tokenizing the text of the second language using the processing device;automatically aligning words in the first text with words in the second text using the processing device;extracting phrases from the aligned words in the first text and in the second text using the processing device; andcalculating, using the processing device, a probability of a paraphrase match associated with one or more phrases in the first text and one or more phrases in the second text,wherein the feature associated with each translation candidate is the probability of a paraphrase match.
  • 64. The non-transitory tangible computer-readable medium of claim 62, wherein the set of one or more weight values is calculated using a minimum error rate training (MERT) operation on a corpus of learner text.
  • 65. The non-transitory tangible computer-readable medium of claim 62, wherein the computer-readable code further comprises computer-readable code to cause the computer to perform the operation of generating a phrase table having collocation corrections with features derived from at least one of a spelling edit distance, a homophone dictionary, a synonym dictionary, and native language-induced paraphrases.
  • 66. The non-transitory tangible computer-readable medium of claim 65, wherein the phrase table comprises one or more penalty features for use in calculating the probability of a paraphrase match.
  • 67. A non-transitory tangible computer-readable medium comprising computer-readable code that, when executed by a computer, cause the computer: to receive a natural language text input, the text input comprising a grammatical error in which a portion of the input text comprises a class from a set of classes;to generate a plurality of selection tasks from a corpus of non-learner text that is assumed to be free of grammatical errors, wherein for each selection task a classifier re-predicts a class used in the non-learner text;to generate a plurality of correction tasks from a corpus of learner text, wherein for each correction task a classifier proposes a class used in the learner text;to train a grammar correction model using a set of binary classification problems that include the plurality of selection tasks and the plurality of correction tasks; andto use the trained grammar correction model to predict a class for the text input from the set of possible classes.
  • 68. The non-transitory tangible computer-readable medium of claim 67, wherein the computer-readable code further comprises computer-readable code that cause the computer to output a suggestion to change the class of the text input to the predicted class if the predicted class is different than the class in the text input.
  • 69. The non-transitory tangible computer-readable medium of claim 67, wherein the learner text is annotated by a teacher with an assumed correct class.
  • 70. The non-transitory tangible computer-readable medium of claim 67, wherein the class is an article associated with a noun phrase in the input text, and wherein the computer-readable code further comprises computer-readable code that cause the computer to extract feature functions for the classifiers from noun phrases in the non-learner text and the learner text.
  • 71. The non-transitory tangible computer-readable medium of claim 67, wherein the class is a preposition associated with a prepositional phrase in the input text, and wherein the computer-readable code further comprises computer-readable code that cause the computer to extract feature functions for the classifiers from prepositional phrases in the non-learner text and the learner text.
  • 72. The non-transitory tangible computer-readable medium of claim 67, wherein the non-learner text and the learner text have a different feature space, the feature space of the learner text including the word used by a writer.
  • 73. The non-transitory tangible computer-readable medium of claim 67, wherein training the grammar correction model comprises minimizing a loss function on the training data.
  • 74. The non-transitory tangible computer-readable medium of claim 67, wherein training the grammar correction model further comprises identifying a plurality of linear classifiers through analysis of the non-learner text, and wherein the linear classifiers further comprise a weight factor included in a matrix of weight factors, and wherein training the grammar correction model further comprises performing a Singular Value Decomposition (SVD) on the matrix of weight factors.
PCT Information
Filing Document Filing Date Country Kind 371c Date
PCT/SG2011/000331 9/23/2011 WO 00 4/11/2013
Provisional Applications (3)
Number Date Country
61495902 Jun 2011 US
61386183 Sep 2010 US
61509151 Jul 2011 US