The present invention is related to a method and system for providing a pattern encoded dictionary for use in language processing in computer systems, such as Optical Character Recognition (OCR) systems or Automatic Speech Recognition systems (ASR), and especially to a pattern encoded dictionary using at least one pattern-classifier related to patterns of characters or phonemes representing language elements retrieved from a specific language dictionary.
The present state-of-the art text recognition systems, often denoted as Optical Character Recognition Systems, are typically based on template matching with known fixed templates, by structural matching or by recognizing the characters based on a set of fixed set of recognition rules using a set of computed features extracted from the shapes of characters. Each character will be assigned a score or an a priori calculated probability for each character class or set. A dictionary is used to check that each chain of proposed characters can form words, picking the most probable word.
The state-of-the art text recognition systems usually fails when they encounter moderate to heavily degraded text images. Such degrading of text images may be a result of photocopying an original document, typewritten documents which may be encountered when scanning older archive material, newspapers which usually have poor print and paper quality effecting the quality of the text images, faxes which usually has poor resolution in the transmission channel and printing device, etc. These and similar problems are described in the book by Stephen Rice et al, “Optical Character recognition—An illustrated Guide to the Frontier”, Kluwer Academic Publishers 1999.
The current text recognition systems do only to a limited extent adapt to specific font or deformation of the text without a guided learning phase requiring human interaction in the process, which slows down the process considerably. Electronic document handling, archive systems, electronic storage of printed material etc. requires scanning of unlimited number of pages which makes it impossible to use human interaction to succeed with such tasks.
In this respect automatic speech recognition systems face problems that are very similar to the problems encountered with deformed images in OCR processing. The starting point of recognizing either OCR input or ASR input is to be able to recognize some words and/or characters as reliably recognized. Based on such reliably identified items, the OCR or ASR process may continue in an adaptive fashion or as a configurable system. Whenever there is a deformation in the OCR input, or the ASR input is deformed for example due to different voice patterns of different persons, the ability to recognize some characters or words as reliably is diminished. However, according to the present invention some characters, words may be reliably identified by providing a pattern encoded dictionary based on at least one identifiable pattern-classifier related to patterns found in images comprising text, even in images comprising heavily deformed text, in OCR systems, or in digital images of phonemes in ASR systems.'
PCT patent application WO 2005/050473 A2 disclose a method for semantic clustering of text segments in a language processing system based on their semantic meaning.
The clusters refers to one or several semantic topics and are used for the training of language models. The semantic clustering of the disclosure is based on a text emission model and cluster transmission model providing a probability estimate that a text is concerned with a specific semantic topic.
The aim of the present invention is to cluster words based on the physical, geometrical and structural similarity of language elements such as letters, syllables or phonemes when they are represented in a computer system. This is achieved by applying a pattern-classifier to cluster the words based on their similarity. By clustering all words with similar patterns in a dictionary, it is possible to perform a look-up in the pattern-classifier encoded dictionary based on a specific pattern and the dictionary will output a list of all words in the corresponding cluster. The semantic meaning or topic is irrelevant to the pattern-classifier of the present invention.
According to an aspect of the present invention, words form repeating patterns due to the repeating nature of letters constituting words. In prior art, this phenomena is used for example in crypto analysis. According to the present invention, at least one pattern-classifier related to such repeating aspects of letters constituting words provided in a specific language to be processed in an OCR program or in an ASR system, is used to encode a pattern dictionary relating said at least one pattern-classifier with said words thereby enabling a recognition of words of text input in said OCR or ASR system by identifying said at least one pattern-classifier from any OCR input (digital image of scanned or digitally photographed document input) or any input representing patterns of speech in digital form (digitized microphone input, file, etc.), and use said pattern encoded dictionary to recognize said OCR or ASR input.
According to another aspect of the present invention, several different pattern encoded dictionaries encoding different aspects of said pattern-classifiers, or that are providing encoding of different pattern-classifiers may be used on the same OCR or ASR input to increase the amount of recognized text from said input.
A detailed description of a preferred embodiment of the present invention will be provided in the following sections by first describing some examples of pattern-classifiers related to patterns found in the OCR domain and the ASR domain, respectively, followed by a description of an example of a preferred method and system utilizing said method providing a pattern encoded dictionary.
As an example of pattern-classifiers related to patterns in images in OCR systems,
According to an example of embodiment of the present invention, letters extending above the x-height line 3, as shown in
According to another example of embodiment of the present invention, the encoding to words are extended to characters that are not character segmented, where each fragment of the character is encoded in a similar manner as described above. As an example, the fragmented word ‘funny’ would be encoded as OMMMMMMUM where the ‘u’ and ‘n’ are coded as MM due to the dual stems and the ‘y’ is coded as UM. By this approach a pattern-classifier encoded dictionary can be used as a tool to solve segmentation problems within text recognition if unique patterns can be identified.
An example of pattern that may be used is the ascender/descender pattern of the two words, which are OMMMMMMMOOM MOUMMMOOMM. In said example of an English dictionary comprising about 125 000 words, ‘cigarettes’ is the only word in this dictionary that has an ascender/descender pattern of MOUMMMOOMM. Without any other information than the segmentation of the characters and the ascender/descender pattern, the method according to the present invention is able to recognize the word as ‘cigarettes’ using this specific dictionary and said ascender/descender encoding.
The ascender/descender pattern for the word ‘innumerable’ in
According to an aspect of the present invention, ascender/descender pattern of the three words in
According to another aspect of the present invention, more elaborate pattern encoding may be used for recognition purpose than the ascender/descender encoding used in the previous examples. An example of embodiment of the present invention is using encoding of bows and stems as illustrated in table 1 below.
Applying the bow/stem encoding provided in table 1, the pattern FHE GFBHAFFG BHFFEFFBBFBFA will represent the sentence ‘any reindeer incapacitated’ illustrated in
According to yet another aspect of the present invention, topological properties of a character, expressed for example in mathematical terms as the number of elements and holes that constitute the character, may be used in providing a pattern encoded dictionary. An example is provided in table 2 below. In the example in table 2, we have used additional information about the position of holes within the character.
The text in
Applying the topological pattern encoding from table 2 and encoding the text in
Combining the topological properties from table 2 with the ascender/descender encoding described above, a unique pattern for the word ‘darkest’ is provided among the 125 000 words in said dictionary.
Automatic speech recognition systems (ASR) face the same type of problems as found in OCR in the sense that some patterns representing elements (letters, words etc.) must be recognized reliable to be able to provide an adaptation or configuration of the ASR system. According to an aspect of the present invention, pattern encoding of sound may be based on phoneme types. For example, speech may be divided in vowels [sound patterns related to the letters aeiouy], nasals [nm], laterals [l], thrills [r], fricatives [fsvz] and plosives [ptkbdg] with or without distinguishing between voiced and unvoiced sounds.
For example, the word ‘instruments’ is then encoded as VNFTVNVNPF where V represents vowels, N nasals, F fricatives, T thrills and P plosives. This is only an example of encoding, and different sound combinations can be used. Some sounds will represent several characters in speech and other classifications of sounds can be used as well, according to the present invention.
According to yet another aspect of the present invention, sound may also be encoded using an encoding scheme based on patterns found in images of sound patterns, for example the amplitude variation of sound output from a microphone with respect to time.
According to another aspect of the present invention, all vocoids sounds and some voiced and thrilled consonants are quasi-stationary periodic sounds characterized by a repeating sound pattern with a small number of frequencies called formant frequencies. It is the vocal chord that decides the basic frequency and the position of the moveable elements of the vocal tract that decides the formant frequencies. These voiced sounds are often further classified based on the number of formants and relative frequencies of the formants in automatic speech recognition systems. A large source for errors is the large variation from speaker to speaker. It is however easy to distinguish quasi-stationary periodic sounds from other sound types, for example by Fourier transform of the sound signal as known to a person skilled in the art.
According to yet another aspect of the present invention, plosive sounds are characterised by their non-stationary transient sound signal as the vocal tract is closed and opened again during the speech. The plosives may contain a transient closing phase, a quelling phase (muted or voiced) and transient opening phase. Even though each of the plosives has a distinctly different sound signal, they are easily distinguished from other sound types, for example by Fourier transform as known to a person skilled in the art.
According to an example of embodiment of the present invention, an encoding scheme providing an N for the stationary non-periodic sounds [fsvz], P for the periodic sounds [aeiouynmlr] and T for the transient sounds [ptkbdg] may be used to encode a pattern encoded dictionary, for example for the English language. In an example of dictionary comprising 125 000 words, the word ‘instruments’ is then encoded as PPNTPPPPPTN. There are only 4 out of 125 000 words in said dictionary that matches this pronunciation-based encoding pattern. These 4 words are instalments, instruments, masterminds and restaurants. By independently recognizing for example the thrill (‘r’) in the word ‘instruments’, a unique pattern is provided. According to the example of embodiment of the present invention as depicted in
In addition to the specific encoding schemes to be used in the analyzing section 11, the storage location 25 may comprise encoding schemes incorporating results from statistical pattern-classifier analysis provided in section 24 and/or a priori knowledge about a specific language contained in section 26. In an example of embodiment, the statistical analysis section 24 receives text image or speech signals from section 28, and is analysing the input to estimate which set of pattern-classifiers are best represented in the input, providing an ability for the analysing section 21 to provide pattern-classifier patterns 22 that are significant in the actual text image or speech signal. In this manner, the pattern encoded dictionary may be provided as a “best fit” for the actual input 28. When said pattern-classifier patterns 22 are analyzed, the output is communicated to the clustering section 23, providing dictionary listings relating said patterns with words from the dictionary 20 in the storage location 27. A text image input or speech signal 28 is analyzed in section 29 to provide pattern-classifier patterns 30. The section 29 receives the same encoding schemes used in the section 21 from the storage location 25. The output from section 30 is used to perform a dictionary lookup process in 31 by addressing the list provided in section 27 with the pattern-classifier pattern from section 30. The output from the dictionary lookup section 31 results in either a unique word or sound identification 32, or a list of candidates 33. The output 32 and 33 is used for adaptation or configuration of an OCR or ASR system.
According to an aspect of the present invention, the list of candidates 33 should be a minimum. The ideal situation is only to have unique identifications 32. However, according to an example of embodiment of the present invention, the number of candidates 33 may be reduced by repeating, for example the steps of the preferred embodiment as depicted in
Number | Date | Country | Kind |
---|---|---|---|
20052966 | Jun 2005 | NO | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/NO06/00227 | 6/14/2006 | WO | 00 | 3/6/2008 |