a is a flowchart of a method according one embodiment.
b is a state diagram illustrating detection of a static state of a camera view.
No detailed description will be presented regarding the specific functions of the different blocks of the telephone 100. In short, however, as the person skilled in the art will realize, the processing unit 110 controls the overall function of the functional blocks in that it is capable of receiving input from the keyboard 105, audio information via the microphone 114, images via the camera 118 and receive suitably encoded and modulated data via the antenna 122 and transceiver 120. The processing unit 110 is also capable of providing output in the form of sound via the speaker 116, images via the display 107 and suitably encoded and modulated data via the transceiver 120 and antenna 122.
The terminal 100 is typically in connection with a communication network 126 via a radio interface 124. As the skilled person will realize, the network 126 illustrated in
A method according to the invention will now be described with reference to a flow chart in
In a viewfinder mode, started during a viewfinder start step 201, image sampling rate is performed at typically 15 frames per second with a typical frame size of 160×120 pixels and the sampling rate is typically about 60 micro seconds per frame. Since 60 micro seconds is much shorter than the typical reaction time of a normal human user, the sampling rate is down-sampled to one frame out of every five frames. The display frequency is thereby 15 frames per second, which to a human user looks essentially continuous. During this step, a user aims the camera such that a text is viewed in the viewfinder, i.e. typically on the terminal display. Detection of the movement of the view in the viewfinder is performed, not at every frame, but typically once every 300 micro seconds in order to save computational power and smoothen out noise. During the viewfinder mode, a guiding pattern is displayed, typically at the center of the view in the viewfinder, for aiding the user when aiming at the target.
Zooming of the camera is then performed in a zoom step 203. The camera settings are set by adjustment of the automatic digital zoom parameters. The purpose of the automatic digital zoom is to obtain a suitable target size in the viewfinder frame. For a camera terminal that has both digital zoom and optical zoom functionalities, it is difficult for a user to cross-adjust the zoom parameters to obtain a good quality image for OCR. Hence, intelligent digital zoom parameter estimation is used, which limits the capture distance within a small range and ensures the proper size of target in the viewfinder. The end user only need to trigger the optical zoom to make the imaging clear.
Movement detection 205 of the camera is realized by using any qualified moving tracking/detection algorithms known in the art. For simplicity, only the area close to the position at which the guiding pattern is displayed in the viewfinder is detected. The movement detection algorithm is preferably tolerant to the small hand shaking that is inevitable for many human users. Thus, a hand-held shaking model is introduced to avoid false detection due to such hand shaking. The hand-held shaking model is typically one that has been established beforehand, for example by the collecting of two classes of samples: hand-held shaking movements and real movements during a search stage (i.e. during scanning movement across potential target texts). Statistical classification of the two classes can be built into the learning stage, thereby enabling the use of a fast decision tree during the operation of the invention.
Whether the view is in a static state or not, is decided in a decision step 207, which is implemented using a state machine as illustrated by a state transition diagram in
The continued processing will start when entering state (0,1), that is the situation where the camera has been moved and then been focused on the target for a relatively long time period, e.g. several hundred micro seconds. If the camera is keeps unmoved for a longer time, no iterative starting of processing will be caused until the camera moves again and stops on another target. The state-based decision effectively avoids unnecessary processing (normally OCR is sensitive to the small change of input image if the character size is close to the limitation of the lower bound, so overlapped recognizing of the similar images might cause unstable results that will confuse the user) and make the dynamic recognition and any subsequent translation stable.
When state (previous, current)=(0,1) has been determined in the determination step 207, processing of automatic object extraction is started in a record step 209. Here, extraction is made of the target text to be translated from the recorded image. Because the position of the guiding pattern has already provided prior knowledge about the position of the target, a connect-component-based algorithm is applied for object detection and segmentation. If the target is an isolated word, layout analysis gives the accurate block of the word, otherwise a relative region (e.g. a line of Chinese characters without splits) will be extracted.
The extracted target text is then provided to an OCR process in step 211. The OCR processing involves a number of different procedures and considerations. For example, in Chinese-to-English translation, there is often a problem to identify which combination of characters could compose a valid unit (word/phrase) to be translated. Therefore, if there is no layout information available, linguistic analysis should be used after OCR. Rule based word association may be used to find out the possible combination of the concurrent characters by using context sensing and linguistic rules. The valid combination whose position is nearest to the guiding pattern is typically selected as the intended target text.
The recognized text may then be provided to a post processing procedure 213, which will be exemplified with reference to a flow chart in
The dish menu database is the main database consisting of Chinese and English names of dishes. The database is used to look up a Chinese dish name and retrieve the exact English translation. The ingredient database includes some key ingredients involved in dishes such as chicken, beef, fish etc. The database is used to check the ingredient(s) in a dish. Based on the information in the database, even if the interpretation fails to provide a correct dish name during the fuzzy translation, it still can give users a hint of the ingredient(s) of the dish that is of interest. For example, supposing a dish name, say “sauteed potato and steak” (in Chinese), can not be found in the dish menu database by any of the accurate and fuzzy translation, it will be compared with the ingredients in the ingredient database automatically. In the ingredient database, the words of potato and steak can be found and the user will be informed that this dish may include some potatoes and steak.
Hence, with reference to
A key issue in the fuzzy translation is the question of how to judge the fuzzy words. Here is introduced a distance function that is used to calculate the distance between query words and records in the database. Mainly, such a function calculates two parts, i.e. the difference of words length and the number of matched characters. Because similar words should have nearly the same length, the difference of the words length is the most important factor and given a weight of w1, which may be set three times as large as a weight of number of matched characters w2. Thus, the expression for the distance, Dist, is as follows.
Given a value for w1 of 300 and a value for w2 of 100, a threshold of 80 can be used to judge whether the two words are similar. If the distance is greater than 80, the two words are not similar. If the distance is 0, the two words are exactly the same. Hence, if all the distances between the word to be translated and the words in the dish menu database are greater than 80, ingredients translating is used. If there is a distance of 0 between one word in the database and the word to be translated, the accurate translation is used. Otherwise, the fuzzy translating mode is chosen.
Even though the above example uses translation of restaurant menu items, the invention is of course applicable in many other fields.
That is, the invention can be applied on any relevant target text, including street signs, restaurant name signs, etc. The non-click, concept useful, not least due to it's simplicity from a user perspective, for automatic extraction of text from an image.
Examples of fields of use include translation of medicine terms, company name and company address translation. For example, main ingredients of medicines can be listed for understanding a kind of medicine in case of emergency and a database of the main districts and streets of a city can be constructed and used for locating a company.
Another good use case is for performing product/commodity search in a store such as a supermarket. Users can scan the brand/logo/specification of any goods and a specific data search/translation can be performed as described above.
Furthermore, a normal dictionary can be used for translation of the recognized text. The multi-level translation model then operates with a common dictionary for word translation from a first language to a second language. In fact, the invention should not only be considered as useful in connection with translation, it may be seen as a kind of “component-based search” method, for which the input method could be OCR-based as the examples described above. The component-based matching method can be used for any specific database search; if accurate matching is not available, the fuzzy match and keyword/ingredient search will be used.