The present invention relates generally to operating a digital camera, and, more particularly, to input and output control methods that make the process more user friendly and raise the quality of output.
Physical disabilities, reading problems, language ineptitudes or other limitations often make it difficult, tedious or impossible for some people to read printed matter. Among such people are those with low or no vision and dyslexic readers. People insufficiently fluent in the language of the printed matter often have similar difficulties. Various technologies exist for assisting such readers. Some devices ultimately convert text to speech. Some other devices magnify the text image, often using a video or still camera. Yet other devices improve contrast, reverse color, or facilitate reading in other ways. Language translation software, such as Google-translate, is available. In many cases, instead of, or in addition to a video stream, a still digital photographic image of printed matter needs to be made before further processing.
The present invention overcomes the problems and disadvantages associated with current techniques and designs and provides new systems and methods of control of input and output associated with processing text in an image.
One embodiment of the invention is directed to a system for processing an image. The system comprises a processor, an image capturing unit in communication with the processor, an inspection surface positioned so that at least a portion of the inspection surface is within a field of view (FOV) of the image capturing unit and an output device. The system further comprises software executing on the processor that monitors the FOV of the image capturing unit for at least one event. The image capturing unit is in a video mode while the software is monitoring for the at least one event. The inspection surface is capable of supporting an object of interest.
In a preferred embodiment, the software recognizes text in a captured image and converts the text into a computer readable format using OCR (optical character recognition). Preferably, the software directs the image capturing unit to capture an image upon detection of an event.
In a preferred embodiment, the processor is within a housing and an upper surface of the housing is the inspection surface. Preferably, there is at least one marker on the inspection surface and an event is at least one of the blocking of the view of the at least one marker from the image capturing unit, and the appearance of at least one marker within the FOV. In a preferred embodiment, the software directs the image capturing unit to capture an image upon (1) a detection of a marker becoming obscured (in other words disappearing) from the view of the image capturing unit and (2) a subsequent detection of the absence of motion in the FOV of the image capturing unit above a preset limit of motion level for a preset time span.
In a preferred embodiment, an event is a hand gesture of a user within the FOV of the image capturing unit. Preferably, different hand gestures cause the processor to execute different commands. The different commands can be chosen from the group comprising capturing an image, stopping output flow, resuming output flow, rewinding output flow, fast forwarding output flow, pausing output flow, increasing output flow speed, reducing output flow speed, magnifying the output image on the a display, shrinking the output image on a display, and highlighting at least a portion of output image on a display.
In a preferred embodiment, the output device is a display device and text is displayed on the display device and/or the output device is a speaker and text is read aloud via the speaker using text-to-speech conversion software.
Another embodiment of the invention is directed to a computer-readable media containing program instructions for processing an image. The computer-readable media causes a computer to monitor the field of view (FOV) of an image capturing unit for at least one event, capture an image upon detection of an event, and output at least a part of the processed image.
In a preferred embodiment, computer-readable media causes the computer to extract text from a captured image and convert the text into a computer readable format. Preferably, an event is one of at least one marker being obscured from the view of said image capturing unit, and the appearance of at least one marker within the FOV of said image capturing unit. Preferably, the computer-readable media causes the image capturing unit to capture an image upon (1) a detection of a marker becoming obscured from the view of the image capturing unit and (2) the subsequent detection of the absence of motion in the
FOV of the image capturing unit above a preset limit of motion level for a preset time span.
In a preferred embodiment, an event is a hand gesture of the user within the FOV of the image capturing unit. Preferably, different hand gestures cause the computer to execute different commands. The different command can be chosen from the group comprising capturing an image, stopping output flow, resuming output flow, rewinding output flow, fast forwarding output flow, pausing output flow, increasing output flow speed, reducing output flow speed, magnifying the output image on the a display, shrinking the output image on a display, and highlighting at least a portion of output on a display. In a preferred embodiment, the output is text displayed on a display device and/or is text read aloud via a speaker.
Another embodiment of the invention is directed to a method of processing an image. The method comprises the steps of monitoring the field of view (FOV) of an image capturing unit for at least one event, capturing an image upon detection of an event, processing said image into a user consumable format, and outputting at least a part of the processed image.
In a preferred embodiment, the method further comprises extracting text from a captured image and converting the text into a computer readable format. Preferably, an event is one of at least one marker being obscured from the view of the image capturing unit, and the appearance of the at least one marker within the FOV of said image capturing unit.
In a preferred embodiment, the method further comprises capturing an image upon (1) a detection of a marker becoming obscured from the view of the image capturing unit and (2) a subsequent detection of the absence of motion in the FOV of the image capturing unit above a preset limit of motion level for a preset time span.
In a preferred embodiment, an event is a hand gesture of the user within the FOV of the image capturing unit. Preferably, different hand gestures cause a computer to execute different commands. The different command can be chosen from the group comprising capturing an image, stopping output flow, starting output flow, rewinding output flow, fast forwarding output flow, pausing output flow, increasing output flow speed, reducing output flow speed, magnifying the output image on the a display, shrinking the output image on a display, and highlighting at least a portion of output on a display.
Preferably, the user consumable format is text displayed on a display device and/or is text read aloud via a speaker.
Another embodiment of the invention is directed to a system for processing an image. The system comprises a processor within a housing, an image capturing unit in communication with the processor, an inspection surface, and an output device. The system also comprises software executing on the processor, wherein the software monitors the FOV of the image capturing unit for at least one event and recognizes text in a captured image and converts the text into a computer readable format using OCR (optical character recognition). The image capturing unit is positioned so that at least a portion of the inspection surface is within a field of view (FOV) of the image capturing unit. In the preferred embodiment, the upper surface of the housing is the inspection surface.
Other embodiments and advantages of the invention are set forth in part in the description, which follows, and in part, may be obvious from this description, or may be learned from the practice of the invention.
The invention is described in greater detail by way of example only and with reference to the attached drawings, in which:
As embodied and broadly described herein, the disclosures herein provide detailed embodiments of the invention. However, the disclosed embodiments are merely exemplary of the invention that may be embodied in various and alternative forms. Therefore, there is no intent that specific structural and functional details should be limiting, but rather the intention is that they provide a basis for the claims and as a representative basis for teaching one skilled in the art to variously employ the present invention.
One object of the present invention is to provide user friendly control over the flow of information. This includes methods and systems for control at the input stage, such as triggering a digital camera to take a picture (capture a digital image) or changing the optical zoom of the camera. This also includes methods and devices for control at the output stage, whether audio, visual, Braille or other format. Such control can be, for example, changing digital zoom (e.g. magnification on the screen), color, contrast and/or other output characteristics, as well as the flow of the output information stream. Such flow of the output stream can be the flow of the output from OCR (optical character recognition). Examples of such OCR output are 1) speech generated from text, 2) OCR-processed magnified text on a screen, and/or 3) Braille-code streaming into a refreshable Braille display.
With reference to
Although the exemplary environment described herein employs flash memory cards, it should be appreciated by those skilled in the art that other types of computer readable media which can store data that are accessible by a computer, such as magnetic cassettes, hard disks, digital versatile disks, cartridges, random access memories (RAMs), read only memory (ROM), a cable or wireless signal containing a bit stream and the like, may also be used in the exemplary operating environment.
Unless specified otherwise, for the purpose of the present invention, an optical input device 190 is implied to be a camera (aka image capturing unit) in either video or still mode. However, any number of input mechanisms, external drives, devices connected to ports, USB devices, such as a microphone for speech, touch-sensitive screen for gesture or graphical input, keyboard, buttons, camera, mouse, motion input, speech and so forth can be present in the system. The output device 170 can be one or more of a number of output mechanisms known to those of skill in the art, for example, printers, monitors, projectors, speakers, and plotters.
In some instances, multimodal systems enable a user to provide multiple types of input to communicate with the computing device 100. The communications interface 180 generally governs and manages the input and system output. There is no restriction on the invention operating on any particular hardware arrangement and therefore the basic features here may easily be substituted for improved hardware or firmware arrangements as they are developed.
For clarity of explanation, the illustrative system embodiment is presented as comprising individual functional blocks (including functional blocks labeled as a “processor”). The functions these blocks represent may be provided through the use of either shared or dedicated hardware, including, but not limited to, hardware capable of executing software. For example the functions of one or more processors presented in
Embodiments within the scope of the present invention may also include computer-readable media for carrying or having computer-executable instructions or data structures stored thereon. Such computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store desired program code means in the form of computer-executable instructions or data structures. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or combination thereof) to a computer, the computer properly views the connection as a computer-readable medium. Thus, any such connection is properly termed a computer-readable medium. Combinations of the above should also be included within the scope of the computer-readable media.
Computer-executable instructions include, for example, instructions and data which cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. Computer-executable instructions also include program modules that are executed by computers in stand-alone or network environments. Generally, program modules include routines, programs, objects, components, and data structures, etc. that perform particular tasks or implement particular abstract data types. Computer-executable instructions, associated data structures, and program modules represent examples of the program code means for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps.
The system of the invention preferably comprises the following hardware devices: a high resolution camera (e.g. a CCD or CMOS camera) with a large field of view (FOV), a structure to support the camera (to keep it positioned), a computer equipped with a microprocessor (CPU) as well as memory of various types, an optional monitor (display) that provides a screen, and/or a speaker.
In a specific example, a camera sensor of 5 Megapixels is used. The camera is preferably fixed at about 40 cm above the inspection surface on which an object of interest is placed. The lens field of view is preferably 50°. That covers an 8½ by 11″ page plus about 15% margins. The aperture of the lens is preferably small relative to the focal length of the lens, e.g. the diameter of the aperture is three times smaller than the focal length. The small aperture enables the camera to resolve details over a range of distances, so that it can image a single sheet of paper as well as a sheet of paper on a stack of sheets (for example a thick book). LEDs or another source of light, whether visible or infrared, may be used to illuminate the observed object.
Camera 201 feeds information to a digital information processor referred to as a CPU. In
Camera 201 produces either a Monochrome or a raw Bayer image. If a Bayer image is produced, then computer (CPU) converts the Bayer image to RGB. The standard color conversion is used in video mode. Conversion to grayscale may be used in still images. The grayscale conversion is optimized such that the sharpest detail is extracted from the Bayer data.
The system can work and present output in various modes:
In Video Mode, the CPU receives image frames from the camera in real time. If a monitor screen is included in the system it may display those images in real time. If optical zoom and/or digital magnification is tunable, a sighted user can adjust them in Video Mode and watch the object of interest to a) inspect the magnified video image, i.e. read magnified text, and/or b) best fit the object of interest into the FOV (field of view) of the camera for taking a still picture of the object. The user can shift the object for either purpose.
Capture Mode, or Still Mode, allows the user to freeze the preview at the current frame and to capture a digitized image of the object into the computer memory, i.e. to take a picture. Here we assume that the object is a printed page of text. In this mode, a sighted user can view the captured image as a whole. One purpose of this mode of viewing is to verify that the whole text of interest (page, column) is within the captured image. Another is to verify that no, or not too much of, other text (parts of adjacent pages or columns) or picture is captured. If the captured image is found inadequate in this sense, the user can go back to Video Mode, move the object, change the optical zoom, and/or digital magnification and capture an image again.
OCR is well known in the art. OCR software converts an image file into a text file. Once OCR has been performed, its output can be presented to a user in various formats, for example speech (by text-to-speech software), Braille or artificial font text on the screen.
In the process of the presentation of text output to a user, the user can receive the text output in such formats as speech, Braille or magnified text on the screen. The flow of the output presentation is preferably under user's control in that, for example, the user can stop or resume this flow at will.
Spaces between words (or between characters, in a different embodiment) are identified at step 306 by determining the positions of valleys in a vertical projection of line image, one text line at a time. Finding all of the spaces may not be necessary, just a sufficient number of spaces need to be identified to choose new locations for lines breaks to wrap magnified lines on the screen when no OCR has been done.
Paragraph breaks are identified at step 307 by the presence of at least one of the following: i) an unusually wide valley in the horizontal (sideways) projection, ii) an unusually wide valley in the vertical projection at the end of a text line, or/and iii) an unusually wide valley in the vertical projection at the beginning of a text line.
In the captured image, some portions of the text are accepted by the software for further processing, while some portions are rejected. The following is one example.
Rejection of a Column that is Captured in Part:
In embodiments where there is a visual output, options for displaying text on the screen include 1) showing video image of the text in real time, 2) showing the photographed image of the text on the display (monitor, screen) while indicating (e.g., highlighting) the word (or line) being pronounced (read out) to the user, 3) showing the word being pronounced enlarged and optionally scrolling (moving horizontally) across the screen, so that the line that is being read out scrolls (moves horizontally) on the screen, entering on one side and exiting on the other side, and/or 4) a previous option without sound.
In distinguishing “still mode” from “video mode” of the camera, the following should be noted. Still mode is preferably used to take still pictures (capture images) and is usually characterized by a higher resolution compared to video mode. Video mode, also termed “idle mode” is preferably used all the time that the camera is not taking a still picture. For some purposes Video Mode is referred to as Motion-Detector Mode. In video mode the camera preferably works at a frame rate characteristic for video cameras, such as 30 frames per second or the like.
In preferred embodiments, the system uses a motion detector mode. In this mode, a motion-detector is active in software that processes video stream from the camera. In some settings, “motion-detector mode” is synonymous with “Video Mode”. In such settings, Video Mode is essentially opposed to still mode, aka Capture Mode. Usually, a video stream has a lower resolution than a still picture taken with the same camera. This difference enables a higher frame rate in video than in still picture taking. The motion-detector software detects and monitors the level of motion captured by the camera, for example, by measuring the amount of movement in the camera's field of view from frame to frame. In one possible setting, e.g. for scanning a book, if such motion is above a preset limit (i.e. there is motion), the motion detector software continues to monitor the images. If the motion drops and stays below a preset limit for a preset time interval, such level of non-motion triggers the camera to take a still picture. Optionally, before the still picture is taken, the video image is analyzed for the presence of text lines in the image. The outcome of such analysis can affect the decision by the algorithm to take a still picture. After a still picture is taken, an increase of motion above the preset limit for longer than a preset time interval followed by its drop below the preset limit for a preset time triggers taking another still picture. This increase in motion typically happens when the user is turning a page over, while a drop in motion is expected to mean that the page has been turned over and that a picture is to be taken. Optionally, in the motion-detector mode, the brightness of the field of view is monitored, at least at the moment before a still picture is taken. The monitored brightness helps optimize the amount of light to be captured by the camera sensor in the subsequent taking of a still picture, which amount is controlled by what is commonly called “time exposure” or “shutter speed”.
In preferred embodiments, the system can establish the absence of an object of interest under the camera. It is desirable that the camera does not take still pictures when no object of interest is in the field of view of the camera. One way to signal the absence of such object in the field of view of the camera 201, in
The system can have an audio Indicator of the absence of an object of interest under the camera. An optional audio indicator can signal to the user that the predefined recognizable image, image 207, has appeared in the field of view of camera 201. This signal tells the user that the software assumes that there is no object of interest, such as printed matter, under the camera at this moment. For example, a recording can play the words “Please place a document”, once image 207 has appeared in the view of camera 201.
Another use of the predefined image, image 207 in
The system can signal the presence of printed matter under the camera. For example, covering a predefined recognizable image, e.g. image 207 in
In preferred embodiments, the user can give commands by gestures. The printed text converted into magnified text on a monitor (for example as a scrolling line), or into speech, is intended for user's consumption. In the process of such output consumption, the user may wish to have control over the flow of the output text or speech. Specifically, such control may involve giving commands similar to what is called in other consumer players “Stop”, “Play”, “Fast-Forward” and “Rewind” commands. Commands such as “Zoom In”, “Zoom Out” can also be given by gestures, even though they may not be common in other consumer players. When such commands are to be given, the camera is usually in video mode, yet is not monitoring turning pages in book-scanning setting. Thus, the camera can be used to sense a specific motion or an image that signals to the algorithm that the corresponding command should be executed. For example, moving a hand in a specific direction under the camera can signal one of the above commands. Moving a hand in a different direction under the camera can signal a different command. In another example, the field of view of the camera can be arranged to have a horizontal arrow that can be rotated by the user around a vertical axis. The image-processing algorithm can be pre-programmed to sense the motion and/or direction of the arrow. Such a motion can be detected and a change in the direction of the arrow can be identified as a signal. Here we call such a signal a “gesture”. A common software algorithm for the identification of the direction of motion, known as “Optical Flow” algorithm, can be utilized for such gesture recognition.
The interpretation of a gesture can be pre-programmed to depend on the current state of the output flow. For example, gesture interpretation can differ between the states in which 1) the text is being read out (in speech) to the user, 2) the text reading has been stopped, 3) magnified text is being displayed, etc. For example the gesture of moving a hand from right to left is interpreted as the “Stop” (aka “Pause”) command if the output text or speech is flowing. Yet, the same gesture can be interpreted as “Resume” (aka “Play”) if the flow has already stopped.
Moving a hand in other manners can signal additional commands. For example, moving a hand back and forth (e.g. right and left), repeatedly, can signify a command, and repeating this movement a preset number of times within a preset time-frame can signify various additional commands.
Gestures can also be interpreted as commands in modes other than output flow consumption. For example, in Video Mode, a gesture can give a command to change optical zoom or digital magnification. For this purpose, it is desirable to distinguish motion of a hand from other motion, such as motion of printed matter under the camera.
Optionally, the software that processes the video stream can recognize shapes of human fingers or the palm of the hand. With this capability, the software can distinguish motion of the user's hands from motion of the printed matter.
In yet another mode, specifically during scanning of a book, alternating time intervals of motion and no motion can convey [communicate] the process of turning pages, as described herein. Such time intervals of motion and no motion can be considered as gestures too, even if the motion direction is irrelevant for the interpretation of the gesture. Specifically, as a page of a book is being turned, motion is being detected by the motion detector software via the camera. The detected motion may be either that of a hand or that of printed matter. The fact that the page has been turned over and is ready for photographing is detected by the motion detector as the subsequent absence of motion. In practice, if motion (as observed by the detector) has dropped and stayed below a preset level for a preset time interval, the software interprets the drop as the page having been turned over. This triggers taking a picture (photographing, capturing a digital image, a shot) and signaling this event to the user. Before the next shot is taken, the detector should see enough motion again and then a drop in motion for a long enough period of time. In this mode (e.g., book scanning), motion in any direction is being monitored, unlike in specific hand gesture recognition during output consumption, where motion in different directions may mean different commands.
More than one predefined recognizable image can be drawn, painted, engraved, etc., on the surface, such as surface 205 in
Furthermore, time sequences of covered and uncovered images can be pre-programmed to encode various commands. A large number of commands can be encoded by such time sequences. Moving a hand above the surface of images in a specific manner can signal commands by way of covering and uncovering the images in various order (sequences). For example, moving a hand back and forth (e.g. right and left) can signify a command. Repeating this movement a preset number of times within a preset time-frame can signify various additional commands. In such gesture recognition, the shape of a hand can be used to differentiate such hand gestures from movement of printed matter over the surface. Such shape can be indicated by the silhouette of the set of images covered at any single time. Also, image recognition algorithms can be used for the purpose of recognizing hands, fingers, etc.
A printed page may contain a set of predefined recognizable images. Just as the surface, such as surface 205 in
Other embodiments and uses of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. All references cited herein, including all publications, U.S. and foreign patents and patent applications, are specifically and entirely incorporated by reference. It is intended that the specification and examples be considered exemplary only with the true scope and spirit of the invention indicated by the following claims. Furthermore, the term “comprising of” includes the terms “consisting of” and “consisting essentially of.”
The present application claims priority to Provisional U.S. Application No. 61/283,168 filed Nov. 30, 2009 and entitled “Arranging Text under a Camera and Handling Information Flow,” which is incorporated in its entirety.
Number | Date | Country | |
---|---|---|---|
61283168 | Nov 2009 | US |