The present invention relates generally to gesture recognition devices and methods, and more particularly to such devices and methods which include infrared sensors.
Various systems for translating hand position/movement into corresponding digital data to be input to a computing system are well-known. For example, digital pens, computer mouses, and various kinds of touch screens and touch pads are known. Various other systems for translating hand position/movement into digital data which is input to a computer system to accomplish gesture recognition and/or writing and/or drawing based on touchless hand motion also are well-known. For example, see the article “Gesture Recognition with a Wii Controller” by Thomas Schlomer et al., TEI '08 Proceedings Of the Second International Conference on Tangible and Embedded, Interaction TEI conference on Tangible and Embedded Interaction, 2008, ISBN: 978-1-60558-004-3; this article is incorporated herein by reference. The Schlomer article discloses the design and evaluation of a sensor-based gesture recognition system which utilizes the accelerometer contained in the well-known Wii-controller (Wiimote™) as an input device. The system utilizes a Hidden Markov Model for training and recognizing user-chosen gestures, and includes filtering devices ahead of a data pipeline including a gesture recognition quantizer, a gesture recognition model, and a gesture recognition classifier. The quantizer applies a common k-mean algorithm to the incoming vector data. The model is implemented by means of the Hidden Markov Model, and the classifier is chosen to be a Bayes classifier. The filters establish a minimum representation of a gesture before being forwarded to the Hidden Markov Model by eliminating all vectors which do not significantly contribute to a gesture and also eliminate vectors which are roughly equivalent to their predecessor vectors.
Prior Art
The output of face detection module 29 is applied via bus 30 as an input to skin color classification module 22. Stereo camera system 5 also outputs image disparity information on bus 6 which is input to head orientation module 9 and multi-hypothesis tracking module 26. (The term “disparity” refers to the difference between images generated by the two stereo cameras in Prior Art
Three-dimensional head and hand tracking information generated by multi-hypothesis tracking module 26 is utilized along with head orientation information generated by module 9 to model the dynamic motion, rather than just the static position, of pointing gestures and thereby significantly improves the gesture recognition accuracy.
Conventional Hidden Markov Models (HMMs) are utilized in gesture recognition module 21 to perform the gesture recognition based on the outputs of multi-hypothesis tracking module 26 and head orientation module 9. Based on the hand motion and the head motion and orientation, the HMM-based classifier in gesture recognition module 21 is trained to detect pointing gestures to provide significantly improved real-time gesture recognition performance which is suitable for applications in the field of human-robot interaction.
The head and hands of the subject making gestures are identified by means of human skin color clusters in a small region of the chromatic color space. Since a mobile robot has to cope with frequent changes in light conditions, the color model needs to be continuously updated to accommodate changes in ambient light conditions. In order to accomplish this, face detection module 29 searches for a face image in the camera image data by running a known fast face detection algorithm asynchronously with the main video loop, and a new color model is created based on the pixels within the face region whenever a face image is detected. That information is input via path 30 to skin color classification module 22, which then generates the skin map information 25 as an input to multi-hypothesis tracking module 26.
Multi-hypothesis tracking module 26 operates to find the best hypotheses for the positions of the subject's head and hands at each time frame “t”, based on the current camera observation and the hypotheses of past time frames. The best hypotheses are formulated by means of a probabilistic framework that includes an observation score, a posture score, and a transition score. With each new frame, all combinations of the three-dimensional skin cluster centroids are evaluated to find the hypothesis that exhibits the best results with respect to the product of the three observation, posture, and transition scores. Accurate tracking of the relatively small, fast moving hands is a difficult problem compared to the tracking of the head. Accordingly, multi-hypotheses tracking module 26 is designed to be able to correct its present decision instead of being tied to a previous wrong decision by performing multi-hypotheses tracking to allow “rethinking” by keeping an n-best list of hypotheses at each time frame wherein each hypothesis is connected within a tree structure to its predecessor, so multi-hypothesis tracker 26 is free to choose the path that maximizes the overall probability of a correct new decision based on the observation, posture and transition scores.
The resulting head/hand position information generated on bus 31 by multi-hypothesis tracking module 26 is provided as an input to both gesture recognition module 21 and head orientation module 9. Head orientation module 9 uses that information along with the disparity information 6 and RGB image information 10 to generate pan/tilt angle information input via bus 17 to gesture recognition module 21. Head orientation module 9 utilizes two neural networks, one for determining the pan angle of the subject's head and one for determining the tilt angle thereof based on the head's intensity data and disparity image data.
Gesture recognition module 21 models the typical motion pattern of pointing gestures rather than just the static posture of a person during the peak of the gesture) by decomposing the gesture into three distinct phases and modeling each phase with a dedicated Hidden Markov Model, to thereby provide improved accurate pointing gesture recognition. (Note that use of Hidden Markov Model for gesture recognition is a known technique.)
The above-mentioned gesture recognition quantizer in Prior Art
The above mentioned gesture recognition model takes multiple sequential gesture vectors and determines their meanings using the Hidden Markov Model (HMM). Hidden Markov models are especially known for their application in temporal pattern recognition, and they work well for gesture recognition. A detailed description of the Hidden Markov Model is included in the article that appears at the website http://en.wikipedia.org/wiki/Hidden_Markov_Model, and a copy of that article is included with the Information Disclosure Statement submitted with this patent application, and is incorporated herein by reference.
The above mentioned the gesture recognition classifier may use a naïve Bayes classifier to interpret the gesture series and determine the desired action represented by the gesture. Naive Bayes classifiers have worked quite well in many complex real-world situations and can be trained very efficiently in a supervised setting. A detailed description of the naïve Bayes classifier appears in the article that appears at the Web site http://en.wikipedia.org/wiki/Naive_Bayes_classifier, and a copy of that article is included with the Information Disclosure Statement submitted with this patent application, and is incorporated herein by reference.
Prior Art
The output 41 of motion field estimation module 38 is input to motion modeling module 43, the output of which is input to 2D (two-dimensional) reference point tracking module 46. The output 42 of disparity map estimation module 39 is input to disparity modeling module 44, the output of which is input to 3D reference point tracking module 48. The output 47 of 2D reference point tracking module 46 is provided as another input to 3D reference point tracking module 48 and also is fed back as a Z−1 input to 2D reference point tracking module 46. The output of 3D reference point tracking module 48 is input to an incremental planar modeling module 49, the output of which is input to on-plane, off-plane analysis module 50. The output 51 of on-plane, off-plane analysis module 50 is provided as an input to 3D to 2D projection module 52 and also is fed back as a Z−1 input to on-plane, off-plane analysis module 50. The output of 3D to 2D projection module 52 is input to an output normalization module 53, the output 32 of which includes normalized coordinates of the movement of the hand centroids.
In the system shown in
Region of interest selection modules 37R and 37L operate to remove the fingers and the arm of the camera images from the hand image so only the central region of the hand images (i.e. palm, back of the hand images) remains. The disparity map estimation module 39 estimates a disparity map from the two camera images taken at each time instant, using a parametric planar model to cope with the nearly textureless surface of the selected portion of the hand image. Motion field estimation module 38 operates to estimate a monocular motion field from two consecutive time frames in a process that is similar to the estimating of the disparity map in module 39. Motion modeling module 43 operates to adjust parameters of the motion model to comply with the disparity model. The motion field then is used by 2D reference point tracking module 46 and 3D reference point tracking module 48 to track selected points throughout the sequence. At each time instant, the X, Y and Z coordinates of the position and the orientation angles yaw, pitch, and roll of the hand image are calculated for a coordinate frame that is “attached” to the palm of the selected portion of the hand image. The 3D plane parameters are calculated by incremental planar modeling module 49 and on-plane, off-plane analysis module 50 from the disparity plane information established by disparity modeling module 44. For tracking the hand image over time, a set of 2D image points are extracted from the images of one of the two cameras 5R and 5L and its motion model. Then, using disparity models established by disparity modeling module 44 at different times, the motion coordinates of that hand image are mapped to the 3D domain to provide the trajectory of the hand image in space.
On-plane and off-plane analysis module 50 operates to determine when the centroid of the selected portion of the hand image undergoes a significant deviation from a computed plane fitted to the palm of hand to indicate the hand being lifted from the virtual plane in order to indicate a particular drawing/writing movement. 3D to 2D projection module 52 operates to convert the set of 3D points to the best approximated set in two dimensions. Output normalization module 53 then operates to generate hand coordinate tracking data that represents on-plane writing or drawing performed by the user. The hand movement detection and tracking system of
A significant shortcoming of the above described prior art is that the input sensor response times are much slower than is desirable for many hand movement tracking applications and/or for many gesture recognition applications, due to the amount of computer resources required. Also, the ambient lighting variance strongly influences the interpretation of the details and adds significant difficulty in image capture.
There is an unmet need for an improved, faster, less expensive, simpler, and more accurate way of translating various element movements such as hand movements and/or hand gestures into coordinate or vector information representing element or hand position/movement.
There also is an unmet need for an improved, faster, less expensive, and more accurate way of translating various hand movements and/or hand gestures into corresponding input signals for a computer system so there is no need for any part of the hand (or an instrument held by the hand) to actually touch any part of the computer system.
There also is an unmet need for a faster way of generating a vector in response to element movement, and a movement, or the like.
There also is an unmet need for a faster, lower cost, more accurate device and method for translating element or hand movement into digital input information for an operating system.
There also is an unmet need for a faster, lower cost, more accurate device and method for translating element or hand movement into digital input information which simplifies gesture recognition algorithms by avoiding use of external lighting and associated color filtering.
There also is an unmet need for a faster, lower cost, more accurate device and method for translating element or hand movement into digital input information which is very insensitive to ambient lighting conditions.
It is an object of the invention to provide an improved, faster, less expensive, simpler, and more accurate way of translating various element movements such as hand movements and/or hand gestures into coordinate or vector information representing element or hand position and/or movement.
It is another object of the invention to provide an improved, faster, less expensive, simpler, and more accurate way of translating various element movements such as hand movements and/or hand gestures into corresponding input signals for a computer system so that there is no need for any part of the hand (or an instrument held by the hand) to actually touch any part of the computer system.
It is another object of the invention to provide a faster way of generating a vector in response to an element or hand movement or the like.
It is another object of the invention to provide a faster, lower cost, more accurate device and method for translating element movement or hand movement or the like into digital input information for an operating system.
It is another object of the invention to provide a faster, lower cost, more accurate device and method for translating element or hand movement into digital input information which simplifies gesture recognition algorithms by avoiding use of external lighting and associated color filtering.
It is another object of the invention to provide a faster, lower cost, more accurate device and method for translating element or hand movement into digital input information which is very insensitive to ambient lighting conditions.
Briefly described, and in accordance with one embodiment, the present invention provides a system for generating tracking coordinate information in response to movement of an information-indicating element, including an array (55) of IR sensors (60-x,y) disposed along a surface (55A) of the array. Each IR sensor includes first (7) and second (8) thermopile junctions connected in series to form a thermopile (7,8) within a dielectric stack (3) of a radiation sensor chip (1). The first thermopile junction is more thermally insulated from a substrate (2) of the radiation sensor chip than the second thermopile junction. A sensor output signal generated between the first and second thermopile junctions is coupled to a bus (63). A processor (64) is coupled to the bus for operating on information that represents temperature differences between the first and second thermopile junctions of the various IR sensors, respectively, caused by the presence of the information-indicating element to produce the tracking coordinate information as the information-indicating element moves along the surface.
In one embodiment, the invention provides a system for generating tracking coordinate information in response to movement of an information-indicating element, including an array (55) of IR (infrared) sensors (60-x,y) disposed along a surface (55A) of the array (55). Each IR sensor (60-x,y) includes first (7) and second (8) thermopile junctions connected in series to form a thermopile (7,8) within a dielectric stack (3) of a radiation sensor chip (1). The first thermopile junction (7) is more thermally insulated from a substrate (2) of the radiation sensor chip (1) than the second thermopile junction (8). A sensor output signal between the first (7) and second (8) thermopile junctions is coupled to a bus (63), and a processing circuit (64) is coupled to the bus (63) to receive information representing temperature differences between the first (7) and second (8) thermopile junctions of the various IR sensors (60-x,y), respectively, caused by the presence of the information-indicating element. The processing circuit (64) operates on the information representing the temperature differences to produce the tracking coordinate information as the information-indicating element moves along the surface (55A).
In one embodiment, the surface (55A) lies along surfaces of the substrates (2) of the radiation sensor chips (1). Each first thermopile junction (7) is insulated from the substrate (2) by means of a corresponding cavity (4) between the substrate (2) and the dielectric stack (3). A plurality of bonding pads (28A) coupled to the thermopile (7,8) are disposed on the radiation sensor chip (1), and a plurality of bump conductors (28) are attached to the bonding pads (28A), respectively, for physically and electrically coupling the radiation sensor chip (1) to conductors (23A) on a circuit board (23).
In one embodiment, the dielectric stack (3) is a CMOS semiconductor process dielectric stack including a plurality of SiO2 sublayers (3-1,2 . . . 6) and various polysilicon traces, titanium nitride traces, tungsten contacts, and aluminum metalization traces between the various sublayers patterned to provide the first (7) and second (8) thermopile junctions connected in series to form the thermopile (7,8). Each IR sensor (60-x,y) includes CMOS circuitry (45) coupled between first (+) and second (−) terminals of the thermopile (7,8) to receive and operate on a thermoelectric voltage (Vout) generated by the thermopile (7,8) in response to infrared (IR) radiation received by the radiation sensor chip (1). The CMOS circuitry (45) also is coupled to the bonding pads (28A). The CMOS circuitry (45) converts the thermoelectric voltage (Vout) to digital information in an I2C format and sends the digital information to the processing circuit (64) via the bus (63). The processing circuit (64) operates on the digital information to generate a sequence of vectors (57) that indicate locations and directions of the information-indicating element as it moves along the surface (55A).
In one embodiment, the information-indicating element includes at least part of a human hand, and the processing circuit (64) operates on the vectors to interpret gestures represented by the movement of the hand along the surface (55A).
In one embodiment, the IR sensors (60-x,y) are represented by measured pixels (60) which are spaced apart along the surface (55A). In one embodiment, the IR sensors (60-x,y) are disposed along a periphery of a display (72) to produce temperature differences between the first (7) and second (8) thermopile junctions of the various IR sensors (60-x,y) caused by the presence of the information-indicating element as it moves along the surface of the display (72). In one embodiment, IR sensors (60-x,y) are represented by measured pixels (60) which are spaced apart along the surface (55A), and the processing circuit (64) interpolates values of various interpolated pixels (60A) located between various measured pixels (60).
In one embodiment, the substrate (2) is composed of silicon to pass infrared radiation to the thermopile (7,8) and block visible radiation, and further includes a passivation layer (12) disposed on the dielectric stack (3) and a plurality of generally circular etchant openings (24) located between the various traces and extending through the passivation layer (12) and the dielectric layer (3) to the cavity (4) for introducing silicon etchant to produce the cavity (4) by etching the silicon substrate (2).
In one embodiment, the radiation sensor chip (1) is part of a WCSP (wafer chip scale package).
In one embodiment, the invention provides a method for generating tracking coordinate information in response to movement of an information-indicating element, including providing an array (55) of IR (infrared) sensors (60-x,y) disposed along a surface (55A) of the array (55), each IR sensor (60-x,y) including first (7) and second (8) thermopile junctions connected in series to form a thermopile (7,8) within a dielectric stack (3) of a radiation sensor chip (1), the first thermopile junction (7) being more thermally insulated from a substrate (2) of the radiation sensor chip (1) than the second thermopile junction (8), a sensor output signal between the first (7) and second (8) thermopile junctions being coupled to a bus (63); coupling a processing circuit (64) to the bus (63); operating the processing circuit (64) to receive information representing temperature differences between the first (7) and second (8) thermopile junctions of the various IR sensors (60-x,y), respectively, caused by the presence of the information-indicating element; and causing the processing circuit (64) to operate on the information representing the temperature differences to produce the tracking coordinate information as the information-indicating element moves along the surface (55A).
In one embodiment, substrate (2) is composed of silicon to pass infrared radiation to the thermopile (7,8) and block visible radiation, wherein the method includes providing the surface (55A) along surfaces of the substrates (2) of the IR sensors (60-x,y) and providing a cavity (3) between the substrate (2) and the first thermopile junction (7) to thermally insulate the first thermopile junction (7) from the substrate (2).
In one embodiment, the method includes providing the radiation sensor chip (1) as part of a WCSP (wafer chip scale package).
In one embodiment, the bus (63) is an I2C bus, and the method includes providing I2C interface circuitry coupled between the I2C bus and first (+) and second (−) terminals of the thermopile (7,8). In one embodiment, the method includes providing CMOS circuitry (45) which includes the I2C interface circuitry in each IR sensor (60-x,y) coupled between the first (+) and second (−) terminals of the thermopile (7,8) to receive and operate on a thermoelectric voltage (Vout) generated by the thermopile (7,8) in response to infrared (IR) radiation received by the radiation sensor chip (1).
In one embodiment, the invention provides a system for generating tracking coordinate information in response to movement of an information-indicating element, including an array (55) of IR (infrared) sensors (60-x,y) disposed along a surface (55A) of the array (55), each IR sensor (60-x,y) including first (7) and second (8) thermopile junctions connected in series to form a thermopile (7,8) within a dielectric stack (3) of a radiation sensor chip (1), the first thermopile junction (7) being more thermally insulated from a substrate (2) of the radiation sensor chip (1) than the second thermopile junction (8), a sensor output signal between the first (7) and second (8) thermopile junctions being coupled to a bus (63); and processing means (64) coupled to the bus (63) for operating on information representing temperature differences between the first (7) and second (8) thermopile junctions of the various IR sensors (60-x,y), respectively, caused by the presence of the information-indicating element to produce the tracking coordinate information as the information-indicating element moves along the surface (55A).
The embodiments of the invention described below may be used to improve the previously described prior art by avoiding the cost and complexity of using video cameras to sense hand movement and also by avoiding the slowness of data manipulation required by the use of the cameras. The described embodiments of the invention also avoid any need for external lighting and associated color filtering to thereby significantly simplify hand movement and/or gesture recognition algorithms that may be needed in some applications.
Referring to the example of
In the example of
The basic system described in the example of
By way of definition, the term “gesture” as used herein is intended to encompass any hand movements utilized to communicate information to a computer or the like to enable it to interpret hand movements, perform writing operations, or perform drawing operations.
The various layers shown in dielectric stack 3, including polysilicon layer 13, titanium nitride layer 15, aluminum first metalization layer M1, aluminum second metalization layer M2, and aluminum third metalization layer M3 each are formed on a corresponding oxide sub-layer of dielectric stack 3. Thermopile 7,8 thus is formed within SiO2 stack 3. Cavity 4 in silicon substrate 2 is located directly beneath thermopile junction 7, and therefore thermally insulates thermopile junction 7 from silicon substrate 2. However thermopile junction 8 is located directly adjacent to silicon substrate 2 and therefore is at essentially the same temperature as silicon substrate 2. A relatively long, narrow polysilicon trace 13 is disposed on a SiO2 sub-layer 3-1 of dielectric stack 3 and extends between tungsten contact 14-2 (in thermopile junction 7) and tungsten contact 14-1 (in thermopile junction 8). Titanium nitride trace 15 extends between tungsten contact 15-1 (in thermopile junction 8) and tungsten contact 15-2 (in thermopile junction 7). Thus, polysilicon trace 13 and titanium nitride trace 15 both function as parts of thermopile 7,8. Thermopile 7,8 is referred to as a poly/titanium-nitride thermopile, since the Seebeck coefficients of the various aluminum traces cancel and the Seebeck coefficients of the various tungsten contacts 14-1, 14-2, 15-2, and 17 also cancel because the temperature difference across the various connections is essentially equal to zero.
The right end of polysilicon layer 13 is connected to the right end of titanium nitride trace 15 by means of tungsten contact 14-2, aluminum trace 16-3, and tungsten contact 15-2 so as to form “hot” thermopile junction 7. Similarly, the left end of polysilicon layer 13 is connected by tungsten contact 14-1 to aluminum trace 11 B and the left end of titanium nitride trace 15 is coupled by tungsten contact 15-1, aluminum trace 16-2, and tungsten contact 17 to aluminum trace 11A, so as to thereby form “cold” thermopile junction 8. The series-connected combination of the two thermopile junctions 7 and 8 forms thermopile 7,8.
Aluminum metalization interconnect layers M1, M2, and M3 are formed on the SiO2 sub-layers 3-3, 3-4, and 3-5, respectively, of dielectric stack 3. A conventional silicon nitride passivation layer 12 is formed on another oxide sub-layer 3-6 of dielectric layer 3. A number of relatively small-diameter etchant holes 24 extend from the top of passivation layer 12 through dielectric stack 3 into cavity 4, between the various patterned metalization (M1, M2 and M3), titanium nitride, and polysilicon traces which form thermopile junctions 7 and 8.
Epoxy film 34 is provided on nitride passivation layer 12 to permanently seal the upper ends of etch openings 24 and to reinforce the “floating membrane” portion of dielectric layer 3. Although there may be some applications of the invention which do not require epoxy cover plate 34, the use of epoxy cover plate 34 is an important aspect of providing a reliable WCSP package configuration of the IR sensors of the present invention. In an embodiment of the invention under development, epoxy cover plate 34 is substantially thicker (roughly 16 microns) than the entire thickness (roughly 6 microns) of dielectric stack 3.
The IR sensor devices 60-x,y shown in
Thus, an array of infrared sensors may be used to detect hand motion, and the translated vector of the motion of that hand (or the hand-held device such as a heated stylus) can be input into a display system that does not have touch-sensing capability, based on the temperature difference between the hand and the environment. The array of IR sensors can detect the spatial times at which an object such as a hand passes over the sensors and the direction of movement of the hand (or hand-held object or other object). The use of IR sensors means that no external light source or surface contact is needed. The array could be of any suitable dimensions and could be as small as a 2×1 array. And as previously mentioned, the IR sensor array surface may be planar, convex, or concave.
The use of long wavelength IR sensors means that no external lighting source is needed to generate the signal to the sensing array, and as previously mentioned, this may significantly simplify the required signal processing, compared to the signal processing required in the systems of Prior Art
The processor determines the peak signal location and subtracts background levels for each time frame. It then tracks the locations of the peak signal in each time frame and, if desired, then calculates the appropriate hand/finger movement or gesture type. (For more information on conventional I2C systems, see “The I2C-Bus Specification, Version 2.1, January 2000”, which is incorporated herein by reference, and/or the article entitled “I2C” which is cited in and included with the Information Disclosure Statement submitted with this application, is also incorporated herein by reference, and is available at http://en.wikipedia.org/wiki/I%C2%B2C.)
It should be noted that each IR sensor in array 55 may be considered to-be a “pixel” of array 55, so the I2C interface circuitry in each IR sensor 60-x,y generates output data that is considered to be output data for a corresponding pixel. Microprocessor 64 scans all of the IR sensors 60-x,y essentially simultaneously in order to obtain all of the pixel data of IR sensor array 55 during one time frame.
The space between the various pixels corresponding to the various IR sensors in 3×3 array 55 can be relatively large, as indicated in
The diagrams of
Although the above described embodiments of the invention refer to interpreting, translating, or tracking movement of a human hand, finger, or the like into useful digital information, the moving element being interpreted, transmitted, or tracked could be any element having a temperature difference relative to the thermopiles of the IR sensors. For example, the moving element may be a heated stylus held by the hand, or it may be anything having a temperature different than the background ambient temperature.
As a practical matter, the described technique using the assignee's disclosed infrared detectors (
Advantages of the described embodiments of the invention include higher system operating speed, lower cost, and greater ease of use than the prior art systems for detecting and quantifying hand movement or the like to provide corresponding digital input information to a utilization system or device. One important advantage of using IR sensors for tracking of movement of a hand, finger, or other element is that the IR sensors are insensitive to ambient lighting conditions. Another advantage of the IR sensors is that they do not have to be densely located in the screen or sensor surface. One likely application of the described embodiments is to replace a computer mouse, perhaps with a larger area of surface 55A than the surface on which a typical mouse is typically used.
While the invention has been described with reference to several particular embodiments thereof, those skilled in the art will be able to make various modifications to the described embodiments of the invention without departing from its true spirit and scope. It is intended that all elements or steps which are insubstantially different from those recited in the claims but perform substantially the same functions, respectively, in substantially the same way to achieve the same result as what is claimed are within the scope of the invention.