There is a need for enhanced ways for people to interact with technology devices and access their varied functionality, beyond the conventional keyboard, mouse, joystick etc. Ever more powerful computing and communication devices have further generated a need for effective tools for inputting text, choosing icons, manipulating objects. This need is even more noticeable for small devices, such as mobile phones, personal digital assistants (PDAs) and hand-held consoles, which do not have room for a full keyboard.
Significant advances have been made in recent years in the application of gesture control for user interaction with electronic devices. Gestures can be used, for example, to control a television, for home automation, to interfaces with tablets, personal computers, and mobile phones. As core technologies continue to improve and their costs decline, gesture control is destined to continue to play a major role in the ways in which people interact with electronic devices. The ability to accurately recognize a user's gestures depends on the quality and accuracy of the core tracking capabilities.
Furthermore, there is a need to more accurately identify the movements of people and objects. For example, in the field of vehicle safety systems, it would be beneficial to have a system that is able to better identify objects outside the vehicle, such as pedestrians and other automobiles, and track their movements. In the surveillance industry, there is a need to more accurately identify the movements of people in a (possibly prohibited) area.
Examples of a system for automatically defining and identifying movements are illustrated in the figures. The examples and figures are illustrative rather than limiting.
A system and method are provided for object tracking using depth data and amplitude data, depth data and intensity data, or depth data and both amplitude data and intensity data. Time of flight (ToF) sensor data may be used to provide enhanced image processing, the method including acquiring depth data for an object imaged by a ToF sensor; acquiring amplitude data for the imaged object and/or acquiring intensity data for the imaged object; applying an image processing algorithm to process the depth data and the amplitude data and/or the intensity data; and tracking object movement based on an analysis of the depth data and the amplitude data and/or the intensity data.
Various aspects and examples of the invention will now be described. The following description provides specific details for a thorough understanding and enabling description of these examples. One skilled in the art will understand, however, that the invention may be practiced without many of these details. Additionally, some well-known structures or functions may not be shown or described in detail, so as to avoid unnecessarily obscuring the relevant description.
The terminology used in the description presented below is intended to be interpreted in its broadest reasonable manner, even though it is being used in conjunction with a detailed description of certain specific examples of the technology. Certain terms may even be emphasized below; however, any terminology intended to be interpreted in any restricted manner will be overtly and specifically defined as such in this Detailed Description section.
The tracking of object movements as may be performed, for example, by an electronic device responsive to gestures, requires the device to be able to recognize the movements or gesture(s) that a user or object is making. For the purposes of this disclosure, the term ‘gesture recognition’ is used to refer to a method for identifying specific movements or pose configurations performed by a user, such as a swipe on a mouse-pad in a particular direction having a particular speed, a finger tracing a specific shape on a touchscreen, or the wave of a hand. The device must decide whether a particular gesture was performed or not by analyzing data describing the user's interaction with a particular hardware/software interface. That is, there must be some way of detecting or tracking the object that is being used to perform or execute the gesture. In the case of a touchscreen, it is the combination of the hardware and software technologies necessary to detect the user's touch on the screen. In the case of a depth sensor-based system, it is generally the hardware and software combination necessary to identify and track the user's joints and body parts.
In the above examples of device interaction through gesture control, as well as object tracking in general, a tracking layer enables movement recognition and tracking. In the case of gesture tracking, gesture recognition may be distinct from the process of tracking, as the recognition of a gesture triggers a pre-defined behavior (e.g., a wave of the hand turns off the lights) in an application, device, or game that the user is interacting with.
The input to an object tracking system can be data describing a user's movements that originates from any number of different input devices, such as touch-screens (single-touch or multi-touch), movements of a user as captured with an RGB (red, green, blue) sensor, and movements of a user as captured using a depth sensor. In other applications, accelerometers and weight scales can provide useful data for movement or gesture recognition.
U.S. patent application Ser. No. 12/817,102, entitled “METHOD AND SYSTEM FOR MODELING SUBJECTS FROM A DEPTH MAP”, filed Jun. 16, 2010, describes a method of tracking a player using a depth sensor and identifying and tracking the joints of a user's body. U.S. patent application Ser. No. 12/707,340, entitled “METHOD AND SYSTEM FOR GESTURE RECOGNITION”, filed Feb. 17, 2010, describes a method of identifying gestures using a depth sensor. Both patent applications are hereby incorporated in their entirety in the present disclosure.
Robust movement or gesture recognition can be quite difficult to implement. In particular, it needs to be able to interpret the user's intentions accurately, take into account differences in movement between different users, and determine the context in which the movements are active.
The above described challenges further emphasize the need for enhanced accuracy, speed and intelligence when sensing, identifying and tracking objects or users. Enhanced tracking may be used to enable movement recognition, and can also be applied to provide enhancements for surveillance applications (for example, using three-dimensional sensors and the techniques described herein for tracking people moving around in a space, applying this to people counting, or tailgating, etc.), or further applications where monitoring people and understanding their movements is beneficial. Furthermore, there is a need for enabling object tracking in different conditions, such as darkness, where enhanced movement tracking is necessary even under problematic conditions.
The present disclosure describes the usage of depth, amplitude and intensity data to help track objects, thereby helping to more accurately identify and process user movements or gestures.
Object Tracking System.
An object tracking system needs to recognize and identify movements performed by a user or object being imaged, and to interpret the data to determine movements, signals or communication.
Gesture Recognition System.
A gesture recognition system is a system that recognizes and identifies pre-determined movements performed by a user in his or her interaction with some input device. Examples include interpreting data from a sensor or camera to recognize that a user has closed his hand, or interpreting the data to recognize a forward punch with the left hand.
Depth Sensors.
The present disclosure may be used for object tracking based on data acquired from depth sensors, which are sensors that generate three-dimensional data. There are several different types of depth sensors, such as sensors that rely on the time-of-flight principle, structured light, coded light, speckled pattern technology, and stereoscopic cameras. These sensors may generate an image with a fixed resolution of pixels, where each pixel has an integer value, and the values correspond to the distance of the object projected onto that region of the image by the sensor. In addition to this depth data, the depth sensors may be combined with conventional color cameras, and the color data can be combined with the depth data for use in processing.
Gesture.
A gesture is a unique, clearly distinctive motion or pose of one or more body joints or parts. The process of gesture recognition analyzes input data to determine whether a gesture was performed or not.
Classifier.
A process that identifies a given motion, for example by identifying a specific movement as a target gesture, or rejecting the motion if it is not identified as a target gesture.
Input Data.
The data generated by a depth sensor, and used as input into the tracking algorithms. For example, this data may be the depth sensor's representation of the capture of an object's or user's movements in front of the sensor.
ToF Sensor.
Time-of-Flight (ToF) technology, based on measuring the time that light emitted by an illumination unit requires to travel to an object and back to the sensor.
The present disclosure may be used for object tracking, whether of people, animals, vehicles or other objects, based on depth, amplitude and/or intensity data acquired from depth sensors. Amplitude (a), as used herein, may be defined, in some embodiments, according to the following formula. According to the time of flight principle, the correlation of an incident optical signal, s, with a reference signal, g, that is the incident optical signal reflected from an object, is defined as:
For example, if g is an ideal sinusoidal signal, fm is the modulation frequency, a is the amplitude of the incident optical signal, b is the correlation bias, and φ is the phase shift (corresponding to the object distance), the correlation would be given by:
Using four sequential phase images with different phase offsets:
The phase shift, the intensity, and the amplitude of the signal can be determined:
In practice, the input signal may be different from a sinusoidal signal. For example, the input may be a rectangular signal. Then the corresponding phase shift, intensity, and amplitude would be different from the idealized equations presented above.
Reference is now made to
The ToF sensor 115 can further include a depth processor module 120, which is adapted to process the received image signal and generate a depth map. The ToF sensor 115 can further include an amplitude processor module 125, which is adapted to process the received image signal and generate an amplitude map. As can be seen with reference to
System 100 may further include an image tracking module 135 for determining object tracking. In some embodiments a depth sensor processing algorithm may be applied by tracking module 135, and/or an amplitude sensor processing algorithm may be applied by tracking module 135, to enable system 100 to utilize both depth and amplitude data received from image sensor 110. In one example, the output of module 135, the tracking data, may correspond to the object's skeleton, or other features, whereby the tracking data can correspond to all of a user's joints or feature points as generated by the tracking module, or a subset of them. System 100 may further include an object data classification module 140, for classifying sensed data, thereby aiding in the determination of object movement. The classifying module may, for example, generate an output that can be used to determine whether an object is moving, gesticulating etc.
System 100 may further include an output module 145 for processing the processed gesture data to enable the data to be satisfactorily output to external platforms, consoles, etc. System 100 may further include a user device or application 150, on which a user may play a game, view an output, execute a function or otherwise make use of the processed movement data sensed by the depth sensor.
As can be seen with reference to
In accordance with further embodiments, amplitude and intensity data may be used to assist in tracking movements of joints or parts of objects or users, to help segment foreground from background for classification of images, to determine pose differentiation, to enable character detection, to aid multiple object monitoring, to facilitate 3D modeling, and/or perform various other functions.
Reference is now made to
In some embodiments, intensity data may be used, in place of, or in addition to, amplitude data, as described above. Accordingly, an intensity data processing module may be used to process intensity data as may be necessary, as shown in
Reference is now made to
At block 310, in some examples of implementation, initial image segmentation may be executed, to separate the object of interest from the background. In some examples, a data mask, for example a binary mask (A binary mask is an image where every pixel has a value of either 1 or 0, so the mask conveys the shape of the object, and each pixel is either on the object or part of the background.) or two-dimensional (2D) subject mask, may be created from the depth data. At block 315 the mask may be used, together with the amplitude data or received image, to remove background data or pixels from the amplitude frame. This is basically an “and” binary operation which, for example, interprets pixels above a certain threshold in the amplitude image that correspond to a value or one on a 2D subject mask as part of the object, and the rest of the pixels in the amplitude image correspond to the background. The result of the step at block 315 may be to generate a masked amplitude image, or an amplitude image where all pixels not corresponding to the object of interest are equal to 0.
At block 320, on the masked amplitude image, descriptors may be computed, which are features specific to the object of interest. For example, if the object of interest is a hand, the descriptors may be edges of the fingertips. At block 325 the descriptors found from the masked amplitude image may be compared to a database of subject features, for example, depth features. If the result of the comparison is not sufficiently similar, the object of interest has not been found. Thus, it is assumed that the object is not present in the acquired image. The system returns to acquire additional depth and amplitude data frames at blocks 300 and 305 to continue searching for the object of interest.
If the result of the comparison is sufficiently similar, the system may assume that the object of interest and its position have been identified. In such a scenario, at block 330, after the position of the object of interest has been identified, the masked amplitude image may be used to compute the 2D positions of each tracked element, such as the 2D positions of a joint or element, from the amplitude data.
At block 335 the 2D positions of each joint or element may be used to sample the 3D depth values from the depth image, since there is a one-to-one mapping between the depth image and the amplitude image. At block 340, the 3D positions of the joints may be used to generate a 3D skeleton. Furthermore, in some embodiments, intensity data may be used in place of, or in addition to, amplitude data, as described above.
Reference is now made to
At block 310, in some examples of implementation, initial image segmentation may be executed, to separate the object of interest from the background. In some examples, a data mask, for example a binary mask or 2D subject mask, may be created from the depth data. At block 315 the mask may be used, together with the amplitude data or received image, to remove background data or pixels from the amplitude frame.
At block 350 the image may be processed using the amplitude data from the image, such that, at block 355, after the position of the object of interest has been identified, the masked amplitude image may be used to compute the 2D positions of each tracked element from the amplitude data. At block 360 the 2D positions of each joint or element may be used to sample the 3D depth values from the depth image. At block 365 the 3D positions of the joints may be used to generate a 3D skeleton. Furthermore, in some embodiments, intensity data may be used in place of, or in addition to, amplitude data, as described above.
In general, computer vision (or “image processing”) algorithms can accept different types of input data, such as depth data from active sensor systems (e.g., Time of Flight (TOF), structured light), depth data from passive sensor systems (e.g., such as stereoscopic), color data, amplitude data, etc. Amplitude, as described herein, relates specifically to the “amplitude of the incident optical signal”, which is substantially equivalent to the strength of the received signal in a TOF sensor system. The particular algorithms most effective for processing the data depend on the character of the data. For example, depth data is more useful when there is a sharp difference between objects that are adjacent in the image plane. On the other hand, depth data is less useful when the differences in the depth values of adjacent objects are smaller. RGB data is more useful when the environmental lighting is stable, and RGB data has the advantage of typically much higher resolution than the depth data obtained from active sensor systems. In a similar vein, the amplitude data has the disadvantage of low resolution, wherein the resolution is substantially equivalent to that of the depth data. However, the amplitude data is robust to environmental lighting conditions and typically contains a much higher level of detail than the depth data. Furthermore, in some embodiments, intensity data may be used in place of, or in addition to, amplitude data, as described above.
Similarly, different image processing techniques may be effective for different types of data. For RGB data, tracking can be done based on the color of objects. A common example is to use the color of the skin for tracking exposed parts of the human body. When processing an amplitude image, it may be useful to track the gradients (edges), which indicate sharp discontinuities between objects.
Reference is now made to
The center photograph shows the intensity image in which each pixel value corresponds to the intensity value I, as defined above. The right photograph shows the amplitude image in which each pixel value corresponds to the amplitude variable a, as defined above.
As can be seen in
According to some embodiments, data from different channels of the sensor may be combined, and consequently, the strengths of one channel can be used to compensate for the weaknesses of others. In one example, at block 400 the object tracking apparatus, platform or system may acquire and process depth data from a depth sensor. In parallel to block 400, at block 405 the object tracking apparatus, platform or system may acquire and process amplitude data from a depth sensor, where the amplitude signal value is determined on a per-pixel basis.
Because the amplitude data is assumed to provide an indication of the confidence level of the depth data values, at block 435, a decision is made whether to use the depth data based on the amplitude data values. If the amplitude signal value for a given pixel is determined to be substantially low, this indicates a low level of confidence in the accuracy of the pixel value, and at block 440, the depth data for the given pixel may be discarded. If the amplitude signal pixel value is determined to be substantially high, meaning that the amplitude level indicates a high level of confidence in the accuracy of the pixel value, then at block 445, the depth data and the amplitude data may be utilized to track objects in a scene. Alternatively, the depth data can be used by itself to track objects in a scene. Furthermore, in some embodiments, intensity data may be used in place of, or in addition to, amplitude data, as described above.
In the above described process, the amplitude signal is substantially “free”, that is, it may be computed as a component of the TOF calculations. Therefore, using this signal does not substantially add additional processing requirements to the system.
Unless the context clearly requires otherwise, throughout the description and the claims, the words “comprise”, “comprising”, and the like are to be construed in an inclusive sense (i.e., to say, in the sense of “including, but not limited to”), as opposed to an exclusive or exhaustive sense. As used herein, the terms “connected,” “coupled,” or any variant thereof means any connection or coupling, either direct or indirect, between two or more elements. Such a coupling or connection between the elements can be physical, logical, or a combination thereof. Additionally, the words “herein,” “above,” “below,” and words of similar import, when used in this application, refer to this application as a whole and not to any particular portions of this application. Where the context permits, words in the above Detailed Description using the singular or plural number may also include the plural or singular number respectively. The word “or,” in reference to a list of two or more items, covers all of the following interpretations of the word: any of the items in the list, all of the items in the list, and any combination of the items in the list.
The above Detailed Description of examples of the invention is not intended to be exhaustive or to limit the invention to the precise form disclosed above. While specific examples for the invention are described above for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize. While processes or blocks are presented in a given order in this application, alternative implementations may perform routines having steps performed in a different order, or employ systems having blocks in a different order. Some processes or blocks may be deleted, moved, added, subdivided, combined, and/or modified to provide alternative or sub-combinations. Also, while processes or blocks are at times shown as being performed in series, these processes or blocks may instead be performed or implemented in parallel, or may be performed at different times. Further any specific numbers noted herein are only examples. It is understood that alternative implementations may employ differing values or ranges.
The various illustrations and teachings provided herein can also be applied to systems other than the system described above. The elements and acts of the various examples described above can be combined to provide further implementations of the invention.
Any patents and applications and other references noted above, including any that may be listed in accompanying filing papers, are incorporated herein by reference. Aspects of the invention can be modified, if necessary, to employ the systems, functions, and concepts included in such references to provide further implementations of the invention.
These and other changes can be made to the invention in light of the above Detailed Description. While the above description describes certain examples of the invention, and describes the best mode contemplated, no matter how detailed the above appears in text, the invention can be practiced in many ways. Details of the system may vary considerably in its specific implementation, while still being encompassed by the invention disclosed herein. As noted above, particular terminology used when describing certain features or aspects of the invention should not be taken to imply that the terminology is being redefined herein to be restricted to any specific characteristics, features, or aspects of the invention with which that terminology is associated. In general, the terms used in the following claims should not be construed to limit the invention to the specific examples disclosed in the specification, unless the above Detailed Description section explicitly defines such terms. Accordingly, the actual scope of the invention encompasses not only the disclosed examples, but also all equivalent ways of practicing or implementing the invention under the claims.
While certain aspects of the invention are presented below in certain claim forms, the applicant contemplates the various aspects of the invention in any number of claim forms. For example, while only one aspect of the invention is recited as a means-plus-function claim under 35 U.S.C. §112, sixth paragraph, other aspects may likewise be embodied as a means-plus-function claim, or in other forms, such as being embodied in a computer-readable medium. (Any claims intended to be treated under 35 U.S.C. §112, ¶6 will begin with the words “means for.”) Accordingly, the applicant reserves the right to add additional claims after filing the application to pursue such additional claim forms for other aspects of the invention.