The present disclosure relates to a method for controlling a mobile or stationary terminal via a 3D sensor and a codeless hardware recognition device integrating a non-linear classifier with or without a computer program assisting such a method. Specifically, the disclosure relates to facilitating hand or face gesture user input using one of multiple types (structured light, time-of-flight, stereoscopic, etc.) of 3D image input and a patented and unique class of hardware implemented non-linear classifiers.
Present day mobile and stationary terminal devices such as mobile phones or gaming platforms are equipped with image and/or IR sensors and are connected to display screens that display user input or the user him/herself in conjunction with a game or application being performed by the terminal. Such an arrangement is typically configured to receive input by interaction with a user through a user interface. Currently such devices are not controlled by specific hand (like American Sign Language for instance) or facial gestures being processed by a zero instruction based hardware non-linear classifier (codeless). This proposed approach to solving the problem results in a low power and real time implementation which can be made very inexpensive for implementation into wall powered and/or battery operated platforms for industrial, military, commercial, medical, automotive, consumer applications and more.
One current popular system uses gesture recognition with an RGB camera and an IR depth field camera sensor to compute skeletal information and translate to interactive commands for gaming for instance. This embodiment introduces an additional hardware capability that can take real time information of the hands and/or the face and give the user a new level of control for the system. This additional control could be using the index finger and motioning it for a mouse click, using the thumb and index finger to show expansion or contraction or an open hand becoming a closed hand to grab for instance. These recognized hand inputs can be combined with tracking of the hand's location to perform operations such as grabbing and manipulating virtual objects or drawing shapes or freeform images that are also recognized real time by the hardware classifier in the system, greatly expanding the breadth of applications that the user can enjoy and the interpretation of the gesture itself.
Secondarily, 3D information can be obtained in other ways—such as time of flight or stereoscopic input. The most cost effective way is to use stereoscopic vision sensor input only and triangulate the distance based on the shift of pixel information from the right and left cameras. Combining this with a nonlinear hardware implemented classifier can not only provide direct translation of depth of an object, but recognition of the object as well. These techniques versus instruction based software simulation will allow for significant cost, power, size, weight, development time and latency reduction allowing a wide range of pattern recognition capability in mobile or stationary platforms.
The hardware nonlinear classifier is a natively implemented radial basis function (RBF) Restricted Coulomb Energy (RCE) learning function and/or kNN (k nearest neighbor) machine learning device that can take in vectors (data bases)—compare in parallel against internally stored vectors, apply a threshold function against the result and then search and sort on the output for winner take all recognition decision, all without code execution. This technique implemented in silicon is covered by U.S. Pat. Nos. 5,621,863, 5,717,832, 5,701,397, 5,710,869 and 5,740,326. Specifically applying a device covered by these patents to solve hand/face gesture recognition from 3D input is the substance of this application.
A system can be designed using 3D input with simulations of various algorithms run on traditional CPUs/GPUs/DSPs to recognize the input. The problem with these approaches is that it requires many cores and/or threads to perform the function within the latency required. For real time interaction and to be accurate—many models must be looked at simultaneously. This makes the end result cost & power prohibitive for consumer platforms in particular. By using a natively implemented massively parallel memory based hardware nonlinear classifier referred to above, this is mitigated to a practical and robust solution for this class of applications. It becomes practical for real time gesturing for game interaction, sign language interpretation, and computer control on hand held battery appliances via these techniques. Because of low power recognition, applications such as instant on when a gesture or face is recognized can also be incorporated into the platform. A traditionally implemented approach would consume too much battery power to continuously be looking for such input.
The lack of finger recognition in current gesture recognition gaming platforms create a notable gap in the abilities of the system as compared to other motion devices which incorporate buttons. For example there is no visual gesture option for quickly selecting an item, or for doing drag-and-drop operations. Game developers have designed games for systems around this omission by focusing on titles which recognize overall body gestures, such as dancing and sports games. As a result, there exists an untapped market of popular games which lend themselves to motion control but require the ability to quickly select objects or grab, reposition, and release them. Currently this is done with a mouse input or buttons.
An object of this embodiment is to overcome at least some of the drawbacks relating to the compromise designs of prior art devices as discussed above. The ability to click on objects as well as to grab, re-position, and release objects is also fundamental to the user-interface of a PC. Performing drag-and-drop on files, dragging scrollbars or sliders, panning document or map viewers, and highlighting groups of items are all based on the ability to click, hold, and release the mouse.
Skeleton tracking of the overall body has been implemented successfully by Microsoft and others. One open source implementation identifies the joints by converting the depth camera data into a 3D point cloud, and connecting adjacent points within a threshold distance of each other into coherent objects. The human body is then represented as a collection of 3D points, and appendages such as the head and hands can be found as extremities on that surface. To match the extremities to their body parts, the proportions of the human body are used to determine which arrangement of the extremities best matches the expected proportions of the human body. A similar approach could theoretically be applied to the hand to identify the location of the fingers and their joints; however, the depth camera may lack the resolution and precision to do this accurately.
To overcome the coarseness of the fingers in the depth view, we will use hardware based pattern matching to recognize the overall shape of the hand and fingers. The silhouette of the hand will be matched against previously trained examples in order to identify the gesture being made.
The use of pattern matching and example databases is common in machine vision. An important challenge to the approach, however, is that accurate pattern recognition can require a very large database of examples. The von Neumann architecture is not well suited to real-time, low-power pattern matching; the examples must be checked in serial, and the processing time scales linearly with the number of examples to check. To overcome this, we will demonstrate pattern matching with the CogniMem CM1K (or any variant covered by the aforementioned patents) pattern matching chip. The CM1K is designed to perform pattern matching in full parallel, and simultaneously compares the input pattern to every example in its memory with a response time of 10 microseconds. Each CM1K stores 1024 examples and multiple CM1Ks can be used in parallel to increase the database size without affecting response time. Using the CM1K, the silhouette of the hand can be compared to a large database of examples in real-time and low-power.
The skeleton tracking information helps identify the coordinate of the hand joint within the depth frame. We first take a small square region around the hand from the depth frame, and then exclude any pixels which are outside of a threshold radius from the hand joint in real space. This allows us to isolate the silhouette of the hand against a white background, even when the hand is in front of the person's body (provided the hand is at least a minimum distance from the body). See
Samples of the extracted hand are recorded in different orientations and distances from the camera (
The chip uses patented hardware implemented Radial Basis Function (RBF) and Restricted Coulomb Energy (RCE) or k Nearest Neighbor (kNN) algorithms to learn and recognize examples. For each example input, if the chip does not yet recognize the input, the example is added to the chip's memory (that is, a new “neuron” is committed) and a similarity threshold (referred to as the neuron's “influence field”) is set. The example stored by a neuron is referred to as the neuron's model.
Inputs are compared to all of the neurons (collectively referred to as the knowledge base) in parallel. An input is compared to a neuron's model by taking the Manhattan (L1) distance between the input and the neuron model. If the distance reported by a neuron is less than that neuron's influence field, then the input is recognized as belonging to that neuron's category.
If the chip is shown an image which it recognizes as the wrong category during learning, then the influence field of the neuron which recognized it is reduced so that it no longer recognizes that input.
An example implementation of the invention can consist of a 3D sensor, a television or monitor, and a CogniMem hardware evaluation board, all connected to a single PC (or other computing platform). Software on the PC will extract the silhouette of the hand from the depth frames and will communicate with the CogniMem board to identify the hand gesture.
The mouse cursor on the PC will be controlled by the user's hand, with clicking operations implemented by finger gestures. A wide range of gestures can be taught—like the standard American sign language or user defined hand/face gestures. Example gestures of user input, including the ability to click on objects, grab and reposition objects, pan and zoom in or out on the screen are appropriate for this example implementation. The user will be able to use these gestures to interact with various software applications, including both video games and productivity software.
The present embodiment now will be described more fully hereinafter with reference to the accompanying drawings, in which some examples of the embodiments are shown. Indeed, these may be represented in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided by way of example so that this disclosure will satisfy applicable legal requirements.
In summary, the CPU (
Many modifications and other embodiments versus those set forth herein will come to mind to one skilled in the art to which these embodiments pertain having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Therefore, it is to be understood that the specific examples of the embodiments disclosed are not exhaustive and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.
Number | Date | Country | |
---|---|---|---|
61660583 | Jun 2012 | US |