As human machine interactions evolve from simple finger touch of a button on the touch sensitive screen of a device to more complex interactions like multi-touch or touchless interactions, user expectations are building up for new experiences that are more complex and real-life. For example, users expect that devices provide interactions for real-life gestures for grabbing an object like a sheet of paper and dropping it in a paper tray, grabbing a photo and passing it to another person etc.
These real-life gestures are much more complex and need innovation on hardware to provide complex detection and tracking and extreme level of processing through software to compose those detections into a synthesized gesture like grab. Currently there is lack of this type of technology.
While multi-touch technologies have been used in some personal digital assistant products, music player products and smart phone products, to detect multiple finger pinch gestures, these rely on comparatively expensive sensor technology that do not cost-effectively scale to larger sizes. Thus there remains a need for gesture recognition systems and methods that can be implemented with low cost sensor arrays suitable for larger sized devices.
The present technology provides a cost-effective technology for recognizing complex gestures, like grab and drop performed by human hand. This technology can be scaled to accommodate very large displays and surfaces like large screen TVs or other large control surfaces, where conventional technology used in smaller personal digital assistants, music players or smart phones would be cost prohibitive.
In accordance with one aspect, the disclosed system and method employs an algorithm and computational model for detection and tracking of human hand grabbing an object and dropping an object in a 2-D or 3-D space. In this case user can lift its hand completely off the surface and into the air and then drop it on the surface.
In accordance with another aspect, the disclosed system and method employs an algorithm and computational model for detection and tracking of human hand grabbing an object on surface and then dragging it on the surface from one point to another and then dropping it. In this case hand of the user is constantly in touch with the surface and hand is never lifted completely off the surface.
a is a graph showing exemplary capacitance readings of a single touch point, separately showing both X-axis and Y-axis sensor readings;
b is a graph showing exemplary capacitance readings of a two touch points, separately showing both X-axis and Y-axis sensor readings;
a is a three-dimensional point cloud graph, showing exemplary grab and touch distributions of data from the X-axis sensor readings;
b is a three-dimensional point cloud graph, showing exemplary grab and touch distributions of data from the Y-axis sensor readings;
Human machine interactions for consumer electronic devices are gravitating towards more intuitive methods based on touch and gestures and away from the existing mouse and keyboard approach. For many applications touch sensitive surface is used for users to interact with underlying system. Same touch surface can also be used as display for many applications. Consumer electronics displays are getting thinner and less expensive. Hence there is need for a touch surface that is thin and inexpensive and provides multi-touch experience.
The exemplary embodiment illustrated here uses a multi-touch surface based on capacitive sensor arrays that can be packaged in a very thin foil, at a fraction of the cost of sensors typically used for multi-touch solutions. Although inexpensive sensor technology is used, we can still accurately detect and track complex gestures like grab, drag and drop. Thus while the illustrated embodiment uses capacitive sensors as underlying technology to provide touch point detection and tracking, this invention can be easily implemented using other types of sensors, including but not limited to resistive, pressure, optical or magnetic sensors to provide the touch detection and tracking. As long as we are able to determine the touch points, using any available technology, grab and drop gesture can be composed and detected easily using the algorithms disclosed herein.
As illustrated in
As illustrated in
The interactive foil is composed of capacitance sensors on both the vertical and horizontal direction, as shown in the magnified detail at 64. To simplify description, we refer here to the vertical direction as the y-axis and the horizontal direction as the x-axis. The capacitance sensor is sensitive to conductive objects like human body parts when they are near the surface of the foil. The x-axis and the y-axis are, however, independent while reading sensed capacitance values. When the human body parts, e.g. a finger F, comes close enough to the surface, the capacitance values on the corresponding x and y-axis will increase (xa, ya). It thus makes possible the detection of a single or multiple touch points. In our development sample, the foil is 32 inches long diagonally, and the ratio of the long and short sides is 16:9. Therefore, the corresponding sensor distance in the x-axis is about 22.69 mm and that in the y-axis is about 13.16 mm. Based on these specifications of the hardware, a set of algorithms is developed to detect and track the touch points and gestures like grab and drop, as will be described in the following sections.
It will be appreciated that the capacitance sensor can be implemented upon an optically clear substrate, using extremely fine sensing wires, so that the capacitive sensor array can be deployed over the top of or sandwiched within display screen components. Doing this allows the technology of this preferred embodiment to be used for touch screens, TV screens, graphical work surfaces, and the like. Of course, if see-through capability is not required, the sensor array may be fabricated using an opaque substrate.
When fingers touch or even come near enough to the surface of the sensor array, the capacitances of the nearby sensors will increase. By constantly reading or periodically polling the capacitance values of the sensors, the system can recognize and distinguish among different gestures. Using the process that will next be discussed, the system can distinguish the “touch” gesture from the “grab and drop” gesture. In this regard, the touch gesture involves the semantic of simple selection of a virtual object, by pointing to it with the fingertip (touch). The grab and drop gesture involves the semantic of selecting and moving a virtual object by picking up (grabbing) the object and then placing it (dropping) in another virtual location.
Distinguishing between the touch gesture and the grab and drop gesture is not as simple as it might seem at first blush, particularly with the capacitive sensor array of the illustrated embodiment. This is because the sensor array comprised of two separate X-coordinate and Y-coordinate sensor arrays cannot always discriminate between single touch and multiple touch (there are ambiguities in the sensor data). To illustrate, refer to
The system and method of the present disclosure is able to distinguish between touch and grab and drop gestures, even despite these inherent shortcomings of the separate X-coordinate and Y-coordinate sensor arrays. It does this using trained model-based pattern recognition and trajectory recognition algorithms. By way of overview, when a touch is recognized, touch points are detected and every detected touch point is tracked individually when they move. The algorithm deems grab and drop as a recognized gesture, and therefore when a grab is recognized it waits until a drop (another recognized gesture) is found or timeout occurs. User can also drag the grabbed object before dropping it.
The grab and drop algorithms and procedures address the ambiguity problem associated with capacitive sensors by using pattern recognition to infer where the touch points are (and thereby resolve the ambiguity). At any given instant, the inference may be incorrect; but over a short period of time, confidence in the inference drawn from the aggregate will grow to a degree where it can reasonably be relied upon. Another important advantage of such pattern recognition is that the system can infer gestural movements even when the data stream from the sensor array has momentarily ceased (because the user has lifted his hand far enough from the sensor array that it is no longer being capacitively sensed). When the user's hand again moves within sensor range, the recognition algorithm is able to infer whether the newly detected motion is part of the previously detected grab and drop operation by relying on the trained models. In other words, groups of sensor data that closely enough match the grab and drop trained models will be classified as a grab and drop operation, even though the data has dropouts or gaps caused by the user's hand being out of sensor range.
A data flow diagram of the basic process is shown in
If the detected gesture is recognized as a touch gesture, then further processing steps are performed. The data are first analyzed by the touch point classifier 24, which performs the initial assessment whether the touch corresponds to a single touch point, or a plurality of touch points. The classifier 24 uses models that are trained off-line to distinguish between single and multiple touch points.
Next the classification results are fed into a simplified Hidden Markov Model (HMM) 26 to update the posteriori probability. The HMM probabilistically smoothes the data over time. Once the posteriori reaches the threshold, the corresponding number of touch points is confirmed and the peak detector 28 is applied to the readings to find the local maxima. The peak detector 28 analyzes the confirmed number of touch points to pinpoint more precisely where the touch point occurred. For a single touch point, the global maximum is detected; for multiple touch points, a set of local maxima are detected.
Finally, a Kalman tracker 30 associates the respective touch points from the X-axis and Y-axis sensors as ordered pairs. The Kalman filter is based on a constant speed model that is able to associate touch points at different time frames, as well as provide data smoothing as the detected points move during the gesture. The Kalman tracker 30 may only need to be optionally invoked. It is invoked if plural touch points have been detected. In such case the Kalman tracker 30 resolves the ambiguity that arises when two points touch the sensor at the same time. If only one touch point was detected, it is not necessary to invoke the Kalman tracker.
The gesture recognizer 20 is preferably designed to recognize two categories of gestures, i.e. grab-and-drop and touch, and it is composed of two modules, a gesture classifier 70, and a confidence accumulator 72. See
To recognize the gesture of grab-and-drop and touch, sample data are collected for offline training. The samples are collected by having a population of different people (representing different hand sizes and both left-handed and right-handed) make repeated grab and drop gestures while recording the sensor data throughout the grab and drop sequence. The sample data are then stored as trained models 74 that the gesture classifier 70 uses to analyze new, incoming sensor data during system use. Notice that the grab-and-drop gesture is characterized by a grab and followed by a drop; the correct recognition of the grab is the critical part for this gesture. Hence, in the data collection, we focus on the grab data. Because the grab gesture precedes the drop gesture, we can analyze the collected capacitive readings of the training data and appropriately label the grab and drop regions within the data. With this focus, a reasonable feature set can be represented by the statistics of the capacitive readings.
To visualize the distribution of the two gestures, a point cloud is shown in
To select the number of normalized central moments used in the recognizer, we employ a k-fold cross-validation technique to estimate the classification error for different selection of the features as shown in
The estimate of the false positive and false negative rates as shown in
Let Sn be the gesture when the n-th readings are collected, and Wn be the classification results of the n-th reading. The performance of the classifier was modeled as P(Wn|Sn), which were estimated by k-fold cross validation during training. From Sn−1 to Sn, there is a probability of transition P(Sn|Sn−1). Suppose as time n−1, we have the posteriori probability of P(Sn−1|Wn−1, . . . , W0), after the classifier processed n-th readings, the new posteriori probability P(Sn|Wn, . . . , W0) will then be updated as
As can be seen, the posteriori probability P(Sn|Wn, . . . , W0) accumulates when Wn's are collected. Once it is high enough, we confirm the corresponding gesture and the system goes to the follow-up procedures for that gesture.
If the gesture of grab is confirmed, the grab point needs to be estimated. The way the system estimates it is by thresholding and weighted averaging, which is discussed more fully below in connection with estimation of the drop point.
When a grab gesture is confirmed, the system waits until there is no contact with the sensor array to initialize the drop detector 22. The drop detector initialized like this is then very simple to implement. We simply need to detect the next time when any human body parts contact the touch screen and this is done by a threshold c0 on the average capacitive readings.
To estimate the position of the grab point and the drop point, a threshold-and-averaging method is employed. The idea is to first find a threshold and then average the position of the readings that are over the threshold. In this implementation, the threshold is found by calculating a weighted average of the maximum reading and the average reading. Let cmax be the maximum reading and cavg be the average reading, the threshold ch is then set to
c
h
=w
0
c
avg
+w
1
c
max, subject to, w0+w1=1,w0,w1>0
The position of the grab or drop point can be easily estimated as the average of the position of the points that are over the threshold ch. The drop ends when no contact with the touch screen is present, which is again by the threshold c0. After the drop gesture finished, the system goes back the very beginning.
If a touch is confirmed in the gesture recognizer, the capacitive readings are further passed to this touch point classifier. In this section, we will describe the way we make our touch point classifier work. To simplify the discussion let's take a scenario where only up to two touch points can be present on the touch screen. The proposed algorithm, however, can be extended to handle more than two touch points by simply adding classes when training the classifier as well as increasing the states in the simplified Hidden Markov Model as described below. For example, in order to detect and track three points, we need to add three classes in the classifier during training it and increase the states to three in Simplified Hidden Markov Model.
Sample capacitance readings for a single touch point and two touch points are shown in
A Gaussian density classifier is proposed here. Suppose samples of each group are from a multivariate Gaussian density N(μk,Σk), k=1, 2. Let xikεRd be the i-th sample point for the k-th group, i=1, . . . , Nk. For each group, the Maximum Likelihood (ML) estimation of the mean μk and covariance matrix σk is
With this estimation, the boundary is then defined as the equal Probabilistic Density Function (PDF) curve, and is given by
x
T
Qx+Lx+K=0,
where Q=Σ1−1−Σ2−1, L=−2(μ1Σ1−1−μ2Σ2−1), and K=μ1TΣ1−1μ1−μ2TΣ2−1μ2−log |Σ1|+log |Σ2.
The features we propose to use are the statistics of the capacitance readings, which are the mean, the standard deviation and the normalized higher order central moments. For feature selection, we use k-fold cross validation on the training dataset with features up to the 8th normalized central moment. The estimated false positive and false negative rates are shown in
To assess the classification results over time, we employ a simplified Hidden Markov Model (HMM) to implement a model-based probabilistic analyzer 26. The HMM exhibits the ability to smooth the detection over time in a probabilistic sense. In this regard, the output of the touch point classifier 24 can be though of as a sequence of time-based classification decisions. The HMM 26 analyzes the sequence of data from the classifier 24, to determine how those classification decisions may best be connected to define a smooth sequence corresponding to the gestural motion. In this regard, it should be recognized that not all detected points necessarily correspond to the same gestural motion. Two simultaneously detected points could correspond to different gestural motions that happen to be ongoing at the same time, for example.
The structure of the HMM we are using is shown in
P(Zt
P(Σt+δ|Zt+δ)=P(Xt|Zt),∀δεZ+.
Without any prior knowledge, it is reasonable to assume Z0˜Benoulli (p=0.5). Suppose at t, we have a prior knowledge about Zt−1, i.e. P(Zt−1|Xt−1, . . . , X0), and the classifier gives the result Xt, the hidden state is then updated by the Bayesian rule
Instead of maximizing the joint likelihood to find the best sequence, we made decision based on the posteriori P(Zt|Xt, . . . X0). Once the posteriori is higher than a predefined threshold, which we set it very high, the state is confirmed and the number of touch points Nt were then passed to the peak detector to find the positions of the touch points.
From the confirmed number of touch points Nt, the peak detector found the first Nt largest local maxima. If there is only one touch point, the searching is straightforward as we only need to find the global maximum. Otherwise, when there are two touch points, after we found the two local maxima, we applied a ratio test, i.e. when the ratio of the value of the two peaks are very large, the lower one is deemed as a noise, and the two touch points coincide with each other on that dimension.
To achieve a subpixel accuracy, for each local maximum pair (xm, f(xm)), where xm is the position and f(xm) is the capacitance value, together with one point on either side, (xm−1, f(xm−1)) and (xm+1, f(xm+1)), we fit a parabola f(x)=ax2+bx+c. This is equivalent to solving a linear system
Then the maximum point is refined to
As the two dimensions of the capacitive sensor are independent, positions on x- and y-axis should be associated together to determine the touch point in the 2-D plane. When there are two peaks on each dimension (x1, x2) and (y1, y2), there could be two pair of possible associations (x1, y1), (x2, y2) and (x1, y2), (x2, y1), which have equal probability. This poses an ambiguity if at the very beginning there are two touch points. Hence, in the system, it is restricted to start from a single touch point.
To associate touch points at different time frames as well as smooth the movement, we employ a Kalman filter with a constant speed model. The Kalman filter evaluates the trajectory of touch point movement, to determine which x-axis and y-axis data should be associated as ordered pairs (representing a touch point).
Let us define
The transition of the Kalman filter satisfies
t+1
=H
t
+w
t+1
=M
t+1
+u
where in our problem,
are the transition and measurement matrix, w˜N(0, R) and ν˜N(0, Q) are white Gaussian noises with covariance matrices R and Q.
Given prior information from past observations
t
post=μt+ΣMT(MΣMT+R)−1(
Σpost=Σ−MT(MΣMT+R)−1M
μt+1=H
Σ=HΣpostHT+Q
where
From the foregoing it will be seen that the technology described here will enable multi-touch interaction for many audio/video products. Because the capacitive sensors can be packaged in a thin foil it can be used to produce very thin multi-touch displays at a very small additional cost.
Number | Date | Country | |
---|---|---|---|
61099332 | Sep 2008 | US |