The disclosed embodiments relate to a method for mapping of gestures to particular functions of a communications terminal. In particular, it relates to a method for invoking an operation of a communication terminal in response to registering and interpreting a predetermined motion or pattern of an object. It furthermore relates to a computer program arranged to perform said method.
In interacting with electronic devices such as computer terminals, cameras, mobile phones, and television sets, people have become used to enter information and maneuver these electronic devices through keyboards, touch sensitive displays etc.
With the increased popularity of hand held devices, and the miniaturization of these, usability problems caused by the decrease in size of the input means of these devices becomes apparent. Hence, an alternative solution for providing input to electronic devices, especially handheld ones, is sought. It is furthermore an aim to find a more natural interaction between humans and computing devices.
Various input techniques that are experimented with include accessory sensor modalities connected to computing devices, such as motion sensors, surface muscle or nerve sensors, etc. for acquiring specified gestures. However, as a drawback the use of such sensors require extensive computational power, something which is associated with considerable costs.
Hence, it is desired to develop an input technology that is able to solve the usability problems brought from the miniaturization of input devices.
In the following, a natural UI interaction system based on hand gestures captured from one or more cameras is presented. With the system integrated in mobile devices, it will efficiently solve the conflict of miniaturized hardware and maximized software input, meanwhile, the interaction by hand gestures will dramatically improve the usability of the mobile devices.
In one embodiment, a communication terminal is provided that is capable of establishing interaction with an external object by detecting and recognizing predetermined motions for controlling the communication terminal.
In another embodiment, a communication terminal is provided with proximity detection for activating the interaction with an external object for detection and recognition of predetermined motions.
In a further embodiment, a method comprises invoking an operation of a communication terminal in response to registering and interpreting a predetermined motion or pattern of an object. A convenient solution for realizing command input to a communication terminal, such as a mobile phone is realized. As a further advantage, a direct solution for the conflict of device miniaturization and usability is provided. The interaction is more natural, and input is not limited by the miniaturization of device hardware. The term invoking is may also be construed as associating.
The motion or pattern may advantageously be registered and interpreted visually, such as by capturing an image of an object. Advantageously, image input is readily provided by a camera, for instance integrated in the communication terminal.
According to one embodiment, the object comprises a hand and the predetermined motion or pattern comprises a hand gesture. As an advantage, a natural interaction between humans and computing devices can be achieved by using hand gestures for command input and navigation of user interface to the devices. Furthermore, the user may move the hand according to predetermined patterns, which may have been set by the user at a previous occasion, and thereby invoke different operations of the mobile phone such as calling the sender of the message, go to the next message and so forth.
According to various embodiments, the wording registering may be construed as capturing image data, and the wording interpreting may be construed as recognizing an object as a hand and recognizing and associating a gesture of the hand with a reference gesture. According to one embodiment of the invention, the wording interpreting may be construed as comprising steps of identifying an object, recognizing the object, determining its orientation, recognizing and associating it with a hand gesture. The interpretation may be performed by a software of the terminal.
Furthermore, according to another embodiment of the method according to the invention, the operation involves provision of a command input to the communication terminal using a hand gesture, and the method comprises:
The wording capturing image data may be construed as simply taking a picture with an image capturing device, such as a camera of for instance a mobile phone.
With the wording identifying an object in said image data, it may be construed as finding an object in the picture.
According to one embodiment, said identification involves classifying skin color. As an advantage, human-like objects, such as a hand may be recognized from an image.
According to another embodiment, the skin color classification comprises performing Gaussian mixture modelling. Hence, the complex nature of human skin color and intensity spectra is imitated and, as an advantage, the precision of recognizing objects comprising human skin within an image is increased.
Advantageously, various techniques may be employed to improve the process of separating noisy regions from wanted regions of a gesture. For instance, according to one embodiment, the color classification may involve color space analysis and/or probability analysis.
Furthermore, according to another embodiment, the color space analysis may involve conversion of image data to chrominance plane (CbCr) color space image data.
According to still yet another embodiment, the object recognition may involve eliminating visual noise using connected component extraction.
According to one embodiment, the connected component extraction may comprise any of the following:
According to one embodiment, the association may involve a step of determining orientation of the hand and involving:
According to a further embodiment, the orientation determining arrives at one of the following:
According to a preferred embodiment in a common, general reference frame, the first, second, third, and fourth operations correspond to moving focus up, down, left, and right respectively, and said fifth, sixth, and seventh operations correspond to open an item, such as a file, folder or image, close a file folder or image, and stop the focus motion respectively. The wording focus refers to focus of an item, such as an image, a file, a contact, a detail entry, phone number, or the alike.
Furthermore, according to one preferred embodiment, the first KL axis direction is vertically upwards, and the second KL axis direction horizontally to the left.
With being essentially superpositioned, it is to be construed that the two centerpoints are in the vicinity of each other and not necessarily completely superpositioned.
According to one embodiment, the registering may be performed using a camera comprised by the communication terminal.
According a further embodiment, the communication terminal may comprise a mobile phone.
The wording gesture should in this context be construed as a single formation or shape of a gesticulation produced with a hand, such as a closed fist, open hand, closed hand with thumb extended and pointing in a direction. The wording gesture is also to be construed as a group consisting of a sequence of single gestures after each other and furthermore, also as a gesture comprising a moving hand, such as a ticking-in-the-air with a finger.
The wording image data is to be construed as a still image or a series of still images, such as a video sequence.
According to yet another embodiment, the method further comprises a step of activation by proximity detection. Hence, equipped with a proximity sensor that detects range to nearby objects, means for registering motions may be activated by proximity detection, rendering it enough to approach the terminal with an object without bringing them into mechanical contact. Useable proximity switches may comprise inductive, capacitative, electromagnetic radiation or ultrasonic types. Detecting electromagnetic radiation includes optical sensing and infrared radiation as detected from emitted heat from for instance, the hand of a user.
The above advantages and features together with numerous other s advantages and features, which will become evident from below detailed description, are obtained according to a second aspect of the disclosed embodiments by a computer-readable medium having computer-executable components, said computer-readable medium being adapted to invoke an operation of a communication terminal in response to registering and interpreting a predetermined motion or pattern of an object.
Especially, according to one embodiment, the computer-readable medium may further be adapted to:
In other words, the disclosed embodiments provide a method for controlling different operations of a communication terminal by recognition of predetermined motions of an object. In the case where a hand, such as the user's, is used as the object, the predetermined motions may comprise closing the hand into a fist, grabbing, waving, pointing with one or more fingers, or like a pattern, such as comprising a series of motions. Hence, the predetermined motions may be coupled or paired with actions, commands or tasks which are executed by the communication terminal. The wording controlling is in this context also to be construed as invoking or executing different operations of the mobile communications terminal.
The predetermined motions may be recognized to control opening and/or closing items of media content, accessing previous or next item of media content in a list or stack of items, deleting an item of media content, scrolling through the content of an item of media content, answering an incoming voice call, take an action on an item selected from a list of items, call the sender of SMS or ending the projection.
The incoming communication may comprise a message, such as an SMS or MMS. As media content or message may comprise text, image, video or any combinations thereof. Although these messaging services are the most frequently used today, the invention is also intended for use with other types of text or multimedia messages.
The method may further comprise a step of moving the object away from the projector along a projected cone of light until a preferred size of the image is obtained. By virtually holding the information in the hand the user feel in control of the presentation, only revealing data for him or herself. The nature of the gesture is intuitive for the user getting the impression and feeling of taking the image with the hand, out of the communication terminal and after having reviewed the information, putting the it back into the terminal again.
The method may further comprise a step of moving the object back to the device and/or a step of detecting a second tap to end projection of said image. Hence, in an intuitive manner, the user will perform the same steps as when initiating the process, only in a reverse order.
The object referred to may be the hand of, for instance, a user of the communication terminal. Among the advantages of using a hand is the direct possibility of slightly folding the hand to shield off the image from the environment. Other objects that can be used comprise a newspaper, a pencil or even an umbrella.
The predetermined motions may be detected and recognized by using an image-acquiring means. An image-acquiring means could be, for instance, any type of digital camera, such as a CMOS camera.
The wording interpreting may also be interpreted as recognizing.
A natural interaction between humans and computing devices can be achieved by using hand gestures for command input and navigation of user interface to the devices. Especially, with the availability of mobile camera devices and powerful image/video content analysis and pattern recognition technologies realizing command input by hand gestures through camera input is a convenient solution, expected to be highly appreciated by end users.
In other words, with the invention disclosed herein, input technology is able to provide one direct solution for the conflict of device miniaturization and usability. The interaction is more natural. Input is not limited by the miniaturization of device hardware. Hence, the way of interaction presented with this invention provides an advantageous, hands free solution with numerous benefits, especially for hand held communication devices.
The above, as well as additional objects, features and advantages of the disclosed embodiments, will be better understood through the following illustrative and non-limiting detailed description of preferred embodiments, with reference to the appended drawing, wherein:
In the following description of the various embodiments, reference is made to the accompanying drawings which form a part hereof, and in which is shown by way of illustration various embodiments in which the invention may be practiced. It is to be understood that other embodiments may be utilized and structural and functional modifications may be made without departing from the scope of the disclosed embodiments.
In
In a second step 202 of the method, one or more objects are identified from the image data. Further details of how the object identification is performed is outlined below in steps 207 and 208 for skin color segmentation and connected component labeling and mergence 208 respectively.
In a third step 203 of the method, it is investigated whether, or not, any of the objects corresponds to a hand. For this, a number of hand gesture requirements must be fulfilled, the details of which are given below in connection with step 209 for noise area elimination.
In a fourth step 204 of the method, the orientation of the hand is determined. This is done in an orientation-based geometric approach using Karhunen-Loeve orientation, which will be described in further detail below in connection with step 210.
In a fifth step 205 of the method, the gesture of the hand is recognized and associated with one of a set of predetermined gestures. The procedure for this is described in further detail below in connection with steps 211 to 217.
In a sixth step 206 of the method, it is provided an input corresponding to the recognized gesture. The various input alternatives are described in greater detail below in connection with steps 218 to 224.
Further to the step 202 of the method as depicted in
RGB colour space is one of the most widely-used color spaces for processing and storing colour image data, but normally it does not fit for colour analysis and colour based recognition due to the high correlation between channels and mixing of chrominance and luminance data.
Hue-saturation based color spaces like HSV, HSI, HSL are models which are consistent to human's intuitive perceptions and similar to how an artist actually mixes colours. Especially Hue has the invariant property to white light sources and ambient light and surface orientation.
YCbCr is a hardware-oriented model. In the colour space, the luminance is separated from the chrominance data. Cb and Cr values are formed by subtracting luma from RGB red and blue components. The transformation simplicity and explicit separation of luminance and chrominance components make this color space attractive for skin colour modelling [Hsu et al. 2002].
In order to select either a Hue-based color space or YCbCr space to make skin color detection invariant to luminance, YCbCr and HSV are evaluated respectively with a set of skin color training data, which is composed of 550 skin color samples extracted from various still images and video frames, covering a large range of skin color appearance (totally more than 20 million skin color pixels in the skin sample data).
In
For modelling the skin color segmentation, a Gaussian mixture model and Expectation Maximization (EM) estimation is used.
Gaussian density functions and a mixture of Gaussians are often used to model skin color [Yang et al. 2002]. The parameters in a unimodal Gaussian distribution are often estimated using maximum-likelihood. The motivation for using a mixture of Gaussians is based on the observation that the colour histogram for the human skin with different ethnic background does not form a unimodal distribution, but rather a multimodal distribution.
With a unimodal Gaussian, the class-conditional probability distribution function (PDF) of skin color is approximated by a parametric functional form [Yang, Waible 1996].
p(x|skin)=g(x;ms,Cs)=(2π)−d/2|Cs|−1/2exp{−(x−ms)TCs−1(x−ms)} (1)
where d is the dimension of the feature vector, ms is the mean vector and Cs is the covariance matrix of the skin class. In the case of multimodal distribution, skin color distributions are approximated by GMM (Gaussian Mixture Model).
The parameters of Gaussian mixture (i.e., weights ω, means m, covariances C) are typically found using the Expectation Maximization (EM) algorithm [Bilmes 1998].
The EM algorithm is a general method of finding the maximum-likelihood estimate of the parameters of an underlying distribution from a given data set when the data is incomplete or has missing values. The mixture-density parameter estimation problem is one of the most widely-used applications of the EM algorithm [Xu, Jordan 1996]
In the invention, YCbCr color space and GMM are used to implement skin colour classification. In order to build a GMM model, K-means [Duda, Hart 2001] algorithm is used to set the cluster centres, and then the parameters of each Gaussian component are estimated with EM algorithm.
In the case, the GMM model for skin color classification consists of 20 Gaussian components. Each component is a 2-element (Cb and Cr element) Gaussian distribution. The parameters of the 20 Gaussian components are listed as follows.
After skin color classification, the post processing, connected component extraction [Gonzalez, Woods 2002], is needed for noise area removal.
In a step 208 of “connected component labeling and mergence” neighboring regions or components which should belong to one object are merged, and the size of the region is computed. Based on the size information of labeling objects, a step 209 of “noise area elimination” is performed to remove those noise-like small regions and those regions with regular shapes (man-made objects).
Hence, after segmentation, the original image is turned into a black/white image in which the white regions stand for objects, while the black regions stand for background. However, at the moment, the size and shape of the white regions is not known. With connected component labeling, the size and shape of the object regions are computed, and according to some given prior criteria, neighbouring object regions belonging to the same object are merged. After the step of labeling and merging, the step of noise area removal is performed to remove those small regions and those regions with regular shape (man-made objects).
According to one embodiment, there should be a unique hand region in any input gesture image. After color skin based segmentation, sometimes, not only hand region, but also other noisy regions, may be segmented. Thus, step 203 in which an object is recognized as a hand involves a step of noise elimination 209. Hence, if there are any noisy regions extracted, they are removed according to the following rules:
As a part of the step of associating the object with a predetermined object 204 the orientation of the hand is determined in a step 210 for determining Karhunen-Loeve (KL) orientation. This orientation-based geometric approach for hand gesture recognition comprises determining of the Karhunen-Loeve (KL) orientation, and a determining centroids of the hand region and its convex hull.
KL Orientation
The KL orientation is derived as follows:
Assuming that each pixel coordinate in the skin colour pixel set Ps of the input gesture image is (xsi, ysi), then Ps=[ps1 ps2 . . . psN], psi=(xsi, ysi)T,i=1 . . . N is coordinates of skin colour pixels. The mean of Ps is
The corresponding covariance matrix is defined as
The eigen value Es=[es1 es2] and the corresponding eigen vector Evs=[evs1 evs2] are easily calculated from the covariance matrix Cs. Hence, the eigen vector evs max, corresponding to the bigger eigen value es max, determines KL orientation in the image coordinate plane, refer to the dash lines 407 to 412 in
Centroids of Hand Region and its Convex Hull
With the segmented hand region, shown in section d) of
i= . . . N is ith skin color pixel in the hand region.
C2(x2, y2) is derived as:
Based on the Green theorem,
∫Sxds=−∫Lx2dy,∫Sds=∫Lx*dy,L—perimeter of polygon
For a polygon as a sequence of line segments, this can be reduced exactly to a sum,
The shape of the second centroid C2 is created by “shortcutting” the edges connecting the hand region. The effect is thus to smear the contour of the hand region such that the thumb coalesce with the body of the hand, and the “centre of gravity” of the image object is displaced.
Further to the fifth step 205 of the method as depicted in
If the KL orientation of a hand region, and the centroids of the region and its convex hull have been computed, then the orientation of the hand shape can be estimated by the position relationship of the two centroids referring to the KL orientation of the hand region.
The input alternatives that are available according to this outlined embodiment of the present invention are UP, DOWN, RIGHT, LEFT, OPEN, CLOSE, and STOP. However, other input alternatives may be employed. It is furthermore also possible to have other predetermined gestures to which provided gestures can be matched from. A user may for instance provide individual gestures to the group of predetermined gestures recognized by the system. Hence, providing a learning system capable of being individualized according to each user's choice and preferences.
The principle of matching an input gesture with a reference gesture object can be described as follows: A reference gesture object is selected from a predetermined number of available reference objects by eliminating less likely alternatives, such that the last one remaining is selected. That is, for instance, knowing that there are six different alternatives to choose from, the one with best correspondence is selected.
Referring to
In order to optimize the use of a limited number of gestures, various input can be associated with a single gesture. Hence, according to the present example, the operations CLOSE and STOP can both be associated with a closed fist. Depending on the previous action, or operation, the closed fist gesture in step 217 results in different operations, for instance CLOSE, as in step 223, if the last input was STOP and the last gesture was an open hand. Otherwise, the resulting operation is STOP indicated by step 224. In case the area of the convex hull of the gesture is at least twice the area of the previous gesture, as indicated by step 215, and the previous operation was STOP, as indicated by step 216, then the present operation is OPEN indicated by step 222. In case the last operation had not been OPEN in the last example, the present operation had been NO operation at all as indicated in step 216.
Put it slightly different, if the KL orientation of the hand region is nearly horizontal and the two centroids are separated from one another, the gesture means LEFT or RIGHT. While in the case of nearly vertical KL orientation, the gesture means UP or DOWN. Then the positional relationship of two centroids is used to determinate the gesture meaning. It's easily understood that the difference of the two centroids is affected by the extending thumb. If the thumb extends left, the convex hull's centroid lies in the left of hand region's centroid. For the gestures RIGHT, UP and DOWN position relationship of two centroids resemble that of LEFT. On the other hand, centroid of convex hull will be in different position with that of hand region if there's a protruding thumb of hand.
According to another embodiment of the present invention, the following specifications apply:
An item may comprise a document, a folder, a contact, a recipient, a multimedia content, such as an image, audio or video sequence, a reminder, a multimedia message, or the alike.
If the two centroids C1 and C2 421 and 422 of a hand region are almost overlapping, as depicted with an open hand 405 and essentially vertical KL axis 411 in section e), and a closed fist 406 and essentially horizontal KL axis 412 in section f) of
In other words, a user interface interaction is enabled through provision of certain, defined hand gestures. Hence, hand gestures can be used for command input, and entry of letters and digits as well. According to one application, namely media gallery navigation, in which “Up” is used to move the focus up, “Down” to move the focus down, “Left” to move the focus left, “Right” to move the focus right, “Stop” means the focus movement is stopped, “Open” is used to open a focused picture, and “Close” is used to close an opened picture in the gallery. The hand gestures also can be used for controlling the movement of an object on a graphical user interface, e.g. the movement of the worm in the well known greedy worm game.
According to one implementation, the communication terminal is configured to register and interpret motions of an object, preferably with a built-in camera combined with software that registers and analyses motions/patterns in front of it. The terminal is then configured to respond to predetermined motions or patterns of a user's hand, for instance to select and execute actions such as opening and/or closing items of media content, accessing previous or next item of media content in a list or stack of items, deleting an item of media content, scrolling through the content of an item of media content, answering an incoming voice call, take an action on an item selected from a list of items, call the sender of an SMS or take actions in connection with an incoming communication, such as an SMS (Short Messaging Service) or MMS (Multimedia Messaging Service). In the two last mentioned cases, motions or patterns mentioned previously may comprise a closed fist which may be interpreted by the communication terminal to delete the message, tilting of the hand may be used to go to next message in the folder or list of messages, tilting upward may indicate going forward in the list and tilting downward going back in the list. A number of actions can be associated with different patterns by rotating, tilting, circling or simply moving the hand back and forth or up and down. A pattern may also comprise a series or sequence of motions. The communication terminal may be configured to recognize a number of pre-set motions. However, it may also be possible for a user to configure individual motions, or adjust the motions to better match existing patterns.
Hence, using proximity detection, a gesture of approaching the terminal with an object may trigger the terminal to activate the projector to present information of the incoming communication. A proximity sensor detects when something comes into its proximity. Such a sensor, which gives a switched output on detecting something coming into proximity, is called a proximity switch.
Finally, the above described embodiments provide a convenient and intuitive way of providing input to a communication terminal. It is well suited for provision in connection with devices of reduced size. It is also particularly convenient in situations and environments where the hands of a person are exposed to fluids or other substances, such that physical contact with the terminal is directly undesirable.
Number | Name | Date | Kind |
---|---|---|---|
5533110 | Pinard et al. | Jul 1996 | A |
5751843 | Maggioni et al. | May 1998 | A |
5946485 | Weeren et al. | Aug 1999 | A |
6002808 | Freeman | Dec 1999 | A |
6771294 | Pulli et al. | Aug 2004 | B1 |
6996460 | Krahnstoever et al. | Feb 2006 | B1 |
7123783 | Gargesha et al. | Oct 2006 | B2 |
7200266 | Ozer et al. | Apr 2007 | B2 |
20020021278 | Hinckley et al. | Feb 2002 | A1 |
20020118880 | Liu et al. | Aug 2002 | A1 |
20020167488 | Hinckley et al. | Nov 2002 | A1 |
20040056907 | Sharma et al. | Mar 2004 | A1 |
20040120581 | Ozer et al. | Jun 2004 | A1 |
20040189720 | Wilson et al. | Sep 2004 | A1 |
20050210418 | Marvit et al. | Sep 2005 | A1 |
20060136846 | Im et al. | Jun 2006 | A1 |
20060209021 | Yoo et al. | Sep 2006 | A1 |
20090135162 | Van De Wijdeven et al. | May 2009 | A1 |
Number | Date | Country |
---|---|---|
2007073303 | Jul 2007 | KR |
Entry |
---|
Park Y G, Priority-Data: (Jan. 4, 2006) Derwent-Acc-No. 2008-C59022 Title: System providing short message transmission service to mobile terminal by using motion recognition concerned with storing user's various motions by camera and transmitting short message to mobile terminal if identical motion is detected. |
Park Y G, “System providing short message transmission service to mobile terminal by using motion recognition concerned with storing user's various motions by camera and transmitting short message to mobile terminal if identical motion is detected” Derwent-Acc-No. 2008-C59022, Pub-No. KR2007073303A Appl-No. 2006KR-001049, Jan. 4, 2006. |
“eyeSight and CogniVue Demonstrate Gesture Recognition and Face Detection on Smallest Form Factor Lowest Power Smart Camera Module” Las Vegas, Nevada—Jan. 9, 2011—http://www.cognivue.com/news/news—12—01—09.php#sthash.1GAOGGXQ.dpuf. |
Kratz, Sven, Ballagas, Rafael, “Gesture Recognition Using Motion Estimation on Mobile Phones” , 2007. https://hci.rwth-aachen.de/materials/publications/kratz2007a.pdf. |
Jeff A. Bilmes, “A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models”, International Computer Science Institute, Apr. 1998, pp. 1-13, Berkeley, CA. |
Rein-Lien Hsu, et al., “Face Detection in Color Images”, IEEE Transactions on Pattern Analysis and Machine Intellingence, vol. 24, No. 5, May 2002, pp. 696-706. |
Lei Xu, et al., “On Convergence Properties of the EM Algorithm for Gaussian Mixtures”, Neural Computation 8, pp. 129-151, 1996, Massachusetts Institute of Technology. |
Jie Yang, et al., “A Real-Time Face Tracker”, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 1996, pp. 142-147. |
Ming-Hsuan Yang, et al., “Detecting Faces in Images: A Survey”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, No. 1, Jan. 2002, pp. 34-58. |
International Search Report, Application No. PCT/IB2007/002766, mailed Mar. 14, 2008. |
Chinese Office Action dated Aug. 11, 2010. |
Chinese Office Action dated Nov. 17, 2010. |
Chinese Office Action dated Jan. 11, 2011. |
Number | Date | Country | |
---|---|---|---|
20080244465 A1 | Oct 2008 | US |