This invention relates generally to the gesture detection field, and more specifically to a new and useful system and method for gesture detection through local product maps in the gesture detection field.
Local binary pattern (LBP) operators are often used for texture classification in the computer vision. LBP works through labeling pixels of an image by thresholding pixels in a support region against a nucleus pixel (e.g., center pixel). Where the pixel is greater than the center pixel, the pixel is assigned a one, otherwise the pixel is assigned a zero. A histogram of the frequency of assigned binary pixel values is then computed. The histogram may then also be normalized. The resulting histogram is then used as a feature in classification processes. LBP operators have become a feature for computer vision classification applications; however, LBP features are sensitive to varying lighting conditions and can prove to not be robust in certain applications. Thus, there is a need in the gesture detection field to create a new and useful system and method for gesture detection through local product maps. This invention provides such a new and useful system and method.
The following description of preferred embodiments of the invention is not intended to limit the invention to these preferred embodiments, but rather to enable any person skilled in the art to make and use this invention.
As shown in
The imaging unit of a preferred embodiment functions to acquire image data for classification. The imaging unit may directly capture image data or the imaging unit may alternatively interface with a source of image data. An imaging capture unit may capture video or still images with a camera such as a RGB digital, web camera, or a camera phone, but may alternatively be implemented by any suitable imaging unit such as stereo camera, 3D scanner, or IR camera. In one variation, the imaging unit can be directly connected to and/or integrated with a display, user interface, or other user components. Alternatively, the imaging unit can be a discrete element within a larger system that is not connected to any particular device, display, user interface, or the like. Preferably, the imaging unit is connectable to a controllable device, which can include for example a display and/or audio channel. Alternatively, the controllable device can be any suitable electronic device or appliance subject to control though electrical signaling. The image data collected from the imaging unit is preferably directed to the LPM feature extraction engine where the data will be processed and classified. The imaging unit may include a preprocessor module to modify the image data prior to communicating to the feature extraction engine.
The local product map (LPM) feature extraction engine functions to compute a feature based on a product map and probability for the local image data. The LPM feature extraction engine preferably retrieves image data from the imaging unit. The LPM feature extraction engine is preferably configured for computing dimensionality of a local support region, calculating a normalizing term of the dimensionality term, applying weighted patterns to the dimensionality term, mapping to a probabilistic model, and modeling probability distribution as the LPM feature, as described in further detail below. The LPM feature output of the LPM feature extraction engine is preferably a histogram representing probability distribution for various texture patterns. The LPM feature extraction engine preferably preserves the gradient nature of pixel values and maps those into a probability space. The LPM feature extraction engine may additionally account for directionality of the patterns. The LPM features preferably reflect a probability of a region of the image data being of a particular pattern.
The classifier engine of a preferred embodiment functions to use LPM features of the image data to identify gestures and/or objects. The classifier engine preferably includes at least one machine learning or classification module that operates on the LPM features of the LPM feature extraction engine. The classifier engine may be configured to classify any suitable number of gestures or objects. Additional features may be extracted through other engines. The feature outputs of other feature extraction engines may be use in combination with the LPM features. The output of the classifier engine is preferably used in generating gesture input, gesture identification, object identification, and/or any suitable image detection application.
As shown in
The computed LPM feature preferably preserves the gradient nature of pixel values and maps those into a probability space as opposed to LBP where pixel information is lost in transformation into binary constructs. LBP uses a binary assignment of pixels in a support region based on the pixel wise comparison of pixels, sacrificing image information. The method of a preferred embodiment uses a LPM feature that preserves relative pixel values (i.e., gradient information) and uses a probability based modeling to extract pattern maps. LPM is thus less sensitive to lighting conditions of an image, less sensitive to pixel value variance, and an overall more robust feature for gesture and/or detection. As shown in
The method is preferably used in gesture and/or object detection applications. For example, the method can preferably detect gestures involving finger movement and hand position without sacrificing operation efficiency or increasing system requirements. One exemplary application of the method preferably includes being used as a user interface to a computing unit such as a personal computer, a mobile phone, an entertainment system, or a home automation unit. The method may be used for computer input, attention monitoring, mood monitoring, in an advertisement unit and/or any suitable application.
Step S100, which includes obtaining image data S100, functions to collect data representing physical presence and actions of a user or scene. The images are the source from which gesture input will be generated. Image data may be accessed from a stored image source or alternatively, image data may be directly captured by an imaging unit. Depending upon ambient light and other lighting effects such as exposure or reflection, it optionally performs pre-processing of images. The camera is preferably capable of capturing light in the visible spectrum like a RGB camera, which may be found in web cameras, web cameras over the internet or local Wi-Fi/home/office networks, digital cameras, smart phones, tablet computers, and other computing devices capable of capturing video. Any suitable imaging system may alternatively be used. A single unique camera is preferably used, but a combination of two or more cameras may alternatively be used. The captured images may be multi-channel images or any suitable type of image. For example, one camera may capture images in the visible spectrum, while a second camera captures near infrared spectrum images. Captured images may have more than one channel of image data such as RGB color data, near infra-red channel data, a depth map, or any suitable image representing the physical presence of a objects used to make gestures. Depending upon historical data spread over current and prior sessions, different channels of a source image may be used at different times.
Additionally, the method may control a light source for when capturing images. Illuminating a light source may include illuminating a multi spectrum light such as near infrared light or visible light source. One or more than one channel of the captured image may be dedicated to the spectrum of a light source. The captured data may be stored or alternatively used in real-time processing. Pre-processing may include transforming image color space to alternative representations such as Lab, Luv color space. Any other mappings that reduce the impact of exposure might also be performed. This mapping may also be performed on demand and cached for subsequent use depending upon the input needed by subsequent stages. Additionally or alternatively, preprocessing may include adjusting the exposure rate and/or frame rate depending upon exposure in the captured images or from reading sensors of an imaging unit.
Step S200, which includes extracting a LPM feature from the image data, functions to use a process of pattern predictions from localized regions of pixels. LPM is ideal for texture pattern detection as it can be applied over regional areas. The image data is preferably segmented into different regions of an image. Each region of an image is preferably subdivided into support regions as shown in
As shown in
Step S210, which includes computing a dimensionality component of a local support region, functions to calculate relative pixel values of pixels in a region compared to a center pixel. The dimensionality component is preferably a vector (i.e., one dimensional matrix) with n terms for each of the pixels in the support region, where there are n+1 pixels in the support region, the nucleus pixel accounting for the plus one pixel. As shown in
Step S220, which includes calculating a normalizing term of the dimensionality component, functions to generate a factor by which subsequent calculations may be normalized. In one preferred embodiment, the normalization term is the absolute average of the terms in the dimensionality vector component as shown in
Step S230, which includes applying weighted patterns to the dimensionality component to create a gradient vector, functions to map at least one pattern to the support region. Step S230 is preferably applied in combination with step S240 for each weighted pattern of a pattern set, and Steps S230 and S240 are further repeated for each support region used to calculate the LPM feature. A weighted pattern is preferably a construct that is tailored to incorporating detection of a particular pixel pattern (e.g., a line or two lines intersecting). A first weighted pattern is preferably composed as an n by 1 vector. In a preferred variation, the values of the weighted pattern vector exist in the range of negative one to positive one, but the weighted pattern vector may have any suitable range (e.g., negative infinity to positive infinity). The weighted patterns of the pattern set may be the full permutation of pattern possibilities. Alternatively, the weighted patterns may be a selected subset of pattern possibilities. As another alternative, the weighted patterns may be algorithmically learned, wherein a selected subset of pattern possibilities is determined according to machine learning or other processes. The subset of pattern possibilities is preferably the set of patterns that were historically more influential in matching patterns. For example, weighted patterns may be selected to identify corner as shown in
Step S240, which includes mapping the gradient characterization to a probabilistic model, functions to transform the gradient characterization component to a probability space. The gradient characterization is preferably a gradient vector. Each weighted pattern from the pattern set preferably corresponds to one gradient vector, and thus a probabilistic model (e.g., probability value) is preferably calculated to correspond to each weighted pattern. The gradient characterization generated in Step S230 is preferably transformed into a probability value ranging from zero to one, but the gradient characterization may alternatively be transformed to any suitable range or mapping A probability value is preferably generated by calculating the product of an activation function of each term of the gradient vector. As shown in
The sigmoid computation may also be sped up in the case when the weights are +1 and −1 by observing that sigmoid(−x)=1−sigmoid(x). Integrating such a computational observance into the calculation of a sigmoid can avoid expensive exponentiation in the sigmoid function. Exponentiation may be further optimized by using a lookup table. The sigmoid activation function may alternatively be any suitable variation of a logistic function, arctangent, hyperbolic tangent, algebraic function or any suitable function transformed such that the minimum output of the activation function is preferably a non-negative value. Other activation functions may include max (0,Fi) or log(i+exp(Fi)) or any suitable activation function. Alternatively, any suitable transform may be applied. Such a probabilistic mapping functions to make pixel values more resilient to slight value variations. Instead of drastically altering pattern matching, the probability of matching a pattern changes in relationship to variation of pixels conforming to a pattern. Such resilience makes LPM features more robust in situations such as varied lighting conditions.
Step S250, which includes condensing probabilistic models into a probabilistic distribution feature, functions to convert the probabilities to histogram based feature. The probability distribution is preferably a histogram-based representation of the probabilistic models over a cell region as shown in
In one preferred implementation of extracting a LPM feature from the image data, the LPM feature may be efficiently computed in the case when the weighted patterns consist of all possible combinations of 1 and −1 (i.e., 2̂n different weight patterns for a length-n pattern). While a simplistic implementation may involve n*2̂n multiplications, a preferred implementation reduces multiplications to 2̂(n+1) by using a dynamic programming based approach that reuses multiplications across different patterns. The approach may be visualized by a binary tree of height n. At a node at level k, the value for the left child is generated by multiplying the value at the node by a sigmoid (or an appropriate activation function) of k-th value in gradient vector weighted by i. Similarly, a weight of −1 is used to generate the value at the right node. Starting with a value of 1 at the root node and generating the tree to depth n, the products corresponding to 2̂n weight patterns are then generated at the leaves of the tree. The number of multiplications can be further reduced by skipping computation of a sub-tree rooted at a node where the value has fallen substantially near zero since all subsequent products will be zero too. Further, the entire tree may not be stored in memory as computation of a single level of the tree can be computed from the previous level. Hence, the approach may not require additional memory and the 2̂n products can be computed in-place. The approach is preferably applicable whenever it is possible to share multiplications across weight patterns beyond the specific case of +1/−1 weights discussed here.
Step S300, which includes applying a classifier to the LPM features of the image data, functions to compute feature similarities and patterns. The classifier preferably uses common machine learning and/or other classification techniques to classify the image data. The classifier may be a support vector machine (SVM), a k-NN classifier, self-organizing map (SOM), statistical test, and/or any suitable machine learning technique or classifier. A classifier may additionally be applied for each gesture and/or object that the method is configured to detect. The classifiers may use a variety of techniques depending on the intended gesture and/or object. A variety of additional features and classification of those features may additionally be performed, and the image data may have a plurality of classifications based on a variety of features at least one of the features being LPM features. A combination of classifiers may be used to detect a plurality of gestures and/or objects.
Step S400, which includes detecting an object according to the result of the applied classifier, functions to respond to the classifications of the image data. In one preferred embodiment, gesture detection is used as a user-interface input mechanism. Upon detecting a configured gesture, a corresponding action or actions are performed by the application and/or device. The gesture input application is preferably substantially similar to the system and method described in U.S. patent application Ser. No. 13/159,379, filed 13 Jun. 2011, which is hereby incorporated in its entirety by this reference. In an alternative preferred embodiment, the gesture detection is used in an informational application. Upon detecting an object or gesture, a data object is augmented (e.g., created, edited, deleted) to reflect the detected gesture. Such an application may be used for tracking or logging detection of a gesture. The detected gesture may alternatively be used in any suitable application.
The system and method of the preferred embodiment and variations thereof can be embodied and/or implemented at least in part as a machine configured to receive a computer-readable medium storing computer-readable instructions. The instructions are preferably executed by computer-executable components preferably integrated with an imaging unit, an LPM feature extraction engine, and a classification engine. The computer-readable medium can be stored on any suitable computer-readable media such as RAMs, ROMs, flash memory, EEPROMs, optical devices (CD or DVD), hard drives, floppy drives, or any suitable device. The computer-executable component is preferably a general or application specific processor, but any suitable dedicated hardware or hardware/firmware combination device can alternatively or additionally execute the instructions.
Although omitted for conciseness, the preferred embodiments include every combination and permutation of the various components and steps of the system and method of the preferred embodiments.
As a person skilled in the art will recognize from the previous detailed description and from the figures and claims, modifications and changes can be made to the preferred embodiments of the invention without departing from the scope of this invention defined in the following claims.