Various of the disclosed embodiments relate to automated gesture recognition processing for user-device interactions.
Human-computer interaction (HCI) systems are becoming increasingly prevalent in our society. This increasing prevalence has precipitated an evolution in the nature of such interactions. Punch cards have been surpassed by keyboards, which were themselves complemented by mice, which are themselves now complemented by touch screen displays, etc. Today, various machine vision approaches may even facilitate visual, rather than the mechanical, user feedback. For example, machine vision techniques may allow computers to interpret images from their environment so as to recognize user faces and gestures. These systems may rely upon grayscale or color images exclusively, depth data exclusively, or a combination of both. Examples of senor systems that may be used by these systems include, e.g., the Microsoft Kinect™ Intel RealSense™, Apple PrimeSense™, Structure Sensor™, Velodyne HDL-32E LiDAR™, Orbbec Astra™, etc.
While users increasingly desire to interact with these systems, such interactions may be hampered by ineffective system recognition of user gestures. Failing to recognize a gesture's performance may cause the user to assume that the system is not configured to recognize such gestures. Perhaps more frustratingly, misinterpreting one gesture for another may cause the system to perform an undesirable operation. Systems unable to distinguish these subtle differences in user gesture movement cannot accurately infer the user's intentions. In addition, an inability to overcome this initial interfacing difficulty restricts the user's access to any downstream functionality of the system. For example, poor identification and reporting of gestures prevents users from engaging with applications running on the system as intended by the application designer. Consequently, poor gesture recognition limits the system's viability as a platform for third party developers. It may be especially difficult to recognize “swipe” hand gestures given the wide variety of user sizes, habits, and orientations relative to the system.
Consequently, there exists a need for refined gesture recognition systems and methods which consistently identify user gestures, e.g., swipe gestures. Such consistency should be accomplished despite the many obstacles involved, including the disparate character of user movements, disparate user body types, variable recognition contexts, variable recognition hardware, etc.
Various of the embodiments introduced herein may be better understood by referring to the following Detailed Description in conjunction with the accompanying drawings, in which like reference numerals indicate identical or functionally similar elements:
The specific examples depicted in the drawings have been selected to facilitate understanding. Consequently, the disclosed embodiments should not be restricted to the specific details in the drawings or the corresponding disclosure. For example, the drawings may not be drawn to scale, the dimensions of some elements in the figures may have been adjusted to facilitate understanding, and the operations of the embodiments associated with the flow diagrams may encompass additional, alternative, or fewer operations than those depicted here. Thus, some components and/or operations may be separated into different blocks or combined into a single block in a manner other than as depicted. The embodiments are intended to cover all modifications, equivalents, and alternatives falling within the scope of the disclosed examples, rather than limit the embodiments to the particular examples described or depicted.
Various of the disclosed embodiments may be used in conjunction with a mounted or fixed depth camera system to detect, e.g. user gestures.
A depth sensor 115a may be mounted upon or connected to or near the kiosk 125 so that the depth sensor's 115a field of depth capture 120a (also referred to as a “field of view” herein) encompasses gestures 110 made by the user 105. Thus, when the user points at, e.g., an icon on the display 125a by making a gesture within the field of depth data capture 120a the depth sensor 115a may provide the depth values to a processing system, which may infer the selected icon or operation to be performed. The processing system may be configured to perform various of the operations disclosed herein and may be specifically configured, or designed, for interfacing with a depth sensor (indeed, it may be embedded in the depth sensor, or vice versa, in some embodiments). Accordingly, the processing system may include hardware, firmware, software, or a combination of these components. The processing system may be located within the depth sensor 115a, within the kiosk 125, at a remote location, etc. or distributed across locations. In some embodiments, the applications running on the kiosk 125 may simply receive an indication of the selected icon and may not be specifically designed to consider whether the selection was made via physical touch vs. depth based determinations of the selection. Thus, the depth sensor 115a and the processing system may be an independent product or device from the kiosk 125 in some embodiments.
In situation 100b, a user 105 is standing in a home environment which may include one or more depth sensors 115b, 115c, and 115d each with their own corresponding fields of depth capture 120b, 120c, and 120d respectively. Depth sensor 115b may be located on or near a television or other display 130. The depth sensor 115b may be used to capture gesture input from the user 105 and forward the depth data to an application running on or in conjunction with the display 130. For example, a gaming system, computer conferencing system, etc. may be run using display 130 and may be responsive to the user's 105 gesture inputs. In contrast, the depth sensor 115c may passively observe the user 105 as part of a separate gesture or behavior detection application. For example, a home automation system may respond to gestures made by the user 105 alone or in conjunction with various voice commands. In some embodiments, the depth sensors 115b and 115c may share their depth data with a single application to facilitate observation of the user 105 from multiple perspectives. Obstacles and non-user dynamic and static objects, e.g. couch 135, may be present in the environment and may or may not be included in the fields of depth capture 120b-d.
Note that while the depth sensor may be placed at a location visible to the user 105 (e.g., attached on top or mounted upon the side of televisions, kiosks, etc. as depicted, e.g., with sensors 115a-c) some depth sensors may be integrated within another object. Such an integrated sensor may be able to collect depth data without being readily visible to user 105. For example, depth sensor 115d may be integrated into television 130 behind a one-way mirror and used in lieu of sensor 115b to collect data. The one-way mirror may allow depth sensor 115d to collect data without the user 105 realizing that the data is being collected. This may allow the user to be less self-conscious in their movements and to behave more naturally during the interaction.
While the depth sensors 115a-d may be positioned parallel to a wall, or with depth fields at a direction orthogonal to a normal vector from the floor, this may not always be the case. Indeed, the depth sensors 115a-d may be positioned at a wide variety of angles, some of which place the fields of depth data capture 120a-d at angles oblique to the floor and/or wall. For example, depth sensor 115c may be positioned near the ceiling and be directed to look down at the user 105 on the floor.
This relation between the depth sensor and the floor may be extreme and dynamic in some situations. For example, in situation 100c a depth sensor 115e is located upon the back of a van 140. The van may be parked before an inclined platform 150 to facilitate loading and unloading. The depth sensor 115e may be used to infer user gestures to direct the operation of the van (e.g., move forward, backward) or to perform other operations (e.g., initiate a phone call). Because the van 140 regularly enters new environments, new obstacles and objects 145a,b may regularly enter the depth sensor's 115e field of depth capture 120e. Additionally, the inclined platform 150 and irregularly elevated terrain may often place the depth sensor 115e, and corresponding field of depth capture 120e, at oblique angles relative to the “floor” on which the user 105 stands. Such variation can complicate assumptions made regarding the depth data in a static and/or controlled environment (e.g., assumptions made regarding the location of the floor).
Various embodiments may include a housing frame for one or more of the depth sensors (e.g., as described in U.S. patent application Ser. No. 15/478,209). The housing frame may be specifically designed to anticipate the inputs and behaviors of the users. In some embodiments, the display system may be integrated with the housing frame to form modular units.
Each of housing frames 220a-c may contain one or more depth sensors as described elsewhere herein. The computer system 205 may have transforms available to relate depth data acquired at each sensor to a global system of coordinates relative to display 235. These transforms may be achieved using a calibration process, or may, e.g., be preset with a factory default. Though shown here as separate frames, in some embodiments the frames 220a-c may be a single frame. The frames 220a-c may be affixed to the display 235, to a nearby wall, to a separate mounting platform, etc.
While some embodiments specifically contemplate providing a display system connected with the housing frames, one will readily appreciate that systems may be constructed in alternative fashions to achieve substantially the same function. For example,
While
One will appreciate that the example dimensions provided above are merely used in connection with this specific example to help the user appreciate a specific embodiment. Accordingly, the dimensions may readily be varied to achieve substantially the same purpose.
Depth capture sensors may take a variety of forms, including RGB sensors using parallax to infer depth, range-based lidar, infrared pattern emission and detection, etc. Many of these systems may capture individual “frames” of depth data over time (i.e., the depth values acquired in the field of view at a given instant or over a finite period of time). Each “frame” may comprise a collection of three-dimensional values for depths measured in the field of view (though one will readily recognize multiple ways to represent, e.g., a time of flight analysis for depth determination). These three-dimensional values may be represented, e.g., as points in three-dimensional space, as distances for rays emitted at various angles from the depth sensor, etc.
To facilitate understanding, the side view 500b also includes a depiction of a depth sensor's field of view 535 at the time of the frame capture. The depth sensor's angle 530 at the origin is such that the user's upper torso, but not the user's legs have been captured in the frame. Again, this example is merely provided to accommodate the reader's understanding, and the reader will appreciate that some embodiments may capture the entire field of view without omitting any portion of the user. For example, the embodiments depicted in
Similarly, though
Many applications would like to infer the user's gestures from the depth data 505. Accomplishing this from the raw depth data may be quite challenging and so some embodiments may apply preprocessing procedures to isolate the depth values of interest. For example,
Perspective view 605c and side view 610c introduce a wall plane 620, which may also be assumed or estimated by the processing system. The floor and wall plane may be used as “clipping planes” to exclude depth data from subsequent processing. For example, based upon the assumed context in which the depth sensor is used, a processing system may place the wall plane 620 halfway to the maximum range of the depth sensor's field of view. Depth data values behind this plane may be excluded from subsequent processing. For example, the portion 520a of the background depth data may be excluded, but the portion 520b may be retained as shown in perspective view 605c and side view 610c.
Ideally, the portion 520b of the background would also be excluded from subsequent processing, since it does not encompass data related to the user. Some embodiments further exclude depth data by “raising” the floor plane 615 based upon context to a position 615a as shown in perspective view 605d and side view 610d. This may result in the exclusion of the portion 520b from future processing. These clipping operations may also remove portions of the user data 510d which will not contain gestures (e.g., the lower torso). Thus, in this example, only the portion 510c remains for further processing.
As mentioned previously, the reader will appreciate that this example is provided merely to facilitate understanding and that in some embodiments clipping may be omitted entirely, or may occur only very close to the floor so that leg and even foot data are both still captured. One will recognize that
Following the isolation of the depth values (which may not occur in some embodiments), which may contain gesture data of interest, the processing system may classify the depth values into various user portions. These portions, or “classes”, may reflect particular parts of the user's body and can be used to infer gestures.
In contrast, the lower arm and hand may be very relevant to gesture determination and more granular classifications may be used. For example, a “right lower arm” class 740, a “right wrist” class 745, a “right hand” class 755, a “right thumb” class 750, and a “right fingers” class 760 may be used. Though not shown, complementary classes for the left lower arm may also be used. With these granular classifications, the system may able to infer, e.g., a direction the user is pointing, by comparing the relative orientation of the classified depth points.
One will appreciate that “gestures” may be static (e.g., pointing a finger) or dynamic (e.g. swiping an arm). Consequently, some gestures may be recognized in a single frame, while some may require a collection of frames to be recognized. Classification may help facilitate recognition by showing the temporal and spatial relations to classes of pixels over time.
One will appreciate that between receipt of the initial depth values 800a and creation of the classified result 800b, minimal or no clipping may have been performed, e.g., as described above with respect to
During Classification 1015, the system may associate groups of depth values with one class (or in some embodiments, multiple classes) at block 1040. For example, the system may determine a classification using classes as discussed with respect to
During the Gesture Identification 1020 operations the system may perform gesture recognition at block 1050, using methods described below. During Publication 1025, at block 1055 the system may determine if new gesture data is available. For example, a new swipe gesture from right to left may have been detected at block 1050. If such a new gesture is present, the system may make the gesture data available to various applications, e.g., a kiosk operating system, a game console operating system, etc., at block 1060. At block 1065, the operations may be performed again for additional frames received.
Again, one will recognize that the process may be used to infer gestures across frames by comparing, e.g., the displacement of classes between frames (as, e.g., when the user moves their hand from left to right). Similarly, one will appreciate that not all the steps of this example pipeline (e.g., preprocessing 1010 or classification 1015) need be performed in every embodiment.
Ideally, interactions with the human-computer interface should be intuitive for the user. Accordingly, the system may readily recognize gestures and actions, such as: swipes (substantially linear hand motion in a direction, e.g., up, down, left, or right); pointing at the screen; rotation of the user's body; etc. In many instances, the most complicated gesture for the system to identify is the swipe gesture, particularly, to determine whether the user intended to swipe in one of four general directions (e.g., UP, DOWN, LEFT, and RIGHT). Naively tracing the hand classified depth values over a succession of frames may encounter a variety of problems. For example, users may need to move their hand into position before a swipe and to move their hand back to a rest position after a swipe. These movements do not convey the user's intent and may travel in directions other than the direction associated with the swipe gesture itself.
In addition, users often do not swipe in an exactly straight line, but instead move their hand in a curve. When combined with the motion to move their hands into position and back to rest, the complete hand motion may form a large arc, very little of which may be in the same direction the user intends the system to recognize. In addition to this natural deviation, the swipe gesture is often performed differently by different individuals. Some users may stretch out their arms during their gesture while others may keep their arms closer to their bodies. Similarly, some users swipe faster than others and some users perform all their gestures with the same hand, while other users may switch hands. These variations may cause the hand motion to vary dramatically from person to person. Indeed, even the same person may swipe differently over the course of a session.
To be able to interpret ambiguous gestures, some embodiments consider implementing various heuristics as part of the recognition process. Example heuristics include: consideration of a gesture zone; in-zone transitions; gesture angle boundaries, and dynamic versus static hand identification. These heuristics may be used, e.g., to hand-craft recognition solutions to different contexts and as the basis for features in machine learning. For example, training data may be used to learn these heuristics with a Support Vector Machine (SVM), neural network, or other learning system. As compared to other methods, the heuristics may reduce the amount of training data needed for accurate recognition.
A “gesture zone” is a region of space before the user, which may be used by the system to inform depth value assessment. For example, the position of the user's hand relative to the gesture zone dimensions may be used to indicate a phase of the gesture (e.g., the “Action” phase described below).
Computer system 1250 may anticipate the use of a “gesture box” 1270a before the user 1240. The gesture box 1270a may be raised above the floor (thus corresponding to a projection 1270b upon the floor). The gesture box 1270a may not be a physical object, visible to the user, but may instead reflect a region between the user and the display anticipated by computer processing system 1250 to facilitate gesture recognition as described herein. For example, motions by the user 1240, which place the user's hand 1245 within the box 1270a may receive different scrutiny by system 1250 as compared to motions outside the box. Use of gesture box 1270a may not only isolate gesture-related motions for processing, but may also provide a metric for identifying a given gesture.
Note that while the region associated with “gesture box” 1270a is shown as an actual box in
The box may have a front face 1305b, a left face 1305a, and a top face 1305c. Not visible in the diagram are a right face 1305d, bottom face 1305f and back face 1305e. The naming convention used here (“front,” “back,” etc.) is arbitrary and chosen to facilitate understanding. As will be discussed with reference to
The gesture box 1305 may be a distance 1330c before the user 1240. The gesture box 1305 may be a distance 1330q from the display 1235. The gesture box 1305 may be a distance 1330b from the floor. In some embodiments, placement of the box may depend upon the classification of depth values associated, e.g., with the user's torso. For example, the center position of the box may be placed at an offset towards the display from the center position of the depth values classified as “torso” (though, again, some embodiments may instead center the box at either the user's left or right shoulder-classified values as shown in
In some embodiments, the gesture zone is centered at the user's left 1515b or the right 1515a shoulder joint or point, for identifying gesture by the left or right hand respectively. This may be the user's physical joint location in some embodiments, but may instead be an approximation to the position of the joint from the depth values in some embodiments. The gesture zone's depth 1330d may be 20 cm from the user's torso when the gesture box abuts the torso (i.e., where the distance 1330c is zero and not some positive value as shown in the figure to facilitate understanding). Note that hand positions closer to the user's torso may be less likely to associated with the user's intentions, as discussed below, and distance 1330c may be increased accordingly. There may be no bound on how far the hand can be from the user's torso in some embodiments (e.g., distance 1330q may be zero). In some embodiments, the width 1330f of the gesture zone may be 120 cm, centered at the left 1515b (or right 1515a) shoulder point. Accordingly, there may be 60 cm on either side of the point.
In some embodiments, the gesture zone's lower boundary may be 35 cm below the shoulder point (e.g., the distance 1330z) but with no bound on how high the hand can be (i.e. the height 1330e may be unconstrained, rather than finite, as shown). In some embodiments, the size and position of the gesture zone may adapt to the physical dimensions of the user. For example, a taller person may have a larger gesture zone compared to a shorter person. Since the taller person would have a shoulder joint higher above the ground, the taller person would also have a gesture zone positioned higher above the ground.
In some embodiments, rather than vary the gesture zone size using the height of the person, the system may measure the length of the person's arm following classification. This length may provide a more precise method for determining the size of the gesture zone. For gestures that involve two hands, the gesture zone may be a union of the two individual gesture zones, for each of the left and right hands, as described above.
In some embodiments, the position of the gesture box may depend upon the user's orientation, while in some embodiments the box may remain fixed regardless of the user's orientation. For example, in views 1315a and 1315b the user 1240 has rotated 1325 their torso (rotating the shoulder classified depth values accordingly) to the left. However, in this example embodiment, the position and orientation of the box remains fixed. In contrast, in the embodiment illustrated with views 1320a and 1320b the box has “tracked” the user's torso movement to remain in a position roughly parallel with the lateral dimension of the user's torso. Rotation of the user's torso in this manner may precipitate a new angle 1350 between the centerline of the user's torso and the shortest distance to the display 1235. When the box does not track the user's movement, as in view 1315b, in some embodiments, the system may adjust the recognition process (e.g., via a transform) to recognize gestures in the user's new orientation. Similarly, even when the box does track the user's torso rotation, the system may appreciate that a movement relative to the user's centerline is no longer relative to the centerline of the display in the same manner.
In some embodiments, the computer system may adjust the box's position based upon torso movements exclusively (e.g., as here, where the user's feet remain stationary at their original position). Similarly, in some embodiments, the box may follow the user's torso when the user crouches, jumps, or otherwise changes elevation. In some embodiments, the user's head, alone or in conjunction with the user's torso, may instead be used to position the box.
As mentioned, while the gesture zone may be centered about the center of the user's torso in some embodiments, in this example, the zone is centered around the centroid of the shoulder-classified values (or the shoulder point at a boundary of values determined as discussed herein, etc.). Accordingly, the zone's center may track an arrow 1355 extending from the user at this centroid in the embodiments illustrated with views 1320a and 1320b (similarly, for a torso centroid oriented zone, the arrow would extend from the center of the user's torso and be similarly tracked by the zone). The origin of arrow 1355 may serve as the origin for a corresponding coordinate system (e.g., positions along the arrow from the of the user reflecting increasingly positive coordinate positions in the Z-direction). Thus, the coordinate system may translate and not rotate as shown in the example of 1315b or may both translate and rotate as shown in the example of 1320b. Though the user's right shoulder is used in this example, one will appreciate that the zone may be located at the left shoulder for a left-based hand gesture.
To facilitate understanding,
Prior to performing a gesture, a user may begin in an idle state 1405a, where the user may not be performing any actions. This state may occur, e.g., when the user's arms are resting by their side in a rest position as shown in the various views at time 1425a. As time progresses 1410 during the gesture's performance, the user may enter a prologue state 1405b at time 1425b. The prologue may reflect a preliminary motion by the user in preparation for a gesture, e.g., the user moving their hand into position for a horizontal “swipe” gesture motion as shown in the various views at time 1425b. In some embodiments, the prologue may include a motion leading to entry into the gesture box.
In the subsequent action state 1405c, the user may perform the actual gestural motions intended to convey the user's intent to the system. Here, for example, the user is “swiping” their hand from left to right at time 1425c. The epilogue 1405d state then includes any user motions after the user has provided sufficient information for the system to identify the intended gesture. Thus, the epilogue may reflect the remainder of the gesture not associated with the user's intention, such as motion of the user's hand back to the rest position as shown at time 1425d. Once the user's hand exits the gesture zone, the system may return 1415 the gesture state to idle 1405a. The process may then begin again with the same or different gesture.
As discussed in greater detail herein, in some embodiments the user may transition from one gesture into a new gesture. For example, the user's hand may remain in the gesture zone between the end of one gesture and the start of the next gesture. This scenario is discussed in greater detail herein with respect to
Identifying the beginning and end of the action phase may be important to properly recognizing a gesture. Unfortunately, the system may not know the exact points when the action phase begins and ends. Instead, the system may try to identify the start of the prologue and then repeatedly attempt to identify motions associated with intent, i.e., those in the action or prologue phases.
In some gestures, the user may transition from one gesture to another without bringing their hand to a complete standstill. By considering the distance of the user's hand from the user's torso, motions associated with the user's intentions may be distinguished from other motions unrelated to those intentions. For example,
Accordingly, in some embodiments, the system may consider hand movements further away from the user's body as conveying intent, while the system may infer that hand movements closer to the user's body are due to repositioning, e.g., in the prologue or epilogue phases. Stated differently, in this example, the system may construe the hand movements 1420a and 1420c as the action phases of two individual gestures separated by a repositioning hand movement 1420b. This may be accomplished in this example, at least in part, when the system finds that the hand in movement 1420b crosses the vertical boundary 1510b (described in greater detail below) at a shorter distance from the user's torso than the movements 1420a and 1420c. The system may then infer that the first half of 1420b is an epilogue after the action-related movement 1420a, and the second half of 1420b is a prologue prior to the action-related movement 1420c.
Some embodiments distinguish movements associated with gestures by determining whether depth values (or their “pixel” projection on a two-dimensional surface) of certain classes crossed various boundary planes during the user motion. This heuristic may be especially useful for distinguishing prologue, action and epilogue phases of a gesture. For example, when detecting swipe gestures, the system may seek to determine a swipe angle in the range of [−π, +π] representing the direction of the swipe. In natural gestures, this angle can vary widely between users and even with the same user. The system may be biased to infer that horizontal swipes are more likely to cross vertical plane boundaries centered at the user's shoulders, while vertical swipes are more likely to cross a horizontal plane boundary at the user's shoulder.
In some embodiments, the left and right shoulder point positions may simply be determined as being 18 cm above the torso centroid position and 10 cm to the left or right of that position. These offsets may be based on a person that is 165 cm tall and adapted to individuals of different heights. In still other embodiments, a left or right shoulder classifier may be used, to identify depth data corresponding to the shoulder point.
Thus, the point 1515a may be taken as the centroid of shoulder classified values, as the centroid of “right upper arm” and “torso” classified values along their boundary, etc. A centroid for the torso-classified values may be determined as the point 1560. As an example set of dimensions, the horizontal distance 1565a between the torso centroid point 1560 and a shoulder point, e.g., point 1515a, may be 10 cm. A vertical distance 1565b distance between the torso centroid point 1560 and a shoulder point, e.g., point 1515a, may be 18 cm.
For detecting left handed swipe gestures, a vertical left shoulder plane 1510b passing through the user's 1505 left shoulder 1515b may be considered. A horizontal plane 1510c may then pass through each of the user's shoulder points and be parallel with the floor. If the user swipes with the left hand, then the vertical plane 1510b is used and whether the hand crosses boundary 1510a may be irrelevant. Conversely, if the user swipes with the right hand, then the vertical plane 1510a is used and whether the hand crosses 1510b may be irrelevant. In some embodiments, when the user swipes with both hands simultaneously, only the left hand may continue to be assessed with reference to boundary 1510b and the right hand may only be assessed with respect to boundary 1510a. However, in embodiments permitting diagonal swipes, the event of crossing both boundaries 1510a and 1510c for the user's right hand or both boundaries 1510b and 1510c for the user's left hand may precipitate angle adjustments so as to favor diagonal swipe directions instead of non-diagonal swipe directions.
Each region may be associated with an angle between the region's boundaries with neighboring regions. For example, where the regions are divided by boundaries 1525a and 1525b, the LEFT region may be associated with the angle 1530a, the RIGHT region may be associated with the angle 1530b (right and left being here taken from the depth sensor's field of view), the TOP region may be associated with the angle 1530c, and the DOWN region may be associated with the angle 1530d. In some embodiments, each of angles 1530a-d may be set to an initial, default value of π/2 radians. In some embodiments, the center 1550 of the regions of division 1520 may be placed over the position at which a user's hand begins the action phase of the gesture. For example, if the user 1505 moves their left hand 1505a a distance from a start position to an end position represented by vector 1555a, the regions of division 1520 may be considered with the vector 1555a (representing a change in position or change in velocity as described herein) at its center 1550. One will appreciate that the vector may be considered in its original 3-dimensional form (and the boundaries considered to be planes) or in corresponding 2-dimensional projections (and the boundaries considered as lines). The two-dimensional version of the vector may be found by projecting the vector upon a plane parallel with the front face of the user (e.g., parallel with plane 1305b of the gesture box 1305), a plane parallel with the display, etc.
As shown in
In some embodiments, the system may multiply this angle by a confidence measure that the swipe is horizontal. For example, the measure may be the distance the hand traveled before and after crossing the vertical boundary 1510b. If the hand traveled from 10 cm on one side of the boundary 1510b to 10 cm on the other side of the boundary, that may produce a confidence measure 1. This measure value may result in an angle increase by π/8 radians at each end as previously described.
If instead the hand traveled from 40 cm on one side of the boundary to 40 cm on the other side of the boundary, this longer distance may indicate a more deliberate gesture and increase the confidence measure to 2. Correspondingly, the angle may be increased by 2*π/4. Such an increase may make it very likely the system will classify the gesture as either left or right swipe, rather than an up or down swipe.
Thus, motions of the hand-classified depth pixels at an angle that might previously, e.g., be classified as an “UP” swipe, would now be classified as a “LEFT” or “RIGHT” swipe. For example, if the path of the user's left hand 1505a over several frames during the action phase corresponds to the arrow 1555b, consequently crossing the boundary plane 1510b (but not boundary plane 1510c), then the gesture would be classified as a RIGHT swipe, even though the motion would be a DOWN swipe in the default regions of division 1520 of
Conversely,
As discussed above, the system may recognize a gesture based, at least in part, upon the user's motions during the action phase. However, if the gesture is not yet known, it may be difficult to identify the beginning and end of this phase, particularly as the conditions for the action phase of one gesture may not be the same as the conditions for another.
At block 1615, the system may determine the gesture state. In some embodiments, the gesture state may be represented as an integer variable number, e.g., 0-2 where: 0=Idle; 1=Prologue/Action; 2=Epilogue (though again, one will appreciate alternative possible classifications that, e.g., distinguish only between movements related and not related to the user's intent). Accordingly, determining the gesture state in block 1615 may simply involve determining the present integer value of a state variable as it was updated at blocks 1635, 1640 and 1650 (initially, the variable may be in the “Idle” state). If the system determines that the state is “Idle,” then the system may determine if a gesture's prologue has started at block 1620. If the system determines that the prologue has begun at block 1620, then the system may set the gesture state to “Prologue/Action” at block 1635. If the prologue has not started at block 1620, then the system may determine if the user's hand is in the gesture zone at block 1645. If the user's hand is in the gesture zone, then the process may return in anticipation of receiving new gesture data. Conversely, absence of the hand in the zone may indicate that the gesture has concluded. Consequently, at block 1650, the system may set the gesture state to “Idle” and may clear the gesture history at block 1655 in anticipation of a new gesture.
If, at block 1615, the system instead determines that the gesture is in the prologue phase or the action phase, then at block 1625 the system may determine if the gesture can be identified based upon the available data in this frame and the gesture history. If the gesture can be identified, then the system may set the gesture state to “Epilogue” at block 1640. The identified gesture may also be published for consumption by any listening applications at block 1660.
Block 1640 will result in this system determining in a subsequent iteration for a new frame that the gesture is in the epilogue phase, e.g., at block 1615. If the system then determines that the epilogue has ended at block 1630, then the system may transition to block 1650. To summarize, the example process depicted here transitions to the Idle state either when: the user retracts their hand from the gesture zone (a “NO” transition from block 1645); or the user's hand is in the gesture zone, but is moving away from the user's body (a determination that the Epilogue has ended at block 1630). As discussed above, one gesture may follow immediately upon another, so in some instances the next frame may result in a new prologue for the next gesture.
The system may detect the epilogue end transition in one of two methods in some embodiments. In the first method, the system simply concludes the epilogue once the user's hand exits the gesture zone. The second method relies on the expectation that the user will retract their hand in the epilogue. Thus, if the system observes the hand extending further into the gesture zone, it may construe the act as the prologue to a subsequent gesture, rather than an epilogue to the most recent gesture. When this occurs, the system may immediately clear the gesture history and set the gesture state to Idle. This may facilitate the system's recognition of subsequent gesture entries as the start of the prologue in block 1620 and proceed to identify this subsequent gesture.
Prologue detection is discussed in greater detail herein in relation to the example of
At block 1705, the system may determine if the user's hand is stationary. For example, the system may determine if the average speed of the user's hand over the past several frames has been below a threshold. If the system determines that the user's hand is stationery, then the system may determine that a pointing gesture has been detected at block 1710. An example of this determination is provided below with reference to
If the system does not determine that the hand is stationary, then the system may transition to block 1714 and estimate whether a swipe epilogue end detection will be likely at block 1630. An example process by which this estimation may be accomplished is described in greater detail below with respect to
The weighted average velocity V may be defined as shown in Equation 1:
where vi is the velocity of the hand at timestep i, wi is the weight of the velocity sample, and T is the present time. In some embodiments, wi is the distance of the hand from the body (though, again, in some embodiments, no weights may be applied). Accordingly, the further the user's hand is from the user's body, the more influence vi at that moment has on
At block 1720, the system may determine if the hand is moving “quickly” or “slowly,” e.g., by comparing the weighted average velocity
When the hand is moving sufficiently fast to be a swipe gesture, then the system may determine the swipe vector at block 1730. In some embodiments, the swipe vector is the difference between the end and beginning centroid positions of the user's hand in the action or action and prologue phases. The system may use gesture samples since the prologue started until the current frame. In some embodiments, the swipe vector may instead be determined from the average velocity
where
At block 1735, the system may determine which boundary or boundaries the vector crosses (or if no boundary was crossed, though in some embodiments, a boundary crossing may be a requirement to transition from block 1720 to block 1730). As discussed with respect to
In various embodiments, the gesture history may be stored in a “stack” or “queue,” wherein frame or gesture data is ordered sequentially in time. In some embodiments, the queue may be implemented as a circular queue or circular buffer, as is known in the art, to improve performance.
Each entry may include a timestamp 1810a, a position of the user's left hand 1810b (e.g., a centroid as discussed in greater detail herein), a position of the user's right hand 1810c, a velocity of the user's left hand 1810d, and a velocity of the user's right hand 1810e (e.g., using successive centroid determinations as discussed in greater detail herein). In some embodiments, the gesture history may also include one or more data values 1810f associated with the heuristic results as discussed herein.
Where the buffer 1800 is a circular buffer, the buffer may comprise a finite region of data. As the end of the region is reached with sequential writes, the system may return to the initial entry 1820 and overwrite the oldest entry with the most recent captured data (for example, the capture at a time N+1 may be written at the position 1805a in the buffer that was previously storing the data for time 1). The system may track a reference to the most recent entry's position so that it may read the entries in sequential order.
As discussed elsewhere herein, the centroid of depth values classified as being associated with the user's torso may be used in some embodiments, e.g., to determine the distance from the torso to the user's hand, for placement of the gesture zone, etc. These values may be received as an array of data points DT[i] for i=1 . . . NT, classified as corresponding to the torso. Each point may be a vector, e.g., of (x, y, z) coordinates.
The torso centroid CT may then be computed as:
Similar to the torso, the system may receive values classified as being associated with the user's shoulders. The left (or right) shoulder joint centroid may similarly be determined as:
where DLS[i] are those points classified as being associated with the left shoulder (again, one will appreciate alternatives using, e.g., boundary values, estimated offsets, etc.).
Left or right hand positions may similarly be determined as the centroid of their respective depth value collections DL[i] (having NL points) and DR[i] (having NR points). For example, the left hand centroid CL may be calculated as:
With these values, relative position of the left hand to the left shoulder may be taken as the difference in the centroids, i.e., PL=CL−CLS for the left hand and PR=CR−CRS for the right.
If the gesture entry is indeed the first received entry, as determined at block 1910, then at blocks 1915 and 1920, the system may set the velocities for the right and left hand (e.g., the value of the velocities 1810d and 1810e at the position 1805d in the circular buffer during a first iteration) to zero and return until the next gesture is received. At block 1935, any miscellaneous values that may be needed for future computation may be retained, not necessarily in the gesture history, but possibly in registers or variables. For example, the shoulder centroid values CLS and CRS for this kth time may be retained for use at a subsequent k+1 time (the hand centroids may already be retained in the gesture fields 1810b and 1810c for the preceding capture times).
When the next entry is received, since there will already be an item in the history at block 1910, the system will instead transition to blocks 1925 and 1930. As shown in block 1925 the distance from the hand to shoulder at each respective time may be used as a consistent reference for the velocity relative to the user, e.g.:
V
L[k]=(CL[k]−CLS[k])−(CL[k−1]−CLS[k−1]) (6)
Note that in some embodiments multiple gesture history values may be received rather than the process run each time. For example, rather than only consider a single previous gesture record when determining the velocity, some embodiments may average the velocity over a window of preceding gesture records.
Prologue detection at block 1620 may proceed in some embodiments with consideration of the hand's relation to the gesture zone. For example,
At block 2005, the system may determine the most recent entries spanning 100 ms in the gesture history 2030. As illustrated, these are the entries between entries k1 and k2 inclusive in
At block 2010, the system may determine the average velocity
The average hand position
where P[k] is the hand position (e.g., the centroid) at entry k. Similarly, the averaged velocity
The system may then return true at block 2020 if both of blocks 2015a and 2015b are satisfied, and false at block 2025 otherwise.
Block 2015a determines whether P is within the gesture zone. Block 2015b instead isolates the z component,
Epilogue detection performed at block 1630 may vary depending upon the gesture identified. With regard to swipe gesture epilogue end detection, similar to the process 2000 of
Note the rule
In contrast to the swipe gesture epilogue end detection of
H(P,Q)=√{square root over ((PX−QX)2+(PY−QY)2+(PZ−QZ)2)} (9)
At block 2140, the system may consider this distance H(Q,P) as well as the hand's Z-directional velocity VZ at the time the present position P was captured. If VZ>=0 mm/s and H(P, Q)>=100 mm at block 2140, the system may infer that the user is no longer pointing and consequently that the epilogue phase has concluded, returning true at block 2150 and false otherwise at block 2145.
As mentioned, gesture recognition at block 1625 may proceed as indicated in the process 1700 of
At block 2205, the system may find the most recent gesture history entries spanning approximately 400 ms, e.g., as shown between entries k1 and k2 in the history 2200a of
As mentioned, in addition to the full consideration of whether a swipe epilogue has concluded at block 1630, the system may also predict whether a swipe epilogue is likely to be detected at block 1714 as part of the gesture recognition process.
Block 1714 may use a combination of the heuristics to identify the transition from action to epilogue within operations 2400. At block 2405, the system may determine kS, kM and kE as shown in
In an initial iteration 2300a, kS may be set to the first received gesture item, kM to the item after 100 ms and kE to the final item of the 100 ms range following kM. Thus,
As will be discussed with reference to block 2425, the positions of kS, kM and kE may be incremented with each iteration. Thus, at the time of the second iteration 2300b, each of kS, kM and kE may be lowered to a more recent entry. This process may continue through successive iterations 2300c until a final iteration 2300d wherein kE exceeds the last entry (corresponding to block 2430).
During each iteration the system may compute the average positions
Note that in some embodiments, position coordinates may be considered relative to the shoulder point for the hand under consideration (e.g., the origin of the coordinate system is the shoulder joint). Consequently, a positive or negative x-value indicates a position on each side of the crossing boundary plane.
Note that if none of the crossing conditions are satisfied at blocks 2415a-d, that the system may transition to block 2430 without recording any crossings. Where a crossing is detected, it is stored for future reference, e.g., in a crossing array (though one will appreciate that any suitable storage structure may suffice). The process may continue through successive iterations until kE is such that the second 100 ms range includes the most recently received gesture item (corresponding to iteration 2300d) at block 2430.
At this point, the system may consider a plurality of criteria in conjunction with decision blocks 2435a-c and 2445. Particularly, if at least two iterations included crossings then block 2435a may transition to block 2450. At block 2435b, the system may confirm that the crossings satisfy directionality criteria. For example, if the last two entries of the crossing array contain crossings in opposite directions (e.g., one entry shows a crossing from right-to-left and then the other shows a crossing from left-to-right) the system may transition to block 2450, as it is unlikely that a swipe gesture epilogue would include such behavior. In contrast, if a directionality condition that this not occur is satisfied, then the system may transition to block 2435c.
At block 2435c, the system may consider whether various torso relations are satisfied. For example, the system may consider whether the hand's distance for the first crossing is further away from the torso than the hand's distance for the second crossing. If this is not true, the system may transition to block 2450.
If blocks 2435a-c are satisfied, then the system may output true at block at block 2455. In contrast, if any of blocks 2435a-c are not satisfied, then at block 2440, the system may determine the average hand position when kE is at the end of the gesture history, that is, by averaging all the values between kS and kE at the final iteration. If the averaged hand position is outside the gesture zone, the system may transition to block 2455. Conversely, if the values remain within the zone, then the system may transition to block 2450.
At block 2450 the system may indicate that no swipe epilogue has been predicted (e.g., transitioning to block 1725). At block 2455, in contrast, the system may indicate that a swipe epilogue has been predicted (e.g., transitioning to block 1715).
The following description provides an example realization of the hand position relative to the user's torso heuristic 1110 discussed above. At block 1715, the system may compute the weighted average velocity VAVG or
where W [k] is:
and where Pz[k] is the distance of the hand from the torso (e.g., the torso centroid) and dmin is the minimum depth of the gesture zone (e.g., determined empirically). For example, a user whose torso centroid is 1170 mm from the ground may have a dmin=200 mm. That is, these example numbers correspond to the above-discussed embodiment wherein the user's torso centroid form the ground is used as a proxy for the user's height in placement of the gesture box. For taller users whose torso centroid is higher above the ground, then dmin will be larger. Conversely, for shorter users whose torso centroid is closer to the ground, dmin will be smaller. One will appreciate variations where other methods are used (e.g., the centroid of a shoulder classification).
Additionally, one will appreciate that Equation 10 is simply the more general Equation 1 in the form of gesture history entries specifically. Also note that W[k] is zero only if the hand is outside the gesture zone. Thus, in some embodiments, in order for the system to transition from idle to prologue/action, the gesture history must contain some entries with the hand inside the gesture zone (i.e., at least one non-zero W[k]).
Further note that the condition in Equation 11 that Pz[k] be inside the gesture zone implies that Pz[k]>=dmin and so W[k] is always non-negative. Equation 11 sets W[k]=0 when the hand is outside of the gesture zone so that the corresponding velocity V[k] is not used in the calculation of V when Pz[k]<dmin. Conversely, when Pz[k]>=dmin, and the hand is within the gesture zone, the values should be considered. The larger the value of dmin, the further away the hand is from torso. W[k] is correspondingly larger when dmin is larger, giving V[k] more influence on
Because dmin adapts to the user's height (or arm length in some embodiments), the weight W[k] may also adapts to the user's height or arm length. The weight W[k] may also work regardless of whether a person swipes with an outstretched arm or in a more relaxed position closer to their torso.
At block 2505, the system may initialize each of the counter variables Nx, Ny, and N to zero. At block 2510, the system may then begin iterating through each P[k] 2545 in the gesture zone. Again, P[k] here represents the position of the hand at the kth entry of the gesture history. Iteration over all k values, accordingly corresponds to iteration over the entire gesture history.
For each position value P[k], the system may determine if the value's x component is greater than zero at block 2550 and increment the counter Nx at block 2555. Similarly, if the value's y component is greater than zero at block 2560, then at block 2565 the system may increment the counter Ny. The counter N may be incremented regardless of the component values at block 2570.
Once all the values within the gesture zone have been considered at block 2510, then the system may assess boundary crossings based on the values of the counter variables Nx and Ny. Particularly, if Nx is between one and five-sixths of N at block 2515, a vertical boundary crossing may be noted at block 2520. Similarly, if Ny is between one and five-sixths of N at block 2525, a horizontal boundary crossing may be noted at block 2530.
Thus, these calculations may be used to determine if the hand crossed the vertical or horizontal boundary. In principle, this may mean that Nx=0 or Nx=N. But because the hand positions may be noisy, a small number of hand positions may cross the boundary due to a noisy depth sensor. If the user gestures close to the boundary, the hand may also inadvertently cross the boundary.
Accordingly, some embodiments require that the number of hand samples on both sides of the boundary be above a threshold before declaring that the hand has crossed the boundary. For example, the threshold may be one-sixth of N. This threshold may be lower or higher depending upon how deliberate the swipe gesture must be in order to declare that it has crossed the boundary.
As discussed herein, once the boundary crossings have been determined then the division angle may be adjusted at block 2540 (e.g., increasing angles 1530a and 1530b, increasing angles 1530c and 1530d, or leaving all the angles equal). One will appreciate that these operations may be performed as part of blocks 1735, 1740, 1745 and 1750.
Various embodiments may incorporate some or all of the heuristics described herein into machine learning methods for gesture recognition (e.g., processing by a neural network, support vector machine, principal component analysis, etc.). For example, the Gesture Zone, distance from the user's hand to their torso, and hand motion relative to the swipe axes may be appended to feature vectors when training and testing.
In some embodiments, the machine learning method may be able to adequately identify test gesture histories when provided with large training gesture history datasets. However, incorporation of one or more of the heuristics into the machine learning process may reduce the size of the training data necessary to achieve the same accuracy. Reducing the necessary size of training data may be useful as obtaining correctly labeled and unbiased training data may be difficult or expensive.
Some of the heuristics may be particularly beneficial for this purpose as the heuristics may incorporate prior knowledge regarding the problem domain into the machine learning method. For example, in contrast to a “generic” machine learning dataset, handcrafted feature vectors exhibit a more direct mapping to the desired outcome. In addition, the heuristics artificially augment the limited training data, creating more “value” for each training data item. That is, handcrafting feature vectors may save the machine learning system some work in learning these features from training data.
In some embodiments, machine learning feature vectors may be extended to include: left (or right) hand position relative to the left (or right) shoulder joint as part of the boundary heuristic; left (or right) hand velocity relative to the left (and right) shoulder joint as part of the boundary heuristic; and a weight W derived from the distance from the user's torso Pz and the start of the gesture zone dmin.
As an example, to facilitate understanding,
To this original data 2605a may be appended vector data associated with the torso distance heuristic 2605b (e.g., V as determined in Equation 10, the boundary crossing 2605c,
Thus, the augmented training data vector may include: a timestamp; left/right hand position relative to shoulder point; left/right hand velocity;
The training data may be further enlarged by creating additional modified training vectors 2610 by augmenting 2615 either or both of torso 2605b and gesture zone data 2605d with augmented values 2610b and 2610d respectively. For example, the data values may be scaled using the scaling method discussed below. In this example, the boundary crossing data 2605c and original vector data 2605a may remain the same in the modified training vectors 2610. In some embodiments, the original data may be augmented as well. For example, when the hand positions are modified using Equations 12-14, the hand position and velocities may be updated so as to correspond with swipes of a different size.
The corresponding values 2620a-d in the test data 2620 (e.g., data acquired in-situ during actual interactions with the system) may be acquired in a fashion analogous to original data 2605, without the modifications 2615.
Some embodiments may augment the training data (e.g., as part of modifications 2615) by scaling hand positions from their original point PZ[k] to a new point P′z[k]:
P′
z[k]=Pz[k]*A (12)
where a scaling factor A<1 brings the hand gesture position closer to the torso and A>1 brings the hand gesture position further away. Such scaling may facilitate the creation of dataset variations from a single training dataset, e.g., to avoid overfitting. The system may similarly consider larger or smaller swipes by scaling values perpendicular to the screen or torso, e.g.:
P′
x[k]=Px[k]*B (13)
P′
y[k]=Py[k]*B (14)
where B<1 makes the gesture smaller and B>1 makes the gesture larger.
Some embodiments may further augment the training data by creating faster or slower swipes. Such speed adjustment may be accomplished by adjusting the timestamp t that a gesture sample was received. This can be done by scaling
t′=t*L (15)
where L<1 speeds up the swipe and L>1 slows down the swipe.
The one or more processors 2710 may include, e.g., an Intel™ processor chip, a math coprocessor, a graphics processor, etc. The one or more memory components 2715 may include, e.g., a volatile memory (RAM, SRAM, DRAM, etc.), a non-volatile memory (EPROM, ROM, Flash memory, etc.), or similar devices. The one or more input/output devices 2720 may include, e.g., display devices, keyboards, pointing devices, touchscreen devices, etc. The one or more storage devices 2725 may include, e.g., cloud based storages, removable USB storage, disk drives, etc. In some systems memory components 2715 and storage devices 2725 may be the same components. Network adapters 2730 may include, e.g., wired network interfaces, wireless interfaces, Bluetooth™ adapters, line-of-sight interfaces, etc.
One will recognize that only some of the components, alternative components, or additional components than those depicted in
In some embodiments, data structures and message structures may be stored or transmitted via a data transmission medium, e.g., a signal on a communications link, via the network adapters 2730. Transmission may occur across a variety of mediums, e.g., the Internet, a local area network, a wide area network, or a point-to-point dial-up connection, etc. Thus, “computer readable media” can include computer-readable storage media (e.g., “non-transitory” computer-readable media) and computer-readable transmission media.
The one or more memory components 2715 and one or more storage devices 2725 may be computer-readable storage media. In some embodiments, the one or more memory components 2715 or one or more storage devices 2725 may store instructions, which may perform or cause to be performed various of the operations discussed herein. In some embodiments, the instructions stored in memory 2715 can be implemented as software and/or firmware. These instructions may be used to perform operations on the one or more processors 2710 to carry out processes described herein. In some embodiments, such instructions may be provided to the one or more processors 2710 by downloading the instructions from another system, e.g., via network adapter 2730.
The drawings and description herein are illustrative. Consequently, neither the description nor the drawings should be construed so as to limit the disclosure. For example, titles or subtitles have been provided simply for the reader's convenience and to facilitate understanding. Thus, the titles or subtitles should not be construed so as to limit the scope of the disclosure, e.g., by grouping features which were presented in a particular order or together simply to facilitate understanding. Unless otherwise defined herein, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains. In the case of conflict, this document, including any definitions provided herein, will control. A recital of one or more synonyms herein does not exclude the use of other synonyms. The use of examples anywhere in this specification including examples of any term discussed herein is illustrative only, and is not intended to further limit the scope and meaning of the disclosure or of any exemplified term.
Similarly, despite the particular presentation in the figures herein, one skilled in the art will appreciate that actual data structures used to store information may differ from what is shown. For example, the data structures may be organized in a different manner, may contain more or less information than shown, may be compressed and/or encrypted, etc. The drawings and disclosure may omit common or well-known details in order to avoid confusion. Similarly, the figures may depict a particular series of operations to facilitate understanding, which are simply exemplary of a wider class of such collection of operations. Accordingly, one will readily recognize that additional, alternative, or fewer operations may often be used to achieve the same purpose or effect depicted in some of the flow diagrams. For example, data may be encrypted, though not presented as such in the figures, items may be considered in different looping patterns (“for” loop, “while” loop, etc.), or sorted in a different manner, to achieve the same or similar effect, etc.
Reference herein to “an embodiment” or “one embodiment” means that at least one embodiment of the disclosure includes a particular feature, structure, or characteristic described in connection with the embodiment. Thus, the phrase “in one embodiment” in various places herein is not necessarily referring to the same embodiment in each of those various places. Separate or alternative embodiments may not be mutually exclusive of other embodiments. One will recognize that various modifications may be made without deviating from the scope of the embodiments.