Interacting with a computing device, such as a computer, game system or robot, without requiring an input device such as a keyboard, mouse, touch-sensitive screen or game controller, presents various challenges. For one, the device needs to determine when a user is interacting with it, and when the user is doing something else. There needs to be a way for the user to make such an intention known to the device.
For another, users tend to move around. This means that the device cannot depend upon the user being at a fixed location when interaction is desired.
Still further, the conventional way in which users interact with computing devices is via user Interfaces that present options in the form of menus, lists, and icons. However, such user interface options are based upon conventional input devices. Providing for navigation between options and for selection of discrete elements without such a device is another challenge.
This Summary is provided to introduce a selection of representative concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used in any way that would limit the scope of the claimed subject matter.
Briefly, various aspects of the subject matter described herein are directed towards a technology by which depth data obtained via a depth camera is processed to provide a virtual touchpad region in which a user can use hand movements or the like to interact with a computing device. A virtual touchpad program processes the depth data to determine a representative position of a user, e.g., the user's head position, and logically generates the virtual touchpad region relative to the representative position. Further processing of the depth data indicates when and where the user interacts with the virtual touchpad region, with the interaction-related depth data processed to determine corresponding input data (e.g., coordinates), such as for posting to a message queue.
In one aspect, the representative position of the user is determined based at least in part on face detection. This helps facilitate tracking multiple users in a scene, although it is also feasible to do so based upon tracking head position only.
In one aspect, the computing device and depth camera are incorporated into a robot that moves on a floor. The depth camera may be angled upwardly relative to the floor to detect the user, and the dimensions of the virtual touchpad region may be logically generated to vertically tilt the virtual touchpad region relative to the floor.
To determine hand position within the virtual touchpad region, one or more connected blobs representing objects in the virtual touchpad region are selected, as detected via the depth data. Hands may be isolated from among the blobs by blob size, as well as by blob position by eliminating any blob that touches any horizontal or vertical edge of the virtual touchpad region. In general, this detects intentional physical projection by the user of one or more objects (e.g., hands) into the virtual touchpad region. A coordinate set that represents each hand's position within the virtual touchpad region may be computed, e.g., based upon a center of energy computation.
Other advantages may become apparent from the following detailed description when taken in conjunction with the drawings.
The present invention is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements and in which:
Various aspects of the technology described herein are generally directed towards a virtual touchpad that is programmatically positioned in front of users, allowing them to reach out and move their hands to interact with a computing device. For example, hand movement within the virtual touchpad region may be used to provide input to the computing device, such as to control movement of one or more correlated cursors about a displayed representation of a user interface. The input may be used by the device to select between options and make selections, including operating as if the input came from a conventional input device.
It should be understood that any of the examples herein are non-limiting. For one, while one computing device is exemplified as a robot, it is understood that any computing and/or electronic device, such as a personal computer, television, game system, and so forth may benefit from the technology described herein. As such, the present invention is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any of the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used various ways that provide benefits and advantages in computing and interfacing with computing and/or electronic devices in general.
As will be understood, the depth camera 106 provides an image including depth data that may be processed to determine relative user position (e.g., user head position), which may be used to determine where in space to position the virtual touchpad 102. As will be understood, the virtual touchpad region does not physically exist as an object, but rather is logically generated in space as a set of two or three-dimensional coordinates relative to a representative position of the user, with interaction with the virtual touchpad detected based upon information in the depth data at the corresponding virtual touchpad image pixel coordinates.
In addition, (or alternatively), the RGB data of the depth camera 106 may be used for face detection (and/or face recognition). Note that the exemplified depth camera 106 is configured to provide R, G, B and D (depth) values per pixel, per frame; however if RGB data is desired, it is also feasible to use data captured by an RGB camera in conjunction with a separate depth camera that only provides depth data. Face detection is advantageous when a person's head may otherwise be considered a background object rather than a foreground object, such as if the person is keeping his or her head very still. Face detection also helps differentiate multiple persons in a scene.
As will be described below, the depth data is further processed to determine where the user's hands in the foreground are located relative to the virtual touchpad 102, which if present in the virtual touchpad region, may then be converted into x, y, z coordinates, e.g., on a per-frame basis. As will be understood, this is accomplished by tracking the hand positions when they intersect the virtual touchpad, and thus may be performed without the high computational power that is needed to interpret the user's body and construct a representational skeleton of the user in order to track the user's motions and body positions (e.g., as in Microsoft® Kinect™ technology).
As represented in
Following background subtraction, connected component analysis (another known computer vision technique) is performed to determine groups of pixels that are connected. In general, background subtraction and connected component analysis separates each region of the image where the depth reading is closer than previous readings, and segments the image into foreground and background blobs representing foreground objects.
In one implementation, measurements are made on one or more foreground objects to determine whether a user's head is in the depth camera's field of view. To this end, the array (the pixel map comprising columns and rows) is processed by a head detection/position processing mechanism 226 from the top pixel row downward to determine leftmost and rightmost points of any object in the depth image; note that it is straightforward to determine horizontal (and vertical) distances from depth data. If the distance between the left and right side of an object is of a width that corresponds to a reasonable human head (e.g., between a maximum width and a minimum width), then the object is considered a head. If a head is not detected, then the next lower pixel row of the array is similarly processed and so on, until a head is detected or the row is too low. In other words, when foreground blobs are sufficiently large and a head-shaped blob is detected at the top of the region, then the region is tracked as a user for subsequent frames. Note that while more complex processing may be performed, this straightforward process is effective in actual practice.
In a situation in which multiple users appear, the change in the location of each box that represents a head may be tracked. In this way, the head detection/position processing mechanism 226 (
In one embodiment, face tracking 228 (
Turning to another aspect, once the representative position of the user (e.g., the head position) is known, the position of the virtual touchpad 102 may be computed relative to that position, such as a reasonable distance (e.g., approximately one-half meter) in front of and below the head, with any tilting accomplished by basic geometry. The virtual touchpad 102 is thus created relative to the user, and may be given any appropriate width, height, and depth, which may vary by application. For example, if the device is at a lower surface than the user (facing up at the user), the pad may be angled to the user's body such that the penetration points are aligned to the person's body, not to the orientation of the device. This helps the user to intuitively learn that the touch pad is directly in front of him or her, no matter what the circumstance of the device with which the user is interacting.
In general, the virtual touchpad follows the head throughout the frames. Note that while the user is interacting with the virtual touchpad as described below, the virtual touchpad may be fixed (at least to an extent), e.g., so that unintentional head movements do not move the virtual touchpad relative to the user's hand positions. As long as the user is tracked, any object penetrating this region (which may be a plane or volume) is checked to determine if it should be interpreted as the user's hand; if so the hand's positional data is tracked by the device.
Once positioned, any objects within the virtual touchpad 332 may be isolated from other objects based upon their depths being within the virtual touchpad region, as represented in
In the event that more than two such island blobs exist in the virtual touchpad, the two largest such blobs are selected as representing the hands. It is alternatively feasible to eliminate any extra blob that is not of a size that reasonably represents a hand.
In one implementation, the coordinates that are selected to represent the position of each of the hands (blobs 441 and 442) are computed based upon each blob's center of energy computation. A center of mass computation is also feasible, as one alternative way to determine coordinates, for example, however with the center of energy, a user who is pointing a finger into the virtual touchpad will generally have the coordinates follow the fingertip because it has the closest depth values.
The coordinates may be provided in any suitable way to an application or other program, such as by posting messages to a message queue with an accompanying timestamp. Note that if the z-coordinates are discarded, the virtual touchpad operates like any other touchpad, with two coordinate pairs able to be used as cursors, including for pinching (zoom out) or spreading (zoom in) operations; however, the z-coordinate may be retained for applications that are configured to use three-dimensional cursors. Similarly, only a single coordinate set (e.g., corresponding to the right hand, which may be user configurable) may be provided for an application that expects only one cursor to be controlled via conventional single pointing device messages.
Step 510 is performed after face detection (or if not yet time for face detection), in which the scene is evaluated to determine whether it is empty with respect to users. If so, the process ends for this frame. Otherwise, step 512 determines the head position or positions. Step 514 updates the coordinates for each user based on the hand or hands, if any, in the virtual touchpad region, which in one implementation is only performed for the primary user. Step 516 represents outputting the coordinates in some way, e.g., by posting messages for use by an application or the like. Note that instead of coordinates, some pre-processing may be performed based upon the coordinates, e.g., the movements may be detected as gestures, and a command provided to an application that represents the gesture.
As can be seen, there is described a technology for receiving controller-free user input without needing high-computational expenditure. By analyzing a depth image to see if there are objects of interest that may be people, and logically creating a virtual touch pad in front of at least one person that is sensed for interaction therewith, controller-free interaction using only hand movements or the like is provided.
As mentioned, advantageously, the techniques described herein can be applied to any device. It can be understood, therefore, that handheld, portable and other computing devices and computing objects of all kinds including robots are contemplated for use in connection with the various embodiments. Accordingly, the below general purpose remote computer described below in
Embodiments can partly be implemented via an operating system, for use by a developer of services for a device or object, and/or included within application software that operates to perform one or more functional aspects of the various embodiments described herein. Software may be described in the general context of computer executable instructions, such as program modules, being executed by one or more computers, such as client workstations, servers or other devices. Those skilled in the art will appreciate that computer systems have a variety of configurations and protocols that can be used to communicate data, and thus, no particular configuration or protocol is considered limiting.
With reference to
Computer 610 typically includes a variety of computer readable media and can be any available media that can be accessed by computer 610. The system memory 630 may include computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) and/or random access memory (RAM). By way of example, and not limitation, system memory 630 may also include an operating system, application programs, other program modules, and program data.
A user can enter commands and information into the computer 610 through input devices 640. A monitor or other type of display device is also connected to the system bus 622 via an interface, such as output interface 650. In addition to a monitor, computers can also include other peripheral output devices such as speakers and a printer, which may be connected through output interface 650.
The computer 610 may operate in a networked or distributed environment using logical connections to one or more other remote computers, such as remote computer 670. The remote computer 670 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, or any other remote media consumption or transmission device, and may include any or all of the elements described above relative to the computer 610. The logical connections depicted in
As mentioned above, while exemplary embodiments have been described in connection with various computing devices and network architectures, the underlying concepts may be applied to any network system and any computing device or system in which it is desirable to improve efficiency of resource usage.
Also, there are multiple ways to implement the same or similar functionality, e.g., an appropriate API, tool kit, driver code, operating system, control, standalone or downloadable software object, etc. which enables applications and services to take advantage of the techniques provided herein. Thus, embodiments herein are contemplated from the standpoint of an API (or other software object), as well as from a software or hardware object that implements one or more embodiments as described herein. Thus, various embodiments described herein can have aspects that are wholly in hardware, partly in hardware and partly in software, as well as in software.
The word “exemplary” is used herein to mean serving as an example, instance, or illustration. For the avoidance of doubt, the subject matter disclosed herein is not limited by such examples. In addition, any aspect or design described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects or designs, nor is it meant to preclude equivalent exemplary structures and techniques known to those of ordinary skill in the art. Furthermore, to the extent that the terms “includes,” “has,” “contains,” and other similar words are used, for the avoidance of doubt, such terms are intended to be inclusive in a manner similar to the term “comprising” as an open transition word without precluding any additional or other elements when employed in a claim.
As mentioned, the various techniques described herein may be implemented in connection with hardware or software or, where appropriate, with a combination of both. As used herein, the terms “component,” “module,” “system” and the like are likewise intended to refer to a computer-related entity, either hardware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on computer and the computer can be a component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between two or more computers.
The aforementioned systems have been described with respect to interaction between several components. It can be appreciated that such systems and components can include those components or specified sub-components, some of the specified components or sub-components, and/or additional components, and according to various permutations and combinations of the foregoing. Sub-components can also be implemented as components communicatively coupled to other components rather than included within parent components (hierarchical). Additionally, it can be noted that one or more components may be combined into a single component providing aggregate functionality or divided into several separate sub-components, and that any one or more middle layers, such as a management layer, may be provided to communicatively couple to such sub-components in order to provide integrated functionality. Any components described herein may also interact with one or more other components not specifically described herein but generally known by those of skill in the art.
In view of the exemplary systems described herein, methodologies that may be implemented in accordance with the described subject matter can also be appreciated with reference to the flowcharts of the various figures. While for purposes of simplicity of explanation, the methodologies are shown and described as a series of blocks, it is to be understood and appreciated that the various embodiments are not limited by the order of the blocks, as some blocks may occur in different orders and/or concurrently with other blocks from what is depicted and described herein. Where non-sequential, or branched, flow is illustrated via flowchart, it can be appreciated that various other branches, flow paths, and orders of the blocks, may be implemented which achieve the same or a similar result. Moreover, some illustrated blocks are optional in implementing the methodologies described hereinafter.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.
In addition to the various embodiments described herein, it is to be understood that other similar embodiments can be used or modifications and additions can be made to the described embodiment(s) for performing the same or equivalent function of the corresponding embodiment(s) without deviating therefrom. Still further, multiple processing chips or multiple devices can share the performance of one or more functions described herein, and similarly, storage can be effected across a plurality of devices. Accordingly, the invention is not to be limited to any single embodiment, but rather is to be construed in breadth, spirit and scope in accordance with the appended claims.