Exemplary embodiments generally relate to computer graphics processing, image analysis, and data processing and, more particularly, to display peripheral interface input devices, to tracking and detecting targets, to pattern recognition, and to gesture-based operator interfaces.
Computer-based vision systems are used to control computers, video games, military vehicles, and even medical equipment. Images captured by a camera are interpreted to perform some task. Conventional vision systems, however, require a cumbersome calibration process.
The features, aspects, and advantages of the exemplary embodiments are better understood when the following Detailed Description is read with reference to the accompanying drawings, wherein:
The exemplary embodiments will now be described more fully hereinafter with reference to the accompanying drawings. The exemplary embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. These embodiments are provided so that this disclosure will be thorough and complete and will fully convey the exemplary embodiments to those of ordinary skill in the art. Moreover, all statements herein reciting embodiments, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future (i.e., any elements developed that perform the same function, regardless of structure).
Thus, for example, it will be appreciated by those of ordinary skill in the art that the diagrams, schematics, illustrations, and the like represent conceptual views or processes illustrating the exemplary embodiments. The functions of the various elements shown in the figures may be provided through the use of dedicated hardware as well as hardware capable of executing associated software. Those of ordinary skill in the art further understand that the exemplary hardware, software, processes, methods, and/or operating systems described herein are for illustrative purposes and, thus, are not intended to be limited to any particular named manufacturer.
As used herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless expressly stated otherwise. It will be further understood that the terms “includes,” “comprises,” “including,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being “connected” or “coupled” to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Furthermore, “connected” or “coupled” as used herein may include wirelessly connected or coupled. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.
It will also be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first device could be termed a second device, and, similarly, a second device could be termed a first device without departing from the teachings of the disclosure.
Regardless of how the vision system 20 is used, a process called calibration may be required. The vision system 20 may need to acclimate itself to an operator and/or to an environment being monitored (e.g., a field of view 30 of the camera 24). These two pre-conditions are conventionally resolved by creating very rigid environments (e.g., a well-known, pre-calibrated field of view 30) or by requiring the operator to wear awkward clothing (e.g., gloves, hats, or materials created with specific reflective regions) for acceptable interaction.
Exemplary embodiments, however, calibrate using a human gesture 40. Exemplary embodiments propose a marker-less vision system 20 that uses the human gesture 40 to automatically calibrate for operator interaction. The human gesture 40 may be any gesture that is visually unique, thus permitting the vision system 20 to quickly identify the human gesture 40 within or inside a visually complex image 22. The vision system 20, for example, may be trained to calibrate using disjointed or unusual gestures, as later paragraphs will explain.
Calibration correlates the operator's physical world with a computer-based world. Three of the most popular computer-based world examples are an augmented reality, an interactive world, and a virtual reality. The augmented reality world is where the operator sees graphics and text overlaid onto the image 22 of the real-world. In the interactive world, the electronic device 26 associates real-world actions with limited feedback from the virtual world. The virtual reality world immerses the operator in a wholly artificial, computer-based rendering that incorporates at least some information from the image 22. The human gesture 40 may be used to calibrate any of these computer-based world examples (the augmented reality, the interactive world, and the virtual reality). Conventionally, automatic calibration used an object with known geometry (e.g., a checker board or color bars) for calibration. This level of precision permits an exact association of the digitized image 22 with the computer's virtual world, but conventional methods require specialized props and experienced operators. Exemplary embodiments may eliminate both of these burdens by utilizing only the operator's hands and the human gesture 40 that is both intuitive and well-known (as
Exemplary embodiments, however, need not prompt the operator. The operator, instead, may calibrate and begin interaction without the prompt 60. For example, if there is one person playing a game of tic-tac-toe on the display device 56, one or more players may join the game by simply posing the human gesture 40. Exemplary embodiments may also accommodate games and other applications that require authentication (e.g., a password or PIN code).
Exemplary embodiments may utilize any calibration algorithm 110. Exemplary embodiments not only leverage existing algorithms for the detection of hand gestures as visual patterns, but exemplary embodiments may automatically calibrate real-world and virtual-world representations. As earlier paragraphs explained, the calibration algorithm 110 utilizes the intuitive human gesture 40 and the operator's perception of the display device 56 to automatically calibrate these two environments. Exemplary embodiments thus permit calibration in adverse conditions (e.g., low-lighting, unusual room geometry, untrained operators, etc.) because the operator is providing a highly precise identification of the display device 56 from his or her perspective. While there are no physical demarcations for the display device 56 once the operator lowers his or her hands, exemplary embodiments cognitively remember the boundary 80 of the gesture area 88. Exemplary embodiments map the gesture area 88 to the pixel boundaries of the display device 56 in the operator's line of sight. Once the human gesture 40 has been correctly detected, calibration of the real world and the virtual-world environments may be conceptually simple. The image processing application 52 need only to transform the coordinates of the camera's perspective into that of the operator to accurately detect the interaction region 70. Exemplary embodiments may thus perform planar and affine transformations for three-dimensional computer graphics, and the appropriate linear matrix multiplication is well known. As an added form of verification, exemplary embodiments may generate an acknowledgment 120 that calibration was successful or a notification 122 that calibration was unsuccessful.
Exemplary embodiments may be utilized with any gesture. As human-computer interfaces move beyond physical interaction and voice commands, the inventor envisions a common lexicon of hand-based gestures will arise. Looking at modern touch-pads and mobile devices, a number of gestures are already present, such as clicks, swipes, multi-finger clicks or drags, and even some multi-component gestures (like finger dragging in an “L”-shape). With sufficient visual training data, exemplary embodiments may accommodate any gesture. For example:
Exemplary embodiments may utilize any algorithm. Any algorithm that detects visual patterns, visual templates, regions of high or low pixel intensity may be used. The commonly used boosted cascade of Haar wavelet classifiers, for example, may be used, as described by Paul Viola & Michael J. Jones, Robust Real-Time Face Detection, 57 International Journal of Computer Vision 137-154 (2004). Exemplary embodiments, however, do not depend on a specific image resolution, even though high-resolution images and complex gestures may place a heavier demand on the processor 50 and memory 54. During the object recognition 140, the image processing application 52 has knowledge of where (within the real-world spatial location) the human gesture 40 or visual object is in the image 22 provided by the camera 24. If only one input from the camera 24 is provided, spatial knowledge may be limited to a single two-dimensional plane. More specifically, without additional computation (and calibration), exemplary embodiments may have little or no knowledge about the distance of the operator from the camera 24 in the interaction area (illustrated as reference numeral 70 in
A secondary problem that some vision systems encounter is the need to recalibrate if the operator moves around the environment. Exemplary embodiments, however, even though originally envisioned for television viewing and entertainment purposes, were designed with this potential pitfall in mind. Exemplary embodiments thus include an elegant solution to accommodate operator mobility within the entire viewing area of the camera 24. During calibration, the operator performs the human gesture 40 to spatially identify the display device 56 according to his or her perspective. However, at the same time, the operator is also specifically identifying her face to the camera 24. Exemplary embodiments may thus perform a second detection for the operator's face and reuse that region for face recognition in subsequent use. Using the relative size and position of the operator's face, exemplary embodiments may accommodate small movements in the same viewing area without requiring additional calibration sessions. For additional performance improvement, additional detection and tracking techniques may be applied to follow the operator's entire body (i.e., his or her gait) while moving around the viewing area.
Exemplary embodiments may utilize any display device 56 having any resolution. Exemplary embodiments also do not depend on the content being generated by the display device 56. The operator is implicitly resolving confusion about the size and location of the display device 56 when he or she calibrates the vision system 20. Therefore, the content being displayed on the display device 56 may be relatively static (a menu with several buttons to “click”), quite dynamic (a video game that has movement and several interaction areas on screen), or a hybrid of these examples. Exemplary embodiments, at a minimum, need only translate the operator's interactions into digital interaction commands, so these interactions may be a mouse movement, a button click, a multi-finger swipe, etc. Exemplary embodiments need only be trained with the correct interaction gesture.
Exemplary embodiments may also include automatic enrollment. Beyond automatic calibration itself, exemplary embodiments may also track and adjust internal detection and recognition algorithms or identify potential errors for a specific operator. Conventional vision systems, typically trained to perform detection of visual objects, either have a limited tolerance for variation in those objects (i.e., the size of fingers or face geometry is relatively fixed) or they require additional real-time calibration to handle operator specific traits (often referred to as an “enrollment” process). Even though exemplary embodiments may utilize enrollment, the operator is already identifying his or her hands, face, and some form of body geometry to the vision system 20 during automatic calibration (by performing the human gesture 40). Exemplary embodiments may thus undertake any necessary adjustments, according to an operator's traits, at the time of the human gesture 40. Again, to reassure the operator, immediate audible or visual feedback may be provided. For example, when the vision system 20 observes an operator making the “picture frame” human gesture 40, exemplary embodiments may automatically compute the thickness of fingers, the span of the operator's hand, perform face detection, perform body detection (for gait-based tracking), and begin to extract low-level image features for recognition from the video segment used for calibration. Traditional vision systems that lack a form of automatic enrollment must explicitly request that an operator identify himself or herself to begin low-level feature extraction.
Exemplary embodiments may also detect and recognize different gestures as the operator moves within the viewing area. Automatic enrollment allows the vision system 20 to immediately identify errors due to out-of-tolerance conditions (like the operator being too far from the camera 24, the operator's gesture was ill formed, or the lighting conditions may be too poor for recognition of all gestures). With immediate identification of these potential errors, before any interaction begins, the operator is alerted and prompted to retry the calibration or adjust their location, allowing an uninterrupted operator experience and reducing frustration that may be caused by failures in the interaction that traditional vision systems could not predict.
Exemplary embodiments may also provide marker-less interaction. Conventional vision systems may require that the operator wear special clothing or use required physical props to interact with the vision system 20. Exemplary embodiments, however, utilize a pre-defined real-world space (e.g., the interaction area 70 and/or the gesture area 88) that the user has chosen that can easily be transformed into virtual-world coordinates once calibrated. Once this real-world space is defined by the operator's hands, it is very easy for that operator to cognitively remember and interact within the real-world space. Thus, any interactive gesture, whether it is a simple pointing action to click or a swiping action to navigate between display “pages,” can be performed by the operator within the calibrated, real-world space with little or no effort.
Exemplary embodiments may also provide simultaneous calibration for multiple participants. Another inherent drawback of traditional vision systems that use physical remote controls, props, or “hot spot” areas for interaction is that these conventional systems only accommodate operators that have the special equipment. For example, a popular gaming console now uses wireless remotes and infrared cameras to allow multiple operators to interact with the game. However, if only two remotes are available, it may be impossible for a third operator to use the game. Because exemplary embodiments utilize the human gesture 40 for calibration, the number of simultaneous operators is limited only by processing power (e.g., the processor 50, memory 54, and the image processing application 52). As long as no operators/players occlude each other from the camera's perspective, exemplary embodiments have no limit to the number of operators that may simultaneously interact with the vision system 20. Even if operator occlusion should occur, multiple cameras may be used (as later paragraphs will explain). Exemplary embodiments may thus be quickly scaled to a large number of operators, thus opening up any software application to a more “social” environment, such as interactive voting for a game show (each operator could gesture a thumbs-up or thumbs-down movement), collaborative puzzle solving (each operator could work on a different part of the display device 56), or more traditional collaborative sports games (tennis, ping-pong, etc.).
Exemplary embodiments may also provide remote collaboration and teleconferencing. Because exemplary embodiments may be scaled to any number of operators, exemplary embodiments may include remote collaboration. Contrary to existing teleconferencing solutions, exemplary embodiments do not require a physical or virtual whiteboard, device, or other static object to provide operator interaction. Therefore, once an interaction by one operator is recognized, exemplary embodiments may digitally broadcast the operator's interaction to multiple display devices (via their respective command interpreters) to modify remote displays. Remote calibration thus complements the ability to automatically track operators and to instantly add an unlimited number of operators.
As earlier paragraphs mentioned, exemplary embodiments may utilize any gesture.
Exemplary embodiments may be physically embodied on or in a computer-readable storage medium. This computer-readable medium may include CD-ROM, DVD, tape, cassette, floppy disk, memory card, and large-capacity disks. This computer-readable medium, or media, could be distributed to end-subscribers, licensees, and assignees. These types of computer-readable media, and other types not mention here but considered within the scope of the exemplary embodiments. A computer program product comprises processor-executable instructions for calibrating, interpreting, and commanding vision systems, as explained above.
While the exemplary embodiments have been described with respect to various features, aspects, and embodiments, those skilled and unskilled in the art will recognize the exemplary embodiments are not so limited. Other variations, modifications, and alternative embodiments may be made without departing from the spirit and scope of the exemplary embodiments.
This application is a continuation of U.S. application Ser. No. 15/293,346 filed Oct. 4, 2016 and since issued as U.S. Pat. No. 9,933,856, which is a continuation of U.S. application Ser. No. 14/499,096 filed Sep. 27, 2014 and since issued as U.S. Pat. No. 9,483,690, which is a continuation of U.S. application Ser. No. 12/944,897 filed Nov. 12, 2010 and since issued as U.S. Pat. No. 8,861,797, with all applications incorporated herein by reference in their entireties.
Number | Name | Date | Kind |
---|---|---|---|
5181015 | Marshall et al. | Jan 1993 | A |
7483057 | Grosvenor et al. | Jan 2009 | B2 |
7487468 | Tanaka et al. | Feb 2009 | B2 |
7940986 | Mekenkamp et al. | May 2011 | B2 |
7961173 | Boillot | Jun 2011 | B2 |
8199108 | Bell | Jun 2012 | B2 |
8552983 | Chiu | Oct 2013 | B2 |
8861797 | Zavesky | Oct 2014 | B2 |
9483690 | Zavesky | Nov 2016 | B2 |
20010030668 | Erten et al. | Oct 2001 | A1 |
20050215319 | Rigopulos | Sep 2005 | A1 |
20060044399 | Fredlund et al. | Mar 2006 | A1 |
20080028325 | Ferren et al. | Jan 2008 | A1 |
20080120577 | Ma et al. | May 2008 | A1 |
20090022394 | Banerjee | Jan 2009 | A1 |
20090027337 | Hildreth | Jan 2009 | A1 |
20090109795 | Marti | Apr 2009 | A1 |
20090262187 | Asada et al. | Oct 2009 | A1 |
20100013943 | Thorn | Jan 2010 | A1 |
20100103106 | Chui | Apr 2010 | A1 |
20100141578 | Horiuchi et al. | Jun 2010 | A1 |
20100194679 | Wu | Aug 2010 | A1 |
20100199228 | Latta | Aug 2010 | A1 |
20100199232 | Mistry et al. | Aug 2010 | A1 |
20100211920 | Westerman et al. | Aug 2010 | A1 |
20100231509 | Boillot et al. | Sep 2010 | A1 |
20110243380 | Forutanpour et al. | Oct 2011 | A1 |
20110267265 | Stinson | Nov 2011 | A1 |
20120223882 | Galor et al. | Sep 2012 | A1 |
Entry |
---|
Kohler, M. (1996) “Vision based remote control in intelligent home environments,” Proc. 3D Image Analysis and Synthesis 1996, pp. 147-154. |
Colombo et al. (Aug. 2003) “Visual capture and understanding of had pointing actions in a 3-d environment.” IEEE Trans. On Systems, Man, and Cybernetics Part B, vol. 33 No. 4, pp. 677-686. |
Jojic et al. (2000) “Detection and estimation of pointing gestures in dense disparity maps.” Proc. 4th IEEE Int'l Conf on Automatic Face and Gesture Recognition, pp. 468-475. |
Do et al. (2006) “Advanced soft remote control system using hand gesture.” Proc. MICAI 2006, in LNAI vol. 4293, pp. 745-755. |
Non-Final Office Action received for U.S. Appl. No. 12/944,897 dated Jun. 25, 2013, 17 pages. |
Final Office Action received for U.S. Appl. No. 12/944,897 dated Oct. 10, 2013, 14 pages. |
Non-Final Office Action received for U.S. Appl. No. 12/944,897 dated Jan. 29, 2014, 18 pages. |
Notice of Allowance received for U.S. Appl. No. 12/944,897 dated Jun. 6, 2014, 17 pages. |
Non-Final Office Action received for U.S. Appl. No. 14/499,096 dated Mar. 10, 2016, 12 pages. |
Non-Final Office Action received for U.S. Appl. No. 15/293,346 dated May 8, 2017, 15 pages. |
Final Office Action received for U.S. Appl. No. 15/293,346 dated Aug. 23, 2017, 11 pages. |
Number | Date | Country | |
---|---|---|---|
20180188820 A1 | Jul 2018 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 15293346 | Oct 2016 | US |
Child | 15910277 | US | |
Parent | 14499096 | Sep 2014 | US |
Child | 15293346 | US | |
Parent | 12944897 | Nov 2010 | US |
Child | 14499096 | US |