The technology disclosed relates to highly functional/highly accurate motion sensory control devices capable of capturing and providing images to motion capture systems that detect gestures in a three dimensional (3D) sensory space for use in automotive and industrial control systems.
The first dashboard consisted of a board placed in front of the driver of a carriage to shield from debris cast off the horses' hooves. As vehicles became more complex, and mechanical motive power supplanted the horse, controls for various systems (environmental, safety, entertainment and so forth) proliferated. The dashboard was retained as a convenient place for various controls. The operator's attention must be removed from the road (or runway, rail or sea-lane) to “hunt” for the knob or switch, hopefully labelled in his or her own language. In the 1970s, replacing English language labels with international symbols made the dashboard equally unintelligible to everyone everywhere. The need for a more simplified interface became apparent and joysticks, keyboards or keypads, glass cockpits, and so forth were pressed into service. But complexity—and confusion—proliferated.
Some have looked to capturing motions of the operator's hands (gestures) and interpreting the motions to provide commands. Some rudimentary efforts by SoftKinetic and others include inferring motion from shadows passing over simple photo-detector sensors. Unfortunately, such systems tend to be prone to false positives. The sensor cannot discriminate between the operator's hand and the wagging of the tail of the family dog. Changing and uncontrollable lighting situations, background objects, glare, reflections and so forth pose further challenges to the use of optical components. To date, such considerations have limited the deployment and use of motion capture technology in the vehicle cabin to little more than non-functional pipe dreams.
Implementations of the technology disclosed address these and other problems by providing an embeddable motion sensory control device capable of acquiring imaging information of a scene and providing at least a near real time (i.e., sufficiently fast that any residual lag between the scene change and the system's response is unnoticeable or practically insignificant) stream of imaging information to a motion capture or image analyzer that detects gestures in a three dimensional (3D) sensory space, interprets the gesture as a command to a system or machine under control, and issuing the command when appropriate. The device can be embedded in a wide variety of machines or systems.
In a representative implementation, an embeddable motion sensory control device is provided that includes a plurality of imaging sensors arranged on a first portion that provide stereoscopic imaging information for a scene being viewed. One or more illumination sources arranged on a second portion are also included. A controller is coupled to the imaging sensors and illumination sources to control operation thereof, acquiring imaging information of a scene, and providing at least a near real time stream of the imaging information to a system or device under control.
Advantageously, some implementations can provide improved user experience, greater safety and improved functionality. Some implementations can enable motion capture or image analysis systems to recognize gestures, thereby enabling an operator to control a device or system, such as a vehicle or vehicle subsystem, by intuitive gesture sets. Some implementations can provide improved interfacing and/or control with a variety of machines (e.g., aircraft or automobiles, trains, planes, fork lifts, ships and so forth) for example. Devices can be embedded within the machine under control and can work cooperatively with a proxy or supporting device (smart telephones, portable computing systems, including laptop, tablet computing devices, personal data assistants, special purpose visualization computing machinery, including heads up displays (HUD), wearable virtual and/or augmented reality systems, including Google Glass, and others, graphics processors, embedded microcontrollers, gaming consoles, or the like; wired or wirelessly coupled networks of one or more of the foregoing, and/or combinations thereof). Device implementation can obviate or reduce the need for contact-based input devices such as a mouse, joystick, touch pad, or touch screen. Some implementations can provide for improved interface with computing and/or other machinery than would be possible with heretofore known techniques. In some implementations, a richer human-machine interface experience can be provided.
Other aspects and advantages of the present technology can be seen on review of the drawings, the detailed description and the claims, which follow.
The illumination board 172 has a number of individually controllable illumination sources 108, 110, which can be LEDs or other sources, embedded thereon. Two cameras 102, 104 provide stereoscopic image-based sensing and reside on the main board 182 of device 100 in the illustrated implementation. The main board 182 may also include a processor conducting basic image processing, control of the cameras 102, 104 and the sources 108, 110.
Stereoscopic imaging information provided by cameras 102, 104 can be provided selectively or continuously to a user by means of a presentation device (HUD, dashboard/console mounted display device, wireless transmission to a display associated with a portable device, or a wearable appliance (HMD). The device 100 can provide live real time or near real time image information from the cameras, real time or near real time imaging information augmented by computer generated graphics, information, icons or other virtualized presentations, virtualized representations of the scene being viewed, and/or time varying combinations selected therefrom. Gestures made by a user are sensed by the cameras 102, 104 of the sensory device 100, and the resulting imaging information can be provided to a motion capture or image analysis system to identify and determine commands to a system. Advantageously, integrating scanning with imaging capabilities into a single motion sensory device 100 provides a highly functional, flexible, yet compact device suited to installation in machines with limited space, such as, e.g., vehicles, appliances, portable or wearable electronic devices, and so forth.
Some of the illumination sources 108, 110 can have associated focusing optics. In this example, six LEDs 108 (four of which are arranged at the center and two of which flank the board 172 at the sides) have focusing lenses, and ten additional LEDs 110 (which are arranged in columns of two, three, three, and two LEDs, respectively) are without focusing lenses. The board 172 may also include a socket 178 for coupling a photo-detector (or other sensor). Information from a photo-detector sensing changes in reflectance indicating presence or absence of objects within a region of space into which the illumination sources 108, 110 emit light during a “scanning” of the region of space.
Various modifications of the design shown in
Now with reference to
In some implementations, sensory system 200 is capable of separating information received from pixels of cameras 102, 104 sensitive to IR light from information received from pixels sensitive to visible light, e.g., RGB (red, green, and blue) and processing these two types of image information separately. For example, IR (infrared) images can be used for gesture recognition while RGB (visible light) images can be used for a live video feed via a presentation interface. In this example, a video stream including a sequence of images of a scene in the real world can be captured using cameras having a set of RGB pixels and a set of IR pixels. Information from the IR sensitive pixels is separated out for processing to recognize gestures. Information from the RGB sensitive pixels is provided to a presentation interface (HUD, HMD, etc.) of a host device as a live video feed to a presentation output. The presentation output is displayed to a user. One or more virtual objects can be integrated with the video stream images to form the presentation output. Accordingly, the sensory system 200 can provide any of gesture recognition, a real world presentation of real world objects via pass through video feed, and/or an augmented reality including virtual objects integrated with a real world view.
Cameras 102, 104 are preferably capable of capturing video images (i.e., successive image frames at a constant rate of at least 15 frames per second); although no particular frame rate is required. The capabilities of cameras 102, 104 are not critical to the technology disclosed, and the cameras can vary as to frame rate, image resolution (e.g., pixels per image), color or intensity resolution (e.g., number of bits of intensity data per pixel), focal length of lenses, depth of field, etc. In general, for a particular application, any cameras capable of focusing on objects within a spatial volume of interest can be used. For instance, to capture motion of the hand of an otherwise stationary person, the volume of interest can be defined as a cube approximately one meter on a side. In some implementations, as illustrated by sensor 200-1, the cameras 102, 104 are disposed opposite the motion to be detected, e.g., where the hand 214 is expected to move. In this location, the amount of information recorded about the hand is proportional to the number of pixels it occupies in the camera images, and the hand will occupy more pixels when the camera's angle with respect to the hand's “pointing direction” is as close to perpendicular as possible. In an alternative implementation, shown by sensor 200-3, the sensor is disposed along the motion detected, e.g., where the hand 214 is expected to move.
In some implementations, the one or more sources 108, 110 can be disposed to illuminate region of interest 212 in which one or more portions of the operator (or occupant's) body in this example a hand 214) that may optionally hold a tool or other object of interest and cameras 102, 104 are oriented toward the region 212 to capture video images of the hand 214. The operation of light sources 108, 110 and cameras 102, 104 is controlled by sensory-analysis system 206 which can be a computer system, control logic implemented in hardware and/or software or combinations thereof. Based on the captured images, sensory-analysis system 206 determines the position and/or motion of object 214.
In one implementation, the sources 108, 110 are infrared light sources. For example, the light sources can be, e.g., infrared light-emitting diodes (LEDs), and cameras 102, 104 can be sensitive to infrared light. Use of infrared light can allow the system 200 to operate under a broad range of lighting conditions and can avoid various inconveniences or distractions that may be associated with directing visible light into the region where the person is moving. However, a particular wavelength or region of the electromagnetic spectrum need not be required. In one implementation, filters 221, 222 are placed in front of cameras 102, 104 to filter out extraneous light so that only the light provided by sources 108, 110 is registered in the images captured by cameras 102, 104. In one implementation, the system selectively chooses to process visible (RGB) information or infrared (IR) information from cameras 102, 104 differently, separately or in conjunction with one another to adjust operation of system 200 to varying ambient conditions.
In another implementation, one or more sonic transducers 215, 217 are sonic sources sending sonic energy and detecting reflected sonic energy used in conjunction with, or instead of, cameras 102, 104 and light sources 108, 110. The sonic sources transmit sound waves to the user; the user either blocks (or “sonic shadowing”) or alters the sound waves (or “sonic deflections”) that impinge upon her. Such sonic shadows and/or deflections can also be used to detect the user's gestures and/or provide presence information and/or distance information using ranging techniques known in the art. In some implementations, the sound waves are, for example, ultrasound, that is not audible to humans. Alternatively, lasers or other radiation emitting devices can be used to detect position, presence or both of hand 214.
The illustrated system 200 can include any of various other sensors not shown in
It should be stressed that the arrangement shown in
The computing environment can also include other removable/non-removable, volatile/nonvolatile computer storage media. For example, a hard disk drive can read or write to non-removable, nonvolatile magnetic media. A magnetic disk drive can read from or write to a removable, nonvolatile magnetic disk, and an optical disk drive can read from or write to a removable, nonvolatile optical disk such as a CD-ROM or other optical media. Other removable/non-removable, volatile/nonvolatile computer storage media that can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like. The storage media are typically connected to the system bus through a removable or non-removable memory interface.
Processor 342 can be a general-purpose microprocessor, but depending on implementation can alternatively be a microcontroller, peripheral integrated circuit element, a CSIC (customer-specific integrated circuit), an ASIC (application-specific integrated circuit), a logic circuit, a digital signal processor, a programmable logic device such as an FPGA (field-programmable gate array), a PLD (programmable logic device), a PLA (programmable logic array), an RFID processor, smart chip, or any other device or arrangement of devices that is capable of implementing the actions of the processes of the technology disclosed.
Sensor interface 336 can include hardware, firmware and/or software that enables communication between computer system 300 and cameras 102, 104 shown in
Sensor interface 336 can also include controllers 347, 349, to which light sources (e.g., light sources 108, 110) can be connected. In some implementations, controllers 347, 349 provide operating current to the light sources, e.g., in response to instructions from processor 342 executing mocap program 344. In other implementations, the light sources can draw operating current from an external power supply, and controllers 347, 349 can generate control signals for the light sources, e.g., instructing the light sources to be turned on or off or changing the brightness. In some implementations, a single controller can be used to control multiple light sources.
Instructions defining mocap program 344 are stored in memory 334, and these instructions, when executed, perform motion-capture analysis on images supplied from cameras connected to sensor interface 336. In one implementation, mocap program 344 includes various modules, such as an object detection module 352, an object analysis module 354, and a gesture-recognition module 356. Object detection module 352 can analyze images (e.g., images captured via sensor interface 336) to detect edges of an object therein and/or other information about the object's location. Object analysis module 354 can analyze the object information provided by object detection module 352 to determine the 3D position and/or motion of the object (e.g., a user's hand). In some implementations, object analysis module 354 can also analyze audio signals (e.g., audio signals captured via interface 336) to localize the object by, for example, time distance of arrival, multilateration or the like. (“Multilateration is a navigation technique based on the measurement of the difference in distance to two or more stations at known locations that broadcast signals at known times. See Wikipedia, at http://en.wikipedia.org/w/index.php?title=Multilateration&oldid=523281858, on Nov. 16, 2012, 06:07 UTC). Examples of operations that can be implemented in code modules of mocap program 344 are described below. Memory 334 can also include other information and/or code modules used by mocap program 344 such as an application platform 366 that allows a user to interact with the mocap program 344 using different applications like application 1 (App1), application 2 (App2), and application N (AppN).
Presentation I/F 238, speakers 339, microphones 340, and optional wireless interface 341 can be used to facilitate user or system interaction with computer system 300. In some implementations, results of gesture capture using sensor interface 336 and mocap program 344 can be interpreted as user input. For example, a user can perform hand gestures that are analyzed using mocap program 344, and the results of this analysis can be interpreted as an instruction to some other program executing on processor 342 (e.g., a web browser, GPS application, dictation program, or other application). Thus, by way of illustration, a user might use upward or downward swiping gestures to “scroll” a webpage currently displayed via presentation I/F 238, to use rotating gestures to increase or decrease the volume of audio output from speakers 339, and so on.
It will be appreciated that computer system 300 is illustrative and that variations and modifications are possible. A particular implementation can include other functionality not described herein, e.g., wired and/or wireless network interfaces, media playing and/or recording system interfaces, etc. In some implementations, one or more cameras can be built into the vehicle or equipment into which the sensor 200 is imbedded rather than being supplied as separate components. Further, an image analyzer can be implemented using only a subset of computer system components (e.g., as a processor executing program code, an ASIC, or a fixed-function digital signal processor, with suitable I/O interfaces to receive image data and output analysis results).
While computer system 300 is described herein with reference to particular blocks, it is to be understood that the blocks are defined for convenience of description and are not intended to imply a particular physical arrangement of component parts. Further, the blocks need not correspond to physically distinct components. To the extent that physically distinct components are used, connections between components (e.g., for data communication) can be wired and/or wireless as desired.
With reference to
As a result, the sensory-analysis system 206 can not only recognize gestures for purposes of providing input to the electronic device, but can also capture the position and shape of the user's hand in consecutive video images in order to characterize the hand gesture in 3D space and reproduce it on a display screen for example via presentation I/F 238 as a rigged hand 99 for example. Rigged hand 99 is determined from model hand 98 that includes a rigged hand overlay 94 covering one or more capsule elements 97 built from the images by the object detection module 352.
In one implementation, and with reference to
In one implementation, the gesture-recognition module 356 compares one or more primitives of the detected gesture to a library of gesture primitives electronically stored as records in a database, which is implemented in the sensory-analysis system 206, the electronic device, or on an external storage system. (As used herein, the term “electronically stored” includes storage in volatile or non-volatile storage, the latter including disks, Flash memory, etc., and extends to any computationally addressable storage media (including, for example, optical storage).) For example, gestures can be stored as vectors, i.e., mathematically specified spatial trajectories, other primitives or combinations thereof, and the gesture record can have a field specifying the relevant part of the user's body making the gesture; thus, similar trajectories executed by a user's hand and head can be stored in the database as different gestures so that an application can interpret them differently.
The number of frame buffers included in a system generally reflects the number of images simultaneously analyzed by the analysis system or module 430, which is described in greater detail below. Briefly, analysis module 430 analyzes the pixel data in each of a sequence of image frames 420 to locate objects therein and track their movement over time (as indicated at 440). This analysis can take various forms, and the algorithm performing the analysis dictates how pixels in the image frames 420 are handled. For example, the algorithm implemented by analysis module 430 can process the pixels of each frame buffer on a line-by-line basis—i.e., each row of the pixel grid is successively analyzed. Other algorithms can analyze pixels in columns, tiled areas, or other organizational formats.
In various implementations, the motion captured in a series of camera images is used to compute a corresponding series of output images for display via the presentation I/F 238. For example, camera images of a moving hand can be translated into a wire-frame or other graphic depiction of the hand by the processor 342. Alternatively, hand gestures can be interpreted as input used to control a separate visual output; by way of illustration, a user can be able to use upward or downward swiping gestures to “scroll” a webpage or other document currently displayed, or open and close her hand to zoom in and out of the page. In any case, the output images are generally stored in the form of pixel data in a frame buffer, e.g., one of the frame buffers 415. A video display controller reads out the frame buffer to generate a data stream and associated control signals to output the images via the presentation I/F 238. The video display controller can be provided along with the processor 342 and memory 334 on-board the motherboard of the computer 300, and can be integrated with the processor 342 or implemented as a co-processor that manipulates a separate video memory. As noted, the computer 300 can be equipped with a separate graphics or video card that aids with generating the feed of output images for the presentation I/F 238. The video card generally includes a graphics processing unit (GPU) and video memory, and is useful, in particular, for complex and computationally expensive image processing and rendering. The graphics card can include the frame buffer and the functionality of the video display controller (and the on-board video display controller can be disabled). In general, the image-processing and motion-capture functionality of the system can be distributed between the GPU and the main processor 342 in various ways.
Suitable algorithms for motion-capture program 344 are described below as well as, in more detail, in U.S. Ser. Nos. 61/587,554, 13/414,485, 61/724,091, 13/724,357, and 13/742,953, filed on Jan. 17, 2012, Mar. 7, 2012, Nov. 8, 2012, Dec. 21, 2012 and Jan. 16, 2013, respectively, which are hereby incorporated herein by reference in their entirety. The various modules can be programmed in any suitable programming language, including, without limitation high-level languages such as C, C++, C#, OpenGL, Ada, Basic, Cobra, FORTRAN, Java, Lisp, Perl, Python, Ruby, or Object Pascal, or low-level assembly languages.
Again with reference to
Acquisition parameters can be applied to the cameras 402, 404 and/or to the frame buffers 415. The camera 402, 404 for example, can be responsive to acquisition parameters in operating the cameras 402, 404 to acquire images at a commanded rate, or can instead limit the number of acquired frames passed (per unit time) to the frame buffers 415. Image-analysis parameters can be applied to the image-analysis module 430 as numerical quantities that affect the operation of the contour-defining algorithm.
The desirable values for acquisition parameters and image-analysis parameters appropriate to a given level of available resources can depend, for example, on the characteristics of the image-analysis module 430, the nature of the application utilizing the mocap output, and design preferences. Whereas some image-processing algorithms can be able to trade off a resolution of contour approximation against input frame resolution over a wide range, other algorithms may not exhibit much tolerance at all—requiring, for example, a minimal image resolution below which the algorithm fails altogether.
In one implementation, and with reference to
Projection can include an image or other visual representation of the user's hand 599 and/or one or more optional objects. Objects can include (e.g., objects associated with an application 522, 523, 524 objects representing an operational parameter of the vehicle 521, advertising objects 517, objects representing more abstract things 515, other types of objects, and combination objects). For example, visual projection mechanism 504 of
Alternatively, surface 516 can be a wearable computing device such as Google Glass™ or equivalent connectable wirelessly or by wire to sensory system 500.
Projections for augmented vehicle environments can be differentiated for front and rear seat passengers in an automobile for example. Front seat passengers can experience clicks, chimes and/or speech feedback responsive to the occupant's gestures. Rear seat passengers can experience clicks, chimes and/or speech feedback on separate audio channel to headphones or HMDs used by the rear seat passengers (to avoid distracting driver).
Alternatively, in a driverless automobile implementation, the “driver” no longer drives vehicle. There is not as significant a price for distracting the “driver”. In one such implementation, gestures can be expanded for all front seat passengers to control vehicle (sub)systems. Driverless vehicles can include a larger more interactive HUD (up to whole windshield). Gestures control non-safety related navigation decisions (e.g., override determined routing, waypoints on moving map display, choosing rest stops for purposes of rerouting (e.g., bathroom breaks), and so forth).
This application claims the benefit of U.S. Provisional Patent Application No. 62/038,112, entitled “AUTOMOTIVE AND INDUSTRIAL MOTION SENSORY DEVICE”, filed 15 Aug. 2014, the disclosure of which is incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
7050606 | Paul et al. | May 2006 | B2 |
7289645 | Yamamoto et al. | Oct 2007 | B2 |
7454136 | Raskar et al. | Nov 2008 | B2 |
8396252 | El Dokor | Mar 2013 | B2 |
8744645 | Vaghefinazari et al. | Jun 2014 | B1 |
8781171 | King et al. | Jul 2014 | B2 |
8942881 | Hobbs et al. | Jan 2015 | B2 |
9063574 | Ivanchenko | Jun 2015 | B1 |
9477314 | Alameh et al. | Oct 2016 | B2 |
10353532 | Holz | Jul 2019 | B1 |
20020140633 | Rafii | Oct 2002 | A1 |
20050063564 | Yamamoto et al. | Mar 2005 | A1 |
20070025717 | Raskar et al. | Feb 2007 | A1 |
20080043108 | Jung et al. | Feb 2008 | A1 |
20090073275 | Awazu | Mar 2009 | A1 |
20090167682 | Yamashita | Jul 2009 | A1 |
20090235313 | Maruyama | Sep 2009 | A1 |
20090278915 | Kramer et al. | Nov 2009 | A1 |
20100013613 | Weston | Jan 2010 | A1 |
20100205541 | Rapaport | Aug 2010 | A1 |
20110124977 | Winarski | May 2011 | A1 |
20120041344 | Flodmark | Feb 2012 | A1 |
20130156296 | El Dokor | Jun 2013 | A1 |
20130182077 | Holz | Jul 2013 | A1 |
20130182246 | Tanase | Jul 2013 | A1 |
20130182902 | Holz | Jul 2013 | A1 |
20130204457 | King et al. | Aug 2013 | A1 |
20130229508 | Li | Sep 2013 | A1 |
20130261871 | Hobbs et al. | Oct 2013 | A1 |
20130271370 | Wang et al. | Oct 2013 | A1 |
20130329015 | Pulli et al. | Dec 2013 | A1 |
20140028861 | Holz | Jan 2014 | A1 |
20140099019 | El Dokor | Apr 2014 | A1 |
20140104463 | Spears | Apr 2014 | A1 |
20140118257 | Baldwin | May 2014 | A1 |
20140201666 | Bedikian et al. | Jul 2014 | A1 |
20140267666 | Holz | Sep 2014 | A1 |
20140309813 | Ricci | Oct 2014 | A1 |
20150016777 | Abovitz et al. | Jan 2015 | A1 |
20150049017 | Weber et al. | Feb 2015 | A1 |
20150145656 | Levesque | May 2015 | A1 |
20150154035 | Zepeniuk | Jun 2015 | A1 |
20150199025 | Holz | Jul 2015 | A1 |
20150211919 | Julian et al. | Jul 2015 | A1 |
20150227210 | Chen et al. | Aug 2015 | A1 |
20150332475 | Shroff et al. | Nov 2015 | A1 |
20160077547 | Aimone | Mar 2016 | A1 |
20160239080 | Marcolina et al. | Aug 2016 | A1 |
20160239624 | Short | Aug 2016 | A1 |
20190391824 | Hung | Dec 2019 | A1 |
Entry |
---|
CN 2014205211324, Office Action 1, dated Nov. 20, 2014, 10 pages. |
CN 2014205211324, Response to Office Action 1, dated Nov. 20, 2014, 35 pages. |
CN 2014205211324, Office Action 2, dated Feb. 17, 2015, 6 pages. |
CN 2014205211324, Response to Office Action 2, dated Feb. 17, 2015, 12 pages. |
CN 2014205211324, Notice of Allowance, dated May 14, 2015, 3 pages. |
DE 20-2014-104297.0, Office Action 1, dated Nov. 11, 2014, 4 pages. |
DE 20-2014-104297.0, Response Office Action 1, dated Nov. 11, 2014, 5 pages. |
U.S. Appl. No. 14/725,510—Office Action dated Feb. 27, 2017, 33 pages. |
U.S. Appl. No. 14/725,510—Response to Office Action dated Feb. 27, 2017 filed Jul. 11, 2017, 9 pages. |
U.S. Appl. No. 14/725,510—Notice of Allowance dated Sep. 8, 2017, 10 pages. |
Number | Date | Country | |
---|---|---|---|
20160048725 A1 | Feb 2016 | US |
Number | Date | Country | |
---|---|---|---|
62038112 | Aug 2014 | US |