The invention relates generally to computer systems, and more particularly to controlling computer systems that have connected cameras.
The use of cameras with a personal computer system (computer cameras) is becoming commonplace. Such computer cameras, often referred to as “webcams” because many users use computer cameras for sending live video over the web, may be built into a personal computer, or may be added later, such as via a USB (universal serial bus) connection. Add-on computer cameras may be positioned on small stands, but are typically clipped to the user's monitor.
Computer cameras may be used in conjunction with software for face-tracking, in which the camera can adjust itself to essentially follow around a user's face. For example, face detection is described in U.S. patent application Ser. No. 10/621,260 filed Jul. 16, 2003, entitled “Robust Multi-View Face Detection Methods and Apparatuses.” Moreover, U.S. patent application Ser. No. 10/154,892 filed May 23, 2002, entitled “Head Pose Tracking System,” describes a mechanism by which not only may a user's face be tracked, but parallax is adjusted using mathematical correction techniques so that when a user having a video conference looks at a display monitor to view others' images, the appearance is that of the user looking into the camera rather than looking down (typically) at the monitor. This reduction in parallax provides a better user experience, because among other reasons, the appearance of looking down or away (even though actually looking at them in the display) from people during a conversation has many negative connotations, whereas maintaining eye contact has positive connotations. These patents are assigned to the assignee of the present invention and hereby incorporated by reference.
Other software is being improved for the purposes of performing pose detection, which is directed towards determining a user's general viewing direction, e.g., whether a user is generally looking at a computer camera (or some other fixed point), or is looking elsewhere. Gaze detection, another evolving technology, is generally directed towards determining more precisely where a user is looking among variable locations, e.g., at what part of a display.
While software is thus evolving to improve users' experiences and interactions with cameras, there are a number of non-camera related computing tasks and problems that could be improved by the visual detection capabilities of a computer camera and presence detection, pose detection and/or gaze detection software. What is needed is a set of software-based mechanisms that leverage the visual detection capabilities of a computer camera to improve a user's overall computing experience.
Briefly, the present invention provides a system and method that uses one or more computer cameras, along with visual cues based on presence detection, pose detection and/or gaze detection software, to improve a user's overall computing experience with respect to performing a number of non-camera related computing tasks. To this end, by detecting via visual cues as to whether and/or where a user is looking at a point such as a display monitor, one or more computer operating states may be changed to accomplish non-camera related computing tasks. Examples include better management of power consumption by reducing power when the user is not looking at the display, turning voice recognition on and off based on where the user is looking, faster-perceived startup by resuming from lower-power states based on user presence, different application program behavior, and other improvements. Visual cues may be used alone or in conjunction with other criteria, such as the current operating context and possibly other sensed data. For example, the time of day may be a factor in sensing motion, possibly including turning the camera on (which may be turned off after some time with no motion sensed) to again look for motion, such as to wake a computer system into a higher-powered state in anticipation of usage as soon as motion is sensed at the start of a workday.
In one example implementation, pose tracking may be used to control power consumption of a computer system, which is particularly beneficial for mobile computers running on battery power. In general, while presence detection may be used to turn the computer system's display on or off to save power, more specific visual cues such as pose detection can turn the display off or otherwise reduce its power consumption when the user is present, but not looking at the display. Other power-consuming resources such as processor, hard disk, and so on may be likewise controlled based on the current orientation of the user's face.
Similarly, one of the most significant challenges to speech recognition is determining, without manual input or specific verbal cues, when the user is intending to speak to the computer system/device, as opposed to otherwise just talking. To solve this challenge, the present invention employs visual cues, possibly in conjunction with other data, to determine when the person is likely intending to communicate with the computer or device (versus directing speech elsewhere). More particularly, by knowing via visual cues the direction a person is looking when he or she speaks, e.g., generally towards the display monitor or not, a mechanism running on a computer can determine if the user is likely intending to control the computer via voice commands or is directing the speech elsewhere.
In one implementation, pose detection which may be trained determines whether the user is considered as generally looking towards a certain point, typically the computer system's display. With this information, an architecture such as incorporated into the computer's operating system utilizes the camera to process images of the user's face to obtain visual cues, by analyzing the user's face and the orientation of the face relative to display, as well as possibly obtain other information, such as by detecting key presses, mouse movements and/or speech. This information may be used by various logic to determine whether a user is interacting with a computer system, and thereby decide actions to take, including power management and speech handling.
Other advantages will become apparent from the following detailed description when taken in conjunction with the drawings, in which: BRIEF DESCRIPTION OF THE DRAWINGS
Exemplary Operating Environment
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to: personal computers, server computers, hand-held or laptop devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote computer storage media including memory storage devices.
With reference to
The computer 110 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer 110 and includes both volatile and nonvolatile media, and removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer 110. Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above should also be included within the scope of computer-readable media.
The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation,
The computer 110 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media, discussed above and illustrated in
The computer 110 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110, although only a memory storage device 181 has been illustrated in
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user input interface 160 or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
State Changes Based on Detected Visual Cues
The present invention is generally directed towards a system and method by which a computer system is controlled based on detected visual cues. The visual cues establish whether a user is present at a computer system, is physically looking at something (typically the computer's system's display) indicative of intended user interaction with the computer system, and/or is looking at a more specific location. As will be understood, numerous ways to implement the present invention are feasible, and only some of the alternatives are described herein. For example, the present invention is highly advantageous with respect to reducing power consumption, as well as with activating/deactivating speech recognition, however many other uses are feasible, and may be left up to specific application programs.
As will be understood, for obtaining visual cues, the present invention leverages existing video-based presence detection, pose detection and/or gaze detection technology to determine a user's intent with respect to interaction with a computer system. Thus, the examples set forth herein are representative of current ways to implement the present invention, each of which will continue to provide utility as these technologies evolve. As such, the present invention is not limited to any particular examples used herein, but rather may be used various ways that provide benefits and advantages in computing in general.
Moreover,
In one implementation, an eye-spacing algorithm may be employed. Such an eye-spacing algorithm may be generic to apply to many users, or trained via a training mechanism 202 (e.g., of the operating system 134) for a particular user's face. For example, training may occur by having the user position his or her face in a typical location in front of a display during usage, and commanding a detection computation mechanism 204 through a suitable user interface (UI) to learn the face's characteristics. The user may be instructed to turn his or her head to the maximum angles that should be considered looking at the display 191, in order to train the detection computation mechanism 204 with suitable angular limits. Note that the examples described herein describe angles relative to the center of the display 191, rather than to the camera 164, although a user can set whatever point is desired as the center, and may set any suitable limits. Further, note that the position of the eyes within a facial image is detectable, and thus spacing measured in any number of ways, including by blink detection, by detection of the pupils via contrast, by “red-eye” detection based on reflection, and so forth.
Once the facial image is captured and learned, the eye spacing (d) is measured relative to the head height (h), e.g., (d)/(h). As represented in
Whenever the user's head turns beyond a certain angle off-center relative to the display screen, which may be user-calibrated as described above, then the currently measured and normalized eye spacing value indicates to the detection computation mechanism 202 that the user's face is no longer positioned so as to be looking at the display 191. Note that by sampling at a rate that is faster than a user's head can turn, or by using other facial characteristics, it is known whether the user has turned left or right. This is useful for non-centered cameras as in
Thus, in the example of
In actual operation (following training), an event or the like indicative of whether the user is looking towards the display 191 or away from it may be output by the detection computation mechanism 204, such as whenever a transition is detected, for consumption by state change logic 206. Alternatively, the state change logic 206 may poll for position information, which has the advantage of not having to use processing power for facial processing (e.g., pose detection) except when actually needed. Note that for purposes of simplicity herein, one alternative aspect of the present invention is in part described via a polling model that obtains a True versus False result. However it is understood that any way of obtaining the information is feasible, including that the detection computation mechanism 204 may use the information itself to take action, e.g., the detection computation mechanism 204 may incorporate the state change logic 206. Further, the detection computation mechanism 204 may use or return an actual (e.g., offset-adjusted) degree value, possibly signed or the like to indicate left or right, so that for example, different decisions may be made based on certainty of looking away versus looking towards, that is, not simply True versus False, but a finer-grained decision.
As described below, other criteria may be used to assist the state change logic 206 in making its decision, including user settings for example, or other operating system internal (e.g., time-of-day) input data and/or external data (e.g., whether the user is using a telephone). For example, input information such as mouse or keyboard-based input also indicate that a user is interacting with the computer system, and may thus supplant the need for pose detection, or enhance the pose detection data in the state change logic's decision making process.
To determine interaction, step 404 evaluates whether there is detected mouse movement, while step 406 evaluates whether the keyboard is being used. Note that such mechanisms currently exist today for screensaver control/power management, and may include timing considerations, e.g., whether the mouse is moving or has moved in the last N seconds, so that movement at the exact instant of evaluation is not required. In this simplified example, if mouse movement or keyboard usage is detected at steps 404 or 406, respectively, then the result is True at step 410, that is, the user is interacting with the computer system.
In accordance with an aspect of the present invention, if the user is not physically interacting at steps 404 or 406, step 408 is executed to determine whether the user is looking at the screen. As described above, visual cues are used in this determination. If so, the result is True at step 410, otherwise the not, the result is False at step 412. Note that speech detection may likewise be including as a test for interaction, however as described below with reference to
Returning to
Turning to power management, it is well known that with current mobile computing technology, a significant power consumer is the display subsystem 312, including the LCD screen, backlight, and associated electronics, consuming on the order of up to forty percent of the power, and thereby being a major limiting factor of battery life. Thus, power conservation is particularly valuable in preserving battery life on mobile devices. However, power management also provides benefits with non-battery powered computer systems, including cost and environmental benefits resulting from conservation of electricity, prolonged display life, and so forth.
Contemporary operating systems attempt to ascertain user presence by the delay between keyboard or mouse presses, and attempt to save power by turning off the display when the user is deemed not present. However, the use of keyboard and mouse activity is a very unreliable method of detecting presence, often resulting in the display being turned off while a person is reading (e.g., an email message) but not physically interacting with an input device, or conversely resulting in the display being left on while the user is not even viewing it.
In accordance with an aspect of the present invention, there is provided a generalized method of managing power based on visual cues, by detecting user presence, pose and/or gaze. Visual cues are used to reduce power consumption, as well as improve the user's power-related computing experience by more intelligently controlling display power or other resource power. This may be accomplished in any number of ways, including modes that are configurable by the user's preferences and settings 310.
As one example of usage, whenever a user looks away from the display, the detection subsystem can dim or blank the screen by providing information to the display subsystem 312, to progressively dim the screen to completely blank or some other minimum limit. Similarly, other powered-managed mechanisms as represented in
For example, the presence of a user that is neither typing nor moving the mouse/pointer (and possibly not interacting by speaking into the microphone) may be used as input, in conjunction with visual cues that indicate the user is not looking at the display, to turn off the display or fade the display to a lower-power setting. This information may also be used to control other power-managed mechanisms 314, such as to slow the processor speed, and so forth.
Other modes are possible. For example, when visual cues indicate that a user is not looking but is otherwise still interacting, e.g., typing, a mode may be triggered in which the display may be slowly dimmed to some lowered-level, but no other action taken, which works well with users that are touch (sight) typists that look at the data to enter rather than the display, perhaps glancing occasionally at the display. In another possible mode, looking at the display while there is an open program window may be used to assume the user is reading, and thus in such a situation the lack of keyboard and mouse interaction may not be used as criteria to turn off the display. In another mode, a user or default (e.g., maximum battery) power setting may configure a machine such that simply looking away any time may fade the display out (dim, slower refresh rate, lower color depth, change the color scheme and so on), while looking towards the display may fade the display in. Thus, depending on aggressiveness of a given mode's power settings, visual cues may do different things, including dim the display or turn the display subsystem 314 completely off or on.
If the result is True as evaluated at step 502, that is, the user is interacting, step 502 branches to step 504 where a determination is made as to whether the power is already at maximum power. If not, the power is increased via step 506 towards the maximum level, otherwise there is no way to increase it and step 506 is bypassed. Note that the increase may be instantaneous, however step 506 allows for a gradual increase. Step 508 represents an optional delay, so that the interaction detection need not be evaluated continuously while the user is working, but rather can be occasionally (intermittently or periodically) checked. If used, the delay at step 508 also facilitates a gradual increase in power, e.g., to fade in the display once looking has resumed, thereby avoiding a sudden flashing effect.
In the event that the result is False, that is, the user is not interacting, step 510 is executed to determine whether the power is already at the minimum limit, e.g., corresponding to a current power settings mode, such as a maximum battery mode. If not, step 512 represents reducing the power, again instantly if desired, or gradually, until some lower limit is reached (which may be mode-dependent). Note that in order to come back when the user again interacts, some interaction detection is still necessary, e.g., the mouse detection keyboard detection and camera/visual cues detection still need to be running, and thus the power management should not shut down these mechanisms, at least not until a specified (e.g., relatively long) time is reached. Step 514 represents an optional delay, (shown as possibly different from the delay of step 508, because the delay times may be different), so that the power reduction may be gradual, e.g., the display will fade out.
As mentioned above with reference to
In keeping with the present invention, by using visual cues such as pose detection or gaze detection data, a differentiation may be made between a user that is directing speech towards a computer or is directing speech elsewhere, such as towards someone in the room. In general, if the user is looking directly at the computer it is likely that the user wants to command the device, and thus speech input should be accepted for command and control. Note that speech recognition for dictating to application programs may use visual cues in a similar manner, however when dictating a particular dictation window (e.g., an application window) is open and thus at least this additional information is available for making a decision. In contrast, command and control speech may occur unpredictably and/or at essentially any time.
Step 604 represents determining whether the user is speaking on the telephone. For example, some contemporary computers know when landline or mobile telephones are cradled/active or not, and computer systems that use voice over internet protocol (VOIP) will know whether a connection is active (the same microphone may be used); a ring signal picked up at the microphone followed by a user's traditional answer (e.g., “Hello”) is another way to detect at least incoming calls. Although not necessary to the present invention, detection of phone activity is used herein as an example of an additional criterion that may be evaluated to help in the decision-making process. Other criterion, including sensing a manual control button or the like, recognizing that a dictation or messenger-type program is already active and is using the microphone, and/or detecting a voice cue corresponding to a recognized code word, may be similarly used in the overall decision-making process.
In
If the user is interacting, step 608 branches to step 610 where command and control is activated. Although not shown in
If not known to be using speech for other purposes, step 702 branches to step 704 where pose (or gaze) detection is used to determine whether the user is looking at the display screen. If not, step 704 branches back to step 702 and the process continues waiting, by looping in this example. Note that although processing visual cues consumes resources, the logic of
If at step 704 the user is looking at the screen, step 706 is executed to determine whether the user has begun speaking. If not, the process branches back to loop again. As can be readily appreciated, steps 702, 704 and 706 are essentially waiting for the user to speak what is likely to be a command to the screen. When this set of conditions occurs, step 706 branches to step 708, which sends the speech as data to a speech recognizer for command and control purposes.
Note that depending on the speech command, the command and control may end the process of
Step 710 represents detecting for such further speech, which if detected, resets a timer at step 712 and returns to step 708 to send the further speech to the speech recognizer. If no further speech is detected within the timer's measured time as evaluated at step 714, the process returns to step 702 to again wait for further speech with a full set of conditions required, including whether the visual cues detected indicate that the user is looking at the computer screen while speaking. Note that the time out a step 714 may be relatively short, to allow the user to briefly and naturally pause while speaking (by returning to step 710), without requiring visual cue processing and/or require that the user look at the screen the entire time he or she is entering (a possibly lengthy set of) verbal commands.
In this manner, various tasks such as power management and speech recognition are improved via presence detection and/or pose detection. As can be readily appreciated, gaze detection can further improve the handling of computer tasks.
For example, U.S. patent application Ser. No. 10/985,478 describes OLED technology in which individual LEDs can be controlled for brightness; gaze detection can conserve power, such as in conjunction with a power management mode that illuminates only the area of the screen that the user is looking at. Gaze detection can also move relevant data on the display screen. For example, auxiliary information may be displayed on the main display, while other information is turned off. The auxiliary information can move around with the user's eye movements via gaze detection. Gaze detection can also be used to launch applications, change focus, and so forth.
For use with speech recognition, gaze detection can be used to differentiate among various programs to which speech is directed, e.g., to a dictation program, or to a command and control program depending on where on the display the user is currently looking. Not only may this prevent one program from improperly sensing speech directed towards another program, but gaze detection may improve recognition accuracy, in that the lexicon of available commands may be narrowed according to the location at which the user is looking. For example, if a user is looking at a media player program, commands such as “Play” or “Rewind” may be allowed, while commands such as “Run” would not.
As can be seen from the foregoing detailed description, there is provided a system and mechanism that leverage the visual detection capabilities of a computer camera to improve a user's overall computing experience. Power management, speech handling and other computing tasks may be improved based on visual cues. The present invention thus provides numerous benefits and advantages needed in contemporary computing.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific form or forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.