As the capabilities of portable computing devices continue to improve, and as users are utilizing these devices in an ever increasing number of ways, there is a corresponding need to adapt and improve the ways in which users interact with these devices. Certain devices use motions such as gestures or head tracking for input to various applications executing on these devices. While head tracking algorithms perform adequately under certain conditions, there are variations and conditions that can cause these algorithms to perform less accurately than desired, which can lead to false input and user frustration. Further, inaccuracies in face or head tracking can cause developers to shy away from incorporating such input into their applications and devices.
Various embodiments in accordance with the present disclosure will be described with reference to the drawings, in which:
a) and 1(b) illustrate an example environment in which a user can interact with a portable computing device in accordance with various embodiments;
a), 2(b), 2(c), 2(d), and 2(e) illustrates an example head tracking approach that can be utilized in accordance with various embodiments;
a), 3(b), 3(c), 3(d), 3(e), 3(f), 3(g), and 3(h) illustrate example images that can be used to attempt to determine a face or head location in accordance with various embodiments;
Systems and methods in accordance with various embodiments of the present disclosure overcome one or more of the above-referenced and other deficiencies in conventional approaches to determining and/or tracking the relative position of an object, such as the head or face of a user, using an electronic device. In particular, various embodiments discussed herein provide for the dynamic selection of a tracking template for use in face, head, or user tracking based at least in part upon a state of a computing device, an aspect of the user, and/or an environmental condition. The template used can be updated as the state, aspect, and/or environmental condition changes. Further, in order to reduce the number of false positives as well as the amount of processing capacity needed, in some embodiments a computing device can suspend a tracking process when the device is in a certain orientation, such as upside down, or within a range of such orientations.
Various other functions and advantages are described and suggested below as may be provided in accordance with the various embodiments.
a) illustrates an example environment 100 in which aspects of the various embodiments can be implemented. In this example, a user 102 is interacting with a computing device 104. During such interaction, the user 102 will typically position the computing device 104 such that at least a portion of the user (e.g., a face or body portion) is positioned within an angular capture range 108 of at least one camera 106, such as a primary front-facing camera, of the computing device. Although a portable computing device (e.g., an electronic book reader, smart phone, or tablet computer) is shown, it should be understood that any electronic device capable of receiving, determining, and/or processing input can be used in accordance with various embodiments discussed herein, where the devices can include, for example, desktop computers, notebook computers, personal data assistants, video gaming consoles, television set top boxes, smart televisions, wearable computers (e.g., smart watches, biometric readers and glasses), portable media players, and digital cameras, among others. In some embodiments the user will be positioned within the angular range of a rear-facing or other camera on the device, although in this example the user is positioned on the same side as a display element 112 such that the user can view content displayed by the device during the interaction.
The ability to determine the relative location of a user with respect to a computing device enables various approaches for interacting with such a device. For example, a device might render information on a display screen based on where the user is with respect to the device. The device also might power down if a user's head is not detected within a period of time. A device also might accept device motions as input as well, such as to display additional information in response to a moving of a user's head or tilting of the device (causing the relative location of the user to change with respect to the device). These input mechanisms can thus depend upon information from various cameras (or sensors) to determine things like motions, gestures, and head movement.
In one example, the relative direction of a user's head can be determined using one or more images captured using a single camera. In order to get the relative location in three dimensions, it can be necessary to determine the distance to the head as well. While an estimate can be made based upon feature spacing viewed from a single camera, for example, it can be desirable in many situations to obtain more accurate distance information. One way to determine the distance to various features or points is to use stereoscopic imaging, or three-dimensional imaging, although various other distance or depth determining processes can be used as well within the scope of the various embodiments. For any pair of cameras that have at least a partially overlapping field of view, three-dimensional imaging can be performed by capturing image information for one or more objects from two different perspectives or points of view, and combining the information to produce a stereoscopic or “3D” image. In at least some embodiments, the fields of view can initially be matched through careful placement and calibration, such as by imaging using a known calibration standard and adjusting an optical axis of one or more cameras to have those axes be substantially parallel. The cameras thus can be matched cameras, whereby the fields of view and major axes are aligned, and where the resolution and various other parameters have similar values for each of the cameras. Three-dimensional or stereoscopic image information can be captured using two or more cameras to provide three-dimensional point data, or disparity information, which can be used to generate a depth map or otherwise determine the distance from the cameras to various features or objects. For a given camera pair, a stereoscopic image of at least one object can be generated using the respective image that was captured by each camera in the pair. Distances measurements for the at least one object then can be determined using each stereoscopic image.
a) through 2(e) illustrate an example approach for determining the relative position of a user's head to a computing device. In the situation 200 illustrated in
Various approaches to identifying a head or face of a user can be utilized in different embodiments. For example, images can be analyzed to locate elliptical shapes that may correspond to a user's head, or image matching can be used to attempt to recognize the face of a particular user by comparing captured image data against one or more existing images of that user. Another approach attempts to identify specific features of a person's head or face, and then use the locations of these features to determine a relative position of the user's head. For example, an example algorithm can analyze the images captured by the left camera and the right camera to attempt to locate specific features 234, 244 of a user's face, as illustrated in the example images 230, 240 of
In many embodiments, a face detection and/or tracking process utilizes an object detector, also referred to as a classifier or object detection template, to detect all possible instances of a face under various conditions. These conditions can include, for example, variations in lighting, user pose, time of day, type of illumination, and the like. A face detector searches for specific features in an image in an attempt to determine the location and scale of one or more faces in an image captured by a camera (or other such sensor) of a computing device. In some embodiments, the incoming image is scanned and each potential sub-window is evaluated by the face detector. Face detector templates will often be trained using machine learning techniques, such as by providing positive and negative training examples. These can include images that include a face and images that do not include a face. Different classifiers can be trained to detect different types or categories of objects, such as faces, bikes, or birds, for example.
The training process in various embodiments requires a very large number of positive (and negative) examples that can cover different variations that are expected to be seen in various inputs. In conventional face tracking applications, for example, there is no a priori knowledge about the type of the face (male vs. female, ethnicity), lighting conditions (indoor vs. outdoor, shadow vs. sunny), or pose of the user, that will likely be present in a particular image. In order to successfully detect faces under a wide range of conditions, the training data generally will contain examples of faces under different view angles, poses, lighting conditions, facial hair, glasses, etc. Increasing the variability in the training data allows the face detector to find faces under these varying conditions. By using a larger range of training data to cover a wide variety of cases, however, the average accuracy level can be decreased, as there can be higher rates of potential false detections. Using a specific set of training data can improve accuracy for a certain class of object or face, for example, but may be less accurate for other classes.
As examples,
As mentioned, however, the features detectable in an image, and the relative arrangement and/or spacing of those features, can vary significantly between images due to various factors. For example, in the example image 310 of
Similarly, the lighting conditions might affect the presence and/or arrangement of features identifiable in a captured image. For example, in the example image 320 of
Aspects of different users can result in substantially different feature locations as well. For example, the features 332 identified for a woman in the example image 330 of
Even for a single known user there can be different situations that can lead to different apparent feature arrangements. For example, in the image 350 of
Accordingly, approaches in accordance with various embodiments can utilize multiple face detector templates for face detection and tracking, and can attempt to determine information such as the state of the device, the user (or type of user), or an environmental condition in order to dynamically select the appropriate template to use for face detection. As mentioned, terms such as “up” and “down” are used for purposes of explanation and are not intended to imply specific directional requirements unless otherwise specifically stated herein.
In some embodiments, an offline analysis can be performed to determine situations where the typical selections, locations, relative positions, and/or arrangement of features are such that different templates may be beneficial. This can include, for example, a template for ambient light images and a template for infrared (IR) light images. Similarly, for a device with two or more cameras that are separated an appreciable distance on the device, a template for a normal or straight-on view might be used, as well as one or more templates for different poses or views, such as may be captured by a side camera or a camera at an angle with respect to a user. Similarly, templates for low light conditions with high exposure or gain settings might warrant a dedicated template. For each of these situations, a state of the device (e.g., orientation or active IR source) or environmental condition (e.g., amount of ambient light) can be determined that dictates which template to use for face tracking at a current point in time.
Such an analysis can also be performed to determine when different templates might be advantageous for different types of users. For example, it might be beneficial to use a different template for men than for women, and for adults versus children. It might also be beneficial to utilize different templates for different regions or ethnicities, as facial dimensions and relative feature arrangements may differ significantly between different regions, such as a region of Asia with respect to a region of Europe or Africa. It also might be beneficial to have different templates for users who wear glasses or have certain types of facial hair. Any or all of these and other aspects of a user might be beneficial to use to determine the optimal template for face detection and tracking.
For each of these aspects, however, the computing device in at least some embodiments has to determine the appropriate aspect to use in selecting a template. Various approaches for determining these aspects can be used in accordance with the various embodiments. For example, a facial recognition process might be run to attempt to identify a user for which specific information, such as age, gender, and ethnicity, are known to the device or application. A particular user might login using username, password, biometric, or other such information that can be used to identify a specific user as well. For some users for which specific information is not known, one or more processes can be used to attempt to determine one or more aspects of the user. This can include, for example, capturing and analyzing one or more images to attempt to determine recognizable aspects of a user, such as age range or gender. In some embodiments, information such as the location of the device can be used to select an appropriate template. For example, a device located in Asia might start with an Asian data-trained template, while a device located in South America might start with a different template trained using different but more relevant data. The location can be determined using GPS data, IP address information, or any other appropriate information determinable by, or available to, a computing device or application executing on that device, such as may utilize a GPS, signal triangulation process, or other such location determination component or process. If there are multiple users of a device, information such as the way in which the user is holding or using the device might be indicative of a particular user for which to select a template. If a face cannot be detected using a specific template, additional attempts can be made by rotating the template (or image data) or using a different template, among other such options. In some embodiments the dynamic determination of the appropriate template to use can include a ranking of templates based on available information. For example, the use of IR light to capture an image instead of ambient light might cause a greater difference than differences between genders, such that an IR template might be ranked higher than a gender-specific template, unless a template exists that is trained on both. In some embodiments, the various classes can have different rankings or weightings such that templates can be selected for use in a specific order unless available information dictates otherwise. In some embodiments categories might be created that include templates for specific combinations of features, such as a female child illuminated by IR or a male adult illuminated by ambient light, among other such options. A template determination algorithm can analyze the available information and determine and/or infer the appropriate category. In some embodiments a generic template might be used when no information is available that indicates the appropriate template to use. In other embodiments a device might track which template(s) are most used on that device and start with those template(s) if no other information is available. Various other approaches can be used as well within the scope of the various embodiments. In some embodiments different templates can developed starting with the same face detector and using different data sets, while other embodiments might start with different detectors developed for different features, types of objects, etc.
Once a template has been selected (or before or during the selection process in some embodiments) one or more images can be captured 412 or otherwise acquired using at least one camera of the computing device. As discussed, in some embodiments this can include a pair of images captured using stereoscopic data that provides distance information, in order to more accurately analyze relative feature positions for a given distance. The selected template then can be used to analyze the image and attempt to determine a face location 414 for the user. As mentioned, this can include detecting features in the image and using the selected face detector template to determine whether those features are indicative of a human face, and then determining a location of the face based at least in part upon the locations of those features. If it is determined 416 that there is no prior face position data, at least for the current session or within a threshold amount of time, then another image can be captured and analyzed using the process. If prior data exists, then the current head location data can be compared 418 to the prior location data to determine any change, or at least a change that exceeds a minimum change threshold. A minimum change threshold might be used to account for noise or slight user movements, which are not meant to be used as input and thus may not result in any change in the determined head location for input purposes. If there is a change, information about the change, movement, and/or new head position can be provided 420 as input to an application or service, for example, such as an application that tracks head position over time for purposes of controlling one or more aspects of a computing device.
Although not shown in
Accordingly, approaches in accordance with various embodiments can limit the rotation angle over which the device (or an application executing on the device) is willing to analyze using a template for face detection. For example, a template might be able to be trained to recognize a face that is rotated plus or minus sixty degrees from normal, or “upright” in the image. Thus, a single template can cover one-hundred twenty degrees of rotation. For at least some embodiments, the device might only use one orientation of a template in order to attempt to recognize a face, and be willing to not provide for face or head detection and tracking outside that device orientation range. This might be done for different device orientations, with an “up” orientation of the device being selected as the normal direction for range selection purposes. In other embodiments, the device might utilize different template rotations, such as plus or minus ninety degrees, but may ignore the “upside down” orientation of one-hundred and eighty degrees as the device may be unlikely to be in that orientation with respect to a user, and the upside down orientation may be too susceptible to inaccuracies. In still other embodiments, a device might completely suspend face tracking processes if the device is in an upside down orientation, or in an orientation that is outside a determined range of acceptable orientations (such as more than sixty degrees from a conventional orientation such as portrait or landscape).
If the device is within range, one or more images can be captured 510 or otherwise acquired using at least one camera of the computing device. As discussed, in some embodiments this can include a pair of images captured using stereoscopic data that provides distance information, in order to more accurately analyze relative feature positions for a given distance. A template, which may be selected in some embodiments using one of the processed discussed herein in some embodiments, can be used to analyze the image and attempt to determine an object location 512 with respect to the device. As mentioned, this can include detecting features in the image and using a selected detector template to determine whether those features are indicative of a specified object, such as a human face, and then determining a location of the object based at least in part upon the locations of those features. If it is determined 514 that there is no prior position data, at least for the current session or within a threshold amount of time, then another image can be captured and analyzed using the process. If prior data exists, then the current location data can be compared 516 to the prior location data to determine any change, or at least a change that exceeds a minimum change threshold as discussed above. If there is a change, information about the change, movement, and/or new position can be provided 518 as input to an application or service, for example, such as an application that tracks head position over time for purposes of controlling one or more aspects of a computing device.
A specific example is provided that incorporates both the processes of
As mentioned, the appearance of the face can be dramatically different when illuminated by ambient light sources (e.g., the sun or fluorescent lamps) than when illuminated with IR LEDs. Following traditional face detection training approaches, a single monolithic face detector could be trained by adding IR-illuminated face examples to the ambient illuminated face examples in the training data to generate a combined template. Similar approaches could be used with the orientation and camera angle differences. However, approaches discussed herein can train different face detectors, each trained using a respective type of training data, allowing each individual face detector to be more accurate (and fast) within their respective categories. Further, since the information used to select between these templates can be readily determined, the template selection can be dynamically performed with relatively high accuracy. In such embodiments, the device can use what is within the control of the device to select the best template to use under a particular situation for a particular device state.
The example computing device can include at least one microphone or other audio capture device capable of capturing audio data, such as words or commands spoken by a user of the device, music playing near the device, etc. In this example, a microphone is placed on the same side of the device as the display screen, such that the microphone will typically be better able to capture words spoken by a user of the device. In at least some embodiments, a microphone can be a directional microphone that captures sound information from substantially directly in front of the microphone, and picks up only a limited amount of sound from other directions. It should be understood that a microphone might be located on any appropriate surface of any region, face, or edge of the device in different embodiments, and that multiple microphones can be used for audio recording and filtering purposes, etc.
In some embodiments, the computing device 700 of
The device also can include at least one orientation or motion sensor. As discussed, such a sensor can include an accelerometer or gyroscope operable to detect an orientation and/or change in orientation, or an electronic or digital compass, which can indicate a direction in which the device is determined to be facing. The mechanism(s) also (or alternatively) can include or comprise a global positioning system (GPS) or similar positioning element operable to determine relative coordinates for a position of the computing device, as well as information about relatively large movements of the device. The device can include other elements as well, such as may enable location determinations through triangulation or another such approach. These mechanisms can communicate with the processor, whereby the device can perform any of a number of actions described or suggested herein.
As discussed, different approaches can be implemented in various environments in accordance with the described embodiments. While many processes discussed herein will be performed on a computing device capturing an image, it should be understood that any or all processing, analyzing, and/or storing can be performed remotely by another device, system, or service as well. For example,
The illustrative environment includes at least one application server 808 and a data store 810. It should be understood that there can be several application servers, layers or other elements, processes or components, which may be chained or otherwise configured, which can interact to perform tasks such as obtaining data from an appropriate data store. As used herein the term “data store” refers to any device or combination of devices capable of storing, accessing and retrieving data, which may include any combination and number of data servers, databases, data storage devices and data storage media, in any standard, distributed or clustered environment. The application server can include any appropriate hardware and software for integrating with the data store as needed to execute aspects of one or more applications for the client device and handling a majority of the data access and business logic for an application. The application server provides access control services in cooperation with the data store and is able to generate content such as text, graphics, audio and/or video to be transferred to the user, which may be served to the user by the Web server in the form of HTML, XML or another appropriate structured language in this example. The handling of all requests and responses, as well as the delivery of content between the client device 802 and the application server 808, can be handled by the Web server 806. It should be understood that the Web and application servers are not required and are merely example components, as structured code discussed herein can be executed on any appropriate device or host machine as discussed elsewhere herein.
The data store 810 can include several separate data tables, databases or other data storage mechanisms and media for storing data relating to a particular aspect. For example, the data store illustrated includes mechanisms for storing production data 812 and user information 816, which can be used to serve content for the production side. The data store also is shown to include a mechanism for storing log or session data 814. It should be understood that there can be many other aspects that may need to be stored in the data store, such as page image information and access rights information, which can be stored in any of the above listed mechanisms as appropriate or in additional mechanisms in the data store 810. The data store 810 is operable, through logic associated therewith, to receive instructions from the application server 808 and obtain, update or otherwise process data in response thereto. In one example, a user might submit a search request for a certain type of element. In this case, the data store might access the user information to verify the identity of the user and can access the catalog detail information to obtain information about elements of that type. The information can then be returned to the user, such as in a results listing on a Web page that the user is able to view via a browser on the user device 802. Information for a particular element of interest can be viewed in a dedicated page or window of the browser.
Each server typically will include an operating system that provides executable program instructions for the general administration and operation of that server and typically will include computer-readable medium storing instructions that, when executed by a processor of the server, allow the server to perform its intended functions. Suitable implementations for the operating system and general functionality of the servers are known or commercially available and are readily implemented by persons having ordinary skill in the art, particularly in light of the disclosure herein.
The environment in one embodiment is a distributed computing environment utilizing several computer systems and components that are interconnected via communication links, using one or more computer networks or direct connections. However, it will be appreciated by those of ordinary skill in the art that such a system could operate equally well in a system having fewer or a greater number of components than are illustrated in
As discussed above, the various embodiments can be implemented in a wide variety of operating environments, which in some cases can include one or more user computers, computing devices, or processing devices which can be used to operate any of a number of applications. User or client devices can include any of a number of general purpose personal computers, such as desktop or laptop computers running a standard operating system, as well as cellular, wireless, and handheld devices running mobile software and capable of supporting a number of networking and messaging protocols. Such a system also can include a number of workstations running any of a variety of commercially-available operating systems and other known applications for purposes such as development and database management. These devices also can include other electronic devices, such as dummy terminals, thin-clients, gaming systems, and other devices capable of communicating via a network.
Various aspects also can be implemented as part of at least one service or Web service, such as may be part of a service-oriented architecture. Services such as Web services can communicate using any appropriate type of messaging, such as by using messages in extensible markup language (XML) format and exchanged using an appropriate protocol such as SOAP (derived from the “Simple Object Access Protocol”). Processes provided or executed by such services can be written in any appropriate language, such as the Web Services Description Language (WSDL). Using a language such as WSDL allows for functionality such as the automated generation of client-side code in various SOAP frameworks.
Most embodiments utilize at least one network that would be familiar to those skilled in the art for supporting communications using any of a variety of commercially-available protocols, such as TCP/IP, FTP, UPnP, NFS, and CIFS. The network can be, for example, a local area network, a wide-area network, a virtual private network, the Internet, an intranet, an extranet, a public switched telephone network, an infrared network, a wireless network, and any combination thereof.
In embodiments utilizing a Web server, the Web server can run any of a variety of server or mid-tier applications, including HTTP servers, FTP servers, CGI servers, data servers, Java servers, and business application servers. The server(s) also may be capable of executing programs or scripts in response requests from user devices, such as by executing one or more Web applications that may be implemented as one or more scripts or programs written in any programming language, such as Java®, C, C# or C++, or any scripting language, such as Perl, Python, or TCL, as well as combinations thereof. The server(s) may also include database servers, including without limitation those commercially available from Oracle®, Microsoft®, Sybase®, and IBM®.
The environment can include a variety of data stores and other memory and storage media as discussed above. These can reside in a variety of locations, such as on a storage medium local to (and/or resident in) one or more of the computers or remote from any or all of the computers across the network. In a particular set of embodiments, the information may reside in a storage-area network (“SAN”) familiar to those skilled in the art. Similarly, any necessary files for performing the functions attributed to the computers, servers, or other network devices may be stored locally and/or remotely, as appropriate. Where a system includes computerized devices, each such device can include hardware elements that may be electrically coupled via a bus, the elements including, for example, at least one central processing unit (CPU), at least one input device (e.g., a mouse, keyboard, controller, touch screen, or keypad), and at least one output device (e.g., a display device, printer, or speaker). Such a system may also include one or more storage devices, such as disk drives, optical storage devices, and solid-state storage devices such as random access memory (“RAM”) or read-only memory (“ROM”), as well as removable media devices, memory cards, flash cards, etc.
Such devices also can include a computer-readable storage media reader, a communications device (e.g., a modem, a network card (wireless or wired), an infrared communication device, etc.), and working memory as described above. The computer-readable storage media reader can be connected with, or configured to receive, a computer-readable storage medium, representing remote, local, fixed, and/or removable storage devices as well as storage media for temporarily and/or more permanently containing, storing, transmitting, and retrieving computer-readable information. The system and various devices also typically will include a number of software applications, modules, services, or other elements located within at least one working memory device, including an operating system and application programs, such as a client application or Web browser. It should be appreciated that alternate embodiments may have numerous variations from that described above. For example, customized hardware might also be used and/or particular elements might be implemented in hardware, software (including portable software, such as applets), or both. Further, connection to other computing devices such as network input/output devices may be employed.
Storage media and non-transitory computer-readable media for containing code, or portions of code, can include any appropriate media known or used in the art, such as but not limited to volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules, or other data, including RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the a system device. Based on the disclosure and teachings provided herein, a person of ordinary skill in the art will appreciate other ways and/or methods to implement the various embodiments.
The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. It will, however, be evident that various modifications and changes may be made thereunto without departing from the broader spirit and scope of the invention as set forth in the claims.