Increasing emphasis has been placed on face detection in the field of computer vision. The computational cost of face detection can be expensive, however, and rises with increasing image size. As image sensor resolution increases, so too does the computational cost of face detection, posing a challenge particularly for mobile devices whose computational resources are limited.
Computing device 102 may capture image data in any suitable form. For example, computing device 102 may be operated in a camera mode, in which case the set of images 106 may be captured as a sequence of images. In another example, computing device 102 may be operated in a video camera mode, in which case the set of images of 106 may be captured as a sequence of frames forming video. In this example, face detection may be performed at a frequency matching that at which video is captured—e.g., 30 or 60 frames per second. Any suitable face detection frequency and image capture method may be used, however.
Although shown as a mobile device, computing device 102 may assume any suitable form, including but not limited to that of a desktop, server, gaming console, tablet computing device, etc. Regardless of the form taken, the set of computational resources (e.g., processing cycles, memory, and bandwidth) available to computing device 102 for performing face detection is limited. The computational resources may be further limited when computing device 102 is configured as a mobile device, due to the limited power available from its power source (e.g., battery). These and other constraints placed on face detection by limited computational resources may force an undesirable tradeoff between face detection and other tasks carried out by computing device 102, which in turn may degrade the user experience—e.g., deemphasizing face detection may render face detection slow and/or inaccurate, while emphasis of face detection may render running applications unresponsive. As such, computing device 102 may be configured to consider the availability of computational resources when determining whether to perform face detection, and may establish a compute budget based on the available resources. Face detection may be limited to subsets, and not the entirety, of image data by performing face detection on regions where human faces are likelier to be found without exceeding the established compute budget.
Computing device 102 may include a logic subsystem 108 and a storage subsystem 110 holding instructions executable by the logic subsystem to effect the approaches described herein. For example, the instructions may be executable to receive an image (e.g., from the set of images 106), apply a tile array to the image, the tile array comprising a plurality of tiles, and perform face detection on at least a subset of the tiles. As described below, one or more of the plurality of tiles may overlap one or more others of the plurality of tiles. Computing device 102 may determine whether or not to perform face detection on a given tile based on a likelihood that the tile includes at least a portion of a human face.
In view of the above, “face detection” as used herein may refer to the detection of a complete face or a portion, and not the entirety, thereof. For example, in some implementations face detection performed in a tile may produce positive results (i.e., detection of a face portion therein) if a sufficient face portion resides in the tile, without requiring that the entirety of the face resides in the tile to prompt positive face detection. The approaches disclosed herein, however, are equally applicable to implementations that do require the entirety, and not merely portions, of a face to reside in a tile for face detection to produce positive results in the tile. Further, in such implementations that do require complete faces to yield positive face detection, only tiles of scales suited to the size of a face in an image (e.g., large enough to completely contain the face without containing significant image portions that do not correspond to the face) may yield positive detection of the face, while tiles of scales unsuited to the size of the face (e.g., of scales that contain only portions, and not the entirety, of the face, or of scales that contain significant image portions that do not correspond to the face) may not yield positive detection of the face. Details regarding tile scale are discussed below.
The likelihoods for each tile 204 in tile array 200 may be determined based on any practicable criteria, and in many examples it will be desirable to establish likelihood with a focus on making efficient use of compute resources. Further, in most examples the likelihood determination will be performed via mechanisms that are significantly less computationally expensive than the actual face detection methods used on the tiles. As a non-limiting example, the likelihoods may be determined based at least on pixel color. For pixel color, the colors of one or more pixels in a given tile (e.g., an average color of two or more pixels) may be compared to colors that correspond to human skin, with a greater correspondence between pixel color and human skin color leading to assignment of a greater likelihood, and lesser correspondence leading to assignment of a lesser likelihood.
Other criteria may be used in determining tile likelihoods. For example, an assessment of tiles in a frame in a sequence of video frames may be used in assigning likelihoods in subsequent frames.
In some examples, a maximum likelihood (e.g., 0.99) may be assigned to tiles 204A and 204B (e.g., based on positive face detection). When assigned to a tile, the maximum likelihood may ensure that face detection is performed on the tile; in this case whether face detection is performed on a tile may be controlled by performing face detection on tiles having probabilities greater than a threshold (e.g., a threshold specified by an established compute budget). From this example, it will be appreciated that mechanisms may be employed to guarantee that a given tile is inspected. In alternate methods, however, resource constraints or other considerations may lead to a scheme in which there is no such guarantee, but rather only a proportionately high possibility of a tile being selected for inspection.
The detection of a face in a tile may influence the likelihood assignment to other tiles.
The propagation of likelihoods among tiles may be implemented in a variety of suitable manners. Although the propagation of the maximum likelihood from tiles 204A and 204B to respectively adjacent tiles 204A′ and 204B′ is described above, non-maximum likelihoods may alternatively be propagated. In some configurations, non-maximum likelihoods may not ensure the performance of face detection. As a more particular example, the propagation of likelihoods may be a function of tile distance—e.g., a first tile to which a likelihood is propagated from a second tile may receive a likelihood that is reduced relative to the likelihood assigned to the second tile, in proportion to the distance between the first and second tiles.
In some implementations, facial part classification may be employed in assigning and/or propagating likelihoods. For example, tiles corresponding to face parts relatively more invariant to transformations (e.g., rotation), such as the nose and mouth, may be assigned greater likelihoods relative to other face parts that more frequently become occluded or otherwise obscured due to such transformations. When used in combination with motion, described below, facial part classification may lead to the assignment of greater likelihoods to tiles adjacent to the more invariant face parts, in contrast to the assignment of lesser likelihoods to tiles adjacent to the less invariant face parts. Such an approach may represent an expectation that face portions closer to the center of a face will have a greater persistence in images when the face is in motion.
A tile array may include at least one tile that overlaps another tile.
Likelihood determination may be based on motion. In one example, a change in the color (e.g., average pixel color) of corresponding tiles between frames may be considered an indication of motion.
Likelihood propagation may account for the speed and direction of motion. A motion vector, for example, may be computed based on observed rates of change in pixel color and the directions along which similar changes in pixel color propagate. The likelihood of a tile where motion originated may be propagated to tiles substantially on the path of the motion vector—e.g., intersecting or adjacent to the motion vector or an extension thereof. Further, likelihoods may be propagated to tiles of increasing distance from a tile where motion originated as the speed of motion (e.g., vector magnitude) increases—e.g., a relatively low speed of motion may lead to likelihood propagation to only immediately adjacent tiles, whereas a relatively higher speed of motion may lead to likelihood propagation to tiles beyond those that are immediately adjacent. In an alternative implementation, a likelihood propagated to other tiles may be scaled down as a function of distance, where the degree of scaling is less for higher speeds of motion and greater for lower speeds of motion.
Likelihood determination may be based on environmental priors. For example, a computing device (e.g., computing device 102 of
Likelihood determination may consider both environmental priors and motion, which may be weighted differently. For example, in lieu of assigning to a tile a moderate likelihood (e.g., 0.50) determined based only on moderate motion in that tile, a relatively greater likelihood may be assigned to the tile as a result of an environmental prior indicating that tile to be a likely location where faces may be found. As another example, a likelihood determined based only on motion for a tile may be reduced if an environmental prior indicates that tile to be at a location where faces are not likely to be found. In some examples, indications of large motion may lead to the assignment of high (e.g., the maximum) likelihoods to a tile, even if an environmental prior indicates that tile to be an unlikely face location. Generally, two or more of the criteria described herein may be considered in assigning likelihoods.
In some examples, the computing device may accept user input for establishing prior likelihoods—for example, the user input may be operable to identify locations (e.g., tiles) where the presence of faces are physically impossible, for example, such that face detection is not performed at these locations (e.g., by assigning corresponding tiles likelihoods of zero). User input may alternatively or additionally be used to assign any suitable likelihood to image locations.
In some implementations, two or more tile arrays at different scales may be used to effect the approaches described herein. “Scale” as used herein may refer to the size of tiles in a given tile array, and a collection of tile arrays at different scales may be referred to as a tile “hierarchy”.
Tile array 250 includes a plurality of tiles (e.g., tile 254) that are each assigned a likelihood that the corresponding tile includes at least a portion of a human face based on one or more of the criteria described above. Similar to the application of tile array 200 to image 206, tiles 254 may be assigned likelihoods based on the outcome of assessing image 202;
Although not illustrated in
The selection of tile scales may be based on motion. For example, the transition between tile scales may be controlled in proportion to a magnitude of detected or expected motion; if a relatively large degree of motion is believed to be occurring, a transition from a tile array of scale Y to a tile array of scale Y+/−2 may be effected, rather than to a tile array of scale Y+/−1 (e.g., an adjacent tile scale). Such an approach may allow a detected face to be persistently tracked in the event the face rapidly moves toward or away from a camera, for example. Generally, any suitable adjacent or non-adjacent transition between tile scales may occur, including a transition from a smallest to largest tile scale and vice versa.
In the course of using a tile hierarchy, determining whether to perform face detection on a tile may be based on a scale of the tile. For example, face detection may be preferentially performed for tiles of a relatively larger scale than tiles of a relatively smaller scale—e.g., tiles 204 of tile array 200 may be preferentially assessed over tiles 254 of tile array 250 due to the relatively greater scale of tile array 200. Such an approach may reduce computational cost, at least initially, as in some examples the cost of performing face detection may not scale linearly with tile scale—for example, the cost associated with tiles of scale 32×32 may not be reduced relative to the cost associated with tiles of scale 64×64 in proportion to the reduction in tile size when going from 64×64 to 32×32. The preferential exploration of tiles at relatively greater scales may increase the speed at which faces relatively close to a camera are detected, while slightly delaying the detection of faces relatively distanced from the camera. It will be understood that, in some examples, the preferential exploration of relatively larger tiles may be a consequence of larger tiles generally having greater likelihoods of containing a face due to the greater image portions they cover, and not a result of an explicit setting causing such preferential exploration. Implementations are possible, however, in which an explicit setting may be established that causes preferential exploration of larger scales over smaller scales, smaller or medium-sized scales over larger scales, etc. For example, a set of scales (e.g., smaller scales) may be preferentially explored over a different set of scales (e.g., larger scales) based on an expected face distance, which may establish a range of expected face sizes in image-space on which exploration may be focused.
As described above, the approaches described herein for performing face detection based on tile likelihoods may be carried out based on an established compute budget. The compute budget may be established based on available (e.g., unallocated) computing resources and/or other potential factors such as application context (e.g., a relatively demanding application may force a reduced compute budget to maintain a desired user experience). The compute budget, in some scenarios, may limit the performance of face detection to a subset, but not all of, the tiles in a tile array or tile hierarchy. The subset of tiles that are evaluated for the presence of faces may be selected on the basis of likelihood such that tiles of greater likelihood are evaluated before tiles of relatively lesser likelihood.
An established compute budget may constrain face detection in various manners. For example, the compute budget may constrain a subset of tiles on which face detection is performed in size—e.g., the budget may stipulate a number of tiles that can be evaluated without exceeding the compute budget. As another example, the compute budget may stipulate a length of time in which tiles can be evaluated. Regardless of its configuration, face detection may be performed on a subset of tiles until the compute budget is exhausted. In some examples, face detection may be performed on at least a subset of tiles, followed by the performance of face detection on additional tiles until the compute budget is exhausted. In this scenario, the compute budget may have constrained face detection to the subset of tiles, but, upon completion of face detection on the subset, the compute budget is not fully exhausted. As such, face detection may be performed on additional tiles until the compute budget is exhausted. In other examples, the compute budget may be re-determined upon its exhaustion, which may prompt the evaluation of additional tiles. Establishment of the compute budget may be performed in any suitable manner and at any suitable frequency; the compute budget may be established for every frame/image, at two or more times within a given frame/image, for each sequence of contiguous video frames, etc. Consequently, the number of tiles on which face detection is performed may vary from frame/image to frame/image for at least some of a plurality of received frames/images. Such variation may be based on variations in the established compute budget (e.g., established for each frame/image). Thus, a compute budget may be dynamically established. It will nevertheless be understood, however, that in some scenarios a common compute budget established for different frames may lead to face detection in different numbers of tiles across the frames. Further, the variation in the number of tiles on which face detection is performed may be a function of other factors alternative or in addition to a varying compute budget, including but not limited to randomness and/or image data (e.g., variation in the number of faces in different images).
Non-zero likelihoods may be assigned to every tile in a given tile array or tile hierarchy. For example, a minimum but non-zero likelihood (e.g., 0.01) may be assigned to tiles for which their evaluations suggested no presence of a face. The assignment of non-zero likelihoods to every tile—even for tiles in which the presence of a face is not detected or expected—enables their eventual evaluation so that no tile goes unexplored over the long term. Although the approaches described herein may preferentially evaluate likelier tiles, the tile selection process may employ some degree of randomness so that minimum likelihood tiles are explored and all regions of an image eventually assessed for the presence of faces. The assignment of non-zero likelihoods may be one example of a variety of approaches that enable the modification of tile likelihood relative to the likelihood that would otherwise be determined without such modification—e.g., based on one or more of the criteria described herein such as pixel color, motion, environmental priors, and previous face detections. A tile's likelihood may be modified to achieve a desired frequency with which face detection is performed therein, for example. In some implementations, a likelihood modification may be weighted less relative to the likelihood determined based on a criterion-based assessment. In this way, the modification may be limited to effecting small changes in likelihood.
The process by which tiles are selected for face detection may be implemented in various suitable manners. In one example, each tile may be assigned a probability—e.g., likelihood 205. A random number (e.g., a decimal probability) may be generated and compared, for a given tile, to that tile's probability to determine whether or not to perform face detection in the tile. If the tile's probability exceeds the random number, the tile may be designated for face detection, whereas the tile may not be designated for face detection if the tile's probability falls below the random number. A random number may be generated for each image so that the probability of performing face detection on a region of an image in N frames can be determined.
As another non-limiting example, probabilistic face detection may be implemented using what is referred to herein as a “token” based approach. In this example, a number of unique tokens (e.g., alphanumeric identifiers) may be assigned to each tile. The number of unique tokens assigned to a given tile may be in direct proportion to the likelihood associated with that tile, such that likelier tiles are assigned greater numbers of tokens. The collection of unique tokens assigned to all tiles may form a token pool. A number of unique tokens may then be randomly selected from the token pool. This number of tokens selected from the token pool may be stipulated by an established compute budget, for example. Each tile corresponding to each selected token may then be designated for face detection. Such an approach enables probabilistic tile selection in which likelier tiles are naturally selected by virtue of their greater number of assigned tokens.
The approaches herein to tile-based face detection may be modified in various suitable manners. For example, the propagation of likelihoods to spatially adjacent tiles in a subsequent frame may also occur for spatially adjacent tiles in the same frame. In this example, face detection may be performed at multiple stages for a single image. Further, the propagation of likelihoods may be carried out in any suitable manner—e.g., the same likelihood may be propagated between tiles, or may be modified, such as by being slightly reduced as described above. Still further, entire images or frames may be evaluated for the likelihood of including a face; those images/frames considered unlikely to include a face may be discarded from face detection. Yet further, any suitable face detection methods may be employed with the approaches described herein. An example face detection method may include, for example, feature extraction, feature vector formation, and feature vector distance determination.
At 402, method 400 may include receiving an image.
At 404, method 400 may include applying a tile array to the image. The tile array may comprise a plurality of tiles.
At 406, method 400 may include performing face detection on at least a subset of the tiles. Determining whether or not to perform face detection on a given tile may be based on a likelihood that the tile includes at least a portion of a human face. The subset of the tiles on which face detection is performed may be constrained in size by a compute budget. The subset of tiles may include at least one tile at a first scale and at least one tile at a second scale different from the first scale. At least one of the subset of tiles may at least partially overlap another one of the subset of tiles.
Method 400 may further comprise, for each tile in which at least a portion of a human face is detected, performing face detection on one or more respectively adjacent tiles. The one or more respectively adjacent tiles may be spatially and/or temporally adjacent.
In some embodiments, the methods and processes described herein may be tied to a computing system of one or more computing devices. In particular, such methods and processes may be implemented as a computer-application program or service, an application-programming interface (API), a library, and/or other computer-program product.
Computing system 500 includes a logic machine 502 and a storage machine 504. Computing system 500 may optionally include a display subsystem 506, input subsystem 508, communication subsystem 510, and/or other components not shown in
Logic machine 502 includes one or more physical devices configured to execute instructions. For example, the logic machine may be configured to execute instructions that are part of one or more applications, services, programs, routines, libraries, objects, components, data structures, or other logical constructs. Such instructions may be implemented to perform a task, implement a data type, transform the state of one or more components, achieve a technical effect, or otherwise arrive at a desired result.
The logic machine may include one or more processors configured to execute software instructions. Additionally or alternatively, the logic machine may include one or more hardware or firmware logic machines configured to execute hardware or firmware instructions. Processors of the logic machine may be single-core or multi-core, and the instructions executed thereon may be configured for sequential, parallel, and/or distributed processing. Individual components of the logic machine optionally may be distributed among two or more separate devices, which may be remotely located and/or configured for coordinated processing. Aspects of the logic machine may be virtualized and executed by remotely accessible, networked computing devices configured in a cloud-computing configuration.
Storage machine 504 includes one or more physical devices configured to hold instructions executable by the logic machine to implement the methods and processes described herein. When such methods and processes are implemented, the state of storage machine 504 may be transformed—e.g., to hold different data.
Storage machine 504 may include removable and/or built-in devices. Storage machine 504 may include optical memory (e.g., CD, DVD, HD-DVD, and Blu-Ray Disc), semiconductor memory (e.g., RAM, EPROM, and EEPROM), and/or magnetic memory (e.g., hard-disk drive, floppy-disk drive, tape drive, and MRAM), among others. Storage machine 504 may include volatile, nonvolatile, dynamic, static, read/write, read-only, random-access, sequential-access, location-addressable, file-addressable, and/or content-addressable devices.
It will be appreciated that storage machine 504 includes one or more physical devices. However, aspects of the instructions described herein alternatively may be propagated by a communication medium (e.g., an electromagnetic signal or an optical signal) that is not held by a physical device for a finite duration.
Aspects of logic machine 502 and storage machine 504 may be integrated together into one or more hardware-logic components. Such hardware-logic components may include field-programmable gate arrays (FPGAs), program- and application-specific integrated circuits (PASIC/ASICs), program- and application-specific standard products (PSSP/ASSPs), system-on-a-chip (SOC), and complex programmable logic devices (CPLDs), for example.
The terms “module,” “program,” and “engine” may be used to describe an aspect of computing system 500 implemented to perform a particular function. In some cases, a module, program, or engine may be instantiated via logic machine 502 executing instructions held by storage machine 504. It will be understood that different modules, programs, and/or engines may be instantiated from the same application, service, code block, object, library, routine, API, function, etc. Likewise, the same module, program, and/or engine may be instantiated by different applications, services, code blocks, objects, routines, APIs, functions, etc. The terms “module,” “program,” and “engine” may encompass individual or groups of executable files, data files, libraries, drivers, scripts, database records, etc.
It will be appreciated that a “service”, as used herein, is an application program executable across multiple user sessions. A service may be available to one or more system components, programs, and/or other services. In some implementations, a service may run on one or more server-computing devices.
When included, display subsystem 506 may be used to present a visual representation of data held by storage machine 504. This visual representation may take the form of a graphical user interface (GUI). As the herein described methods and processes change the data held by the storage machine, and thus transform the state of the storage machine, the state of display subsystem 506 may likewise be transformed to visually represent changes in the underlying data. Display subsystem 506 may include one or more display devices utilizing virtually any type of technology. Such display devices may be combined with logic machine 502 and/or storage machine 504 in a shared enclosure, or such display devices may be peripheral display devices.
When included, input subsystem 508 may comprise or interface with one or more user-input devices such as a keyboard, mouse, touch screen, or game controller. In some embodiments, the input subsystem may comprise or interface with selected natural user input (NUI) componentry. Such componentry may be integrated or peripheral, and the transduction and/or processing of input actions may be handled on- or off-board. Example NUI componentry may include a microphone for speech and/or voice recognition; an infrared, color, stereoscopic, and/or depth camera for machine vision and/or gesture recognition; a head tracker, eye tracker, accelerometer, and/or gyroscope for motion detection and/or intent recognition; as well as electric-field sensing componentry for assessing brain activity.
When included, communication subsystem 510 may be configured to communicatively couple computing system 500 with one or more other computing devices. Communication subsystem 510 may include wired and/or wireless communication devices compatible with one or more different communication protocols. As non-limiting examples, the communication subsystem may be configured for communication via a wireless telephone network, or a wired or wireless local- or wide-area network. In some embodiments, the communication subsystem may allow computing system 500 to send and/or receive messages to and/or from other devices via a network such as the Internet.
An example provides a computing device comprising a logic subsystem and a storage subsystem holding instructions executable by the logic subsystem to receive an image, apply a tile array to the image, the tile array comprising a plurality of tiles, and perform face detection on at least a subset of the tiles, where determining whether or not to perform face detection on a given tile is based on a likelihood that the tile includes at least a portion of a human face. In such an example, the subset of the tiles on which face detection is performed alternatively or additionally may be constrained in size by a compute budget. In such an example, the instructions alternatively or additionally may be further executable to, after performing face detection on at least the subset of the tiles, perform face detection on additional tiles until a compute budget is exhausted. In such an example, the instructions alternatively or additionally may be executable for a plurality of received images, and a number of tiles on which face detection is performed alternatively or additionally may vary from image to image for at least some of the plurality of received images, such variation being based on variations in a compute budget. In such an example, the instructions alternatively or additionally may be further executable to, for each tile in which at least a portion of a human face is detected, perform face detection on one or more respectively adjacent tiles in response to such detection. In such an example, the tile array alternatively or additionally may be a first tile array comprising a first plurality of tiles at a first scale, the first tile array belonging to a tile hierarchy comprising a plurality of tile arrays including a second tile array comprising a second plurality of tiles at a second scale, and the subset of the tiles alternatively or additionally may include a first subset of the first plurality of tiles and a second subset of the second plurality of tiles. In such an example, the second subset of the second plurality of tiles alternatively or additionally may spatially correspond to the first subset of the first plurality of tiles. In such an example, some of the plurality of tiles alternatively or additionally may at least partially overlap others of the plurality of tiles. In such an example, the likelihood alternatively or additionally may be determined based on prior face detection. In such an example, the likelihood alternatively or additionally may be determined based on motion. In such an example, the likelihood alternatively or additionally may be determined based on one or both of pixel color and an environmental prior. In such an example, each likelihood alternatively or additionally may be non-zero. In such an example, determining whether or not to perform face detection on the given tile alternatively or additionally may be further based on a scale of the given tile, such that face detection is preferentially performed for tiles of a first scale than tiles of a second scale. Any or all of the above-described examples may be combined in any suitable manner in various implementations.
Another example provides a face detection method comprising receiving an image, applying a tile array to the image, the tile array comprising a plurality of tiles, and performing face detection on at least a subset of the tiles, where determining whether or not to perform face detection on a given tile is based on a likelihood that the tile includes at least a portion of a human face. In such an example, the subset of the tiles on which face detection is performed alternatively or additionally may be constrained in size by a compute budget. In such an example, the method alternatively or additionally may comprise, for each tile in which at least a portion of a human face is detected, performing face detection on one or more respectively adjacent tiles. In such an example, the one or more respectively adjacent tiles alternatively or additionally may be spatially and/or temporally adjacent. In such an example, the subset of tiles alternatively or additionally may include at least one tile at a first scale and at least one tile at a second scale different from the first scale. In such an example, at least one of the subset of tiles alternatively or additionally may at least partially overlap another one of the subset of tiles. Any or all of the above-described examples may be combined in any suitable manner in various implementations.
Another example provides a face detection method, comprising receiving an image, applying a tile array to the image, the tile array comprising a plurality of tiles, establishing a compute budget, and performing face detection on some, but not all, of the tiles until the compute budget is exhausted, where determining whether or not to perform face detection on a given tile is based on a likelihood that the tile includes at least a portion of a human face. Any or all of the above-described examples may be combined in any suitable manner in various implementations.
It will be understood that the configurations and/or approaches described herein are exemplary in nature, and that these specific embodiments or examples are not to be considered in a limiting sense, because numerous variations are possible. The specific routines or methods described herein may represent one or more of any number of processing strategies. As such, various acts illustrated and/or described may be performed in the sequence illustrated and/or described, in other sequences, in parallel, or omitted. Likewise, the order of the above-described processes may be changed.
The subject matter of the present disclosure includes all novel and nonobvious combinations and subcombinations of the various processes, systems and configurations, and other features, functions, acts, and/or properties disclosed herein, as well as any and all equivalents thereof.