Various types of image capture devices (referred to herein as cameras) are used to capture color and/or depth information representing subjects and objects at scenes being captured. For instance, a set of cameras (also referred to as a camera array) may be used to capture still and/or video images depicting the scene using color, depth, grayscale, and/or other image content. Such images may be presented to viewers and/or analyzed and processed for use in various applications.
As one example of such an application, three-dimensional (3D) representations of objects may be produced based on images captured by cameras with different poses (i.e., different positions and/or orientations so as to afford the cameras distinct vantage points) around the objects. As another example, computer vision may be performed to extract information about objects captured in the images and to implement autonomous processes based on this information. These and various other applications of image processing may be used in a variety of entertainment, educational, industrial, agricultural, medical, commercial, robotics, promotional, and/or other contexts and use cases. For instance, extended reality (e.g., virtual reality, augmented reality, etc.) use cases may make use of volumetric models generated based on intensity (e.g., color) and depth images depicting a scene from various vantage points (e.g., various perspectives, various locations, etc.) with respect to the scene.
In any of these types of applications, it would be desirable to capture complete and detailed image data for a subject as a 3D representation of the subject is constructed. Unfortunately, various circumstances and real-world conditions may make this ideal difficult to achieve. For instance, failure to capture a desired amount of information for a subject could occur as a result of no camera having a vantage point to see the subject (or a region of the subject), as a result of the subject (or a region of the subject) being occluded by another object at the scene, as a result of the subject being too far away from certain cameras to be captured with sufficient detail, or the like.
The accompanying drawings illustrate various embodiments and are a part of the specification. The illustrated embodiments are merely examples and do not limit the scope of the disclosure. Throughout the drawings, identical or similar reference numbers designate identical or similar elements.
Methods and systems for refining a 3D representation of a subject are described herein. As mentioned above, various types of 3D representations (e.g., point cloud representations, volumetric mesh representations, voxelized representations, etc.) may be generated for a subject (e.g., a human or animal subject, etc.) and used in various types of applications and use cases, including by incorporating the 3D representations into an extended reality experience such as a virtual reality or augmented reality experience presented to a user. Unfortunately, as has also been mentioned, various circumstances may lead 3D representations of subjects to be produced with various types of deficiencies. For example, deficiencies may arise when particular regions of the subject cannot be suitably captured (e.g., captured at all or captured at a desired level of detail, etc.) at a certain moment in time as a result of circumstances such as, for instance, the subject being outside the fields of view of the cameras, being occluded by other objects, being too far away from the cameras to be captured with a desired pixel density, and so forth. Methods and systems described herein therefore relate to ways of detecting and mitigating such deficiencies. For example, based on methods and systems described herein, a volumetric modeling system may refine (e.g., enhance, supplement, enrich, improve, reconstruct, interpolate, etc.) a 3D representation detected to have certain deficiencies by supplementing captured data with data drawn from a parameterizable body model associated with the subject.
As one example, a human subject engaged in a sporting event (e.g., a football player playing in a football game) may be volumetrically modeled as the sporting event occurs (e.g., so as to provide users with a virtual reality experience that is associated with the sporting event and features a representation of the player). The volumetric modeling of this subject may be most effective when the subject is captured from various vantage points all around the subject. However, during the ordinary course of the game, it is likely that one or more other players may occlude parts of the subject from one or more of the vantage points of the cameras capturing the subject. If, for example, another player blocks the view of the subject's hand that a particular camera has for a particular frame, the 3D representation of the subject generated for that frame may lack information about the hand (at least from that angle) and may accordingly be incomplete or otherwise deficient. For instance, for the frame in question, the 3D representation of the subject may appear to not have a hand, or to have a distorted or poorly detailed representation of the hand as a result of the missing data.
To mitigate this, methods and systems described herein may utilize a parameterizable body model associated with the subject to help fill in missing data about the subject for whom sufficient capture data has not been able to be attained (e.g., details about the subject's hand in this particular example). For example, a domain-specific body model trained (e.g., using machine learning technologies, etc.) to take the form of a specific subject type (e.g., a human subject) may be parameterized and fitted to the subject so as to imitate the pose of the subject (including various physical characteristics of the subject such as his or her height, build, etc.). Once such a body model is parameterized and posed like the subject for the frame in question, a provisional 3D representation that has been constructed with the available capture data (i.e., the 3D representation in which the deficient hand data has been detected) may be refined (e.g., enhanced, supplemented, etc.) using corresponding data from the body model. For example, the hand from the body model could be used to stand in for a missing hand in the provisional 3D representation, the hand from the body model could be used to correct and improve a deformed hand in the provisional 3D representation, the hand from the body model could be used to enhance (e.g., increase the pixel density for) a poorly-detailed hand in the provisional 3D representation, or the like.
Various advantages and benefits may result from refined 3D representations (i.e., those produced based on both captured data and information from parameterized body models) being used instead of 3D representations limited only to whatever image data can be physically captured from a scene. For example, for capture scenarios involving large or complex capture areas, multiple highly-active subjects, and/or other such circumstances likely to introduce deficiencies into 3D representations being produced, volumetric modeling systems employing methods described herein may be capable of producing 3D representations that are more complete and detailed, and that have a higher degree of quality than could be produced based on captured data alone. Such improvements in 3D representations may make extended reality experiences more immersive and enjoyable for users and may similarly improve other types of applications and use cases in which volumetric modeling is performed and 3D representations of subjects are used.
Various specific implementations will now be described in detail with reference to the figures. It will be understood that the specific implementations described below are provided as non-limiting examples and may be applied in various situations. Additionally, it will be understood that other examples not explicitly described herein may also fall within the scope of the claims set forth below. Methods and systems for refining a 3D representation of a subject may provide any or all of the benefits mentioned above, as well as various additional and/or alternative benefits that will be described and/or made apparent below.
System 100 may include memory resources configured to store instructions, as well as one or more processors communicatively coupled to the memory resources and configured to execute the instructions to perform functions described herein. For example, a generalized representation of system 100 is shown in
Memory 102 may store and/or otherwise maintain executable data used by processor 104 to perform any of the functionality described herein. For example, memory 102 may store instructions 106 that may be executed by processor 104. Memory 102 may be implemented by one or more memory or storage devices, including any memory or storage devices described herein, that are configured to store data in a transitory or non-transitory manner. Instructions 106 may be executed by processor 104 to cause system 100 to perform any of the functionality described herein. Instructions 106 may be implemented by any suitable application, software, script, code, and/or other executable data instance. Additionally, memory 102 may also maintain any other data accessed, managed, used, and/or transmitted by processor 104 in a particular implementation.
Processor 104 may be implemented by one or more computer processing devices, including general-purpose processors (e.g., central processing units (CPUs), graphics processing units (GPUs), microprocessors, etc.), special-purpose processors (e.g., application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), etc.), or the like. Using processor 104 (e.g., when processor 104 is directed to perform operations represented by instructions 106 stored in memory 102), system 100 may perform functions associated with refining a 3D representation of a subject in accordance with methods and systems described herein and/or as may serve a particular implementation.
As one example of functionality that processor 104 may perform,
In certain examples, operations of method 200 may be performed in real time so as to provide, receive, process, and/or use data described herein immediately as the data is generated, updated, changed, exchanged, or otherwise becomes available (e.g., analyzing captured image data, producing refined 3D representations based on the image data, providing volumetric content incorporating the refined 3D representations, etc., even as the subjects represented within the image data are engaged in the behaviors depicted in the volumetric content). In such examples, certain operations described herein may involve real-time data, real-time representations, real-time conditions, and/or other real-time circumstances. As used herein, “real time” will be understood to relate to data processing and/or other actions that are performed immediately, as well as conditions and/or circumstances that are accounted for as they exist in the moment when the processing or other actions are performed. For example, a real-time operation may refer to an operation that is performed immediately and without undue delay, even if it is not possible for there to be absolutely zero delay. Similarly, real-time data, real-time representations, real-time conditions, and so forth, will be understood to refer to data, representations, and conditions that relate to a present moment in time or a moment in time when decisions are being made and operations are being performed (e.g., even if after a short delay), such that the data, representations, conditions, and so forth are temporally relevant to the decisions being made and/or the operations being performed.
Each of operations 202-208 of method 200 will now be described in more detail as the operations may be performed by an implementation of system 100 (e.g., by processor 104 executing instructions 106 stored in memory 102).
At operation 202, system 100 may obtain a provisional 3D representation of a subject present at a scene. As used herein, a 3D representation of a subject may refer to any suitable volumetric or other three-dimensional representation, model, depiction, etc., of any suitable subject (e.g., human or animal subject, inanimate object, etc.). For instance, a 3D representation may be implemented as a point cloud representation, a mesh representation (with or without a texture applied to the mesh), a voxelized representation, or any other suitable data structure configured to represent the subject in any manner as may serve a particular implementation. As used herein, a 3D representation may be referred to as a “provisional” 3D representation when it has not yet been refined in accordance with methods and principles described herein, or at least when there is further refinement to be performed before the 3D representation will be considered to be fully-refined and presentable. For example, a provisional 3D representation may be implemented as a point cloud or mesh representation of a subject that has been constructed exclusively from captured image data and that does not yet incorporate any information from a body model or other auxiliary source of information (as will be described in more detail below). In contrast, as will be further described below, “refined” 3D representations, as that term is used herein, will be understood to refer to 3D representations that have been refined in accordance with methods and/or principles described herein so as to incorporate not only captured image data but also auxiliary data such as information derived from a particular body model or other machine learning model.
The provisional 3D representation obtained at operation 202 may be based on (e.g., exclusively based on) image data captured by a set of cameras configured to have different fields of view according to different vantage points the cameras have at the scene. For example, if the subject is a human subject (e.g., a football player) present in a scene of a sports venue (e.g., on a football field), the set of cameras may be disposed at various locations around the scene (e.g., around the football field) and may be inwardly facing so as to capture the human subject and other subjects and/or objects present at the scene from a variety of vantage points. A sports venue scene with a human subject involved in a sporting event will be illustrated and described in more detail below.
In some implementations, system 100 may obtain the provisional 3D representation at operation 202 by receiving the provisional 3D representation from an associated system that is configured to analyze the image data from the cameras and to generate the provisional 3D representation based on that image data. In other implementations, system 100 may obtain the provisional 3D representation at operation 202 by generating the provisional 3D representation itself, based on the image data captured by the set of cameras. For example, the image data captured by the set of cameras may include both color data and depth data representative of objects present at the scene (captured by respective color and depth capture devices at the scene) and the obtaining of the provisional 3D representation of the subject may be performed by generating the provisional 3D representation of the subject based on color data and depth data representative of the subject. In some examples, system 100 may capture the image data itself (e.g., if the set of cameras are incorporated into system 100) or may receive captured image data from an external set of cameras (e.g., if the cameras are not incorporated into system 100) and use the captured image data to construct the provisional 3D representation. In some examples, this may be performed on a frame-by-frame basis as the subject moves through the scene and the set of cameras (which may be synchronized with one another in certain implementations) captures respective sequences of images depicting the subject from the respective vantage points of the cameras. When such frame-by-frame modeling is performed, individual 3D representations associated with discrete moments in time (e.g., associated with different frame times) may be sequenced to create a time-varying 3D representation (also referred to as a 4D representation with the fourth dimension being a time dimension) of the subject that moves and imitates behaviors of the subject (e.g., a football player running down the field, etc.).
At operation 204, system 100 may identify a deficiency in the provisional 3D representation of the subject that was obtained at operation 202. For example, as a provisional representation constructed exclusively from image data that has been captured at the scene, the obtained 3D representation may be limited in scope and quality by the image data that is available (i.e., that was able to be captured under whatever circumstances happen to presently exist at the scene). Such limits may lead to deficiencies associated with missing regions of the subject (e.g., for which no data is available), distorted regions of the subject (e.g., for which the data is suspect, incorrect, or incomplete), undetailed regions of the subject (e.g., for which the level of detail is lower than may be desired, etc.), or the like. As one example, if the subject happens to be far away from certain cameras in the set of cameras (e.g., on the opposite side of a large playing field), the deficiency identified at operation 204 may be associated with a relatively low pixel density of certain image data representing certain regions of the subject (i.e., depicting these regions with insufficient detail). As another example, if parts of the subject happen to be occluded from certain camera views (e.g., a player that is partially occluded by the presence of other players in the vicinity), the deficiency identified at operation 204 may be associated with missing data that could not be captured for the subject as a result of the occlusion.
At operation 206, system 100 may determine a set of parameters for a body model. The body model may be parameterizable to adaptively model (e.g., imitate the form of) a body type of the subject (e.g., a human body or another suitable body type for another type of subject such as an animal or particular type of inanimate object). As such, the determining of the set of parameters performed at operation 206 may be performed such that an application of the set of parameters to the body model will result in a parameterized body model that imitates a pose of the provisional 3D representation of the subject. For example, as will be described an illustrated in an extended example below, if the subject is a football player running down a football field, the set of parameters may be determined such that a human body model will imitate (e.g., approximately take the form of) the football player, including taking on an approximate height and build of the particular player, being posed with the same particular running posture the player has, and so forth.
In some implementations, the identifying of the deficiency at operation 204 and the determining of the set of parameters at operation 206 may be performed independently from one another, either concurrently or in an arbitrary sequence (e.g., with operation 204 performed prior to operation 206 as shown in
For example, in a particular implementation, the determining of the set of parameters for the body model (at operation 206) may specifically be performed prior to the identifying of the deficiency in the provisional 3D representation of the subject (at operation 204). Moreover, in this example, an additional operation (not explicitly shown in
At operation 208, system 100 may generate a refined 3D representation of the subject in which the deficiency identified at operation 204 is mitigated. As has been mentioned (and as will be described and illustrated in more detail below), this refined 3D representation may be generated based on both the provisional 3D representation obtained at operation 202 and the parameterized body model parameterized at operation 206. In this way, the deficiency identified at operation 204 may be eliminated or otherwise mitigated from the final 3D representation that will be used (e.g., incorporated into volumetric content, presented to a user, or otherwise employed in the relevant application or use case). For instance, the nose-related example described above provides one example of filling in an identified “hole” in the model. As another example, if a hand of a subject is represented with missing data or distortion in an otherwise suitable provisional 3D representation, the refined 3D representation generated at operation 208 may involve incorporating hand-related data from the parameterized body model (for which the set of parameters were determined at operation 206) into the provisional 3D representation to mitigate this hand-related deficiency and produce a refined 3D representation of the subject in which the hand (as well as every other part subject) is complete, accurate, and sufficiently detailed.
While configuration 300 represents one particular use case or application of a volumetric modeling system such as system 100 (i.e., a specific extended reality use case in which image data 306 representing objects in scene 304 is used to generate volumetric representations of the objects for use in presenting an extended reality experience to user 314), it will be understood that system 100 may similarly be used in various other use cases and/or applications as may serve a particular implementation. For example, implementations of system 100 may be used to perform volumetric modeling (including by refining 3D representations of certain types of subjects) for use cases that do not involve extended reality content but rather are aimed at more general computer vision applications, object modeling applications, or the like. Indeed, system 100 may be employed for any suitable image processing application or use case in fields such as entertainment, education, manufacturing, medical imaging, robotic automation, or any other suitable field. Thus, while configuration 300 and various examples described and illustrated herein use volumetric object modeling and extended reality content production as an example use case, it will be understood that configuration 300 may be modified or customized in various ways to suit any of these other types of applications or use cases. Each of the elements of configuration 300 will now be described in more detail.
The array of cameras 302 (also referred to herein as image capture devices) may be configured to capture image data (e.g., color data, intensity data, depth data, and/or other suitable types of image data) associated with scene 304 and objects included therein (i.e., objects present at the scene). For instance, the array of cameras 302 may include a synchronized set of video cameras that are each oriented toward the scene and configured to capture color images depicting objects at the scene. Additionally, the same video cameras (or distinct depth capture devices associated with the video cameras) may be used to capture depth images of the objects at the scene using any suitable depth detection techniques (e.g., stereoscopic techniques, time-of-flight techniques, structured light techniques, etc.). As will be illustrated in more detail below, each of cameras 302 included in the array may have a different pose (i.e., position and orientation) with respect to the scene being captured (i.e., scene 304 in this example). The poses of the cameras may be selected, for example, to provide coverage of the scene, or at least of a particular volumetric capture zone defined at the scene (not explicitly shown in
Scene 304 may represent any real-world area for which image data is captured by the array of cameras 302. Scene 304 may be any suitable size from a small indoor studio space to a large outdoor field or larger space, depending on the arrangement and number of cameras 302 included in the array. Certain scenes 304 may include or otherwise be associated with a particular volumetric capture zone that is defined with an explicit boundary to guarantee a minimum level of coverage by the array of cameras 302 (e.g., coverage from multiple perspectives around the zone) that may not necessarily be provided outside of the zone. For example, for a scene 304 at a sports stadium, the volumetric capture zone may be limited to the playing field on which the sporting event takes place.
Scene 304 may include one or more objects (not explicitly shown in
To illustrate an example of how the array of cameras 302 may be posed with respect to scene 304 and the objects included therein,
Returning to
The implementation of volumetric modeling system 100 shown to be receiving image data 306 in
Extended reality content 308 may be represented by a data stream generated by system 100 that includes volumetric content (e.g., refined 3D representations of objects at scene 304, etc.) and/or other data (e.g., metadata, etc.) useful for presenting the extended reality content. As shown, a data stream encoding extended reality content 308 may be transmitted by way of network 310 to XR presentation device 312 so that extended reality content 308 may be presented by the device to user 314. Extended reality content 308 may include any number of volumetric representations of objects 402 (including subject 404) and/or other such content that, when presented by XR presentation device 312, provides user 314 with an extended reality experience involving the volumetric object representations. For instance, in the example in which scene 304 includes a playing field where a sporting event is taking place and the objects 402 represented volumetrically in extended reality content 308 are players involved in the sporting event, the extended reality experience presented to user 314 may allow user 314 to immerse himself or herself in the sporting event such as by virtually standing on the playing field, watching the players engage in the event from a virtual perspective of the user's choice (e.g., right in the middle of the action, etc.), and so forth.
Network 310 may serve as a data delivery medium by way of which data may be exchanged between a server domain (in which system 100 is included) and a client domain (in which XR presentation device 312 is included). For example, network 310 may be implemented by any suitable private or public networks (e.g., a provider-specific wired or wireless communications network such as a cellular carrier network operated by a mobile carrier entity, a local area network (LAN), a wide area network, the Internet, etc.) and may use any communication technologies, devices, media, protocols, or the like, as may serve a particular implementation.
XR presentation device 312 may be configured to provide an extended reality experience to user 314 and to present refined 3D representations (e.g., refined 3D representations incorporated into extended reality content 308) to user 314 as part of the extended reality experience. To this end, XR presentation device 312 may represent any device used by user 314 to view volumetric representations (e.g., including refined 3D representations) of objects 402 that are generated by system 100 and are included within extended reality content 308 received by way of network 310. For instance, in certain examples, XR presentation device 312 may include or be implemented by a head-mounted extended reality device that presents a fully-immersive virtual reality world, or that presents an augmented reality world based on the actual environment in which user 314 is located (but adding additional augmentations such as volumetric object representations produced and provided by system 100). In other examples, XR presentation device 312 may include or be implemented by a mobile device (e.g., a smartphone, a tablet device, etc.) or another type of media player device such as a computer, a television, or the like.
In the example of
Based on the different types of captured image data 306 (i.e., based on depth data 306-D and color data 306-C),
A generalized embodiment of a volumetric modeling system configured to obtain and refine a 3D representation of a subject has been described in relation to
Point cloud representation 504 is shown to be generated based on depth data 306-D (e.g., a plurality of synchronously captured depth data images that each depict subject 404 from different perspectives around the subject). Point cloud representation 504 may be generated based on depth data 306-D in any suitable way and may incorporate a plurality of points defined in relation to a coordinate system associated with scene 304. Because each of these points may correspond to depth data captured by at least one of cameras 302 in at least one of the depth images represented within captured image data 306, the plurality of points may collectively form a “cloud”-like object having the shape and form of subject 404 and located within the coordinate system at a location that corresponds to the location of subject 404 within scene 304.
Mesh representation 506 is shown to be generated based on point cloud representation 504. Like point cloud representation 504, mesh representation 506 may be based only on depth data 306-D, ultimately leading mesh representation 506 to be limited to describing the physical shape, geometry, and location of subject 404 (rather than describing, for example, the appearance and texture of the subject). However, whereas point cloud representation 504 may be implemented by a large number of surface points where surfaces of subject 404 have been detected in the space of scene 304, mesh representation 506 may represent the outer surface of subject 404 in a cleaner (e.g., better defined and less fuzzy) and more efficient (in terms of data economy) manner as a mesh of interconnected vertices. These interconnected vertices of mesh representation 506 may form a large number of geometric shapes (e.g., triangles, quadrilaterals, etc.) that collectively reproduce the basic shape and form of point cloud representation 504 and the actual subject 404 as depicted in images 502.
Textured mesh representation 508 is shown to be based on mesh representation 506 but also to account for color data 306-C. For example, at this stage of the process, texture content (e.g., color data representative of photographic imagery that has been captured) may be applied to mesh representation 506 in a manner analogous to overlaying a skin onto a wireframe object. When this texturing is complete, textured mesh representation 508 may be considered a full and complete 3D representation of volumetric model of subject 404 that would be suitable for inclusion in extended reality content 308 or another type of content associated with another suitable volumetric application. However, as has been mentioned, various circumstances (e.g., circumstances present at the scene itself and/or associated with the image capture setup or procedure being used) may lead captured image data 306, and hence also the textured mesh representation 508 derived therefrom, to have certain deficiencies that may be desirable to mitigate prior to the textured mesh representation being presented to a user (or otherwise used in the relevant application for which the volumetric modeling is being performed). A few examples of the types of circumstances that may lead to such deficiencies (and thereby make it useful to perform body model parameterization 510 and textured mesh refinement 512) will now be illustrated and described.
Due to this position of subject 404, as well as the arrangement and configuration of the set of cameras 302, subject 404 will be understood to be located right on the fringe of what cameras 302-5 and 302-6 are able to capture. Indeed, given the position of subject 404 shown in
Due to the geometry of camera 302-1, occluding object 702, and subject 404, a certain region of subject 404 will not be depicted in an image captured by camera 302-1 at this moment in time. For example, certain regions of subject 404 (e.g., protruding extremities such as a foot, etc.) may, from the perspective of camera 302-1, be positioned behind occluding object 702 (e.g., another player or another object on the field) so as to not be not captured at all or to be captured in a distorted or other suboptimal way. Accordingly, a textured mesh representation 706 illustrated in
Due to this position of subject 404, as well as the arrangement and configuration of the set of cameras 302, subject 404 may be viewable from several different perspectives, but may be captured with greater point density (depth image data) and/or pixel density (color image data) from some of these perspectives than others. For example, due to the proximity of cameras 302-3 through 302-5, image data captured by these cameras may capture sufficiently detailed image data for subject 404, while due to the relatively long distance between subject 404 and camera 302-1, image data captured by this camera may be associated with a lower level of detail. Indeed, given the position of subject 404 shown in
Returning to
To illustrate,
In
In this way, a generic body model in a generic pose (such as illustrated by pose 902-1) may be iteratively adjusted (through intermediate poses such as pose 902-2) until reaching a final pose (pose 902-3) in which the joints of the body model imitate the corresponding joints of the subject represented in the captured image data (e.g., image data 306) and in the provisional 3D representation (e.g., any of representations 504-508). As shown by
System 100 may determine the set of body model parameters 904 (e.g., parameterizing body model 900 from initial pose 902-1 until ultimately reaching final pose 902-3) in any manner as may serve a particular implementation. For instance, in certain implementations, the determining of the set of parameters 904 for body model 900 may be performed using a model-fitting technique associated with an error equation that quantifies a fit of the body model to the pose of the provisional 3D representation. Such a model-fitting technique may involve iteratively adjusting and reassessing the fit of the body model to the pose of the provisional 3D representation with an objective of minimizing the error equation until the error equation satisfies a predetermined error threshold that represents an acceptable error in the fit of the body model to the pose of the provisional 3D representation. This error equation may include a plurality of terms representing different aspects of the fit of body model 900 to the pose of the provisional 3D representation and the error equation may be configured to quantify the fit of the body model to the pose of the provisional 3D representation based on a combination of different assessments of the fit corresponding to the plurality of terms. For example, if the error equation has two terms (also referred to as loss terms or error terms), system 100 may sum the error represented by both terms after each iteration (i.e., after each intermediate adjustment of the set of body model parameters 904) with an aim to decrease the error terms to the lowest values possible (representing the best imitation of the provisional 3D representation). By having different terms in the error equation, different aspects of the fit of the body model may be independently assessed and constrained to thereby produce a highly accurate fit that accounts for all of the different aspects.
To illustrate how an error equation with different terms may be employed to iteratively perform the body model parameterization shown in
At operation 1002-1, system 100 may begin the model-fitting technique by setting or otherwise obtaining an initial set of parameters (“Set Parameters”). The parameters set or obtained at this step may be a default set of parameters such as initial body model parameters 904-1 or may be based on parameters determined as part of a related operation (e.g., body model parameters determined for a previous frame, body model parameters that have been predetermined for the particular subject and only need to be modified to adjust the posture, etc.).
At operation 1002-2, system 100 may assess or otherwise test the body model parameters (“Test Fit”) that were initially set at operation 1002-1 or that have been adjusted by operation 1002-4 (described in more detail below). For example, at operation 1002-2, system 100 may assess the characteristics of joints, vectors, and/or other aspects of the body model (as it has been parameterized up to this point) and the provisional 3D representation of the subject (posed in the way that is targeted for imitation by the model) in accordance with an error equation 1004 that may include one or more terms 1006 (e.g., terms 1006-1 and 1006-2 in this example). As will be described in more detail below, an error equation such as error equation 1004 may allow system 100 to objectively quantify how close the fit of the body model is to the target pose of the subject so that, even if the body model never perfectly imitates the pose of the subject, system 100 may determine when the fit is close enough to be considered final (e.g., as with the final pose 902-3 illustrated in
To this end, system 100 may, at operation 1002-3, determine if the fit quantified at operation 1002-2 satisfies an objective predetermined error threshold (“Satisfy Threshold?”). For example, the error threshold may have been determined ahead of time (e.g., by a designer of the system) to represent an acceptable or target error in the fit of the body model to the pose of the provisional 3D representation threshold.
As shown, if it is determined at operation 1002-3 that the error threshold is satisfied by the current set of parameters (“Yes”), flow diagram 1000 may be considered to be at an end and these parameters may be applied to produce the parameterized body model used for refining the provisional 3D representation in the ways described herein. Conversely, if it is determined at operation 1002-3 that the error threshold is not satisfied by the current set of parameters (“No”), flow diagram 1000 may continue on to operation 1002-4, where the parameters may be adjusted (“Adjust Parameters”) to attempt to make the body model better fit the target pose of the provisional 3D representation. Using this feedback loop, the model-fitting technique may loop and iterate for a maximum number of cycles or until the fit of the body model is such that the error threshold becomes satisfied and the flow diagram ends.
Error equation 1004 may be used in the ways described above to assess the fit of the body model to the target pose of the provisional 3D representation. More specifically, error equation 1004 may be used to quantify how well parameterized the body model is (how optimal the current set of body model parameters 904 are) in light of the objective to parameterize the body model to imitate the subject. To this end, as shown, error equation 1004 may be configured to quantify a total amount of error (ETotal) as a function of the pose detected for the subject (pd) and the pose of the body model (pa) when the current set of body model parameters are applied. As mentioned above, this total error may be determined as a sum of different aspects of error quantified using different terms (E1, E2, etc.). Accordingly, as shown, error equation 1004 may set ETotal(pd, pb) equal to the sum of error terms E1(pd, pb), E2(pd, pb), and/or any other error terms as may serve a particular implementation. It will also be understood that only a single error term may be used in certain implementations.
Next to error equation 1004,
The first error term 1006-1 included in this implementation of error equation 1004 is a joint-based term configured to account for a position similarity between joints of the body model and corresponding joints of the subject as represented in the provisional 3D representation. To illustrate, circles labeled pd will be understood to represent detected positions of one particular joint (e.g., an elbow, a shoulder, a hand, etc.) of subject 404 as represented in the provisional 3D representation, while squares labeled pb will be understood to represent current positions of the corresponding joint (e.g., the elbow, shoulder, or hand, etc.) of the parameterizable body model 900 according to the current set of body model parameters being assessed or tested. As shown, a relatively large distance 1008-1 between these joints may be associated with a low position similarity (i.e., the positions of the joints are not particularly similar as they are relatively far apart) and a corresponding high degree of error (“High Error”) when the joint-based term 1006-1 is assessed. In contrast, a relatively small distance 1008-2 between these joints may be associated with a high position similarity (i.e., the positions of the joints are substantially similar as they are relatively near one another) and a corresponding low degree of error (“Low Error”) when the joint-based term 1006-1 is assessed.
The second error term 1006-2 included in this implementation of error equation 1004 is a vector-based (or bone-based) term configured to account for a pose similarity between vectors extending between particular joints of the body model and corresponding vectors extending between particular joints of the subject as represented in the provisional 3D representation. To illustrate, vectors labeled pd and extending between joints represented by circles will be understood to represent detected poses of one particular vector between particular joints (e.g., a vector between a shoulder joint and an elbow joint representing the humerus bone, etc.) of subject 404 as represented in the provisional 3D representation, while vectors labeled pb and extending between joints represented by squares will be understood to represent current poses of corresponding vectors between the particular joints (e.g., the vector representing the humerus bone, etc.) of the parameterizable body model 900 according to the current set of body model parameters being assessed. As shown, vectors at relatively disparate angles 1010-1 may be associated with a low pose similarity (i.e., the poses of the vectors or bones are not particularly similar as they are angled and oriented in substantially different ways) and a corresponding high degree of error (“High Error”) when the vector-based term 1006-2 is assessed. In contrast, vectors at relatively parallel angles 1010-2 may be associated with a high pose similarity (i.e., the poses of the vectors or bones are substantially similar as they are angled and oriented in the same ways) and a corresponding low degree of error (“Low Error”) when the vector-based term 1006-2 is assessed.
As mentioned above, different error terms may be useful at characterizing differing aspects of the current fit provided by a particular set of body model parameters. This is illustrated in
Returning to
As has been mentioned, one aspect of refining a provisional 3D representation is identifying the deficiencies that are to be mitigated in the refinement. This identifying of deficiencies may be performed in any suitable way, including in ways that analyze only the provisional 3D representation (e.g., to determine if there is a missing or deficient body part such as the missing hand or foot illustrated, respectively, by deficiencies 604 and 804) and/or in ways that leverage the parameterized body model produced by the parameterization 510 process. For example, as mentioned above, once the parameterized body model is completely fitted to the pose of the subject, various pixels represented in captured image data 306 for the subject may be projected to the corresponding locations on the surface of the parameterized body model (e.g., among the vertices of the body model). Each detected point associated with the provisional 3D representation may be matched with its closest vertex on the parameterized body model and, once all points are matched to their corresponding vertices, any vertices detected to not be associated with any points (or detected to be associated with a number of associated points below a set threshold) may be marked as a deficiency (e.g., an occluded region, an out-of-frame region, an insufficiently-detailed region, etc.).
Once the deficiencies have been identified in the provisional 3D representation in any of these ways (e.g., leveraging or not leveraging the parameterized body model), system 100 may perform textured mesh refinement 512 to mitigate the identified deficiencies in any suitable manner to thereby generate the refined 3D representation 514. As one example, based on an identified deficiency (e.g., a nose on the subject's face that is partially occluded or otherwise lacks sufficient detail), system 100 may define a feature area on the provisional 3D representation (e.g., an area including the nose and surrounding part of the face) and a corresponding feature area on the parameterized body model. Given these corresponding feature areas, system 100 may identify base points in the feature area of the provisional 3D representation and map these to corresponding base points of the body model. System 100 may then deform the provisional 3D representation (e.g., raising or lowering points within the feature area) in accordance with corresponding points of the body model. For example, these target points may be raised or lowered along a specified vector (e.g., orthogonal to the main surface of the body model in the area) until they intersect with the parameterized body model. In the example of the nose, these base points could include the facial areas surrounding the nose (which are used for the alignment), then the feature points in the target mesh (representing the nose itself) may be moved along a vector orthogonal to the face until they intersect the corresponding body model mesh. In some examples, the textured mesh refinement 512 may involve interpolating surface points from captured data to initially fill deficient parts of the 3D representation.
In some implementations, upsampling may be performed to attempt to create a uniform distribution of points and/or vertices in the deficient area as compared to other areas of the 3D representation being refined. For example, since it may not be possible to deform a single triangle (or small number of triangles) that may currently be associated with the provisional 3D representation of the nose to imitate the nuances of the nose represented by the body model, system 100 may artificially generate additional triangles (by upsampling to create data that was not actually captured) that may be deformed to more closely match the body model mesh and/or to approximately match the vertex density that is desired (e.g., the vertex density present on other parts of the 3D representation). In these ways, insights obtained from the parameterized body model about how subject 404 is expected to look may be imported into refined 3D representation 514 to fill holes and otherwise mitigate identified deficiencies of the provisional 3D representation based on the captured data alone.
In certain embodiments, one or more of the processes described herein may be implemented at least in part as instructions embodied in a non-transitory computer-readable medium and executable by one or more computing devices. In general, a processor (e.g., a microprocessor) receives instructions, from a non-transitory computer-readable medium (e.g., a memory, etc.), and executes those instructions, thereby performing one or more operations such as the operations described herein. Such instructions may be stored and/or transmitted using any of a variety of known computer-readable media.
A computer-readable medium (also referred to as a processor-readable medium) includes any non-transitory medium that participates in providing data (e.g., instructions) that may be read by a computer (e.g., by a processor of a computer). Such a medium may take many forms, including, but not limited to, non-volatile media, and/or volatile media. Non-volatile media may include, for example, optical or magnetic disks and other persistent memory. Volatile media may include, for example, dynamic random-access memory (DRAM), which typically constitutes a main memory. Common forms of computer-readable media include, for example, a disk, hard disk, magnetic tape, any other magnetic medium, a compact disc read-only memory (CD-ROM), a digital video disc (DVD), any other optical medium, random access memory (RAM), programmable read-only memory (PROM), electrically erasable programmable read-only memory (EPROM), FLASH-EEPROM, any other memory chip or cartridge, or any other tangible medium from which a computer can read.
As shown in
Communication interface 1102 may be configured to communicate with one or more computing devices. Examples of communication interface 1102 include, without limitation, a wired network interface (such as a network interface card), a wireless network interface (such as a wireless network interface card), a modem, an audio/video connection, and any other suitable interface.
Processor 1104 generally represents any type or form of processing unit capable of processing data or interpreting, executing, and/or directing execution of one or more of the instructions, processes, and/or operations described herein. Processor 1104 may direct execution of operations in accordance with one or more applications 1112 or other computer-executable instructions such as may be stored in storage device 1106 or another computer-readable medium.
Storage device 1106 may include one or more data storage media, devices, or configurations and may employ any type, form, and combination of data storage media and/or device. For example, storage device 1106 may include, but is not limited to, a hard drive, network drive, flash drive, magnetic disc, optical disc, RAM, dynamic RAM, other non-volatile and/or volatile data storage units, or a combination or sub-combination thereof. Electronic data, including data described herein, may be temporarily and/or permanently stored in storage device 1106. For example, data representative of one or more executable applications 1112 configured to direct processor 1104 to perform any of the operations described herein may be stored within storage device 1106. In some examples, data may be arranged in one or more databases residing within storage device 1106.
I/O module 1108 may include one or more I/O modules configured to receive user input and provide user output. One or more I/O modules may be used to receive input for a single virtual experience. I/O module 1108 may include any hardware, firmware, software, or combination thereof supportive of input and output capabilities. For example, I/O module 1108 may include hardware and/or software for capturing user input, including, but not limited to, a keyboard or keypad, a touchscreen component (e.g., touchscreen display), a receiver (e.g., an RF or infrared receiver), motion sensors, and/or one or more input buttons.
I/O module 1108 may include one or more devices for presenting output to a user, including, but not limited to, a graphics engine, a display (e.g., a display screen), one or more output drivers (e.g., display drivers), one or more audio speakers, and one or more audio drivers. In certain embodiments, I/O module 1108 is configured to provide graphical data to a display for presentation to a user. The graphical data may be representative of one or more graphical user interfaces and/or any other graphical content as may serve a particular implementation.
In some examples, any of the facilities described herein may be implemented by or within one or more components of computing device 1100. For example, one or more applications 1112 residing within storage device 1106 may be configured to direct processor 1104 to perform one or more processes or functions associated with processor 104 of system 100. Likewise, memory 102 of system 100 may be implemented by or within storage device 1106.
To the extent the aforementioned implementations collect, store, or employ personal information of individuals, groups or other entities, it should be understood that such information shall be used in accordance with all applicable laws concerning protection of personal information. Additionally, the collection, storage, and use of such information can be subject to consent of the individual to such activity, for example, through well known “opt-in” or “opt-out” processes as can be appropriate for the situation and type of information. Storage and use of personal information can be in an appropriately secure manner reflective of the type of information, for example, through various access control, encryption and anonymization techniques for particularly sensitive information.
In the preceding description, various example embodiments have been described with reference to the accompanying drawings. It will, however, be evident that various modifications and changes may be made thereto, and additional embodiments may be implemented, without departing from the broader scope of the invention as set forth in the claims that follow. The specification and drawings are accordingly to be regarded in an illustrative rather than restrictive sense.