The present disclosure relates generally to system, method and computer accessible medium for providing image or video of a user changing their clothing or other attributes, and more specifically, to exemplary embodiments of system, method and computer accessible medium that facilitates a user to select different clothing, accessories or other attributes from a database, and to provide that user with an image or a video showing a person wearing the clothing, accessories, or other attributes.
Determining which outfit to wear on a daily basis can be very time consuming. A person must first select the potential outfits to wear, and then try on all of the different outfits, including mixing and matching different parts of different outfits in order to choose the best combination. This process needs to be repeated every day, and it can also be time consuming to select clothes to purchase at a store. First, the user must browse the entire store, and select the items the user wishes to try on. Then, the user enters a changing room, and tries on all of the clothing. In order to alleviate the time and stress of this process, different systems have been developed. Many fashion retail websites, such as Glamour magazine, H&M's retail website's “Virtual Dressing Room”, and other companies such as Embodee's online try-on developed applets in an attempt to address the above described issues. For example, JC Penny's, in collaboration with Seventeen Magazine's website, utilizes an augmented reality, a camera, and a web-browser having Flash plug-in. These systems can use rudimentary face and body tracking, in combination with pre-rendered still images, to show the user wearing the clothing. However, such systems do not use real-time video for illustrating the clothing appearance changes.
Other full-body techniques typically employ graph-based structures derived from large motion-capture data. (See, e.g., ARIKAN, O., AND FORSYTH, D., “Interactive motion generation from examples”, ACM Transactions on Graphics 21, 3, 483-490, 2002; KOVAR, L., GLEICHER, M., AND PIGHIN, F., “Motion graphs”, ACM Transactions on Graphics (TOG) 21, 3, 473-482, 2002; LEE, J., CHAI, J., REITSMA, P., HODGINS, 311 J., AND POLLARD, N., “Interactive control of avatars animated with human motion data”, ACM Transactions on Graphics 21, 3, 491-500, 2002; LI, Y., WANG, T., AND SHUM, H., “Motion texture: a two level statistical model for character motion synthesis”, In Proceedings of the 29th annual conference on Computer graphics and interactive techniques, ACM, 465-472, 2002; PULLEN, K., AND BREGLER, C., “Motion capture assisted animation: texturing and synthesis”, ACM Transactions on Graphics (SIGGRAPH 2002) 21, 3, 501-508, 2002). However, there is no video used in these techniques. Other related technologies can include more general video based techniques, such as, Video Sprites (see, e.g., SCHODL, A., AND ESSA, I., “Controlled animation of video sprites”, In Proceedings of the 2002 ACM SIGGRAPH/Eurographics symposium on Computer animation, ACM, 121-127, 2002) and Human Video Textures (see, e.g., FLAGG, M., NAKAZAWA, A., ZHANG, Q., KANG, S., RYU, Y., ESSA, I., AND REHG, J., “Human video textures”, In Proceedings of the 2009 symposium on Interactive 3D graphics and games, ACM, 199-206, 2009). In such cases, either matting-based extraction is used without explicit skeletal annotation, or a marker-based system in parallel to HD video acquisition is utilized. However, no real-time video input is used to drive the animations. Three-dimensional (“3D”) extensions of video based acquisition techniques have been recently advanced. (See, e.g., DE AGUIAR, E., STOLL, C., THEOBALT, C., AHMED, N., SEIDEL, H., AND THRUN, S., “Performance capture from sparse multi-view video”, In ACM Transactions on Graphics (TOG), vol. 27, ACM, 98, 2008; DENG, Z., AND NOH, J., “Computer facial animation: A survey. Data-Driven 3D Facial Animation”, pp. 1-28, 2007). Further, a dynamic simulation based cloth modeling has been incorporated into these 3D video based capture techniques. (See, e.g., STOLL, C., GALL, J., DE AGUIAR, E., THRUN, S., AND THEOBALT, C., “Video-based reconstruction of animatable human characters”, In ACM Transactions on Graphics (TOG), vol. 29, ACM, 139, 2010). None of the above, however, provides a user with real-time tracking to display what the clothing would look like to the user.
Thus, it may be beneficial to provide exemplary system, method and computer accessible medium for the real-time video display of a person wearing different clothing that can be easily manipulated and controlled by a user, and which can address and/or overcome at least some of the deficiencies described herein above.
Thus, to address and/or overcome at least some of the issues described herein above, exemplary embodiments of the system, method and computer accessible medium, called BodySwap or BodyJam can be provided which can facilitate a user to change his/her outfit quickly. For example, the exemplary system, method and computer accessible medium can facilitate a real-time full body view of a user and display poses, in real-time, of a person standing in front of the camera/display mirror, facilitating the user to change his/her clothes as well as other appearance attributes. According to certain exemplary embodiments of the present disclosure, procedures can be provided for real-time video based rendering system. For example, BodySwap can be used, e.g., as a virtual mirror to dress and re-dress people in different clothing. In certain exemplary embodiments of BodySwap, a specific garment can be changed, a different person can be provided, and/or a specific garment can be controlled.
The exemplary system, method and computer accessible medium can take advantage of marker-less skeletal tracking techniques, such as, e.g., Microsoft's Kinect. (See, e.g., Reference 16). Unlike conventional systems which are example-based rendering systems that need marker based data, according to the particular exemplary embodiments of the present disclosure, a marker-less annotation can be used for the input video that can be driving the animation, and marker-less annotation for the video-based render database. The exemplary system, method and computer accessible medium can include engines to learn from face and body retargeting and re-writing systems, such as, e.g., those described in References 2, 3, 6, and 18, that use computer vision to annotate or drive facial animation. For example, Reference 19 describes a Kinect-based real-time facial retargeting system.
According to additional exemplary embodiment of the present disclosure, poses can be matched to a video database of different torsos and legs. “Pages” can be turned by gestures interpreted through the video tracking. Some or all body poses can be mirrored in real time, and outfits can be mixed and matched through gestures and poses by the user
The exemplary applications of such technologies can be immense including: video games, movies, fashion retail stores, to name a few areas.
These and other objects of the present disclosure can be achieved by provisions of exemplary systems, methods and computer-accessible mediums according to exemplary embodiments of the present disclosure for displaying visual information corresponding to at least one user, using which a selection of at least one attribute to be viewed can be received, at least one user pose of the at least one user in real-time can be tracked using a marker-less capture procedure. The user pose(s) can be matched with at least one database pose in a database, and the database pose(s) can be displayed in combination with the attribute(s).
In particular exemplary embodiments of the present disclosure, the database can include a plurality of stored images of previously captured skeletal annotated poses captured using a marker-less capture procedure. The previously captured skeletal annotated poses can be of at least one person presenting different attributes. According to some exemplary embodiments, the attributes can include clothing and/or accessory. In particular exemplary embodiments, a skin color of the user/person analyzed can be modified to match the skin color of the user. The database pose(s) can approximately match position and/or orientation of the user pose(s).
According to further exemplary embodiments of the present disclosure, clothing can be conformed to a body of the user(s) by analyzing the body style of the user. For example, the tracking of user(s) can be performed using a camera. The marker-less capture procedure can be performed using an OpenNI Framework. At least one further user pose can be tracked and matched to at least one further database pose, and the further database pose(s) can be displayed. In some exemplary embodiments of the present disclosure, the matching procedures can be performed by searching the database for poses that are close to the at least one first database pose. For example, the database can be searched using a nearest neighbor algorithm.
These and other objects, features and advantages of the exemplary embodiments of the present disclosure will become apparent upon reading the following detailed description of the exemplary embodiments of the present disclosure, when taken in conjunction with the appended claims.
Further objects, features and advantages of the present disclosure will become apparent from the following detailed description taken in conjunction with the accompanying Figures showing illustrative embodiments of the present disclosure, in which:
Throughout the drawings, the same reference numerals and characters, unless otherwise stated, are used to denote like features, elements, components, or portions of the illustrated embodiments. Moreover, while the present disclosure will now be described in detail with reference to the figures, it is done so in connection with the illustrative embodiments and is not limited by the particular embodiments illustrated in the figures.
The exemplary embodiments of the present disclosure may be further understood with reference to the following description and the related appended drawings. The exemplary embodiments of the present disclosure relate to camera tracking system, method and computer-accessible medium, associated database for the real-time tracking, and a display of a person wearing different clothing. Specifically, the exemplary system, method and computer-accessible medium can track a user's movement and display a person moving in a similar manner wearing the desired clothing. The exemplary embodiments are described with reference to a person wearing clothing, although those having ordinary skill in the art will understand that the exemplary embodiments of the present disclosure may be implemented on any real-time tracking system that can display a person moving in a similar manner to the user.
Exemplary Generation of Clothing Database
According to exemplary embodiments of the present disclosure, a clothing database can be generated. For example, using a video detection system (See, e.g., SHOTTON J., FITZGIBBON A., COOK M., SHARP T., FINOCCHIO M., MOORE R., KIPMAN A., BLAKE A., “Realtime human pose recognition in parts from a single depth image”), the performance of one person (e.g., the model) wearing a piece of clothing can be recorded, and a database of the clothing's appearances from multiple poses can be created. To generate the image database for a piece of clothing, an exemplary model can be dressed with the clothing and a performance of him or her moving around can be recorded, annotated by his/her 3D skeleton. This can be accomplished using a video camera and depth extraction (e.g., like a mocap system combined with a video camera). In an exemplary embodiment of the present disclosure, a Kinect sensor can be used to capture the performance with the skeleton being computed by the OpenNI Framework (see, e.g., OPENNI. OpenNI. www.openni.org). For each frame of the performance, a database entry can be created containing the video frame image and the corresponding skeleton for the model's pose. For example, for each frame of the performance captured at a constant frame rate, a database entry containing the video frame image and the corresponding skeleton for the model's pose can be generated.
To establish a notation, an image database D={Ef} can be a set of pairs Ef=(Sf, If) composed of a 3D skeleton Sf and an image If extracted from the video of the performance. The video frame number can be f. The skeleton Sf={Jf,j j=1, . . . , n} in turn, can be composed of the n 3D joint positions Jf,j for each frame f.
According to certain exemplary embodiments of the present disclosure, poses can be captured using the distance between the joint orientations quaternions, which are insensitive to the skeleton's bone sizes.
According to certain exemplary embodiments of the present disclosure, the performances of the reference model wearing various styles and colors of clothes can be recorded. Each performance can give rise to a separate image database capturing the clothing's appearances from multiple poses. Some exemplary databases can be marked, for example, as being suitable exclusively for the upper body, others, just for the lower body, while some can be used for both. The exemplary databases can then be revised and/or generated, for example, as a clothing library.
According to certain exemplary embodiments of the present disclosure, the exemplary database can be indexed frame-by-frame. The exemplary database can be made of small video clips, video clips of size ˜½ second (e.g., about 12 frames per video clip). Other manipulations can be performed used or based on information contained in the database which can save search time (e.g., since at each search a video clip, e.g., 12 frames) and can create smoother sensations during playback.
According to certain exemplary embodiments of the present disclosure, a face database can be generated. As skeleton information may not be present or needed, the pose of the face can be extracted from the detection of the left and right eyes. The pose can be the information that replaces the joints of the skeleton for a body (see e.g.,
Referring to
Exemplary Body Swapping
Referring to
To control the exemplary system, method and computer accessible medium, the controller's skeleton S(t), as well as a position and orientation of the user, can be tracked, for example, in real-time, at moments t, to query the database for the frame that best matches his or her current pose. Initially, the entry Ef=(Sf, If) containing the best matching skeleton can be sought. At each moment t, the controller's skeleton S(t) can be used to search in a database D for the entry with the pose, which can include a position and orientation, that best matches his or her pose:
where d can be a skeleton distance function, and then image If(t) can be displayed on the screen back to the controller, giving him/her the impression that the model is mimicking his/her performance. For notation simplicity, the index t can be omitted sometimes from f(t). Ef=(Sf, If) as the database entry corresponding to the best matching skeleton Sf (e.g., at time t). Next, If can be displayed on the screen giving the controller the impression of a virtual mirror.
Exemplary Distance Function: For the skeleton distance function d used to search the image database, a weighted sum of squared distances between the 3D joints of the two skeletons can be used. Moreover, in order to make the control insensitive to translations, the skeletons can first be centered by their torso's joint (e.g., the model can be instructed to roughly move in place when recording the performance for the image database). More precisely, the distance between skeletons S={Jj; j=1, . . . , n} and S′={J′j; j=1, . . . , n} is to be computed. For example, such skeletons can come already ordered by their type of junction, so the correspondence between joints can already be provided, e.g., by the Kinect. Their joints can first be centered around the respective torsos JT and J′T, obtaining, for example, new, translated skeletons:
{tilde over (S)}=({tilde over (J)}i; i=1, . . . , n)=(Ji−JT; i=1, . . . , n),
{tilde over (S)}′=({tilde over (J)}′i; i=1, . . . , n)=(J′i−J′T; i=1, . . . , n).
The distance can then be determined as:
where the weights Wj can be used to improve the playback smoothness (e.g., joints on the torso typically have higher weights than the limbs), as well as to eliminate some of the joints from being controlling altogether. For example, if there is interest only in moving the upper body, the weights of the leg joints can be set to zero. Further, the joint velocities can be incorporated in addition to the positions, which can be a simple matter of extending the database with more annotations. The velocities can facilitate resolving conflicts between nearby poses (e.g., an arm moving up versus moving down).
Exemplary Nearest Neighbor Search: For each query, it can be preferable to search through the database for the entry which holds the skeleton closest to the controller's current pose. According to certain exemplary embodiments of the present disclosure, a straight linear search can be used. Alternatively, more sophisticated nearest neighbor search algorithms, such as space partitioning approaches can be used. In large databases, an efficient search algorithm can be preferable.
Exemplary “Hysteresis” Thresholding for Smoothness: In order to remove jittering and compute in real-time playback, nearby video frames that are in consecutive queries can be used for as long as the skeleton distance stays within a threshold. For example, if the query at one moment t returned the database entry Ef*=(Sf*, If*). For the next query, at time t+1, instead of searching the whole database, as in equation (1), candidates can be limited to entries inside a window of width W around frame f at time t. For example, equation (1) can be described by the following pseudo program, e.g.:
where S(t+1) can be the controller's skeleton, and f*(t+1) can be the frame number displayed next. T can be a threshold parameter. In certain exemplary embodiments, W can equal 4. When the distance of the local optimum computed with the first equation in (eq. 3) becomes too large (e.g., according to parameter T), a long transition can be provided by resorting back to searching the closest matching skeleton over the database using, for example, the second equation in (eq. 3). This can remove jittering because, when the original model moved around, he or she may have passed multiple times through nearby poses, which can become a source of jittering in the real-time playback. By using adjacent frames, the system can use smoother video sequences present in the original recording.
Exemplary Image Buffering: The exemplary images can be obtained, for example, from disk to main memory on demand when performing database queries. As an option to limit memory consumption in the case of large databases, a memory budget can be assigned on how many images can be allowed to be in memory at one given time. Then, a LRU cache replacement policy can be employed by, when necessary, first swapping back to disk the frames with the oldest access time. Moreover, a simple predictive caching scheme can be used, e.g., by pre-loading into main memory a window of frames around the frame returned by a query.
Exemplary Frame Discarding: The exemplary system can be organized around two threads: the image database matching thread, which can produce the best matching frame based on the controller's real time skeleton, and the rendering thread, which can display the matched frames on the screen. The matching thread can add frames to a queue, annotated with a timestamp of the query, and the rendering thread consumes frames from the queue. In order to avoid occasional long lags between the controller's movement and the video that is displayed back to him/her, maintaining the feel of real-time control, the rendering thread can discard the frames that are too old when dequeing a new frame for display.
Exemplary Skin Color Swapping
In order for the user to better identify himself/herself with the body that is being shown on the screen, the user's skin color can be transferred to the model that was originally used to create the database. The images can be modified from the databases at runtime using a statistical model of the color distribution on both skin regions. A transformation that gives convincing results can be accomplished as well as one that runs fast enough to be computed immediately in real-time after a new user steps in without disrupting the experience.
Skin Color Transfer: Images can be transformed from RGB space into 1αβ space (see e.g., Reference 25). The details of the transformation from RGB to 1αβ space can be found in Reference 23. In the discussion that follows, the color is assumed to be represented in 1αβ space. The skin color distribution of the target image ct is modeled as a Gaussian:
ct˜(ct;μt,Σt) (4)
and the color distribution in the source image c, as a mixture of Gaussians:
using the EM algorithm (see e.g., Reference 5). Two components for the Gaussian mixture of the face can be used (i.e., n=2), which can be enough to model in one component the actual skin pixels and, in the other, pixels not corresponding to skin, such as eyes and hair pixels. A Gaussian distribution responsible for the greater number of pixels to model the user's skin color distribution can be used. This can be denoted N (cs; μs; Σs).
Having N (cs; μs; Σs) and N (ct; μt; Σt) describing the skin color distributions in the source and target images, respectively. Each pixel is then transformed in the skin region of the target image by warping the distribution N (ct; μt; Σt) into N (cs; μs; Σs). More precisely, when Vt is the 3×3 matrix of eigenvectors of Et, with one eigenvector per column, and Dt a diagonal matrix holding the corresponding eigenvalues in the main diagonal. Define V, and D, the same way for E.
Each pixel ct in the target image is then transformed by:
and then converted back to RGB space.
Referring to
Defining the Skin Masks: To capture the skin region in the source image, whenever a new user jumps in, a Viola-Jones face detector can be employed (See e.g., Reference 30 P. Viola and M. Jones. “Rapid object detection using a boosted cascade of simple features, In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 1, pp. I-511-I-518, vol. 1, 2010) to mask out the controller's face in the source image. The target skin region, on the other hand, can be computed offline, since it corresponds to images in the clothing database. The skin regions are masked by rotoscoping the database videos in Adobe After Effects using the Roto Brush tool, although not limited thereto. The exemplary result is that a skin area mask for each frame in the database can be stored along with the Gaussian describing the skin color distribution of the reference model that was used to record the garment (also computed offline).
At run-time, skin color transformation can be applied very fast, since all that is needed is to estimate the Gaussian mixture inside the user face region in a single frame, and then use the resulting model to compute transformation (see, e.g., eq. 6). To further increase speed, transformation (eq. 6) and the 1αβ⇄RGB color space conversions can be computed in a fragment program in the GPU whenever an image from the database is shown.
Exemplary Body Jam
According to an exemplary implementation of exemplary embodiments of the present disclosure, Microsoft's Kinect procedure can be used. For example, as shown in
According to another exemplary embodiment of the present disclosure, the user can move, for example, in front of a video sensor, and a screen can illustrate the user being dressed in different clothes. By using hand gestures, the user can independently flip through the clothes dressing their upper and lower body. With this interface, the user is able to for example, choose between different styles, patterns, colors, as well as to evaluate which garments go well together. Moreover, by making use of the techniques presented in Body Swap, users not only can see themselves in different clothes, but can also control, in real time, the animation of the body.
According to certain exemplary embodiments of the present disclosure, the electronic representations of the clothes can be manipulated so that the clothes are conformed to the body of the user. For example, the clothes stored in the database can be modeled by individuals having different body styles, for example, different body-shapes than the user (e.g., rounded shoulders compared to square shoulders; slight build compared to muscular build; etc.). Accordingly to certain exemplary embodiments of the present disclosure, procedures can be provided to manipulate the appearance of the clothing to conform the clothing to the body to provide a more realistic fit on the user.
Exemplary Controlling of Three Separate Body Parts
Overview of the Exemplary User interface (UI): The exemplary screen can be divided into three separate stacked layers (see, e.g.,
Exemplary Aligning the Body Parts
In order to generate the final composition of the three stacked layers, the real-time video of the controller, e.g., the upper, and the lower body videos generated by the upper and lower body image databases, can be cropped, scaled and/or aligned.
Exemplary Cropping: The video frames retrieved from a database feeding the upper body video between the neck and waistline can be cropped. To accomplish this, the projection, for example, on the Kinect's image plane, of the 3D skeleton annotations contained in the result of a database query, can be used. When used for the lower body, the frames below the waist line can be cropped. The real-time video of the controller, in turn, can be cropped above the neck using the real-time tracked skeleton, for example, with the Kinect procedure.
Exemplary Alignment: The exemplary images can be aligned based on the projected skeletons. A projected skeleton can be the skeletons described by the joint information without the “z-component”. The “z-component” can be, for example, the component away from the Kinect, e.g., the one that describes depth away from the Kinect. The real-time head position can be aligned with the neck position contained in the entry from the upper body database, and the lower body or waist, in turn, can be aligned with the upper body.
Exemplary Scaling: In addition to the exemplary alignment, the videos can be appropriately scaled in order to generate a convincing final composition. Again, the projected joints can be employed. The lower body can be scaled in relation to the upper body based on the ratio of the projected torsos of each. The head can be scaled in relation to the torso based on the distance from the neck to the head.
Exemplary Changing Clothes
According to certain exemplary embodiments of the present disclosure, flipping through the clothes can be accomplished via various computer-based procedures.
Exemplary Gesture Driven Switch: When using hand gestures for control, at each moment the clothes of the upper or the lower body can be changed, indicated to the user by two small yellow circles aligned with the active layer (see, e.g.,
Exemplary Timed Random Switch: As an alternative, a timed switch between clothes that randomly alternates between the databases available in the clothing library can be employed. It can be used offline to create avatars dressed in any clothes, or even to dress Hollywood actors to produce movies, without requiring them to ever try on the clothes.
Exemplary Hand Tracking Interface: In a more realistic setting, users should be able to pick clothes from a catalog. Also, a “hand cursor” interface can be implemented where thumbnails of the available clothes are overlaid on the screen, and, by tracking the user's hand, he/she is able to pick different outfits by placing the cursor on top of the thumbnail of his/her choice of garment (
As shown in
Further, the exemplary processing arrangement 102 can be provided with or include an input/output arrangement 114, which can include, e.g., a wired network, a wireless network, the interne, an intranet, a data collection probe, a sensor, etc. As shown in
The foregoing merely illustrates the principles of the disclosure. Various modifications and alterations to the described embodiments will be apparent to those skilled in the art in view of the teachings herein. It will thus be appreciated that those skilled in the art will be able to devise numerous systems, arrangements, and procedures which, although not explicitly shown or described herein, embody the principles of the disclosure and can be thus within the spirit and scope of the disclosure. In addition, all publications and references referred to above can be incorporated herein by reference in their entireties. It should be understood that the exemplary procedures described herein can be stored on any computer accessible medium, including a hard drive, RAM, ROM, removable disks, CD-ROM, memory sticks, etc., and executed by a processing arrangement and/or computing arrangement which can be and/or include a hardware processors, microprocessor, mini, macro, mainframe, etc., including a plurality and/or combination thereof. In addition, certain terms used in the present disclosure, including the specification, drawings and claims thereof, can be used synonymously in certain instances, including, but not limited to, e.g., data and information. It should be understood that, while these words, and/or other words that can be synonymous to one another, can be used synonymously herein, that there can be instances when such words can be intended to not be used synonymously. Further, to the extent that the prior art knowledge has not been explicitly incorporated by reference herein above, it is explicitly incorporated herein in its entirety. All publications referenced are incorporated herein by reference in their entireties.
Certain details are set forth of various exemplary embodiments. However, one skilled in the relevant art will recognize that embodiments may be practiced without one or more of these details, or with other methods, components, materials, etc. In other instances, well-known structures associated with controllers, data storage devices and display devices, have not been shown or described in detail to avoid unnecessarily obscuring descriptions of the embodiments.
Unless the context requires otherwise, throughout the specification, the word “comprise” and variations thereof, such as, “comprises” and “comprising” can be construed in an open, inclusive sense, that is, as “including, but not limited to.”
Reference throughout this specification to “one embodiment” or “an embodiment” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, the appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
The following references are hereby incorporated by reference in their entirety.
This application claims priority to U.S. Provisional Application No. 61/515,649 filed on Aug. 5, 2011. The entire disclosure of the above-referenced application is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
61515649 | Aug 2011 | US |