There are many different video and image editing systems allowing users to create sophisticated editing and compilation effects. With the right equipment, software, and commands, a user can apply effects to produce nearly any imaginable visual result. However, video editing typically requires complicated editing software that can be very expensive, difficult to use, and, without significant training, is unapproachable for the typical user.
Aspects of the present disclosure are directed to a gesture video effect system that can determine when a user has made a gesture that is mapped to an effect and can display the effect, e.g., as an overlay on the video. The gesture can be from a pre-defined set of gestures or a gesture specified by the effect creator. In various cases, the gesture can be recognized using a trained machine learning model that recognizes gestures or that can compare a kinematic model of a depicted user to a kinematic model for a gesture to determine a match. A selected effect can be displayed as a video overlay at a location corresponding to where the gesture was made or at another location defined by the effect creator.
Further aspects of the present disclosure are directed to a movement matching system that can automatically match movements between a source video and a live feed of a user. The movement matching model can do this by initially tracking a first user depicted in a source video. The movement matching model can then generate a corresponding set of kinematic model movements for that first user. The movement matching model can next provide an overlay on a second video with the same soundtrack according to the kinematic model movements. Finally, the movement matching model can track movements of a second user and determine how accurately they match the set of kinematic model movements.
Yet further aspects of the present disclosure are directed to a video customization system that can capture both video data and gaze data indicating where the video creator is looking throughout the video. The video customization system can correlate the gaze data to coordinates in the video. Based on these coordinates, the video customization system can perform various customizations on the video, such as setting the coordinates as the focal point of the video, recognizing an object at the coordinates and setting that object as a focus object and/or highlighting the focus object, and/or setting a creator's field-of-view in the video.
Aspects of the present disclosure are directed to a gesture video effect system that can determine when a user has made a gesture mapped to an effect and can implement the effect in the video. In various implementations, the gesture can be one of a pre-defined set of gestures, selected by the effect creator, or can be a unique gesture specified by the effect creator (e.g., by making a pose in front of a camera or by posing a virtual user model, which the gesture video effect system saves for comparison to depictions of poses made by users in videos). In various cases, the gesture can be recognized using a machine learning model trained to recognize gestures and/or by mapping a kinematic model to the depicted user and applying a machine learning model trained to demine a similarity between the kinematic model and a kinematic model defined for the gesture. Also in various implementations, once the effect is selected as mapped to a gesture that was performed by a depicted user, the effect can be displayed at one of various locations, such as at a location corresponding to where the gesture was made or at a location define by the effect creator (e.g., at particular coordinates in the video frame or in relation to a recognized object or body part depicted in the video frame).
At block 502, process 500 can receive a next portion of a video feed. In various implementations, the portion of the video feed can be a latest portion of a live video feed or a next portion of a pre-recorded video feed received for post-processing. In some cases, when the video is being recorded, the gesture video effect system can include an affordance illustrating to the user what gestures the user can make to cause effects. For example, the gesture video effect system can put an overlay on the video showing a pose mapped to an effect, can provide a tutorial, can show an icon or description indicating mapped gestures, or can provide another instruction.
At block 504, process 500 can determine whether a gesture, mapped to an effect, is in the video feed. While any user movement or pose can be a gesture mapped to an effect, examples of possible gestures include a defined number of fingers raised, a hand raised, a punch, a kick, a waive, a head nod, a particular facial expression (e.g., a smile, mouth open, tongue out, a frown, etc.), a twirl, a jump, etc. The particular gesture that is mapped to an effect can be specified by the effect creator. For example, the effect creator can select a gesture from a pre-defined set of gestures, can pose a virtual user model, or can make a gesture in front of a capture camera. As a first example, the effect creator can have a virtual model of a user that can be moved to make a gesture (pose and/or movement) on a computing system. When the effect creator causes the model to make the gesture, a corresponding kinematic model for the gesture (which can be a pose of the model or the model in motion) can be tracked and saved as the gesture. As a second example, the effect creator can appear in front of a capture camera and perform the gesture (pose and/or movement) that she wants mapped to the effect. The gesture video effect system can determine a corresponding kinematic model for the performed gesture, which it can save as the gesture for the effect.
For the set of effects mapped to gestures, process 500 can determine whether one of these gestures is being performed using various machine learning approaches. For example, a machine learning model can be trained to define a kinematic model for the user depicted in one or more video frames. Such a kinematic model (also known as a skeletal model) can identify key points on a depicted user's body, such as at their forehead, chin, base of neck, shoulders, elbows, wrists, palms, fingertips, torso, hips, knees feet, and tips of toes. In various implementations, more or less points can be used (e.g., additional points on a user's face can be mapped to determine more fine-grained facial expressions). Kinematic models are discussed in greater detail below in relation to
If a mapped gesture is recognized, process 500 can proceed to block 506; otherwise process 500 can proceed to block 510.
At block 506, process 500 can select a location for overlaying or otherwise applying the effect to the video feed. In some cases, the effect can be a full-frame effect, in which case the location is just the entire frame. In other cases, the effect can be configured to be placed in relation to the location in the frame of where the user made the gesture. In some cases, this location can be updated as the user continues to make the gesture across frames (e.g., causing the effect to move with the gesture). In yet other cases, the effect can have a defined location (e.g., specified by the effect creator), such as at an x-y offset from a corner of the frame or in relation to a recognized object or body part depicted in the frame (whether or not this object or body part was part of the recognized gesture).
At block 508, process 500 can enable the effect on video at selected location. This can include adding a sticker, playing an animation, adding shading, applying a warping, or any other possible video effect (which may or may not be an overlay). In some cases, adding the effect can be part of a post-processing procedure, in which case multiple effects may be mapped to the same gesture and the user can select which of several effects to apply.
At block 510, process 500 can determine whether there are additional portions of the video to review. For a live feed, this can include determining whether the live feed has ended. For a pre-recorded video, this can include determining whether there are additional portions of the pre-recorded video remaining. If there are additional portions of the video to review, process 500 can return to block 502; otherwise process 500 can end.
Aspects of the present disclosure are directed to a movement matching system that can automatically match movements between a source video and a live feed of a user. The movement matching model can use a machine learning model trained to recognize points on a first user depicted in the source video to label those points and map the first user's poses to a kinematic model. The movement matching model can then track the kinematic model across frames of the source video to get a set of kinematic model movements for the first user in the source video. In various implementations, the set of kinematic model movements can be for the entire source video or for key points, such as at certain beats or at certain intervals. The movement matching model can next provide an overlay on a second video, that has the same soundtrack, according to the kinematic model movements. This can be an overlay showing an outline of a person drawn around the kinematic model. In various implementations, the second video can be the same as the first video, a feed of a second user, or another video such as one that depicts the second user in a new environment such as on a stage or in a music video. The movement matching model can next track movements of a second user in a live feed, by again applying the machine learning model trained to recognize points on a depicted user and map those points to a kinematic model. Finally, the movement matching model can determine how accurately the second user's movements match the set of kinematic model movements from the source video, e.g., by determining distances between matched points of the two kinematic models or by applying another machine learning model trained to take two kinematic models and provide a match value.
At block 1102, process 1100 can receive a source video. This can be a pre-recorded video or live video depicting a first user taking actions, such as dancing to music.
At block 1104, process 1100 can map a first kinematic model to the first user depicted in the source video received at block 1102. In various implementations, this mapping can be for all frames of the source video or for just certain points, such as when certain beats occur (e.g., downbeats or upbeats in the music of the video) or at certain intervals (e.g., every 1, 5, or 10 seconds) of the video.
At block 1106, process 1100 can generate an overlay based on the first kinematic model mapping. This overlay can be for the whole source video or for the parts of the source video for which the first kinematic model was specified. Generating the overlay can include showing all or a part of the first kinematic model or drawing a person-shaped outline around the pose of the first kinematic model. Examples of such outlines are provided in
The line connecting block 1106 and 1108 is shown as a dashed line to illustrate that block 1108 may not be triggered by the completion of block 1106 and that blocks 1102-1106 and blocks 1108-1116 may be performed on different systems. For example, block 1106 to create overlays may be performed on a server that ingests music videos and creates overlays for user video feeds, whereas those overlays may be provided to a client that performs blocks 1108-1116 to show the overlays on a user's feed and track how well the user's movements match them.
At block 1108, process 1100 can begin playback of a second video with the overlay generated at block 1106. In some cases, instead of generating an overlay at block 1106 and showing it at block 1108, process 1100 can simply playback the source video, having the user try to match the motions of the depicted user. In some implementations, the overlay generated at block 1106 can be on provided on the source video, or on another video such as a feed of a second user with the music from the source video (as shown in examples 700 and 800), or on a video of the second user masked to have a different backdrop (e.g., a stage at a rock concert or the source video with the originally depicted user replaced with the second user).
At block 1110, process 1100 can receive posture data for of a second user to which the video from block 1108 is being shown. In various implementations, the posture data can be from the second video; from IMU or other movement/position data of a wearable worn by the second user; from LIDAR, a depth camera, or other motion tracking data from a device monitoring the second user; etc. At block 1112, process 1100 can map a second kinematic model to the posture data for the second user. For example, process 1100 can map points to recognized body parts of the second user depicted in the video received at block 1102 (e.g., accomplished in a manner similar to that performed in block 1104). As another example, movement/position data can be associated with particular body parts, e.g., a watch wearable providing movement data can be defined to be associated with a wrist on which the second user is wearing the watch wearable.
At block 1114, process 1100 can track an accuracy of how well the second kinematic model matches the first kinematic model. In some cases, the matching can be just between parts of the models, such as the parts for the user's hand, arms, head, and/or feet. For example, there can be a match between two kinematic models when both have an arm raised gesture at the same time, no matter what other actions the models are performing at that time. In some implementations, the comparison can be performed thorough distance comparisons of corresponding points when the two models are overlaid on one another or by applying another machine learning model trained to compare similarities between kinematic model postures.
At block 1116, process 1100 can provide scoring based on the tracked accuracy. For example, for each second of matching a score can be provided and those scores can be averaged across the entire time (or segments of time, such as for each level as shown in
Aspects of the present disclosure are directed to a video customization system that can customize a video based on a determined focus of the video creator. The video customization system can capture both video and gaze data indicating where the video creator is looking throughout the video. This can be a video captured by a camera, mobile phone, artificial reality device, or other camera-enabled device. The gaze data can, for example, be determined by modeling the user's eye(s) and determining a vector cast out from the center of the user's cornea(s), into the world or onto a screen of the video capture device. In some implementations, this gaze capturing can be done with, or augmented with, one or more machine learning models. By projecting the gaze ray to image coordinates, the video customization system can correlate the gaze data to coordinates in the video, generating time labeled gaze coordinates throughout the video. Based on these coordinates, the video customization system can perform various customizations on the video, such as setting the coordinates as the focal point of the video, recognizing an object at the coordinates, and setting that object as a focus object and/or highlighting the focus object, and/or setting a field-of-view in the video (e.g., cropping the video to the creator's field-of-view or providing an overlay on the video indicating the creator's field-of-view).
At block 1502, process 1500 can capture video data, including visual data (which may be synchronized with audio data) and eye tracking data. The data gathered at block 1502 can be captured (at block 1504 and 1506) by a single device (e.g., an artificial reality device with both external facing camera(s) and user-facing eye tracking cameras) or through time synchronized data from multiple devices such as a first camera capturing the visual data, with a video camera, mobile device, or artificial reality device (at block 1504) with timing information and a second device (e.g., artificial reality device or another camera pointed at the creator's eyes) that captures (at block 1506) gaze data for the creator. In some implementations, the recorded visual data can be a recording of a virtual reality world generated by a computing system, thus the capturing can be capturing the computing system display output, without actually using a physical camera.
The eye tracking data can capture images of the creating user's eyes while one or more light sources illuminate either or both of the user's eyes. An eye-facing camera can capture a reflection of this light to determine eye position (e.g., based on set of reflections, i.e., “glints,” around the user's cornea). In some cases, a 3D model of the user's eyes can be generated such that positions of the eyes are set based on the glints. In some cases, this modeling can be enhanced or replaced through the use of a machine learning module trained to determining a gaze direction or to provide 3D modeling information from glint data inputs. The result can be gaze data that provides, e.g., a ray indicating the direction of a user's gaze in the world or a position on a display where the user was looking at a given time.
At block 1508, process 1500 can correlate the eye tracking data to coordinates in the video. This can include projecting the gaze ray, from the eye tracking data of block 1506, to image coordinates. Block 1508 can include determining how a user's field-of-view maps over the captured video and determining where the user's gaze was directed (based on the captured eye tracking data) within that field of view. This can provide, e.g., coordinates within the video (e.g., from a bottom left corner of the video) where the user's gaze was directed for each frame of the video. This information can be generated for each video frame, resulting in a time series of gazes within the video (that is a map from timestamps to gaze coordinates), conveying where the video creator was looking a the video was captured. Alternatively or in addition, process 1500 at block 1508 can track what the creator's field-of-view was during the video capture. This can be performed, for example, where the captured video is larger than the creator's field-of-view, e.g., for a capture by a panoramic or 1460 degree camera.
While any block can be removed or rearranged in various implementations, block 1510 is shown in dashed lines to indicate there are specific instances where block 1510 is skipped. At block 1510, process 1500 can perform object recognition on the video. This can include recognizing any objects displayed in the video or the object that is at the point of the coordinates determined at block 1508. In some implementations, the object recognition can identify and tag objects while in other cases the object recognition simply determines which parts of a video frame corresponds to an object (i.e., determining object outlines). Object recognition can be performed using existing machine learning models trained for this purpose.
At block 1512, process 1500 can customize the video based on the eye tracking coordinates in the video. In some implementations, the customization can include setting a focal distance in the video (see e.g.,
Processors 1610 can be a single processing unit or multiple processing units in a device or distributed across multiple devices. Processors 1610 can be coupled to other hardware devices, for example, with the use of a bus, such as a PCI bus or SCSI bus. The processors 1610 can communicate with a hardware controller for devices, such as for a display 1630. Display 1630 can be used to display text and graphics. In some implementations, display 1630 provides graphical and textual visual feedback to a user. In some implementations, display 1630 includes the input device as part of the display, such as when the input device is a touchscreen or is equipped with an eye direction monitoring system. In some implementations, the display is separate from the input device. Examples of display devices are: an LCD display screen, an LED display screen, a projected, holographic, or augmented reality display (such as a heads-up display device or a head-mounted device), and so on. Other I/O devices 1640 can also be coupled to the processor, such as a network card, video card, audio card, USB, firewire or other external device, camera, printer, speakers, CD-ROM drive, DVD drive, disk drive, or Blu-Ray device.
In some implementations, the device 1600 also includes a communication device capable of communicating wirelessly or wire-based with a network node. The communication device can communicate with another device or a server through a network using, for example, TCP/IP protocols. Device 1600 can utilize the communication device to distribute operations across multiple network devices.
The processors 1610 can have access to a memory 1650 in a device or distributed across multiple devices. A memory includes one or more of various hardware devices for volatile and non-volatile storage, and can include both read-only and writable memory. For example, a memory can comprise random access memory (RAM), various caches, CPU registers, read-only memory (ROM), and writable non-volatile memory, such as flash memory, hard drives, floppy disks, CDs, DVDs, magnetic storage devices, tape drives, and so forth. A memory is not a propagating signal divorced from underlying hardware; a memory is thus non-transitory. Memory 1650 can include program memory 1660 that stores programs and software, such as an operating system 1662, video effect system 1664, and other application programs 1666. Memory 1650 can also include data memory 1670, which can be provided to the program memory 1660 or any element of the device 1600.
Some implementations can be operational with numerous other computing system environments or configurations. Examples of computing systems, environments, and/or configurations that may be suitable for use with the technology include, but are not limited to, personal computers, server computers, handheld or laptop devices, cellular telephones, wearable electronics, gaming consoles, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, or the like.
In some implementations, server 1710 can be an edge server which receives client requests and coordinates fulfillment of those requests through other servers, such as servers 1720A-C. Server computing devices 1710 and 1720 can comprise computing systems, such as device 1600. Though each server computing device 1710 and 1720 is displayed logically as a single server, server computing devices can each be a distributed computing environment encompassing multiple computing devices located at the same or at geographically disparate physical locations. In some implementations, each server 1720 corresponds to a group of servers.
Client computing devices 1705 and server computing devices 1710 and 1720 can each act as a server or client to other server/client devices. Server 1710 can connect to a database 1715. Servers 1720A-C can each connect to a corresponding database 1725A-C. As discussed above, each server 1720 can correspond to a group of servers, and each of these servers can share a database or can have their own database. Databases 1715 and 1725 can warehouse (e.g., store) information. Though databases 1715 and 1725 are displayed logically as single units, databases 1715 and 1725 can each be a distributed computing environment encompassing multiple computing devices, can be located within their corresponding server, or can be located at the same or at geographically disparate physical locations.
Network 1730 can be a local area network (LAN) or a wide area network (WAN), but can also be other wired or wireless networks. Network 1730 may be the Internet or some other public or private network. Client computing devices 1705 can be connected to network 1730 through a network interface, such as by wired or wireless communication. While the connections between server 1710 and servers 1720 are shown as separate connections, these connections can be any kind of local, wide area, wired, or wireless network, including network 1730 or a separate public or private network.
Embodiments of the disclosed technology may include or be implemented in conjunction with an artificial reality system. Artificial reality or extra reality (XR) is a form of reality that has been adjusted in some manner before presentation to a user, which may include, e.g., a virtual reality (VR), an augmented reality (AR), a mixed reality (MR), a hybrid reality, or some combination and/or derivatives thereof. Artificial reality content may include completely generated content or generated content combined with captured content (e.g., real-world photographs). The artificial reality content may include video, audio, haptic feedback, or some combination thereof, any of which may be presented in a single channel or in multiple channels (such as stereo video that produces a three-dimensional effect to the viewer). Additionally, in some embodiments, artificial reality may be associated with applications, products, accessories, services, or some combination thereof, that are, e.g., used to create content in an artificial reality and/or used in (e.g., perform activities in) an artificial reality. The artificial reality system that provides the artificial reality content may be implemented on various platforms, including a head-mounted display (HMD) connected to a host computer system, a standalone HMD, a mobile device or computing system, a “cave” environment or other projection system, or any other hardware platform capable of providing artificial reality content to one or more viewers.
“Virtual reality” or “VR,” as used herein, refers to an immersive experience where a user's visual input is controlled by a computing system. “Augmented reality” or “AR” refers to systems where a user views images of the real world after they have passed through a computing system. For example, a tablet with a camera on the back can capture images of the real world and then display the images on the screen on the opposite side of the tablet from the camera. The tablet can process and adjust or “augment” the images as they pass through the system, such as by adding virtual objects. “Mixed reality” or “MR” refers to systems where light entering a user's eye is partially generated by a computing system and partially composes light reflected off objects in the real world. For example, a MR headset could be shaped as a pair of glasses with a pass-through display, which allows light from the real world to pass through a waveguide that simultaneously emits light from a projector in the MR headset, allowing the MR headset to present virtual objects intermixed with the real objects the user can see. “Artificial reality,” “extra reality,” or “XR,” as used herein, refers to any of VR, AR, MR, or any combination or hybrid thereof. Additional details on XR systems with which the disclosed technology can be used are provided in U.S. patent application Ser. No. 17/170,839, titled “INTEGRATING ARTIFICIAL REALITY AND OTHER COMPUTING DEVICES,” filed 2/8/2021, which is herein incorporated by reference.
Those skilled in the art will appreciate that the components and blocks illustrated above may be altered in a variety of ways. For example, the order of the logic may be rearranged, substeps may be performed in parallel, illustrated logic may be omitted, other logic may be included, etc. As used herein, the word “or” refers to any possible permutation of a set of items. For example, the phrase “A, B, or C” refers to at least one of A, B, C, or any combination thereof, such as any of: A; B; C; A and B; A and C; B and C; A, B, and C; or multiple of any item such as A and A; B, B, and C; A, A, B, C, and C; etc. Any patents, patent applications, and other references noted above are incorporated herein by reference. Aspects can be modified, if necessary, to employ the systems, functions, and concepts of the various references described above to provide yet further implementations. If statements or subject matter in a document incorporated by reference conflicts with statements or subject matter of this application, then this application shall control.
This application claims priority to U.S. Provisional Application Nos. 63/293,389 filed Dec. 23, 2021, titled “Video Customizations From Creator Focus Indications,” with Attorney Docket Number 3589-0108DP01; 63/298,411 filed Jan. 11, 2022, titled “Automated Movement Matching Between Videos,” with Attorney Docket Number 3589-0099DP01; and 63/298,407 filed Jan. 11, 2022, titled “Gesture Triggering Video Effects,” with Attorney Docket Number 3589-0098DP01. Each patent application listed above is incorporated herein by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
63298407 | Jan 2022 | US | |
63298411 | Jan 2022 | US | |
63293389 | Dec 2021 | US |