Embodiments of the invention relate to identifying and editing a human pose in a video on a mobile device.
Human pose detection refers to the detection of key points of a human figure in an image. The positions of the key points describe the human pose. Each key point is associated with a body part such as the head, a shoulder, a hip joint, a knee and a foot. Human pose detection enables the determination of whether a person detected in an image is kicking his leg, raising his elbow, standing up or sitting down.
Conventionally, a human pose is captured by outfitting a human subject with a marker suit having embedded tracking sensors on several key locations. Such an approach is cumbersome, time-consuming and costly. Marker-less methods for pose estimation have been developed but require significant computing power, which is an obstacle for devices limited by computing resources, such as mobile devices.
In one embodiment, a mobile device is provided to generate a target human pose in a video. The mobile device includes processing hardware, memory coupled to the processing hardware, and a display. The processing hardware is operative to: identify key points of a human figure from a frame of the video in response to a user command, the user command further indicating a target position of a given key point of the key points; generate a target frame including the target human pose, with the given key point of the target human pose at the target position; and generate on the display an edited frame sequence including the target frame. The edited frame sequence shows movement of the human pose transitioning into the target human pose.
In another embodiment, a method is provided to generate a target human pose in a video. The method comprises: identifying key points of a human figure from a frame of the video in response to a user command, the user command further indicating a target position of a given key point of the key points; generating a target frame including the target human pose, with the given key point of the target human pose at the target position; and generating on a display an edited frame sequence including the target frame. The edited frame sequence shows movement of the human pose transitioning into the target human pose.
Other aspects and features will become apparent to those ordinarily skilled in the art upon review of the following description of specific embodiments in conjunction with the accompanying figures.
The present invention is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that different references to “an” or “one” embodiment in this disclosure are not necessarily to the same embodiment, and such references mean at least one. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
In the following description, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known circuits, structures and techniques have not been shown in detail in order not to obscure the understanding of this description. It will be appreciated, however, by one skilled in the art, that the invention may be practiced without such specific details. Those of ordinary skill in the art, with the included descriptions, will be able to implement appropriate functionality without undue experimentation.
Embodiments of the invention enable the editing of a human pose captured in a video. In one embodiment, the pose of a human figure is identified in a video, where the pose is defined by a number of key points which describe joint positions and joint orientations. A user, such as a smartphone user, may view the video on a display of the smartphone and edit the position of a key point in a frame of the video. The user-edited position of the key point is referred to as the target position. In response to the user input, the human pose is automatically modified in the video, including a target frame that shows the key point at the target position and the neighboring frames that precede and/or follow the target frame. For example, a human figure may extend his arm in an original frame sequence of the video, and a user may edit one frame in the video to bend the arm. A method and a system are developed to automatically generate an edited frame sequence based on the original frame sequence and the target position of the edited key point. In the edited frame sequence, the human figure is shown to bend his arm in a natural and smooth movement.
In one embodiment, a video editing application may be provided and executed by the user's smartphone, which according to a user command automatically generates an edited frame sequence with smooth transitions into and out of the target frame.
Although the terms “smartphone” and “mobile device” are used in this disclosure, it is understood that the methodology described herein is applicable to any computing and/or communication device capable of displaying a video, identifying a human pose and key points, editing one or more of the key points according to a user command, and generating an edited video. It is understood that the term “mobile device” includes a smartphone, a tablet, a network-connected device, a gaming device, etc. The video to be edited on a mobile device may be captured by the same mobile device, or by a different device and then downloaded to the mobile device. In one embodiment, a user may edit a human pose in a frame of the video, run a video editing application on the mobile device to generate an edited video, and then share the edited video on social media.
As an example, the video may be displayed and edited on the mobile device 100 of
In one embodiment, the key points of the human figure are shown on the display after the mobile device 100 receives the user's command to edit the video (e.g., when the user starts running a video editing application on the mobile device 100). The user may select a frame sequence (e.g., the original frame sequence 210) to be replaced by the edited frame sequence 220. The user may input his edits in the first frame of the selected frame sequence to define the target pose in the last frame (i.e., the target frame) of the edited frame sequence 220. The number of intermediate frames generated by the mobile device 100 between the original pose (in frame (F1)) and the target pose (in frame (F4)) may be controlled by a predetermined setting or a user-configurable setting (e.g., 1-2 seconds of frames such as 30-60 frames), and/or may be dependent on the amount of movement between the original pose and the target pose, to produce a smooth movement. In one embodiment, additional frames may also be generated and added after the target frame (e.g., frame (F4)) to produce a smooth movement of the human figure.
With respect to human pose estimation 320, the mobile device 100 may identify the key points of a human pose from a human figure image by performing CNN-based parts identification and parts association. Parts identification refers to identifying the key points of a human figure, and parts association refers to associating the key points with body parts of a human figure. The human pose estimation 320 may be performed on the human figure cropped from the background image, and CNN computations are performed to associate the identified key points with body parts of the cropped human figure. CNN-based algorithms for image segmentation and human pose estimation are known in the art; the descriptions of these algorithms are beyond the scope of this disclosure. It is noted that the mobile device 100 may perform CNN computations to identify the human pose according to a wide range of algorithms.
After the key points of the human figure are identified and displayed on the mobile device 100, a user of the mobile device 100 may input a command to move any of the key points on the display. The user command may include a user-directed motion on a touch screen to move a key point to a target position. The user may move one or more of the key points via a user interface; e.g., by dragging a key point (referred to as the given key point) to a target position by hand or by a stylus pen on a touch screen or touch pad of the mobile device 100. The mobile device 100 based on the edited coordinates of the given key point (e.g., in the Cartesian space) computes the corresponding joint angles of the human figure. In one embodiment, the mobile device 100 converts the Cartesian coordinates to the corresponding joint angles by applying an inverse kinematics transformation 330. From the joint angles, the mobile device 100 computes the resulting key points which define the target pose, where the resulting key points include the given key point moved by the user and the other key points which are moved from their respective original positions caused by movement of the given key point.
After the resulting key points are computed, the mobile device 100 applies global warping 340 to transform the original human figure pixels (having the original pose) to the target human figure pixels (having the target pose). The original human figure pixels are in an original coordinate system and the target human figure pixels are in a new coordinate system. The global warping 340 maps each pixel value of the human figure in the original coordinate system to the new coordinate system, such that the human figure is shown to have the target pose in the edited video. For example, if Q and P are the original coordinates of the two key points that define an arm in the original pose, and Q′ and P′ are the new coordinates of the corresponding resulting key points in the target pose, a transformation (T) can be computed from the line-pairs Q-P and Q′-P′. This transformation (T) can be used to warp pixels on the arm. If X is a pixel (or pixels) on the arm in the original pose, X′=T·X is the corresponding pixel (or pixels) on the arm in the target pose.
In one embodiment, the inverse kinematics transformation 330 and the global warping 340 are also performed on each intermediate state of the human pose in each intermediate frame (which precedes the target frame) to produce a smooth path of movement of the human figure. A smooth simulated path of movement of posture is computed with inverse kinematics transformation 330 and the pose within the time window of the intermediate frames are warped accordantly to present a natural human pose.
The mobile device 700 further includes memory and storage hardware 720 coupled to the processing hardware 710. The memory and storage hardware 720 may include memory devices such as dynamic random access memory (DRAM), static RAM (SRAM), flash memory and other volatile or non-volatile memory devices. The memory and storage hardware 720 may further include storage devices, for example, any type of solid-state or magnetic storage device.
The mobile device 700 may also include a display 730 to display information such as pictures, videos, messages, Web pages, games, texts, and other types of text, image and video data. In one embodiment, the display 730 and a touch screen may be integrated together.
The mobile device 700 may also include a camera 740 for capturing images and videos, which can then be viewed on the display 730. The videos may be edited via a user interface, such as a keyboard, a touch pad, a touch screen, a mouse, etc. The mobile device 700 may also include audio hardware 750, such as a microphone and a speaker, for receiving and generating sounds. The mobile device 700 may also include a battery 760 to supply operating power to hardware components of the mobile device 700.
The mobile device 700 may also include an antenna 770 and a digital and/or analog radio frequency (RF) transceiver 780 to transmit and/or receive voice, digital data and/or media signals, including the aforementioned video of edited human pose.
It is understood the embodiment of
The method 800 begins at step 810 with the mobile device identifying key points of a human figure from a frame of a video in response to a user command. The user command further indicates a target position of a given key point of the key points. At step 820, the mobile device generates a target frame including a target human pose. The given key point of the target human pose is at the target position. At step 830, the mobile device generates on the display an edited frame sequence including the target frame. The edited frame sequence shows the movement of the human pose transitioning into the target human pose.
The operations of the flow diagram of
Various functional components or blocks have been described herein. As will be appreciated by persons skilled in the art, the functional blocks will preferably be implemented through circuits (either dedicated circuits, or general purpose circuits, which operate under the control of one or more processors and coded instructions), which will typically comprise transistors that are configured in such a way as to control the operation of the circuity in accordance with the functions and operations described herein.
While the invention has been described in terms of several embodiments, those skilled in the art will recognize that the invention is not limited to the embodiments described, and can be practiced with modification and alteration within the spirit and scope of the appended claims. The description is thus to be regarded as illustrative instead of limiting.