At least one of the present embodiments generally relates to a method or an apparatus for deep learning of foot contacts with convolutional neural networks.
Convolutional Neural Networks (CNNs) have allowed considerable progress in the processing of image, video, and time series signals. Their benefits in these fields have sparked an interest in deep learning such as for foot contacts.
At least one of the present embodiments generally relates to a method or an apparatus for deep learning of skeletal contacts with convolutional neural networks. In an exemplary embodiment, not limiting the general aspects, the skeletal contacts are described in terms of vertical foot contacts.
According to a first aspect, there is provided a method. The method comprises steps for training a deep neural network by determining vertical ground reactive force estimates from a database comprising motion information and ground reactive forces; applying the trained deep neural network to human motion data to generate vertical ground reactive force estimates; and, applying a contacts function to the vertical ground reactive force estimates to generate foot contact estimates
According to another aspect, there is provided an apparatus. The apparatus comprises a processor. The processor can be configured to implement the general aspects by executing any of the described methods.
According to another general aspect of at least one embodiment, there is provided a device comprising an apparatus according to any of the decoding embodiments; and at least one of (i) an antenna configured to receive a signal, the signal including the video block, (ii) a band limiter configured to limit the received signal to a band of frequencies that includes the video block, or (iii) a display configured to display an output representative of a video block.
According to another general aspect of at least one embodiment, there is provided a non-transitory computer readable medium containing data content generated according to any of the described encoding embodiments or variants.
According to another general aspect of at least one embodiment, there is provided a signal comprising video data generated according to any of the described encoding embodiments or variants.
According to another general aspect of at least one embodiment, a bitstream is formatted to include data content generated according to any of the described encoding embodiments or variants.
According to another general aspect of at least one embodiment, there is provided a computer program product comprising instructions which, when the program is executed by a computer, cause the computer to carry out any of the described decoding embodiments or variants.
These and other aspects, features and advantages of the general aspects will become apparent from the following detailed description of exemplary embodiments, which is to be read in connection with the accompanying drawings.
Convolutional Neural Networks (CNNs) have allowed considerable progress in the processing of image, video, and time series signals. Their benefits in these fields have sparked an interest in generalizing their application to other types of data, including data defined on non-Euclidean domains.
Perceived realism is central to human animation; however, different artefacts are often introduced whenever one tries to edit or synthesize motion data. Human perception being more sensitive to relative changes rather than absolute ones, artefacts taking place close to fixed reference points like walls or ground are generally more hurtful to perception. In this context foot contacts detection plays an important role to quantify foot artefacts and to provide animation tools robust with respect to those artefacts.
As of today, many motion synthesis and editing approaches suffer from such artifacts, e.g., foot sliding on, passing through or floating above the ground, and require knowing when the feet are in contact with the ground to post-process motion sequences. Usually, foot contacts are derived from motion sequences subject to artifacts themselves using simple heuristics that are hand-crafted thresholds over velocity and proximity relative to the ground. Unfortunately, these approaches suffer from three main limitations.
First, the terrain height map must be known, meaning that these approaches are most of the time not applicable to uneven or inclined ground. Second, optimal thresholds are not universal, meaning that they must be manually tuned, ideally for every type of motion, morphology, and contact location (e.g., heel or toe). Finally, even optimal thresholds fail at accurately detecting every foot contact, which implies that tedious manual checking is necessary when using these approaches.
Under the general aspects described here, an approach is proposed to design a neural network that estimates ground contact forces in various areas under each foot from motion capture data at the input. At training time only, the input motion capture data is supplemented with synchronized pressure insole data and this extra data is leveraged to optimize the neural network weights. At run-time, the optimized network is fed with motion capture data to regress the ground contact forces, and predict more accurate ground contact from these forces.
In the following described embodiments, the proposed data-driven foot contacts prediction method is learned on a large motion capture database which outperforms traditional heuristic approaches. First, we have constituted UnderPressure, a novel publicly available database composed of diverse motion capture data synchronized with pressure insoles data, from which corresponding vertical ground reaction forces (vGRFs) and foot contacts can be computed. Then, we train a deep neural network to predict GRF distribution from motion sequences.
The key idea here is that using a meaningful proxy representation, i.e., GRFs, richer than binary foot contacts will force the network to actually model the complex interactions between the feet and the ground rather than to find out optimal thresholds for instance.
Finally, accurate foot contact predictions can be computed from the predicted GRFs at inference. The proposed embodiments enable better accuracy of ground contact detection as it is based on fine-grained ground reaction forces, as opposed to motion capture joint locations.
In this approach, the estimation results can be used to improve the quality of the motion data. Based on contact detection results, skeletal motion sequences can be updated so that the feet are positioned on the ground plane and that the velocity of their joints is zero during contact. Inverse kinematics can be used to update the other body joints, for example.
We propose the first method for foot contacts detection learned on a significant amount of accurately labelled motion data.
It is the most accurate method for foot contacts detection as of today, providing correct foot contacts labels in locomotion over about 99 frames out of 100 as well as valuable vertical ground reaction forces exerted on the feet at each frame.
Some example applications for these embodiments are:
For the purpose of training the neural network used for the prediction of foot contacts, we recorded motion capture data on a sample of adult volunteers equipped with motion capture sensors and performing a variety of movements including locomotion at different paces both forward and backward, sitting, chair climbing, jumping, as well as motions on uneven terrain like going up and down stairs.
In addition to motion capture, we recorded plantar foot pressure distribution. To this end, subjects were also equipped with Moticon's OpenGo Sensor Insoles placed into their shoes. Each insole has 16 plantar pressure sensors with a resolution of 0.25 N/cm2 and a 6-axis inertial measurement unit (IMU), both running at 100 Hz (See
Vertical GRFs captured data includes motion sequences, plantar pressure distribution and feet acceleration. Since pressure is defined as the perpendicular force per unit area, we additionally compute vertical GRF components (vGRFs) by multiplying pressure values by the corresponding cell areas. The motivation here is that groups of vGRFs are easier to aggregate, i.e., by summation.
In order to define foot contact labels from insole records, we designed a robust deterministic algorithm which computes binary foot contact labels divided into heels and toes as commonly done in human animation. It roughly consists in considering heel or toe in contact with the ground when their corresponding total force (See
Since we jointly captured motion and foot pressure data with separate devices, our records must be accurately synchronized in absence of a genlock signal. To this end, subjects were asked to perform a simple control movement at the beginning and end of each capture sequence for synchronization purposes. It consists of a small in-place double-leg jump allowing us to match vertical acceleration peaks measured on pressure insole IMUs and computed from motion capture foot position. Although numerical differentiation is known to amplify high frequency noise, we found that acceleration peaks were precise enough to accurately synchronize up to 1 insole frame, i.e. 10 ms.
After synchronizing our data, we resampled motion capture data from 240 Hz down to 100 Hz using spherical linear interpolation to match pressure insoles framerate. Though, we also provide original motion sequences at 240 Hz in our database. We also trim the beginning and end of each sequence which correspond to the synchronization movements.
In this section, we describe the proposed method to learn a deep learning model of vGRF distribution prediction from motion capture data. vGRFs then serve as a rich proxy representation from which to compute robust and accurate binary foot contacts.
Input. At each frame t, the human pose Xt∈R/×3 is represented by the position of its J=23 joints in a global Euclidean space. We design our vGRFs prediction network ψθ to output vGRFs at each frame from a few surrounding input frames with padding when needed. The full input pose sequence is then X∈RT×J×3, where T is the number of frames.
Output. As previously described, our vGRFs prediction network 40 estimates the vGRF distribution from motion data. At each frame, it outputs F{circumflex over ( )}=ψθ(X)∈RT×2×16, i.e. 16 positive real valued vGRF components at each frame for both feet, corresponding to the 16 insole pressure cells. At inference, we use the same deterministic algorithm, called contacts function I′, as the one we use to compute ground truth binary foot contacts from captured insoles data, to derive binary foot contacts c{circumflex over ( )}=Γ(F{circumflex over ( )})∈{0, 1} T×2×2 from estimated vGRFs at each frame, for both feet and both contact locations (heel & toe).
vGRFs Prediction Network
Architecture. We design our vGRFs prediction network 40 in order to process variable-length sequences. To this end, the network is composed of four 1D temporal convolutional layers followed by three linear layers shared across frames. Each layer is followed by exponential linear units (ELU) except the last one which is a softplus activation to output nonnegative vGRFs. Since we use 3-frame wide kernels in the convolutional layers, the intrinsic latency of the model is equivalent to 4 frames or 40 ms.
Training. During training, our network is iteratively exposed to sequences of human poses and tries predicting corresponding vGRFs as depicted in
where F is the ground truth vGRFs and F{circumflex over ( )}=ψθ(X) its predicted counterpart. Note that for numerical convenience, 1 is added to both F and F{circumflex over ( )} since ln(0) is not defined while vGRFs can be zero.
One embodiment of a method 500 under the described aspects is shown in
Processor 610 is also configured to either insert or receive information in a bitstream and, either compressing, encoding, or decoding using any of the described aspects.
This document describes a variety of aspects, including tools, features, embodiments, models, approaches, etc. Many of these aspects are described with specificity and, at least to show the individual characteristics, are often described in a manner that can sound limiting. However, this is for purposes of clarity in description, and does not limit the application or scope of those aspects. Indeed, all the different aspects can be combined and interchanged to provide further aspects. Moreover, the aspects can be combined and interchanged with aspects described in earlier filings as well.
The aspects described and contemplated in this document can be implemented in many different forms. These and other aspects can be implemented as a method, an apparatus, a computer readable storage medium having stored thereon instructions for encoding or decoding video data according to any of the methods described, and/or a computer readable storage medium having stored thereon a bitstream generated according to any of the methods described.
Various methods are described herein, and each of the methods comprises one or more steps or actions for achieving the described method. Unless a specific order of steps or actions is required for proper operation of the method, the order and/or use of specific steps and/or actions may be modified or combined.
The embodiments can be carried out by computer software implemented by the processor 610 or by hardware, or by a combination of hardware and software. As a non-limiting example, the embodiments can be implemented by one or more integrated circuits. The memory 620 can be of any type appropriate to the technical environment and can be implemented using any appropriate data storage technology, such as optical memory devices, magnetic memory devices, semiconductor-based memory devices, fixed memory, and removable memory, as non-limiting examples. The processor 610 can be of any type appropriate to the technical environment, and can encompass one or more of microprocessors, general purpose computers, special purpose computers, and processors based on a multi-core architecture, as non-limiting examples.
When a figure is presented as a flow diagram, it should be understood that it also provides a block diagram of a corresponding apparatus. Similarly, when a figure is presented as a block diagram, it should be understood that it also provides a flow diagram of a corresponding method/process.
The implementations and aspects described herein can be implemented in, for example, a method or a process, an apparatus, a software program, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (for example, discussed only as a method), the implementation of features discussed can also be implemented in other forms (for example, an apparatus or program). An apparatus can be implemented in, for example, appropriate hardware, software, and firmware. The methods can be implemented, for example, in a processor, which refers to processing devices in general, including, for example, a computer, a microprocessor, an integrated circuit, or a programmable logic device. Processors also include communication devices, such as, for example, computers, cell phones, portable/personal digital assistants (“PDAs”), and other devices that facilitate communication of information between end-users.
Reference to “one embodiment” or “an embodiment” or “one implementation” or “an implementation”, as well as other variations thereof, means that a particular feature, structure, characteristic, and so forth described in connection with the embodiment is included in at least one embodiment. Thus, the appearances of the phrase “in one embodiment” or “in an embodiment” or “in one implementation” or “in an implementation”, as well any other variations, appearing in various places throughout this document are not necessarily all referring to the same embodiment.
Additionally, this document may refer to “determining” various pieces of information. Determining the information can include one or more of, for example, estimating the information, calculating the information, predicting the information, or retrieving the information from memory.
Further, this document may refer to “accessing” various pieces of information. Accessing the information can include one or more of, for example, receiving the information, retrieving the information (for example, from memory), storing the information, moving the information, copying the information, calculating the information, determining the information, predicting the information, or estimating the information.
Additionally, this document may refer to “receiving” various pieces of information. Receiving is, as with “accessing”, intended to be a broad term. Receiving the information can include one or more of, for example, accessing the information, or retrieving the information (for example, from memory). Further, “receiving” is typically involved, in one way or another, during operations such as, for example, storing the information, processing the information, transmitting the information, moving the information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information.
It is to be appreciated that the use of any of the following “/”, “and/or”, and “at least one of”, for example, in the cases of “A/B”, “A and/or B” and “at least one of A and B”, is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of both options (A and B). As a further example, in the cases of “A, B, and/or C” and “at least one of A, B, and C”, such phrasing is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of the third listed option (C) only, or the selection of the first and the second listed options (A and B) only, or the selection of the first and third listed options (A and C) only, or the selection of the second and third listed options (B and C) only, or the selection of all three options (A and B and C). This may be extended, as is clear to one of ordinary skill in this and related arts, for as many items as are listed.
As will be evident to one of ordinary skill in the art, implementations can produce a variety of signals formatted to carry information that can be, for example, stored or transmitted. The information can include, for example, instructions for performing a method, or data produced by one of the described implementations. For example, a signal can be formatted to carry the bitstream of a described embodiment. Such a signal can be formatted, for example, as an electromagnetic wave (for example, using a radio frequency portion of spectrum) or as a baseband signal. The formatting can include, for example, encoding a data stream and modulating a carrier with the encoded data stream. The information that the signal carries can be, for example, analog or digital information. The signal can be transmitted over a variety of different wired or wireless links, as is known. The signal can be stored on a processor-readable medium.
Embodiments may include one or more of the following features or entities, alone or in combination, across various different claim categories and types:
Various other generalized, as well as particularized, inventions and claims are also supported and contemplated throughout this description.
Number | Date | Country | Kind |
---|---|---|---|
22305080.8 | Jan 2022 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2023/051586 | 1/23/2023 | WO |