The present disclosure relates to the field of image processing technologies and, more particularly, relates to a method and device for harmony-aware audio-driven motion synthesis.
Machine-based generation is widely used in the tasks of producing music videos, speech editing, and animation synthesis, where harmony represents the consistent perception of rhythms, emotions, or visual appearances in the output subjectively.
As a typical problem in audio-visual cross-domain generation, the task of audio-driven motion synthesis gains much attention in character animation, video generation and choreograph. The traditional methods tackle the audio-to-visual generation by retrieving visual clips that share the feature-level similarity with the given music. Different from regular motion synthesis, when conditioned with music, people are found to be sensitive to the inharmonious synthesized motions, which damages the qualitative evaluation heavily. Harmony is considered as one of the most important factors that highly influence the quality assessment of cross-domain results. However, the feeling of harmony relies on perceptual judgement. This may be challenging to enhance the audio-visual harmony in audio-driven motion synthesis tasks.
The disclosed method and system are directed to solve one or more problems set forth above and other problems.
One aspect of the present disclosure provides a method for harmony-aware audio-driven motion synthesis applied to a computing device. The method includes determining a plurality of testing meter units according to an input audio, each testing meter unit corresponding to an input audio sequence of the input audio, obtaining an auditory input corresponding to each testing meter unit, obtaining an initial pose of each testing meter unit as a visual input based on a visual motion sequence synthesized for a previous testing meter unit, and automatically generating a harmony-aware motion sequence corresponding to the input audio using a generator of a generative adversarial network (GAN) model. The GAN model is trained by incorporating a hybrid loss function. The hybrid loss function includes a multi-space pose loss, a harmony loss, and a GAN loss. The harmony loss is determined according to beat consistencies of audio-visual beat pairs.
Another aspect of the present disclosure provides a device for harmony-aware audio-driven motion synthesis, including a memory and a processor coupled to the memory. The processor is configured to perform a plurality of operations including determining a plurality of testing meter units according to an input audio, each testing meter unit corresponding to an input audio sequence of the input audio, obtaining an auditory input corresponding to each testing meter unit, obtaining an initial pose of each testing meter unit as a visual input based on a visual motion sequence synthesized for a previous testing meter unit, and automatically generating a harmony-aware motion sequence corresponding to the input audio using a generator of a generative adversarial network (GAN) model. The GAN model is trained by incorporating a hybrid loss function. The hybrid loss function includes a multi-space pose loss, a harmony loss, and a GAN loss. The harmony loss is determined according to beat consistencies of audio-visual beat pairs.
The following drawings are merely examples for illustrative purposes according to various disclosed embodiments and are not intended to limit the scope of the present disclosure.
Reference will now be made in detail to exemplary embodiments of the invention, which are illustrated in the accompanying drawings. Hereinafter, embodiments consistent with the disclosure will be described with reference to the drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. It is apparent that the described embodiments are some but not all of the embodiments of the present invention. Based on the disclosed embodiments, persons of ordinary skill in the art may derive other embodiments consistent with the present disclosure, all of which are within the scope of the present invention.
Harmony is an essential part of artistic creation. Movie directors tend to produce appealing scenes with songs that enhance emotional expression. When musicians arrange different voice parts in a chorus, they are supposed to consider whether the combination sounds harmonious. Artists pursue harmony in their works to create the senses of beauty and comfort. Since professional skills and techniques are required to complete such creative works, to save financial cost and labor, automatic generation is gradually applied to imitate the human creation process by exploiting computational models. Similar to human work, the machine-based generation needs to obey the rule of harmony in order to produce high-quality results that satisfy human aesthetics.
Handling harmony in those generative tasks means the models should put effort into controlling the consistency between multiple signals, which is shown as the alignment of features explicitly for observation or implicitly in the latent spaces. The synchronization for different signal pairs may differ in their relevance so that in the high-related pairs, correlated features are easier to be captured and aligned. In human perception, over 90 percent of sense derives from the stimulus of visual or auditory signals and they interrelate and interact with each other during brain processing.
The present disclosure provides a method and device for harmony-aware audio-driven motion synthesis. The disclosed method and/or device can be applied in any proper occasions where human motion synthesis with music is desired. The disclosed harmony-aware audio-driven motion synthesis process is implemented based on a beat-oriented generative adversarial network (GAN) model with harmony-aware hybrid loss function, i.e., HarmoGAN model, which utilizes audio sequences or extracted auditory features to generate the visual motion sequences. The addition of harmony evaluation mechanism in the disclosed GAN model is verified to quantify the harmony between audio and visual sequences by analyzing beat consistency.
Processor 102 may include any appropriate processor(s). In certain embodiments, processor 102 may include multiple cores for multi-thread or parallel processing, and/or graphics processing unit (GPU). Processor 102 may execute sequences of computer program instructions to perform various processes, such as an audio-visual harmony evaluation and harmony-aware audio-driven motion synthesis program, a GAN model training program, etc. Storage medium 104 may be a non-transitory computer-readable storage medium, and may include memory modules, such as ROM, RAM, flash memory modules, and erasable and rewritable memory, and mass storages, such as CD-ROM, U-disk, and hard disk, etc. Storage medium 104 may store computer programs for implementing various processes, when executed by processor 102. Storage medium 104 may also include one or more databases for storing certain data such as video data, training data set, testing video data set, data of trained GAN model, and certain operations can be performed on the stored data, such as database searching and data retrieving.
The communication module 108 may include network devices for establishing connections through a network. Display 106 may include any appropriate type of computer display device or electronic device display (e.g., CRT or LCD based devices, touch screens). Peripherals 112 may include additional I/O devices, such as a keyboard, a mouse, and so on.
In operation, the processor 102 may be configured to execute instructions stored on the storage medium 104 and perform various operations related to a harmony-aware audio-driven motion synthesis method as detailed in the following descriptions.
As shown in
At S202, a plurality of testing meter units are determined according to an input audio, each testing meter unit corresponding to an input audio sequence of the input audio.
At S204, an auditory input corresponding to each testing meter unit is obtained.
At S206, an initial pose of each testing meter unit is obtained as a visual input based on a visual motion sequence synthesized for a previous testing meter unit; and
At S208, a harmony-aware motion sequence corresponding to the input audio is automatically generated using a generator of a GAN model. The GAN model is trained by incorporating a hybrid loss function. The hybrid loss function includes a multi-space pose loss, a harmony loss, and a GAN loss. The harmony loss is determined according to beat consistencies of audio-visual beat pairs.
As shown in
At S302, audio beats and audio beat strengths are obtained from a training sample audio. Each audio beat corresponds to one audio beat strength.
Spectrogram analysis is widely used to obtain the audio beats Ba(t) in audio processing. The spectrogram of given audio sequences A(t) can be obtained by the time-windowed Fast Fourier Transform (FFT). With the estimation of the amplitude of the spectrogram, the beats are extracted by looking for distinct amplitude changes in the time domain, which could be described as:
Amp denotes the function or model that estimates the amplitude of the spectrogram. The positive threshold c1 is set to determine the existence of beats at time t which satisfies Ba(t)=1 compared with any other t′ in its punctured neighborhood {dot over (U)}a determined by a pre-defined radius t0. In the mainstream methodologies, amplitude estimation is conducted by deriving the onset strengths from the obtained spectrogram. The audio beats Ba (t)=1 is then determined by the occurrence of the peak in each onset envelope.
In some embodiments, to obtain the audio beats, the mainstream approach making use of the onset strength is exploited. All the audio beats can be practically processed by methods in the open-source package LibROSA, which provides the implementations of the onset-driven beat detection for audio signals. In some embodiments, the audio beat strengths are per-computed to estimate the tempo based on the auto-correlation inside onset envelope by the analysis of Mel spectrogram. Referring to Equations. (1) and (2), the audio beat Ba(t)=1 can be explained by the case where there is a peak in the onset envelope at consistent t with the obtained tempo. To assemble the valid beats in Ba(t), the position-based beats {pa(b)|b=1,2, . . . , N} are formed to collect all the positions in time t of occurred N beats that satisfy Ba(t)=1.
Simultaneously, the corresponding strengths of beats pa(b) are thus represented with the peak values as sa(b).
At S304, a plurality of training meter units are determined according to the audio beats and the audio beat strengths. Each training meter unit corresponds to a sample audio sequence of the training sample audio and a temporal index based on a time record of the training meter unit.
me(b)=1 means that the beat is strong and 0 is week. For example, with a quarter note, strong-weak beat combinations are mapped into 3 meter unit types, which are 4/4 , 5/4, and 6/4 in 4 categories totally. Several beats in the previous meter unit are added in a current meter unit for the transitions between meter units to form the meter units into the unified beat length to describe the flow of musical rhythms. In the example shown in
Meter units are used as a basic unit of audio and visual pairs in the audio-driven video synthesis process. For each meter unit MU, the start time and end time are recorded as MU(t), t ∈ [tstart, tend]. The disclosed process can find, for a meter unit corresponding to an audio sequence A(t), a motion sequence V(t) that matches the audio sequence with desirable harmony by using a generator of a trained GAN model, and repeat the similar process for all meter units. According to the time records, when training the GAN model, known audio-visual matched videos training samples having desirable harmony are obtained and corresponding audio and motion sequences A(t) and V(t) can be extracted as the ground-truth pairs, including the audio beats pa(b) and their beat strengths sa(b).
Meanwhile, the temporal indexes TI(t) is formed in binary to denote the separation between the current meter unit and the previous meter unit, where TI(t) is set to 1 if the time belongs to the priors from the previous meter unit. Using the example shown in
In some embodiments, training the GAN model also includes segmenting audio-visual clips based on the plurality of training meter units temporally and inputting the segmented audio-visual clips for the training. The initial pose of each training meter unit is obtained from a corresponding audio-visual clip that contains the sample audio sequence. In other words, if the human motion sequences are harmonious with the given auditory rhythms, the human motion sequences can show regular recurring movement units related to the audio beats. It can be assumed that the correlation exists between such movement units and the obtained audio beats. Thus, the audio-visual clips are segmented based on the defined meter units temporally as input for training to strengthen the learning of beat-driven cross-domain unit mapping in a deep model consistent with the embodiments of the present disclosure, which can indirectly benefit the audio-visual harmony for the generation.
At S306, features of the sample audio sequence of each training meter unit are extracted as a sample auditory input.
In some embodiments, features of the input audio sequence A(t) of the current meter unit are extracted as an auditory input Af(t). For example, for A(t), the features of Mel Frequency Cepstral Coefficients (MFCCs) are extracted as the auditory input Af(t). In some embodiments, auditory input corresponding to a testing meter unit (e.g., S204) may be obtained in a similar manner as obtaining training sample auditory input (e.g., S306).
At S308, a sample initial pose of each training meter unit is obtained as a sample visual input based on the temporal index and a training sample visual motion sequence.
In some embodiments, for training, the initial poses Vf(t) can be obtained based on TI(t) and V(t) by:
That is, the visual features Vf(t) keep the movements from the previous meter unit as priors and use the mean pose for one previous meter unit as initialization for the current meter unit. Using the example shown in
The auditory input (i.e., the audio features extracted at S304) and the visual input (i.e., the initial poses obtained at S306) can be inputted into a generator of a GAN model. The structure of the generator G can be summarized as G (Af(t), Vf(t))=V′(t), where V′(t) denotes the generated motion sequence for a meter unit.
At S310, the GAN model is trained using the sample auditory input and the sample visual input of each training meter unit by incorporating a hybrid loss function, to obtain a trained GAN model. The hybrid loss function includes a multi-space pose loss, a harmony loss, and a GAN loss. The harmony loss is determined according to beat consistencies of audio-visual beat pairs corresponding to a training meter unit. Each audio beat in the audio-visual beat pairs is from the sample auditory input of the training meter unit. Each visual beat in the audio-visual beat pairs is from an estimated visual motion sequence corresponding to the training meter unit generated during a process of training the GAN model.
Since GANs shown their outstanding power in visual generation tasks, GANs are also popular to be used in cross-domain generation.
In the tasks of audio-driven motion synthesis, the network is fed with the input of audio sequences or extracted auditory features to generate the visual motion sequences. Due to the difficulty of cross-domain synthesis, it is always a problem to encourage effective feature transformation in the architecture. To solve this problem, the encoder-decoder structure is considered to handle the translation between sequences to sequences. Taken into consideration the chronological order in the input and output sequences, recurrent structures are introduced into the architecture of encoder and decoder to obtain features considering temporal correlations. The Gated Recurrent Units (GRUs), as a typical structure of recurrent neural network (RNN), can outperform the common Long Short Term Memory (LSTM) structure in sequence learning for its fewer parameters and reduced computation. In some embodiments, as shown in
In addition, differently from analyzing only the audio features outputted from the encoder in the decoding of poses, the initial pose features are concatenated with the audio features to enhance cross-domain learning for the decoder. The skip connections are also applied to intentionally add the audio-visual features into the future layers.
Because the generator G is aimed of producing human motions in 3D poses, it is more difficult to accurately estimate the additional depth dimension compared to synthesizing 2D motions. In some embodiments, the 2D poses are estimated first, and then a depth lifting branch is constructed to produce the 3D poses based on the 2D estimation. Taking advantage of the similarity between the 2D and 3D poses, the depth can be efficiently generated.
In the music-to-motion synthesis, not only the consistency of content style between the generated human movements and target audio sequences is needed to be supervised, but also the reality of synthesized human motions. Thus, a cross-domain discriminator Dcd and a spatial-temporal discriminator Dst are built to guide the network to learn the global content consistency between the audio-visual pairs and the targeted pose flow in the spatial and temporal domain, respectively.
In some embodiments, as shown in
In some embodiments, for penalizing the unrealistic produced motions, such as distorted human poses and unnatural transition between movements, the spatial-temporal discriminator Dst is constructed by applying a temporal progressing network. As shown in
Harmony plays an important role in the evaluation of generated cross-modal results. Since the sense of vision and hearing are highly related and affect each other in brain processing, harmony is especially concerned in the tasks of audio-to-visual or visual-to-audio generation. Taking the example of audio-driven human motion synthesis, in the quality assessment the audio-visual harmony is emphasized that the synthesized movements should be rhythmic and harmonious with the music. In other words, the rhythms in the audio and visual sequences are required to be consistent temporally in order to satisfy the perceptual harmony. Since the feelings of rhythm rely on subjective human perception, given the audio sequences A(t) and visual sequences V(t) as functions of time t, it is an important topic to approximate the perceptual judgement of harmony into quantitative measurements as:
h=H(A(t), V(t)) (5)
H denotes the algorithm that analyzes the harmony between the cross-domain signals and h is a scalar representing the quantified judgement of harmony.
Referring to the rules that detect audio beats, the visual beats Bv(t) are similarly extracted based on the analysis of motion trend between visual sequences V(t) and V(t-t0). When there is a drastic change in the motion trend occurs, the time t is considered as the occurrence of a beat, which could be depicted as:
MT denotes the function or model that estimates the motion trend during t1 (a pre-defined constant value) and c2 is a positive value that controls the threshold to obtain the beat at time t where Bv(t)=1 stands in comparison with any other t′ in its punctured neighborhood {dot over (U)}v radiused by pre-defined constant t2. To process the general pixel-based visual signals, the use of optical flow can capture the motion trend in the moving events. With the quantification of optical flow, the visual beats can be obtained by deriving the local maximums that denote the obvious changes in movements. When focusing on only the human motion in such pixel-based signals, the skeleton-driven method can be used to specify the motion trend for the pure skeleton-based motions extracted from visual signals. Thus, the estimation of motion trend can be converted to analyzing the directions of body movements by joint-based standard derivation. The visual beats Bv=1 thus are defined as the distinct directional changes in the motion sequences.
Based on the observed audio and visual beats, a common assumption is derived to tackle rhythmic consistency that the appearance of every audio beat is supposed to synchronize with that of the visual beat and vice versa. Following such assumption, the existing strategies evaluate the quantified audio-visual harmony h by performing the alignment based on the extracted beats, which extend the Eq. (5) as:
h=L(fa(A(t)), fv(V(t)))=L(Ba(t), Bv)) (8)
fa and fv denote functions for beat detection in the audio and visual signals, respectively. L represents the alignment algorithm. In some embodiments, the algorithm L can formulate the alignment problem as analyzing the distances between the synchronized beat pairs by warping, cross-entropy, or F-score, which are effective to align the cross-domain objects.
In some embodiments, taking human video as an example, given human and their movements as the attention points, the harmony are mainly considered as the alignment between foreground human motion in the visual frames and the associated background music. To better analyze the foreground human motions, the skeletons of human are extracted to represent the human motions in videos. Hence, harmony is evaluated between the audio signals and skeleton-based human motions.
To visualize the subjective evaluation of harmony by objective expressions, based on human reaction time, the tolerance fields neighbored with audio beats are set to represent the perceptual judgement of audio-visual harmony in terms of the synchronization between beats.
In the visual case, optical flow is often used to extract beats for general frame-based visual signals. However, when fed with human video, this approach does not function effectively compared to the skeleton-based approach. One reason is that optical flow treats the motion of each pixel almost equally where much higher weights is put on the foreground human. As shown in
In some embodiments, the mainstream onset-based audio beat detection is combined with the SD-driven visual beat detection in the beat alignment experiment conducted on the dance dataset. Since the human reaction time is around 0.25 seconds, the radius of tolerance field for each audio beat is set to 6 frames, with the total duration of 0.24 seconds under 25 fps, to evaluate the audio-visual alignment results. However, the outcome is not very satisfactory. As shown in
In some embodiments, training the GAN model further includes detecting visual beats of the estimated visual motion sequence by considering a difference between joint velocity sums in neighboring frames of the estimated visual motion sequence.
Given the skeleton-oriented human motion sequences vs(t, j) with j joints at frame t obtained from V(t), the joint velocity sum Jv(t) is derived by calculating the frame difference as:
i denotes the ith joint. In some embodiments, the diversity and frequency can be regularized based on analyzing the joint sum.
To define the motion beats that are well-aligned with audio beats, the evolution of indivisible movement units (e.g., hand lift) is mainly focused on for the analysis of visual beats in the whole motion sequences.
Similar with the audio case, the position-based visual beats are then formulated as {pv(b)|b=1,2, . . . , M} for M valid beats satisfying {tilde over (B)}v(t)=1 and their strengths are assigned due to the corresponding Jv(t) as sv(b).
In addition to the synchronization between beats, another factor that highly influences the perception of audio-visual harmony is an attention mechanism consistent with the embodiments of the present disclosure. Since human attention is drawn for things that are more “attractive”, on the contrary, some other things may be overlooked unconsciously in the perception. Thus, to assess the audio-visual harmony close to the real human perception, the attention mechanism is needed to be introduced in the evaluation framework.
The attention mechanism reveals that unsalient objects are neglected in human perception without any awareness, which influence both vision and hearing systems. When it comes to the subjective perception of rhythmic harmony, the phenomenon of inattentional blindness and deafness may also affect the judgement based on the saliency distribution in the audio and visual rhythms. In order to approximate the perceptual measurement of harmony, an attention-based evaluation framework consistent with embodiments of the present disclosure is provided to highlight the importance of salient beats, which extends Eq. (8) as:
h=L(Wa(pa(b)), Wv(pv(b))) (12)
Wa and Wv denote the attentional weighting masks derived from the audio and visual beat saliency, respectively.
In some embodiments, the beat consistencies of the audio-visual beat pairs corresponding to a training meter unit is determined by assigning a weight to each audio beat and a weight to each visual beat based on beat saliency. Because salient beats favor the perception of harmony, a weight is assigned to each beat based on its beat saliency to enhance the corresponding attentional impact in the evaluation. The beat saliency is represented by the beat strengths sa(b) and sv(b) and adaptive weighting masks that are constructed by considering the global SD for the strengths.
In the analysis of auditory saliency, the attentional mask is built as:
W
a=sign(sa(b)−SD(sa(b))×λ1) (13)
λ1 denotes a constant scale factor to adjust the audio saliency threshold.
Differently from processing the mask for audio beats, in the visual case, the motion beats are extracted from not only the peaks but also the valleys of the joint velocity sum, which means that the direct comparison with SD is not applicable for analyzing the visual saliency. Therefore, the peak-to-valley difference is utilized to define the visual saliency strength for detecting the appearances of high-impact visual beats, which is shown as:
R(b)=|sv(b)−sv(b−1)|b=2, . . . , M (14)
R(b) denotes the peak-to-valley difference for each beat.
The visual saliency mask Wv is then defined by utilizing the global SD as:
W
v=sign(sv(b)−SD(R(b))×λ2) (15)
μ2 denotes a constant scale factor to adjust the visual saliency threshold.
In some embodiments, the beat consistencies of the audio-visual beat pairs corresponding to a training meter unit is determined further by obtaining, among the audio-visual beat pairs of the training meter unit, attentional beats according to the weights of the audio beats and the weights of the visual beats, the attentional beats including one or more attentional audio beats and one or more attentional visual beats, and obtaining the beat strength for each of the attentional beats. By applying the weighting masks Wa and Wv on the beats pa(b) and pv(b), respectively, the attentional beats p′a(b) and p′v(b) are obtained by extracting the positive results from Wa(pa(b)) and Wv(pv(b)). The corresponding beat strengths for the attentional beats are similarly defined as s′a(b) and s′v(b).
The harmonious feeling in audio-visual human perception can be described as fuzzy measurement, which derives from the way that the brain of human being recognizes sensory signals. In some embodiments, the existing warping method is used to handle the beat alignment, which directly adjust the strength curve for visual beats to fit that of audio beats by applying compensations. In some embodiments, the contrastive difference is constructed by calculating cross-entropy distance between the auditory amplitude and motion labels. Because the brain of human being has limitations for recognizing the signals in precise amplitude, such strength-based fine mappings between audio-visual beats are not consistent with the real perception.
In some embodiments, the beat consistencies of the audio-visual beat pairs corresponding to a training meter unit is determined further by constructing hitting scores by counting labels in an audio and visual domain to represent aligned attentional beats in the sample auditory input and the estimated visual motion sequence, one label representing that one attentional audio beat is aligned with a corresponding attentional visual beat according to a human reaction time delay. That is, inspired by the binary labels given to present whether the audio beats and the visual beats are synchronized in the time domain (e.g.,
Beginning with the selected high-saliency N audio beats p′a(b) and M visual beats p′v(b), the Eq. (12) is reformed by using the F-score measurement as:
h=L(p′a(b), p′v(b))=Fs(E(p′a(b)), E(p′v(b))) (16)
E denotes the algorithm obtaining the hitting score in both audio and visual domain. Fs represents the F-score measurement.
With the observation that there is a delay between visual perception and brain-processed recognition, the assumption can be made that the beat can be considered to be hit as long as the time interval between the beat and the nearest cross-domain beat is less than the human-reaction delay. In this way, a fuzzy interval-based judgement can be made for measuring the alignment, instead of depending on precise strength-based mappings. As the synchronized beats appear in audio-visual pairs, the audio beats can be seen as anchors in the analysis of hitting. To obtain the interval, the position matrix Z(ba, bv) is built by repeating the M visual beats p′v(bv) for N times as:
∀ba, Z(ba, bv)=Z(bv)=p′v(bv) (17)
ba=1,2, . . . , N and bv=1,2, . . . , M.
The column-wise audio-visual interval D (ba, bv) based on Z(ba, bv) is computed by subtracting p′a(ba) absolutely:
D(ba, bv)=|Z(ba, bv)−p′a(ba)| (18)
Then the judgement of whether the audio beat Hp(ba) is hit can be obtained by comparing its minimum audio-visual interval T(ba) row-wisely with the pre-defined reacting delay as:
Tdelay is a constant frame time and Hp(ba)=1 denotes that there exist a synchronized audio-visual beat pair.
Finally, the hitting score hs can be derived by performing the weighted sum of all the hitting points as:
Considering the normalization for the total numbers of audio beats and motion beats, the hitting score for audio harmony can be formed as
and the hitting score for visual harmony can be formed as.
However, the correlation between ha and hv differs from source to source. For instance, given a specific input audio-visual sequences, the obtained ha may be higher than hv but the contrary observation can be obtained for another input sequences.
In some embodiments, the beat consistencies of the audio-visual beat pairs corresponding to a training meter unit is determined further by determining the beat consistencies using the hitting scores. That is, to balance between the audio-visual scores, in some embodiments, the final audio-visual harmony h is obtained by performing the harmonic mean, which reforms the Eq. (16) as:
β is a pre-defined constant. Therefore, Eq. (22) can be transformed into the function of hs as:
Implied by Equations. (22) and (23), the quantification of audio-visual harmony in the evaluation can be suggested as:
The harmony evaluation in the present disclosure is applied according to the following Lemma (1): Given an audio clip with N obtained attentional audio beats and a visual clip with M visual beats, the quantified audio-visual harmony can be uniquely determined by hs.
The hybrid loss function includes a multi-space pose loss, a harmony loss, and a GAN loss. The multi-space pose loss is employed to regularize the realism of the estimated human movements. In some embodiments, the multi-space pose loss includes one or more of Kullback-Leibler (KL) loss, Charbonnier-based mean squared error (MSE) loss, and Charbonnier-based VGG loss.
For distribution space, the Kullback-Leibler (KL) loss kl function is applied based on the intermediate results of 2D poses in the generation process, shown as:
kl=KL((V2d(t))∥(V2d′(t)) (24)
kl denotes the Kullback-Leibler (KL) loss. denotes the operation that transforms the ground-truth 2D motion sequences V2d(t) and the intermediate output V2d′(t) to the probability distribution.
In the pixel space, a Charbonnier-based MSE loss mse is established to constrain the generation of the 3D poses as:
ϵ is a positive constant close to zero to soothe the gradient vanishing in training. A weight mask
is applied based on the temporal index TI(t) to guide the network focus more on the generation of motions for the current meter.
VGG networks are widely used to generate visual features consistent with human perception, the Charbonnier-based VGG loss feat is also performed to regularize the produced human motion in the deep feature space by:
In some embodiments, the feature-space pose loss, such as the Kullback-Leibler (KL) loss, the Charbonnier-based MSE loss, or the Charbonnier-based VGG loss, is assumed to be capable to capture the deep features for the motion flow and regularize the flow in the synthesized motions to be consistent with the ground truth.
According to the Lemma (1), the harmony between the audio and human motion sequences can be determined by evaluating the audio-visual beat consistency, which is uniquely dependent on the hitting score hs. Thus, the harmony loss is created by formulating the function:
harmo
=E(p′a(b), s′a(b), VB(G(Af(t), Vf(t))))+√{square root over (|M−N|)} (27)
VB denotes the extraction of attentional visual beats with the corresponding beat strengths based on the estimated human motion sequences from the generator. Such results are then sent to the algorithm E to calculate the hitting score with the pre-computed p′a(b) and s′a(b). Apart from minimizing the negative hitting score, the over-frequent visual beats are penalized by adding a L1 distance comparing the number of visual beats M with N audio beats.
GANs can learn to generate outputs based on the distribution of the given data during the adversarial training by solving the min-max problem:
ϕ and θ denote the parameters for the discriminator and generator, respectively. x represents the ground truth data while Y is the input to the generator.
In some embodiments, training the GAN model further includes minimizing the harmony loss, the multi-space pose loss, and the GAN loss from the generator, and maximizing values of loss functions of the cross-domain discriminator and the spatial temporal discriminator to distinguish between a real training sample and a fake training sample. Thus, the cross-domain discriminator and the spatial-temporal pose discriminator try to distinguish between real and fake through maximizing the loss:
dcd =[log(1−Dcd(A(t), G(Af(t), Vf(t))))]+[log Dcd(A(t), V(t))] (29)
dst =[log Dst(V(t))]+[log(1−Dst(G(Af(t), Vf(t))))] (30)
On the contrary, the generator attempts to fool the discriminators by minimizing the function:
gan=[−log Dcd(A(t), G(Af(t), Vf(t)))]+[−log Dst(G(Af(t), Vf(t)))) ] (31)
In summary, combining all the loss functions above, the final loss function for the generator can be formulated as:
total=λklkl+λmsemse+λfeatfeat+λharmoharmo+λgangan (32)
The λs denote the corresponding weight for each loss component.
In some embodiments, referring back to
In some embodiments, obtaining the initial pose of each testing meter unit includes keeping the generated motion sequence from a previous testing meter unit right before a current testing meter unit in the initial pose of a current meter unit, and using a mean pose of the generated harmony-aware motion sequence from the previous testing meter unit as initialization for the current testing meter unit.
In some embodiments, in an implementation example, the dance dataset released by Tang et al. in “Dance with melody: An LSTM-autoencoder approach to music-oriented dance synthesis,” proceedings of the 26th ACM international conference on Multimedia, 2018 (hereinafter, [Tang et al., 2018]), is utilized to train the HarmoGAN model, which includes 61 sequences of dancing videos performed by the professional dancer totaling 94 minutes and 907,200 frames in 25 fps. It provides the 3D human body keypoints with 21 joints collected from wearable devices and the corresponding audio tracks. The dance dataset contains four typical types of dance: cha-cha, rumba, tango, and waltz. To save the memory cost, all videos are resampled at 15 fps to create a sample dataset. 2014 clips of concatenated audio-visual input features are obtained with the corresponding target poses from the whole dance dataset, where 214 of them are selected randomly as the self-created testing data and the rest are used for model training. All the functions that handle the extraction of musical features can be found in the Librosa package of McFee et al. in “Librosa: Audio and music signal analysis in python,” proceedings of the 14th python in science conference, 2015 (hereinafter, [McFee et al., 2015]).
To evaluate the harmony between the audio and synthesized audio-driven motion sequences, the HarmoGAN model is tested based on the ballroom music dataset by Gouyon et al. in “An experimental comparison of audio tempo induction algorithms,” IEEE Transactions on Audio, Speech, and Language Processing, 2006 (hereinafter, [Gouyon et al., 2006]). It extracts 698 background music clips each of 30 seconds from the online dance videos. It contains music for 7 types of dance: cha-cha, jive, quickstep, rumba, samba, tango, and waltz. In each dance category, 6 audio sequences are randomly picked to form the testing dataset. The beat-based harmony mechanism is employed to quantify the audio-visual harmony with the use of the Librosa package of [McFee et al., 2015] to obtain information of auditory beats.
HarmoGAN is implemented in PyTorch. The generator is first pretrained to prepare a reasonable initialization for the following GAN training. The pretraining ends at 225 epochs with the use of pretrain=0.14+mse. The Adam optimizer proposed by Kingma and Ba in “Adam: A method for stochastic optimization,” arXir preprint arXir: 1412.6980 (2014) (hereinafter, [Kingma and Ba, 2014]), is utilized with batch size of 10. The initial learning rate is set to 0.001 and gets decreased every 50 epochs by multiplying with the factors in the order of [0.5,0.2,0.2,0.5]. Initialized with the pretrained model, GAN training is started with both the generator and discriminator networks. The weights of loss components in the hybrid loss function for our generator are set as follows: λkl=0.0001, λmse=λfeat=λgan=0.001, λharmo=1. The weight decay is set to 0.001 for the discriminators and 0.0001 for the generator. The learning rates for all the networks are initialized at 0.0001 and divided by 2 and 5 alternatively every 5 epochs. The optimizer and batch size are kept the same as in pretraining. After 45 epochs of adversarial training, the convergence is achieved to obtain the final HarmoGAN. It only takes 53 minutes to finish the whole training process based on the NVIDIA TITAN V GPU, which is fairly efficient.
For the harmony evaluation mechanism, the constant factors λ1 and λ2 in Eqs. (13) and (15) are set to 0.1 and 1, respectively, to obtain the attentional saliency. Meanwhile, the reaction delay is defined as 0.25 seconds, shown as Tdelay=3.75 frames in Eq. (20) under 15 fps. When evaluating the quantified audio-visual harmony, the β of F-score in Eq. (22) is set as 2 to focus more on the hit rate of audio beats.
To confirm the assumption that the occurrence of inharmony in the audio-visual objects can be observed by human perception, a user study is conducted to test whether the participants are sensitive to the inharmonious audio-visual clips. 20 dance videos are collected, which consist of 10 harmonious contents from the ground truth in the dance dataset released by [Tang et al., 2018], and 10 inharmonious clips created by permuting the audio or visual sequences. The invited 10 participants are required to watch the whole 20 videos and provide the perceptual harmony evaluation by picking up all the sequences that are considered as inharmony.
Before analyzing the performance of harmonization for the model, at first the HarmoGAN model is supposed to show reasonable ability to synthesize natural motion flows based on human skeletons. To evaluate the motion generation, the HarmoGAN model is tested on the self-created testing dataset obtained from the dance dataset released by [Tang et al., 2018], which can provide ground-truth dance movements performed by a real human dancer. The Fréchet Inception Distance (FID) metric proposed by Heusel et al. in “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” arXir preprint arXir: 1706.08500 (2017) (hereinafter, [Heusel et al., 2015]), is utilized to measure the perceptual distance between the estimated motion sequences and the human ground truth. As there exists no standard for extracting features in pose sequences, the VGG network proposed by Simonyan and Zisserman in “Very deep convolutional networks for large-scale image recognition,” arXir preprint arXir: 1409.1556 (2014) (hereinafter, [Simonyan and Zisserman, 2014]), is employed to obtain pose features for measuring FID. The average results are shown in
Meanwhile, in
To analyze the enhancement of audio-visual harmony after introducing the harmony loss into the network training, an ablation study is conducted to evaluate the performance of the HarmoGAN model with its variant without the use of harmo on the self-created testing dataset, which contains relevant initial poses for the generation of motion sequences. Given the pre-computed audio beats from the music sequences, the harmony can be assessed by analyzing the audio-visual beat consistency based on the estimated human movements.
Apart from the quantified harmony derived from the harmony evaluation mechanism shown in
To further assess the ability of harmonization in the HarmoGAN model, in addition to the variant, the HarmoGAN model is compared against the other two powerful GAN-based state-of-the-art models proposed by [Lee et al., 2019] and [Ren et al., 2020], for audio-driven motion synthesis. For a fair comparison, all models are tested on the Ballroom dataset by [Gouyon et al., 2006], which is a public music dataset only providing background music for various dance types. The 42 clips of 6-second audio tracks are randomly collected from the Ballroom dataset as the testing dataset. Without any given ground-truth human movement, motion sequences in the training dataset are selected as the initial poses to generate the dance sequences.
In
The detailed evaluation results for each dance type are shown in Table 1 and 2. Compared with the baseline HarmoGAN model without the use of harmo, the assistance of spatial-temporal GCN proposed by [Ren et al., 2020] may intrinsically benefit the harmonization by regularizing the hierarchical representations of skeletons in the generation of motion sequences. However, such improvement lacks robustness and is highly affected by the bias in the training dataset. The post-processing beat warper proposed by [Lee et al., 2019] can relatively lift the performance evenly but is still limited. In comparison with the other models, the HarmoGAN model can directly produce distinct and robust improvement for the audio-visual harmony that is independent of the dance types.
In some embodiments, the HarmoGAN model can be performed to generate the visual sequences based on video frames. In some embodiments, a multi-stage or end-to-end system can be built to perform the audio-visual harmonization based on video frames.
In addition, the cost of the tested models is analyzed based on the number of model parameters and training pairs. The number of parameters for the generator in the HarmoGAN model is closer to that of [Ren et al., 2020] and half of that of [Lee et al., 2019], while [Lee et al., 2019], require a 10-times larger training dataset for obtaining the final model. Thus, considering the results of harmony evaluation for each model, it reveals that the HarmoGAN model can improve the performance efficiently without increasing too much the cost in both the training and testing phase.
To evaluate the audio-visual harmony qualitatively, the dance videos are synthesized by combining the audio sequences and the generated motions from the tested models. Then the user study is conducted to compare the perceptual harmony for the synthesized videos. 12 unprofessional participants are invited to watch the video pairs from the different 3 models. The unprofessional participants are asked to vote which is better in terms of the audio-visual harmony blindly.
As shown in
As an example of qualitative evaluation, as shown in
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the claims.