The present invention relates generally to performing cross-lingual automatic dubbing, and more particularly to performing cross-lingual automatic dubbing of a multimedia digital signal that may include video, audio and other effects.
The amount of video content produced daily is huge and constantly growing. There is plenty of media content in the form of movies and other kinds of shows by classic cinema and TV producers, as well as relatively new actors in the video streaming scene. Moreover, online video platforms have millions of daily viewers and content creators. Social networks have also moved towards a type of interaction based on videos rather than other communication forms.
Video fruition is in many cases limited by language understanding. Large cinema and TV producers have a large budget and can easily afford dubbing for a broader reach of selected movies and shows. However, this process is generally expensive and there is no standardized pricing in the industry, and it is as such not affordable for small actors. The cost-efficient option for them is to use subtitles to internationalize their shows. Subtitles are cheaper than dubbing, and can also be automated in many respects nowadays, reducing their price further.
However, watching a show with dubbing or with subtitles are two different watching experiences. With subtitles, the audience is forced to read all the time if they do not understand the audio language. As such, subtitles can be a hindrance to fully enjoying a show, since the attention is continually split between watching the scene and reading the lower side of the screen. On the other hand, a drawback is that many people do not appreciate dubbing because of the adaptations it forces to the language in order to fit the constraints given by the scenes.
From an accessibility point of view, subtitles are of no help with visually impaired or blind people, and it can be problematic for people suffering from dyslexia or just slow when reading in a different language. In fact, subtitles may not be available in all languages and viewers may be forced to pick one language they understand, which is not their native language. Dubbing is instead not useful for deaf or hard-to-hear people, who need subtitles to follow the dialogues.
In addition to the previous points, the preference for subtitling or dubbing is also cultural, with countries opting for (and being used to) one of the two options. In some countries, the broad population is just used to watching everything in their native language, being it original or dubbed content, and watching a show with subtitles would not be considered an option. Considering everything, dubbing has the potential to increase considerably the reach of video content. If it can satisfy quality and cost requirements, automatic dubbing can enable industry players of all sizes to increase the internationalization of their videos through dubbing.
The invention is based on State-of-the-Art (SOTA) technology for several tasks in signal processing and natural language processing. Given a video signal in input, an automatic dubbing system requires high accuracy in, at least, the following tasks: automatic speech recognition (ASR) for transcribing the speech from the signal; source separation to separate speech and background signal and reintroduce the background in the final video; machine translation (MT) to obtain text in the target language; and text-to-speech synthesis (TTS) to generate the utterances in the target language.
Low accuracy in one of the abovementioned components would compromise the overall quality of the final video, but high accuracy is not enough as well, and some task-specific challenges need to be considered to make the viewing experience pleasant for the audience. Additionally, given the high number of components, it becomes relevant to consider some tradeoffs between accuracy and total processing time. In fact, while processing times are still lower than what humans could achieve, in many cases also considered in this disclosure, the fully automated process is just a first step before human intervention. And longer processing will increase costs and delay the start of human work.
Speaker diarization is a component that segments an audio signal according to the speaker. Given that in general the number of speakers is not known in advance, finding the most probable number of speakers is also part of the task, which makes it particularly hard. The result of transcribing the audio signal depends on the quality of the ASR, diarization, audio segmentation and text punctuation components, but also by the way they interact with each other. One effective way would be to first perform diarization, then speech recognition on single-speaker segments, and finally punctuation on each of those segments. However, ASR and diarization are both computationally expensive components with RTF˜0.5 each when computed on modern CPUs. The concatenation of these two components would lead to an RTF of ˜1. The availability of multiple compute nodes allows computing the two operations in parallel and then merging the results before applying punctuation. The scores of the language model will be wrongly influenced from the words of another speaker at the borders between two speakers but the RTF is halved in practice.
A similar problem arises with source separation. This is another computationally expensive component with an RTF of ˜0.5 on modern multicore CPUs. If the transcription is done on the enhanced speech signal obtained from source separation, the result should be higher than recognizing from a noisy signal. Yet, the spectral representation of enhanced speech signals is different from recorded signals and ASR systems need to be trained on this kind of signals to benefit from them. In some applications, the increased effort and RTF is not worth the gain in transcription quality, and the recognition can be performed on the original audio while source separation runs in parallel in another computing node.
Machine translation is one of the most studied components in automatic dubbing literature. The reason is that translations need to be accurate, but also uttered by the TTS system under time constraints. The constraints are given by the lip movements when the speaker is in foreground, and by the surrounding sentences timings in all cases. To achieve such goals, MT models are enhanced to output translations of the given length, or to predict pauses in a sentence in order to reproduce the prosody from the original speech. In the present disclosure we include these features, and we also add metadata-aware translations to speed up the translation post-processing by easily changing aspects like gender, style, or politeness. Some metadata, like speaker gender, can be detected automatically to have higher translation quality in the first dubbed video, but others become relevant mostly for postediting.
The TTS component needs to pronounce the words correctly and the voice should sound somehow like the original. In some cases, a cloning of the original voice is required by the user while in other cases just a voice that fits the gender can be good enough. TTS with the speaker voice can be achieved in at least two ways. The first approach considered in the present disclosure is to use speaker-adaptive TTS, which takes as an additional input a feature vector representing the speaker voice's characteristics. The system learns to pronounce the text with a voice that reproduces the encoded one. The second option is to use a more classical approach to TTS with fixed voices, and cascade it with a voice conversion model, which takes in as input a speech signal with the speech and a second speech signal with a reference voice, and outputs the same speech with the reference voice. This second approach requires an additional model but decouples the problem of producing high-quality, emotional voices from the problem of producing the desired voice.
The bridge between MT and TTS is a component we call speech placement, which decides the actual timing to assign to each utterance. It uses translated texts, the original timing and the time requested to utter the sentences with the desired voice. What it does is to assign the actual time stamps to each utterance and decide if it should be sped up or slowed down.
Image processing models are also included in our automatic dubbing pipeline for several tasks. A speaker detection system detects lips movements in the video images to provide an a priori probability of the number of speakers. This information is used to narrow the range of speaker numbers to be considered by the diarization system. Scene cut detection is another technology that does not require machine learning and can predict accurately the scene changes. It can help to further narrow down the range and get more accurate results. Additionally, speaker detection provides a priori probabilities for the speaker changes, which are particularly difficult to detect for current speaker diarization systems when one speaker turn is very short (less than 2 seconds). This can sound like a corner case, but it improves the overall viewing experience, as a missed speaker change means for the viewer that two characters in the video speak with the exact same voice. This may result in lost viewer's trust in the automatic system.
According to an embodiment of the invention, a method of speaker-adaptive dubbing of an audio signal in a video into different languages includes:
The above described features and advantages will be more fully appreciated with reference to the below figures, in which:
The present disclosure relates to methods, systems, apparatuses, and non-transitory computer readable storage media for automatic dubbing and voiceover.
According to some embodiments, the system may be configured as a cloud service that works in two subsequent phases. For example,
According to some embodiments, a multimedia signal such as a video is transmitted through a network to the automatic dubbing system 200 residing in the cloud, as depicted in
The images separated from the video signal may be transmitted to the video processing service 230 that runs the person detection and speaker detection components. The audio signal extracted by the signal processing component is transmitted to the source separation service 220 that only includes the source separation components, such as speech. The output from speaker detection can also be used here as described above. The ASR service 240 includes audio segmentation, ASR, punctuation and diarization systems. Its output is sent to an MT service 250 and speaker processing service 260. The MT service runs the MT system, and the speaker segmentation is sent to the speaker processing service. It can compute translation durations, compute a speaker embedding for each speaker identified by diarization and run speech placement. The speaker processing service 260 outputs information to a TTS service 270, which performs speech synthesis and emotion classification. Then the generated audio signals are sent back to the signal processing service 210 to concatenate them, merge them with the background and render the final video.
A Speaker Embedding Block 340 may receive enhanced speech output from the speech separation block 320 and output from the speaker diarization block 330. These audio signals segmented by speaker are input to Speaker Embedding to extract speaker vectors. Then in a Group Embedding block 345, which receives output from the Speaker Embedding Block, a representative speaker vector is selected for each speaker. Such speaker vectors are input together with translations output by the Machine Translation block 350 to a Speech Duration block 355 to predict the duration of the resulting synthesized speech signal. Given the durations and the original times from the Speech Duration block, a Speech Placement block determines start and end times for each of the synthetic speech signal, plus a possible tempo rate to speed up or slow down the signal.
Translated sentences with the respective speaker vectors and placement results are then input to a Speech Synthesis block 365 to obtain the individual synthesized audio signals for each sentence, representing translated speech within the audio signal. These signals are then concatenated and blended with the background audio signal from block 320 to obtain a single audio signal in Blend Audio block 370. The result is then merged with the source video in Render Video to produce a dubbed output video in a target language different than the source language in 375.
Referring to
For example, the audio signal may then be processed by a source separation component, which creates two new audio files, one file with the speech signal and one file with the background signal, which includes noise and music among other types of background audio signals. In some embodiments, speech separation is deployed as a microservice in the cloud. In some embodiments, speech separation consists of a single or distributed deep learning model that receives an audio signal in the time domain as input and produces as output the two signals for speech and background. In some embodiments, the model transforms the signal into the frequency domain and then transforms the output back into the time domain.
The embodiments using source separation may use a deep-learning-based system for speech enhancement to extract the clean speech, and a different deep-learning-based system for source separation to extract the clean background. This requires more computation than signal subtraction but is also more accurate. The accuracy advantage is twofold. First a deep learning model may generate higher-quality signals than a simplistic signal processing algorithm; second, it enables the usage of speech enhancement that removes, or strongly attenuates, reverberation. In fact, when the background is obtained by signal subtraction, the original speech is reverberated but the clean speech is not, then reverberation leaks into the background. This problem does not arise when the two signals are extracted separately according to an embodiment of the present invention.
The audio signal is then processed by an ASR system to obtain the text with the corresponding word timings. According to some embodiments, the input audio signal is the one that was originally extracted from the video. In such embodiments, the ASR system may be trained using audio data that includes both speech and the background signals. In some embodiments, the ASR system may be trained with speech signals obtained from speech enhancement or source separation systems. In such embodiments, the ASR component receives as input speech signals output by the source separation component.
In some embodiments, the automatic segmentation system 840 may split the audio signal into segments according to speech pauses, and the transcription is then performed on the speech segments. The ASR system 850 may generate the transcript of the audio signal in the form of a list of words without punctuation and time markers corresponding to the beginning and end of each word. The ASR system can be a hybrid HMM-DNN system or an end-to-end system in the form of attention encoder-decoder models, transducers or CTC models. In some embodiments that do not use automatic segmentation, the audio signal may be processed by the ASR component as streaming input and output. The audio signal is taken in input by the ASR component in potentially overlapping chunks of the full signal, and the output is produced while using the previous chunks and a “look-ahead” of future frames in a frame-synchronous manner. A speaker diarization system is used to assign audio windows, and their respective word sequences, to speaker labels.
The speaker diarization system may include a speaker encoder, which takes as input an audio signal and produces in output a single vector that represents the voice characteristics of the speaker in the given audio segment; a clustering algorithm, such as k-means or spectral clustering, that groups the resulting speaker vectors according to a similarity metrics; an optimization algorithm to determine the most probable number of clusters; and a gender classification system that takes as input the same audio signal as the speaker encoder, and produces in output a label ‘male’ or ‘female’ for the speaker. Some embodiments apply speaker gender classification on all the audio segments and then a majority vote determines the gender for the speaker. Some embodiments compute speaker gender classification only on the longest segment assigned to a speaker.
The sequences of frames with detected faces are input to a lip movement detection model 930 for processing. For all the consecutive frames where the same face is detected, decided according to position and plausible movements, the model detects if the corresponding lips move following a speech pattern across the frames. If no speakers are detected with moving lips in this way in 940, the image is discarded and becomes a candidate for “off-screen”. If one speaker is detected, then the face image is analyzed by a face recognition model in 950 that produces a vector encoding visual characteristics that can encompass sex, skin color, age, ethnicity and face traits useful to distinguish a person from another one.
Some embodiments use additional information from the images in the video to improve the diarization results. A visual diarization system detects the persons in a video signal and clusters them to find all the occurrences of a person in the video. According to some embodiments, as depicted in
The information about the persons and the lip movements is matched with the time boundaries for speech detected by the ASR system. In some embodiments, the segments from acoustic speaker diarization are then mapped to the labels from visual speaker detection using the time information from both. The resulting information is used together with the labels from both systems to assign more accurate speaker labels. In case of mismatch between the acoustics and the visual speaker, the system that during testing obtained the lowest error rate determines the label. In some embodiments, visual speaker detection is used to initialize the number of speakers for speaker diarization. Consecutive windows of the audio signal assigned speaker label are merged into a single segment.
An automatic punctuation system adds punctuation and proper casing to the output text from ASR. The input to this system consists of word sequences corresponding to segments detected during the diarization phase. This way, there is no contextual confusion between words spoken by different speakers. In some embodiments the punctuator receives as input only the text as output from the ASR system. In some other embodiments, the punctuator receives text and the corresponding audio signal as input, with the idea to use acoustic cues, such as pauses, to predict punctuation.
The information from ASR, punctuation and diarization is then merged into a data structure that includes the transcripts with the initial and final time stamps of each word, divided into segments, each segment with their timing and speaker. According to some embodiments, the present data structure is then stored into structured data formats such as xml or json for further post-editing.
Some embodiments additionally use a speech emotion detection system to determine the emotions expressed in a sentence. Examples of emotions that can be expressed in speech are: happiness, sadness, anger, excitement, fear, surprise, or neutrality. In some embodiments, the output is a single emotion label for the whole segment. In some embodiments, the output is a set of weights for the various emotions. In some embodiments, the same output is computed at a word level using the word boundaries determined by the ASR system. Some embodiments additionally use a text emotion classifier, which takes as input a sentence in text format and produces in output emotion labels similarly to the speech detection system. The scores from the two systems are then combined with weights for a more accurate result.
The MT system is trained to translate from the source to the target language. In some embodiments, the translation happens at a segment level, which may lack context for a proper translation. In some embodiments, the MT system can use information from the previous source text and/or translation for a better contextual translation. For proper video dubbing, it is required that the MT output can be uttered in the same timespan as the original utterance, and as such, some control in the output length is needed. In some embodiments, the MT model is trained to generate output of the same audio duration by using cross-lingual textual information such as character or syllable count. In some embodiments, the MT system is trained with utterance time information on both the source and target side to generate output that can be uttered at the same time as the input.
In some embodiments, the MT system receives as input also a speaker vector (described later) representing the speaker for the utterance and uses this additional information for predicting the utterance time. In fact, the time to utter a sentence varies from speaker to speaker also in automatic systems. In some embodiments, visual information from the video signal can be processed to predict if the speaker is off screen, in which case the timing is more relaxed.
In some embodiments, the N-best list from the MT system is retained to later choose the translation hypothesis that best fits the given time.
Referring to
Some embodiments additionally compute the room impulse response (RIR) for each segment. This is based on algorithms that predict the reverberation in the signal and use it to estimate the RIR. Spectral subtraction and Wiener filter are among the algorithms commonly used for this task. The RIR from each segment is used to dereverberate the audio signal in 1240 and further improve its speech quality. Given a segment for which the speaker is known, the speaker encoder 1250 extracts vectors from consecutive, or overlapping, windows in it. The vectors obtained from the multiple windows are then grouped to obtain the speaker vector for the segment in 1260. For speaker vector selection in 1270, some embodiments use the vector obtained from this step as it is. Some embodiments, in order to have a consistent voice throughout the whole output video signal, group the speaker vectors for the same speakers to select one representative. Some techniques used for this step are: picking the feature vector corresponding to the longest segment for each speaker; computing an average vector; using the number of speakers found by speaker diarization to set the number of clusters for a clustering algorithm to run on the feature vectors, and picking the cluster centroids.
The speech placement system decides the exact timing to assign to each sentence. Speech placement is designed according to the functioning of the TTS system. In some embodiments, the TTS system does not take as input the actual text but a phonemic representation of it. Then, the textual input is preprocessed with the TTS preprocessing pipeline to convert it into a phonemic representation and obtain a symbol-level duration. The speech placement system then determines the start and stop time of each segment, as well as a time correction factor for each word in order to expand or shrink its duration according to the time needs.
Some embodiments use a module to predict the speech rate. This module takes in input a text sequence and the corresponding audio signal, and outputs a single real number (speaking pace) indicating how much the speech content is slower or faster than a normal speaking rate. This number may then be multiplied or otherwise varied by the predicted duration of the synthesized audio to modify the duration accordingly. This helps to reproduce the original viewing experience in the target language when a sentence is pronounced purposedly too slow or too fast.
Such a pace prediction component is based on a duration model trained on words and their durations. The training data may be obtained starting from aligned audio-text data in the same language, and using a forced aligner to obtain word durations. Then, a Gaussian model may be trained to obtain mean and variance duration for each word. During inference, if a measured duration with an error interval given by a forced aligner has a probability according to the Gaussian model higher than a threshold t, then pace is set as 1, else it is the division of the duration measurement by the Gaussian mean.
In some embodiments, the MT model is trained to output sequences of phonemes and their respective duration. The training data may be prepared by starting from normal bilingual parallel text used for MT. The source side is left unchanged, but the target side is converted to phonemes using a grapheme-to-phoneme (G2P) model, and a duration model is used to annotate phonemes with duration. The duration model may be trained using audio-phonemes aligned data. In some embodiments, the output of MT is a sequence of pairs (phoneme, duration). The output sequence is then postprocessed to extract the phoneme and duration sequences separately and these sequences are fed as input to the TTS systems after speech placement and N-best selection from the previous points.
The synthetic audio files are then concatenated together while adding long-enough silence between the different segments. Silence is represented by a zero vector. In some embodiments, all synthesized audio signals are concatenated together, taking their timings into account. In some embodiments, following standards in the media industry, only synthesized voice signals with the same voice are concatenated together. These separate audio tracks can then be further used for professional mixing. Such embodiments also generate an additional audio signal that combines all the voice in a single track for generating a dubbed video automatically. The resulting audio file is then merged with the background audio in some embodiments, and with the background and a volume-scaled version of the original speech audio in other embodiments. The final audio file obtained by this latest merging is then added to the video and sent back to the user over the network.
Some embodiments also allow the user to modify intermediate output of the system. In such embodiments, the output artifact does not include only the output video signal, but may also include a data structure containing a sequence of segments, where for each segment it stores initial and final time, input audio signal, automatic transcript, machine translation, speaker label, speaker gender, detected emotion, on/off screen information, synthesized audio signal and phonemic representation of the audio signal or subsets or combinations thereof. Additionally, it maps speaker labels to speaker vectors and emotion labels to emotion vectors. Such data structure is then passed as input to a remote system that presents a GUI to the user. The GUI allows the user to edit all the information in it. In particular, it allows the user to change: the time markers of segments to fit more precisely the original utterances; the text of translations to utter with the synthesized voices, and of the transcriptions, which leads to better translations when they are requested into multiple target languages; the speaker labels to change voices easily; and the detected emotions to change the prosody of the synthetic utterance. Then, a button starts the generation of the synthetic audio for a segment given the edited data. The synthesized audio is generated with the same TTS system used for automatic dubbing and includes speaker and emotion adaptation as well. Some embodiments offer an additional interface to modify the synthesized audio by changing the pitch contour and duration of each phoneme.
Various systems, components, services or micro-services, and functional blocks are described herein for processing audio and video, corresponding speech, silence, voices and background noise embedded in the video, and speech recognition and machine translation (collectively components). These components may be implemented in software instructions stored in memory that are executed by a processor such as a microprocessor, GPU, CPU or other centralized or decentralized processor and may run on a local computer, a local server or in a distributed cloud environment. The processor may further be coupled to input/output devices including network interfaces for streaming video or audio or other data, databases, keyboards, mice, displays, speakers and cameras, among others. These components all of those identified herein, including ffmpeg associated with basic signal processing, and ASR, MT, speech separation, audio segmenter, case and punctuation systems, visual diarization systems, speaker diarization, speech rate prediction, speech placement system, speaker encoders, decoders, emotion classifiers for text and speech, TTS and speaker adaptive TTS. In addition, some components described herein comprise models that may be implemented and trained with AI, machine learning and deep neural networks. These may include any of the components above, but particularly the ASR, MT, and TTS functionality and also other components described herein as implementing trained models or neural networks. The memory may store training models, data and operational models and programs for training translation MT embodiments, ASR embodiments, TTS embodiments and embodiments of other components described herein.
While particular embodiments of the present invention have been shown and described, it will be understood by those having ordinary skill in the art that changes may be made to those embodiments without departing from the spirit and scope of the present invention.
This application claims priority to U.S. Provisional Patent Application No. 63/543,232, filed on Oct. 9, 2023, entitled “Automatic Dubbing: Methods and Apparatuses, the entirety of which is incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
63543232 | Oct 2023 | US |