The present disclosure relates generally to devices, systems and methods for acquiring audio of a specific person's voice in a first language, duplicating the characteristic sound of the voice and reproducing the same voice in a second language. The present disclosure further relates to capturing facial movements of the specific person's face while speaking in a first language and to animating video facial movements to align with a translated voice sound.
Subtitles or dubbing video content is expensive and time consuming. Commonly, dubbing is performed by a similar but different actor. A male actor may be dubbed by another male actor of a similar age but not by the same actor. An actor's voice is often an important part of character development. Changing the character's voice changes the overall effect of the video content.
Prosody is the elements of speech that are not individual phonetic segments but which are properties of syllables and larger units of speech and linguistic functions such as intonation, stress and rhythm. Such elements are also referred to as sound identifiers. Prosody reflects the emotional features of a speaker, their underlying emotional state, if they are making statements, commands or interrogatories; or if the speaker is speaking with irony or sarcasm, for example.
Many physical prosodic properties may be objectively measured. The sound wave and physiological characteristics of articulation may also be measured objectively. Objectively measured prosodic properties and sound wave characteristics, are sound identifiers that may be replicated.
Major prosodic variables include pitch of the voice, length of sounds, volume and variations between soft and loud, and timbre of the sound. These variables correspond closely to fundamental frequency, measured in hertz, duration, measured in milliseconds, intensity, measured in decibels, and spectral characteristics measured by the distribution of energy at different parts of the audible frequency range. In other words, the shape of a soundwave measured over time is an important sound identifier that can be objectively measured and replicated.
By replicating objectively measured prosodic properties as well as the sound wave of an actor's voice, the sound of the voice may be replicated. This process is also referred to as voice cloning. Voice cloning is the process of replicating or synthesizing a person's voice from audio samples. The result is a digital replica of the original voice which may be used to generate speech from text.
Replacing the spoken word in a second language using artificial intelligence in an original actor's voice would decrease the cost and labor of the dubbing of video content while preserving the quality of an actor's voice in the video content.
In the realm of digital media and film production, accurately synchronizing an actor's lip movements with newly recorded speech, such as in dubbing, voice replacement, or language translation, has long posed a technical challenge. Traditionally, reshooting scenes or relying on manual animation techniques were required to ensure the actor's lip movements matched the altered speech, which was often labor-intensive, time-consuming, and costly. As digital technologies evolved, so did the need for more efficient methods to manipulate mouth movements without reshooting or manual intervention.
Recent advancements in computer vision, artificial intelligence (AI), and machine learning have enabled the development of automated systems that can capture, analyze, and duplicate an actor's facial expressions and lip movements in real time. These systems can then be reanimated to correspond to new speech inputs while maintaining the integrity of the original performance. This allows for seamless dubbing, re-voicing, and even creative applications like altering performances or generating entirely new dialogue, all while preserving the actor's original visual appearance. Such technology has applications in the entertainment industry, including film, television, gaming, and virtual reality, as well as in marketing, training, and educational content.
Existing solutions often struggle with achieving a naturalistic appearance in real-time applications, maintaining synchronization under varying lighting conditions, and ensuring the actor's full range of emotions is captured. Therefore, there is a need for a more precise and efficient method for duplicating and reanimating mouth movements to match altered speech while maintaining the authenticity of the performance, in the original actor's voice and facial movements.
A method for generating dubbed speech in the voice of an original actor in a piece of video content is disclosed. Artificial intelligence (AI) is employed to extract sounds of an original actor's voice in a first language from background noise. The extracted language sounds are divided into a plurality of sound identifiers. Sound identifiers include but are not limited to prosodic content such as frequency, duration, intensity and the overall shape of a sound wave of the actor's voice sounds, measured over time. These sound identifiers may also be referred to as pitch, length of sounds, volume and variations between soft and loud and timbre, and sound wave peaks and valleys at specific times.
The overall shape of a sound wave of the actor's voice may be derived by various methods. Sampling is the reduction of a continuous-time signal to a discrete-time signal. In the field of music and audio recording, a sample is a value of a signal at a point in time. A sampler is a system or operation that extracts samples from a continuous signal. A theoretical ideal sampler produces samples equivalent to the instantaneous value of the continuous signal at a desired point in time.
An Artificial Intelligence component of the method involves extracting and duplicating a specific person's voice from an audio file. The audio file may be an audio/visual combination or a separate audio component. In other words, the audio file may be combined with a movie or video, or may be a separate file containing audio only. AI duplicates a voice by analyzing and modeling the unique characteristics of that voice using machine learning techniques. The process begins with gathering a substantial amount of audio data from the target speaker, which captures various elements such as pitch, tone, cadence, and pronunciation. This data is then fed into a neural network trained on speech patterns and vocal nuances. The AI uses this training to create a model that can replicate the speaker's voice with high fidelity, generating new speech that mimics the original voice's distinctive qualities. Advanced algorithms also incorporate contextual understanding and emotional nuances to make the synthetic voice sound more natural and authentic.
AI used in the method of the disclosure, derives text from speech through a process known as automatic speech recognition (ASR). This involves converting spoken language into written text using complex algorithms and machine learning models. Initially, the AI system processes audio input to break it down into smaller, manageable units called phonemes, which are the basic sounds of speech. The system then uses a pre-trained neural network to analyze these phonemes and match them with corresponding words or phrases based on context and language patterns. By leveraging vast datasets of spoken language, the AI can improve its accuracy over time, recognizing various accents, dialects, and speech patterns to generate text that accurately reflects the spoken content.
A sound wave shape is derived from various methods. A sound wave's shape is created by the way a person's voice interacts with the environment as they speak. When a person talks, their vocal cords vibrate, producing sound waves. These vibrations cause the air around the vocal cords to move, creating pressure fluctuations that travel through the air. The shape and characteristics of these sound waves are influenced by the vocal tract's unique configuration, including the throat, mouth, and nasal passages. As the sound waves move through these resonating chambers, they are shaped by articulatory movements—such as how the tongue and lips are positioned—which affects the frequency and amplitude of the sound. These vibrations and modifications result in the distinct waveform that represents a person's voice, capturing its tone, timbre, pitch, volume, and rate. In some embodiments the voice characteristic sound further represents accent, speech pattern and idiosyncrasies. Idiosyncrasies may include pauses, clicking of the tongue, time extending sounds such as “um” “ah” or “you know” that are commonly used in casual speech. One skilled in the art understands that each language may have its own common phrases similarly used to fill time. The combination of these factors creates a unique acoustic signature that characterizes each individual's voice and is referred to as a voice characteristic sound. The voice characteristic sound further comprises.
Extracted sounds evaluated as to their prosodic characteristics are combined with a script of the actor's spoken word in the original language. The script is translated into a second language using AI. The prosodic characteristics are applied to the script in the second language wherein the sounds of the actor's voice may be heard in the second language. The sound of the actor's voice in the second language is then replaced into the original file including other speakers, background noise and the like. The result is an actor's voice in a second language in the original video content.
Duplicating an actor's mouth movement to reanimate it with altered speech in a video involves a process called “lip-syncing” or “facial reanimation.” This technique captures the actor's original facial expressions, particularly the mouth movements, during their speech. Advanced software is used to track and map these movements, then replicate them frame by frame. Once the original movement is captured, animators can modify the lip movements to match new dialogue while keeping the rest of the facial expressions intact. This method allows for a translated audio with matched facial movements.
An example method for duplicating media with translated speech corresponding to an input media file employs a computer to execute the steps of acquiring an input media file including the voice of a specific person, often an actor, speaking in a first language. The method continues by deriving an input script of the actor's spoken word. Further, the method continues by acquiring the actor's voice characteristic sound. A voice characteristic sound includes various prosodic characteristics and may include a combination of tone, timbre, pitch, volume and rate. A voice characteristic sound may also be defined by frequency measured in hertz, duration measured in time increments, intensity measured in decibels or a sound wave shape, also referred to as a wave form. In some embodiments, the voice characteristic sound is defined by a wave form. In other embodiments the sound of the actor's voice is divided into a plurality of sound identifiers from the audio file in the first language. After defining the actor's voice in this way the method continues by translating the input script into a second language to generate an output script. The actor's voice characteristic sound, including sound identifiers, is combined with the output script and the result is an actor's voice in a second language in the original video content.
In some embodiments the actor's voice, translated into a second language in the same actor's voice, is coupled with a video adaptation of the actor's facial movements to complete the experience of seeing the actor speaking in a second language. The method further includes acquiring the input media file including a video of the actors voice and face, speaking in the first language. The actor's facial movements are separated into individual phonetic pronunciation video segments of speech in the first language. The method continues by assembling phonetic pronunciation according to the duplicate media with translated speech and then by generating duplicate media with translated speech and facial movements of the actor.
Artificial intelligence is used to derive speech sounds from non-speech sounds. Non-speech sounds refers to background noise, music, other's speaking and the like. One skilled in the art understands that multiple actors speaking would all be translated using the method.
Although the method may be employed when creating a new audio/visual product, the method also works when translating a prior published work.
Referring to
Referring to
Referring to