Embodiments of the present principles generally relate to content rendering and, more particularly, to a method, apparatus and system for creating a prosodic script for accurate rendering of content.
Currently, there is no way for text-to-speech (TTS) and other behavioral rendering systems to reliably solve the “one-to-many” mapping problem, in which the rendering system attempts to create one rendering that covers all the different ways an individual may choose to speak a phrase, with the results that the current renderings tend to sound lifeless and without full communicative intent. Some current systems have begun to attempt post-hoc changes to spoken speech to address the “one-to-many” mapping problem, however, such solutions are cumbersome and unreliable.
On the other hand, the ability to realistically render a single frame of video, given puppeteering input or other means of specifying the position of human gestures, such as facial features, for that frame, has grown significantly, and there are now numerous means to generate these frames. Similarly, for speech, given a desired stream of phonemes (i.e., derived from text and a pronunciation dictionary), numerous methods exist to (a) quickly train a system to render a voice with the timbre of the desired output, and (b) render any desired text in that voice. However, for both face and voice, the ability to accurately render realistic behavioral dynamics for the selected subject has lagged far behind. For example, given only text as input, current systems are able to render speech for that input, but the speech generally comes across as without any real effect or intent, sounding much like an unengaged voice actor reading a script in a voice with only the minimal inflection needed to convey the syntax and semantics of the selected sentence. For face, the problem is in some ways worse, as face and head movements and expressions are generally completely divorced from any coordinated prosodic intent with the speech, a tell-tale sign of fakery, and a method guaranteed to disengage the listener.
Embodiments of the present principles provide a method, apparatus and system for prosodic scripting for accurate rendering of content.
In some embodiments, a method for creating a script for rendering audio and/or video streams includes identifying at least one prosodic speech feature, in a received audio stream and/or a received language model, and/or identifying at least one prosodic gesture in a received video stream, and automatically temporally annotating an associated text stream with at least one prosodic speech symbol created from the identified at least one prosodic speech feature and/or at least one prosodic gesture symbol created from the identified at least prosodic gesture to create a prosodic script, that when rendered provides an audio stream and/or a video stream comprising the at least one prosodic speech feature and/or the at least one prosodic gesture that are temporally aligned.
In some embodiments, the method can further include converting a received audio stream and/or a received language model into a text stream to create the associated text stream and creating the at least one prosodic speech symbol and/or the at least one prosodic gesture symbol using stored, pre-determined symbols.
In some embodiments, the method can further include rendering the prosodic script to create at least one predicted audio stream and/or at least one predicted video stream and comparing prosodic speech features of the at least one predicted audio stream and/or prosodic gestures of the at least one predicted video stream to prosodic speech features of a ground truth audio stream and/or prosodic gestures of a ground truth video stream to determine respective loss functions for training a system to create the prosodic script.
In some embodiments, in the method the prosodic gestures are identified from movement of at least a portion of a body of a speaker of the received audio stream.
In some embodiments, in the method the portion of a body of a speaker comprises a face of the speaker of the audio stream and that at least one prosodic gesture comprises a change in at least a portion of the face of the speaker, including at least one of a head, mouth, forehead, ears, chin, or eyes of the speaker of the received audio stream.
In some embodiments, the method can further include creating a spectrogram of the received audio stream, rendering the spectrogram from the prosodic script to create a predicted spectrogram, and comparing the predicted spectrogram to the created spectrogram to determine a loss function for training a system to create the prosodic script.
In some embodiments, in the method the at least one prosodic speech feature comprises at least one of an emphasis, a duration, or a pitch of a temporal portion of the received audio stream.
In some embodiments of the present principles, a method for creating a dynamic prosodic script for rendering audio and/or video streams includes identifying at least one prosodic speech feature in a received audio stream and/or a received language model, creating at least one modifiable prosodic speech symbol for each of the identified at least one prosodic speech features, converting the received audio stream into a text stream, automatically and temporally annotating the text stream with at least one created, modifiable prosodic speech symbol, identifying in a received video stream at least one prosodic gesture of at least a portion of a body of a speaker of the received audio stream, creating at least one modifiable prosodic gesture symbol for each of the identified at least one prosodic gestures and temporally annotating the text stream with at least one created, modifiable prosodic gesture symbol along with the at least one modifiable, prosodic speech symbol to create a prosodic script, wherein the at least one modifiable speech prosodic symbol and the at least one modifiable prosodic gesture symbol are modifiable in the prosodic script, such that a rendering of an audio stream or a video stream from the prosodic script is changed when at least one of the at least one modifiable speech prosodic symbol and the at least one modifiable prosodic gesture symbol is modified.
In some embodiments, in the above method at least one of the at least one prosodic speech symbol or the at least one prosodic gesture symbol comprise at least one of a predetermined character representative of at least one of the prosodic speech features identified in the audio stream or at least one of the prosodic gestures identified in the video stream or a semantic description of at least one of the prosodic speech features identified in the audio stream or at least one of the prosodic gestures identified in the video stream.
In some embodiments, an apparatus for creating a script for rendering audio and/or video streams includes a processor and a memory accessible to the processor, the memory having stored therein at least one of programs or instructions executable by the processor to configure the apparatus to identify at least one prosodic speech feature in a received audio stream and/or a received language model and/or at least identifying one prosodic gesture in a received video stream, and automatically temporally annotate an associated text stream with at least one prosodic speech symbol created from the identified at least one prosodic speech feature and/or at least one prosodic gesture symbol created from the identified at least prosodic gesture to create a prosodic script, that when rendered provides an audio stream and/or a video stream comprising the at least one prosodic speech feature and/or the at least one prosodic gesture that are temporally aligned.
In some embodiments, the apparatus is further configured to convert a received audio stream and/or a received language model into a text stream to create the associated text stream, and create the at least one prosodic speech symbol and/or the at least one prosodic gesture symbol using stored, pre-determined symbols.
In some embodiments, the apparatus is further configured to render the prosodic script to create at least one predicted audio stream and/or at least one predicted video stream and compare prosodic speech features of the at least one predicted audio stream and/or prosodic gestures of the at least one predicted video stream to prosodic speech features of a ground truth audio stream and/or prosodic gestures of a ground truth video stream to determine respective loss functions for training a system to create the prosodic script.
In some embodiments, the prosodic gestures are identified from movement of at least a portion of a body of a speaker of the received audio stream.
In some embodiments, the portion of a body of a speaker comprises a face of the speaker of the audio stream and that at least one prosodic gesture comprises a change in at least a portion of the face of the speaker, including at least one of a head, mouth, forehead, ears, chin, or eyes of the speaker of the received audio stream.
In some embodiments, the apparatus is further configured to create a spectrogram of the received audio stream, render the spectrogram from the prosodic script to create a predicted spectrogram, and compare the predicted spectrogram to the created spectrogram to determine a loss function for training a system to create the prosodic script.
In some embodiments, the at least one prosodic speech feature comprises at least one of an emphasis, a duration, or a pitch of a temporal portion of the received audio stream.
In some embodiments, a system for creating a script for rendering audio and/or video streams includes a spectral features module, a gesture features module, a streams to script module, and an apparatus comprising a processor and a memory accessible to the processor, the memory having stored therein at least one of programs or instructions. In such embodiments, when the programs or instructions are executed by the processor, the apparatus is configured to identify, using the spectral features module and/or the gesture features module, at least one prosodic speech feature in a received audio stream and/or a received language model and/or identifying at least one prosodic gesture in a received video stream, and automatically and temporally annotate, using the streams to script module, an associated text stream with at least one prosodic speech symbol created from the identified at least one prosodic speech feature and/or at least one prosodic gesture symbol created from the identified at least prosodic gesture to create a prosodic script, that when rendered provides an audio stream and/or a video stream comprising the at least one prosodic speech feature and/or the at least one prosodic gesture that are temporally aligned.
In some embodiments, the system further includes a speech to text module and the apparatus is further configured to convert, using the speech to text module, a received audio stream into a text stream to create the associated text stream, and create, using the streams to script module, the at least one prosodic speech symbol and/or the at least one prosodic gesture symbol using stored, pre-determined symbols.
In some embodiments, the system further includes a rendering module and the apparatus is further configured to render, using the rendering module, the prosodic script to create at least one predicted audio stream and/or at least one predicted video stream, and compare, using the rendering module, prosodic speech features of the at least one predicted audio stream and/or prosodic gestures of the at least one predicted video stream to prosodic speech features of a ground truth audio stream and/or prosodic gestures of a ground truth video stream to determine respective loss functions for training a system to create the prosodic script.
In some embodiments, the prosodic gestures are identified, by the streams to script module, from movement of at least a portion of a body of a speaker of the received audio stream.
In some embodiments, the portion of a body of a speaker includes a face of the speaker of the audio stream and that at least one prosodic gesture includes a change in at least a portion of the face of the speaker, including at least one of a head, mouth, forehead, ears, chin, or eyes of the speaker of the received audio stream.
In some embodiments, the apparatus is further configured to create a spectrogram of the received audio stream, render the spectrogram from the prosodic script to create a predicted spectrogram, and compare the predicted spectrogram to the created spectrogram to determine a loss function for training a system to create the prosodic script.
In some embodiments, the at least one prosodic speech feature comprises at least one of an emphasis, a duration, or a pitch of a temporal portion of the received audio stream.
Other and further embodiments in accordance with the present principles are described below.
So that the manner in which the above recited features of the present principles can be understood in detail, a more particular description of the principles, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments in accordance with the present principles and are therefore not to be considered limiting of its scope, for the principles may admit to other equally effective embodiments.
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures. The figures are not drawn to scale and may be simplified for clarity. It is contemplated that elements and features of one embodiment may be beneficially incorporated in other embodiments without further recitation.
Embodiments of the present principles generally relate to methods, apparatuses and systems for prosodic scripting for accurate rendering of content. While the concepts of the present principles are susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and are described in detail below. It should be understood that there is no intent to limit the concepts of the present principles to the particular forms disclosed. On the contrary, the intent is to cover all modifications, equivalents, and alternatives consistent with the present principles and the appended claims. For example, although embodiments of the present principles will be described primarily with respect to specific content such as facial features, such teachings should not be considered limiting. Embodiments in accordance with the present principles can be applied to substantially any content including other human body parts and robotic components.
Embodiments of the present principles enable an automatic extraction of prosodic elements from video and audio capture of visual and audio performances, such as spontaneously speaking humans. The prosodic text/audio and gesture annotations are scripted and such script can be rendered to for example; train a prosodic scripting system of the present principles based on predicting prosody streams from the scripted text including the prosodic annotations, which can be used for behavioral analysis and feedback. Such analysis and feedback of the present principles can be implemented to, for example, enable an understanding of distinct components of a large-scale scene, such as a 3D scene, and the contextual interactions between such components can provide a better understanding of the scene contents and to enable a segmentation of the scene into various semantic categories of interest.
The term “symbol” is used throughout this disclosure. It should be noted that the term “symbol” is intended to define any representation of at least one of prosodic speech features and/or prosodic gestures that can be inserted/annotated in a text stream in accordance with the present principles. For example and as described below, in some embodiments a symbol can include a character that is representative of at least one of prosodic speech features and/or prosodic gestures that can be inserted/annotated in an associated text stream. Alternatively or in addition, in some embodiments a symbol can include a semantic description (e.g., text) of at least one of prosodic speech features and/or prosodic gestures that can be inserted/annotated in an associated text stream.
Prosody can be defined as behavioral dynamics which can include the communication of meaning beyond what is included in actual words and gestures. That is, prosody refers to modulations of spoken speech, e.g., with pauses, emphases, rising and falling tones, etc, and human gestures (e.g., facial gestures) and mannerisms that are also synchronized to the spoken content. As such, in the present disclosure, a prosodic audio/speech feature can refer to parts/features of audio/speech that communicate meaning beyond the actual words and a prosodic gesture can include any expression and/or movement of a body (human or otherwise) that communicate meaning beyond temporally respective words.
Although embodiments of the present principles will be described herein with respect to receiving audio and video data of a human performing an act, such as reading a monologue, alternatively or in addition, embodiments of a prosodic scripting system of the present principles, such as the prosodic scripting system 100 of
In accordance with embodiments of the present principles, a text stream is temporally annotated at least one prosodic speech symbol created from at least one prosodic speech feature identified in a received audio stream and/or at least one prosodic gesture symbol created from at least prosodic gesture identified in a received video stream to create a prosodic script. Because the prosodic speech symbols and/or the prosodic gesture symbols are temporally inserted into the text stream, a rendered prosodic script will accurately reflect in time, a predicted audio performance of the received audio stream, and/or a predicted visual performance of the received video stream, and/or a coordinated performance of the received audio and video stream.
In the embodiment of the prosodic scripting system 100 of
In some embodiments of the present principles, the prosodic scripting system 100 of
In the embodiment of the prosodic scripting system 100 of
In the embodiment of the prosodic scripting system 100 of
For example,
Referring back to the embodiment of the prosodic scripting system 100 of
For example,
In accordance with the present principles, in some embodiments, the prosodic speech symbols inserted/annotated in the text stream can be modified/modifiable. That is, in some embodiments of the present principles, the symbols or any other representation of identified speech prosody can be modified in a created prosodic script, such that a rendering of the prosodic speech can be changed. That is, a rendering of a performance of a prosodic script can be changed by modifying prosodic symbols and/or semantic descriptions representative of identified prosodic speech that were inserted into a determined prosodic script (described in further detail with respect to
In the prosodic scripting system 100 of
In the embodiment of the prosodic scripting system 100 of
Similar to the spectral features module 105 of
That is, in the embodiment of the prosodic scripting system 100 of
In some embodiments of the present principles, the streams to script module 120 can create a respective symbol for each of the identified prosodic gestures similar to the prosodic speech symbols depicted in
In accordance with the present principles, in some embodiments, the prosodic gesture symbols inserted/annotated in the text stream can be modified/modifiable. That is, in some embodiments of the present principles, the symbols or any other representation of identified gesture prosody can be modified in a created prosodic script, such that a rendering of identified body gestures can be changed. That is, a rendering of a performance of a prosodic script can be changed by modifying prosodic gesture symbols and/or semantic descriptions representative of identified prosodic gestures that were inserted into a determined prosodic script (described in further detail with respect to
In some embodiments of the present principles, the streams to script module 120 can implement pre-determined facial modulation models to determine at least one of a semantic description and/or symbols for the identified prosodic gestures of received video and accurately capture 3D head shape. For example, in some embodiments, the streams to script module 120 can attempt to match at least one prosodic gesture of a received temporal prosodic gesture stream received from, for example, the visual streams module 110, to attempt to determine at least one of a semantic description and/or a symbol for at least one identified prosodic gesture. The streams to script module 120 can then annotate the text stream with the determined semantic description and/or a symbol for the at least one identified prosodic gesture.
Although the embodiment of the prosodic scripting system 100 of
Alternatively, in some embodiments of the present principles, a prosodic scripting system of the present principles can create a prosodic script of the present principles from only a received video stream. That is, in some embodiments, a visual features module of the present principles can receive video stream and identify prosodic gestures in the received video stream in accordance with the present principles and as similarly described in the embodiment of
In a prosodic scripting system of the present principles, such as the embodiment of the prosodic scripting system 100 of
In the embodiment of
In the embodiment of
In some embodiments, a synthetic puppeteer rendering module of the present principles, such as the synthetic puppeteer rendering module 507 of
In the embodiment of
Referring back to the embodiment of
The embodiment of the speech and sketch prediction module 125 of
Referring back to the prosodic scripting system 100 of
As depicted in the embodiment of the prosodic scripting system 100 of
In some embodiments of the present principles, the spectro to speech module 150 can provide a spectrogram stream, which is converted to a speech waveform for final output. For facial rendering, the sketch to face module 155 can provide a sketch-like rendering to indicate positions of facial features within each frame of the face. In some embodiments, the sketch to face module 155 can include a multi-frame, multimodal autoencoder, that is trained to render the mesh as a specific, fully “fleshed out” face.
In some embodiments of the present principles, in the script to speech and sketch prediction module 125, synchronization of the speech and facial prosody can be achieved in a number of different ways. That is, a tightest synchronization is required for the lips of a speaker, and a script to speech and sketch prediction module of the present principles can, in some embodiments, implement a FaceFormer process that takes as input the already rendered speech stream, thus giving speech the appropriate and needed control for the synchronization.
Although in the embodiment of a prosodic scripting system 100 of
At 604, an associated text stream is automatically, temporally annotated with at least one prosodic speech symbol created from the identified at least one prosodic speech feature and/or at least one prosodic gesture symbol created from the identified at least prosodic gesture to create a prosodic script, that when rendered provides an audio stream and/or a video stream comprising the at least one prosodic speech feature and/or the at least one prosodic gesture that are temporally aligned. The method 600 can then be exited.
In some embodiments, the method 600 can further include converting a received audio stream and/or a received language model into a text stream to create the associated text stream and creating the at least one prosodic speech symbol and/or the at least one prosodic gesture symbol using stored, pre-determined symbols.
In some embodiments, the method 600 can further include rendering the prosodic script to create at least one predicted audio stream and/or at least one predicted video stream and comparing prosodic speech features of the at least one predicted audio stream and/or prosodic gestures of the at least one predicted video stream to prosodic speech features of a ground truth audio stream and/or prosodic gestures of a ground truth video stream to determine respective loss functions for training a system to create the prosodic script.
In some embodiments, in the method 600 the prosodic gestures are identified from movement of at least a portion of a body of a speaker of the received audio stream.
In some embodiments, in the method 600 the portion of a body of a speaker comprises a face of the speaker of the audio stream and that at least one prosodic gesture comprises a change in at least a portion of the face of the speaker, including at least one of a head, mouth, forehead, ears, chin, or eyes of the speaker of the received audio stream.
In some embodiments, the method 600 can further include creating a spectrogram of the received audio stream, rendering the spectrogram from the prosodic script to create a predicted spectrogram, and comparing the predicted spectrogram to the created spectrogram to determine a loss function for training a system to create the prosodic script.
In some embodiments, in the method 600 the at least one prosodic speech feature comprises at least one of an emphasis, a duration, or a pitch of a temporal portion of the received audio stream.
In some embodiments of the present principles, a method for creating a dynamic prosodic script for rendering audio and/or video streams includes identifying at least one prosodic speech feature in a received audio stream and/or a received language model, creating at least one modifiable prosodic speech symbol for each of the identified at least one prosodic speech features, converting the received audio stream into a text stream, automatically and temporally annotating the text stream with at least one created, modifiable prosodic speech symbol, identifying in a received video stream at least one prosodic gesture of at least a portion of a body of a speaker of the received audio stream, creating at least one modifiable prosodic gesture symbol for each of the identified at least one prosodic gestures and temporally annotating the text stream with at least one created, modifiable prosodic gesture symbol along with the at least one modifiable, prosodic speech symbol to create a prosodic script, wherein the at least one modifiable speech prosodic symbol and the at least one modifiable prosodic gesture symbol are modifiable in the prosodic script, such that a rendering of an audio stream or a video stream from the prosodic script is changed when at least one of the at least one modifiable speech prosodic symbol and the at least one modifiable prosodic gesture symbol is modified.
In some embodiments, in the method at least one of the at least one prosodic speech symbol or the at least one prosodic gesture symbol comprise at least one of a predetermined character representative of at least one of the prosodic speech features identified in the audio stream and/or the received language model or at least one of the prosodic gestures identified in the video stream or a semantic description of at least one of the prosodic speech features identified in the audio stream and/or a received language model or at least one of the prosodic gestures identified in the video stream.
In some embodiments, an apparatus for creating a script for rendering audio and/or video streams includes a processor and a memory accessible to the processor, the memory having stored therein at least one of programs or instructions executable by the processor to configure the apparatus to identify at least one prosodic speech feature in a received audio stream and/or a received language model and/or identify at least one prosodic gesture in a received video stream, and automatically temporally annotate an associated text stream with at least one prosodic speech symbol created from the identified at least one prosodic speech feature and/or at least one prosodic gesture symbol created from the identified at least prosodic gesture to create a prosodic script, that when rendered provides an audio stream and/or a video stream comprising the at least one prosodic speech feature and/or the at least one prosodic gesture that are temporally aligned.
In some embodiments, the apparatus is further configured to convert a received audio stream and/or a received language model into a text stream to create the associated text stream, and create the at least one prosodic speech symbol and/or the at least one prosodic gesture symbol using stored, pre-determined symbols.
In some embodiments, the apparatus is further configured to render the prosodic script to create at least one predicted audio stream and/or at least one predicted video stream and compare prosodic speech features of the at least one predicted audio stream and/or prosodic gestures of the at least one predicted video stream to prosodic speech features of a ground truth audio stream and/or prosodic gestures of a ground truth video stream to determine respective loss functions for training a system to create the prosodic script.
In some embodiments, the prosodic gestures are identified from movement of at least a portion of a body of a speaker of the received audio stream.
In some embodiments, the portion of a body of a speaker comprises a face of the speaker of the audio stream and that at least one prosodic gesture comprises a change in at least a portion of the face of the speaker, including at least one of a head, mouth, forehead, ears, chin, or eyes of the speaker of the received audio stream.
In some embodiments, the apparatus is further configured to create a spectrogram of the received audio stream, render the spectrogram from the prosodic script to create a predicted spectrogram, and compare the predicted spectrogram to the created spectrogram to determine a loss function for training a system to create the prosodic script.
In some embodiments, the at least one prosodic speech feature comprises at least one of an emphasis, a duration, or a pitch of a temporal portion of the received audio stream.
In some embodiments, a system for creating a script for rendering audio and/or video streams includes a spectral features module, a gesture features module, a streams to script module, and an apparatus comprising a processor and a memory accessible to the processor, the memory having stored therein at least one of programs or instructions. In such embodiments, when the programs or instructions are executed by the processor, the apparatus is configured to identify, using the spectral features module and/or the gesture features module, at least one prosodic speech feature in a received audio stream and/or a received language model and/or identify at least one prosodic gesture in a received video stream, and automatically and temporally annotate, using the streams to script module, an associated text stream with at least one prosodic speech symbol created from the identified at least one prosodic speech feature and/or at least one prosodic gesture symbol created from the identified at least prosodic gesture to create a prosodic script, that when rendered provides an audio stream and/or a video stream comprising the at least one prosodic speech feature and/or the at least one prosodic gesture that are temporally aligned.
In some embodiments, the system further includes a speech to text module and the apparatus is further configured to convert, using the speech to text module, a received audio stream into a text stream to create the associated text stream, and create, using the streams to script module, the at least one prosodic speech symbol and/or the at least one prosodic gesture symbol using stored, pre-determined symbols.
In some embodiments, the system further includes a rendering module and the apparatus is further configured to render, using the rendering module, the prosodic script to create at least one predicted audio stream and/or at least one predicted video stream, and compare, using the rendering module, prosodic speech features of the at least one predicted audio stream and/or prosodic gestures of the at least one predicted video stream to prosodic speech features of a ground truth audio stream and/or prosodic gestures of a ground truth video stream to determine respective loss functions for training a system to create the prosodic script.
In some embodiments, the prosodic gestures are identified, by the streams to script module, from movement of at least a portion of a body of a speaker of the received audio stream.
In some embodiments, the portion of a body of a speaker includes a face of the speaker of the audio stream and that at least one prosodic gesture includes a change in at least a portion of the face of the speaker, including at least one of a head, mouth, forehead, ears, chin, or eyes of the speaker of the received audio stream.
In some embodiments, the apparatus is further configured to create a spectrogram of the received audio stream, render the spectrogram from the prosodic script to create a predicted spectrogram, and compare the predicted spectrogram to the created spectrogram to determine a loss function for training a system to create the prosodic script.
In some embodiments, the at least one prosodic speech feature comprises at least one of an emphasis, a duration, or a pitch of a temporal portion of the received audio stream.
Embodiments of the present principles can be implemented to determine prosodic scripts for training a prosodic scripting system of the present principles from actual behaving humans. Then, given input text (or in an alternative embodiment, the gist of what is to be said or a large language model), a prosodic scripting system of the present principles can predict appropriate prosodic enhancements to text, to generate a full prosodic script for a specific individual. In some embodiments, the prosodic script can then be rendered directly, but also edited, either textually or through a user interface, in order to modify the behavior of the rendering in desired ways.
In some embodiments the prosodic script can be used directly by a human actor, i.e., to provide guidance to the actor on desired behaviors.
In some embodiments, the prosodic script can be extracted from actual human behavior, and the prosodic script can be used to find markers for either undesirable or pathological behavior. For example, excessive frowning or eye-wandering can be revealed to the user in a biofeedback-type style or can be measured and reported to a clinician for evaluation (e.g., of autism spectrum disorder, speech disorders, etc.).
In some embodiments, a prosodic scripting system of the present principles can be used to edit real performances. That is, one can record a performance, use the system to generate a prosodic script, modify the prosodic script in selected spots to modify specific aspects of that performance, and then render the complete performance again, with the modified portions included, seamlessly integrated with the original performance elements. In some embodiments, the modified elements can be rendered in the style of a recorded actor, based on current or previous training data for that person, for example in a received large language model.
In some embodiments, however, for humorous or artistic effects, the styles from other subject models can be used instead. Alternatively or in addition, in some embodiments, dynamics can also be edited at a lower level (e.g., position of a facial feature or intensity of the voice at each moment in time), rather than at the sequence level that incorporates the predictions of individual styles.
In some embodiments, a user of a prosodic scripting system of the present principles can take advantage of the system's ability to randomly vary the rendering using modifiable prosodic symbols within a believable range for that subject. Then, for example, the user can record a performance, watch various renderings of that performance with, for example, subtle modifications of motions and tone, and then select a desired rendering and/or prosodic script.
Embodiments of a prosodic scripting system of the present principles enable a creation of audio and video streams that more closely match an actor's or director's intentions.
As depicted in
For example,
In the embodiment of
In different embodiments, the computing device 700 can be any of various types of devices, including, but not limited to, a personal computer system, desktop computer, laptop, notebook, tablet or netbook computer, mainframe computer system, handheld computer, workstation, network computer, a camera, a set top box, a mobile device, a consumer device, video game console, handheld video game device, application server, storage device, a peripheral device such as a switch, modem, router, or in general any type of computing or electronic device.
In various embodiments, the computing device 700 can be a uniprocessor system including one processor 710, or a multiprocessor system including several processors 710 (e.g., two, four, eight, or another suitable number). Processors 710 can be any suitable processor capable of executing instructions. For example, in various embodiments processors 710 may be general-purpose or embedded processors implementing any of a variety of instruction set architectures (ISAs). In multiprocessor systems, each of processors 710 may commonly, but not necessarily, implement the same ISA.
System memory 720 can be configured to store program instructions 722 and/or data 732 accessible by processor 710. In various embodiments, system memory 720 can be implemented using any suitable memory technology, such as static random-access memory (SRAM), synchronous dynamic RAM (SDRAM), nonvolatile/Flash-type memory, or any other type of memory. In the illustrated embodiment, program instructions and data implementing any of the elements of the embodiments described above can be stored within system memory 720. In other embodiments, program instructions and/or data can be received, sent or stored upon different types of computer-accessible media or on similar media separate from system memory 720 or computing device 700.
In one embodiment, I/O interface 730 can be configured to coordinate I/O traffic between processor 710, system memory 720, and any peripheral devices in the device, including network interface 740 or other peripheral interfaces, such as input/output devices 750. In some embodiments, I/O interface 730 can perform any necessary protocol, timing or other data transformations to convert data signals from one component (e.g., system memory 720) into a format suitable for use by another component (e.g., processor 710). In some embodiments, I/O interface 730 can include support for devices attached through various types of peripheral buses, such as a variant of the Peripheral Component Interconnect (PCI) bus standard or the Universal Serial Bus (USB) standard, for example. In some embodiments, the function of I/O interface 730 can be split into two or more separate components, such as a north bridge and a south bridge, for example. Also, in some embodiments some or all of the functionality of I/O interface 730, such as an interface to system memory 720, can be incorporated directly into processor 710.
Network interface 740 can be configured to allow data to be exchanged between the computing device 700 and other devices attached to a network (e.g., network 790), such as one or more external systems or between nodes of the computing device 700. In various embodiments, network 790 can include one or more networks including but not limited to Local Area Networks (LANs) (e.g., an Ethernet or corporate network), Wide Area Networks (WANs) (e.g., the Internet), wireless data networks, some other electronic data network, or some combination thereof. In various embodiments, network interface 740 can support communication via wired or wireless general data networks, such as any suitable type of Ethernet network, for example; via digital fiber communications networks; via storage area networks such as Fiber Channel SANs, or via any other suitable type of network and/or protocol.
Input/output devices 750 can, in some embodiments, include one or more display terminals, keyboards, keypads, touchpads, scanning devices, voice or optical recognition devices, or any other devices suitable for entering or accessing data by one or more computer systems. Multiple input/output devices 750 can be present in computer system or can be distributed on various nodes of the computing device 700. In some embodiments, similar input/output devices can be separate from the computing device 700 and can interact with one or more nodes of the computing device 700 through a wired or wireless connection, such as over network interface 740.
Those skilled in the art will appreciate that the computing device 700 is merely illustrative and is not intended to limit the scope of embodiments. In particular, the computer system and devices can include any combination of hardware or software that can perform the indicated functions of various embodiments, including computers, network devices, Internet appliances, PDAs, wireless phones, pagers, and the like. The computing device 700 can also be connected to other devices that are not illustrated, or instead can operate as a stand-alone system. In addition, the functionality provided by the illustrated components can in some embodiments be combined in fewer components or distributed in additional components. Similarly, in some embodiments, the functionality of some of the illustrated components may not be provided and/or other additional functionality can be available.
The computing device 700 can communicate with other computing devices based on various computer communication protocols such a Wi-Fi, Bluetooth.RTM. (and/or other standards for exchanging data over short distances includes protocols using short-wavelength radio transmissions), USB, Ethernet, cellular, an ultrasonic local area communication protocol, etc. The computing device 700 can further include a web browser.
Although the computing device 700 is depicted as a general-purpose computer, the computing device 700 is programmed to perform various specialized control functions and is configured to act as a specialized, specific computer in accordance with the present principles, and embodiments can be implemented in hardware, for example, as an application specified integrated circuit (ASIC). As such, the process steps described herein are intended to be broadly interpreted as being equivalently performed by software, hardware, or a combination thereof.
In the network environment 800 of
In some embodiments, a user can implement a system for prosodic scripting in the computer networks 806 to provide prosodic scripts in accordance with the present principles. Alternatively or in addition, in some embodiments, a user can implement a system for prosodic scripting in the cloud server/computing device 812 of the cloud environment 810 in accordance with the present principles. For example, in some embodiments it can be advantageous to perform processing functions of the present principles in the cloud environment 810 to take advantage of the processing capabilities and storage capabilities of the cloud environment 810. In some embodiments in accordance with the present principles, a system for providing prosodic scripting in a container network can be located in a single and/or multiple locations/servers/computers to perform all or portions of the herein described functionalities of a system in accordance with the present principles. For example, in some embodiments components of the prosodic scripting system, such as spectral features module 105, a visual features module 110 (illustratively a facial features module), a speech to text module 115, a streams to script module 120, and a script to speech and sketch prediction module 125 can be located in one or more than one of the user domain 802, the computer network environment 806, and the cloud environment 810 for providing the functions described above either locally or remotely.
Those skilled in the art will also appreciate that, while various items are illustrated as being stored in memory or on storage while being used, these items or portions of them can be transferred between memory and other storage devices for purposes of memory management and data integrity. Alternatively, in other embodiments some or all of the software components can execute in memory on another device and communicate with the illustrated computer system via inter-computer communication. Some or all of the system components or data structures can also be stored (e.g., as instructions or structured data) on a computer-accessible medium or a portable article to be read by an appropriate drive, various examples of which are described above. In some embodiments, instructions stored on a computer-accessible medium separate from the computing device 700 can be transmitted to the computing device 700 via transmission media or signals such as electrical, electromagnetic, or digital signals, conveyed via a communication medium such as a network and/or a wireless link. Various embodiments can further include receiving, sending or storing instructions and/or data implemented in accordance with the foregoing description upon a computer-accessible medium or via a communication medium. In general, a computer-accessible medium can include a storage medium or memory medium such as magnetic or optical media, e.g., disk or DVD/CD-ROM, volatile or non-volatile media such as RAM (e.g., SDRAM, DDR, RDRAM, SRAM, and the like), ROM, and the like.
The methods and processes described herein may be implemented in software, hardware, or a combination thereof, in different embodiments. In addition, the order of methods can be changed, and various elements can be added, reordered, combined, omitted or otherwise modified. All examples described herein are presented in a non-limiting manner. Various modifications and changes can be made as would be obvious to a person skilled in the art having benefit of this disclosure. Realizations in accordance with embodiments have been described in the context of particular embodiments. These embodiments are meant to be illustrative and not limiting. Many variations, modifications, additions, and improvements are possible. Accordingly, plural instances can be provided for components described herein as a single instance. Boundaries between various components, operations and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and can fall within the scope of claims that follow. Structures and functionality presented as discrete components in the example configurations can be implemented as a combined structure or component. These and other variations, modifications, additions, and improvements can fall within the scope of embodiments as defined in the claims that follow.
In the foregoing description, numerous specific details, examples, and scenarios are set forth in order to provide a more thorough understanding of the present disclosure. It will be appreciated, however, that embodiments of the disclosure can be practiced without such specific details. Further, such examples and scenarios are provided for illustration, and are not intended to limit the disclosure in any way. Those of ordinary skill in the art, with the included descriptions, should be able to implement appropriate functionality without undue experimentation.
References in the specification to “an embodiment,” etc., indicate that the embodiment described can include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is believed to be within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly indicated.
Embodiments in accordance with the disclosure can be implemented in hardware, firmware, software, or any combination thereof. Embodiments can also be implemented as instructions stored using one or more machine-readable media, which may be read and executed by one or more processors. A machine-readable medium can include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computing device or a “virtual machine” running on one or more computing devices). For example, a machine-readable medium can include any suitable form of volatile or non-volatile memory.
In addition, the various operations, processes, and methods disclosed herein can be embodied in a machine-readable medium and/or a machine accessible medium/storage device compatible with a data processing system (e.g., a computer system), and can be performed in any order (e.g., including using means for achieving the various operations). Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense. In some embodiments, the machine-readable medium can be a non-transitory form of machine-readable medium/storage device.
Modules, data structures, and the like defined herein are defined as such for ease of discussion and are not intended to imply that any specific implementation details are required. For example, any of the described modules and/or data structures can be combined or divided into sub-modules, sub-processes or other units of computer code or data as can be required by a particular design or implementation.
In the drawings, specific arrangements or orderings of schematic elements can be shown for ease of description. However, the specific ordering or arrangement of such elements is not meant to imply that a particular order or sequence of processing, or separation of processes, is required in all embodiments. In general, schematic elements used to represent instruction blocks or modules can be implemented using any suitable form of machine-readable instruction, and each such instruction can be implemented using any suitable programming language, library, application-programming interface (API), and/or other software development tools or frameworks. Similarly, schematic elements used to represent data or information can be implemented using any suitable electronic arrangement or data structure. Further, some connections, relationships or associations between elements can be simplified or not shown in the drawings so as not to obscure the disclosure.
This disclosure is to be considered as exemplary and not restrictive in character, and all changes and modifications that come within the guidelines of the disclosure are desired to be protected.
This application claims benefit of and priority to U.S. Provisional Patent Application Ser. No. 63/442,673, filed Feb. 2, 2023 and U.S. Provisional Patent Application Ser. No. 63/454,575, filed Apr. 13, 2023, which are both herein incorporated by reference in their entireties.
This invention was made with Government support under contract number H92401-22-9-P001 awarded by the United States Special Operations Command (USSOCOM). The Government has certain rights in this invention.
Number | Date | Country | |
---|---|---|---|
63442673 | Feb 2023 | US | |
63454575 | Mar 2023 | US |