1. Technical Field
The present disclosure relates to synthesizing speech and more specifically to providing access to a backend speech synthesis process via an application programming interface (API).
2. Introduction
To a casual observer, any text-to-speech (TTS) system appears to be a black-box solution for creating synthetic speech from input text. In fact, TTS systems are mostly used as black-box systems today. In other words, TTS systems do not require the user or application programmer to have linguistic or phonetic skills. However, internally, such a TTS system has multiple, clearly separated modules with unique functions. These modules process expensive source speech data for a specific speaker or task using algorithms and approaches that may be closely guarded trade secrets.
Often, one party generates the source speech data by recording many hours of speech for a particular speaker in a high-quality studio environment. Another party has a set of highly tuned, effective, and proprietary TTS algorithms. In order for these two parties to collaborate one with another, each must provide the other access to their own intellectual property, which one or both parties may oppose. Thus, the current approaches available in the art force parties that may be at arm's length to either cooperate at a much closer level than either party wants or not cooperate at all. This friction prevents the benefits of TTS to spread in certain circumstances.
Additional features and advantages of the disclosure will be set forth in the description which follows, and in part will be obvious from the description, or can be learned by practice of the herein disclosed principles. The features and advantages of the disclosure can be realized and obtained by means of the instruments and combinations particularly pointed out in the appended claims. These and other features of the disclosure will become more fully apparent from the following description and appended claims, or can be learned by the practice of the principles set forth herein.
Disclosed are systems, methods, and non-transitory computer-readable storage media for generating speech and/or a TTS voice using a divided client-server approach that splits the front end from the back end via API calls. A server configured to practice the method receives, from a network client that has no access to and knowledge of internal operations of the server, a request to generate a text-to-speech voice, the request having speech samples, transcriptions of the speech samples, and metadata describing the speech samples. The server extracts sound units from the speech samples based on the transcriptions and generates an interactive demonstration of the text-to-speech voice based on the sound units, the transcriptions, and the metadata, wherein the interactive demonstration hides a back end processing implementation from the network client. Then the server provides access to the interactive demonstration to the network client. The server can optionally maintain logs associated with the text-to-speech voice and provide those logs as feedback to the client.
The server can also receive an additional request from the network client for the text-to-speech voice that is the subject of the interactive demonstration and provide the text-to-speech voice to the network client. In one aspect, the request is received via a web interface. The client and/or the server can impose a minimum quality threshold on the speech samples. The TTS voice can be language agnostic. In a variation designed to reduce the amount of redundant speech samples or to expedite the process of gathering speech samples, the server can analyze the speech samples to determine a coverage hole in the speech samples for a particular purpose. Then the server can suggest to the client a type of additional speech sample intended to address the coverage hole. The server and client can iterate through this approach several times until a threshold coverage for the particular purpose is reached.
On the other hand, the client can transmit to a server a request to generate the text-to-speech voice. The request can include speech samples, transcriptions of the speech samples, and metadata describing the speech samples such as a gender, age, or other speaker information, the conditions under which the speech samples were collected, and so forth. The client then receives a notification from the network-based automatic speech processing system that the text-to-speech voice is generated. This notification can arrive hours, days, or even weeks after the request, depending on the request, specific tasks, the speed of the server(s), a queue of tasks submitted before the client's request, and so forth. Then the client can test, via a network, the text-to-speech voice independent of knowledge of internal operations of the server and/or without access to and knowledge of internal operations of the server.
In order to describe the manner in which the above-recited and other advantages and features of the disclosure can be obtained, a more particular description of the principles briefly described above will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. Understanding that these drawings depict only exemplary embodiments of the disclosure and are not therefore to be considered to be limiting of its scope, the principles herein are described and explained with additional specificity and detail through the use of the accompanying drawings in which:
Various embodiments of the disclosure are discussed in detail below. While specific implementations are discussed, it should be understood that this is done for illustration purposes only. A person skilled in the relevant art will recognize that other components and configurations may be used without parting from the spirit and scope of the disclosure.
The present disclosure addresses the need in the art for generating TTS voices with resources divided among multiple parties. A brief introductory description of a basic general purpose system or computing device in
With reference to
The system bus 110 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. A basic input/output (BIOS) stored in ROM 140 or the like, may provide the basic routine that helps to transfer information between elements within the computing device 100, such as during start-up. The computing device 100 further includes storage devices 160 such as a hard disk drive, a magnetic disk drive, an optical disk drive, tape drive or the like. The storage device 160 can include software modules 162, 164, 166 for controlling the processor 120. Other hardware or software modules are contemplated. The storage device 160 is connected to the system bus 110 by a drive interface. The drives and the associated computer readable storage media provide nonvolatile storage of computer readable instructions, data structures, program modules and other data for the computing device 100. In one aspect, a hardware module that performs a particular function includes the software component stored in a non-transitory computer-readable medium in connection with the necessary hardware components, such as the processor 120, bus 110, display 170, and so forth, to carry out the function. The basic components are known to those of skill in the art and appropriate variations are contemplated depending on the type of device, such as whether the device 100 is a small, handheld computing device, a desktop computer, or a computer server.
Although the exemplary embodiment described herein employs the hard disk 160, it should be appreciated by those skilled in the art that other types of computer readable media which can store data that are accessible by a computer, such as magnetic cassettes, flash memory cards, digital versatile disks, cartridges, random access memories (RAMs) 150, read only memory (ROM) 140, a cable or wireless signal containing a bit stream and the like, may also be used in the exemplary operating environment. Non-transitory computer-readable storage media expressly exclude media such as energy, carrier signals, electromagnetic waves, and signals per se.
To enable user interaction with the computing device 100, an input device 190 represents any number of input mechanisms, such as a microphone for speech, a touch-sensitive screen for gesture or graphical input, keyboard, mouse, motion input, speech and so forth. An output device 170 can also be one or more of a number of output mechanisms known to those of skill in the art. In some instances, multimodal systems enable a user to provide multiple types of input to communicate with the computing device 100. The communications interface 180 generally governs and manages the user input and system output. There is no restriction on operating on any particular hardware arrangement and therefore the basic features here may easily be substituted for improved hardware or firmware arrangements as they are developed.
For clarity of explanation, the illustrative system embodiment is presented as including individual functional blocks including functional blocks labeled as a “processor” or processor 120. The functions these blocks represent may be provided through the use of either shared or dedicated hardware, including, but not limited to, hardware capable of executing software and hardware, such as a processor 120, that is purpose-built to operate as an equivalent to software executing on a general purpose processor. For example the functions of one or more processors presented in
The logical operations of the various embodiments are implemented as: (1) a sequence of computer implemented steps, operations, or procedures running on a programmable circuit within a general use computer, (2) a sequence of computer implemented steps, operations, or procedures running on a specific-use programmable circuit; and/or (3) interconnected machine modules or program engines within the programmable circuits. The system 100 shown in
Having disclosed some components of a computing system, the disclosure now returns to a discussion of self-service TTS web services through an API. This approach can replace a monolithic TTS synthesizer by effectively splitting a TTS synthesizer into discrete parts. For example, the TTS synthesizer can include parts for language analysis, database search for appropriate units, acoustic synthesis, and so forth. The system can include all or part of these components as well as other components. In this environment, a user uploads voice data on a client device that accesses the server over the Internet via an API and the server provides voice. This configuration can also provide the ability for a client who has a module in a language unsupported by the server to use the rest of the server's TTS mechanisms to create a voice in that unsupported language. This approach can be used to cobble together a voice for testing, prototyping, or live services to see how the client's front-end fits together with the server back end before the client and server organizations make a contract to share the components.
Each discrete part of the TTS synthesizer approach 200 shown in
The text analysis front end 204 can model sentence and word melody, as well as stress assignment (all part of prosody) to create symbolic meta-tags that form part of the input to the unit selection module 210. The unit selection module 210 uses the text front end's output stream as a “fuzzy” search query to select the single sequence of speech units from the database that optimally synthesizes the input text. The system can change the sound unit features 212 and store of sound units 220 for each new voice and/or language 214. Then, a signal processing backend 216 concatenates snippets of audio to form the output audio stream that one can listen to, using signal processing to smooth over the concatenation boundaries between snippets, modifying pitch and/or durations in the process, etc. The signal processing backend 216 produces synthesized speech 218 as the “final product” of the left-to-right value chain. Even the identities of the speech units selected by the unit selection module 210 have value, for example, as part of an information stream that can be used as a very low bit-rate representation of speech. Such a low bit-rate representation can be suitable, for example, to communicate by voice with submarines. Another benefit is that the “fuzzy” database search query produced by the text-analysis front end 204 is a compact, but necessarily rich, symbolic representation for how a TTS system developer wants the output to sound.
This approach also makes use of the fact that this front-end 204 and the unit-selection 210 and backend 216 can reside elsewhere and can be produced, operated, and/or owned by separate parties. Accordingly, the boundary between unit selection 210 and signal-processing backend 216 can also be used to choose one or more from a variety of different owners/creators of modules. This approach allows a user to combine proprietary modules that are owned by separate parties for the purpose of forming a complete TTS system over the web, without disclosing one party's intellectual property to the other, as would be necessary to integrate each party's components into a standalone complete TTS system.
In one typical scenario, the linguistic and phonetic expertise for a specific language resides within the country where the specific language is spoken natively such as Azerbaijan, while the expertise for the unit-selection algorithms and signal-processing backend and their implementations might reside in a different country such as the United States. A server can operate the signal processing backend 216 and make the back end available via a comprehensive set of web APIs that allow “merging” different parts of a complete system. This arrangement allows collaboration of different teams across the globe towards a common goal of creating a complete system and allows for optimal use of each team's expertise while keeping each side's intellectual property separate during development.
In another aspect, illustrated in
In such a scenario, the “client” 304 side provides the audio and some meta information 308, for example, about the gender, age, ethnicity, etc. of the speaker to set the proper pitch range. The client 304 can also provide the voice-talent recordings and textual transcriptions that correspond accurately to the audio recordings. The client 304 provides this information to the voice-building procedure 316 of the TTS system 306 exposed to the client by a comprehensive set of APIs. When the voice build procedure completes, the TTS system 306 notifies the client 304 that the TTS voice was successfully built and invites the client 304 to an interactive demo of this voice. The interactive demo can provide, for example, a way for the client to enter arbitrary input text and receive corresponding audio for evaluation purposes, such as before integrating the voice database fully with the production TTS system.
The voice-build procedure 316 of the TTS system 306 includes an acoustic (or other) model training module 310, a segmentation and indexing database 314, and a lexicon 312. The voice-build procedure 316 of the TTS system 306 creates a large index of all speech units in the input set of audio recordings 308. For this, the TTS system 306 first trains a speaker or voice dependent Acoustic Model (AM) for segmenting the audio phonetically via an automatic speech recognizer. In one variation, segmenting includes marking the beginning and end of each phoneme. The speech recognizer can segment each recording in a forced alignment mode where the phoneme sequence to be aligned is derived from the also supplied text that corresponds accurately to what is being said. After creating the index 314, the voice build procedure 316 of the TTS system 306 can also compute other information, such as unit-selection caches to rapidly choose candidate acoustic units or compute unit compatibility or “join” costs, and store the other information in the TTS voice database 314.
The TTS system 306 can communicate data between modules as simple tables, as phonemes plus features, unit numbers plus features, and/or any other suitable data format. These exemplary information formats are compact and easily transferred, enabling practical communication between TTS modules or via a web API. Even if a TTS system 306 modules do not use such a data format naturally, the output they do produce can be rewritten, transcoded, converted, and/or compressed into such a format by interface handler routines, thus making disparate systems interoperable.
The process of creating TTS modules and creating high quality voices is difficult. Writing programs to implement text-analysis frontends can require extensive manual effort, including creating pronunciation dictionaries and/or Letter-to-Sound (LTS) rules, text normalization, and so forth. Voice recordings require high-quality microphone and recording equipment such as those found in recording studios. Segmentation and labeling requires good speech recognition and other software tools.
The principles disclosed herein are applicable to a variety of usage scenarios. One common element in these example scenarios is that two parties team up to overcome the generally high barriers to begin creating a new TTS system for a given language. One barrier in particular is the need for an instantiation of all modules to create audible synthetic speech. Each party uses their different skills to create language modules and voices more efficiently and at a higher quality together than doing it alone. For example, one party may have a legacy language module but no voices. Another party may have voices or recordings but no ability to perform text analysis.
The approaches disclosed herein provide the ability for a client to submit detailed phonetic information to a TTS system instead of pure text, and receive the resulting audio. This approach can be used to perform synthesis based on proprietary language modules, for example, if a client has a legacy (pre-existing) language module.
In another variation, the system introduces additional modules into the original data flow, possibly involving human intervention. For research or commercial purposes, the system can detect and/or correct defects output by one module before passing the data on to the next module. Some examples of programmatic correction include modifying incoming text, performing expansions that the frontend does not handle by default, modifying phonetic input to accommodate varying usage between systems (such as /f ao r/ or /f ow r/ for the word “four”), and injecting pre-tested units to represent specific words or phrases.
For audio that is created once and stored for later playback, a human listener can also judge the resulting audio, modifying data at one or more stages to improve the output. Such tools, often called “prompt sculptors”, can be tightly integrated into the core of a TTS system, but can also be applied to a distributed collection of proprietary modules. Prompt sculptors can, for example, change the prescribed prosody of specific words or phrases before unit selection to increase emphasis, and remember the unit sequences corresponding to good renderings of frequent words and phrases for re-use when that text reappears.
Having disclosed some basic system components and concepts, the disclosure now turns to the exemplary method embodiments shown in
The server extracts sound units from the speech samples based on the transcriptions (404) and generates a web interface, interactive or non-interactive demonstration, standalone file, or other output of the text-to-speech voice based on the sound units, the transcriptions, and the metadata, wherein the interactive demonstration hides a back end processing implementation from the network client (406). The server can also modify one or more of the sound units and the interactive demonstration based on an intervention from a human expert. The text-to-speech voice can be tailored for a specific language or language agnostic.
The server provides access to the interactive demonstration to the network client (408). The server can provide access via a downloadable application, a web-based speech synthesis program, a set of phones, a TTS voice, etc. In one example, the server provides a non-interactive or limited-interaction demonstration in the form of sample synthesized speech. In conjunction with the demonstration, the system can generate a log associated with how at least part of the interactive demonstration was generated and share all or part of the log with the client. The log can provide feedback to the client and guide efforts to tune or otherwise refine the parameters and data input to the server for another iteration. The server can optionally receive an additional request from the network client for the text-to-speech voice and provide the text-to-speech voice to the network client.
In one variation, the system helps the client focus the speech samples to avoid wasted time and effort. For example, the system can analyze the speech samples, determine a coverage hole in the speech samples for a particular purpose, and suggest to the network client a type, category, or particular content of additional speech sample intended to address the coverage hole. Then the client can prepare and submit additional speech samples based on the suggestion. The server and client can iteratively perform these steps until a threshold coverage for the particular purpose is reached. The system can use an iterative algorithm to compare additional audio files and suggest what to cover next, such as a specific vocabulary for a particular domain, for higher efficiency and to avoid repeating things that are not needed or are already done.
The client then receives a notification from the server that the text-to-speech voice is generated (504) and can test or assist a user in testing, via a network, the text-to-speech voice independent of access to and knowledge of internal operations of the server (506). The separation of data and algorithms between a client and a server provides a way for each to evaluate the likelihood of success for a more close collaboration on speech generation without compromising sensitive intellectual property of either party.
Embodiments within the scope of the present disclosure may also include tangible and/or non-transitory computer-readable storage media for carrying or having computer-executable instructions or data structures stored thereon. Such non-transitory computer-readable storage media can be any available media that can be accessed by a general purpose or special purpose computer, including the functional design of any special purpose processor as discussed above. By way of example, and not limitation, such non-transitory computer-readable media can include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store desired program code means in the form of computer-executable instructions, data structures, or processor chip design. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or combination thereof) to a computer, the computer properly views the connection as a computer-readable medium. Thus, any such connection is properly termed a computer-readable medium. Combinations of the above should also be included within the scope of the computer-readable media.
Computer-executable instructions include, for example, instructions and data which cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. Computer-executable instructions also include program modules that are executed by computers in stand-alone or network environments. Generally, program modules include routines, programs, components, data structures, objects, and the functions inherent in the design of special-purpose processors, etc. that perform particular tasks or implement particular abstract data types. Computer-executable instructions, associated data structures, and program modules represent examples of the program code means for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps.
Those of skill in the art will appreciate that other embodiments of the disclosure may be practiced in network computing environments with many types of computer system configurations, including personal computers, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like. Embodiments may also be practiced in distributed computing environments where tasks are performed by local and remote processing devices that are linked (either by hardwired links, wireless links, or by a combination thereof) through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.
The various embodiments described above are provided by way of illustration only and should not be construed to limit the scope of the disclosure. For example, the principles herein can be adapted for use via a web interface, a mobile phone application, or any other network-based embodiment. Those skilled in the art will readily recognize various modifications and changes that may be made to the principles described herein without following the example embodiments and applications illustrated and described herein, and without departing from the spirit and scope of the disclosure.
Number | Name | Date | Kind |
---|---|---|---|
8103509 | Burns et al. | Jan 2012 | B2 |
8352268 | Naik et al. | Jan 2013 | B2 |
20060095848 | Naik | May 2006 | A1 |
20100082328 | Rogers et al. | Apr 2010 | A1 |
20110202344 | Meyer et al. | Aug 2011 | A1 |
Number | Date | Country | |
---|---|---|---|
20120136664 A1 | May 2012 | US |