This application is based on and claims priority under 35 U.S.C. § 119 to Korean Patent Application No. 10-2023-0178603, filed on Dec. 11, 2023, in the Korean Intellectual Property Office, the disclosure of which is herein incorporated by reference in its entirety.
The disclosure relates to a speech synthesis technology, and more particularly, to a speech synthesis method and system for supporting a real-time conversation model which outputs a response utterance in the unit of a text rather than in the unit of a sentence.
Speech synthesis systems employing the method shown in
However, recently, there is an attempt to generate a conversation content in the unit of a text shorter than a sentence through a conversation model in real time. This method has the merits of outputting a conversation in a unit shorter than a sentence, and concurrently, analyzing an utterance content of a user and responding thereto. Due to such merits, there is an advantage that a more human-like conversation can be modeled.
However, in related-art speech technologies, an input is still received in the unit of a sentence, and hence, there is a problem that a speech is not generated until a sentence is completed although a text is generated in real time in a conversation model, and a speech is generated through a speech synthesizer after a sentence is completed.
The disclosure has been developed in order to solve the above-described problems, and an object of the disclosure is to provide, as a solution for synthesizing a speech without delay in outputting from a real-time conversation model, a method and a system for synthesizing a speech in real time upon continuously receiving a text which is a unit shorter than a sentence.
According to an embodiment of the disclosure to achieve the above-described object, a real-time speech synthesis method may include: a step of outputting, by a conversation model, a sentence for responding to an utterance of a user in the unit of a text; and a step of synthesizing, by a speech synthesis model, a speech in the unit of the outputted text.
The step of outputting may include outputting texts constituting the sentence while the sentence for responding is not completed.
The step of synthesizing may include synthesizing a speech from the outputted texts while all of the texts constituting the sentence are not outputted at the step of outputting.
The real-time speech synthesis may further include a step of outputting, by a speech output module, the synthesized speech in the unit of a text. The conversation model may be a machine learning model that is trained to receive an utterance of a user and to generate a sentence for responding to the utterance.
The step of outputting may include outputting the synthesized speech while a user is uttering.
The speech synthesis model may be a machine learning model that is trained to receive a text and to synthesize a speech.
The speech synthesis model may further receive a part of a text that is synthesized into a speech in a previous section, and may synthesizes a speech.
The speech synthesis model may further receive a part of a text that is synthesized into a speech in a next section, and synthesizes a speech.
According to another aspect of the disclosure, there is provided a real-time speech synthesis system including: a conversation model configured to output a sentence for responding to an utterance of a user in the unit of a text; and a speech synthesis model configured to synthesize a speech in the unit of the text outputted from the conversation model.
According to still another aspect of the disclosure, there is provided a real-time speech synthesis method including: a step of synthesizing, by a speech synthesis model, a speech in the unit of a text with respect to a sentence which is outputted from a conversation model in the unit of a text; and a step of outputting, by a speech output module, the synthesized speech in the unit of a text.
According to yet another aspect of the disclosure, there is provided a real-time speech synthesis system including: a speech synthesis model configured to synthesize a speech in the unit of a text with respect to a sentence which is outputted from a conversation model in the unit of a text; and a speech output module configured to output the speech synthesized in the speech synthesis model in the unit of a text.
As described above, according to embodiments of the disclosure, a text which is a shorter unit than a sentence is continuously received and is immediately synthesized into a speech, so that speech synthesis can be performed in real time according to a speed of a real-time conversation model without delay even when a conversation is generated in the unit of a text in the real-time conversation model.
According to embodiments of the disclosure, conversation generation and speech synthesis are performed in the unit of a text shorter than a sentence, so that a memory capacity necessary for a system can be reduced and a processing speed can be enhanced.
Other aspects, advantages, and salient features of the invention will become apparent to those skilled in the art from the following detailed description, which, taken in conjunction with the annexed drawings, discloses exemplary embodiments of the invention.
Before undertaking the DETAILED DESCRIPTION OF THE INVENTION below, it may be advantageous to set forth definitions of certain words and phrases used throughout this patent document: the terms “include” and “comprise,” as well as derivatives thereof, mean inclusion without limitation; the term “or,” is inclusive, meaning and/or; the phrases “associated with” and “associated therewith,” as well as derivatives thereof, may mean to include, be included within, interconnect with, contain, be contained within, connect to or with, couple to or with, be communicable with, cooperate with, interleave, juxtapose, be proximate to, be bound to or with, have, have a property of, or the like. Definitions for certain words and phrases are provided throughout this patent document, those of ordinary skill in the art should understand that in many, if not most instances, such definitions apply to prior, as well as future uses of such defined words and phrases.
For a more complete understanding of the present disclosure and its advantages, reference is now made to the following description taken in conjunction with the accompanying drawings, in which like reference numerals represent like parts:
Hereinafter, the disclosure will be described in more detail with reference to the accompanying drawings.
Embodiments of the disclosure provide a streaming speech synthesis method and system for supporting a real-time conversation model. The disclosure relates to a technology for continuously generating and outputting a speech in the unit of a text without delay when generating a conversation in the unit of a text shorter than a sentence in a real-time conversation model.
The real-time conversation model 110 is a machine learning model that is trained to receive an utterance of a user and to generate and output a sentence for responding to the user utterance in the unit of a text, or a processor for executing the machine learning model. Herein, the “text” used herein refers to a unit shorter than a sentence, for example, a word, a phrase, a clause.
The real-time conversation model 110 generates and outputs a sentence for responding to a user in the unit of a text, that is, outputs a sentence in the unit of a text in the middle of generating the sentence. That is, the real-time conversation model 110 may output texts constituting a sentence even when the sentence for responding is not fully completed.
In
Referring back to
Since the output of the real-time conversation model 110 is comprised of texts, input, speech synthesis, output of a synthesized speech of the streaming speech synthesis model 120 may be performed in the unit of a text.
In
Accordingly, 1) in a T1 section, a text F1 is generated and outputted from the real-time conversation model 110, and a corresponding voice I1 is synthesized and outputted from the streaming speech synthesis model 120,
2) in a T2 section, a text F2 is generated and outputted from the real-time conversation model 110, and a corresponding voice I2 is synthesized and outputted from the streaming speech synthesis model 120,
3) in a T3 section, a text F3 is generated and outputted from the real-time conversation model 110, and a corresponding voice I3 is synthesized and outputted from the streaming speech synthesis model 120,
7) in a T7 section, a text F7 is generated and outputted from the real-time conversation model 110, and a corresponding voice I7 is synthesized and outputted from the streaming speech synthesis model 120.
That is, speech synthesis is performed by the streaming speech synthesis model 120 in real time in the unit of a text even when all texts constituting a sentence is not outputted from the real-time conversation model 110.
If speech synthesis is performed by the streaming speech synthesis model 120 in the unit of a sentence, speech synthesis may start after all of the texts F1, F2, F3, . . . , F7 constituting the sentence are outputted from the real-time conversation model 110, and thus, a time delay corresponding to T1 to T7 may occur.
Referring back to
Since the response generation by the real-time conversation model 110 and the speech synthesis by the streaming speech synthesis model 1120 are performed in the unit of a text rather than in the unit of a sentence, the speech output module 130 may output a synthesized speech even when the user does not yet finish uttering.
Specifically, in synthesizing a speech (12) in a T2 section, a rear part of a previous text F1 is inputted to the streaming speech synthesis model 120 in addition to a current text F2, and is used.
This is to solve a frame gap that may occur at a connection between texts. A length of the rear part of the previous text added to the current text is limited to short than a half of the length of the current text.
Furthermore, it is also possible to add a front part of the next text to the current text in addition to adding the rear part of the previous text to the current text. However, in this case, speech synthesis for the current text may be delayed until the next text is outputted, but smoother speech synthesis may be achieved.
Up to now, a streaming speech synthesis method and system for supporting a real-time conversation model has been described in detail with reference to preferred embodiments.
In the above-described embodiments, speech synthesis is performed in the unit of a text shorter than a sentence upon continuously receiving an input in the unit of a text, so that an input unit of the streaming speech synthesis model matches an output unit of the real-time conversation and service can be provided while minimizing a system delay through real-time interoperation between the two models.
Furthermore, in generating a speech, only inputs neighboring a speech to be generated is considered rather than analyzing all inputs of a sentence, so that there is an advantage that a memory capacity necessary for inference is reduced, and in proportion to this, an amount of calculation is reduced.
In addition, a part of a previous text or a next text is further added to the input of a text, so that an algorithm delay is not greatly increased and smooth continuity between synthesized voices increases.
The technical concept of the disclosure may be applied to a computer-readable recording medium which records a computer program for performing the functions of the apparatus and the method according to the present embodiments. In addition, the technical idea according to various embodiments of the disclosure may be implemented in the form of a computer readable code recorded on the computer-readable recording medium. The computer-readable recording medium may be any data storage device that can be read by a computer and can store data. For example, the computer-readable recording medium may be a read only memory (ROM), a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical disk, a hard disk drive, or the like. A computer readable code or program that is stored in the computer readable recording medium may be transmitted via a network connected between computers.
In addition, while preferred embodiments of the present disclosure have been illustrated and described, the present disclosure is not limited to the above-described specific embodiments. Various changes can be made by a person skilled in the at without departing from the scope of the present disclosure claimed in claims, and also, changed embodiments should not be understood as being separate from the technical idea or prospect of the present disclosure.
| Number | Date | Country | Kind |
|---|---|---|---|
| 10-2023-0178603 | Dec 2023 | KR | national |