This application claims priority to Chinese Application No. 202310659465.7 filed Jun. 5, 2023, the disclosure of which is incorporated herein by reference in its entity.
Embodiments of the disclosure relate to the field of computers, and more specifically, relates to a method and apparatus for generating a speech translation model, an electronic device, and a medium.
Speech-to-speech translation (S2ST) refers to a type of technology that converts speech in one language into speech in another language, which is also known as speech translation, and is very helpful in breaking down communication barriers between people who do not speak the same language. Unlike conventional text-to-text machine translation, both the input and output of the speech translation task are speech. The technology is becoming increasingly important in the trend of globalization, providing more direct convenience especially in cross-border communication, tourism, business, and other fields.
A typical speech translation model is typically composed of three tasks: automatic speech recognition (ASR), machine translation (MT), and text to speech (TTS). An encoder-decoder structure is often used for sequence-to-sequence learning, and this approach can be used for learning a direct end-to-end speech translation model. In the end-to-end speech translation model, there may be two research routes based on a prediction form of a target end, including a continuous speech spectrum and discrete speech units.
Embodiments of the disclosure provide a method and apparatus for generating a speech translation model, an electronic device, and a computer-readable storage medium.
According to a first aspect of the disclosure, a method for generating a speech translation model is provided, and the speech translation model includes a semantic feature extractor and a plurality of decoders. The method includes extracting, by a semantic feature extractor, a source semantic unit sequence of source language audio and a target semantic unit sequence of target language audio, and the source language audio corresponds to the target language audio. The method further includes adjusting a first decoder of the plurality of decoders based on the source semantic unit sequence and the target semantic unit sequence. The method further includes adjusting a second decoder of the plurality of decoders based on the source semantic unit sequence, the target semantic unit sequence, a source acoustic unit sequence of the source language audio, and a target acoustic unit sequence of the target language audio, where the semantic feature extractor remains unchanged during the adjustment of the first decoder and the second decoder.
According to a second aspect of the disclosure, a method for speech translation is provided. The method is performed by the speech translation model generated according to the first aspect, where the speech translation model includes a semantic feature extractor and a plurality of decoders. The method includes generating a predicted target semantic unit sequence based on a given source semantic unit sequence of given source language audio. The method further includes generating a predicted acoustic unit sequence based on the given source semantic unit sequence of the given source language audio, the predicted target semantic unit sequence, and a given source acoustic unit sequence.
According to a third aspect of the disclosure, an apparatus for generating a speech translation model is provided. The speech translation model includes a semantic feature extractor and a plurality of decoders. The apparatus includes a semantic feature extraction module, configured to extract a source semantic unit sequence of source language audio and a target semantic unit sequence of target language audio, where the source language audio corresponds to the target language audio. The apparatus further includes a first decoding adjustment module, configured to adjust a first decoder of the plurality of decoders based on the source semantic unit sequence and the target semantic unit sequence. The apparatus further includes a second decoding adjustment module, configured to adjust a second decoder of the plurality of decoders based on the source semantic unit sequence, the target semantic unit sequence, a source acoustic unit sequence of the source language audio, and a target acoustic unit sequence of the target language audio, where the semantic feature extractor remains unchanged during the adjustment of the first decoder and the second decoder.
According to a fourth aspect of the disclosure, an apparatus for speech translation is provided. The apparatus is configured to use the speech translation model generated according to the third aspect. The apparatus includes a first predicted semantic determination module, configured to generate a predicted target semantic unit sequence based on a given source semantic unit sequence of given source language audio. The apparatus further includes a first predicted acoustic determination module, configured to generate a predicted acoustic unit sequence based on the given source semantic unit sequence of the given source language audio, the predicted target semantic unit sequence, and a given source acoustic unit sequence.
According to a fifth aspect of the disclosure, an electronic device is provided. The electronic device includes a processor and a memory coupled with the processor. The memory has instructions stored therein, and the instructions, when executed by the processor, enable the electronic device to perform the method according to the first aspect.
According to a sixth aspect of the disclosure, a computer-readable storage medium is provided. The computer-readable storage medium stores one or more computer instructions, where the one or more computer instructions are performed by a processor to implement the method according to the first aspect.
The summary of the invention is provided to introduce concept selection in a simplified form, which will be further described in the following specific implementations. The summary of the invention is not intended to identify key or essential features of the subject claimed for protection, nor is it intended to limit the scope of the subject claimed for protection.
The above and other features, advantages, and aspects of various embodiments of the disclosure will become more apparent in conjunction with the accompanying drawings and with reference to following detailed descriptions. In the accompanying drawings, same or similar reference numerals denote same or similar elements.
In all the accompanying drawings, the same or similar reference numerals denote the same or similar elements.
It should be understood that before using the technical solutions disclosed in various embodiments of the disclosure, users should be properly informed of the types, application ranges, usage scenarios, etc. of personal information (e.g., speech) involved in the disclosure in accordance with relevant laws and regulations, and then authorizations of the users are obtained.
For example, in response to receiving an active request of the user, a prompt message is sent to the user to clearly indicate the user that it is necessary to obtain and use the personal information of the user for the operation requested to be performed. Therefore, the user may freely select, according to the prompt message, whether to provide the personal information (e.g., speech) for software or hardware, such as an electronic device, an application, a server, or a storage medium performing the operations of the technical solutions of the disclosure. It should be understood that the above notification and user authorization obtaining process is only illustrative, which does not limit the implementations of the disclosure, and other methods that comply with relevant laws and regulations may also be applied to the implementations of the disclosure.
It should be understood that data (including but not limited to the data itself, and data acquisition, or usage) involved in the technical solutions should comply with the requirements of corresponding laws and regulations, and relevant stipulations.
The embodiments of the disclosure will be described in more detail below with reference to the accompanying drawings. Although the accompanying drawings show some embodiments of the disclosure, it should be understood that the disclosure may be implemented in various forms, and should not be construed as being limited to the embodiments stated herein. On the contrary, these embodiments are provided for a more thorough and complete understanding of the disclosure. It should be understood that the accompanying drawings and the embodiments of the disclosure are for exemplary purposes only, and are not intended to limit the scope of protection of the disclosure.
In the description of the embodiments of the disclosure, the term “include” and similar terms thereof should be understood as open-ended inclusions, that is, “including but not limited to”. The term “based on” should be understood as “at least partially based on.” The term “an embodiment” or “this embodiment” should be understood as “at least one embodiment”. The terms “first”, “second”, etc. may refer to different or identical objects, unless otherwise explicitly specified. Additional explicit and implicit definitions may be included below.
A typical speech translation model is typically composed of three tasks: speech recognition, machine translation, and text to speech. However, dividing the task into three cascaded sub-tasks leads to the accumulation of errors from the previous sub-task to the subsequent sub-tasks, affecting the final effect of the speech translation task.
An encoder-decoder model architecture is often used for sequence-to-sequence learning, and this approach may also be used for learning a direct end-to-end speech translation model, thereby avoiding the error accumulation of a cascaded scheme. However, the simultaneous use of the encoder-decoder architecture results in a huge number of model parameters, low training efficiency, and high engineering implementation requirements.
To solve the above problems, an embodiment of the disclosure provides a speech translation scheme that trains a decoder-only model architecture. The scheme may allow for only training a plurality of decoders during training the speech translation model, thereby reducing the number of parameters in the training process, improving the training efficiency, meanwhile, lowering the engineering implementation requirements, and achieving a better speech translation effect. The scheme utilizes one decoder to convert semantic units of source audio. Due to avoiding the conversion from the audio into a text, the decoder can not only process an ordinary written language but also can process an unwritten language (e.g., a dialect and a minority language). Meanwhile, another decoder is utilized for synthesizing the audio from the semantic units, incorporating acoustic units from the source audio during processing, thereby learning source audio characteristics, speech patterns, etc.
In the computing device 110, a speech translation model 120 may also be included. For example, the speech translation model 120 is deployed in the computing device 110. The speech translation model 120 may be configured to generate, based on the audio 160, translation results of the audio 160, namely target audio 170-1, target audio 170-2, target audio 170-3, and target audio 170-4 (individually or collectively referred to as target audio 170). Each source audio in the source audio 160 corresponds to the corresponding target audio in the target audio 170. For example, the speech translation model 120 may translate the source audio 160-1 into the target audio 170-1, and translate the source audio 160-2 into the target audio 170-2. In some embodiments, the speech translation model 120 may be obtained through unsupervised training based on a decoder-only architecture, using the source audio 160 and target audio 170 to construct input data.
Referring to
The speech translation model 120 further includes the decoder 140. The decoder 140 may convert a source semantic unit sequence of source language audio into a target semantic unit sequence of target language audio. During training the decoder 140, both the source semantic unit sequence and the target semantic unit sequence from training data are simultaneously inputted into the decoder 140 for model training through unsupervised training.
The speech translation model 120 further includes the decoder 145. The decoder 145 may restore timbre information from the source audio 160 into the target semantic unit sequence generated by the decoder 140 so as to be ultimately preserved in the generated target audio 170. It should be understood that in some embodiments, the speech translation model 120 may still achieve the function of speech translation without the decoder 145.
The speech translation model 120 further includes the decoder 150. The decoder 150 may convert the target semantic unit sequence of the target language audio into a target acoustic unit sequence of the target language audio. During training the decoder 150, the source semantic unit sequence, the target semantic unit sequence, a source acoustic unit sequence, and the target acoustic unit sequence from the training data are simultaneously inputted into the decoder 150 for model adjustment through unsupervised training, where the source acoustic unit sequence and the target acoustic unit sequence may be extracted using an audio stream encoder.
Both the decoder 140 and the decoder 150 may be language models only using a decoder structure, and therefore may utilize a prompt sequence related to included source audio information and target audio information for unsupervised training. In the unsupervised training process, parameters of the decoder 140 and the decoder 150 are adjusted by maximizing the probability of a next unit predicted by a prediction sequence from left to right. In an inference stage, the prompt sequence related to the included source audio information may be inputted into the decoder 140 and the decoder 150 to generate the target acoustic sequence associated with the target audio.
It should be understood that the architecture and functions in the example environment 100 are described for illustrative purposes only, and do not imply any limitations on the scope of the disclosure. This embodiment of the disclosure may also be applied to other environments with different structures and/or functions.
The process according to this embodiment of the disclosure is described in detail in conjunction with
In a block 204, a first decoder of the plurality of decoders is adjusted based on the source semantic unit sequence and the target semantic unit sequence. For example, referring to
In a block 206, a second decoder of the plurality of decoders is adjusted based on the source semantic unit sequence, the target semantic unit sequence, the source acoustic unit sequence of the source language audio, and the target acoustic unit sequence of the target language audio, where the semantic feature extractor remains unchanged during the adjustment of the first decoder and the second decoder. For example, referring to
Therefore, according to the method 200 in this embodiment of the disclosure, when the speech translation model is generated, only the plurality of decoders in the semantic translation model are trained, thereby reducing model training parameters, improving training efficiency, lowering engineering implementation requirements such as hardware configuration, and also improving a speech translation effect.
The training process of the plurality of decoders is specifically described below in conjunction with
The source semantic unit sequence 308 and the target semantic unit sequence 310 are processed by a compression module 312 to obtain a compressed source semantic unit sequence 316 and a compressed target semantic unit sequence 318. For example, due to some repetitions, pauses, etc. in the audio, the compression module 312 removes this part of redundant information, thereby improving an effect of the translation decoder 322. However, it should be understood that the compression module 312 is optional. In other words, in some embodiments, the source semantic unit sequence 308 and the target semantic unit sequence 310 may not be compressed and may be directly inputted into the translation decoder 322 for training.
The compressed source semantic unit sequence 314, the compressed target semantic unit sequence 316, and task information 318 may be combined to obtain a prompt sequence 320 (also referred to as a first prompt sequence). For example, when the source audio 302 is English audio, an example of the prompt sequence 320 may be “Translate English unit {src_unit} to Chinese unit {tat_unit}”, where the task information 318 is “Translate English unit . . . to Chinese unit . . . ”, which indicates the translation decoder 322 to perform a task to be processed for converting an English unit to a Chinese unit. It should be understood that the language type may vary as needed, where “{src_unit}” represents the compressed source semantic unit sequence 314, and {tat_unit} represents the compressed target semantic unit sequence 316. It should be understood that placing the task information 318 at the leftmost side of the prompt sequence 320 in
Parameters of the translation decoder 322 are trained and optimized by inputting the prompt sequence 320 into the translation decoder 322 for unsupervised training. For example, the translation decoder 322 may be subjected to unsupervised training based on the prompt sequence 320. The probability of a next prompt unit appearing on the prompt sequence 320 from left to right is predicted and maximized to train the translation decoder 322.
The source audio 402 is processed by an acoustic unit extractor 418 to obtain a source acoustic unit sequence 420, where the source acoustic unit sequence 420 represents acoustic characteristics related to the source audio 402, such as a frequency, a timbre, and a speech pattern. Similarly, the target audio 404 is processed by the acoustic unit extractor 420 to obtain a target acoustic unit sequence 422.
By combining the compressed source semantic unit sequence 414, the compressed target semantic unit sequence 416, the source acoustic unit sequence 420, and the target acoustic unit sequence 422, a prompt sequence 424 (also known as a second prompt sequence) is obtained. Parameters of a synthesis decoder 426 are trained and optimized by inputting the prompt sequence 424 into the synthesis decoder 426 for unsupervised training. For example, the synthesis decoder 426 may be subjected to unsupervised training based on the prompt sequence 424. The probability of a next prompt unit appearing on the prompt sequence 424 from left to right is predicted and maximized to train the synthesis decoder 426.
As shown in
By combining the compressed source semantic unit sequence 514, the compressed target semantic unit sequence 516, the source timing value sequence 518, and the target timing value sequence 520, a prompt sequence 522 is obtained. Parameters of the timing decoder 524 are trained and optimized by inputting the prompt sequence 522 into the timing decoder 524 for unsupervised training. It should be understood that the timing decoder, similar to the translation decoder and the synthesis decoder, is trained based on a language model method. Specifically, training is performed by predicting a next unit in the sequence through a constructed input sequence.
In 620, text translation data is acquired. It should be understood that a text translation task involves text conversion between two types of languages without audio conversion. A data format of text translation is <src_text, tgt_text>, which means converting a source language text to a target language text. In 625, a text translation prompt sequence is generated. For example, the text translation prompt sequence may be “Translate [src lang] text {text} to [tgt lang] text {text}”. The prompt sequence is used for prompting the translation decoder 640 to translate the source language text to the target language text, where “src lang” represents the language type of the source language, and “tgt lang” represents the language type of the target language. For example, the source language may be English, and the target language may be Chinese. After generating the text translation prompt sequence, the text translation prompt sequence is used for adjusting the translation decoder 640 to optimize the parameters of the translation decoder 640.
In 630, speech-to-speech data is acquired. It should be understood that a speech-to-speech task involves direct speech conversion between two types of speeches. A format of the speech-to-speech data is <src_unit, tgt_unit, src_text, tgt_text>, including a source semantic unit, a target semantic unit, a source text, and a target text. In 635, a speech-to-speech prompt sequence is generated. Typically, the speech-to-speech prompt sequence is “Translate [src lang] unit {src_unit} to [tgt lang] unit {tgt_unit}”. The prompt sequence is used for prompting the translation decoder 640 to translate the source semantic unit to the target semantic unit. After generating the speech-to-speech prompt sequence, the speech-to-speech prompt sequence is used for adjusting the translation decoder 640 to optimize the parameters of the translation decoder 640. Because the speech-to-speech data includes totally four types of data: a source semantic unit, a target semantic unit, a source text, and a target text, the following prompt sequences may also be constructed:
“Translate [src lang] unit {src_unit} to [src lang] text {src_text}”, and the prompt sequence is used for prompting the translation decoder 640 to translate the source semantic unit to the source text;
“Translate [src lang] unit {src_unit} to [tgt lang] text {tgt_text}”, and the prompt sequence is used for prompting the translation decoder 640 to translate the source semantic unit to the target text;
“Translate [src lang] text {src_text} to [tgt lang] unit {tgt_unit}”, and the prompt sequence is used for prompting the translation decoder 640 to translate the source text to the target semantic unit; and
“Translate [tgt lang] text {tgt_text} to [tgt lang] unit {tgt_unit}”, and the prompt sequence is used for prompting the translation decoder 640 to translate the target text to the target semantic unit.
It can be seen that by utilizing the multi-task learning scheme, various task types of data may be used, such as any one or more of speech recognition data 610, text translation data 620, and speech-to-speech data 630, so as to adjust the parameters of the translation decoder 640, thereby improving the performance and effect of the translation decoder 640.
Then, processing continues with a unit-to-speech module 730. For example, the compressed target semantic unit sequence is processed through a decoder 732 to obtain a target semantic unit sequence, thereby restoring timbre information from the source audio 710 into the target semantic unit sequence. Then, an acoustic feature extractor 740 is utilized for extracting a source acoustic unit sequence from the source audio, and combining the source semantic unit sequence, the target semantic unit sequence, and the source acoustic unit sequence into another prompt sequence, and inputting the another prompt sequence into a synthesis decoder 734 to generate a target acoustic unit sequence. An acoustic unit converter 750 processes the target acoustic unit sequence to obtain predicted target audio 740.
Therefore, according to the apparatus 800 in the disclosure, when the speech translation model is generated, only the decoders are trained, thereby reducing model training parameters, improving training efficiency, lowering engineering implementation requirements such as hardware configuration, and also improving a speech translation effect.
A plurality of components in the device 1100 are connected to the I/O interface 1105, including an input unit 1106 such as a keyboard and a mouse; an output unit 1107 such as various types of displays and speakers; the storage unit 1108 such as a disk and an optical disc; and a communication unit 1109 such as a network card, a modem, and a wireless communication transceiver. The communication unit 1109 allows the device 1100 to exchange information/data with other devices through a computer network such as the Internet, and/or various telecommunication networks.
The various methods or processes described above may be executed by the CPU/GPU 1101. For example, in some embodiments, the method may be implemented as a computer software program that is tangibly included in a machine-readable medium, such as the storage unit 1108. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 1100 via the ROM 1102 and/or the communication unit 1109. When the computer program is loaded into the RAM 1103 and executed by the CPU/GPU 1101, one or more of steps or actions of the method or process described above may be performed.
In some embodiments, the methods and processes described above may be implemented as a computer program product. The computer program product may include a computer-readable storage medium carrying computer-readable program instructions for performing various aspects of the disclosure.
The computer-readable storage medium may be a tangible device that may retain and store instructions used by an instruction-executing device. The computer-readable storage medium may be, for example, but is not limited to, an electric storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the above. More specific examples (non-exhaustive list) of the computer-readable storage medium include: a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disc (DVD), a memory stick, a floppy disk, a mechanical encoding device, such as a punch card or a raised structure in a groove with instructions stored therein, or any suitable combination of the above. The computer-readable storage medium used here is not to be interpreted as transient signals, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagated through waveguides or other transmission media (e.g., light pulses through fiber-optic cables), or electrical signals transmitted through wires.
The computer-readable program instructions described herein may be downloaded from the computer-readable storage medium to various computing/processing devices or downloaded to an external computer or external storage device via a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or network interface in each computing/processing device receives the computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in the computer-readable storage medium in each computing/processing device.
The computer program instructions for performing the operation of the disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, status setting data, or source code or object code written in any combination of one or more programming languages, where the programming languages include object-oriented programming languages and conventional procedural programming languages. The computer-readable program instructions may be executed entirely on a user computer, partly on the user computer, as a stand-alone software package, partly on the user computer and partly on a remote computer, or entirely on the remote computer or the server. In the case of involving the remote computer, the remote computer may be connected to the user computer through any type of network, including a local area network (LAN) or wide area network (WAN), or may be connected to the external computer (e.g., utilizing an Internet service provider for Internet connectivity). In some embodiments, an electronic circuit, such as a programmable logic circuit, a field programmable gate array (FPGA), or a programmable logic array (PLA), is customized by utilizing status information of the computer-readable program instructions. The electronic circuit may execute the computer-readable program instructions so as to implement various aspects of the disclosure.
These computer-readable program instructions may be provided to a processing unit of a general-purpose computer, a special-purpose computer, or another programmable data processing apparatus, thereby producing a machine, such that these instructions, when executed by the processing unit of the computer or the another programmable data processing apparatus, produce an apparatus for implementing functions/actions specified in one or more blocks in the flowcharts and/or block diagrams. These computer-readable program instructions may also be stored in the computer-readable storage medium, and these instructions make the computer, the programmable data processing apparatus, and/or another device operate in a specific method; and therefore, the computer-readable medium having instructions stored therein includes an article of manufacture that includes instructions for implementing various aspects of the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.
The computer-readable program instructions may also be loaded to the computer, the another programmable data processing apparatus, or the another device, such that a series of operating steps are performed on the computer, the another programmable data processing apparatus, or the another device to produce a computer-implemented process, and accordingly, the instructions executed on the computer, the another programmable data processing apparatus, or the another device implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.
The flowcharts and the block diagrams in the accompanying drawings illustrate the system architectures, functions, and operations possibly implemented by the device, the method and the computer program product according to the various embodiments of the disclosure. In this regard, each block in the flowcharts or the block diagrams may represent a module, a program segment, or a portion of code, and the module, program segment, or portion of code includes one or more executable instructions for implementing specified logical functions. In some alternative implementations, functions marked in the blocks may also occur in an order different from that marked in the accompanying drawings. For example, two successive blocks may actually be executed in parallel substantially, and sometimes may also be executed in a reverse order, which depends on involved functions. It should be further noted that each block in the block diagrams and/or the flowcharts, as well as a combination of the blocks in the block diagrams and/or the flowcharts may be implemented by utilizing a dedicated hardware-based system that executes specified functions or actions, or using a combination of special hardware and computer instructions.
The embodiments of the disclosure have been described above. The above description is illustrative, rather than exhaustive, and is not limited to the disclosed various embodiments. Numerous modifications and alterations are apparent to those of ordinary skill in the art without departing from the scope and spirit of the illustrated embodiments. The selection of the terms as used herein is intended to best explain the principles and practical applications of the various embodiments, or improvements to technologies on the market, or to enable other persons of ordinary skill in the art to understand the various embodiments disclosed herein.
Some example implementations of the disclosure are listed below.
Example 1. A method for generating a speech translation model, where the speech translation model includes a semantic feature extractor and a plurality of decoders, and the method includes:
Example 2. The method according to example 1, where adjusting the first decoder of the plurality of decoders includes:
Example 3. The method according to any one of examples 1 to 2, where adjusting the second decoder of the plurality of decoders includes:
Example 4. The method according to any one of examples 1 to 3, further includes:
Example 5. The method according to any one of examples 1 to 4, further includes:
Example 6. The method according to any one of examples 1 to 5, where the semantic feature extractor includes any one of an unsupervised model and a cluster model.
Example 7. The method according to any one of examples 1 to 6, further includes adjusting the first decoder using multi-task learning, where the multi-task learning includes at least one of the following:
Example 8. The method according to any one of examples 1 to 7, where at least one of the source language audio and the target language audio includes an unwritten language, and the unwritten language has no handwritten text.
Example 9. A method for speech translation, where the method is performed by the speech translation model generated by any one of examples 1 to 8, the speech translation model includes a semantic feature extractor and a plurality of decoders, and the method includes:
Example 10. The method according to example 9, where generating the predicted target semantic unit sequence includes:
Example 11. The method according to any one of examples 9 to 10, where generating the predicted acoustic unit sequence includes:
Example 12. The method according to any one of examples 9 to 11, further includes:
Example 13. An apparatus for generating a speech translation model, where the speech translation model includes a semantic feature extractor and a plurality of decoders, and the apparatus includes:
Example 14. The apparatus according to example 13, where the first decoding adjustment module includes:
Example 15. The apparatus according to any one of examples 13 to 14, where the second decoding adjustment module includes:
Example 16. The apparatus according to any one of examples 13 to 15, further includes:
Example 17. The apparatus according to any one of examples 13 to 16, further includes:
Example 18. The apparatus according to any one of examples 13 to 17, where the semantic feature extraction module includes one of an unsupervised model and a cluster model.
Example 19. The apparatus according to any one of examples 13 to 18, further includes:
Example 20. The apparatus according to any one of examples 13 to 19, where at least one of the source language audio and the target language audio includes an unwritten language, and the unwritten language has no handwritten text.
Example 21. An apparatus for translation, where the apparatus is configured to use the speech translation model generated by any one of examples 13 to 20, the speech translation model includes a semantic feature extractor and a plurality of decoders, and the apparatus includes:
Example 22. The apparatus according to example 21, where a first target semantic determination module includes:
Example 23. The apparatus according to any one of examples 21 to 22, where a first target acoustic determination module includes:
Example 24. The apparatus according to any one of examples 21 to 23, further includes:
Example 25. An electronic device, includes:
Example 26. The electronic device according to example 25, where adjusting the first decoder of the plurality of decoders includes:
Example 27. The electronic device according to any one of examples 25 to 26, where adjusting the second decoder of the plurality of decoders includes:
Example 28. The electronic device according to any one of examples 25 to 27, where the actions further include:
Example 29. The electronic device according to any one of examples 25 to 28, where the actions further include:
Example 30. The electronic device according to any one of examples 25 to 29, where the semantic feature extractor includes one of an unsupervised model and a cluster model.
Example 31. The electronic device according to any one of examples 25 to 30, further includes adjusting the first decoder using multi-task learning, where the multi-task learning includes at least one of the following:
Example 32. The electronic device according to any one of examples 25 to 31, where at least one of the source language audio and the target language audio includes an unwritten language, and the unwritten language has no handwritten text.
Example 33. An electronic device, includes:
Example 34. The electronic device according to any one of example 33, where generating the predicted target semantic unit sequence includes:
Example 35. The electronic device according to any one of examples 33 to 34, where generating the predicted acoustic unit sequence includes:
Example 36. The electronic device according to any one of examples 33 to 35, where the actions further include:
Example 37. A computer-readable storage medium, has one or more computer instructions stored therein, where the one or more computer instructions, when executed by a processor, implement the method according to any one of examples 1 to 12.
Example 38. A computer program product, where the computer program product is tangibly stored in a computer-readable medium and includes computer-executable instructions, and the computer-executable instructions, when executed by a device, enable the device to perform the method according to any one of examples 1 to 12.
Although the disclosure has been described by adopting language specific to structural features and/or method logical actions, it should be understood that the subject limited in the appended claims is not necessarily limited to the specific features or actions described above. On the contrary, the specific features and actions described above are merely example forms for implementing the claims.
Number | Date | Country | Kind |
---|---|---|---|
202310659465.7 | Jun 2023 | CN | national |