This application claims the priority benefit of Taiwan application serial no. 110122715, filed on Jun. 22, 2021. The entirety of the above-mentioned patent application is hereby incorporated by reference herein and made a part of this specification.
The disclosure relates to a speech conference technology, particularly to a conference terminal and an embedding method of audio watermarks.
Remote conferences enable people at different locations or in different spaces to have conversations, and conference-related equipment, protocols, and/or applications are also well developed. It is worth noting that some real-time conference programs may synthesize speech signals and audio watermark signals. However, speech signal processing technologies (for example, frequency band filtering, noise suppression, dynamic range compression (DRC), echo cancellation, etc.) are generally designed for general speech signals, retaining only speech signals while removing non-speech signals. If the speech signal and the audio watermark signal undergo the same speech signal processing on the signal transmission path, the audio watermark signal may be treated as noise or non-speech signals and thus being filtered.
In this light, the embodiments of the present disclosure provide a conference terminal and an embedding method of audio watermarks. The audio watermark is embedded in the terminal to retain the audio watermark through multiple paths.
The embedding method of audio watermarks in the embodiment of the present disclosure is suitable for conference terminals. The embedding method of audio watermarks includes (but is not limited to) the following steps: receiving a first speech signal and a first audio watermark signal respectively, wherein the first speech signal relates to a phonetic content of a speaker corresponding to another conference terminal, and the first audio watermark signal corresponds to the another conference terminal; assigning the first speech signal to a host path to output a second speech signal, and assigning the first audio watermark signal to an offload path to output a second audio watermark signal, wherein the host path provides more digital signal processing (DSP) effects than the offload path; and synthesizing the second speech signal and the second audio watermark signal to output a synthesized audio signal, wherein the synthesized audio signal is adapted for audio playback.
The conference terminal of the embodiment of the present disclosure includes (but is not limited to) a sound receiver, a loudspeaker, a communication transceiver, and a processor. The sound receiver is adapted to receive sound. The loudspeaker is adapted to play sound. The communication transceiver is adapted to transmit or receive data. The processor is coupled to the sound receiver, the loudspeaker, and the communication transceiver. The processor is adapted to receive a first speech signal and a first audio watermark signal respectively through the communication transceiver, assign the first speech signal to a host path to output a second speech signal, and assign the first audio watermark signal to an offload path to output a second audio watermark signal, and synthesize the second speech signal and the second audio watermark signal to output a synthesized audio signal. The first speech signal relates to a phonetic content of a speaker corresponding to another conference terminal, and the first audio watermark signal corresponds to the another conference terminal. The host path provides more digital signal processing effects than the offload path. The synthesized audio signal is adapted for audio playback.
Based on the above, the conference terminal and the embedding method of audio watermarks according to the embodiment of the present disclosure, two transmission paths are provided at the terminal for the speech signal and the audio watermark signal, so that the audio watermark signal receives less signal processing to synthesize the signal accordingly. In this way, the conference terminal may completely play out the speech signal and the audio watermark signal of the speaker at the other terminal, which reduces the noise in the environment.
In order to make the above-mentioned features and advantages of the present disclosure more comprehensible, the following specific embodiments are described in detail in conjunction with the accompanying drawings.
Each conference terminals 10a and 10c may be a wired phone, a mobile phone, a tablet computer, a desktop computer, a notebook computer, or a smart speaker. Each of the conference terminals 10a and 10c includes (but is not limited to) a sound receiver 11, a loudspeaker 13, a communication transceiver 15, a memory 17, and a processor 19.
The sound receiver 11 can be a dynamic, condenser, or electret condenser sound receiver. The sound receiver 11 may also be a combination of other electronic components, analog-to-digital converters, filters, and audio processors that can receive sound waves (for example, human voice, environmental sound, machine operation sound, etc.) and convert them into speech signals. In one embodiment, the sound receiver 11 is adapted to receive/record the sound of the speaker to obtain the speech signals. In some embodiments, the speech signal may include the voice of the speaker, the sound emitted by the loudspeaker 13, and/or other environmental sounds.
The loudspeaker 13 may be a speaker or a loudspeaker. In one embodiment, the loudspeaker 13 is adapted to play sound.
The communication transceiver 15 is, for example, a transceiver that supports a wired network such as Ethernet, optical fiber network, or cable (which may include (but is not limited to) connection interfaces, signal converters, communication protocol processing chips, and other components)), and it may also be a transceiver that supports Wi-Fi, fourth-generation (4G), fifth-generation (5G), or later generation mobile networks, and other wireless networks (which may include (but are not limited to) antennas, digital-to-analog/analog-to-digital converters, communication protocol processing chips, and other components). In one embodiment, the communication transceiver 15 is adapted to transmit or receive data.
The memory 17 may be any type of fixed or removable random access memory (RAM), read only memory (ROM), flash memory, hard disk drive (HDD), solid-state drive (SSD), or similar components. In one embodiment, the memory 17 is adapted to record program codes, software modules, configuration arrangement, data (for example, audio signals), or files.
The processor 19 is coupled to the sound receiver 11, the loudspeaker 13, the communication transceiver 15, and the memory 17. The processor 19 may be a central processing unit (CPU), a graphics processing unit (GPU), or other programmable general-purpose or special-purpose microprocessors, digital signal processing (DSP), programmable controller, field programmable gate array (FPGA), application-specific integrated circuit (ASIC), or other similar components or a combination of the above devices. In one embodiment, the processor 19 is adapted to perform all or part of the operations of the conference terminals 10a and 10c, and may load and execute various software modules, files, and data recorded in the memory 17.
In an embodiment, the processor 19 includes a primary processor 191 and a secondary processor 193. For example, the primary processor 191 is a CPU, and the secondary processor 193 is a platform controller hub (PCH) or other chips or processors with lower power consumption than the CPU. However, in some embodiments, the functions and/or elements of the primary processor 191 and the secondary processor 193 may be integrated.
The cloud server 50 is directly or indirectly connected to the conference terminals 10a and 10c via the network. The cloud server 50 may be a computer system, a server, or a signal processing device. In an embodiment, the conference terminals 10a and 10c may also serve as the cloud server 50. In another embodiment, the cloud server 50 may be used as an independent cloud server different from the conference terminals 10a and 10c. In some embodiments, the cloud server 50 includes (but is not limited to) the same or similar communication transceiver 15, memory 17, and processor 19, and the implementation modes and functions of the components will not be repeated herein.
Various devices, components, and modules in the conference system 1 are used to describe the method according to the embodiments of the present disclosure hereinafter. Each process of the method can be adjusted accordingly according to the practical implementation situation, and is not limited to this.
In addition, it should be noted that, for the convenience of description, the same components can implement the same or similar operations, and the same description will not be repeated herein. For example, the processor 19 of the conference terminals 10a and 10c can all implement the same or similar methods in the embodiments of the present disclosure.
For example,
And the cloud server 50 may generate the audio watermark signal WB for the conference terminal 10c based on the speech signal SB. Specifically,
It should be noted regarding how to obtain the speech signal Sa′, the speech signal SA, and the audio watermark signal W¬A for the conference terminal 10a, please refer to the foregoing description of the speech signal Sb′, the speech signal SB, and the audio watermark signal W¬B, which will not be repeated here. For example, the cloud server 50 may generate an audio watermark signal WA based on an original watermark w0A and a watermark key kwA to be transmitted.
In one embodiment, the original watermark w0A and the audio watermark signal W¬A are used to identify the conference terminal 10a, or the original watermark w0B and the audio watermark signal WB are used to identify the conference terminal 10c. For example, the audio watermark signal W¬A is a sound that records an identification code of the conference terminal 10a. However, in some embodiments, the present disclosure does not limit the content of the audio watermark signals W¬A and W¬B.
In
In one embodiment, the processor 19 receives network packets through the communication transceiver 15 via the network. This network packet includes both the speech signal SB and the audio watermark signal WB. The processor 19 may identify the speech signal SB and the audio watermark signal WB based on an identifier in the network packet. This identifier is adapted to indicate that a certain part of the data load of the network packet is the speech signal SB while the other part is the audio watermark signal WB. For example, the identifier indicates the starting position of the speech signal SB and the audio watermark signal WB in the network packet.
In one embodiment, the processor 19 receives a first network packet through the communication transceiver 15 via the network. This first network packet includes the speech signal SB. And the processor 19 receives a second network packet through the communication transceiver 15 via the network. This second network packet includes the audio watermark signal WB. In other words, the processor 19 distinguishes the speech signal SB and the audio watermark signal WB through two or more network packets.
In
In the embodiment of the present disclosure, the host path provides more digital signal processing (DSP) effects than the offload path. It can be seen that, compared to the speech signal SB, the audio watermark signal WB may not be subjected to digital signal processing effects or is subjected to less digital signal processing effects. For example, the processor 19 performs noise suppression on the speech signal SB, but the audio watermark signal WB is not subjected to noise suppression. Or, the audio watermark signal WB may only be subjected to gain adjustment without undergoing the voice-related signal processing.
It should be noted that
In one embodiment, the host path is configured for major applications such as voice calls or multimedia playback, such as the media player or call software in the Windows system. The offload path is configured for secondary applications like notification sounds, ringtones, or music playback, such as a simple music player. The processor 19 may connect the speech signal SB with the primary application, so that the speech signal SB may be input to the host path used by the primary application, whereas the processor 19 may connect the audio watermark signal WB with the secondary application, so that the audio watermark signal WB may be input to the offload path used by the secondary application.
In one embodiment, the primary processor 191 performs signal processing on the host path, and the secondary processor 193 performs signal processing on the offload path. In other words, the primary processor 191 provides the digital signal processing effects corresponding to the host path to the speech signal SB, and the secondary processor 193 provides the digital signal processing effects corresponding to the offload path for the audio watermark signal WB. For example, the storage space provided by the secondary processor 193 for the mode effects is less than the storage space provided by the primary processor 191.
In
On the other hand, the processor 19 may obtain the speech signal Sa of the speaker through an audio receiving system 271. For example, the processor 19 records through the sound receiver 11 to obtain the speech signal Sa. The processor 19 may perform transmission end speech signal processing on the speech signal Sa to output the speech signal Sa′ (step S290), and transmit the speech signal Sa′ to the cloud server 50 through the communication transceiver 15. Similarly, the cloud server 50 may generate the speech signal SA and the audio watermark signal WA based on the speech signal Sa′. In addition, the conference terminal 10c may also output a complete or less distorted audio watermark signal WA through its loudspeaker 13.
In summary, in the conference device and the embedding method of audio watermarks of the embodiments of the present disclosure, the audio watermark signal and the speech signal are synthesized at the output end of the conference terminal to bypass the speech signal processing of the system to embed the audio watermark. In this configuration, the embodiment of the present disclosure provides a host path and an offload path, and makes the audio watermark signal receive less signal processing or not receive any signal processing. In this way, the terminal may play the user's speech signal and the audio watermark fully, and may reduce the noise in the environment.
Although the present disclosure has been disclosed in the above embodiments, it is not intended to limit the present disclosure. Anyone with ordinary knowledge in the relevant technical field can make changes and modifications without departing from the spirit and scope of the present disclosure. The scope of protection of the present disclosure shall be subject to those defined by the claims attached.
Number | Date | Country | Kind |
---|---|---|---|
110122715 | Jun 2021 | TW | national |