CONFERENCE TERMINAL AND EMBEDDING METHOD OF AUDIO WATERMARKS

Information

  • Patent Application
  • 20220406317
  • Publication Number
    20220406317
  • Date Filed
    August 16, 2021
    3 years ago
  • Date Published
    December 22, 2022
    a year ago
Abstract
A conference terminal and an embedding method of audio watermarks are provided. In the method, a first speech signal and a first audio watermark signal are received respectively. The first speech signal relates to a speaker corresponding to another conference terminal, and the first audio watermark signal corresponds to the another conference terminal. The first speech signal is assigned to a host path to output a second speech signal. The first audio watermark signal is assigned to an offload path to output a second audio watermark signal. The host path provides more digital signal processing (DSP) effects than the offload path. The second speech signal and the second audio watermark signal are synthesized to output a synthesized audio signal. The synthesized audio signal is adapted for audio playback. A completed audio watermark signal is outputted accordingly.
Description
CROSS-REFERENCE TO RELATED APPLICATION

This application claims the priority benefit of Taiwan application serial no. 110122715, filed on Jun. 22, 2021. The entirety of the above-mentioned patent application is hereby incorporated by reference herein and made a part of this specification.


BACKGROUND
Technical Field

The disclosure relates to a speech conference technology, particularly to a conference terminal and an embedding method of audio watermarks.


Description of Related Art

Remote conferences enable people at different locations or in different spaces to have conversations, and conference-related equipment, protocols, and/or applications are also well developed. It is worth noting that some real-time conference programs may synthesize speech signals and audio watermark signals. However, speech signal processing technologies (for example, frequency band filtering, noise suppression, dynamic range compression (DRC), echo cancellation, etc.) are generally designed for general speech signals, retaining only speech signals while removing non-speech signals. If the speech signal and the audio watermark signal undergo the same speech signal processing on the signal transmission path, the audio watermark signal may be treated as noise or non-speech signals and thus being filtered.


SUMMARY

In this light, the embodiments of the present disclosure provide a conference terminal and an embedding method of audio watermarks. The audio watermark is embedded in the terminal to retain the audio watermark through multiple paths.


The embedding method of audio watermarks in the embodiment of the present disclosure is suitable for conference terminals. The embedding method of audio watermarks includes (but is not limited to) the following steps: receiving a first speech signal and a first audio watermark signal respectively, wherein the first speech signal relates to a phonetic content of a speaker corresponding to another conference terminal, and the first audio watermark signal corresponds to the another conference terminal; assigning the first speech signal to a host path to output a second speech signal, and assigning the first audio watermark signal to an offload path to output a second audio watermark signal, wherein the host path provides more digital signal processing (DSP) effects than the offload path; and synthesizing the second speech signal and the second audio watermark signal to output a synthesized audio signal, wherein the synthesized audio signal is adapted for audio playback.


The conference terminal of the embodiment of the present disclosure includes (but is not limited to) a sound receiver, a loudspeaker, a communication transceiver, and a processor. The sound receiver is adapted to receive sound. The loudspeaker is adapted to play sound. The communication transceiver is adapted to transmit or receive data. The processor is coupled to the sound receiver, the loudspeaker, and the communication transceiver. The processor is adapted to receive a first speech signal and a first audio watermark signal respectively through the communication transceiver, assign the first speech signal to a host path to output a second speech signal, and assign the first audio watermark signal to an offload path to output a second audio watermark signal, and synthesize the second speech signal and the second audio watermark signal to output a synthesized audio signal. The first speech signal relates to a phonetic content of a speaker corresponding to another conference terminal, and the first audio watermark signal corresponds to the another conference terminal. The host path provides more digital signal processing effects than the offload path. The synthesized audio signal is adapted for audio playback.


Based on the above, the conference terminal and the embedding method of audio watermarks according to the embodiment of the present disclosure, two transmission paths are provided at the terminal for the speech signal and the audio watermark signal, so that the audio watermark signal receives less signal processing to synthesize the signal accordingly. In this way, the conference terminal may completely play out the speech signal and the audio watermark signal of the speaker at the other terminal, which reduces the noise in the environment.


In order to make the above-mentioned features and advantages of the present disclosure more comprehensible, the following specific embodiments are described in detail in conjunction with the accompanying drawings.





BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 is a schematic diagram of a conference system according to an embodiment of the present disclosure.



FIG. 2 is a flowchart of an embedding method of audio watermarks according to an embodiment of the present disclosure.



FIG. 3 is a flowchart of the generation of a speech signal and an audio watermark signal according to an embodiment of the present disclosure.



FIG. 4 is a flowchart illustrating the generation of an audio watermark signal according to an embodiment of the present disclosure.



FIG. 5 is a schematic diagram of an audio processing architecture according to an embodiment of the disclosure.





DESCRIPTION OF THE EMBODIMENTS


FIG. 1 is a schematic diagram of a conference system 1 according to an embodiment of the present disclosure. In FIG. 1, the conference system 1 includes (but is not limited to) a plurality of conference terminals 10a and 10c and a cloud server 50.


Each conference terminals 10a and 10c may be a wired phone, a mobile phone, a tablet computer, a desktop computer, a notebook computer, or a smart speaker. Each of the conference terminals 10a and 10c includes (but is not limited to) a sound receiver 11, a loudspeaker 13, a communication transceiver 15, a memory 17, and a processor 19.


The sound receiver 11 can be a dynamic, condenser, or electret condenser sound receiver. The sound receiver 11 may also be a combination of other electronic components, analog-to-digital converters, filters, and audio processors that can receive sound waves (for example, human voice, environmental sound, machine operation sound, etc.) and convert them into speech signals. In one embodiment, the sound receiver 11 is adapted to receive/record the sound of the speaker to obtain the speech signals. In some embodiments, the speech signal may include the voice of the speaker, the sound emitted by the loudspeaker 13, and/or other environmental sounds.


The loudspeaker 13 may be a speaker or a loudspeaker. In one embodiment, the loudspeaker 13 is adapted to play sound.


The communication transceiver 15 is, for example, a transceiver that supports a wired network such as Ethernet, optical fiber network, or cable (which may include (but is not limited to) connection interfaces, signal converters, communication protocol processing chips, and other components)), and it may also be a transceiver that supports Wi-Fi, fourth-generation (4G), fifth-generation (5G), or later generation mobile networks, and other wireless networks (which may include (but are not limited to) antennas, digital-to-analog/analog-to-digital converters, communication protocol processing chips, and other components). In one embodiment, the communication transceiver 15 is adapted to transmit or receive data.


The memory 17 may be any type of fixed or removable random access memory (RAM), read only memory (ROM), flash memory, hard disk drive (HDD), solid-state drive (SSD), or similar components. In one embodiment, the memory 17 is adapted to record program codes, software modules, configuration arrangement, data (for example, audio signals), or files.


The processor 19 is coupled to the sound receiver 11, the loudspeaker 13, the communication transceiver 15, and the memory 17. The processor 19 may be a central processing unit (CPU), a graphics processing unit (GPU), or other programmable general-purpose or special-purpose microprocessors, digital signal processing (DSP), programmable controller, field programmable gate array (FPGA), application-specific integrated circuit (ASIC), or other similar components or a combination of the above devices. In one embodiment, the processor 19 is adapted to perform all or part of the operations of the conference terminals 10a and 10c, and may load and execute various software modules, files, and data recorded in the memory 17.


In an embodiment, the processor 19 includes a primary processor 191 and a secondary processor 193. For example, the primary processor 191 is a CPU, and the secondary processor 193 is a platform controller hub (PCH) or other chips or processors with lower power consumption than the CPU. However, in some embodiments, the functions and/or elements of the primary processor 191 and the secondary processor 193 may be integrated.


The cloud server 50 is directly or indirectly connected to the conference terminals 10a and 10c via the network. The cloud server 50 may be a computer system, a server, or a signal processing device. In an embodiment, the conference terminals 10a and 10c may also serve as the cloud server 50. In another embodiment, the cloud server 50 may be used as an independent cloud server different from the conference terminals 10a and 10c. In some embodiments, the cloud server 50 includes (but is not limited to) the same or similar communication transceiver 15, memory 17, and processor 19, and the implementation modes and functions of the components will not be repeated herein.


Various devices, components, and modules in the conference system 1 are used to describe the method according to the embodiments of the present disclosure hereinafter. Each process of the method can be adjusted accordingly according to the practical implementation situation, and is not limited to this.


In addition, it should be noted that, for the convenience of description, the same components can implement the same or similar operations, and the same description will not be repeated herein. For example, the processor 19 of the conference terminals 10a and 10c can all implement the same or similar methods in the embodiments of the present disclosure.



FIG. 2 is a flowchart of an embedding method of audio watermarks according to an embodiment of the present disclosure. In FIG. 1 and FIG. 2, it is assumed that the conference terminals 10a and 10c create a call conference. For example, by setting up a meeting through video software, voice call software, or by making a phone call, the speaker may then start talking. The processor 19 of the conference terminal 10a receives a speech signal SB and an audio watermark signal WB through the communication transceiver 15 (i.e., via a network interface) (step S210). Specifically, the speech signal SB relates to the phonetic content of the speaker corresponding to the conference terminal 10c (for example, the speech signal obtained by the sound receiver 11 of the conference terminal 10c receiving signals from the speaker). The audio watermark signal WB corresponds to the conference terminal 10c.


For example, FIG. 3 is a flowchart of the generation of the speech signal SB and the audio watermark signal WB according to an embodiment of the present disclosure. In FIG. 3, the cloud server 50 receives a speech signal Sb′ recorded by the conference terminal 10c through its sound receiver 11 via the network interface (step S310). The speech signal Sb′ may include the voice of the speaker, the sound played by the loudspeaker 13, and/or other environmental sounds. The cloud server 50 may perform speech signal processing like noise suppression and gain adjustment on the speech signal Sb′ (step S330), and generate the speech signal SB accordingly. However, in some embodiments, it is also possible to omit the speech signal processing and directly use the speech signal Sb′ as the speech signal SB.


And the cloud server 50 may generate the audio watermark signal WB for the conference terminal 10c based on the speech signal SB. Specifically, FIG. 4 is a flowchart of the generation of the audio watermark signal WB according to an embodiment of the present disclosure. In FIG. 4, the cloud server 50 evaluates the applicable parameters (for example, gain, time difference, and/or frequency band) of the watermark through a psychoacoustics model (step S410). The psychoacoustic model is a mathematical model that imitates the human hearing mechanism, and can be used to derive frequency bands that cannot be heard by human ears. The cloud server 50 may generate an audio watermark signal WB based on an original watermark w0B and a watermark key kwB to be transmitted (step S430). It should be noted that the key algorithm used in step S430 is adapted for information security and integrity protection. In some embodiments, it is possible that the audio watermark signal WB is not added to the watermark key kwB, and the original watermark w0B may be directly used as the audio watermark signal WB.


It should be noted regarding how to obtain the speech signal Sa′, the speech signal SA, and the audio watermark signal W¬A for the conference terminal 10a, please refer to the foregoing description of the speech signal Sb′, the speech signal SB, and the audio watermark signal W¬B, which will not be repeated here. For example, the cloud server 50 may generate an audio watermark signal WA based on an original watermark w0A and a watermark key kwA to be transmitted.


In one embodiment, the original watermark w0A and the audio watermark signal W¬A are used to identify the conference terminal 10a, or the original watermark w0B and the audio watermark signal WB are used to identify the conference terminal 10c. For example, the audio watermark signal W¬A is a sound that records an identification code of the conference terminal 10a. However, in some embodiments, the present disclosure does not limit the content of the audio watermark signals W¬A and W¬B.


In FIG. 3, the cloud server 50 transmits the received speech signal SB and the received audio watermark signal WB to the conference terminal 10a via the network interface, and the conference terminal 10a receives the speech signal SB and the audio watermark signal WB and transmits it to the conference terminal 10a (step S370). Alternatively, the cloud server 50 may transmit the received speech signal SA and the audio watermark signal WA to the conference terminal 10c, and the conference terminal 10c receives the speech signal SA and the audio watermark signal WA and transmits them to the conference terminal 10c.


In one embodiment, the processor 19 receives network packets through the communication transceiver 15 via the network. This network packet includes both the speech signal SB and the audio watermark signal WB. The processor 19 may identify the speech signal SB and the audio watermark signal WB based on an identifier in the network packet. This identifier is adapted to indicate that a certain part of the data load of the network packet is the speech signal SB while the other part is the audio watermark signal WB. For example, the identifier indicates the starting position of the speech signal SB and the audio watermark signal WB in the network packet.


In one embodiment, the processor 19 receives a first network packet through the communication transceiver 15 via the network. This first network packet includes the speech signal SB. And the processor 19 receives a second network packet through the communication transceiver 15 via the network. This second network packet includes the audio watermark signal WB. In other words, the processor 19 distinguishes the speech signal SB and the audio watermark signal WB through two or more network packets.


In FIG. 2, the processor 19 assigns the speech signal SB to the host path to output the speech signal SB′ (step S231), and assigns the audio watermark signal WB to the offload path to output the audio watermark signal WB (step S233). Specifically, the conference device 10a may provide one or more digital signal processing (DSP) effects to the audio stream. Digital signal processing effects are, for example, equalization processing, reverb, echo cancellation, gain control, or other audio processing. These sound effects may also be further packetized into one or more audio processing objects (APOs), such as stream effects (SFX), mode effects (MFX), and endpoint effects (EFX).



FIG. 5 is a schematic diagram of an audio processing architecture according to an embodiment of the disclosure. In FIG. 5, in the audio processing architecture, a first layer L1 is applications APP1 and APP2, a second layer L2 is the audio engine, a third layer L3 is the driver, and a fourth layer L4 is the hardware. The application APP1 may be referred to as the primary application. For the application APP1, the audio engine provides stream effects SFX, mode effects MFX, and endpoint effects EFX. The application APP2 may be referred to as the secondary application that provides system pins to the driver. For the application APP2, the audio engine provides the offload stream effects (OSFX) and the offload mode effects (OMFX) that provides offload pins to the driver.


In the embodiment of the present disclosure, the host path provides more digital signal processing (DSP) effects than the offload path. It can be seen that, compared to the speech signal SB, the audio watermark signal WB may not be subjected to digital signal processing effects or is subjected to less digital signal processing effects. For example, the processor 19 performs noise suppression on the speech signal SB, but the audio watermark signal WB is not subjected to noise suppression. Or, the audio watermark signal WB may only be subjected to gain adjustment without undergoing the voice-related signal processing.


It should be noted that FIG. 2 shows that the processor 19 performs the receiving end speech signal processing on the speech signal SB, while the audio watermark signal WB does not receive the receiving end speech signal processing (that is, the output of the offload path is still the audio watermark signal WB). However, in some embodiments, the audio watermark signal WB may also receive part of the receiving end speech signal processing (i.e., the output of the offload path is the new audio watermark signal WB).


In one embodiment, the host path is configured for major applications such as voice calls or multimedia playback, such as the media player or call software in the Windows system. The offload path is configured for secondary applications like notification sounds, ringtones, or music playback, such as a simple music player. The processor 19 may connect the speech signal SB with the primary application, so that the speech signal SB may be input to the host path used by the primary application, whereas the processor 19 may connect the audio watermark signal WB with the secondary application, so that the audio watermark signal WB may be input to the offload path used by the secondary application.


In one embodiment, the primary processor 191 performs signal processing on the host path, and the secondary processor 193 performs signal processing on the offload path. In other words, the primary processor 191 provides the digital signal processing effects corresponding to the host path to the speech signal SB, and the secondary processor 193 provides the digital signal processing effects corresponding to the offload path for the audio watermark signal WB. For example, the storage space provided by the secondary processor 193 for the mode effects is less than the storage space provided by the primary processor 191.


In FIG. 2, the processor 19 synthesizes the speech signal SB′ and the audio watermark signal WB to output a synthesized audio signal SB′+WB (step S250). For example, the processor 19 adds an audio watermark signal WB to the speech signal SB′ through spread spectrum, echo hiding, phase encoding, etc. in the time domain to form the synthesized audio signal SB′+WB. Alternatively, the processor 19 may add the audio watermark signal WB to the speech signal SB′ in the frequency domain by modulated carries, subtracting frequency bands, etc. The synthesized audio signal SB′+WB can be used in an audio playback system 251. For example, the processor 19 plays the synthesized audio signal SB′+WB through the loudspeaker 13, such that the audio playback system 251 may output an audio watermark signal WB that is complete or less distorted.


On the other hand, the processor 19 may obtain the speech signal Sa of the speaker through an audio receiving system 271. For example, the processor 19 records through the sound receiver 11 to obtain the speech signal Sa. The processor 19 may perform transmission end speech signal processing on the speech signal Sa to output the speech signal Sa′ (step S290), and transmit the speech signal Sa′ to the cloud server 50 through the communication transceiver 15. Similarly, the cloud server 50 may generate the speech signal SA and the audio watermark signal WA based on the speech signal Sa′. In addition, the conference terminal 10c may also output a complete or less distorted audio watermark signal WA through its loudspeaker 13.


In summary, in the conference device and the embedding method of audio watermarks of the embodiments of the present disclosure, the audio watermark signal and the speech signal are synthesized at the output end of the conference terminal to bypass the speech signal processing of the system to embed the audio watermark. In this configuration, the embodiment of the present disclosure provides a host path and an offload path, and makes the audio watermark signal receive less signal processing or not receive any signal processing. In this way, the terminal may play the user's speech signal and the audio watermark fully, and may reduce the noise in the environment.


Although the present disclosure has been disclosed in the above embodiments, it is not intended to limit the present disclosure. Anyone with ordinary knowledge in the relevant technical field can make changes and modifications without departing from the spirit and scope of the present disclosure. The scope of protection of the present disclosure shall be subject to those defined by the claims attached.

Claims
  • 1. An embedding method of audio watermarks adapted for a conference terminal, and the embedding method of audio watermarks comprising: receiving, by the conference terminal, a first speech signal and a first audio watermark signal respectively, wherein the first speech signal relates to a phonetic content of a speaker corresponding to another conference terminal, and the first audio watermark signal corresponds to the another conference terminal, and the first audio watermark signal is received from a network packet;assigning the first speech signal to a host path to output a second speech signal, and assigning the first audio watermark signal to an offload path to output a second audio watermark signal, wherein an audio engine of the conference terminal has the host path and the offload path for providing audio processing objects (APOs) implementing digital signal processing effects, the host path provides more digital signal processing effects than the offload path; andsynthesizing the second speech signal and the second audio watermark signal to output a synthesized audio signal, wherein the synthesized audio signal is adapted for audio playback.
  • 2. The embedding method of audio watermarks according to claim 1, wherein respectively receiving the first speech signal and the first audio watermark signal comprises: receiving the network packet via a network, wherein the network packet further comprises the first speech signal; andidentifying the first speech signal and the first speech signal audio watermark based on an identifier in the network packet.
  • 3. The embedding method of audio watermarks according to claim 1, wherein—respectively receiving the first speech signal and the first audio watermark signal comprises: receiving another network packet via a network, wherein the first network packet comprises the first speech signal; andreceiving the network packet via the network.
  • 4. The embedding method of audio water marks according to claim 1, wherein the host path is adapted for voice calls or multimedia playback, and the offload path is adapted for prompt sound, ringtone, or music playback.
  • 5. The embedding method of audio watermarks according to claim 1, further comprising: performing signal processing on the host path through a primary processor; andperforming signal processing on the offload path through a secondary processor.
  • 6. The embedding method of audio watermarks according to claim 1, wherein the second audio watermark signal is a same as the first audio watermark signal via the offload path.
  • 7. The embedding method of audio watermarks according to claim 5, wherein a storage space provided by the secondary processor for mode effects (MFXs) is less than a storage space provided by the primary processor.
  • 8. The embedding method of audio watermarks according to claim 1, wherein the host path is configured for a first application, the offload path is configured for a second application different from the first application, and assigning the first speech signal to the host path further comprises: connecting the first speech signal with the first application, wherein assigning the first audio watermark signal to the offload path further comprises:connecting the first audio watermark signal with the second application.
  • 9. A conference terminal, comprising: a sound receiver, adapted to record sound;a loudspeaker, adapted to play sound;a communication transceiver, adapted to transmit or receive data;a processor, coupled to the sound receiver, the loudspeaker, and the communication transceiver, and adapted to:receive a first speech signal and a first audio watermark signal through the communication transceiver, wherein the first speech signal relates to a phonetic content of a speaker corresponding to another conference terminal, and the first audio watermark signal corresponds to the another conference terminal, and the first audio watermark signal is received from a network packet;assign the first speech signal to a host path to output a second speech signal, and assign the first audio watermark signal to an offload path to output a second audio watermark signal, wherein an audio engine of the conference terminal has the host path and the offload path for providing audio processing objects (APOs) implementing digital signal processing effects, the host path provides more digital signal processing effects than the offload path; andsynthesize the second speech signal and the second audio watermark signal to output a synthesized audio signal, wherein the synthesized audio signal is adapted for audio playback.
  • 10. The conference terminal according to claim 9, wherein the processor is further configured to: receive the network packet via a network through the communication transceiver, wherein the network packet further comprises the first speech signal.
  • 11. The conference terminal according to claim 9, wherein the processor is further configured to: Receive another network packet via a network through the communication transceiver, wherein the first network packet comprises the first speech signal; andreceive the network packet via the network through the communication transceiver.
  • 12. The conference terminal according to claim 9, wherein the host path is adapted for voice calls or multimedia playback, and the offload path is adapted for prompt sound, ringtone, or music playback.
  • 13. The conference terminal according to claim 9, wherein the processor comprises: a primary processor, adapted for performing signal processing on the host path; anda secondary processor, adapted for performing signal processing on the offload path.
  • 14. The conference terminal according to claim 9, wherein the second audio watermark signal is a same as the first audio watermark signal via the offload path.
  • 15. The conference terminal according to claim 13, wherein a storage space provided by the secondary processor for mode effects (MFXs) is less than a storage space provided by the primary processor.
  • 16. The conference terminal according to claim 9, wherein the host path is configured for a first application, the offload path is configured for a second application different from the first application, and the processor is further configured to: connect the first speech signal with the first application; andconnect the first audio watermark signal with the second application.
Priority Claims (1)
Number Date Country Kind
110122715 Jun 2021 TW national