This application claims the priority benefit of Taiwanese application no. 110147950, filed on Dec. 21, 2021. The entirety of the above-mentioned patent application is hereby incorporated by reference herein and made a part of this specification.
The disclosure relates to a sound signal processing technology. Particularly, the disclosure relates to a processing method of a sound watermark and a sound watermark generating apparatus.
Remote conferences enable people in different locations or spaces to have conversations, and conference-related equipment, protocols, and applications are also well developed. It is worth noting that some real-time conference programs may synthesize voice signals with watermark sound signals and use them to identify speaking persons.
Inevitably, if a sound signal is interfered with by noise, a correct rate of determining a watermark at a receiving end may be decreased, thus affecting voice components of a user in the sound signal on a conversation transmission path.
The embodiments of the disclosure provide a processing method of a sound watermark and a sound watermark generating apparatus, in which a watermark sound signal that is generated effectively combats noise, improving conversation quality.
A sound watermark processing method according to an embodiment of the disclosure is adapted for a conference terminal. The conference terminal includes a sound receiver. The processing method of a sound watermark includes (but is not limited to) the following. A conversation-received sound signal is obtained through the sound receiver. A reflected sound signal is generated according to a virtual reflection condition and the conversation-received sound signal. The virtual reflection condition includes a positional relationship between the sound receiver, a sound source, and two external objects. The reflected sound signal is a sound signal obtained from simulating a sound emitted by the sound source reflected by one of the external objects and recorded by the sound receiver. A first watermark sound signal is generated according to a watermark identification code and the reflected sound signal. A second watermark sound signal is generated according to a sound signal distance value and the first watermark sound signal. The sound signal distance value is determined according to a high/low-frequency sound ratio of the reflected sound signal. The sound signal distance value is related to a distance difference between two reflection distances of the sound emitted by the sound source under the positional relationship reflected by the two external objects and reaching the sound receiver. The first watermark sound signal and the second watermark sound signal are synthesized to generate an output watermark sound signal.
A sound watermark generating apparatus according to an embodiment of the disclosure includes (but is not limited to) a memory and a processor. The memory is configured to store a programming code. The processor is coupled to the memory. The processor is configured to load and execute the programming code to: obtain a conversation-received sound signal through a sound receiver; generate a reflected sound signal according to a virtual reflection condition and the conversation-received sound signal; generate a first watermark sound signal according to a watermark identification code and the reflected sound signal; generate a second watermark sound signal according to a sound signal distance value and the first watermark sound signal; and synthesize the first watermark sound signal and the second watermark sound signal to generate an output watermark sound signal. The virtual reflection condition includes a positional relationship between the sound receiver, a sound source, and two external objects. The reflected sound signal is a sound signal obtained from simulating a sound emitted by the sound source reflected by one of the external objects and recorded by the sound receiver. The sound signal distance value is determined according to a high/low-frequency sound ratio of the reflected sound signal. The sound signal distance value is related to a distance difference between two reflection distances of the sound emitted by the sound source under the positional relationship reflected by the two external objects and reaching the sound receiver.
Based on the foregoing, in the processing method of a sound watermark and the sound watermark generating apparatus according to the embodiments of the disclosure, based on the high/low-frequency sound ratio of the conversation-received sound signal, the sound signal distance value between two reflected sound signals to be simulated is determined, and two watermark sound signals are generated accordingly. Thereby, by outputting two synthesized watermark sound signals, the power of the overall watermark sound signal can be reduced, and the accuracy of determining the watermark identification code can be improved.
To make the aforementioned more comprehensible, several embodiments accompanied with drawings are described in detail as follows.
The accompanying drawings are included to provide a further understanding of the disclosure, and are incorporated in and constitute a part of this specification. The drawings illustrate exemplary embodiments of the disclosure and, together with the description, serve to explain the principles of the disclosure.
The conference terminals 10, 20 may be a wired phone, a mobile phone, an Internet phone, a tablet computer, a desktop computer, a notebook computer, or a smart speaker.
The conference terminal 10 includes (but is not limited to) a sound receiver 11, a loudspeaker 13, a communication transceiver 15, a memory 17, and a processor 19.
The sound receiver 11 may be a microphone in, for example, a dynamic, condenser, or electret condenser form. The sound receiver 11 may also be a combination of other electronic components, analog-to-digital converters, filters, and audio processors that receive sound waves (e.g., human voice, environmental sound, and machine operation sound) and convert the sound waves into sound signals. In an embodiment, the sound receiver 11 is configured to receive/record sounds of a speaking person to obtain a conversation-received sound signal. In some embodiments, the conversation-received sound signal may include the sound of the speaking person, the sound emitted by the loudspeaker 13, and/or other environmental sounds.
The loudspeaker 13 may be a horn or a sound amplifier. In an embodiment, the loudspeaker 13 is configured to play sounds.
The communication transceiver 15 is, for example, a transceiver (which may include, but is not limited to, elements such as a connection interface, a signal converter, and a communication protocol processing chip) that supports wired networks such as Ethernet, optical fiber networks, or cables. The communication transceiver 15 may also be a transceiver (which may include, but is not limited to, elements such as an antenna, a digital-to-analog/analog-to-digital converter, and a communication protocol processing chip) that supports wireless networks such as Wi-Fi, fourth-generation (4G), fifth-generation (5G), or later-generation mobile networks. In an embodiment, the communication transceiver 15 is configured to transmit or receive data.
The memory 17 may be any type of fixed or removable random access memory (RAM), read only memory (ROM), flash memory, a hard disk drive (HDD), a solid-state drive (SSD), or similar elements. In an embodiment, the memory 17 is configured to store programming codes, software modules, configurations, data (e.g., sound signals, watermark identification codes, or watermark sound signals), or files.
The processor 19 is coupled to the sound receiver 11, the loudspeaker 13, the communication transceiver 15, and the memory 17. The processor 19 may be a central processing unit (CPU), a graphic processing unit (GPU), or any other programmable general-purpose or special-purpose microprocessor, digital signal processor (DSP), programmable controller, field programmable gate array (FPGA), application-specific integrated circuit (ASIC), or other similar elements or a combination of the above elements. In an embodiment, the processor 19 is configured to perform all or part of operations of the conference terminal 10, and may load and execute the software modules, files, and data stored in the memory 17.
The conference terminal 20 includes (but is not limited to) a sound receiver 21, a loudspeaker 23, a communication transceiver 25, a memory 27, and a processor 29. For the implementation aspects and functions of the sound receiver 21, the loudspeaker 23, the communication transceiver 25, the memory 27, and the processor 29, reference may be made to the above description of the sound receiver 11, the loudspeaker 13, the communication transceiver 15, the memory 17, and the processor 19, which will not be repeated herein. The processor 29 is configured to perform all or part of operations of the conference terminal 20, and may load and execute the software modules, files, and data stored in the memory 27.
The cloud server 50 is directly or indirectly connected to the conference terminals 10, 20 via a network. The cloud server 50 may be a computer system, a server, or a signal processing device. In an embodiment, the conference terminals 10, 20 may also serve as the cloud server 50. In another embodiment, the cloud server 50 may serve as an independent cloud server different from the conference terminals 10, 20. In some embodiments, the cloud server 50 includes (but is not limited to) a same or similar communication transceiver 55, memory 57, and processor 59, and the implementation aspects and functions of the elements will not be repeatedly described.
In an embodiment, a sound watermark generating apparatus 70 may be the conference terminals 10, 20, and/or the cloud server 50. The sound watermark generating apparatus 70 is configured to generate a watermark sound signal and will be described in detail in subsequent embodiments.
Hereinafter, a method according to an embodiment of the disclosure in combination with the various devices, elements, and modules in the conference communication system 1 will be described. Each process flow of the method may be adjusted according to the implementation, and is not limited thereto.
It should also be noted that, for ease of description, the same element may perform the same or similar operations, and will not be repeatedly described. For example, the processor 19 of the conference terminal 10, the processor 29 of the conference terminal 20, and/or the processor 59 of the cloud server 50 may each perform a method same as or similar to the method of the embodiments of the disclosure.
The processor 59 of the cloud server 50 receives the conversation-received sound signal SRx from the conference terminal 20 through the communication transceiver 55. The processor 59 generates a reflected sound signal S′Rx according to a virtual reflection condition and the conversation-received sound signal (step S230). Specifically, general echo cancellation algorithms may adaptively cancel components (e.g., the conversation-received sound signal SRx on a conversation-received path) belonging to reference signals in sound signals received by the sound receivers 11, 21 from the outside. The sounds recorded by the sound receivers 11, 21 include the shortest paths from the loudspeakers 13, 23 to the sound receivers 11, 21 and different reflection paths (i.e., paths formed when sounds are reflected by external objects) of the environment. Positions of reflection affect the time delay and the amplitude attenuation of the sound signal. In addition, the reflected sound signal may also come from different directions, resulting in phase shifts. In the embodiments of the disclosure, the sound signal SRx of a known conversation receiving path is utilized to generate a virtual/simulated reflected sound signal that can be cancelled by an echo cancellation mechanism, and to accordingly generate a watermark sound signal SWM.
In an embodiment, the processor 59 may determine a time delay and an amplitude attenuation of the reflected sound signal S′Rx relative to the conversation-received sound signal SRx according to a positional relationship. For example,
s′
Rx(n)=α1·sRx(n−nw1) (1)
where α1 is the amplitude attenuation caused by a first reflection (i.e., the reflection of a sound signal blocked by the wall W1), n is the sampling point or time, nw1 is the time delay caused by a first reflection distance (i.e., the distance from the sound source SS through the wall W1 to the sound receiver 21).
With reference to
In an embodiment, a filter may be selected as the processor 59 to generate a filtered reflected sound signal. Specifically, the general echo cancellation mechanism processes sound signals at a low frequency (e.g., 2 kilohertz (kHz) or 3 kHz and below) with a slower rate of convergence, but processes sound signals at a high frequency (e.g., 3 kHz or 4 kHz and above) with a faster rate of convergence (e.g., 10 milliseconds (ms) and below). Therefore, based on the watermark identification code alone, the processor 59 may shift the phase of the reflected sound signal (e.g., a first reflected sound signal) passing through high-pass filtering (e.g., only passing sound signals at a frequency of 3 kHz or 4 kHz and above), making interference of signals difficult to be perceived (i.e., the high-frequency sound signal is at a frequency outside the hearing range of humans).
In another embodiment, the processor 59 may also not perform specific frequency filtering on the reflected sound signal.
In an embodiment, the watermark identification code is encoded in a multi-based positional numeral system, and the multi-based positional numeral system provides multiple values at one bit or each of multiple bits of the watermark identification code. Taking a binary system as an example, the value of each bit in the watermark identification code may be “0” or “1”. Taking a hexadecimal system as an example, the value of each bit in the watermark identification code may be “0”, “1”, “2”, . . . , “E”, or “F”. In another embodiment, the watermark identification code is encoded with an alphabet, a character, and/or a symbol. For example, the value of each bit in the watermark identification code may be any one of “A” to “Z” among English alphabets.
In an embodiment, the different values at the bits in the watermark identification code correspond to different phase shifts. For example, assuming that a watermark identification code WO is in a base-N positional numeral system (where N is a positive integer), then an N number of values may be provided for each bit. The N number of different values respectively correspond to different phase shifts φ1 to φN. For another example, assuming that the watermark identification code WO is in a binary system, then two values (i.e., 1 and 0) may be provided for each bit. The two different values respectively correspond to two phase shifts φ and −φ. For example, the phase shift φ is 90°, and the phase shift −φ is −90° (i.e., −1).
The processor 59 may shift the phase of the reflected sound signal (whether passing through high-pass filtering or not) according to the value of one or more bits in the watermark identification code. Taking a base-N positional numeral system as an example, the processor 59 selects one or more of the phase shifts φ1 to φN according to one or more values in the watermark identification code, and performs phase shift using the selected one of the phase shifts φ1 to φN. For example, if the value of the first bit of the watermark identification code is 1, an output phase-shifted reflected sound signal Sφ1 is shifted by φ1 relative to the reflected sound signal, and inference may be made by analogy for other reflected sound signals SφN. The phase shift may be achieved using Hilbert transform or other phase shift algorithms.
In an embodiment, if the filtering process is adopted for the reflected sound signal, then the processor 59 may further synthesize one or more phase-shifted reflected sound signals and reflected sound signals (e.g., the first reflected sound signal) passing through low-pass filtering (e.g., only passing sound signals at a frequency of 4 kHz and below) to generate the first watermark sound signal. In another embodiment, if the filtering process is not adopted for the reflected sound signal, the processor 59 may take one or more phase-shifted reflected sound signals as the first watermark sound signal.
With reference to
S″
Rx(n)=α2·SRx(n−nw2) (2)
where α2 is the amplitude attenuation caused by a second reflection (i.e., the reflection of a sound signal blocked by the wall W2), n is the sampling point or time, nw2 is the time delay caused by a second reflection distance (i.e., the distance from the sound source SS through the wall W2 to the sound receiver 21). In other words, the two reflected sound signals respectively simulate the sound signals reflected by two external objects.
It is worth noting that a difference between the time delay caused by the second reflection distance and the time delay caused by the first reflection distance (or a difference between transmission times of the sound signals reflected by two external objects) (i.e., a sound signal distance value Δn) may be expressed as follows:
Δn=nw2−nw1 (3)
and the cause of sound delay mainly lies in the transmission distance of the sound signal. Therefore, the sound signal distance value is also related to, under the positional relationship of the set virtual reflection condition, a distance difference between the two reflection distances of sounds emitted by the sound source SS respectively reflected by two external objects (e.g., the walls W1 and W2) and reaching the sound receiver 21.
Assuming that the sound signal distance value Δn is far smaller than the time delay corresponding to any reflected signal (e.g., Δn<<nw1), then the two reflection distances (e.g., the first reflection distance and the second reflection distance) are almost equal or completely equal, and the amplitude attenuations of the two reflected sound signals (e.g., the first reflected sound signal and the second reflected sound signal) should also be almost equal or completely equal (e.g., α1≅−α2). Therefore, low-frequency parts of the two reflected sound signals after being superimposed/synthesized are canceled against each other, thus reducing the power of the overall watermark sound signal, and making it difficult for users to perceive the watermark sound signal that is added.
It is worth noting that the conversation-received sound signal SRx may change with time. It is found through experiments that, if the sound signal distance value Δn may be changed appropriately with the change of the conversation-received sound signal SRx, it helps to combat noise interference. In the embodiments of the disclosure, the sound signal distance value is determined according to a high/low-frequency sound ratio of the reflected sound signal (e.g., the first reflected sound signal).
In an embodiment, after the processor 59 generates the reflected sound signal, the processor 59 performs low-pass filtering on the reflected sound signal to generate a low-frequency sound signal. In addition, the processor 59 performs high-pass filtering on the reflected sound signal to generate a high-frequency sound signal. The high/low-frequency sound ratio is a power ratio between the low-frequency sound signal and the high-frequency sound signal.
For example, in the conversation-received sound signal SRx, when a power of the high-frequency sound signal SRxHP is not less than a power of the low-frequency sound signal SRxLP, the sound signal distance value Δn is set to 5 (i.e., the first value). In addition, in the conversation-received sound signal SRx, when the power of the high-frequency sound signal SRxHP is less than the power of the low-frequency sound signal SRxLP, the sound signal distance value Δn is set to 4 (i.e., the second value). The relationship between the sound signal distance value Δn, a power PRxLP of the low-frequency sound signal SRxLP, and a power PRxHP of the high-frequency sound signal SRxHP may be expressed as follows:
where PRxHP is the power of the high-frequency sound signal SRxHP of the conversation-received sound signal SRx, and PRxLP is the power of the low-frequency sound signal SRxLP of the conversation-received sound signal SRx. In other words, the power ratio between the high and low-frequency sound signals is PRxHP/PRxLP or PRxLP/PRxHP. Moreover, since the reflected sound signal is reflected in the conversation-received sound signal, the change in the conversation-received sound signal also changes the reflected sound signal, and the sound signal distance value Δn is also dynamically changed. It has been proved through experiments that a dynamic spacing helps to improve the accuracy of watermark identification. Additionally, it should be noted that the values of the first value and the second value may still be changed depending on actual requirements, and are not limited by the embodiments of the disclosure.
With reference to
S″
WM(n)=−S′WM(n−Δn) (5)
In other words, the second watermark sound signal S″WM is the first watermark sound signal S″WM in an opposite phase and with the time delay of Δn.
With reference to
The processor 19 of the conference terminal 10 receives the watermark sound signal SWM or the watermark-embedded signal SRx+SWM through the communication transceiver 15 via the network, to obtain a transmitted sound signal SA (i.e., the watermark sound signal SWM or the watermark-embedded signal SRx+SWM that is transmitted). Since the watermark sound signal SWM includes the conversation-received sound signal that is time-delayed and amplitude-attenuated (i.e., the reflected sound signal), the echo cancellation mechanism of the processor 19 can effectively eliminate the watermark sound signal Sw. Accordingly, a transmitted sound signal STx (e.g., the conversation-received sound signal that the conference terminal 10 intends to transmit via the network) on the communication transmission path is not affected.
For identification of the watermark sound signal SWM,
With reference to
In an embodiment, the processor 19 may estimate the sound signal distance value ΔnA according to a correlation of the transmitted sound signal SALP under different time delays. For example, through an auto-cepstrum function (e.g., a Mel-frequency cepstrum coefficient (MFCC) or a linear prediction cepstrum coefficient (LPCC)), or other auto-correlation functions, the processor 19 measures the sound signal distance value ΔnA corresponding to the local maximum of the transmitted sound signal SAHP passing through the low-pass filtering LPF. For example, the sound signal distance value ΔnA is 3 or 4.
The processor 19 generates a second shifted sound signal S″A90° according to the first shifted sound signal S′A90° and the estimated sound signal distance value ΔnA (step S590). The relationship between the second shifted sound signal S″A90° and the first shifted sound signal S′A90° may be expressed as follows:
S″
A
90°(n)=S′A90°(n−Δn) (6)
That is, the second shifted sound signal S″A90° is the first shifted sound signal S′A90° being time-delayed by Δn.
The processor 19 may obtain a correlation coefficient from determining a correlation (i.e., a first correlation) between the first shifted sound signal S′A90° and the transmitted sound signal (SA or SAHP), and determining a correlation (i.e., a second correlation) between the second shifted sound signal S′A90° and the transmitted sound signal (SA or SAHP). For example, the processor 19 calculates the cross-correlation between the first shifted sound signal S′A90° and the transmitted sound signal (SA or SAHP) to obtain a first correlation r′HP90°, and calculates the cross-correlation between the second shifted sound signal S″A90° and the transmitted sound signal (SA or SAHP) to obtain a second correlation r′LP90°. The processor 19 performs subtraction between the first correlation r′HP90° and the second correlation r′LP90° to obtain a correlation coefficient RHP90°. The correlation coefficient RHP90° may be expressed as follows:
R
HP
90°
=r′
HP
90°
−r′
LP
90° (7).
The processor 19 may identify the watermark identification code according to the correlation coefficient RHP90° (step S595). For example, if the processor 19 defines a threshold ThR (e.g., 0.3, 0.5, or 0.7), then an identified watermark identification code WE may be expressed as:
That is, if the correlation coefficient RHP90° is higher than the threshold ThR, the processor 19 determines that the value at this bit is a value corresponding to the phase shift 90° (e.g., 1). If the correlation coefficient RHP90° is lower than the threshold ThR, the processor 19 determines that the value at this bit is a value corresponding to the phase shift −90° (e.g., 0).
Further description aided by experiments is provided below.
In summary of the foregoing, in the processing method of a sound watermark and the sound watermark generating apparatus of the embodiments of the disclosure, the sound signal distance value between two reflected sound signals to be simulated is dynamically determined according to the power ratio between the high-frequency sound signal and the low-frequency sound signal in the sound signal, and two watermark sound signals corresponding to the two reflected sound signals are generated based on the sound signal distance value. Accordingly, the power of the overall watermark sound signal can be reduced, and the correct rate of identification of the watermark identification code can be improved.
It will be apparent to those skilled in the art that various modifications and variations can be made to the disclosed embodiments without departing from the scope or spirit of the disclosure. In view of the foregoing, it is intended that the disclosure covers modifications and variations provided that they fall within the scope of the following claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
110147950 | Dec 2021 | TW | national |