This application claims the priority benefit of Taiwan application no. 110127497, filed on Jul. 27, 2021. The entirety of the above-mentioned patent application is hereby incorporated by reference herein and made a part of this specification.
The disclosure relates to a sound signal processing technique, and in particular, to a processing method of a sound watermark and a sound watermark generating apparatus.
Remote conferencing allows people in different places or spaces to communicate, and the development of equipment, protocols, and applications regarding remote conferencing has considerably advanced. It is worth noting that a part of the instant conferencing applications may synthesize an audio signal and a sound watermark signal to identify a speaker.
For example,
Generally, a major function of echo cancellation C on a call transmission path is eliminating a composition belonging to the call reception signal in the sound signal S2 obtained by the sound receiver R and further obtaining a sound signal S3 without an echo. However, a generating path of the sound watermark signal and the general path of the call reception signal may be different. When the sound receiver R receives a sound signal of the loudspeaker S through a feedback path fp, a composition belonging to the sound watermark signal in the sound signal S1 might not be eliminated and be further transmitted via the Internet. As a result, an audio composition of the user sp in the sound signal S3 on the call transmission path might be affected.
Accordingly, the embodiments of disclosure provide a processing method of a sound watermark and a sound watermark generating apparatus generating a sound watermark which may be eliminated by echo cancellation and thus enhance a quality of a call.
The processing method of the sound watermark of the embodiments of the disclosure is adapted for a conference terminal, and the conference terminal includes a sound receiver. The processing method of the sound watermark includes, but not limited to, the following steps. A call reception sound signal is obtained through the sound receiver. A reflection sound signal is generated according to a virtual reflection condition and the call reception sound signal. The virtual reflection condition includes a position relation among the sound receiver, a sound source, and an external object. The reflection sound signal is a sound signal which is obtained by simulating a sound output by a sound source, then being reflected by the external object, and being further recorded by the sound receiver. A phase of the reflection sound signal is shifted according to a watermark indication code to generate a sound watermark signal. The sound watermark signal includes the reflection sound signal with a phase shift.
The sound watermark generating apparatus of the embodiments of the disclosure includes, but not limited to, a memory and a processor. The memory is configured to store a program code. The processor is coupled to the memory. The processor is configured to load and execute the program code to obtain a call reception sound signal. The processor generates a reflection sound signal according to a virtual reflection condition and the call reception sound signal and shifts a phase of the reflection sound signal according to a watermark indication code to generate a sound watermark signal. The call reception sound signal is obtained by recording through a sound receiver. The virtual reflection condition includes a position relation among the sound receiver, a sound source, and an external object. The reflection sound signal is a sound signal which is obtained by simulating a sound output by the sound source, then being reflected by the external object, and being further recorded by the sound receiver. The sound watermark signal includes the reflection sound signal with a phase shift.
Based on the above, according to the processing method of a sound watermark and the sound watermark generating apparatus of the embodiments of the disclosure, the sound signal reflected by the external object is simulated. The simulated sound signal is encoded through phase shifting to generate the sound watermark signal. Accordingly, the general call reception signal and the sound watermark signal are maintained simultaneously at the loudspeaker end. In addition, the two signals may be eliminated by a conventional echo cancellation algorithm. Hence, an audio signal on a call transmission path is not affected.
In order to make the aforementioned features and advantages of the disclosure comprehensible, embodiments accompanied with drawings are described in detail below.
The conference terminal 10 and the conference terminal 20 may be a wired telephone, a mobile phone, an Internet phone, a tablet computer, a desktop computer, a laptop computer, or a smart speaker.
The conference terminal 10 includes, but not limited to, a sound receiver 11, a loudspeaker 13, a communication transceiver 15, a memory 17, and a processor 19.
The sound receiver 11 may be a dynamic microphone, a condenser microphone, or an electret condenser microphone. The sound receiver 11 may also be other combinations of an electronic device, an analog-to-digital converter, a filter, and an audio signal processor which may receive a sound wave (e.g. a human voice, an ambient sound, a sound of machine operation, or the like) to convert the sound wave into a sound signal. In an embodiment, the sound receiver 11 is configured to receive/record a sound from a speaker to obtain a call reception sound signal. In some embodiments, the call reception sound signal may include a voice of the speaker, a sound generated by the loudspeaker 13 and/or other ambient sounds.
The loudspeaker 13 may be a speaker or a megaphone. In an embodiment, the loudspeaker 13 is configured to play a sound.
The communication transceiver 15 is, for example, a transceiver (it may include, but not limited to, an element such as a connection interface, a signal converter, or a communication protocol processing chip) supporting a wired Internet such as Ethernet, fiber optic Internet, or cable Internet. The communication transceiver 15 may also be a transceiver (it may include, but not limited to, an element such as an antenna, an digital-to-analog/analog-to-digital converter, or a communication protocol processing chip) supporting Wi-Fi, the 4G networks, the 5G networks, or the later generation mobile networks. In an embodiment, the communication transceiver 15 is configured to transmit or receive data.
The memory 17 may be any type of fixed or mobile random access memory (RAM), read only memory (ROM), flash memory, conventional hard disk drive (HDD), solid-state drive (SDD), or other similar devices. In an embodiment, the memory 17 is configured to store a program code, a software module, a configuration setting, data (e.g. a sound signal, a watermark indication code, or a sound watermark signal), or a file.
The processor 19 is coupled to the sound receiver 11, the loudspeaker 13, the communication transceiver 15, and the memory 17. The processor 19 may be a central processing unit (CPU), a graphic processing unit (GPU), or other programmable general-purpose or special-purpose microprocessors, digital signal processor (DSP), programmable controller, field programmable gate array (FPGA), application-specific integrated circuit (ASIC), or similar device, or any combination of the above devices. In an embodiment, the processor 19 is configured to execute all of or a part of the tasks of the conference terminal 10 which the processor 19 belongs to. The processor 19 may load and execute each of the software modules, files, and data stored by the memory 17.
The conference terminal 20 includes, but not limited to, a sound receiver 21, a loudspeaker 23, a communication transceiver 25, a memory 27, and a processor 29. With regard to the executions and the features of the sound receiver 21, the loudspeaker 23, the communication transceiver 25, the memory 27, and the processor 29, the description regarding the sound receiver 11, the loudspeaker 13, the communication transceiver 15, the memory 17, and the processor 19 may be referred to. They are not repeated here. The processor 29 is configured to execute all of or a part of the tasks of the conference terminal 20 which the processor 29 belongs to. The processor 29 may load and execute each of the software modules, files, and data stored by the memory 27.
The cloud server 50 is connected to the conference terminal 10 and the conference terminal 20 directly or indirectly through the Internet. The cloud server 50 may be a computer system, a server, or a signal processing device. In an embodiment, the conference terminal 10 and the conference terminal 20 may also serve as the cloud server 50. In another embodiment, the cloud server 50 may serve as an independent cloud server which is different from the conference terminal 10 and the conference terminal 20. In some embodiments, the cloud server 50 includes, but not limited to, the same or a similar communication transceiver 55, a memory 57, and a processor 59, and the executions and the features of the elements will not be repeated.
In an embodiment, a sound watermark generating apparatus 70 may be the conference terminal 10, the conference terminal 20, or the cloud server 50. The sound watermark generating apparatus 70 is configured to generate a sound watermark signal, which will be described further in the embodiments below.
In the description below, accompanied by each of the devices, elements, and modules in the conference call system 1, the method of the embodiments of the disclosure will be described. Each step of the method may be adjusted according to the executions, and the disclosure is not limited thereto.
Note that, for convenience of description, the same element may realize the same or similar operation and will not be repeated. For example, all of the processor 19 of the conference terminal 10, the processor 29 of the conference terminal 20 and/or the processor 59 of the cloud server 50 may realize the same or similar method of the embodiments of the disclosure.
The processor 59 of the cloud server 50 receives the call reception sound signal SRx from the conference terminal 20 through the communication transceiver 55. The processor 59 generates a reflection sound signal S′Rx according to a virtual reflection condition and the call reception sound signal (step S330). Specifically, a common echo cancellation algorithm may adaptively eliminate compositions (e.g. the call reception sound signal SRx of the call reception path) belonging to reference signals in sound signals received by the sound receiver 11 and the sound receiver 21 from an outside. Sounds recorded by the sound receiver 11 and the sound receiver 21 include the shortest paths from the loudspeaker 13 and the loudspeaker 23 to the sound receiver 11 and the sound receiver 21 and different reflection paths of the environment (i.e. a path formed when a sound is reflected by an external object). A reflection sound signal may be affected by a reflection coefficient of a reflection object, and a reflection position affects a time delay and an amplitude attenuation of the sound signal. In addition, the reflection sound signal may also come from different directions, which leads to a phase shift. In the embodiments of the disclosure, a virtual/simulated reflection sound signal which may be eliminated by the echo cancellation is generated by using the sound signal SRx of a known call reception path. Therefore, a sound watermark signal SWM is generated.
In an embodiment, the processor 59 may determine a time delay and an amplitude attenuation of the reflection sound signal S′Rx compared with the call reception sound signal SRx according to the position relation and a reflection coefficient of the external object. For example,
Ts is a sampling time, vs is a speed of a sound, and n is a sampling point or time.
If it is set that the reflection sound signal S′Rx has a time delay γw and an amplitude attenuation αw compared with the call reception sound signal SRx, a relation between the reflection sound signal S′Rx and the call reception sound signal SRx may be represented as the following:
s′
Rx(n)=αw·sRx(n−nw) (2)
According to equation (1) and (2), the following equations are obtained:
nf is a time delay (optionally, it will be further described in the embodiments below) caused by a filter, and nφ is a time delay (optionally, it will be further described in the embodiments below) caused by a phase shift.
Note that according to different demand of design, a variable in the virtual reflection condition may be further adjusted. For example, there is more than one external object or relative position.
Referring to
Referring to
For example,
In another embodiment, the processor 59 may not perform a filtering processing of a specific frequency on the reflection sound signal S′Rx. That is, the reflection sound signal S″Rx is the same as the reflection sound signal S′Rx.
Referring to
In an embodiment, the different values of all the digits in the watermark indication code Wo correspond to different phase shifts. For example,
The processor 59 may shift a phase of the reflection sound signal S″Rx according to the values of the one or multiple digits in the watermark indication code Wo. For example, in
In an embodiment, the watermark indication code includes multiple digits. The sound watermark signal SWM includes the multiple reflection sound signals with the phase shifts, and each of the reflection sound signals with the phase shifts occupies a time length in the sound watermark signal SWM. It is assumed that a time length (e.g. 0.1, 0.5, or 1 second, and it is greater than a time delay nw) of each of the digits is denoted by Lb. Similar to the concept of time-division multiplexing, the processor 59 divides a time period (i.e. a major time unit) of the sound watermark signal SWM into minor time units with the same or different time lengths according to a digit number included in the watermark indication code Wo. Each of the minor time units carries the reflection sound signal with the phase shift corresponding to the different digit.
In an embodiment, if the filtering processing in
In some embodiments, the processor 59 may generate multiple identical sound watermark signals. The sound watermark signals respectively correspond to different major time units. That is, the sound watermark signals are output in a loop. To distinguish the adjacent sound watermark signals, the processor 59 may add an interval between the adjacent sound watermark signals. For example, a mute signal or other known high-frequency sound signal is added at the interval.
In an embodiment, the processor 59 may respectively transmit the call reception sound signal SRx and the sound watermark signal SWM through the communication transceiver 55. In another embodiment, the processor 59 may synthesize the call reception sound signal SRx and the sound watermark signal SWM to generate an embedded watermark signal SRx+SWM. Next, the processor 59 may transmit the embedded watermark signal SRx+SWM through the communication transceiver 55.
The processor 19 of the conference terminal 10 receives the sound watermark signal SWM or the embedded watermark signal SRx+SWM through the communication transceiver 15 via the Internet to obtain a transmission sound signal SA (i.e. the transmitted sound watermark signal SWM or the embedded watermark signal SRx+SWM). Since the sound watermark signal SWM includes the call reception sound signal (i.e. the reflection sound signal) with the time delay and the amplitude attenuation, the echo cancellation of the processor 19 may effectively eliminate the sound watermark signal SWM. Accordingly, a call transmission sound signal STx (e.g. the call reception sound signal which the conference terminal 10 desires to transmit via the Internet) on the call transmission path may not be affected.
With regard to identifying the sound watermark signal SWM,
The processor 19 may shift a phase of the transmission sound signal SAHP according to the correspondence relation between the value described in step S450 and the phase shift (i.e. in step S930, phase shifting is performed.). For example, in
That is, if the correlation is greater than the threshold value ThR, the processor 19 determines that the value of the digit corresponds to a value (e.g. 1) of the 90° phase shift; if the correlation is less than the threshold value ThR, the processor 19 determines that the value of the digit corresponds to a value (e.g. 0) of the −90° phase shift. In another embodiment, the processor 19 may transmit values of the transmission sound signal SAHP corresponding to different minor time units based on a deep learning classifier.
In summary of the above, in the processing method of the sound watermark and the sound watermark generating apparatus according to the embodiments of the disclosure, the reflection sound signal is simulated according to the principle of the echo cancellation, and the sound watermark signal is encoded by performing phase shifting on the reflection sound signal. Accordingly, at a receiving end, the sound watermark signal obtained through a feedback path may be eliminated by the echo cancellation, and the sound watermark signal does not affect the call transmission signal on the call transmission path.
Although the disclosure has been described with reference to the above embodiments, they are not intended to limit the disclosure. It will be apparent to one of ordinary skill in the art that modifications to the described embodiments may be made without departing from the spirit and the scope of the disclosure. Accordingly, the scope of the disclosure will be defined by the attached claims and their equivalents and not by the above detailed descriptions.
Number | Date | Country | Kind |
---|---|---|---|
110127497 | Jul 2021 | TW | national |