This application claims the priority benefit of Taiwan application serial no. 110125761, filed on Jul. 13, 2021. The entirety of the above-mentioned patent application is hereby incorporated by reference herein and made a part of this specification.
The disclosure relates to a speech processing technology, and more particularly, to a processing method of a sound watermark and a speech communication system.
Remote conferences allow people in different locations or spaces to have conversations, and conference-related equipment, protocols, and/or applications are also well developed. It is worth noting that some real-time conference programs may synthesize speech signals and watermark sound signals. However, the embedding process of the watermark may take too much time, which is more difficult to meet the immediacy of the conference call. In addition, the sound signal may be affected by noise and be distorted after transmission, and the embedded watermark will also be affected and difficult to recognize.
In view of this, the embodiments of the disclosure provide a processing method of a sound watermark and a speech communication system, which may embed a watermark sound signal in real time, and also has an anti-noise function.
The processing method of the sound watermark in the embodiment of the disclosure includes (but is not limited to) the following steps. Multiple sinewave signals are generated. Frequencies of the sinewave signals are different, and the sinewave signals belong to a high-frequency sound signal. A watermark pattern is mapped into a time-frequency diagram to form a watermark sound signal. Two dimensions of the watermark pattern in a two-dimensional coordinate system respectively correspond to a time axis and a frequency axis in the time-frequency diagram. Each of multiple audio frames on the time axis corresponds to the sinewave signals with different frequencies on the frequency axis. A speech signal and the watermark sound signal are synthesized in a time domain to generate a watermark-embedded signal.
The speech communication system in the embodiment of the disclosure includes (but is not limited to) a transmitting device. The transmitting device is configured to generate multiple sinewave signals, map a watermark pattern into a time-frequency diagram to form a watermark sound signal, and synthesize a speech signal and the watermark sound signal in a time domain to generate a watermark-embedded signal. Frequencies of the sinewave signals are different, and the sinewave signals belong to a high-frequency sound signal. Two dimensions of the watermark pattern in a two-dimensional coordinate system respectively correspond to a time axis and a frequency axis in the time-frequency diagram. Each of multiple audio frames on the time axis corresponds to the sinewave signals with different frequencies on the frequency axis.
Based on the above, according to the speech communication system and the processing method of the sound watermark in the embodiments of the disclosure, the sinewave signals belonging to the high-frequency sound and having different frequencies are used to synthesize the watermark sound signal corresponding to the watermark pattern, and the watermark sound signal and the speech signal are synthesized in the time domain. In this way, the watermark sound signal may be embedded in real time, and the noise impact of the pulse signal may be reduced.
In order for the aforementioned features and advantages of the disclosure to be more comprehensible, embodiments accompanied with drawings are described in detail below
The transmitting device 10 and the receiving device 50 may be wired phones, mobile phones, Internet phones, tablet computers, desktop computers, notebook computers, or smart speakers.
The transmitting device 10 includes (but is not limited to) a communication transceiver 11, a storage 13 and a processor 15.
The communication transceiver 11 is, for example, a transceiver (which may include (but is not limited to) a component such as a connection interface, a signal converter, and a communication protocol processing chip) that supports a wired network such as Ethernet, an optical fiber network, or a cable, and may also be a transceiver (which may include (but is not limited to) a component such as an antenna, a digital-to-analog/analog-to-digital converter, and a communication protocol processing chip) that supports a wireless network such as Wi-Fi, and a fourth generation (4G), a fifth generation (5G), or later generation mobile networks. In an embodiment, the communication transceiver 11 is configured to transmit or receive data through a network 30 (for example, the Internet, a local area network, or other types of networks).
The storage 13 may be any types of fixed or removable random access memory (RAM), a read only memory (ROM), a flash memory, a conventional hard disk drive (HDD), a solid-state drive (SSD), or similar components. In an embodiment, the storage 13 is configured to store a program code, a software module, a configuration, data (for example, a sound signal, a watermark pattern, and a watermark sound signal, etc.), or a file.
The processor 15 is coupled to the communication transceiver 11 and the storage 13. The processor 15 may be a central processing unit (CPU), a graphic processing unit (GPU), other programmable general-purpose or special-purpose microprocessors, a digital signal processor (DSP), a programmable controller, a field programmable gate array (FPGA), an application-specific integrated circuit (ASIC), other similar components, or a combination of the above. In an embodiment, the processor 15 is configured to perform all or a part of operations of the transmitting device 10, and may load and execute the software module, the program code, the file, and the data stored by the storage 13.
The receiving device 50 includes (but is not limited to) a communication transceiver 51, a storage 53, and a processor 55. Implementation aspects of the communication transceiver 51, the storage 53, and the processor 55 and functions thereof may respectively refer to the descriptions of the communication transceiver 11, the storage 13, and the processor 15. Thus, details in this regard will not be further reiterated in the following.
In some embodiments, the transmitting device 10 and/or the receiving device 50 further includes a sound receiver and/or a speaker (not shown). The sound receiver may be a dynamic, condenser, or electret condenser microphone. The sound receiver may also be a combination of other electronic components that may receive a sound wave (for example, human voice, environmental sound, and machine operation sound, etc.) and convert the sound wave into a sound signal, an analog-to-digital converter, a filter, and an audio processor. In an embodiment, the sound receiver is configured to receive/record a talker to obtain a speech signal. In some embodiments, the speech signal may include a voice of the talker, a sound from the speaker, and/or other environmental sounds. The speaker may be a horn or loudspeaker. In an embodiment, the speaker is configured to play the sound.
Hereinafter, various devices, components, and modules in the speech communication system 1 will be used to illustrate a method according to the embodiment of the disclosure. Each of the processes of the method may be adjusted accordingly according to the implementation situation, and the disclosure is not limited thereto.
In an embodiment, the processor 15 may decide the frequency of one of the sinewave signals Sf1 to SfN every specific frequency spacing. For example, the frequency of the sinewave signal Sf1 is 16 kilohertz (kHz). The frequency of the sinewave signal Sf2 is 16.5 kHz. The frequency of the sinewave signal Sn is 17 kHz. That is, the frequency spacing is 500 Hz, and the rest may be derived by analogy. In another embodiment, the frequency spacing between the sinewave signals Sf1 to SfN5 may not be fixed.
The processor 15 sets a time length of the sinewave signals Sf1 to SfN to the number of samples of an audio frame (time unit) (for example, 512, 1024, or 2028). In addition, the sinewave signals belong to a high-frequency sound signal (for example, the frequency thereof is between 16 kHz and 20 kHz, but may vary depending on capabilities of the speaker).
In an embodiment, the processor 15 further windows the sinewave signals Sf1 to SfN based on a windowing function (for example, a Hamming window, a rectangular window, or a Gaussian window) to generate windowed sinewave signals Sf1w to SfNw. In this way, a time spacing is generated in a time domain between the adjacent audio frames, and a pulse is avoided between the audio frames.
For example,
The processor 15 maps a watermark pattern W1 into a time-frequency diagram to form a watermark sound signal SW (step S220). Specifically, the watermark pattern W1 may be designed according to the user requirements, and the embodiment of the disclosure is not limited thereto. For example,
The processor 15 converts the watermark pattern W1 from a two-dimensional coordinate system into the time-frequency diagram. The two-dimensional coordinate system includes two dimensions. For example,
In an embodiment, the processor 15 further extends the watermark pattern W1 on a time axis corresponding to one dimension in the two-dimensional coordinate system according to an amount of superposition. The amount of superposition is related to an amount of superposition of the adjacent audio frames. For example, the amount of superposition is 0.5 audio frame or other time lengths, and the superposition of the audio frame will be detailed later. Taking
On the other hand, the time-frequency diagram includes a time axis and a frequency axis. Each of the audio frames on the time axis corresponds to the sinewave signals with different frequencies on the frequency axis. In an embodiment, the processor 15 establishes a watermark matrix in the time-frequency diagram according to the watermark pattern W1. The watermark matrix includes multiple elements, and each of the elements is one of a marked element and an unmarked element. The marked element denotes that a corresponding position of the watermark pattern W1 in the two-dimensional coordinate system has a value, and the unmarked element denotes that the corresponding position of the watermark pattern W1 in the two-dimensional coordinate system does not have a value.
Taking
The processor 15 selects the one or more sinewave signals in each of the audio frames according to the watermark matrix. The one or more selected sinewave signals correspond to the marked elements in the elements. Taking
The processor 15 superimposes the one or more selected sinewave signals on the audio frames in the time-frequency diagram in the time domain to form the watermark sound signal SW. The processor 15 superimposes the adjacent audio frames according to the amount of superimposition. For example,
The processor 15 synthesizes a speech signal S′H and the watermark sound signal SW in the time domain to generate a watermark-embedded signal SHWed (step S230). Specifically, a speech signal SH is a sound signal obtained by the transmitting device 10 recording the talker through the sound receiver, or obtained from an external device (for example, a call conference server, a recording pen, or a smart phone). For example, in a conference call, the transmitting device 10 receives the sound of the talker.
In an embodiment, the processor 15 may filter out the sound signals in a frequency band where the sinewave signals Sf1 to SfN are located in the original speech signal SH to generate the speech signal S′H. For example, assuming that the frequency band where the sinewave signals Sf1 to SfN are located is 16 kHz to 20 kHz, the processor 15 passes the speech signal SH through a low-pass filter that is passable below 16 kHz. In this way, it is possible to prevent the speech signal SH from affecting the watermark sound signal SW. In another embodiment, the processor 15 may directly use the original speech signal SH as the speech signal S′H.
The processor 15 may add the watermark sound signal SW to the speech signal S′H in the time domain through methods such as spread spectrum, echo hiding, and phase encoding to form the watermark-embedded signal SHWed. In light of the above, in the embodiment of the disclosure, the watermark sound signal SW is established in advance to be synthesized with the speech signal S′H in the time domain in real time.
The processor 15 transmits the watermark-embedded signal SHWed through the communication transceiver 11 and through the network 30 (step S240). The processor 55 of the receiving device 50 receives a transmitted sound signal SA through the communication transceiver 51. The transmitted sound signal SA is the transmitted watermark-embedded signal SHWed In some cases, the watermark-embedded signal SHWed is distorted during the transmission of the network 30 (for example, interfered by other environmental sounds, reflections from obstacles, or other noise) to form the transmitted sound signal SA (or called an attacked signal). It is worth noting that the transmitting device 10 sets the watermark sound signal SW to the high-frequency sound signal, but the high-frequency sound signal may be interfered by a pulse signal. For example,
The processor 55 maps the transmitted sound signal SA into the time-frequency diagram, and compares multiple preset watermark signals W1 to WM (step S250). Specifically, the processor 55 may use a fast Fourier transform (FFT) or other conversions from the time domain to a frequency domain to switch each of the non-superimposed audio frames in the transmitted sound signal SA to the frequency domain, and consider the overall time-frequency diagram formed by all the audio frames.
On the other hand, the preset watermark signals W1 to WM (where M is a positive integer) are respectively configured to recognize different transmitting devices 10 or different users. The preset watermark signals have been stored in the storage 53. The preset watermark signals W1 to WM correspond to multiple preset watermark patterns in the two-dimensional coordinate system. Similarly, each of the preset watermark patterns may be designed according to the user requirements, and the embodiment of the disclosure is not limited thereto.
The processor 55 recognizes the watermark sound signal SW (step S260) according to a correlation between the transmitted sound signal SA and the preset watermark signals W1 to WM (that is, a comparison result of the transmitted sound signal SA and the preset watermark signals W1 to WM). Specifically, the correlation herein is a degree of similarity between the transmitted sound signal SA and the preset watermark signals W1 to WM. In the preset watermark signals, the preset watermark signal with the highest degree of similarity is the watermark sound signal SW.
The processor 55 may modify the preset watermark signals W1 to WM according to the one or more pulse signals τx (step S830). Specifically, the processor 55 adds or subtracts a characteristic of pulse interference to the preset watermark signals W1 to WM on the vertical axis (corresponding to the frequency axis) in the two-dimensional coordinate system according to a position of the audio frame where the pulse signal τx is located (corresponding to a position in the horizontal axis in the two-dimensional coordinate system), so as to generate modified preset watermark signals W′1 to W′M.
For example,
In an embodiment, the above correlation includes a first correlation. The processor 55 may determine the first correlation between the transmitted sound signal SA and the preset watermark signals W1 to WM that have not been modified, and select multiple candidate watermark signals from the preset watermark signals W1 to WM according to the first correlation. The processor 55 may only modify the candidate watermark signals in the preset watermark signals W1 to WM. The processor 55 may, for example, filter out some candidate watermark signals with a relatively high degree of similarity to the transmitted sound signal SA according to a classifier based on deep learning or cross-correlation. Taking cross-correlation as an example, a cross-correlation value thereof greater than the corresponding threshold value may be used as the candidate watermark signal.
In an embodiment, the above correlation includes a second correlation. The processor 55 may decide the second correlation between the transmitted sound signal SA and the modified preset watermark signals W1 to WM or the candidate watermark signals, and perform a pattern recognition accordingly (step S850). Specifically, since the watermark sound signal SW belongs to the high-frequency audio signal, the processor 55 may filter out the sound signals outside the frequency band where the sinewave signals Sf1 to SfN are located in the original transmitted sound signal SA. For example, the processor 55 passes the transmitted sound signal SA through a high-pass filter that is passable above 16 kHz. In addition, the processor 55 may, for example, filter out one candidate watermark signal with the highest degree of similarity to the transmitted sound signal SA according to the classifier based on deep learning or cross-correlation. Taking the cross-correlation as an example, the maximum cross-correlation value thereof may be used as the recognized watermark sound signal SW. For example, the preset watermark signal W1 has the highest correlation, so that the preset watermark signal W1 is the watermark sound signal SW.
Based on the above, in the speech communication system and the processing method of the sound watermark according to the embodiments of the disclosure, the watermark sound signal formed by superimposing the sinewave signals with different frequencies corresponding to the audio frames is defined in advance at a transmitting end, so that the watermark sound signal may be embedded into the speech signal in real time, thereby meeting the needs of real-time call conferences. In addition, the pulse signal is determined at a receiving end, and the interference of the pulse signal on the preset watermark signals is considered, so that the watermark sound signal is accurately recognized, thereby reducing the noise impact of the pulse signal.
Although the disclosure has been described with reference to the above embodiments, they are not intended to limit the disclosure. It will be apparent to one of ordinary skill in the art that modifications to the described embodiments may be made without departing from the spirit and the scope of the disclosure. Accordingly, the scope of the disclosure will be defined by the attached claims and their equivalents and not by the above detailed descriptions.
Number | Date | Country | Kind |
---|---|---|---|
110125761 | Jul 2021 | TW | national |
Number | Name | Date | Kind |
---|---|---|---|
7299189 | Sato | Nov 2007 | B1 |
20040267533 | Hannigan | Dec 2004 | A1 |
20060212704 | Kirovski | Sep 2006 | A1 |
20080181449 | Hannigan et al. | Jul 2008 | A1 |
20130085751 | Takahashi | Apr 2013 | A1 |
20140108020 | Sharma | Apr 2014 | A1 |
20160148620 | Bilobrov | May 2016 | A1 |
20210098008 | Nesfield | Apr 2021 | A1 |
Number | Date | Country |
---|---|---|
102884571 | Dec 2014 | CN |
Number | Date | Country | |
---|---|---|---|
20230019841 A1 | Jan 2023 | US |