The present invention relates to a system for speech separation and a method performed in the system, and specifically relates to a system and a method for improving speech separation performance by a sliding window.
In recent years, more and more vehicles have voice recognition functions. However, when more than one person speaks in the vehicle at the same time, the host of the vehicle will not be able to quickly recognize the sound from the driver from a plurality of voices. In this case, the corresponding operation cannot be performed according to the driver's instruction accurately and promptly, and it is easy to cause an erroneous operation.
Currently, there are mainly two ways to perform speech separation. The first is to create a microphone array for voice enhancement. The second is to use algorithms for speech separation. Various algorithms for speech separation may include Frequency Domain Independent Component Analysis (FDICA), Degenerate Unmixing Estimation Technique (DUET) or their extension algorithms.
A DUET Blind Source Separation method can separate any number of voice sources using only two mixtures. The method is valid when sources are W-disjoint orthogonal, that is, when the supports of the windowed Fourier transform of the signals in the mixture are disjoint. For anechoic mixtures of attenuated and delayed sources, the method allows one to estimate the mixing parameters by clustering relative attenuation-delay pairs extracted from the ratios of the time-frequency representations of the mixtures. The estimates of the mixing parameters are then used to partition the time-frequency representation of one mixture to recover the original sources.
In practice, for example, if the time of a segment of speech is 4 seconds (such as shown in
Therefore, there is a need to develop an improved speech separation system and method that can quickly perform the speech separation so as to quickly recover the original sources of sounds.
In one or more illustrative embodiments, a method for speech separation is provided. The method uses at least one microphone to acquire at least one speech from at least one user and stores the at least one speech as a speech signal in a sound recording module. The method further extracts the speech signal from the sound recording module and processes the extracted speech signal through a sliding window, and transmits the processed speech signal to a DUET module for speech separation.
Preferably, the method in one embodiment uses a sliding window by traversing the extracted speech signal to determine a maximum amplitude of the speech signal; determining a starting position of the sliding window, the starting position of the sliding window is a position where the amplitude of the speech signal exceeds a predetermined proportion of the maximum amplitude for the first time from the beginning of the speech signal; determining an ending position of the sliding window, the ending position of the sliding window is a position where the amplitude of the speech signal exceeds a predetermined proportion of the maximum amplitude for the first time from the ending of the speech signal back to the beginning of the speech signal; and selecting the segment of the speech signal between the start position of the sliding window and the ending position of the sliding window as a processed speech signal for speech separation.
Preferably, the method in another embodiment uses a sliding window by traversing the extracted speech signal to determine an average amplitude of the speech signal; determining a starting position of the sliding window, the starting position of the sliding window is a position where the amplitude of the speech signal exceeds the average amplitude for the first time from the beginning of the speech signal; determining an ending position of the sliding window, the ending position of the sliding window is a position where the amplitude of the speech signal exceeds the average amplitude for the first time from the ending of the speech signal back to the beginning of the speech signal; selecting the segment of the speech signal between the start position of the sliding window and the ending position of the sliding window as a processed speech signal for speech separation.
In one or more illustrative embodiments, a system for speech separation is provided. The system for speech separation comprises at least one microphone for acquiring at least one speech from at least one user, a sound recording module for storing the at least one speech as a speech signal, a sliding window for extracting the speech signal from the sound recording module and processing the extracted speech signal, and a DUET module for receiving the processed speech signal to for speech separation.
Preferably, the sliding window in one embodiment is configured to traverse the extracted speech signal to determine a maximum amplitude of the speech signal; determine a starting position of the sliding window, the starting position of the sliding window is a position where the amplitude of the speech signal exceeds a predetermined proportion of the maximum amplitude for the first time from the beginning of the speech signal; determine an ending position of the sliding window, the ending position of the sliding window is a position where the amplitude of the speech signal exceeds a predetermined proportion of the maximum amplitude for the first time from the ending of the speech signal back to the beginning of the speech signal; and select the segment of the speech signal between the start position of the sliding window and the ending position of the sliding window as a processed speech signal for speech separation.
Preferably, the sliding window in another embodiment is configured to traverse the extracted speech signal to determine an average amplitude of the speech signal; determine a starting position of the sliding window, the starting position of the sliding window is a position where the amplitude of the speech signal exceeds the average amplitude for the first time from the beginning of the speech signal; determine an ending position of the sliding window, the ending position of the sliding window is a position where the amplitude of the speech signal exceeds the average amplitude for the first time from the ending of the speech signal back to the beginning of the speech signal; select the segment of the speech signal between the start position of the sliding window and the ending position of the sliding window as a processed speech signal for speech separation.
A computer readable media having computer-executable instructions for performing the abovesaid method is provided.
Advantageously, the disclosed speech separation system and method can improve the real time performance of DUET by using a sliding window.
The systems, methods, features and advantages will be, or will become, apparent to one with skill in the art upon examination of the following figures and detailed description. It is intended that all such additional systems, methods, features and advantages be included within this description, be within the scope of the invention.
The features, nature, and advantages of the present application may be better understood with reference to the following drawings and description. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. Moreover, in the figures, like referenced numerals designate corresponding parts throughout the different views.
It is to be understood that the following description of examples of implementations are given only for the purpose of illustration and are not to be taken in a limiting sense. The partitioning of examples in function blocks, modules or units shown in the drawings is not to be construed as indicating that these function blocks, modules or units are necessarily implemented as physically separate units. Functional blocks, modules or units shown or described may be implemented as separate units, circuits, chips, functions, modules, or circuit elements. One or more functional blocks or units may also be implemented in a common circuit, chip, circuit element or unit.
When the system is working, for example, as shown in
The sliding window module can extract the speech signal from the sound recording module and processes the extracted speech signal by a sliding window. The processed speech signal is then transmitted to a DUET module for speech separation. At last, the different sources of speech can be separated. For example, the processed speech signal can be finally separated into the first speech (sound1) from the first person and the second speech (sound2) from the second person.
A sliding window will be illustrated referring to
For example, the extracted speech signal may last four seconds as shown in
For example,
As shown in
The processing using a sliding window at step 502 may comprise determining a window length of a sliding window, and selecting a segment of the speech signal within the window length of the sliding window as the processed speech signal for further speech separation.
According to one embodiment of the present invention, determining a window length of a sliding window may comprise traversing the extracted speech signal to determine a maximum amplitude of the speech signal. Then, a starting position of the sliding window and an ending position of the sliding window are determined to obtain the window length of the sliding window. The starting position of the sliding window is a position where the amplitude of the speech signal exceeds a predetermined proportion of the maximum amplitude for the first time from the beginning of the speech signal. The ending position of the sliding window is a position where the amplitude of the speech signal exceeds a predetermined proportion of the maximum amplitude for the first time from the ending of the speech signal back to the beginning of the speech signal. Preferably, the predetermined proportion may be greater than or equal to ¼ and less than or equal to ½.
According to another embodiment of the present invention, determining a window length of a sliding window may comprises traversing the extracted speech signal to determine an average amplitude of the speech signal. Then, a starting position of the sliding window and an ending position of the sliding window are determined to obtain the window length of the sliding window. For example, the starting position of the sliding window is a position where the amplitude of the speech signal exceeds the average amplitude for the first time from the beginning of the speech signal. The ending position of the sliding window is a position where the amplitude of the speech signal exceeds the average amplitude for the first time from the ending of the speech signal back to the beginning of the speech signal.
The speech separation method and system of the present invention introduces a sliding window to pre-process data before sending the data collected by the microphone to the DUET module for processing. By extracting the relatively concentrated portion of the speech information in a segment of the signal and removing unnecessary portions of the segment signal, the amount of data that the DUET algorithm needs to process is reduced, thereby reducing the running time of the DUET algorithm, thereby improving the work efficiency of the overall speech separation system.
The term “module” may be defined to include a plurality of executable modules. The modules may include software, hardware, firmware, or some combination thereof executable by a processor. Software modules may include instructions stored in memory, or another memory device, that may be executable by the processor or other processor. Hardware modules may include various devices, components, circuits, gates, circuit boards, and the like that are executable, directed, or controlled for performance by the processor.
The program(s) of the program product define functions of the embodiments (including the methods described herein) and can be contained on a variety of computer-readable storage media. Illustrative computer-readable storage media include, but are not limited to: (i) non-writable storage media (e.g., read-only memory devices within a computer such as compact disc read only memory (CD-ROM) disks readable by a CD-ROM drive, flash memory, read only memory (ROM) chips or any type of solid-state non-volatile semiconductor memory) on which information is permanently stored; and (ii) writable storage media (e.g., floppy disks within a diskette drive or hard-disk drive or any type of solid-state random-access semiconductor memory) on which alterable information is stored.
The invention has been described above with reference to specific embodiments. Persons of ordinary skill in the art, however, will understand that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims.
This application claims priority to PCT Patent Application No. PCT/CN2019/077321, filed Mar. 7, 2019, and entitled “METHOD AND SYSTEM FOR SPEECH SEPARATION”, the entire disclosure of which is incorporated herein by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2019/077321 | 3/7/2019 | WO | 00 |