Speech processing systems (e.g., automatic speech recognition (ASR) systems, bio-metric voice systems, etc.) suffer recognition accuracy degradation from the input of a far-field audio signal when the speakers are distant from the microphone. The degradation of recognition quality is due to the signal corruption effect of the far-field speech caused by reverberation and background noise. Compared with a single microphone, a microphone array device which comprises multiple microphones can be utilized to capture multichannel audio as the input to a speech processing backend system (e.g., an ASR backend system) for alleviating such a degradation problem. However, since a speech processing backend is usually designed to receive a single-channel audio input, a speech processing frontend component which receives the multichannel audio and emits a single-channel audio may be utilized to bridge the gap between the multichannel audio input and the speech processing backend.
Multichannel signals may be processed by an ASR system to transcribe or otherwise process conversational speech. Spatial information contained in the multichannel audio can enhance the system's ability to accurately capture the nuances of conversational speech, enabling more robust transcription and analysis. Speech processing methods considering this information are particularly beneficial in scenarios where distinguishing between multiple speakers or capturing environmental cues is desirable for the transcription process.
Like reference symbols in the various drawings indicate like elements.
As will be discussed in greater detail below, implementations of the present disclosure address the problems of effective utilization of a multichannel (e.g., microphone array) edge device to capture a distant speech signal and usage of spatial information in an efficient way to enhance the signal for an end-to-end ASR system. One aspect regarding solving the distant ASR problem lies in the employment of microphone arrays and the exploitation of spatial information to improve signal quality.
Microphone arrays include multiple microphones spaced apart by known distances and relative positions. Therefore, the array is capable of capturing multichannel audio signals which contain spatial information. Spatial information in a microphone array refers to the location and orientation of individual microphones relative to each other within the array, as well as the sound sources in the surrounding environment. This spatial arrangement of microphones enables the array to capture and analyze audio signals from different directions, providing valuable cues for various audio processing applications. Microphone arrays with well-defined spatial arrangements are particularly beneficial in challenging audio environments where noise, reverberation, and multiple sound sources are present.
However, since ASR systems are configured to process single-channel signals, the spatial information contained in a multichannel signal needs to be accessible when the multichannel signal is transformed into a single-channel signal for processing by the ASR system.
A Short-Time Discrete Cosine Transform and, particularly, the Modified Discrete Cosine Transform (MDCT) is used to generate a time-frequency representation (i.e., the individual channel's spectrum) of the multichannel audio signal in a speech enhancement frontend for ASR. This representation includes encoded phase information within a real-valued representation of the spectrum, which results in better speech enhancement because the neural network computes the magnitude- and phase-dependent weights similar to weights computed in beamforming processes. This representation also facilitates reconstruction of the time-domain waveform of the enhanced signal and extracting spatial information, e.g., direction of arrival (DOA), from the computed weights.
The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features and advantages will become apparent from the description, the drawings, and the claims.
Referring to
In the implementation shown in
Referring to
As described, implementations of the present disclosure employ the modified discrete cosine transform (MDCT) for processing the multichannel audio signal to generate the spectral representation of the multichannel audio signal. The MDCT is performed according to the following equation (1):
Magnitude- and phase-dependent weights are generated from a spectral representation of the multichannel audio by a DNN for signal enhancement, 1108. In an implementation, the DNN is jointly trained with the downstream ASR model and the weights are a periodically updated tensor of the shape (B, T, C, N), where B is the batch size, Tis the number of MDCT frames, C is the number of channels, and N is defined with reference to equation (1). A single-channel representation of the multichannel signal is generated, 1112, by performing an element-wise product of the magnitude- and phase-dependent weights and the spectral representation of the multichannel signal and summing over a channel dimension to collapse the multiple channels into a single channel. In an implementation, the Log-Mel spectrogram of the single-channel representation, 1114, is calculated, which is then transmitted to the ASR model for processing. In other implementations, a raw STDCT spectrogram is processed, or the single-channel spectrum is converted back to the waveform, depending on requirements of the downstream ASR model. As such, the downstream ASR may include one of a number of known systems that operate on a range of features, including the Mel filterbank coefficients.
Since, as described above, the STDCT transform implicitly encodes the phase information of the multichannel audio signal, the system is able to determine direction of arrival (DOA) information, 1120, from the magnitude- and phase-dependent weights, 1124. The DOA information is of the type that enables an indication of not just the general direction from which the source of a signal originated based on which microphone in the monitored space captured the audio, but a more precise indication of the angle of incidence of the audio signal at the receiving microphone. For different microphone geometries, this will include corresponding spatial information, e.g., in the case of a linear array, the DOA information may indicate that an audio signal received and processed by the system 10 originated at an azimuth angle relative to the array of, for example, 20° to the left. Such information facilitates speaker localization and/or speaker diarization for one or more speakers in the monitored space. For certain other microphone array geometries, the DOA information may also allow for determination of elevation information of the speaker relative to the array.
The distinctions in microphone signals caused by various propagation paths between the acoustic signal source and microphones in an array encapsulate spatial information about sound sources. This information is preserved in the multichannel STDCT spectrograms and is used for signal enhancement.
Another feature of STDCT is that it is real-valued and invertible. This allows obtaining an enhanced audio signal corresponding to the enhanced time-domain waveform for listening or speech quality monitoring without applying computationally expensive methods for phase reconstruction. In the implementation using the MDCT, an inverse MDCT (IMDCT) is performed on the single-channel representation to obtain the enhanced audio signal, 1122. The IMDCT is defined by:
where the variables are as indicated with reference to equation (1).
Referring to
MDCT is employed in most modern audio compression standards, including MP3, Dolby Digital (AC-3), Ogg Vorbis, Opus (CELT), Windows Media Audio (WMA), Advanced Audio Coding (AAC), AAC-LD (LD-MDCT), High-Definition Coding (HDC), LDAC, Dolby AC-4, and MPEG-H 3D Audio, as well as in a family of speech compression standards G.7xx.
As described above, since the phase information is maintained with the STDCT, the magnitude- and phase-dependent weights derived from the multichannel audio signal representation are processed by a deep neural network (DNN) 952 to generate the DOA information 844 and provide it as metadata for the cloud device to use for localization purposes such as diarization. DNN 952 is a neural network that uses the magnitude- and phase-dependent weights to calculate a distribution of the weights across each channel to determine the direction of the received audio signal.
The encoded single-channel representation is transmitted over transmission channel 814 to cloud device 802 including decoder 832 and ASR device 828. In cloud device 802, the signal is decoded and input to ASR 828. In the case where a modified audio codec (like CELT) was used in the encoder, the signal does not need to be decoded completely to the time domain, in which case an unpacker reconstructs the MDCT spectrogram and feeds it directly to ASR 828. Depending on the input format of the downstream ASR model, a Log-Mel spectrogram or any other features of the single-channel representation may be calculated.
Referring to
Referring to
Spectral representation 962 is processed in DNN 942 to extract magnitude information and phase information associated with the multichannel audio signal 618. Magnitude- and phase-dependent weights are generated, 944. A weighted sum 946 is performed on the spectral representation 962 using the magnitude- and phase-dependent weights to combine the multichannel representation into a single-channel representation 960 of the input multichannel audio signal 618. A power spectral density calculation PSD(x)=|x|2 is performed on the single-channel representation 960 to produce a single-channel PSD signal 960a, which includes batch (B), time (T), and STDCT (N) components. In some implementations, the Log-Mel spectrogram 950 is derived from the single-channel PSD 960a and provided to an ASR system (not shown) for processing.
As described above, to obtain an enhanced audio signal, 958, representative of the multichannel audio signal 618, an inverse STDCT (ISTDCT) is performed, 956, on the single-channel representation 960. Further, the magnitude- and phase-dependent weights are processed by a deep neural network (DNN) 952 to obtain direction of arrival (DOA) information 954 for the multi-channel audio signal 618. In an example, DNN 952 is trained (for example in a supervised manner with labeled data) to map the STDCT weights to a DOA estimate.
Referring to
Referring to
Accordingly, audio signal enhancement process 10 as used in this disclosure may include any combination of audio signal enhancement process 10s, audio signal enhancement process 10c1, audio signal enhancement process 10c2, audio signal enhancement process 10c3, and audio signal enhancement process 10c4.
Audio signal enhancement process 10s may be a server application and may reside on and may be executed by a computer system 1000, which may be connected to network 1002 (e.g., the Internet or a local area network). Computer system 1000 may include various components, examples of which may include but are not limited to: a personal computer, a server computer, a series of server computers, a mini computer, a mainframe computer, one or more Network Attached Storage (NAS) systems, one or more Storage Area Network (SAN) systems, one or more Platform as a Service (PaaS) systems, one or more Infrastructure as a Service (IaaS) systems, one or more Software as a Service (SaaS) systems, a cloud-based computational system, and a cloud-based storage platform.
A SAN includes one or more of a personal computer, a server computer, a series of server computers, a minicomputer, a mainframe computer, a RAID device, and a NAS system. The various components of computer system 1000 may execute one or more operating systems.
The instruction sets and subroutines of audio signal enhancement process 10s, which may be stored on storage device 1004 coupled to computer system 1000, may be executed by one or more processors (not shown) and one or more memory architectures (not shown) included within computer system 1000. Examples of storage device 1004 may include but are not limited to: a hard disk drive; a RAID device; a random-access memory (RAM); a read-only memory (ROM); and all forms of flash memory storage devices.
Network 1002 may be connected to one or more secondary networks (e.g., network 1004), examples of which may include but are not limited to: a local area network; a wide area network; or an intranet, for example.
Various IO requests (e.g., IO request 1008) may be sent from audio signal enhancement process 10s, audio signal enhancement process 10c1, audio signal enhancement process 10c2, audio signal enhancement process 10c3 and/or audio signal enhancement process 10c4 to computer system 1000. Examples of IO request 1008 may include but are not limited to data write requests (i.e., a request that content be written to computer system 1000) and data read requests (i.e., a request that content be read from computer system 1000).
The instruction sets and subroutines of audio signal enhancement process 10c1, audio signal enhancement process 10c2, audio signal enhancement process 10c3 and/or audio signal enhancement process 10c4, which may be stored on storage devices 1010, 1012, 1014, 1016 (respectively) coupled to client electronic devices 1018, 1020, 1022, 1024 (respectively), may be executed by one or more processors (not shown) and one or more memory architectures (not shown) incorporated into client electronic devices 1018, 1020, 1022, 1024 (respectively). Storage devices 1010, 1012, 1014, 1016 may include but are not limited to: hard disk drives; optical drives; RAID devices; random access memories (RAM); read-only memories (ROM), and all forms of flash memory storage devices. Examples of client electronic devices 1018, 1020, 1022, 1024 may include, but are not limited to, personal computing device 1018 (e.g., a smart phone, a personal digital assistant, a laptop computer, a notebook computer, and a desktop computer), audio input device 1020 (e.g., a handheld microphone, a lapel microphone, an embedded microphone (such as those embedded within eyeglasses, smart phones, tablet computers and/or watches) and an audio recording device), display device 1022 (e.g., a tablet computer, a computer monitor, and a smart television), a hybrid device (e.g., a single device that includes the functionality of one or more of the above-references devices; not shown), an audio rendering device (e.g., a speaker system, a headphone system, or an earbud system; not shown), and a dedicated network device (not shown).
Users 1026, 1028, 1030, 1032 may access computer system 1000 directly through network 1002 or through secondary network 1006. Further, computer system 1000 may be connected to network 1002 through secondary network 1006, as illustrated with link line 1034.
The various client electronic devices (e.g., client electronic devices 1018, 1020, 1022, 1024) may be directly or indirectly coupled to network 1002 (or network 1006). For example, personal computing device 1018 is shown directly coupled to network 1002 via a hardwired network connection. Further, machine vision input device 1024 is shown directly coupled to network 1006 via a hardwired network connection. Audio input device 1022 is shown wirelessly coupled to network 1002 via wireless communication channel 1036 established between audio input device 1020 and wireless access point (i.e., WAP) 1038, which is shown directly coupled to network 1002. WAP 1038 may be, for example, an IEEE 802.11a, 802.11b, 802.11g, 802.11n, Wi-Fi, and/or any device that can establish wireless communication channel 1036 between the audio input device 1020 and WAP 1038. Display device 1022 is shown wirelessly coupled to network 1002 via wireless communication channel 1040 established between display device 1022 and WAP 1042, which is shown directly coupled to network 1002.
The various client electronic devices (e.g., client electronic devices 1018, 1020, 1022, 1024) may each execute an operating system, wherein the combination of the various client electronic devices (e.g., client electronic devices 1018, 1020, 1022, 1024) and computer system 1000 may form modular system 1044.
As will be appreciated by one skilled in the art, the present disclosure may be embodied as a method, a system, or a computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit.” “module” or “system.” Furthermore, the present disclosure may take the form of a computer program product on a computer-usable storage medium having computer-usable program code embodied in the medium.
Any suitable computer-usable or computer-readable medium may be used. The computer-usable or computer-readable medium may also be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory. In the context of this document, a computer-usable or computer-readable medium may be any medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. The computer-usable medium may include a propagated data signal with the computer-usable program code embodied therewith, either in base-band or as part of a carrier wave. The computer usable program code may be transmitted using any appropriate medium, including but not limited to the Internet, wireline, optical fiber cable, RF, etc.
Computer program code for carrying out operations of the present disclosure may be written in an object-oriented programming language. On the other hand, the computer program code for carrying out operations of the present disclosure may also be written in conventional procedural programming languages, such as C or similar. The program code may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through a local area network/a wide area network/the Internet.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, may be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general-purpose computer/special purpose computer/other programmable data processing apparatus, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that may direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the figures may illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, not at all, or in any combination with any other flowcharts depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, may be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present disclosure has been presented for purposes of illustration and description but is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The embodiment was chosen and described to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure for various embodiments with various modifications as are suited to the particular use contemplated.
Several implementations have been described. Having thus described the disclosure of the present application in detail and by reference to embodiments thereof, it will be apparent that modifications and variations are possible without departing from the scope of the disclosure defined in the appended claims.