The present invention relates to noise reduction and particularly to speech enhancement during an audio conference.
Voice over Internet Protocol (VoIP) communication includes encoding voice as digital data, encapsulating the digital data into data packets, and transporting the data packets over a data network. A conference call is a telephone call between two or more participants at geographically distributed locations, which allows each participant to be able to speak to, and to listen to, other participants simultaneously. A conference call among the participants may be conducted via a voice conference bridge or centralized server. The conference call connects multiple endpoint devices (VoIP devices or computer systems) associated with the participants using appropriate Web conference communication protocols. Alternatively, conference calls may be mediated peer-to-peer in which audio may be streamed directly between participants' computer systems without an intermediary server.
US patent publication U.S. Pat. No. 5,210,796 discloses a stereo/monophonic detection apparatus for detecting whether two-channel input audio signals are stereo or monophonic. The level difference between the input audio signals is calculated. The signal representing the level difference is discriminated maintaining a predetermined hysteresis. A stereo/monophonic detection is performed in accordance with the result of the discrimination to prevent an erroneous detection that may otherwise be caused by a level difference variation during a short time as in a case where the sound field is positioned at the center in the stereo signals.
Various computerized systems and methods are disclosed herein including an audio input configured to input an audio stream and a processor configured to enable noise reduction and process the audio stream for emphasizing speech content. A monophonic detector is configured to determine whether the audio stream is either monophonic or not monophonic. A decision module is configured to receive an input from the monophonic detector and to output a decision to bypass the noise-reduction when the audio stream is not monophonic. A speech detection module may be configured to detect speech in the audio stream and maintain bypass of the noise reduction until speech is detected in the audio stream. The processor may be configured to apply the noise reduction when the audio stream is monophonic and when speech is detected in the audio stream. The noise-reduction may be bypassed while starting input of the audio stream. The processor may be configured to parse the audio stream into audio frames. The processor may be configured to bypass the noise reduction when a current audio frame is not monophonic. The processor may be configured to enable noise reduction by computing time-frequency gains for emphasizing speech content in the audio stream. The processor may be configured to monitor the audio frames for speech, update a status of the audio stream as including speech when a number greater than a threshold of, e.g. consecutive, audio frames are detected as including speech. The noise reduction for emphasizing the speech content may be applied when the status is updated. However, when less than a threshold of audio frames are detected as including speech, noise reduction may not be applied but time-frequency gains may be computed and stored for later noise reduction during upcoming frames. The processor may be configured to maintain the noise reduction until end of the audio stream unless the audio stream is determined not to be monophonic. The processor may be configured to transform the audio stream into a time-frequency representation, compute time-frequency gains configured to emphasize speech content in the audio stream and inverse-transform the time-frequency representation to time domain while applying the time-frequency gains to produce an audio stream with emphasized speech content.
Various computer readable media are disclosed, that, when executed by a processor, cause the processor to execute methods disclosed herein.
The invention is herein described, by way of example only, with reference to the accompanying drawings, wherein:
The foregoing and/or other aspects will become apparent from the following detailed description when considered in conjunction with the accompanying drawing figures.
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. The embodiments are described below to explain the present invention by referring to the figures.
By way of introduction, aspects of the present invention are directed to communications of speech audio signals, using Voice over Internet Protocol (VoIP) communications by way of example. Noise reduction, also known as speech emphasis or speech enhancement, for VoIP communications is intended to enhance human speech and/or reduce audio content other than human speech. However, noise reduction algorithms may also reduce desired audio content which is not related to human speech. Examples include a ringtone beginning a call, or an audible notification received during a conference. Other examples may include a music lesson over VoIP or desired audio content played during an online conference. Embodiments of the present invention are directed to applying noise reduction when there is speech and otherwise bypassing the noise reduction when audio content other than speech is communicated in order not to remove or reduce desired audio content during the conference.
Referring now to the drawings, reference is now made to
In parallel, one or more channels of input audio may be input to transform module 11 configured to perform a time-frequency transform, e.g. short time Fourier transform (STFT). The time-frequency transform, e.g. STFT, may be input to a noise reduction module 14 configured to output noise reduction (NR) gains. Noise reduction module 14 may estimate the noise reduction (NR) gains without applying the reduction operation. NR gains may be input to decision module 19. Decision module 19 may select between NR gains which may be appropriate when the audio signal includes speech and default gains which may be appropriate for audio content other than speech. Gains selected by decision module 19 may be combined or multiplied (block 15) by magnitudes determined from the time-frequency transform, e.g. STFT. Complex coefficients or phases may be retrieved or reconstructed in block 16 from phase information from STFT transform 11. Inverse transform module 17 may inverse transform into time domain output audio either with noise reduction gains or default gains depending on the selection of decision module 19 whether input audio includes speech content. Default gains may be unity gains or may include filtering, equalization et cetera dependent on characteristics of the non-speech audio being processed.
Reference is now also made to
Reference is now also made to
In this description and in the following claims, a “computer system” is defined as one or more software modules, one or more hardware modules, or combinations thereof, which work together to perform operations on electronic data. For example, the definition of computer system includes the hardware components of a personal computer, as well as software modules, such as the operating system of the personal computer. The physical layout of the modules is not important. A computer system may include one or more computers coupled via a computer network. Likewise, a computer system may include a single physical device (such as a mobile phone, a laptop computer or tablet where internal modules (such as a memory and processor) work together to perform operations on electronic data.
In this description and in the following claims, a “network” is defined as any architecture where two or more computer systems may exchange data. Exchanged data may be in the form of electrical signals that are meaningful to the two or more computer systems. When data is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer system or computer device, the connection is properly viewed as a computer-readable medium. Thus, any such connection is properly termed a computer-readable medium. Combinations of the above should also be included within the scope of computer-readable media. Computer-executable instructions comprise, for example, instructions and data which cause a general-purpose computer system or special-purpose computer system to perform a certain function or group of functions. The described embodiments can also be embodied as computer readable code on a non-transitory computer readable medium. The non-transitory computer readable medium is any data storage device that can store data which can thereafter be read by a computer system. Examples of the non-transitory computer readable medium include read-only memory, random-access memory, CD-ROMs, HDDs, DVDs, magnetic tape, and optical data storage devices. The non-transitory computer readable medium can also be distributed over network-coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.
The various aspects, embodiments, implementations or features of the described embodiments can be used separately or in any combination. Various aspects of the described embodiments can be implemented by software, hardware or a combination of hardware and software.
The term “audio frame” as used herein refers to an analogue audio signal which may include speech which is sampled and digitized. Sampling rate may be 45 kilohertz by way of example. The sampled speech signal may be parsed into audio frames usually of equal duration, 50 milliseconds by way of example.
The terms “mono” and “monophonic are used herein interchangeably and refer to an audio stream recorded with a single microphone or multiple audio streams recorded simultaneously with respective multiple microphones which are measurably identical within previously determined thresholds of time-frequency magnitudes and phases, except for an overall level adjustment between the multiple audio streams.
The terms “stereo” and “stereophonic” are used herein interchangeably and refer to multiple, e.g. two, audio streams recorded simultaneously with respective multiple, e.g. two, microphones which are measurably different, with differences greater than previously determined thresholds of time-frequency magnitudes and/or phases, except for overall levels.
The term “speech” as used herein includes conversation, voice and/or vocal content such as singing. The terms “speech content” and “vocal content” are used herein interchangeably.
The term “detecting speech” as used herein is sometimes known as “voice activity detection” (VAD) and refers to a binary decision of whether one or more audio frames includes speech or does not include speech. Voice activity detection (VAD) may be performed by first determining a speech presence probability in the audio frame and subsequently based on a previously defined threshold deciding whether or not the audio frame includes speech.
The term “time-frequency” as in time-frequency analysis or time-frequency representation refers to techniques that analyze a signal in both the time and frequency domains simultaneously. A short time Fourier transform (STFT) is an example of a time-frequency representation.
The term “threshold” as used herein referring to multiple audio frames including speech content may be (but is not limited to) a consecutive number of frames or stereophonic frame pairs including speech, a fraction of previous audio frames including speech and/or a weighted fraction of audio frames including speech with greater weights on last frames, by way of example.
The term “gains” as used herein in the context of time-frequency gains refers to frequency dependent coefficients which may be real-valued and normalized between zero and one. The term “noise reduction (NR) gains” as used herein are frequency dependent coefficients computed to enhance speech and/or reduce audio signal or noise other than speech.
The transitional term “comprising” as used herein is synonymous with “including”, and is inclusive or open-ended and does not exclude additional, unrecited elements or method steps. The articles “a”, “an” is used herein, such as “a computer system”, “an audio frame” have the meaning of “one or more” that is “one or more computer systems”, “one or more audio frames”.
All optional and preferred features and modifications of the described embodiments and dependent claims are usable in all aspects of the invention taught herein. Furthermore, the individual features of the dependent claims, as well as all optional and preferred features and modifications of the described embodiments are combinable and interchangeable with one another.
Although selected features of the present invention have been shown and described, it is to be understood the present invention is not limited to the described features.
Although selected embodiments of the present invention have been shown and described, it is to be understood the present invention is not limited to the described embodiments. Instead, it is to be appreciated that changes may be made to these embodiments without departing from the scope of invention defined by the claims and the equivalents thereof.
Number | Date | Country | Kind |
---|---|---|---|
2106390.4 | May 2021 | GB | national |