The example and non-limiting embodiments of the present invention relate to processing of speech signals. In particular, at least some example embodiments relate to a method, to an apparatus and/or to a computer program for processing speech signals captured in noisy environments.
When a person speaks in presence of background noise he or she, in many cases unconsciously, adjusts the way he/she is speaking due to the background noise. The adjustment most notably comprises adjusting of voice loudness, but also adjustment of intonation, speaking pace and/or the spectral content etc. may be observed as a result of the speaker trying to adapt his/her voice to be heard better in presence of the background noise. This adjustment or adaptation is based on the auditory feedback from his/her own voice and the background noise—and interaction of the two. Such an adjustment of voice by the speaker may be referred to as a secondary impact of the background noise.
Many voice capturing arrangements apply noise suppression in order to remove/cancel or at least substantially reduce the background noise in the captured signal. However, while noise suppression is applied, the resulting speech from which the noise is removed or reduces still remains “adjusted” to the environmental background noise. This may make the resulting speech to sound unnatural, annoying and/or even disturbing once the background noise has been removed or reduced, possibly even reducing the intelligibility of the speech. The impact may be especially disturbing for the listener when the characteristics of background noise change rapidly during talking e.g. when during a phone call the far-end speaker raises his/her voice loudness temporarily due to environmental noise, e.g. due to traffic noise caused by a car passing by. Typically, the better the noise suppression is the more noticeable and disturbing this effect may be. Moreover, with possible upcoming advances in noise suppression techniques this issue can be expected to become even more prominent.
Enhancement of a speech signal in the presence of background noise is widely researched topic, having resulted in techniques such as noise cancelling, adaptive equalization, multi-microphone systems etc. aiming to either reduce the background noise in the captured signal or to improve the actual capture so that it becomes less sensitive to background noise. However, such speech enhancement techniques fail to address the above-mentioned issue of the speaker adapting his/her voice in presence of background noise.
According to an example embodiment, an apparatus is provided, the apparatus comprising at least one processor and at least one memory including computer program code for one or more programs, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to obtain a current time frame of a noise-suppressed voice signal, derived on basis of a current time frame of a source audio signal comprising a source voice signal, to detect input voice characteristics for the current time frame of noise-suppressed voice signal, to obtain reference voice characteristics for said current time frame, said reference voice characteristics being descriptive of the source voice signal in noise-free or low-noise environment, and to create a current time frame of a modified voice signal by modifying said current time frame of the noise-suppressed voice signal in response to a difference between the detected input voice characteristic and the reference voice characteristics exceeding a predetermined threshold.
According to another example embodiment, a further apparatus is provided, the apparatus comprising means for means for obtaining a current time frame of a noise-suppressed voice signal, derived on basis of a current time frame of a source audio signal comprising a source voice signal, means for detecting input voice characteristics for the current time frame of noise-suppressed voice signal, means for obtaining reference voice characteristics for said current time frame, said reference voice characteristics being descriptive of the source voice signal in noise-free or low-noise environment, and means for creating a current time frame of a modified voice signal by modifying said current time frame of the noise-suppressed voice signal in response to a difference between the detected input voice characteristic and the reference voice characteristics exceeding a predetermined threshold.
According to another example embodiment, a method is provided, the method comprising obtaining a current time frame of a noise-suppressed voice signal, derived on basis of a current time frame of a source audio signal comprising a source voice signal, detecting input voice characteristics for the current time frame of noise-suppressed voice signal, obtaining reference voice characteristics for said current time frame, said reference voice characteristics being descriptive of the source voice signal in noise-free or low-noise environment, and creating a current time frame of a modified voice signal by modifying said current time frame of the noise-suppressed voice signal in response to a difference between the detected input voice characteristic and the reference voice characteristics exceeding a predetermined threshold.
According to another example embodiment, a computer program is provided, the computer program including one or more sequences of one or more instructions which, when executed by one or more processors, cause an apparatus at least to obtain a current time frame of a noise-suppressed voice signal, derived on basis of a current time frame of a source audio signal comprising a source voice signal, to detect input voice characteristics for the current time frame of noise-suppressed voice signal, to obtain reference voice characteristics for said current time frame, said reference voice characteristics being descriptive of the source voice signal in noise-free or low-noise environment, and to create a current time frame of a modified voice signal by modifying said current time frame of the noise-suppressed voice signal in response to a difference between the detected input voice characteristic and the reference voice characteristics exceeding a predetermined threshold.
The computer program referred to above may be embodied on a volatile or a non-volatile computer-readable record medium, for example as a computer program product comprising at least one computer readable non-transitory medium having program code stored thereon, the program which when executed by an apparatus cause the apparatus at least to perform the operations described hereinbefore for the computer program according to the fifth aspect of the invention.
The exemplifying embodiments of the invention presented in this patent application are not to be interpreted to pose limitations to the applicability of the appended claims. The verb “to comprise” and its derivatives are used in this patent application as an open limitation that does not exclude the existence of also unrecited features. The features described hereinafter are mutually freely combinable unless explicitly stated otherwise.
Some features of the invention are set forth in the appended claims. Aspects of the invention, however, both as to its construction and its method of operation, together with additional objects and advantages thereof, will be best understood from the following description of some example embodiments when read in connection with the accompanying drawings.
Throughout this text, the terms voice and speech are used interchangeably. Similarly, the terms noise suppression, noise reduction and noise removal are used interchangeably throughout this text.
The embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings.
The arrangement 100 comprises a microphone arrangement 110 for capturing audio signal(s) x(n), comprising e.g. a single microphone or a microphone array. The captured audio signal x(n) typically represents the voice uttered by a speaker corrupted by environmental noises, generally referred to as background noise(s). Hence, the captured audio signal x(n) can be, conceptually, considered as a sum of a voice signal {circumflex over (v)}(n) representing the utterance by the speaker and the background noise signal n(n) representing the background noise component, i.e. x(n)={circumflex over (v)}(n)+n(n). The voice signal {circumflex over (v)}(n) may also be referred to as source voice signal.
The arrangement 100 further comprises a noise suppressor 130 for removing or reducing the amount of the background noise in the captured audio signal x(n). Consequently, the noise suppressor 130 is arranged to derive a noise-suppressed voice signal v(n) on basis of the captured audio signal x(n) by aiming to remove the background noise signal n(n) therefrom. Noise suppression is, however, a non-trivial task and in a real-life scenario perfect cancellation of the noise signal n(n) is typically not possible. Therefore, the noise-suppressed voice signal v(n) is an approximation of the voice signal {circumflex over (v)}(n) uttered by the speaker, from which the background noise component is suppressed to extent possible. A number of noise suppression techniques are known in the art.
The arrangement 100 further comprises a speech encoder 170 for compressing the noise-suppressed voice signal v(n) into encoded voice signal c(n) to produce a low bit-rate representation of the voice signal v(n). Generating the encoded voice signal c(n) facilitates transmission of the voice signal v(n) over a transmission channel and/or storage of the voice signal v(n) in storage medium in a resource-saving manner. However, the arrangement 100 is usable also without the speech encoder 170, in which case the noise-suppressed voice signal v(n) may be provided for transmission and/or for storage without compression. A number of speech compression techniques are known in the art.
The arrangement 100 illustrates some components that are relevant for description of the present invention. The electronic device (or apparatus) hosting the arrangement 100 may, however, comprise a number of further components for processing the captured audio signal x(n), the noise-suppressed voice signal v(n) and/or the encoded voice signal c(n). Such additional components typically include an analog-to-digital (A/D) converter for converting the captured audio signal into a digital form. Hence, the captured audio signal x(n) is provided to noise suppressor 130 and the noise-suppressed voice signal v(n) is provided from the noise suppressor 130 as a digital signal. Further examples of additional components include an echo canceller for removing possible acoustic echo caused in the electronic device hosting the arrangement 100 e.g. from the captured audio signal x(n) or the noise-suppressed voice signal v(n) and an audio equalizer for modifying the frequency characteristics of the captured audio signal x(n) (e.g. to compensate for the known characteristics of the microphone arrangement 110 and/or to provide a captured audio signal of desired frequency characteristics).
The captured audio signal captured audio signal x(n) and the noise-suppressed voice signal v(n) are typically processed in short temporal segments, referred to as frames or time frames. Temporal duration of the frame is typically fixed to a predetermined value, e.g. to a suitable value in the range from 20 to 1000 milliseconds (ms). However, the frame duration does not necessarily have to be a fixed one but the duration may be varied over time. The frames may be consecutive (i.e. non-overlapping) in time, or there may overlap between temporally adjacent frames. The noise suppressor 130 and the speech encoder 170 may be arranged to provide real-time processing of the respective voice signal to enable application of the arrangement 100 e.g. for voice communication. Alternatively, the noise suppressor 130 and/or the speech encoder 170 may be arranged to provide off-line processing of the respective voice signals e.g. for a voice recording application.
The arrangement 200 further comprises a speech enhancer 250 for naturalization of the noise-suppressed voice signal v(n). The speech enhancer 250 obtains the noise-suppressed voice signal v(n) and creates or derives a corresponding modified voice signal {tilde over (v)}(n) based at least in part on the noise-suppressed voice signal v(n) on basis of predetermined set of processing rules (i.e. a processing algorithm). A purpose of the speech enhancer 250 is to create the modified voice signal {tilde over (v)}(n) in which the effect(s) of the speaker adjusting his/her voice to account for background noise conditions are compensated for, thereby providing a more naturally-sounding voice signal for speech compression, storage and/or other processing. Further details of an exemplifying speech enhancer 250 will be described later in this text. Hence, in comparison to the arrangement 100, it is the modified voice signal {tilde over (v)}(n) (instead of the noise-suppressed voice signal v(n)) that is provided for transmission/storage or for further processing e.g. by the speech encoder 170.
The noise suppressor 130 may be arranged to extract one or more parameters that are descriptive of characteristics of the background noise signal n(n) in the captured audio signal x(n) and to provide one or more of these parameters to the speech enhancer 250. Conversely, the speech enhancer 250 may be configured to obtain one or more parameters that are descriptive of characteristics of the background noise signal n(n). Such parameters may include, for example, one or more parameters descriptive of the power or average magnitude of the background noise signal n(n), one or more parameters descriptive of the spectral shape and/or spectral magnitude of the background noise signal n(n), etc.
Although illustrated as a dedicated component in
As an example, the speech enhancer 250 may be always enabled, thereby arranged to process the noise-suppressed voice signal v(n) regardless of the user's selection. As another example, the speech enhancer 250 may be enabled or disabled in accordance with the user's selection. As a further example, the speech enhancer 250 may be enabled or disabled in accordance with a request from a remote user. In the latter example, if the speech processing arrangement 200 comprising the speech enhancer 250 is applied for voice communication, the request may be provided e.g. by the user of the remote speech processing arrangement.
The illustrations of
The speaker adjusting his/her voice to account for variations in the background noise typically enables his/her voice to be heard even in relatively high levels of background noise. Furthermore, the increased magnitude of the speaker's voice facilitates the noise suppressor 130 to (more) efficiently separate the voice signal v(n) or an approximation thereof (i.e. the noise-suppressed voice signal {tilde over (v)}(n)) from the captured audio signal x(n) that also includes the background noise signal n(n) at a relatively high level. Hence, although the speaker adjusting his/her voice in response to variations in the background noise may result in an effect that makes the noise-suppressed voice signal v(n) to sound unnatural or distorted, at the same time it contributes to efficiently preserving the voice signal v(n) contribution of the captured audio signal x(n) and it is also useful in facilitating high-quality operation of the noise suppressor 130 and the speech processing arrangement 100, 200 in general.
In general, the speech enhancer 250 is arranged to process the noise-suppressed voice signal as a sequence of frames, i.e. frame by frame. As described hereinbefore, a frame of the noise-suppressed voice signal v(n) is derived in the noise suppressor 130 on basis of the voice signal {circumflex over (v)}(n), e.g. on basis of the corresponding frame of the voice signal {circumflex over (v)}(n). For clarity and brevity of description, in the following the operation of the speech enhancer 250 is described for a single frame. The speech enhancer 250 is arranged to repeat the process for frames of the sequence frames.
The speech enhancer 250 is configured to obtain a frame of the noise-suppressed voice signal v(n). This frame may be referred to as a current frame of the noise-suppressed voice signal v(n) or frame t of the noise-suppressed voice-signal and it may be denoted as frame vt(n). The frame vt(n) is provided for the input voice detector 504 for detection of the input voice characteristics Ci for the frame t and for the speech naturalizer 505 for creation of the respective frame of the modified speech signal {tilde over (v)}t(n). The frame vt(n) may be further provided for the noise detector 501 to assist the process of background noise characterization.
The input voice detector 504 may be arranged to detect the input voice characteristics Ci for the frame vt(n) on basis of the noise-suppressed voice signal v(n). Since the input voice characteristics Ci are derived on basis of the noise-suppressed voice signal v(n) thereby being representative of ‘clean’ voice, the input voice characteristics may also be referred to as clean voice characteristics. The input voice characteristics may include characteristics of a single type or characteristics of two or several types. As an example, the voice characteristics may include one or more of the following: loudness characteristics, pace characteristics, spectral characteristics, intonation characteristics. Examples of different voice characteristics will be described in more detail later in this text.
The input voice detector 504 may be arranged to carry out an analysis of a segment/period of the noise-suppressed voice signal v(n) covering one or more frames representing active speech in order to detect the input voice characteristics Ct,i (where t refers to the current frame and i identifies the characteristic) for the frame vt(n). As an example, the input voice characteristics Ct,i may be detected on basis of the frame vt(n) only. As another example, the input voice characteristics Ct,i may be detected on basis of the frame vt(n) and further on basis of a predetermined number of frames preceding the frame vt(n) (e.g. frames vt−k1(n), . . . vt−1(n)) and/or a predetermined number of frames following the frame vt(n) (e.g. frames vt+1(n), . . . , vt+k2(n)). Detecting the input voice characteristics Ct,i over a segment of the noise-suppressed voice signal v(n) extending over a number of frames may comprise carrying out the analysis for a single segment of signal covering the respective frames or carrying out the analysis for each frame separately and combining, e.g. averaging, the analysis results obtained for individual frames into the input voice characteristics Ct,i representative of the frames included in the analysis. Detecting the input voice characteristics Ct,i over a number of frames provides a benefit of avoiding the input voice characteristics Ct,i to reflect only characteristics of particular sounds or short-term disturbances instead of overall input voice characteristics of the noise-suppressed voice signal v(n). As an example, the detection of the input voice characteristics Ct,i may be carried out for a signal segment covering up to 2-5 seconds of the noise-suppressed voice signal v(n).
The reference voice detector 502 is arranged to obtain the reference voice characteristics Rt,i (where t refers to the current frame and i identifies the characteristic) for the frame vt(n). The reference voice characteristics Rt,i are, preferably, descriptive of the voice signal {circumflex over (v)}(n) (referred to also as the source voice signal) in a noise-free environment or in a low-noise environment. The reference voice characteristics Rt,i typically include similar selection of voice characteristics as the input voice characteristics Ct,i (or a limited subset thereof). Since the reference voice characteristics Rt,i reflect the desired characteristics for the noise-suppressed speech signal v(n), they may also be referred to as pure voice characteristics.
The reference voice detector 502 is arranged to obtain the noise characteristics Ni from the noise detector 501. The noise characteristics for the current frame, i.e. the frame t, may be denoted as Nt,i. The noise characteristics Nt,i may include a noise indication Lt for indicating whether the frame t of the captured audio signal xt(n) comprises a significant background noise component or not. In the former case the frame xt(n) may be referred to as a noisy frame while in the latter case the frame xt(n) may be referred to as a clean frame. A clean frame may be considered to represent speech in noise-free or low-noise environment, whereas a noisy frame may be considered to represent speech in noisy environment. As an example, the noise indication Lt may comprise a parameter descriptive of the estimated noise level in the frame xt(n). The noise level may be indicated e.g. as RMS value descriptive of the average magnitude of the noise. Consequently, the reference voice detector 502 may be configured to determine whether the frame xt(n) is a noisy frame or a clean frame e.g. such that frames for which the indicated noise level is larger than or equal to a predetermined noise threshold are considered as noisy frames while frame for which the indicated noise level is below said noise threshold are considered as clean frames. As another example, the noise indication Lt may be a binary flag that directly indicates whether the frame xt(n) is a noisy frame or a clean frame.
Obtaining the reference voice characteristics Rt,i may comprise, determining whether the input voice characteristic Ct,i qualify as the reference voice characteristics Rt,i. This determination, typically, comprises determining whether the input voice characteristics represent speech in noise-free or low-noise environment. Consequently, the input voice characteristics Ct,i may be considered to represent speech in noise-free or low-noise environment, and hence applicable as the reference voice characteristics Rt,i, in response to the input voice characteristics representing speech in noise-free or low-noise environment. As an example, the input voice characteristics Ct,i may be considered to represent speech in noise-free or low-noise environment in response to the frame xt(n) being indicated as a clean frame. As another example, the input voice characteristics Ct,i may be considered to represent speech in noise-free or low-noise environment in response to a predetermined number or a predetermined percentage of frames involved in detection of the input voice characteristics Ct,i being indicated as clean frames. As a specific example in this regard, the predetermined number/percentage may require all frames involved in detection of the input voice characteristics Ct,i being indicated as clean frames. In contrast, in case the input voice characteristics Ct,i are not considered as applicable for the reference voice characteristics Rt,i, e.g. in response to the input voice characteristics Ct,i representing noisy speech (e.g. the input voice characteristics Ct,i not representing speech in noise-free or low-noise environment), obtaining the reference voice characteristics Rt,i comprises applying the reference voice characteristics Rt−1,i obtained for a preceding frame, e.g. the frame vt−1(n), as the reference voice characteristics Rt,i. The reference voice detector 502 is further configured to store (into a memory) the obtained reference voice characteristics Rt,i to make them available in processing of subsequent frame.
In case the input voice characteristics Ct,i are considered applicable as reference voice characteristics Rt,i, the reference voice detector 502 may be further configured to adapt the detected input voice characteristics Ct,i on basis of general properties of speech signals in a noise-free environment or in a low-noise environment to derive the reference voice characteristics Rt,i. In this regard, the reference voice detector 502 may be arranged to apply knowledge of general properties of speech provided in block 503 to adapt the detected input voice characteristics Ct,i accordingly. The general properties of speech (block 503) may be provided e.g. as data stored in a memory accessible by the speech enhancer 250, e.g. in a memory provided in the speech enhancer 250.
As an example in this regard, the reference voice detector 502 may be configured to, in case the input voice characteristics Ct,i are considered applicable as basis for determining/updating the reference voice characteristics Rt,i, compute the reference voice characteristics Ct,i as a weighted sum of the input voice characteristics and respective ‘average’ voice characteristics Ai that represent respective voice characteristics in a noise-free or low-noise environment, e.g. as Rt,i=w1Ct,i+w2Ai, where w1+w2=1. The weighting values w1 and w2 may be fixed predetermined values, selected in accordance of the desired extent of the impact of the ‘average’ voice characteristics Ai.
As another example, the voice characteristics in a noise-free or low-noise environment may be represented by the ‘average’ voice characteristics Ai and respective margins mi that define the maximum allowable deviation from the respective ‘average’ voice characteristic Ai. In case any of the detected input voice characteristics Ct,i differs from the respective ‘average’ voice characteristic by more than the respective margin mi (e.g. if |Ct,i−Ai|>mi), the input voice characteristics may be disqualified from being applied as the reference voice characteristics Rt,i and the reference voice characteristics Rt−1,i are applied as the reference voice characteristics Rt,i instead.
In case the input voice characteristics Ct,i are considered applicable as reference voice characteristics Rt,i, the reference voice detector 502 may be further configured to adapt the detected input voice characteristics Ct,i on basis of general properties of speech signals uttered by the speaker of the voice signal {circumflex over (v)}(n) to derive the reference voice characteristics Rt,i. The personal properties or personal characteristics of speech signals uttered by the speaker of the voice signal {circumflex over (v)}(n) may be applied in a manner similar to described for the general properties above. For adaptation on basis of the personal characteristics, predetermined average personal voice characteristics Ak,i for the speaker k are applied instead the generic average generic voice characteristics Ai.
In this regard, the speech enhancer 250 may comprise speaker identifier 507 arranged to apply a speaker recognition technique known in the art to identify the current speaker on basis of a segment/portion of the noise-suppressed voice signal v(n). Alternatively, the speaker identifier 507 may be arranged to identify the current speaker on basis of a segment/portion of the captured audio signal x(n). The speaker identifier 507 may be further configured to provide identification of the speaker to the speaker identification database 506 arranged to store predetermined personal voice characteristics Ak,i for a number of speakers. The speaker identification database 506, in turn, provides the personal voice characteristics Ak,i to the reference voice detector 502.
In case the reference voice characteristics Rt,i are not (yet) available, the general properties of speech signals in a noise-free environment or in a low-noise environment, the general properties of speech signals uttered by the speaker of the voice signal {circumflex over (v)}(n) (if available) or a combination thereof (e.g. a weighted average) may be used as the reference voice characteristics Rt,i. Such a situation may occur e.g. immediately after initialization or re-initialization (e.g. a reset) of the speech enhancer 250 e.g. in the beginning of a communication session or during a communication session due to an error condition.
The speech naturalizer 505 is configured to create the modified voice signal {tilde over (v)}(n) on basis of the noise-suppressed voice signal v(n). In particular, the speech naturalizer 505 may be configured to create the frame t of the modified voice signal {tilde over (v)}(n), denoted as {tilde over (v)}t(n) by modifying the frame vt(n) in response to difference(s) between the input voice characteristic Ct,i and the reference characteristics Rt,i meeting predetermined criteria. In contrast, in response to said difference failing to meet said criteria, the speech naturalizer 505 may be configured to create the frame {tilde over (v)}t(n) as a copy of the frame vt(n). In case the previous frame of the modified voice signal {tilde over (v)}t−1(n) was created as a modification of the corresponding noise-suppressed frame vt−1(n), the speech naturalizer 505 may be configured to apply smoothing for the end of the frame {tilde over (v)}t−1(n) and for the beginning of the frame {tilde over (v)}t(n), such as cross-fading between a segment in the end of frame {tilde over (v)}t−1(n) and a segment of similar length in the beginning of the frame {tilde over (v)}t(n), instead of applying a direct copy of the frame in order to minimize the risk of introducing a discontinuation that may be perceived as an audible distortion in the modified voice signal {tilde over (v)}(n).
Evaluation whether the difference(s) between the input voice characteristic Ct,i and the reference characteristics Rt,i meets the predetermined criteria may comprise determining respective comparison values Dt,i as the difference(s) between the respective input and reference voice characteristics, e.g. as Dt,i=Ct,i−Rt,i, and determining whether one or more of the comparison values Dt,i exceed a respective predetermined threshold Thi. The modification of the frame vt(n) may be applied e.g. in response to any of the comparison values Dt,i exceeding the respective threshold Thi, in response to a predetermined number of the comparison values Dt,i exceeding the respective threshold Thi or in response to all comparison values Dt,i exceeding the respective threshold Thi.
The modification of the frame vt(n) in order to create the frame {tilde over (v)}t(n) may comprise modifying the frame vt(n) such that the frame {tilde over (v)}(n) so created exhibits modified voice characteristics {tilde over (C)}t,i that correspond to the reference voice characteristics Rt,i. This may involve modification(s) bringing the modified voice characteristics {tilde over (C)}t,i to be identical to, essentially identical to or approximate the reference voice characteristics Rt,i. As another example, the modification may comprise modifying the frame vt(n) such that the frame {tilde over (v)}t(n) so created exhibits voice characteristics {tilde over (C)}t,i that are a weighted sum of the input voice characteristics Rt,i and the reference voice characteristics Ct,i, e.g. {tilde over (C)}t,i=wc*Ct,i+wr*Rt,i where wc and wr denote the weights assigned for the input voice characteristics and the reference voice characteristics, respectively, and where wc+wr=1 (and preferably also wc<wr, to give a higher emphasis to the reference voice characteristics).
The noise detector 501 is configured to determine the noise characteristics Ni on basis of the captured audio signal x(n) and/or the noise-suppressed voice signal v(n). In particular, the noise detector 501 may be configured to detect the noise characteristics Nt,i for the current frame on basis of the current frame of the captured audio signal xt(n) and/or the current frame of the noise-suppressed voice signal vt(n). The noise detection may, additionally, consider a predetermined number of frames (of the respective voice signal) immediately preceding the frame xt(n) and/or vt(n) and/or a predetermined number of frames (of the respective signal) immediately following the frame xt(n) and/or vt(n).
As pointed out before, the noise characteristics Nt,i may include the noise indication Lt,n for indicating whether the frame t of the captured audio signal xt(n) comprises a significant background noise component or not, the noise indication Lt,n comprising a parameter descriptive of the estimated noise level in the frame xt(n). In this regard, the noise detector may determine the difference signal d(n) between the captured audio signal x(n) and the noise-suppressed signal v(n), e.g. as d(n)=x(n)−v(n), for a signal segment/period of interest. The signal segment/period of interest typically comprises the current frame t, possibly together with a predetermined number of frames immediately preceding the current frame and/or a predetermined number of frames immediately following the current frame). The parameter descriptive of the noise level may be derived on basis of the difference signal d(n), e.g. as an RMS value descriptive of the average magnitude of the signal d(n) over the segment/period of interest. As also described hereinbefore, the noise indication Lt,n may, as another example, comprise a binary flag that directly indicates whether the frame xt(n) is a noisy frame or a clean frame. In this regard, the noise detector 501 may be configured to apply the approach described as an example in context of the reference voice detector 502 to determine the binary flag by comparing the determined noise level to the predetermined noise threshold.
As a variation of the above-described approach for detecting the noise on basis of the captured audio signal x(n) and the noise-suppressed signal v(n), the speech enhancer may further receive a noise signal {circumflex over (n)}(n) from a microphone arrangement 510 arranged/dedicated to capture a signal that represents only the background noise component. Like the microphone arrangement 110, the microphone arrangement 510 may comprise a single microphone or a microphone array. Consequently, instead of estimating the noise as the difference signal d(n), in this approach the noise detector 501 may be arranged to detect the noise characteristics Nt,i, e.g. the noise indication Lt,n, on basis of the noise signal {circumflex over (n)}(n).
Instead of providing the noise detector 501 as a component of the speech enhancer 250, the noise detector 501 may be provided outside the speech enhancer 250, e.g. as part of the noise suppressor 130 or as a dedicated processing block/portion arranged to derive the noise characteristics Ni on basis of the captured audio signal x(n) and/or the noise-suppressed voice signal v(n).
In block 440, the difference(s) between the input voice characteristics Ct,i and the corresponding reference voice characteristics Rt,i are determined, and in block 450 a determination whether the determined difference(s) meet the predetermined criteria is carried out, as described hereinbefore in context of the speech naturalizer 505. In response to the difference(s) meeting the criteria, the frame of modified voice signal {tilde over (v)}t(n) is created by modifying the respective frame of the noise-suppressed voice signal vt(n) e.g. to exhibit modified voice characteristics {tilde over (C)}t,i that are similar to or approximate the reference voice characteristics Rt,i, as described hereinbefore in context of the speech naturalizer 505 and as indicated in block 460. In contrast, in response to the difference(s) failing to meet the predetermined criteria, the frame of modified voice signal {tilde over (v)}t(n) is created e.g. as a copy of the respective frame of the noise-suppressed voice signal vt(n), as described hereinbefore in context of the speech naturalizer 505 and as indicated in block 470. From block 460 or 470 the method 400 proceeds to obtain the next frame vt+1(n) of the noise-suppressed voice signal (in block 410) and the process from block 410 to 450 or 460 is repeated as long as further frames of the noise-suppressed voice signal are available, as indicated in block 480.
As briefly referred to above, the voice characteristics applied as the input voice characteristics Ct,i, the reference voice characteristics Rt,i and the modified voice characteristics {tilde over (C)}t,i may include one or more parameters descriptive of voice characteristics. These parameters may include parameters descriptive of voice characteristics of a single type or voice characteristics of different types.
The voice characteristics may include one or more parameters descriptive of loudness or energy level of the respective voice signal, typically averaged over a signal segment/period of a desired length. The noise characteristics Nt,i may comprise one or more respective parameters descriptive of the background noise signal n(n).
The voice characteristics may include one or more parameters descriptive of the spectral magnitude or the spectral shape of the respective voice signal. The spectral shape/magnitude may be provided e.g. as a set of spectral bins, each indicating the spectral magnitude of the respective frequency region. The noise characteristics Nt,i may comprise one or more respective parameters descriptive of the background noise signal n(n).
The voice characteristics may include one or more parameters descriptive of the pace or rhythm of the speech in the respective voice signal. Such parameters may, for example, provide an indication of the minimum, maximum and/or average duration of pauses within the speech. These indications may concern e.g. indications of the pauses between words or pauses between phonemes in the respective voice signal.
The voice characteristics may include one or more parameters descriptive of the pitch of voice of the speaker in the respective voice signal.
Table 1 provides some examples of types of voice characteristics, (typically unconscious) reaction(s) by a speaker in an attempt to adapt his/her voice to account for the background noise conditions (i.e. the secondary impact of the background noise), and example(s) of corresponding actions that may be invoked as part of the speech naturalization process (e.g. in the speech naturalizer 505) in order to compensate for the secondary impact of the background noise.
The speech enhancer 650 comprises a reference voice loudness detector 602 for detection of the reference voice loudness Lr, an input voice loudness detector 604 for detection of the input voice loudness Lc and a speech loudness naturalizer 605 for creating the modified speech signal {tilde over (v)}(n). The speech enhancer 650 may comprise further processing portions or processing blocks, such as a noise loudness detector 601 for detection of the noise loudness Ln. Hence, the reference voice loudness detector 602 operates as the reference voice detector 502, the input voice loudness detector 604 operates as the input voice detector 504, the speech loudness naturalizer 605 operates as the speech naturalizer 505, and the noise loudness detector 601 operates as the noise detector 501.
The input voice loudness detector 604 is arranged to detect the input voice loudness for the frame vt(n), denoted as Lt,c on basis of the noise-suppressed voice signal v(n). The input voice loudness detector 604 may be arranged to carry out an analysis of a segment/period of the noise-suppressed voice signal v(n) covering one or more frames representing active speech in order to detect the input voice loudness Lt,c. As an example, the input voice loudness Lt,c may be detected on basis of the frame vt(n) only. As another example, the input voice loudness Lt,c may be detected on basis of the frame vt(n) and further on basis of a predetermined number of frames preceding the frame vt(n) (e.g. frames vt−k1(n), . . . vt−1(n)) and/or a predetermined number of frames following the frame vt(n) (e.g. frames vt+1(n), . . . , vt+k2(n)). As an example, the detection of the input voice loudness Lt,c may be carried out for a signal segment covering 500 to 3000 ms of the noise-suppressed voice signal v(n) and the analysis may be carried out for frames having duration in the range from 20 to 500 ms.
The reference voice loudness detector 602 is arranged to obtain the reference voice loudness for the frame vt(n), denoted as Lt,r, preferably descriptive of the loudness of the voice signal {circumflex over (v)}(n) in a noise-free environment or in a low-noise environment. The reference voice detector 602 may be arranged to obtain the noise indication Lt,n from the noise detector 601, the noise indication Lt,n being descriptive of the estimated noise level in the frame xt(n) or providing an indication whether the frame xt(n) is a noisy frame or a clean frame (as described in context of the reference voice detector 502). The process of obtaining the reference voice loudness Lt,r on basis of the input voice loudness Lt,c or on basis of the reference voice loudness Lt−1,r obtained for the previous frame vt−1(n) may be carried out in a manner similar to that described in general case of obtaining the reference voice characteristics Rt,i in context of the reference voice detector 502.
The speech loudness naturalizer 605 is arranged to evaluate whether the difference between the input voice loudness Lt,c and the reference voice loudness Lt,r meets the predetermined criteria. This may comprise determining respective loudness comparison value(s) indicative of the difference between the input voice loudness Lt,c and the reference voice loudness Lt,r and determining whether the indicated difference in loudness exceeds a respective predetermined threshold. As an example the comparison value may be determined as the loudness difference Lt,diff between the input voice loudness Lt,c and the reference voice loudness Lt,r, i.e. as Lt,diff=Lt,c−Lt,r, or as the loudness ratio Lt,ratio between the input voice loudness Lt,c and the reference voice loudness Lt,r, i.e. as Lt,ratio=Lt,c/Lt,r. Consequently, the modification of the frame vt(n) may be applied to create the respective modified voice frame {tilde over (v)}t(n) e.g. in response to the loudness difference Lt,diff exceeding the (first) loudness threshold, whereas the loudness difference Lt,diff that is smaller than or equal to the (first) loudness threshold results in applying a copy of frame vt(n) as the modified voice frame {tilde over (v)}t(n). As another example, the modification of the frame vt(n) may be applied to create the respective modified voice frame {tilde over (v)}t(n) e.g. in response to the loudness ratio Lt,ratio exceeding a (second) loudness threshold or falling below a (third) loudness threshold, whereas the loudness ratio Lt,ratio that is between these (second and third) thresholds results in applying a copy of frame vt(n) as the modified voice frame {tilde over (v)}t(n)
The modification of the frame vt(n) in order to create the frame {tilde over (v)}t(n) may comprise modifying the frame vt(n) by multiplying the signal samples of the frame vt(n) by a scaling factor k, i.e. {tilde over (v)}t(n)=k*vt(n), the scaling factor k determined e.g. as the ratio between the reference voice loudness Lt,r to the input voice loudness Lt,c, e.g. k=Lt,c/Lt,c.
Therefore, the reference voice loudness detector 602 (or the reference voice detector 502) may not apply the reference voice loudness Lr detected before the time period from time instant 4 to 17 for the time instants 12 to 15 but may apply detection of the reference voice loudness Lr based (at least in part) on a segment of the noise-suppressed voice signal v(n) corresponding to the time instants from 12 to 15 to account for the change in input voice loudness Lc when there was no corresponding change in the noise loudness Ln. To put it in other words, the increase in the input voice loudness Lc during time instants 12 to 15 is preferably not removed by the speech loudness naturalizer 605 (or the speech naturalizer 505). On the other hand, the change in the input voice loudness Lc during time instants 6 to 8 coincides with a change in the noise loudness Ln, thereby representing a change in the input voice loudness Lc that is preferably to be compensated for by the reference voice loudness detector 602 (or the reference voice detector 502). Hence, in the example of
The speech enhancer 1050 comprises a reference pitch detector 1002 for detection of the reference pitch Pr, an input pitch detector 1004 for detection of the pitch Pc of the input voice and a pitch naturalizer 1005 for creating the modified speech signal {tilde over (v)}(n). The speech enhancer 1050 may comprise further processing portions or processing blocks, such as the noise detector 501 for detection of the noise characteristics Ni, e.g. the noise loudness Ln. Hence, the reference pitch detector 1002 operates as the reference voice detector 502, the input pitch detector 1004 operates as the input voice detector 504, and the pitch naturalizer 1005 operates as the speech naturalizer 505.
The input pitch detector 1004 is arranged to detect the pitch Pc of the input voice for the frame vt(n), denoted as Pt,c on basis of the noise-suppressed voice signal v(n). The input pitch detector 1004 may be arranged to carry out an analysis of a segment/period of the noise-suppressed voice signal v(n) covering one or more frames representing active speech in order to detect the input pitch Pt,c. As an example, the input pitch Pt,c may be detected on basis of the frame vt(n) only. As another example, the input pitch Pt,c may be detected on basis of the frame vt(n) and further on basis of a predetermined number of frames preceding the frame vt(n) (e.g. frames vt−k1(n), . . . vt−1(n)) and/or a predetermined number of frames following the frame vt(n) (e.g. frames vt+1(n), . . . , vt+k2(n)). As an example, the detection of the input pitch Pt,c may be carried out for a signal segment covering 500 to 3000 ms of the noise-suppressed voice signal v(n) and the analysis may be carried out for frames having duration in the range from 20 to 500 ms.
The reference pitch detector 1002 is arranged to obtain the reference pitch for the frame vt(n), denoted as Pt,r, preferably descriptive of the pitch of the voice signal {circumflex over (v)}(n) in a noise-free environment or in a low-noise environment. The reference pitch detector 1002 may be arranged to obtain the noise indication Lt,n from the noise detector 501, the noise indication Lt,n being descriptive of the estimated noise level in the frame xt(n) or providing an indication whether the frame xt(n) is a noisy frame or a clean frame (as described in context of the reference voice detector 502). The process of obtaining the reference pitch Pt,r on basis of the input pitch Pt,c or on basis of the reference pitch Pt−1,r obtained for the previous frame vt−1(n) may be carried out in a manner similar to that described in general case of obtaining the reference voice characteristics Rt,i in context of the reference voice detector 502.
The pitch naturalizer 1005 is arranged to evaluate whether the difference between the input pitch Pt,c and the reference pitch Pt,r meets the predetermined criteria. This may comprise determining respective pitch comparison value(s) indicative of the difference between the input pitch Pt,c and the reference pitch Pt,r and determining whether the indicated difference in pitch exceeds a respective predetermined threshold. As an example the comparison value may be determined as the pitch difference Pt,diff between the input pitch Pt,c and the reference pitch Pt,r, i.e. as Pt,diff=Pt,c−Pt,r, or as the pitch ratio Pt,ratio between the input pitch Pt,c and the reference pitch Pt,r, i.e. as Pt,ratio=Pt,c/Pt,r. Consequently, the modification of the frame vt(n) may be applied to create the respective modified voice frame {tilde over (v)}t(n) e.g. in response to the pitch difference Pt,diff exceeding the (first) pitch difference threshold, whereas the pitch difference Pt,diff that is smaller than or equal to the (first) pitch difference threshold results in applying a copy of frame vt(n) as the modified voice frame {tilde over (v)}t(n). As another example, the modification of the frame vt(n) may be applied to create the respective modified voice frame {tilde over (v)}t(n) e.g. in response to the pitch ratio Pt,ratio exceeding a (second) pitch difference threshold or falling below a (third) pitch difference threshold, whereas the pitch ratio Pt,ratio that is between these (second and third) pitch difference thresholds results in applying a copy of frame vt(n) as the modified voice frame {tilde over (v)}t(n)
The modification of the frame vt(n) in order to create the frame {tilde over (v)}t(n) may comprise modifying the frame vt(n) by applying a pitch modification technique known in the art.
As briefly referred to hereinbefore (e.g. in context of the example of
From block 815 the method 800a proceeds to block 845 for the optional step of aligning, at least in part, the reference voice characteristics Rt,i with general properties of speech signals in a noise-free environment or in a low-noise environment and/or with personal characteristics of speech uttered by the speaker of the voice signal {circumflex over (v)}(n). From block 845 the method 800a proceeds to block 850 for outputting the reference voice characteristics Rt,i e.g. for being applied for the current frame and for being stored (in a memory) for further use in subsequent frame(s).
In block 820 it is determined whether the input voice characteristics Ct,i are similar or essentially similar to those (most recently) detected in noise-free or low-noise conditions, denoted as noise-free voice characteristics Cnf,i. In response to this determination being affirmative, the input voice characteristics Ct,i are applied as the (adapted) reference voice characteristics Rt,i (block 815). In contrast, in response to the input voice characteristics Ct,i being found to be different from the noise-free voice characteristics Cnf,i, the method 800a proceeds to obtaining the most recently applied reference voice characteristics Rt−1,i (e.g. by reading from a memory) and (re)applying these as the (new) reference voice characteristics Rt,i, as indicated in block 825. The determination of similarity may comprise deriving the difference between the input voice characteristics Ct,i and the noise-free voice characteristics Cnf,i, and considering the two being different in response to (the absolute value of) the difference therebetween exceeding a predetermined threshold. The threshold may be set differently for different voice characteristics i.
In block 830 it is determined whether the input voice characteristics Ct,i are similar or essentially similar to those obtained for the reference frame Cref,i. In response to this determination being affirmative, the method 800a proceeds to the (optional) block 845 and further to block 850. In contrast, in response to the input voice characteristics Ct,i being found to be different from those of the reference frame Cref,i, the method 800a proceeds to block 835. The determination of similarity may comprise deriving the difference between the input voice characteristics Ct,i and the voice characteristics of the reference frame Cref,i, and considering the two being different in response to (the absolute value of) the difference therebetween exceeding a predetermined threshold. The threshold may be set differently for different voice characteristics i.
In block 835 it is determined whether the noise characteristics Nt,i are similar or essentially similar to noise characteristics obtained for the reference frame, denoted as Nref,i. In response to this determination being affirmative, the method 800a proceeds to the (optional) block 845 and further to block 850. In contrast, in response to the noise characteristics Nt,i being found to be different from the noise characteristics of the reference frame Nref,i, the method 800a proceeds to block 840. The determination of similarity may comprise deriving the difference between the noise characteristics Nt,i and noise characteristics of the reference frame Nref,i, and considering the two being different in response to (the absolute value of) the difference therebetween exceeding a predetermined threshold. The threshold may be set differently for different voice characteristics i.
In block 840, the reference voice characteristics Rt,i are modified to align them with the observed change in the input voice characteristics Ct,i so that the change in the input voice characteristics Ct,i (e.g. increase in loudness) causes a corresponding change (e.g. increase in loudness) in the reference voice characteristics Rt,i, as illustrated in
In the following, exemplifying variations of the method 800a are described. Like the method 800a, also these variations thereof may be implemented e.g. by the reference voice detector 502 or the reference voice loudness detector 602.
The operations, procedures, functions and/or methods described in context of the components of the speech enhancer 250, 650, 1050 may be distributed between the components in a manner different from the one(s) described hereinbefore. There may be, for example, further components within the speech enhancer 250, 650, 1050 for carrying out some of the operations procedures, functions and/or methods assigned in the description hereinbefore to components of the respective speech enhancer 250, 650, 1050, or there may be a single component or a unit for carrying out the operations, procedures, functions and/or methods described in context of the speech enhancer 250, 650, 1050.
In particular, the operations, procedures, functions and/or methods described in context of the components of the speech enhancer 250, 650, 1050 may be provided as software means, as hardware means, or as a combination of software means and hardware means. As an example in this regard, the speech enhancer 250 may be provided as an apparatus comprising means for means for obtaining a current time frame of a noise-suppressed voice signal, derived on basis of a current time frame of a source audio signal comprising a source voice signal, means for detecting input voice characteristics Ci for the current time frame of noise-suppressed voice signal, means for obtaining reference voice characteristics Ri for said current time frame, said reference voice characteristics Ri being descriptive of the source voice signal in noise-free or low-noise environment, and means for creating a current time frame of a modified voice signal {tilde over (v)}(n) by modifying said current time frame of the noise-suppressed voice signal in response to a difference between the detected input voice characteristics Ci and the reference voice characteristics Ri exceeding a predetermined threshold.
Along similar lines, the speech enhancer 650 may be provided as an apparatus comprising means for obtaining a current time frame of a noise-suppressed voice signal v(n), derived on basis of a current time frame of a source audio signal comprising a source voice signal, means for detecting input voice loudness Lc for the current time frame of noise-suppressed voice signal v(n), means for obtaining reference voice loudness Lr for said current time frame, said reference voice loudness Lr being descriptive of the source voice signal in noise-free or low-noise environment, and means for creating a current time frame of a modified voice signal {tilde over (v)}(n) by modifying said current time frame of the noise-suppressed voice signal v(n) in response to a difference between the detected input voice loudness Lc and the reference voice loudness Lr exceeding a predetermined threshold. As a further example, the speech enhancer 1050 may be provided as an apparatus comprising means for obtaining a current time frame of a noise-suppressed voice signal v(n), derived on basis of a current time frame of a source audio signal comprising a source voice signal, means for detecting a pitch Pc, of the input voice for the current time frame of noise-suppressed voice signal v(n), means for obtaining a reference pitch Pr, for said current time frame, said reference pitch Pr, being descriptive of the source voice signal in noise-free or low-noise environment, and means for creating a current time frame of a modified voice signal {tilde over (v)}(n) by modifying said current time frame of the noise-suppressed voice signal v(n) in response to a difference between the input pitch Pc, and the reference pitch Pr, exceeding a predetermined threshold.
Although the processor 910 is presented in the example of
The apparatus 900 may be embodied for example as a mobile phone, a smartphone, a digital camera, a digital video camera, a music player, a media player, a gaming device, a laptop computer, a desktop computer, a personal digital assistant (PDA), a tablet computer, etc.
The memory 920 may store a computer program 950 comprising computer-executable instructions that control the operation of the apparatus 900 when loaded into the processor 910. As an example, the computer program 950 may include one or more sequences of one or more instructions. The computer program 950 may be provided as a computer program code. The processor 910 is able to load and execute the computer program 950 by reading the one or more sequences of one or more instructions included therein from the memory 920. The one or more sequences of one or more instructions may be configured to, when executed by one or more processors, cause an apparatus, for example the apparatus 900, to carry out the operations, procedures and/or functions described hereinbefore in context of the speech enhancer 250, 650, 1050.
Hence, the apparatus 900 may comprise at least one processor 910 and at least one memory 920 including computer program code for one or more programs, the at least one memory 920 and the computer program code configured to, with the at least one processor 910, cause the apparatus 900 to perform the operations, procedures and/or functions described hereinbefore in context of the speech enhancer 250, 650, 1050.
The computer program 950 may be provided at the apparatus 900 via any suitable delivery mechanism. As an example, the delivery mechanism may comprise at least one computer readable non-transitory medium having program code stored thereon, the program code which when executed by an apparatus cause the apparatus at least to carry out the operations, procedures and/or functions described hereinbefore in context of the speech enhancer 250, 650, 1050. The delivery mechanism may be for example a computer readable storage medium, a computer program product, a memory device a record medium such as a CD-ROM, a DVD, a Blue-Ray disc or another article of manufacture that tangibly embodies the computer program 950. As a further example, the delivery mechanism may be a signal configured to reliably transfer the computer program 950.
Reference to a processor should not be understood to encompass only programmable processors, but also dedicated circuits such as field-programmable gate arrays (FPGA), application specific circuits (ASIC), signal processors, etc. Features described in the preceding description may be used in combinations other than the combinations explicitly described. Although functions have been described with reference to certain features, those functions may be performable by other features whether described or not. Although features have been described with reference to certain embodiments, those features may also be present in other embodiments whether described or not.
Number | Date | Country | Kind |
---|---|---|---|
1317910.6 | Oct 2013 | GB | national |
Number | Name | Date | Kind |
---|---|---|---|
4720802 | Damoulakis | Jan 1988 | A |
6522746 | Marchok | Feb 2003 | B1 |
8615394 | Avendano | Dec 2013 | B1 |
8818800 | Fallat | Aug 2014 | B2 |
20050102134 | Manabe | May 2005 | A1 |
20060020451 | Kushner et al. | Jan 2006 | A1 |
20120197636 | Benesty | Aug 2012 | A1 |
20130282373 | Visser | Oct 2013 | A1 |
20150162014 | Zhang | Jun 2015 | A1 |
Number | Date | Country |
---|---|---|
1926085 | May 2008 | EP |
2008075305 | Jun 2008 | WO |
Entry |
---|
Extended European Search Report received for corresponding European Patent Application No. 14186727.5, dated Feb. 25, 2015, 4 pages. |
Davis, “Noise Reduction in Speech Applications”, Electrical Engineering & Applied Signal Processing Series, CRC Press, Apr. 18, 2002, 397 Pages. |
Search Report received for corresponding United Kingdom Patent Application No. 1317910.6, dated Apr. 11, 2014, 3 pages. |
Number | Date | Country | |
---|---|---|---|
20150106088 A1 | Apr 2015 | US |