Desktop and laptop personal computers are increasingly being used as devices for sound capture in a variety of recording and communication scenarios. Some of these scenarios include recording of meetings and lectures for archival purposes and the capture of speech for voice over Internet protocol (“VOIP”) telephony, video conferencing, and audio/video instant messaging. In these applications, audio input is typically captured using a local microphone. In many cases, such as with laptop computers, the microphone may be built into the computer itself and located very close to a keyboard. This type of configuration is highly vulnerable to environmental noise sources being picked up by the microphone. In particular, this configuration is particularly vulnerable to a specific type of additive noise, that of a user simultaneously using a user input device, such as typing on the keyboard of the computer being used for sound capture.
Continuous typing on a keyboard, mouse clicks, or stylus taps, for instance, produce a sequence of noise-like impulses in the captured audio stream. The presence of this non-stationary, impulsive noise in the captured audio stream can be very unpleasant for a downstream listener. In the past, some attempts have been made to deal with impulsive noise generated by keystrokes. However, these attempts have typically included an attempt to explicitly model the keystroke noise and to remove the keystroke noise from the audio stream. This type of approach presents significant problems, however, because keystroke noise (and other user input noise, for that matter) can be highly variable across different users and across different keyboard devices. Moreover, these previous attempts are computationally expensive, thereby making them unacceptable for use in a real time communication environment where low latency is a primary goal.
It is with respect to these considerations and others that the disclosure made herein is presented.
Technologies are described herein for keystroke sound suppression. In particular, through the utilization of the concepts and technologies presented herein, keystroke noise in an audio signal is identified and suppressed by applying a suppression gain to the audio signal when keystroke noise is detected in the absence of speech. Because no attempt is made to model the keystroke noise or to remove the keyboard noise from the audio stream, the concepts and technologies presented herein are suitable for use in a real time communication environment where low latency is a primary goal.
In one implementation, an audio signal is received that might include keyboard noise and/or speech. The audio signal is digitized into a sequence of frames and each frame is transformed from a time domain to a frequency domain for analysis. The transformed audio is then analyzed to determine whether there is a high likelihood that keystroke noise is present in the audio. High likelihood of keystroke noise means that the probability of keystroke noise is higher than a predefined threshold. In one embodiment, the analysis is performed by selecting one of the frames as a current frame. A determination is then made as to whether other frames surrounding the current frame can be utilized to predict the value of the current frame. If the current frame cannot be predicted from the surrounding frames, then there is a high likelihood that keystroke noise is present in the audio signal at or around the current frame.
If it is determined there is high likelihood that the audio signal contains keystroke noise, a determination is made as to whether a keyboard event occurred around the time of the keystroke noise. In order to perform this function, keystroke information is received in one embodiment from an input device application programming interface (“API”) that is configured to deliver the keystroke information with minimal intervention, and therefore minimal latency, from an operating system. The keystroke information is received asynchronously and may identify that either a key-up event or a key-down event occurred. The determination as to whether a keyboard event occurred contemporaneously with the keystroke noise is made based upon the keystroke information received from the input device API in one embodiment.
If it is determined that a keyboard event occurred around the time possible keystroke noise was detected, a further determination is made as to whether speech is present in the audio signal around the time of the keystroke noise. A voice activity detection (“VAD”) component is utilized in one embodiment to make this determination. If no speech is present, the keystroke noise is suppressed in the audio signal. In one embodiment, an automatic gain control (“AGC”) component applies a suppression gain to the audio signal to thereby suppress the keystroke noise in the audio signal. If speech is detected in the audio signal or if the keystroke noise abates, the suppression gain is removed from the audio signal.
It should be appreciated that the above-described subject matter may also be implemented as a computer-controlled apparatus, a computer process, a computing system, or as an article of manufacture such as a computer-readable medium. These and various other features will be apparent from a reading of the following Detailed Description and a review of the associated drawings.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended that this Summary be used to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.
The following detailed description is directed to concepts and technologies for keystroke noise suppression. While the subject matter described herein is presented in the general context of program modules that execute in conjunction with the execution of an operating system and application programs on a computer system, those skilled in the art will recognize that other implementations may be performed in combination with other types of program modules. Generally, program modules include routines, programs, components, data structures, and other types of structures that perform particular tasks, implement particular abstract data types, and transform data. Moreover, those skilled in the art will appreciate that the subject matter described herein may be practiced with or tied to other specific computer system configurations, including hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like.
In the following detailed description, references are made to the accompanying drawings that form a part hereof, and which are shown by way of illustration specific embodiments or examples. Referring now to the drawings, in which like numerals represent like elements through the several figures, technologies for deterministically selecting a domain controller will be described.
Turning now to
In the environment shown in
Keystrokes can be broadly classified as spectrally flat. However, the inherent variety of typing styles, key sequences, and the mechanics of the keys themselves introduce a degree of randomness in the spectral content of a keystroke. This leads to a significant variability across frequency and time for even the same key. The keystroke noise suppression system 102 shown in
According to one embodiment, a user provides a speech signal 104 to a microphone 106. The microphone 106 also receives keystroke noise 110 from the keyboard 108 that is being used by the user. The microphone 106 therefore provides an audio signal 112 that might include speech and keyboard noise to the keystroke noise suppression system 102. It should be appreciated that at any given time, the signal 112 may include silence or other background noise, keyboard noise only, speech only, or keyboard noise and speech.
In one implementation, the keystroke noise suppression system 102 includes a keystroke event detection component 116 and an acoustic feature analysis component 118. A voice activity detection (“VAD”) component 120 and an automatic gain control (“AGC”) component 122 may also be provided by the keystroke noise suppression system 102 or by an operating system.
As shown in
According to one implementation, the acoustic feature analysis component 118 is configured to receive the audio signal 112 and to perform an analysis on the audio signal 112 to determine whether there is high likelihood that keystroke noise 110 is present in the audio signal. In particular, the acoustic feature analysis component 118 is configured in one embodiment to take the digitized audio signal 112 and to subdivide the digitized audio signal 112 into a sequence of frames. The frames are then transformed from the time domain to the frequency domain for analysis.
Once the audio signal 112 had been transformed to the frequency domain, the acoustic feature analysis component 112 analyzes the transformed audio signal 112 to determine whether there is likelihood that keystroke noise 110 is present in the audio 112. In one embodiment, the analysis is performed by selecting one of the frames as a current frame. The acoustic feature analysis component 118 then determines whether other frames of the audio signal 112 surrounding the current frame can be utilized to predict the value of the current frame. If the current frame cannot be predicted from the surrounding frames, then there is high likelihood that keystroke noise 110 is present in the audio signal 112 at or around the current frame.
The measure of likelihood that keystroke noise 110 is present in the audio signal 112 can be summarized by the equation shown in Table 1.
In the equation shown in Table 1, S(k,n) represents the magnitude of a short-time Fourier transform (“STFT”) over the audio signal 112, wherein the variable k is a frequency bin index and the variable n is a time frame index. The likelihood that a current frame of the audio signal 112 includes keystroke noise is computed over the frame range [n−M, n+M]. A typical value of M is 2. The computed likelihood is compared to a fixed threshold to determine whether there is high likelihood that the audio signal 112 contains keystroke noise. The fixed threshold may be determined empirically.
The likelihood function shown in Table 1 is not, by itself, a completely reliable measure of the likelihood that keystroke noise 110 is present in the audio signal 112. Precisely, the equation in Table 1 is a measure of signal predictability, i.e. how well the current frame spectrum can be predicted by its neighbors. Because typing noise is very transient, so it cannot be predicted by its neighbor frames, and results in a large value for Fn. However, many other transient sounds or interferences can also produce a high value of Fn, for example the sound of a pen dropped onto a hard table. Even a normal voice speaking explosive consonants like “t” and “p” can produces a high value of Fn.
In order to improve the likelihood calculations, keyboard events generated by the computing system upon which the keystroke noise suppression system 102 is executing are utilized to constrain the likelihood calculations described above. In particular, on many types of computing systems a key-down event and a key-up event will be generated when a key is pressed or released, respectively, on the keyboard 108. For each frame of the audio signal 112, if the likelihood computation described above determines that it is likely that keystroke noise 110 is present and a key-down or key-up event is located proximately to the current frame, keystroke noise 110 is considered to be present.
In order to determine whether key-down or key-up events have been generated, the keystroke event detection component 116 is configured to utilize the services of an input device API 114. The input device API 114 provides an API for asynchronously delivering keystroke information, such as key-up events and key-down events, with minimal intervention from the operating system and low latency. The WINDOWS family of operating systems from MICROSOFT CORPORATION provides several APIs for obtaining keystroke information in this manner. It should be appreciated, however, that other operating systems from other manufacturers provide similar functionality for accessing keyboard input events in a low latency manner and may be utilized with the embodiments presented herein.
Because keyboard events are generated asynchronously, a separate thread may be created to receive the keystroke information. In this implementation, the keyboard events are pushed into a queue maintained by a detection thread and consumed by a processing function in a main thread. In one embodiment, the queue is implemented by a circular buffer that is designed to be lock- and wait-free while also maintaining data integrity. It should be appreciated that other implementations may be utilized.
According to one embodiment, when the likelihood computation described above is higher than a threshold, keyboard events are located that have occurred contemporaneously with the keystroke noise 110. In one implementation, for instance, keyboard events occurring within −10 ms to 60 ms of the peakness location are identified. If one or more keyboard events are found in the search range, it is assumed that keystroke noise 110 is present. The frames within a certain duration of the peakness location are considered corrupted by the keystroke noise 110. The duration of corruption typically lasts 40 ms to 100 ms depending upon the peakness strength.
If the keystroke noise suppression system 102 determines that keystroke noise 110 is present during a particular group of frames based upon the likelihood computation and the keyboard event data, the voice activity detection (“VAD”) component 120 is utilized to determine whether speech 104 is also occurring within the frames. As known in the art, VAD refers to the process of determining whether an audio signal includes the presence or absence of voice. Various algorithms exist for making this determination.
If speech 104 exists within the frames that have been determined to be corrupted by keystroke noise 110, the results from the VAD component 120 are ignored and no status change occurs. However, if speech 104 does not exist within the frames that have been determined to be corrupted by keystroke noise 110, then the AGC component 122 is instructed to apply a suppression gain to the frames to thereby minimize the keystroke noise 110. For instance, in one embodiment, the suppression gain may be −30 dB to −40 dB.
According to one embodiment, only frames of the audio signal 112 that have not been determined to be corrupted by keystroke noise 110 are provided to the VAD component 120 for the determination as to whether voice is present in the frames. In this manner, only uncorrupted frames are utilized by the VAD component 120 to determine voice activity.
The output of the AGC component 122 is the audio signal 124 that has the keystroke noise 110 contained therein suppressed. As described briefly above, the audio signal 124 may be provided to another software component for further processing 126. For instance, further processing 126 might include the transmission of the audio signal 124 as part of a VOIP conversation. Additional details regarding the operation of the keystroke noise suppression system 102 will be provided below with respect to
Referring now to
It should be appreciated that the logical operations described herein are implemented (1) as a sequence of computer implemented acts or program modules running on a computing system and/or (2) as interconnected machine logic circuits or circuit modules within the computing system. The implementation is a matter of choice dependent on the performance and other requirements of the computing system. Accordingly, the logical operations described herein are referred to variously as states operations, structural devices, acts, or modules. These operations, structural devices, acts and modules may be implemented in software, in firmware, in special purpose digital logic, and any combination thereof. It should also be appreciated that more or fewer operations may be performed than shown in the figures and described herein. These operations may also be performed in a different order than those described herein.
The routine 200 begins at operation 202, where the acoustic feature analysis component 118 is executed in the manner described above to determine the likelihood that keystroke noise 110 is present in the audio signal 112. From operation 202, the routine 200 proceeds to operation 204, where a determination is made as to whether there is high likelihood that keystroke noise 110 is present. If there is no or low likelihood that keystroke noise is present, the routine 200 moves back to operation 202, where the execution of the acoustic feature analysis component 118 continues.
If, at operation 204, the acoustic feature analysis component 118 determines that the likelihood that keystroke noise 110 is present in the audio signal 112 exceeds a pre-defined threshold, the routine 200 proceeds to operation 206. At operation 206, the keystroke event detection component 116 is executed to determine whether a keyboard event has occurred contemporaneously with the keystroke noise 110. Although the routine 200 indicates that the keystroke event detection component 116 is executed after the acoustic feature analysis component 118, it should be appreciated that these components are executed concurrently in one embodiment. In this manner, and as described above, keyboard event information is continually received asynchronously from the input device API 114 and placed in a queue. When the acoustic feature analysis component 118 detects likelihood of keystroke noise 110, the contents of the queue can be searched for contemporaneous keyboard events.
If, at operation 208, the keystroke event detection component 116 concludes that no contemporaneous keyboard events are present, the routine 220 proceeds to operation 202, described above. If, however, one or more keyboard events are detected around the time of the detected keystroke noise 110, the routine 200 proceeds from operation 208 to operation 210. At operation 210, the VAD component 120 is utilized to determine whether speech 104 exists in the frames for which keystroke noise 110 has been detected. If the VAD component 120 determines that speech 104 is present, the routine 200 proceeds from operation 212 to operation 216. At operation 216, the AGC component 132 applies standard AGC to the frames. It should be appreciated that no gain control may be applied to frames containing speech in one embodiment.
If, at operation 210, the VAD component 120 determines that speech 104 is not present in the frames, the routine 200 proceeds from operation 212 to operation 214. At operation 214, the AGC component 122 applies suppression gain to the frames to suppress the detected keystroke noise 110. From operations 214 and 216, the routine 200 proceeds to operation 218, where the audio 124 is output to a software component for further processing 126. From operation 218, the routine 200 returns to operation 202, described above, where subsequent frames of the audio signal 112 are processed in a similar manner as described above. It should be appreciated that the operations shown in
In one embodiment, a two second “hangover” time is added when a determination is made that speech is present. This means that if speech is detected at operation 212, the following two seconds of audio are considered to have speech present regardless of whether speech is actually present or not. It should be appreciated that the hangover time is two seconds in one embodiment, but that another period of time may be utilized.
The computer architecture shown in
The mass storage device 310 is connected to the CPU 302 through a mass storage controller (not shown) connected to the bus 304. The mass storage device 310 and its associated computer-readable media provide non-volatile storage for the computer 300. Although the description of computer-readable media contained herein refers to a mass storage device, such as a hard disk or CD-ROM drive, it should be appreciated by those skilled in the art that computer-readable media can be any available computer storage media that can be accessed by the computer 300.
By way of example, and not limitation, computer-readable media may include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. For example, computer-readable media includes, but is not limited to, RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, digital versatile disks (“DVD”), HD-DVD, BLU-RAY, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the computer 300.
According to various embodiments, the computer 300 may operate in a networked environment using logical connections to remote computers through a network such as the network 320. The computer 300 may connect to the network 320 through a network interface unit 306 connected to the bus 304. It should be appreciated that the network interface unit 306 may also be utilized to connect to other types of networks and remote computer systems. The computer 300 may also include an input/output controller 312 for receiving and processing input from a number of other devices, including a keyboard 108, a microphone 106, a mouse, or an electronic stylus. Similarly, an input/output controller may provide output to a display screen, a printer, a speaker 118, or other type of output device.
As mentioned briefly above, a number of program modules and data files may be stored in the mass storage device 310 and RAM 314 of the computer 300, including an operating system 318 suitable for controlling the operation of a networked desktop, laptop, or server computer. The mass storage device 310 and RAM 314 may also store one or more program modules. In particular, the mass storage device 310 and the RAM 314 may store the keystroke noise suppression system 102, which was described in detail above with respect to
Based on the foregoing, it should be appreciated that technologies for keyboard noise suppression are provided herein. Although the subject matter presented herein has been described in language specific to computer structural features, methodological acts that include transformations, and computer readable media, it is to be understood that the invention defined in the appended claims is not necessarily limited to the specific features, acts, or media described herein. Rather, the specific features, acts and mediums are disclosed as example forms of implementing the claims.
The subject matter described above is provided by way of illustration only and should not be construed as limiting. Various modifications and changes may be made to the subject matter described herein without following the example embodiments and applications illustrated and described, and without departing from the true spirit and scope of the present invention, which is set forth in the following claims.