This application is a 35 U.S.C. § 371 National Phase Entry Application from PCT/US2019/066570 filed Dec. 16, 2019, designating the U.S., the disclosure of which is incorporated herein by reference in its entirety.
This document relates, generally, to amplitude-independent window sizes in audio encoding.
Audio processing remains an important aspect of today's technology environment. Digital assistants used in personal and professional situations to aid users in performing various tasks are trained to recognize speech to detect their cues and instructions. Speech recognition is also used to create a digitally accessible record of events where people are talking. In the rapidly growing world of virtual reality and/or augmented reality, audio processing provides the user a plausible auditory experience in order to best perceive and interact with a digital environment.
In an aspect of the present disclosure, there is provided a computer-implemented method. The method comprises receiving a first signal corresponding to a first flow of acoustic energy, applying a transform to the received first signal using at least a first amplitude-independent window size at a first frequency and a second amplitude-independent window size at a second frequency, the second amplitude-independent window size improving a temporal response at the second frequency, wherein the second frequency is subject to amplitude reduction due to a resonance phenomenon associated with the first frequency, and storing a first encoded signal, the first encoded signal based on applying the transform to the received first signal.
For example, the first frequency may be about 3 kHz, and the second frequency may be about 1.5 kHz or about 10 kHz. The first amplitude-independent window size may be about 18-30 ms (e.g., about 24 ms). The second amplitude-independent window size may be about 3-9 ms (e.g., about 6 ms).
The method may further comprise mapping the first amplitude-independent window size to the first frequency based on the first frequency being associated with energy integration in human hearing.
The method may further comprise mapping the second amplitude-independent window size to the second frequency based on the second frequency being associated with energy differentiation in the human hearing.
The first amplitude-independent window size may be applied for all frequencies of the received first signal except a band at the second frequency. The first amplitude-independent window size may be greater than the second amplitude-independent window size. The first amplitude-independent window size may be greater than the second amplitude-independent window size by an integer multiple. The first amplitude-independent window size may be about four times greater than the second amplitude-independent window size.
The method may further comprise using a third amplitude-independent window size in applying the transform to the first received signal, the third amplitude-independent window size used at a third frequency not associated with the resonance phenomenon, the third amplitude-independent window size different from the first and second amplitude-independent window sizes.
The third amplitude-independent window size may be smaller than the first amplitude-independent window size. The third amplitude-independent window size may be about half as large as the first amplitude-independent window size. The third amplitude-independent window size may be greater than the second amplitude-independent window size. The third amplitude-independent window size may be about twice as large as the second amplitude-independent window size. The third amplitude-independent window size may be smaller than the first amplitude-independent window size.
Applying the transform using the first amplitude-independent window size at the first frequency may generate a first outcome, wherein applying the transform using the second amplitude-independent window size at the second frequency may generate a second outcome, the method further comprising storing the second outcome more frequently than storing the first outcome.
The method may further comprise storing the second outcome with less precision than the first outcome.
The method may further comprise using a third amplitude-independent window size in applying the transform at a third frequency, the third amplitude-independent window size improving a temporal response at the third frequency, the third frequency subject to amplitude reduction due to the resonance phenomenon associated with the first frequency.
The second and third frequencies may be positioned at opposite sides of the first frequency.
The third amplitude-independent window size may be about equal to the second amplitude-independent window size.
The second and third amplitude-independent window sizes may be smaller than the first amplitude-independent window size.
The first audio file may comprise the first encoded signal, and the method may further comprise receiving a second signal corresponding to a second flow of acoustic energy, applying the transform to the received second signal using at least the first amplitude-independent window size at the first frequency and the second amplitude-independent window size at the second frequency, storing a second encoded signal, the second encoded signal based on applying the transform to the received second signal, wherein a second audio file comprises the second encoded signal, and determining a difference between the first and second audio files.
Determining the difference may comprise playing the first and second audio files into a model of human hearing, the model including the resonance phenomenon.
In an aspect of the present disclosure there is provided a computer program product tangibly embodied in a non-transitory storage medium, the computer program product including instructions that when executed by a processor cause the processor to perform operations of any of the method steps described herein.
Optional features of one aspect may be combined with any other aspect.
Like reference symbols in the various drawings indicate like elements.
This document describes examples of audio processing using amplitude-independent window sizes. In some implementations, a relatively larger window size can be used in processing signals having a frequency that is associated with a resonance phenomenon in human ears. For example, the window size can be about two times as large as a window size used for another frequency. In some implementations, a relatively smaller window size can be used in processing signals having a frequency that is subject to amplitude reduction due to the resonance phenomenon. For example, the window size can be about two times smaller than a window size used for another frequency.
Prior to the resonance-enhanced encoder 106 encoding the signal from the sound sensors 102, one or more types of conditioning of the signal can be performed. In some implementations, the signal can be processed to generate a particular representation (e.g., according to a prespecified format). For example, the representation can be decomposed into respective channels of the sound from the sound sensors 102.
In the encoding, the resonance-enhanced encoder 106 can apply a transformation to the signal from the sound sensors 102. The transformation can involve applying two or more different window sizes to respective frequencies (or frequency bands) of the signal from the sound sensors 102. In some implementations, a window size is amplitude-independent, meaning that the window size is applied to the specific at least one frequency (band) regardless of the nature of that aspect of the signal. For example, the resonance-enhanced encoder 106 may not take into account whether the frequency (band) contains sustained levels of acoustic energy, and/or whether the frequency (band) contains any transients, such as a region of relatively short duration having a higher amplitude than surrounding portions of a waveform. The use of different window sizes can help address circumstances related to listening, including, but not limited to, acoustic characteristics such as resonance phenomena.
After encoding, the encoded signal can be stored, forwarded and/or transmitted to another location. For example, a channel 108 represents one or more ways that an encoded audio signal can be managed, such as by transmission to another system for playback.
If the audio of the encoded signal should be played, a decoding process can be performed. Such a decoding process can be performed by a resonance-enhanced decoder 110. For example, the resonance-enhanced decoder 110 can perform operations in essentially the opposite way as in the resonance-enhanced encoder 106. For example, an inverse transform can be performed in the decoding module that partially or completely restores a particular representation that was generated by the resonance-enhanced encoder 106. The resulting audio signals can be stored and/or played depending on the situation. For example, the system 100 can include two or more audio playback sources 112 (including, but not limited to, loudspeakers) to which the processed audio signal can be provided for playback.
The representation of signal from the sound sensors 102 can be played out over headphones, and the system 100 can compute what should be rendered in the headphones. In some implementations, this can be applied in situations involving virtual reality (VR) and/or augmented reality (AR). In some implementations, the rendering can be dependent how the user turns his or her head. For example, a sensor can be used that informs the system of the head orientation, and the system can then cause the person to hear the sound coming from a direction that is independent of the head orientation. As another example, the representation of signal from the sound sensors 102 can be played out over a set of loudspeakers. That is, first the system 100 can store or transmit the description of the field of sound around the listener. At the resonance-enhanced decoder 110, a computation can then be made what the individual speakers should produce to create the field of sound around the listener's head. That is, approaches exemplified herein can facilitate improved spatial decomposition of sound.
People 204A-C are schematically illustrated as being in the physical space 200. The people symbols represent sources of any kind of sounds that the listener can hear. Such sounds can be generated by humans (e.g., speech, song or other utterances), by nature (e.g., wind, animals, or other natural phenomena), or by technology (e.g., machines, loudspeakers, or other human-made apparatuses). That is, the present subject matter relates to sound from one or more types of sources, whether the sounds are caused by humans or not. The locations of the people 204A-C around the circle 202 indicate that the circle 202 can perceive sounds from multiple separate directions. Here, each of the people 204A-C can be said to have associated with them a corresponding spatial profile 206A-C. The spatial profiles 206A-C signify the direction from which the listener can perceive the sound arriving. The spatial profiles 206A-C correspond to how the sound from different sound sources is captured: some of it arrives directly from the sound source, and other sound (generated simultaneously) first bounces on one or more surfaces before being perceived. That is, the sound(s) here represented by the person 204A can have the spatial profile 206A, the sound(s) here represented by the person 204B can have the spatial profile 206B, and the sound(s) here represented by the person 204C can have the spatial profile 206C.
In the context of a room, the notion of a spatial profile is a generalization of this illustrative example. There, the spatial profile includes both the direct path and all the reflective paths through which the sound of the source travels to reach the listener of the circle 202. In a different situation, such as when the physical space 200 is relatively free from structure or inhibits echoes and other acoustic reflections), the direct path of the acoustic energy can predominate at the circle 202. In some implementations, the term “direction” can be taken as having a generalized meaning and to be equivalent to a set of directions representing the direct path and all reflective paths. More or fewer spatial profiles than the spatial profiles 206A-C can occur in some implementations.
Different listeners represented by the circle 202 can have different ability to spatially resolve the sound arriving that has the respective spatial profiles 206A-C. A human, for example, may be able to identify ten, perhaps fifteen, sound sources in parallel based on their respective spatial profiles 206A-C. An apparatus, on the other hand (e.g., a computer-based system prior to the present subject matter), may be able to distinguish significantly fewer sound sources in parallel than the human listener. For example, prior computers have been able to distinguish fewer than three simultaneous sound sources in parallel (e.g., about two sound sources). This can give rise to limitations in the ability of audio equipment to perform spatial decomposition (e.g., in an AR/VR system). As such, using a computer-based system with an improved ability for spatial decomposition can allow the listener of the circle 202 to distinguish between more of the spatial profiles 206A-C.
Determining directionality of sound may be dependent on multiple factors, including, but not limited to, a temporal response. In some implementations, temporal response can signify a system's ability to temporally detect the beginning or ending of an acoustic phenomenon. For example, an improved temporal response corresponds to the system being better at pinpointing when a sound begins or ends. This applies to any kinds of sounds, both sustained levels of acoustic energy and transients.
Each of the input signals 302A-C can include any kinds of audio signal content. In some implementations, the input signal 302A includes a waveform 304A. For example, the waveform 304A can be a relatively homogeneous group of waves that have similar or identical amplitude and have a frequency of about 1.5 kHz. In some implementations, the input signal 302B includes a waveform 304B. For example, the waveform 304B can be a relatively homogeneous group of waves that have similar or identical amplitude and have a frequency of about 3.0 kHz. In some implementations, the input signal 302C includes a waveform 304C. For example, the waveform 304C can be a relatively homogeneous group of waves that have similar or identical amplitude and have a frequency of about 10.0 kHz.
One or more acoustic phenomena can affect the perception of the input signals 302A-C. In some implementations, resonance can occur. For example, the human ear has a resonance at about 3 kHz that can be explained by elastoviscous properties of a membrane that is oscillating in the ear, and the interaction of hair cells on that membrane. This resonance phenomenon is common among all humans. The resonance can have certain impacts on how the human ear receives sound waves.
Beginning with the input signal 302B, this signal is at about the resonance frequency 3.0 kHz and therefore the ear will receive a signal 306B that is affected by resonance. The resonance can cause an amplification of the input signal 302B. If the input signal 302B has a certain amplitude then the signal 306B can have an amplitude that is multiple times greater. For example, the amplitude of the signal 306B can be about double (e.g., an amplification by about +6 dB) the amplitude of the input signal 302B. The resonance can also cause a smearing of the time localization of transients at about the 3.0 kHz frequency. That is, the accumulation of energy associated with the resonance can integrate the signal energy over time. As such, the frequency 3.0 kHz can be associated with energy integration in human hearing. For example, this can blur the temporal characteristics of the transient and attenuate the transient (e.g., an attenuation by about a factor 2). This blurring can make the transient more difficult to detect (e.g., the transient can be said to disappear). This can cause the transient sound to be heard for longer than it occurred (e.g., the transient can be smeared forward in time). For example, the signal 306B can include a waveform 308B that is multiple times longer (e.g., three times longer) than the waveform 304B.
Turning now to the input signals 302A and 302C, these signals are at about two frequencies (1.5 kHz and 10.0 kHz, respectively) that are also affected by the resonance in the human ear, and therefore the ear will receive signals 306A and 306C, respectively, that are also affected by the resonance. Particularly, the resonance can cause a reduction in the input signals 302A and 302C. If the input signal 302A has a certain amplitude then the signal 306A can have an amplitude that is multiple times smaller. For example, the amplitude of the signal 306A can be about half (e.g., a reduction by about −6 dB) of the amplitude of the input signal 302A. If the input signal 302C has a certain amplitude then the signal 306C can have an amplitude that is multiple times smaller. For example, the amplitude of the signal 306C can be about half (e.g., a reduction by about −6 dB) of the amplitude of the input signal 302C. A transient at about 1.5 and/or 10.0 kHz can become more temporally localized (e.g., sharpened in time). For example, the resonance at 3.0 kHz can work as a derivative filter by cancelling surrounding frequencies, making transients in these frequencies enhanced, but dampening the energy in sustained waves. This can allow for more quantization, but leaves less room for placing the transient. For example, the signal 306A can include a waveform 308A that is multiple times shorter (e.g., three times shorter) than the waveform 304A. As another example, the signal 306C can include a waveform 308C that is multiple times shorter (e.g., three times shorter) than the waveform 304C. As such, each of the frequencies 1.5 and 10.0 kHz can be associated with energy differentiation in human hearing.
Applying aspects of the present subject matter can facilitate improved audio processing. For example, an audio compressor (e.g., as part of the resonance-enhanced encoder 106 in
The audio encoder 400 can include one or more transforms 406. The transform(s) 406 can convert an audio signal from a temporal domain to a frequency domain. The transform 406 can be performed on one or more ranges of time, sometimes referred to as the window(s) used for the transform 406. When sounds are developing slowly, it can be said that the larger the window (e.g., the greater the number of milliseconds (ms) transformed), the more that portion of the signal can be compressed. With sounds, they can sometimes be assumed to develop relatively slowly at a relevant frame of reference. For example, with speech the audio signal is produced by a column of air that is vibrating, such that at some given time the air will vibrate at least substantially as it was, say, 20 ms earlier. In this context, an integral transform can be used to obtain predictive characteristics of the vibration. Any transform relating to frequencies can be used, including, but not limited to, a Fourier transform or a cosine transform. In some implementations, the discrete variation of a transform can be used. For example, the discrete Fourier transform (DFT) can be implemented as the fast Fourier transform (FFT). As another example, the discrete cosine transform (DCT) can be used.
The audio encoder 400 includes a mapping 408 between window size and frequency. The mapping 408 can be based on a resonance phenomenon in the human ear. In some implementations, the mapping 408 can associate a first window size with a frequency that is associated with energy integration in human hearing. For example, the frequency can be about 3.0 kHz (e.g., with a window size of about 18-30 ms, such as about 24 ms). In some implementations, the mapping 408 can associate a second window size with a frequency that is associated with energy differentiation in human hearing. For example, the frequency can be about 1.5 kHz and/or about 10.0 kHz (e.g., with a window size of about 3-9 ms, such as about 6 ms). In some implementations, the mapping 408 can associate a third window size with a frequency that is not associated with any particular acoustic phenomenon in human hearing (e.g., not associated with any resonance). For example, the frequency can be lower than about 1.0 kHz and/or greater than about 10.0 kHz (e.g., with a window size of about 6-18 ms, such as about 12 ms). The mapping 408 can effectuate associations between window sizes (e.g., in terms of size, such as ms) and frequency (e.g., in terms of one or more bands of frequencies) in any of multiple different ways. For example, the mapping 408 can include a lookup table to be used with one or more of the transforms 406. As another example, the mapping 408 can be integrated into one or more of the transforms 406 so as to automatically be applied to the transformation(s).
The encoder 400 is an example of an apparatus than can perform a method relating to improved coding. The method can include receiving a first signal (e.g., the signal 302B in
An encoder (e.g., the audio encoder 400 in
The following are examples of decoding. The transforms 600-1, 602-1, and 604 can be performed, of which the transforms 602-1 and 604 can be stored (e.g., in a memory, by the resonance-enhanced decoder 110 in
At 802, a signal can be received. The signal can be an audio signal that corresponds to a flow of energy. For example, the resonance-enhanced encoder 106 can receive a signal from the sound sensors 102 (
At 804, a transform can be applied to the received signal. In some implementations, the transform uses amplitude-independent window sizes. For example, DCT or FFT can be applied to any of the input signals 302A-C regardless of the amplitude of that signal. Different window sizes can be applied at different frequencies.
At 806, an encoded signal can be stored. For example, the resonance-enhanced encoder 106 (
The computing device illustrated in
The computing device 900 includes, in some embodiments, at least one processing device 902 (e.g., a processor), such as a central processing unit (CPU). A variety of processing devices are available from a variety of manufacturers, for example, Intel or Advanced Micro Devices. In this example, the computing device 900 also includes a system memory 904, and a system bus 914 that couples various system components including expansion ports 910 and the system memory 904 to the processing device 902 via an interface 908. The system bus 914 is one of any number of types of bus structures that can be used, including, but not limited to, a memory bus, or memory controller; a peripheral bus; and a local bus using any of a variety of bus architectures.
Examples of computing devices that can be implemented using the computing device 900 include a desktop computer, a laptop computer, a tablet computer, a mobile computing device (such as a smart phone, a touchpad mobile digital device, or other mobile devices), or other devices configured to process digital instructions.
The system memory 904 includes read only memory and random access memory. A basic input/output system containing the basic routines that act to transfer information within computing device 900, such as during start up, can be stored in the system memory 904.
The computing device 900 also includes a storage device 906 in some embodiments, such as a hard disk drive, for storing digital data. The storage device 906 is connected to the system bus 914 by a secondary storage interface. The storage device 906 and its associated computer readable media provide nonvolatile and non-transitory storage of computer readable instructions (including application programs and program modules), data structures, and other data for the computing device 900.
Although the example environment described herein employs a hard disk drive as a secondary storage device, other types of computer readable storage media are used in other embodiments. Examples of these other types of computer readable storage media include magnetic cassettes, flash memory cards, digital video disks, Bernoulli cartridges, compact disc read only memories, digital versatile disk read only memories, random access memories, or read only memories. Some embodiments include non-transitory media. For example, a computer program product can be tangibly embodied in a non-transitory storage medium. Additionally, such computer readable storage media can include local storage or cloud-based storage.
A number of program modules can be stored in the storage device 906 and/or system memory 904, including an operating system, one or more application programs, other program modules (such as the software engines described herein), and program data. The computing device 900 can utilize any suitable operating system, such as Microsoft Windows™, Google Chrome™ OS, Apple OS, Unix, or Linux and variants and any other operating system suitable for a computing device. Other examples can include Microsoft, Google, or Apple operating systems, or any other suitable operating system used in tablet computing devices.
In some embodiments, a user provides inputs to the computing device 900 through one or more input devices 926. Examples of input devices 926 include a keyboard, mouse, microphone (e.g., for voice and/or other audio input), touch sensor (such as a touchpad or touch sensitive display), and gesture sensor (e.g., for gestural input). In some implementations, the input device(s) 926 provide detection based on presence, proximity, and/or motion. In some implementations, a user may walk into their home, and this may trigger an input into a processing device. For example, the input device(s) 926 may then facilitate an automated experience for the user. Other embodiments include other input devices 926. The input devices can be connected to the processing device 902 through an input/output interface 912 that is coupled to the system bus 914. These input devices 926 can be connected by any number of input/output interfaces, such as a parallel port, serial port, game port, or a universal serial bus. Wireless communication between input devices 926 and the input/output interface 912 is possible as well, and includes infrared, BLUETOOTH® wireless technology, 802.11a/b/g/n, cellular, ultra-wideband (UWB), ZigBee, or other radio frequency communication systems in some possible embodiments, to name just a few examples.
In this example embodiment, a display device 916, such as a monitor, liquid crystal display device, projector, or touch sensitive display device, is also connected to the a bus via an interface 908, such as a video adapter. In addition to the display device 916, the computing device 900 can include various other peripheral devices, such as speakers or a printer.
The computing device 900 can be connected to one or more networks through a network interface. The network interface can provide for wired and/or wireless communication. In some implementations, the network interface can include one or more antennas for transmitting and/or receiving wireless signals. When used in a local area networking environment or a wide area networking environment (such as the Internet), the network interface can include an Ethernet interface. Other possible embodiments use other communication devices. For example, some embodiments of the computing device 900 include a modem for communicating across the network.
The computing device 900 may be implemented as a standard server 920, a rack server system 924 or a laptop computer 922. A computing device 950 includes a processor 952, memory 964, a display 954, a communication interface 966, and a transceiver 968. The processor 952 may communicate with a control interface 958 and display interface 956 coupled to a display 954. The computing device 950 includes an external interface 962. Expansion memory 974 may also be provided and connected to device 950 through expansion interface 972. Computing device 950 may also communicate using audio codec 960.
The computing device 900, 950 can include at least some form of computer readable media. Computer readable media includes any available media that can be accessed by the computing device 900, 950. By way of example, computer readable media include computer readable storage media and computer readable communication media.
Computer readable storage media includes volatile and nonvolatile, removable and non-removable media implemented in any device configured to store information such as computer readable instructions, data structures, program modules or other data. Computer readable storage media includes, but is not limited to, random access memory, read only memory, electrically erasable programmable read only memory, flash memory or other memory technology, compact disc read only memory, digital versatile disks or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store the desired information and that can be accessed by the computing device 900.
Computer readable communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” refers to a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, computer readable communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency, infrared, and other wireless media. Combinations of any of the above are also included within the scope of computer readable media.
The computing device illustrated in
A number of embodiments have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the invention.
In addition, the logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. In addition, other steps may be provided, or steps may be eliminated, from the described flows, and other components may be added to, or removed from, the described systems. Accordingly, other embodiments are within the scope of the following claims.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2019/066570 | 12/16/2019 | WO |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2021/126155 | 6/24/2021 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
20050129109 | Kim | Jun 2005 | A1 |
20140297271 | Geiser | Oct 2014 | A1 |
20150276840 | Hejase | Oct 2015 | A1 |
20150304766 | Delikaris-Manias | Oct 2015 | A1 |
20170162209 | Spanias | Jun 2017 | A1 |
20180190303 | Ghido et al. | Jul 2018 | A1 |
20190074805 | Hatab et al. | Mar 2019 | A1 |
20190318733 | Mani | Oct 2019 | A1 |
Entry |
---|
Bell, J. A. (2005). The underwater piano: a resonance theory of cochlear mechanics. |
Baltus, A., & Herrmann, C. S. (2015). Auditory temporal resolution is linked to resonance frequency of the auditory cortex. International Journal of Psychophysiology, 98(1), 1-7. |
International Search Report and Written Opinion for PCT Application No. PCT/US2019/066570, dated Jul. 6, 2020, 12 pages. |
Anderson, David V., “Speech Analysis and Coding Using a Multi-Resolution Sinusoidal Transform”, Conference Proceedings/The 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE Service Center,, May 7, 1996, 5 pages. |
Levine, Scott Nathan, “Audio Representations for Data Compression and Compressed Domain Processing”, Retrieved from the Internet: URL:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.8.6529&rep=rep1&type=pdf, Dec. 1998, 147 pages. |
Number | Date | Country | |
---|---|---|---|
20210233546 A1 | Jul 2021 | US |