This invention relates generally to the field of digital signal processing (DSP), audio engineering and audiology, and more specifically to systems and methods for providing personalized audio based on user hearing test results and based on specific audio content.
Traditional DSP sound personalization methods often rely on administration of an audiogram to parameterize a frequency gain compensation function. Typically, a pure tone threshold (PTT) hearing test is employed to identify frequencies in which a user exhibits raised hearing thresholds and the frequency output is modulated accordingly. These gain parameters are stored locally on the user's device for subsequent audio processing.
The use of frequency compensation is inadequate to the extent that solely applying a gain function to the audio signal does not sufficiently restore audibility. The gain may enable the user to recapture previously unheard frequencies, but the user may subsequently experience loudness discomfort. Listeners with sensorineural hearing loss typically have similar, or even reduced, discomfort thresholds when compared to normal hearing listeners, despite their hearing thresholds being raised. To this extent, their dynamic aperture is narrower and simply adding gain would be detrimental to their hearing health in the long run.
Although hearing loss typically begins at higher frequencies, listeners who are aware that they have hearing loss do not typically complain about the absence of high frequency sounds. Instead, they report difficulties listening in a noisy environment and in hearing out the details in a complex mixture of sounds, such as in an audio stream of a radio interview conducted in a busy street. In essence, off frequency sounds more readily mask information with energy in other frequencies for hearing-impaired (HI) individuals—music that was once clear and rich in detail becomes muddled. This is because music itself is highly self-masking, i.e. numerous sound sources have energy that overlaps in the frequency space, which can reduce outright detectability, or impede the users' ability to extract information from some of the sources.
As hearing deteriorates, the signal-conditioning capabilities of the ear begin to break down, and thus HI listeners need to expend more mental effort to make sense of sounds of interest in complex acoustic scenes (or miss the information entirely). A raised threshold in an audiogram is not merely a reduction in aural sensitivity, but a result of the malfunction of some deeper processes within the auditory system that have implications beyond the detection of faint sounds. To this extent, the addition of simple frequency gain provides an inadequate solution and the use of a multiband dynamic compression system would be more ideally suited as it more readily addresses the deficiencies of an impaired user.
Moreover, it is further inadequate to apply the same parameterized DSP algorithm to all types of audio content. Different forms of audio content require different DSP parameter settings as these systems aren't “one size fits all”. For example, the requirements for voice processing are different than for more complex audio streams, such as for movies or music. For example, users are more willing to accept more aggressive forms of compression for voice calls to improve speech clarity than they are for music. Likewise, in a movie, a user may want to fit a specific DSP somewhere in between that of pure speech and music—so that a balance is achieved between voice clarity and greater detail in background sound and music.
Accordingly, it is an aspect of the present disclosure to provide systems and methods for providing content-specific, personalized audio replay on consumer devices.
According to aspect of the present disclosure, provided are systems and methods for providing content-specific, personalized audio replay on consumer devices. According to an aspect of the present disclosure, provided are methods and systems for processing an audio signal, the method comprising: generating a user hearing profile; calculating at least one set of audio content-specific DSP (digital signal processing) parameters for each of one or more sound personalization algorithms, the calculation of the content-specific DSP parameters based on at least the user hearing profile; associating one or more of the calculated sets of content-specific DSP parameters with a content-identifier for the specific content; in response to an audio stream on an audio output device, analyzing the audio stream to determine at least one content type of the audio stream; based on the at least one determined content type of the audio stream, outputting corresponding content-specific DSP parameters to the audio output device, wherein the corresponding content-specific DSP parameters are outputted based at least in part on their content-identifier; and processing, on the audio output device, an audio signal by using a given sound personalization algorithm parameterized by the corresponding content-specific DSP parameters.
In an aspect of the disclosure, the content-identifier further indicates the given sound personalization algorithm for which a given set of content-specific DSP parameters was calculated.
In a further aspect of the disclosure, the calculation of the content-specific DSP parameters further comprises applying a scaled processing level for the different types of specific content, wherein each scaled processing level is calculated based on one or more target age hearing curves that are different from a hearing curve of the user.
In a further aspect of the disclosure, the calculation of the content-specific DSP parameters further comprises calculating one or more wet mixing parameters and dry mixing parameters to optimize the content-specific DSP parameters for the different types of specific content.
In a further aspect of the disclosure, the calculation of the content-specific DSP parameters further comprises analyzing perceptually relevant information (PRI) to optimize a PRI value provided by the content-specific DSP parameters for the different types of specific content.
In a further aspect of the disclosure, the user hearing profile is generated by conducted at least one hearing test on the audio output device of a user.
In a further aspect of the disclosure, the hearing test is one or more of a masked threshold test (MT test), a pure tone threshold test (PTT test), a psychophysical tuning curve test (PTC test), or a cross frequency simultaneous masking test (xF-SM test).
In a further aspect of the disclosure, the user hearing profile is generated at least in part by analyzing a user input of demographic information to thereby interpolate a representative hearing profile.
In a further aspect of the disclosure, the user input of demographic information includes an age of the user.
In a further aspect of the disclosure, the sound personalization algorithm is a multiband dynamic processor; and the content-specific DSP parameters include one or more ratio values and gain values.
In a further aspect of the disclosure, the sound personalization algorithm is an equalization DSP; and the content-specific DSP parameters include one or more gain values and limiter values.
In a further aspect of the disclosure, the content type of the audio stream is determined by analyzing one or more metadata portions associated with the audio stream.
In a further aspect of the disclosure, the one or more metadata portions are extracted from the audio stream.
In a further aspect of the disclosure, the one or more metadata portions are calculated locally by an operating system of the audio output device.
In a further aspect of the disclosure, the content types include voice, video, music, and specific music genres.
In a further aspect of the disclosure, the audio output device is one of a mobile phone, a smart speaker, a television, headphones, or hearables.
In a further aspect of the disclosure, the at least one set of content-specific DSP parameters is stored on a remote server.
In a further aspect of the disclosure, the at least one set of content-specific DSP parameters is stored locally on the audio output device.
In a further aspect of the disclosure, the content type of the audio stream is determined by performing a Music Information Retrieval (MIR) calculation.
In a further aspect of the disclosure, the content type of the audio stream is determined by performing a spectral analysis calculation on the audio stream or providing the audio stream as input to a speech detection algorithm.
The term “sound personalization algorithm”, as used herein, is defined as any digital signal processing (DSP) algorithm that processes an audio signal to enhance the clarity of the signal to a listener. The DSP algorithm may be, for example: an equalizer, an audio processing function that works on the subband level of an audio signal, a multiband compressive system, or a non-linear audio processing algorithm.
The term “audio content type”, as used herein, is defined as any specific type of audio content in an audio stream, such as voice, video, music, or specific genres of music, such as rock, jazz, classical, pop, etc.
The term “audio output device”, as used herein, is defined as any device that outputs audio, including, but not limited to: mobile phones, computers, televisions, hearing aids, headphones, smart speakers, hearables, and/or speaker systems.
The term “headphone”, as used herein, is any earpiece bearing a transducer that outputs soundwaves into the ear. The earphone may be a wireless hearable, a corded or wireless headphone, a hearable device, or any pair of earbuds.
The term “hearing test”, as used herein, is any test that evaluates a user's hearing health, more specifically a hearing test administered using any transducer that outputs a sound wave. The test may be a threshold test or a suprathreshold test, including, but not limited to, a psychophysical tuning curve (PTC) test, a masked threshold (MT) test, a temporal fine structure test (TFS), temporal masking curve test and a speech in noise test.
The term “server”, as used herein, generally refers to a computer program or device that provides functionalities for other programs or devices.
In order to describe the manner in which the above-recited and other advantages and features of the disclosure can be obtained, a more particular description of the principles briefly described above will be rendered by reference to specific embodiments thereof, which are illustrated in the appended drawings. Understand that these drawings depict only exemplary embodiments of the disclosure and are not therefore to be considered to be limiting of its scope, the principles herein are described and explained with additional specificity and detail through the use of the accompanying drawings in which:
Various embodiments of the disclosure are discussed in detail below. While specific implementations are discussed, it should be understood that this is done for illustration purposes only. A person skilled in the relevant art will recognize that other components and configurations may be used without parting from the spirit and scope of the disclosure. Thus, the following description and drawings are illustrative and are not to be construed as limiting the scope of the embodiments described herein. Numerous specific details are described to provide a thorough understanding of the disclosure. However, in certain instances, well-known or conventional details are not described in order to avoid obscuring the description. References to one or an embodiment in the present disclosure can be references to the same embodiment or any embodiment; and, such references mean at least one of the embodiments.
Reference to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the disclosure. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Moreover, various features are described which may be exhibited by some embodiments and not by others.
The terms used in this specification generally have their ordinary meanings in the art, within the context of the disclosure, and in the specific context where each term is used. Alternative language and synonyms may be used for any one or more of the terms discussed herein, and no special significance should be placed upon whether or not a term is elaborated or discussed herein. In some cases, synonyms for certain terms are provided. A recital of one or more synonyms does not exclude the use of other synonyms. The use of examples anywhere in this specification including examples of any terms discussed herein is illustrative only and is not intended to further limit the scope and meaning of the disclosure or of any example term. Likewise, the disclosure is not limited to various embodiments given in this specification.
Without intent to limit the scope of the disclosure, examples of instruments, apparatus, methods and their related results according to the embodiments of the present disclosure are given below. Note that titles or subtitles may be used in the examples for convenience of a reader, which in no way should limit the scope of the disclosure. Unless otherwise defined, technical and scientific terms used herein have the meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains. In the case of conflict, the present document, including definitions will control.
Additional features and advantages of the disclosure will be set forth in the description which follows, and in part will be obvious from the description, or can be learned by practice of the herein disclosed principles. The features and advantages of the disclosure can be realized and obtained by means of the instruments and combinations particularly pointed out in the appended claims. These and other features of the disclosure will become more fully apparent from the following description and appended claims or can be learned by the practice of the principles set forth herein.
Various example embodiments of the disclosure are discussed in detail below. While specific implementations are discussed, it should be understood that this is done for illustration purposes only. A person skilled in the relevant art will recognize that other components and configurations may be used without departing from the spirit and scope of the present disclosure.
It is an aspect of the present disclosure to provide systems and methods for providing audio content-specific, personalized audio replay on consumer devices.
To this extent,
Other suprathreshold testing may be used. A cross frequency masked threshold test is illustrated in
Next, hearing test results are used to calculate 408 at least one set of audio content-specific DSP parameters (also referred to herein as “content-specific” DSP parameters) for at least one sound personalization algorithm. The calculated DSP parameters for a given sound personalization algorithm may include, but are not limited to: ratio, threshold and gain values within a multiband dynamic processor, gain and limiter values for equalization DSPs, and/or parameter values common to other sound personalization DSPs (see, e.g., commonly owned U.S. Pat. No. 10,199,047 and U.S. patent application Ser. No. 16/244,727, the contents of which are herein incorporated by reference in their entirety). One or more of the DSP parameter calculations may be performed directly or indirectly, as is explained below. The content specific parameters are then stored 409 on the audio output device and/or on a server database alongside a content identifier.
In some embodiments, when an audio stream is playing 410, the audio content type may be identified through metadata associated with the audio stream. For example, the metadata may be contained within the audio file itself or may be ascertained from the operating system of the audio output device. Various types of audio content may include: voice, video, music, or specific genres of music such as classical, rock, pop, jazz, etc. Alternatively or additionally, in some embodiments, other forms of audio signal analysis may be performed to identify audio content type, such as Music Information Retrieval (MIR), speech detection algorithms, or other forms of spectral analysis. After the audio content type has been identified or otherwise determined for the audio stream, content-specific DSP parameters are subsequently retrieved from the database using the audio content identifier as reference 412 and the parameters are then outputted to the device's sound personalization algorithm 413. The audio stream is then processed by the sound personalization algorithm 414.
A wet/dry mixing approach may also be used, as seen in
Parallel compression provides the benefit of allowing the user to mix ‘dry’ unprocessed or slightly processed sound with ‘wet’ processed sound, enabling customization of processing based on subjective preference. For example, this enables hearing impaired users to use a high ratio of heavily processed sound relative to users with moderate to low hearing loss. Furthermore, by reducing the dynamic range of an audio signal by bringing up the softest sounds, rather than reducing the highest peaks, it provides audible detail to sound. The human ear is sensitive to loud sounds being suddenly reduced in volume, but less sensitive to soft sounds being increased in volume, and this mixing method takes advantage of this observation, resulting in a more natural sounding reduction in dynamic range compared with using a dynamic range compressor in isolation. Additionally, parallel compression is in particular useful for speech-comprehension and/or for listening to music with full, original timbre. To mix two different signal pathways requires that the signals in the pathways conform to phase linearity, or into the pathway's identical phase using phase distortion, or the pathway mixing modulator involves a phase correction network in order to prevent any phase cancellations upon summing the correlated signals to provide an audio signal to the control output.
A PRI optimization approach may also be employed, see
PRI can be calculated according to a variety of methods found. One such method, also called perceptual entropy, was developed by James D. Johnston at Bell Labs, generally comprising: transforming a sampled window of audio signal into the frequency domain, obtaining masking thresholds using psychoacoustic rules by performing critical band analysis, determining noise-like or tone-like regions of the audio signal, applying thresholding rules for the signal and then accounting for absolute hearing thresholds. Following this, the number of bits required to quantize the spectrum without introducing perceptible quantization error is determined. For instance, Painter & Spanias disclose a formulation for perceptual entropy in units of bits/s, which is closely related to ISO/IEC MPEG-1 psychoacoustic model 2 [Painter & Spanias, Perceptual Coding of Digital Audio, Proc. Of IEEE, Vol. 88, No. 4 (2000); see also generally Moving Picture Expert Group standards https://mpeg.chiariglione.org/standards; both documents included by reference].
Various optimization methods are possible to maximize the PRI of audio samples, depending on the type of the applied audio processing function such as the above-mentioned multiband dynamics processor. For example, a subband dynamic compressor may be parameterized by compression threshold, attack time, gain and compression ratio for each subband, and these parameters may be determined by the optimization process. In some cases, the effect of the multiband dynamics processor on the audio signal is nonlinear and an appropriate optimization technique such as gradient descend is required. The number of parameters that need to be determined may become large, e.g. if the audio signal is processed in many subbands and a plurality of parameters needs to be determined for each subband. In such cases, it may not be practicable to optimize all parameters simultaneously and a sequential approach for parameter optimization may be applied. Although sequential optimization procedures do not necessarily result in the optimum parameters, the obtained parameter values result in increased PRI over the unprocessed audio sample, thereby improving the listener's listening experience.
Other parameterization processes commonly known in the art may be used to calculate parameters based off user-generated threshold and suprathreshold information. For instance, common prescription techniques for linear and non-linear DSP may be employed. Well known procedures for linear hearing aid algorithms include POGO, NAL, and DSL. See, e.g., H. Dillon, Hearing Aids, 2nd Edition, Boomerang Press, 2012.
Fine tuning of any of the above-mentioned techniques may be estimated from manual fitting data. For instance, it is common in the art to fit a multiband dynamic processor according to series of subjective tests 704 given to a patient in which parameters are adjusted according to a patient's responses, e.g. a series of A/B tests, decision tree paradigms, 2D exploratory interface, in which the patient is asked which set of parameters subjectively sounds better. This testing ultimately guides the optimal parameterization of the DSP.
The parameters of the multi-band compression system in a frequency band are threshold 1111 and gain 1112. These two parameters are determined from the user masking contour curve 1406 for the listener and target masking contour curve 1107. The threshold 1111 and ratio 1112 must satisfy the condition that the signal-to-noise ratio 1121 (SNR) of the user masking contour curve 1106 at a given frequency 1109 is greater than the SNR 1122 of the target masking contour curve 1107 at the same given frequency 1109. Note that the SNR is herein defined as the level of the signal tone compared to the level of the masker noise. The broader the curve will be, the greater the SNR. The given frequency 1109 at which the SNRs 1121 and 1122 are calculated may be arbitrarily chosen, for example, to be beyond a minimum distance from the probe tone frequency 1408.
The sound level 1130 (in dB) of the target masking contour curve 1107 at a given frequency corresponds (see bent arrow 1131 in
In the context of the present disclosure, a masking contour curve is obtained from a user hearing test. A target masking contour curve 1107 is interpolated from at least the user masking contour curve 1106 and a reference masking contour curve, representing the curve of a normal hearing individual. The target masking contour curve 1107 is preferred over a reference curve because fitting an audio signal to a reference curve is not necessarily optimal. Depending on the initial hearing ability of the listener, fitting the processing according to a reference curve may cause an excess of processing to spoil the quality of the signal. The objective is to process the signal in order to obtain a good balance between an objective benefit and a good sound quality.
The given frequency 1109 is then chosen. It may be chosen arbitrarily, e.g., at a certain distance from the tone frequency 1108. The corresponding sound levels of the listener and target masking contour curves are determined at this given frequency 1109. The value of these sound levels may be determined graphically on the y-axis 1102.
The right panel in
In some embodiments, content-specific DSP parameter sets may be calculated indirectly from a user hearing test based on preexisting entries or anchor points in a server database. An anchor point comprises a typical hearing profile constructed based at least in part on demographic information, such as age and sex, in which DSP parameter sets are calculated and stored on the server to serve as reference markers. Indirect calculation of DSP parameter sets bypasses direct parameter sets calculation by finding the closest matching hearing profile(s) and importing (or interpolating) those values for the user.
(√{square root over ((d5a−d1a)2+(d6b−d2b)2 . . . )}<√{square root over ((d5a−d9a)2+(d6b−d10b)2 . . . )}
(√{square root over ((y1−x1)2+(y2−x2)2 . . . )}<√{square root over ((y1−z1)2+(y2−z2)2 . . . )})
As would be appreciated by one of ordinary skill in the art, other methods may be used to quantify similarity amongst user hearing profile graphs, where the other methods can include, but are not limited to, methods such as a Euclidean distance measurements, e.g. ((y1−x1)+(y2−x2) . . . >(y1−x1)+(y2−x2)) . . . or other statistical methods known in the art. For indirect DSP parameter set calculation, then, the closest matching hearing profile(s) between a user and other preexisting database entries or anchor points can then be used.
DSP parameter sets may be interpolated linearly, e.g., a DRC ratio value of 0.7 for user 5 (u_id)5 and 0.8 for user 3 (u_id)3 would be interpolated as 0.75 for user 200 (u_id)200 in the example of
In some embodiments computing system 1600 is a distributed system in which the functions described in this disclosure can be distributed within a datacenter, multiple datacenters, a peer network, etc. In some embodiments, one or more of the described system components represents many such components each performing some or all of the function for which the component is described. In some embodiments, the components can be physical or virtual devices.
Example system 1600 includes at least one processing unit (CPU or processor) 1610 and connection 1605 that couples various system components including system memory 1615, such as read only memory (ROM) 1620 and random access memory (RAM) 1625 to processor 1610. Computing system 1600 can include a cache of high-speed memory 1612 connected directly with, in close proximity to, or integrated as part of processor 1610.
Processor 1610 can include any general-purpose processor and a hardware service or software service, such as services 1632, 1634, and 1636 stored in storage device 1630, configured to control processor 1610 as well as a special-purpose processor where software instructions are incorporated into the actual processor design. Processor 1610 may essentially be a completely self-contained computing system, containing multiple cores or processors, a bus, memory controller, cache, etc. A multi-core processor may be symmetric or asymmetric.
To enable user interaction, computing system 1600 includes an input device 1645, which can represent any number of input mechanisms, such as a microphone for speech, a touch-sensitive screen for gesture or graphical input, keyboard, mouse, motion input, speech, etc. Computing system 1600 can also include output device 1635, which can be one or more of a number of output mechanisms known to those of skill in the art. In some instances, multimodal systems can enable a user to provide multiple types of input/output to communicate with computing system 1600. Computing system 1600 can include communications interface 1640, which can generally govern and manage the user input and system output. There is no restriction on operating on any particular hardware arrangement and therefore the basic features here may easily be substituted for improved hardware or firmware arrangements as they are developed.
Storage device 1630 can be a non-volatile memory device and can be a hard disk or other types of computer readable media which can store data that are accessible by a computer, flash memory cards, solid state memory devices, digital versatile disks, cartridges, random access memories (RAMS), read only memory (ROM), and/or some combination of these devices.
The storage device 1630 can include software services, servers, services, etc., that when the code that defines such software is executed by the processor 1610, it causes the system to perform a function. In some embodiments, a hardware service that performs a particular function can include the software component stored in a computer-readable medium in connection with the necessary hardware components, such as processor 1610, connection 1605, output device 1635, etc., to carry out the function.
The presented technology offers an efficient and accurate way to personalize audio replay automatically for a variety of audio content types. It is to be understood that the present disclosure contemplates numerous variations, options, and alternatives. For clarity of explanation, in some instances the present technology may be presented as including individual functional blocks including functional blocks comprising devices, device components, steps or routines in a method embodied in software, or combinations of hardware and software.
Methods according to the above-described examples can be implemented using computer-executable instructions that are stored or otherwise available from computer readable media. Such instructions can comprise, for example, instructions and data which cause or otherwise configure a general-purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. Portions of computer resources used can be accessible over a network. The computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, firmware, or source code. Examples of computer-readable media that may be used to store instructions, information used, and/or information created during methods according to described examples include magnetic or optical disks, flash memory, USB devices provided with non-volatile memory, networked storage devices, and so on.
Devices implementing methods according to these disclosures can comprise hardware, firmware and/or software, and can take any of a variety of form factors. Typical examples of such form factors include laptops, smart phones, small form factor personal computers, personal digital assistants, rackmount devices, standalone devices, and so on. Functionality described herein also can be embodied in peripherals or add-in cards. Such functionality can also be implemented on a circuit board among different chips or different processes executing in a single device, by way of further example. The instructions, media for conveying such instructions, computing resources for executing them, and other structures for supporting such computing resources are means for providing the functions described in these disclosures.
Although a variety of examples and other information was used to explain aspects within the scope of the appended claims, no limitation of the claims should be implied based on particular features or arrangements in such examples, as one of ordinary skill would be able to use these examples to derive a wide variety of implementations. Further and although some subject matter may have been described in language specific to examples of structural features and/or method steps, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to these described features or acts. For example, such functionality can be distributed differently or performed in components other than those identified herein. Rather, the described features and steps are disclosed as examples of components of systems and methods within the scope of the appended claims.