The present technology relates to systems and methods of generating an audio image. In particular, the systems and methods allow generating an audio image for use in rendering audio to a listener.
Humans have only two ears, but can nonetheless locate sounds in three dimensions. The brain, inner ears, and external ears work together to infer locations of audio sources. In order for a listener to localize sound in three dimensions, the sound must perceptually arrive from a specific azimuth, elevation and distance. The brain of the listener estimates the source location of an audio source by comparing first cues perceived by a first ear to second cues perceived by a second ear to derive difference cues based on time of arrival, intensity and spectral differences. The brain may then rely on the difference cues to locate the specific azimuth, elevation and distance of the audio source.
From the phonograph developed by Edison and described in U.S. Patent 200,521 to the most recent developments in spatial audio, audio professionals and engineers have dedicated tremendous efforts to try to reproduce reality as we hear it and feel it in real life. This objective has become even more prevalent with the recent developments in virtual and augmented reality as audio plays a critical role in providing an immersive experience to a user. As a result, the field of spatial audio has gained a lot of attentions over the last few years. Recent developments in spatial audio mainly focus on improving how source location of an audio source may be captured and/or reproduced. Such developments typically involve virtually positioning and/or displacing audio sources anywhere in a virtual three-dimensional space: comprising behind, in front, on the sides, above and/or below the listener.
Examples of recent developments in perception of locations and movements of audio sources comprise technologies such as (1) Dolby Atmos® from Dolby Laboratories, mostly dedicated to commercial and/or home theaters, and (2) Two Big Ears® from Facebook (also referred to as Facebook 360®), mostly dedicated to creation of audio content to be played back on headphones and/or loudspeakers. As a first example, Dolby Atmos® technology allows numerous audio tracks to be associated with spatial audio description metadata (such as location and/or pan automation data) and to be distributed to theaters for optimal, dynamic rendering to loudspeakers based on the theater capabilities. As a second example, Two Big Ears® technology comprises software suites (such as the Facebook 360 Spatial Workstation) for designing spatial audio for 360 video and/or virtual reality (VR) and/or augmented reality (AR) content. The 360 video and/or the VR and/or the AR content may then be dynamically rendered on headphones or VR/AR headsets.
Existing technologies typically rely on spatial domain convolution of sound waves using head-related transfer functions (HRTFs) to transform sound waves so as to mimic natural sounds waves which emanate from a point of a three-dimensional space. Such technics allow, within certain limits, tricking the brain of the listener to pretend to place different sound sources in different three-dimensional locations upon hearing audio streams, even though the audio streams are produced from only two speakers (such as headphones or loudspeakers). Examples of systems and methods of spatial audio enhancement using HRTFs may be found in U.S. Patent Publication 2014/0270281 to Creative Technology Ltd, International Patent Publication WO 2014/159376 to Dolby Laboratories Inc. and International Patent Publication WO 2015/134658 to Dolby Laboratories Licensing Corporation.
Even though current technologies, such as the ones detailed above, may allow bringing a listener a step closer to an immersive experience, they still present at least certain deficiencies. First, current technologies may present certain limits in tricking the brain of the listener to pretend to place and displace different sound sources in three-dimensional locations. These limits result in a lower immersive experience and/or a lower quality of audio compared to what the listener would have had experiences in real life. Second, at least some current technologies require complex software and/or hardware components to operate conventional HRTF simulation software. As audio content is increasingly being played back through mobile devices (e.g., smart phones, tablets, laptop computers, headphones, VR headsets, AR headsets), complex software and/or hardware components may not always be appropriate as they require substantial processing power that mobile devices may not have as such mobile devices are usually lightweight, compact and low-powered.
Improvements may be therefore desirable.
The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches.
Embodiments of the present technology have been developed based on developers' appreciation of shortcomings associated with the prior art.
In particular, such shortcomings may comprise (1) a limited quality of an immersive experience, (2) a limited ability to naturally render audio content to a listener and/or (3) a required processing power of a device used to produce spatial audio content and/or play-back spatial audio content to a listener.
In one aspect, various implementations of the present technology provide a method of generating an audio image for use in rendering audio, the method comprising:
In another aspect, various implementations of the present technology provide a method of generating an audio image for use in rendering audio, the method comprising:
In yet another aspect, various implementations of the present technology provide a method of generating a volumetric audio image for use in rendering audio, the method comprising:
In another aspect, various implementations of the present technology provide a method of generating an audio image for use in rendering audio, the method comprising:
In yet another aspect, various implementations of the present technology provide a system for rendering audio output, the system comprising:
a sound-field positioner, the sound-field positioner being configured to:
In another aspect, various implementations of the present technology provide a system for generating an audio image file, the system comprising:
In yet another aspect, various implementations of the present technology provide a method of filtering an audio stream, the method comprising:
In another aspect, various implementations of the present technology provide a system for generating an audio image, the system comprising:
In yet another aspect, various implementations of the present technology provide a system for generating an audio image, the system comprising:
In another aspect, various implementations of the present technology provide a system for generating a volumetric audio image, the system comprising:
In yet another aspect, various implementations of the present technology provide a system for generating an audio image, the system comprising:
In another aspect, various implementations of the present technology provide a system for filtering an audio stream, the system comprising:
In yet another aspect, various implementations of the present technology provide a non-transitory computer readable medium comprising control logic which, upon execution by the processor, causes:
In another aspect, various implementations of the present technology provide a method of generating an audio image for use in rendering audio, the method comprising:
In other aspects, convolving the audio stream with the first positional impulse response, convolving the audio stream with the second positional impulse response and convolving the audio stream with the third positional impulse response are executed in parallel.
In other aspects, various implementations of the present technology provide a non-transitory computer-readable medium storing program instructions for generating an audio image, the program instructions being executable by a processor of a computer-based system to carry out one or more of the above-recited methods.
In other aspects, various implementations of the present technology provide a computer-based system, such as, for example, but without being limitative, an electronic device comprising at least one processor and a memory storing program instructions for generating an audio image, the program instructions being executable by the at least one processor of the electronic device to carry out one or more of the above-recited methods.
In the context of the present specification, unless expressly provided otherwise, a computer system may refer, but is not limited to, an “electronic device”, a “mobile device”, an “audio processing device”, “headphones”, a “headset”, a “VR headset device”, an “AR headset device”, a “system”, a “computer-based system” and/or any combination thereof appropriate to the relevant task at hand.
In the context of the present specification, unless expressly provided otherwise, the expression “computer-readable medium” and “memory” are intended to include media of any nature and kind whatsoever, non-limiting examples of which include RAM, ROM, disks (CD-ROMs, DVDs, floppy disks, hard disk drives, etc.), USB keys, flash memory cards, solid state-drives, and tape drives. Still in the context of the present specification, “a” computer-readable medium and “the” computer-readable medium should not be construed as being the same computer-readable medium. To the contrary, and whenever appropriate, “a” computer-readable medium and “the” computer-readable medium may also be construed as a first computer-readable medium and a second computer-readable medium.
In the context of the present specification, unless expressly provided otherwise, the words “first”, “second”, “third”, etc. have been used as adjectives only for the purpose of allowing for distinction between the nouns that they modify from one another, and not for the purpose of describing any particular relationship between those nouns.
Implementations of the present technology each have at least one of the above-mentioned object and/or aspects, but do not necessarily have all of them. It should be understood that some aspects of the present technology that have resulted from attempting to attain the above-mentioned object may not satisfy this object and/or may satisfy other objects not specifically recited herein.
Additional and/or alternative features, aspects and advantages of implementations of the present technology will become apparent from the following description, the accompanying drawings and the appended claims.
For a better understanding of the present technology, as well as other aspects and further features thereof, reference is made to the following description which is to be used in conjunction with the accompanying drawings, where:
It should also be noted that, unless otherwise explicitly specified herein, the drawings are not to scale.
The examples and conditional language recited herein are principally intended to aid the reader in understanding the principles of the present technology and not to limit its scope to such specifically recited examples and conditions. It will be appreciated that those skilled in the art may devise various arrangements which, although not explicitly described or shown herein, nonetheless embody the principles of the present technology and are included within its spirit and scope.
Furthermore, as an aid to understanding, the following description may describe relatively simplified implementations of the present technology. As persons skilled in the art would understand, various implementations of the present technology may be of a greater complexity.
In some cases, what are believed to be helpful examples of modifications to the present technology may also be set forth. This is done merely as an aid to understanding, and, again, not to define the scope or set forth the bounds of the present technology. These modifications are not an exhaustive list, and a person skilled in the art may make other modifications while nonetheless remaining within the scope of the present technology. Further, where no examples of modifications have been set forth, it should not be interpreted that no modifications are possible and/or that what is described is the sole manner of implementing that element of the present technology.
Moreover, all statements herein reciting principles, aspects, and implementations of the present technology, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof, whether they are currently known or developed in the future. Thus, for example, it will be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative circuitry embodying the principles of the present technology. Similarly, it will be appreciated that any flowcharts, flow diagrams, state transition diagrams, pseudo-code, and the like represent various processes which may be substantially represented in computer-readable media and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
The functions of the various elements shown in the figures, including any functional block labeled as a “processor”, a “controller”, an “encoder”, a “sound-field positioner”, a “renderer”, a “decoder”, a “filter”, a “localisation convolution engine”, a “mixer” or a “dynamic processor” may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. In some embodiments of the present technology, the processor may be a general purpose processor, such as a central processing unit (CPU) or a processor dedicated to a specific purpose, such as a digital signal processor (DSP). Moreover, explicit use of the term “processor”, “controller”, “encoder”, “sound-field positioner”, “renderer”, “decoder”, “filter”, “localisation convolution engine”, “mixer” or “dynamic processor” should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, application specific integrated circuit (ASIC), field programmable gate array (FPGA), read-only memory (ROM) for storing software, random access memory (RAM), and non-volatile storage. Other hardware, conventional and/or custom, may also be included.
Software modules, or simply modules which are implied to be software, may be represented herein as any combination of flowchart elements or other elements indicating performance of process steps and/or textual description. Such modules may be executed by hardware that is expressly or implicitly shown. Moreover, it should be understood that module may include for example, but without being limitative, computer program logic, computer program instructions, software, stack, firmware, hardware circuitry or a combination thereof which provides the required capabilities.
Throughout the present disclosure, reference is made to audio image, audio stream, positional impulse response and virtual wave front. It should be understood that such reference is made for the purpose of illustration and is intended to be exemplary of the present technology.
Audio image: an audio signal or a combination of audio signals generated in such a way that, upon being listened to by a listener, a perception of a volumetric audio envelope similar to what the listener would experience in real life is recreated. While conventional audio systems, such as headphones, deliver an audio experience which is limited to being perceived between the listener's ears, an audio image, upon being rendered to the listener, may be perceived as a sound experience expanded to be outside and/or surrounding the head of the listener. This results in a more vibrant, compelling and life-like experience for the listener. In some embodiments, an audio image may be referred to as an holographic audio image and/or a three-dimensional audio image so as to convey a notion of volumetric envelope to be experienced by the listener. In some embodiments, the audio image may be defined by a combination of at least three virtual wave fronts. In some embodiments, the audio image may be defined by a combination of at least three virtual wave fronts generated from an audio stream.
Audio stream: a stream of audio information which may comprise one or more audio channels. An audio stream may be embedded as a digital audio signal or an analogic audio signal. In some embodiments, the audio stream may take the form a computer audio file of a predefined size (e.g., in duration) or a continuous stream of audio information (e.g., a continuous stream streamed from an audio source). As an example, the audio stream may take the form of an uncompressed audio file (e.g., a “.wav” file) or of a compressed audio file (e.g., an “.mp3” file). In some embodiments, the audio stream may comprise a single audio channel (i.e., a mono audio stream). In some other embodiments the audio stream may comprise two audio channels (i.e., a stereo audio stream) or more than two audio channels (e.g., a 5.1. audio format, a 7.1 audio format, MPEG multichannel, etc).
Positional impulse response: an output of a dynamic system when presented with a brief input signal (i.e., the impulse). In some embodiments, an impulse response describes a reaction of a system (e.g., an acoustic space) in response to some external change. In some embodiments, the impulse response enables capturing one or more characteristics of an acoustic space. In some embodiments of the present technology, impulses responses are associated with corresponding positions of an acoustic space, hence the name “positional impulse response” which may also be referred to as “PIR”. Such acoustic space may be a real-life space (e.g., a small recording room, a large concert hall) or a virtual space (e.g., an acoustic sphere to be “recreated” around a head of a listener). The positional impulse responses may define a package or a set of positional impulse responses defining acoustic characteristics of the acoustic space. In some embodiments, the positional impulse responses are associated with an equipment that passes signal. The number of positional impulse responses may vary and is not limitative. The positional impulse responses may take multiple forms, for example, but without being limitative, a signal in the time domain or a signal in the frequency domain. In some embodiments, positions of each one of the positional impulse responses may be modified in real-time (e.g., based on commands of a real-time controller) or according to predefined settings (e.g., setting embedded in control data). In some embodiments, the positional impulse responses may be utilized to be convolved with an audio signal and/or an audio stream.
Virtual wave front: a virtual wave front may be defined as a virtual surface representing corresponding points of a wave that vibrates in unison. When identical waves having a common origin travel through a homogeneous medium, the corresponding crests and troughs at any instant are in phase; i.e., they have completed identical fractions of their cyclic motion, and any surface drawn through all the points of the same phase will constitute a wave front. An exemplary representation of a virtual wave front is provided in
With these fundamentals in place, we will now consider some non-limiting examples to illustrate various implementations of aspects of the present technology.
Communication between the various components of the computing environment 100 may be enabled by one or more internal and/or external buses 160 (e.g. a PCI bus, universal serial bus, IEEE 1394 “Firewire” bus, SCSI bus, Serial-ATA bus, ARINC bus, etc.), to which the various hardware components are electronically coupled.
The input/output interface 150 may be coupled to, for example, but without being limitative, headphones, earbuds, a set of loudspeakers, a headset, a VR headset, a AR headset and/or an audio processing unit (e.g., a recorder, a mixer).
According to implementations of the present technology, the solid-state drive 120 stores program instructions suitable for being loaded into the random access memory 130 and executed by the processor 110 for generating an audio image. For example, the program instructions may be part of a library or an application.
In some embodiments, the computing environment 100 may be configured so as to generate an audio image in accordance with the present technology described in the following paragraphs. In some other embodiments, the computing environment 100 may be configured so as to act as one or more of an “encoder”, a “sound-field positioner”, a “renderer”, a “decoder”, a “controller”, a “real-time controller”, a “filter”, a “localisation convolution engine”, a “mixer”, a “dynamic processor” and/or any combination thereof appropriate to the relevant task at hand.
Referring to
In some embodiments, the authoring tool 210 comprises an encoder. In some embodiments, the authoring tool 210 may also be referred to as an encoder. In the illustrated embodiment, the audio image file 220 is created by the authoring tool 210 and comprises multiple positional impulse responses 222 (PIRs), control data 224 and one or more audio streams 226. Each one of the PIRs is referred to as PIR n, wherein n is an integer. Each one of the one or more audio streams 226, may be referred to as audio stream x, wherein x is an integer. In some embodiments, the PIRs 222 comprises three PIRs, namely PIR1, PIR2 and PIR3. In some other embodiments, the PIR 222 comprises more than three PIRs.
In some embodiments, the authoring tool 210 allows creating audio image files such as the audio image file 220. Once created, the audio image files may then be stored and/or transmitted to a device for real-time or future rendering. In some embodiments, the authoring tool 210 comprises an input interface configured to access one or more audio streams and control data. The control data may comprise positions of impulse responses, the positions allowing positioning impulse responses in a three-dimensional space (such as, but not limited to, a sphere). In some embodiments, the authoring tool 210 comprises an encoder which is configured to encode, for example, in a predefined file format, the one or more audio streams and the control data so that an audio image renderer (such as, but not limited to, the audio image renderer 230) may decode the audio image file to generate an audio image based on the one or more audio streams and positional impulse responses, positions of the positional impulse responses being defined by the control data of the audio image file.
The renderer 230 may be configured to access and/or receive audio image files such as the audio image file 220. In other embodiments, the renderer 230 may independently access one or more audio streams, control data and positional impulse responses. In some embodiments, the renderer 230 may have access to a repository of control data and/or positional impulse responses and receive an audio image file solely comprising one or more audio streams. Conversely, the renderer 230 may have access to one or more audio streams and receive control data and/or positional impulse responses from an external source (such as, but not limited to, a remote server). In the illustrated embodiment, the renderer 230 comprises a sound-field positioner 232 and an audio image renderer 234. In some embodiments, the renderer 230 may also be referred to as a decoder.
The sound-field positioner 232 may be controlled by a real-time controller 240. Even though reference is made to a real-time controller 240, it should be understood that the control of the sound-field positioner 232 does not require to occur in real-time. As such, in various embodiments of the present technology, the sound-field positioner 232 may be controlled by various types of controllers, whether real-time or not. In some embodiments wherein positions of positional impulse responses and their respective positions define a sphere, the sound-field positioner 232 may be referred to as a spherical sound-field positioner. In some embodiments, the sound-field positioner 232 allows associating positional impulse responses with positions and control of such positions of the positional impulse responses as it will be further detailed below in connection with the description of
The audio image renderer 234 may decode an audio image file such as the audio image file 220 to render an audio image. In some embodiments, the audio image renderer 234 may also be referred to as a three-dimensional audio experiential renderer. In some embodiments, the audio image is rendered based on an audio stream and positional impulse responses which positions are determined and/or controlled by the sound-field positioner 232. In some embodiments, the audio image is generated by combining multiple virtual wave fronts, each one of the multiple virtual wave fronts being generated by the audio image renderer 234. In some embodiments, the multiple virtual wave fronts are being generated based on the audio stream and positional impulse responses as it will be further detailed below in connection with the description of
In some embodiments, the audio image renderer 234 mixes the virtual wave fronts and outputs a m-channel audio output so as to render the audio image to a listener. In the embodiments illustrated at
As can be shown in
Turning now to
As illustrated in
In some embodiments, multiple positional impulse responses may be combined together to define a polygonal positional impulse response. Such polygonal positional impulse response is illustrated by a first polygonal positional impulse response 420 and a second polygonal positional impulse response 430.
The first polygonal positional impulse response 420 comprises a first positional impulse response, a second positional impulse response and a third positional impulse response. Each one of the first positional impulse response, the second positional impulse response and the third positional impulse response is associated with a respective position. The combination of all three positions thereby defines the geometry of the first polygonal positional impulse response 420, in the present case, a triangle. In some embodiments, the geometry may be modified (either in real-time or not) via a controller (e.g., the real-time controller 240) and may define any shape (e.g., the three positions may define a line).
The second polygonal positional impulse response 430 comprises a fourth positional impulse response, a fifth positional impulse response, a sixth positional impulse response and a seventh positional impulse response. Each one of the fourth positional impulse response, the fifth positional impulse response, the sixth positional impulse response and the seventh positional impulse response is associated with a respective position. The combination of all four positions thereby defines the geometry of the second polygonal positional impulse response 430, in the present case, a quadrilateral. In some embodiments, the geometry may be modified (either in real-time or not) via a controller (e.g., the real-time controller 240).
In some embodiments, the first polygonal positional impulse response 420 and the second polygonal positional impulse response 430 may be relied upon to generate one or more audio images as it will be further depicted below in connection with the description of
Even though the example of
Referring now to
In the example of
In some embodiments, the control data 524 and the PIRs 522 are accessed by the sound-field positioner 532. The control data 524 may also be accessed and/or relied upon by the audio image renderer 534. In some embodiments, such as the one illustrated at
In the illustrated embodiments, the audio stream 526 is filtered by the ADBF filter 502 before being processed by the audio image renderer 524. It should be understood that even though a single audio stream is illustrated, the processing of multiple audio streams is also envisioned, as previously discussed in connection with the description of
As it may be appreciated by a person skilled in the art of the present technology, the n-m channel mixer 510 may take 2 or more channels as an input and output 2 or more channels. In the illustrated example, the n-m channel mixer 510 takes the second audio sub-stream transmitted by the delay filter 506 and the signal outputted by the audio image renderer 524 and mixes them to generate an audio image output. In some embodiments wherein 2 channels are to be outputted, the n-m channel mixer 510 takes (1) the second audio sub-stream associated with a left channel transmitted by the delay filter 506 and the signal associated with a left channel outputted by the audio image renderer 524 and (2) the second audio sub-stream associated with a right channel transmitted by the delay filter 506 and the signal associated with a right channel outputted by the audio image renderer 524 to generate a left channel and a right channel to be rendered to a listener. In some alternative embodiments, the n-m channel mixer 510 may output more than 2 channels, for example, for cases where the audio image is being rendered on more than two speakers. Such cases include, without being limitative, cases where the audio image is being rendered on headphones having two or more drivers associated with each ear and/or cases where the audio image is being rendered on more than two loudspeakers (e.g., 5.1, 7.1, Dolby AC-4® from Dolby Laboratories, Inc. settings).
Turning now to
In the embodiment illustrated at
As it may be appreciated in
As an example, the audio stream may be a one-channel stream which is then duplicated into three signals so that each one of the three signals may be convolved with each one of the PIR_1602, the PIR_2604 and the PIR_3606. As it may be appreciated on
Turning now to
The audio image renderer 634 generates the right channel by convolving, in parallel, the audio stream with the right component PIR_1 R (also referred to as a first right positional impulse response) to generate a right component of the first virtual wave front VWF1 R, the audio stream with the right component PIR_2 R (also referred to as a second right positional impulse response) to generate a right component of the second virtual wave front VWF2 R and the audio stream with the right component PIR_3 R (also referred to as a third right positional impulse response) to generate a right component of the third virtual wave front VWF3 R.
Then, the n-m channel mixer 660 mixes the VWF1 L, the VWF2 L, the VWF3 L to generate the left channel and mixes the VWF1 R, the VWF2 R and the VWF3 R to generate the right channel. The left channel and the right channel may then be rendered to the listener so that she/he may experience a binaural audio image on a regular stereo setting (such as, headphones or a loudspeaker set).
Turning now to
The first polygonal positional impulse response 1520 comprises a first positional impulse response, a second positional impulse response and a third positional impulse response. Each one of the first positional impulse response, the second positional impulse response and the third positional impulse response is associated with a respective position. The combination of all three positions thereby defines the geometry of the first polygonal positional impulse response 1520, in the present case, a triangle. In some embodiments, the geometry may be modified (either in real-time or not) via a controller (e.g., the real-time controller 240).
The second polygonal positional impulse response 1530 comprises a fourth positional impulse response, a fifth positional impulse response, a sixth positional impulse response and a seventh positional impulse response. Each one of the fourth positional impulse response, the fifth positional impulse response, the sixth positional impulse response and the seventh positional impulse response is associated with a respective position. The combination of all four positions thereby defines the geometry of the second polygonal positional impulse response 1530, in the present case, a quadrilateral. In some embodiments, the geometry may be modified (either in real-time or not) via a controller (e.g., the real-time controller 240).
In the illustrated embodiment, a first audio image 1540 is generated based on the first polygonal positional impulse response 1520 (e.g., based on a first audio stream and each one of the positional impulse responses defining the first polygonal positional impulse response 1520). A second audio image 1550 is generated based on the second polygonal positional impulse response 1550 (e.g., based on a second audio stream and each one of the positional impulse responses defining the second polygonal positional impulse response 1530). In some embodiments, the first audio stream and a second audio stream may be a same audio stream.
In some embodiments, the combination of the first audio image 1540 and the second audio image 1550 define a complex audio image. As it may be appreciated, the complex audio image may be morphed dynamically by controlling positions associated with the first polygonal positional impulse response 1520 and the second polygonal positional impulse response 1530. As an example, the first audio image 1540 may be a volumetric audio image of a first instrument (e.g., a violin) and the second audio image 1550 may be a volumetric audio image of a second instrument (e.g., a guitar). Upon being rendered, the first audio image 1540 and the second audio image 1550 are perceived by a listener as not just point-source audio objects but rather as volumetric audio objects, as if the listener was standing by the first instrument and the second instruments in real life. Those examples should not be construed as being limitative and multiple variations and applications may be envisioned without departing from the scope of the present technology.
The representation of a virtual wave front 1560 aims at exemplifying wave fronts of a sound wave. As a person skilled in the art of the present technology may appreciate, the representation 1560 may be taken from a spherical wave front of a sound wave spreading out from a point source. Wave fronts for longitudinal and transverse waves may be surfaces of any configuration depending on the source, the medium and/or obstructions encountered. As illustrated in
Turning now to
In some embodiments, a volumetric audio image may be perceived by a human auditory system via median and/or lateral information pertaining to the volumetric audio image. In some embodiments, perception in the median plane may be frequency dependent and/or may involve inter-aural level difference (ILD) envelope cues. In some embodiments, lateral perception may be dependent on relative differences of the wave fronts and/or dissimilarities between two ear input signals. Lateral dissimilarities may consist of inter-aural time differences (ITD) and/or inter-aural level differences (ILD). ITDs may be dissimilarities between the two ear input signals related to a time when signals occur or when specific components of the signals occur. These dissimilarities may be described by a frequency plot of inter-aural phase difference b(f). In the perception of ITD envelope cues, timing information may be used for higher frequencies as timing differences in amplitude envelopes may be detected. An ITD envelope cue may be based on extraction by the hearing system of timing differences of onsets of amplitude envelopes instead of timing of waveforms within an envelope. ILDs may be dissimilarities between the two ear input signals related to an average sound pressure level of the two ear input signals. The dissimilarities may be described in terms of differences in amplitude of an inter-aural transfer function |A(f)| and/or a sound pressure level difference 20 log|A(f)|.
In some embodiments, the acoustic renderer comprises a direct sound renderer, an early reflections renderer and/or a late reflections renderer. In some embodiments, the acoustic renderer is based on binaural room simulation, acoustic rendering based on DSP algorithm, acoustic rendering based on impulse response, acoustic rendering based on B-Format, acoustic rendering based on spherical harmonics, acoustic rendering based on environmental context simulation, acoustic rendering based on convolution with impulse response, acoustic rendering based on convolution with impulse response and HRTF processing, acoustic rendering based on auralization, acoustic rendering based on synthetic room impulse response, acoustic rendering based on ambisonics and binaural rendering, acoustic rendering based on high order ambisonics (HOA) and binaural rendering, acoustic rendering based on ray tracing and/or acoustic rendering based on image modeling.
In some embodiments, the binaural renderer is based on binaural signal processing, binaural rendering based on HRTF modeling, binaural rendering based on HRTF measurements, binaural rendering based on DSP algorithm, binaural rendering based on impulse response, binaural rendering based on digital filters for HRTF and/or binaural rendering based on calculation of HRTF sets.
As for the embodiment depicted in
Turning now to
In some embodiments, the cut-off frequency (f2) and/or the crossover frequency (f) may be defined based on the following equations:
As it can be seen on
In some embodiments, F1 is the upper boundary of region A and is determined based on a largest axial dimension of a space L. Region B defines a region where space dimensions are comparable to wavelength of sound frequencies (i.e., wave acoustics). F2 defines a cut-off frequency or a crossover frequency in Hz. RT60 corresponds to a reverberation time of the room in seconds. In some embodiments, RT60 may be defined as the time it takes for sound pressure to reduce by 60 dB, measured from the moment a generated test signal is abruptly ended. V corresponds to a volume of the space. Region C defines a region where diffusion and diffraction dominate, a transition between region B (wave acoustics apply) and region D (ray acoustics apply).
Turning now to
The method 2500 starts at step 2502 by accessing an audio stream. In some embodiments, the audio stream is a first audio stream and the method 2500 further comprises accessing a second audio stream. In some embodiments, the audio stream is an audio channel. In some embodiments, the audio stream is one of a mono audio stream, a stereo audio stream and a multi-channel audio stream.
At a step 2504, the method 2500 accesses a first positional impulse response, the first positional impulse response being associated with a first position. At a step 2506, the method 2500 accesses a second positional impulse response, the second positional impulse response being associated with a second position. At a step 2508, the method 2500 accesses a third positional impulse response, the third positional impulse response being associated with a third position.
Then, the method 2500 generates an audio image by executing steps 2510, 2512 and 2514. In some embodiments, the steps 2510, 2512 and 2514 are executed in parallel. In some embodiments, the step 2510 comprises generating, based on the audio stream and the first positional impulse response, a first virtual wave front to be perceived by a listener as emanating from the first position. The step 2512 comprises generating, based on the audio stream and the second positional impulse response, a second virtual wave front to be perceived by the listener as emanating from the second position. The step 2514 comprises generating, based on the audio stream and the third positional impulse response, a third virtual wave front to be perceived by the listener as emanating from the third position.
In some embodiments, the method 2500 further comprises a step 2516. The step 2516 comprises mixing the first virtual wave front, the second virtual wave front and the third virtual wave front.
In some embodiments, generating the first virtual wave front comprises convolving the audio stream with the first positional impulse response; generating the second virtual wave front comprises convolving the audio stream with the second positional impulse response; and generating the third virtual wave front comprises convolving the audio stream with the third positional impulse response.
In some embodiments, the first positional impulse response comprises a first left positional impulse response associated with the first location and a first right positional impulse response associated with the first location; the second positional impulse response comprises a second left positional impulse response associated with the second location and a second right positional impulse response associated with the second location; and the third positional impulse response comprises a third left positional impulse response associated with the third location and a third right positional impulse response associated with the third location.
In some embodiments, generating the first virtual wave front, the second virtual wave front and the third virtual wave front comprises:
In some embodiments, convolving the audio stream with the summed left positional impulse response comprises generating a left channel signal; convolving the audio stream with the summed right positional impulse response comprises generating a right channel signal; and rendering the left channel signal and the right channel signal to a listener.
In some embodiments, generating the first virtual wave front, the second virtual wave front and the third virtual wave front comprises:
In some embodiments, the method 2500 further comprises:
In some embodiments, generating the first virtual wave front, generating the second virtual wave front and generating the third virtual wave front are executed in parallel.
In some embodiments, upon rendering the audio image to a listener, the first virtual wave front is perceived by the listener as emanating from a first virtual speaker located at the first position, the second virtual wave front is perceived by the listener as emanating from a second virtual speaker located at the second position; and the third virtual wave front is perceived by the listener as emanating from a third virtual speaker located at the third position.
In some embodiments, generating the first virtual wave front, generating the second virtual wave front and generating the third virtual wave front are executed synchronously.
In some embodiments, prior to generating the audio image, the method comprises:
In some embodiments, the audio stream is a first audio stream and the method further comprises accessing a second audio stream.
In some embodiments, the audio image is a first audio image and the method further comprises:
In some embodiments, the audio image is defined by a combination of the first virtual wave front, the second virtual wave front and the third virtual wave front.
In some embodiments, the audio image is perceived by a listener as a virtual immersive audio volume defined by the combination of the first virtual wave front, the second virtual wave front and the third virtual wave front.
In some embodiments, the method 2500 further comprises accessing a fourth positional impulse response, the fourth positional impulse response being associated with a fourth position.
In some embodiments, generating, based on the audio stream and the fourth positional impulse response, a fourth virtual wave front to be perceived by the listener as emanating from the fourth position.
In some embodiments, the first position, the second position and the third position correspond to locations of an acoustic space associated with the first positional impulse response, the second positional impulse response and the third positional impulse response.
In some embodiments, the first position, the second position and the third position define a portion of spherical mesh.
In some embodiments, the first positional impulse response, the second positional impulse response and the third positional impulse response define a polygonal positional impulse response.
In some embodiments, the audio image is a first audio image and wherein the method further comprises:
In some embodiments, the first audio image and the second audio image define a complex audio image.
In some embodiments, the audio stream comprises a point source audio stream and the audio image is perceived by a user as a volumetric audio object of the point source audio stream defined by the combination of the first virtual wave front, the second virtual wave front and the third virtual wave front.
In some embodiments, the point source audio stream comprises a mono audio stream.
In some embodiments, the first positional impulse response, the second positional impulse response, the third positional impulse response and the audio stream are accessed from an audio image file.
In some embodiments, the first position, the second position and the third position are associated with control data, the control data being accessed from the audio image file.
In some embodiments, the audio stream is a first audio stream and the audio image file comprises a second audio stream.
In some embodiments, the audio image file has been generated by an encoder.
In some embodiments, the first positional impulse response, the second positional impulse response and the third positional impulse response are accessed by a sound-field positioner and the audio image is generated by an audio image renderer.
In some embodiments, the sound-field positioner and the audio image renderer define a decoder.
In some embodiments, before generating the audio image, the audio stream is filtered by an acoustically determined band filter.
In some embodiments, the audio stream is divided into a first audio sub-stream and a second audio sub-stream by the acoustically determined band filter.
In some embodiments, convolving the audio stream with the first positional impulse response comprises convolving the first audio sub-stream with the first positional impulse response, convolving the audio stream with the second positional impulse response comprises convolving the first audio sub-stream with the second positional impulse response and convolving the audio stream with the third positional impulse response comprises convolving the first audio sub-stream with the third positional impulse response.
In some embodiments, the first virtual wave front, the second virtual wave front and the third virtual wave front are mixed with the second audio sub-stream to generate the audio image.
In some embodiments, the acoustically determined band filter generates the first audio sub-stream by applying a high-pass filter (HPF) and the second audio sub-stream by applying a low-pass filter (LPF).
In some embodiments, at least one of a gain and a delay is applied to the second audio sub-stream.
In some embodiments, at least one of the HPF and the LPF is defined based on at least one of a cut-off frequency (f2) and a crossover frequency (f).
In some embodiments, the at least one of the cut-off frequency and the crossover frequency is based on a frequency where sound transitions from wave to ray acoustics within a space associated with at least one of the first positional impulse response, the second positional impulse response and the third positional impulse response.
In some embodiments, the at least one of the cut-off frequency (f2) and the crossover frequency (f) is associated with control data.
In some embodiments, the method 2500 further comprises outputting a m-channel audio output based on the audio image.
In some embodiments, the audio image is delivered to a user via at least one of a headphone set and a set of loudspeakers.
In some embodiments, at least one of convolving the audio stream with the first positional impulse response, convolving the audio stream with the second positional impulse response and convolving the audio stream with the third positional impulse response comprises applying a Fourier-transform to the audio stream.
In some embodiments, the first virtual wave front, the second virtual wave front and the third virtual wave front are mixed together.
In some embodiments, at least one of a gain, a delay and a filter/equalizer is applied to at least one of the first virtual wave front, the second virtual wave front and the third virtual wave front.
In some embodiments, applying at least one of the gain, the delay and the filter/equalizer to the at least one of the first virtual wave front, the second virtual wave front and the third virtual wave front is based on control data.
In some embodiments, the audio stream is a first audio stream and the method further comprises accessing multiple audio streams.
In some embodiments, the first audio stream and the multiple audio streams are mixed together before generating the audio image.
In some embodiments, the first position, the second position and the third position are controllable in real-time so as to morph the audio image.
Turning now to
The method 2600 starts at step 2602 by accessing an audio stream. Then, at a step 2604, the method 2600 accesses positional information, the positional information comprising a first position, a second position and a third position.
The method 2600 then executes steps 2610, 2612 and 2614 to generate an audio image. In some embodiments, the steps 2610, 2612 and 2614 are executed in parallel. The step 2610 comprises generating, based on the audio stream, a first virtual wave front to be perceived by a listener as emanating from the first position. The step 2612 comprises generating, based on the audio stream, a second virtual wave front to be perceived by the listener as emanating from the second position. The step 2614 comprises generating, based on the audio stream, a third virtual wave front to be perceived by the listener as emanating from the third position.
In some embodiments, upon rendering the audio image to the listener, the first virtual wave front is perceived by the listener as emanating from a first virtual speaker located at the first position, the second virtual wave front is perceived by the listener as emanating from a second virtual speaker located at the second position; and the third virtual wave front is perceived by the listener as emanating from a third virtual speaker located at the third position.
In some embodiments, at least one of generating the first virtual wave front, generating the second virtual wave front and generating the third virtual wave front comprises at least one of an acoustic rendering and a binaural rendering.
In some embodiments, the acoustic rendering comprises at least one direct sound rendering, early reflections rendering and late reflections rendering.
In some embodiments, the acoustic rendering comprises at least one of binaural room simulation, acoustic rendering based on DSP algorithm, acoustic rendering based on impulse response, acoustic rendering based on B-Format, acoustic rendering based on spherical harmonics, acoustic rendering based on environmental context simulation, acoustic rendering based on convolution with impulse response, acoustic rendering based on convolution with impulse response and HRTF processing, acoustic rendering based on auralization, acoustic rendering based on synthetic room impulse response, acoustic rendering based on ambisonics and binaural rendering, acoustic rendering based on high order ambisonics (HOA) and binaural rendering, acoustic rendering based on ray tracing and acoustic rendering based on image modeling.
In some embodiments, the binaural rendering comprises at least one of binaural signal processing, binaural rendering based on HRTF modeling, binaural rendering based on HRTF measurements, binaural rendering based on DSP algorithm, binaural rendering based on impulse response, binaural rendering based on digital filters for HRTF and binaural rendering based on calculation of HRTF sets.
In some embodiments, generating the first virtual wave front, generating the second virtual wave front and generating a third virtual wave front are executed synchronously.
In some embodiments, prior to generating the audio image, the method comprises:
In some embodiments, generating the first virtual wave front comprises convolving the audio stream with the first positional impulse response; generating the second virtual wave front comprises convolving the audio stream with the second positional impulse response; and generating the third virtual wave front comprises convolving the audio stream with the third positional impulse response.
In some embodiments, prior to generating the audio image, the method 2600 comprises:
In some embodiments, generating the first virtual wave front, the second virtual wave front and the third virtual wave front comprises:
In some embodiments, convolving the audio stream with the summed left positional impulse response comprises generating a left channel; convolving the audio stream with the summed right positional impulse response comprises generating a right channel; and rendering the left channel and the right channel to a listener.
In some embodiments, the audio image is defined by a combination of the first virtual wave front, the second virtual wave front and the third virtual wave front.
In some embodiments, the method 2600 further comprises a step 2616 which comprises mixing the first virtual wave front, the second virtual wave front and the third virtual wave front.
Turning now to
The method 2700 starts at step 2702 by accessing an audio stream. Then, at a step 2704, the method 2700 accesses a first positional impulse response, a second positional impulse response and a third positional impulse response.
Then, at a step 2706, the method 2700 accesses control data, the control data comprising a first position, a second position and a third position. At a step 2708, the method 2700 associates the first positional impulse response with the first position, the second positional impulse response with the second position and the third positional impulse response with the third position.
The method 2700 then generates the volumetric audio image by executing steps 2710, 2712 and 2714. In some embodiments, the steps 2710, 2712 and 2714 are executed in parallel. The step 2710 comprises generating a first virtual wave front emanating from the first position by convolving the audio stream with the first positional impulse response. The step 2712 comprises generating a second virtual wave front emanating from the second position by convolving the audio stream with the second positional impulse response. The step 2714 comprises generating a third virtual wave front emanating from the third position by convolving the audio stream with the third positional impulse response.
In some embodiments, the method 2700 further comprises a step 2716 which comprises mixing the first virtual wave front, the second virtual wave front and the third virtual wave front.
Turning now to
The method 2800 starts at step 2802 by accessing an audio stream. Then, at a step 2804, the method 2800 accesses dimensional information relating to a space. The method 2800 then determines, at a step 2806, a frequency where sound transitions from wave to ray acoustics within the space. At a step 2808, the method 2800 divides the audio stream into a first audio sub-stream and a second audio sub-stream based on the frequency.
In some embodiments, dividing the audio stream comprises generating the first audio sub-stream by applying a high-pass filter (HPF) and the second audio sub-stream by applying a low-pass filter (LPF). In some embodiments, at least one of a gain and a delay is applied to the second audio sub-stream. In some embodiments, the frequency is one of a cut-off frequency (f2) and a crossover frequency (f). In some embodiments, at least one of the HPF and the LPF is defined based on at least one of the cut-off frequency (f2) and the crossover frequency (f).
In some embodiments, at least one of the cut-off frequency (f2) and the crossover frequency (f) is associated with control data. In some embodiments, the space is associated with at least one of a first positional impulse response, a second positional impulse response and a third positional impulse response.
While the above-described implementations have been described and shown with reference to particular steps performed in a particular order, it will be understood that these steps may be combined, sub-divided, or re-ordered without departing from the teachings of the present technology. At least some of the steps may be executed in parallel or in series. Accordingly, the order and grouping of the steps is not a limitation of the present technology.
It should be expressly understood that not all technical effects mentioned herein need to be enjoyed in each and every embodiment of the present technology. For example, embodiments of the present technology may be implemented without the user and/or the listener enjoying some of these technical effects, while other embodiments may be implemented with the user enjoying other technical effects or none at all.
Modifications and improvements to the above-described implementations of the present technology may become apparent to those skilled in the art. The foregoing description is intended to be exemplary rather than limiting. The scope of the present technology is therefore intended to be limited solely by the scope of the appended claims.
The present Application claims priority to U.S. Provisional Patent Application No. 62/410,132 filed on Oct. 19, 2016, the entire disclosure of which is incorporated herein by reference. The present application is a continuation of U.S. patent application Ser. No. 17/023,257 filed on Sep. 16, 2020, which itself is a continuation of U.S. patent application Ser. No. 16/388,146 filed on Apr. 18, 2019, which itself is a continuation of International Patent Application no. PCT/IB2017/056471, filed on Oct. 18, 2017, entitled “SYSTEM FOR AND METHOD OF GENERATING AN AUDIO IMAGE”. These applications are incorporated by reference herein in their entirety.
Number | Name | Date | Kind |
---|---|---|---|
6027428 | Thomas et al. | Feb 2000 | A |
6741706 | Mcgrath et al. | May 2004 | B1 |
8619998 | Walsh et al. | Dec 2013 | B2 |
9094771 | Tsingos et al. | Jul 2015 | B2 |
9172901 | Chabanne et al. | Oct 2015 | B2 |
10820135 | Boerum | Oct 2020 | B2 |
11516616 | Boerum | Nov 2022 | B2 |
20070297616 | Plogsties | Dec 2007 | A1 |
20080298610 | Virolainen et al. | Dec 2008 | A1 |
20120213375 | Mahabub et al. | Aug 2012 | A1 |
20140185812 | Van Achte et al. | Jul 2014 | A1 |
20140185844 | Haurais et al. | Jul 2014 | A1 |
20140219455 | Peters et al. | Aug 2014 | A1 |
20140355796 | Xiang et al. | Dec 2014 | A1 |
20150293655 | Tan | Oct 2015 | A1 |
20170223478 | Jot et al. | Aug 2017 | A1 |
Number | Date | Country |
---|---|---|
102694764 | Sep 2012 | CN |
104021373 | Sep 2014 | CN |
1613127 | Jan 2006 | EP |
2873254 | May 2015 | EP |
9949574 | Sep 1999 | WO |
2012088336 | Jun 2012 | WO |
2014014891 | Jan 2014 | WO |
2014159376 | Oct 2014 | WO |
2014194005 | Dec 2014 | WO |
2015134658 | Sep 2015 | WO |
2015147619 | Oct 2015 | WO |
Entry |
---|
Vocal.com, Early Reflections, published Jul. 2017, https://web.archive.org/web/20170703180923/https://vocal.com/dereverberation/early-reflections/ (Year: 2017). |
International Search Report and Written Opinion issued on Mar. 1, 2018 in corresponding International patent application No. PCT/IB2017/056471. |
Siltanen et al., “Rays or Waves? Understanding the Strengths and Weaknesses of Computational Room Acoustics Modeling Techniques”, Proceedings of the International Symposium of Room Acoustics, ISRA 2010, Melbourne, Australia, Aug. 29-31, 2010. |
Rober et al., “Ray Acoutics Using Computer Graphics Technology”, Proceeding of the 10th International Conference on Digital Audio Effects (DAFx-07), Bordeaux, France, Sep. 10-15, 2007. |
Kiminki, “Sound Propagation Theory for Linear Ray Acoustic Modelling”, Master's Thesis, Helsinki University of Technology, Department of Computer Science and Engineering, Telecommunications Software and Multimedia Laboratory, Mar. 7, 2005. |
Begault, D.R., “3-D Sound for Virtual Reality and Multimedia”, National Aeronautics and Space Administration, NASA/TM-2000-209606. |
Blauert, J., “Communication Acoustics”, Springer—Verlag Berlin Heidelberg, 2005, Chapters 1 and 4. |
Vorlander, M., “Auralization of spaces”, Physics Today, American Institute of Physics, S-0031-9228-0906-020-7, Jun. 2009, pp. 35-40. |
Everest, F.A. et al., “Master Handbook of Acoustics, Fifth Edition”, The McGraw-Hill Companies, Inc., 2009, Chapters 18 and 26. |
Melchior, F., “The theory and practice of generating improved deadphone experiences, Part II”, BBC R&D. |
European Search Report with regard to the counterpart EP Patent Application No. 17861420.2 mailed Jun. 25, 2019. |
Bernschutz, “A Spherical Far Field HRIR/HRTF Compilation of the Neumann KU100”, AIA-DAGA 2013 Merano, Proceedings of the International Conference on Acoustics, pp. 592-595. |
English Abstract for CN102694764 retrieved on Espacenet on Feb. 19, 2021. |
English Abstract for CN104021373 retrieved on Espacenet on Feb. 19, 2021. |
Communication pursuant to Article 94(3) EPC with regard to the counterpart EP Patent Application No. 17861420.2 mailed Jul. 15, 2021. |
Sorensen, “Waves And Rays—Acoustic Fields”, May 23, 2013 , XP055822573, Retrieved from the Internet URL: https://www.acousticfields.com/waves-and-rays/? nab=1 &utm_referrer=https://www.google.de/, 12 pages. |
Number | Date | Country | |
---|---|---|---|
20230050329 A1 | Feb 2023 | US |
Number | Date | Country | |
---|---|---|---|
62410132 | Oct 2016 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 17023257 | Sep 2020 | US |
Child | 17980370 | US | |
Parent | 16388146 | Apr 2019 | US |
Child | 17023257 | US | |
Parent | PCT/IB2017/056471 | Oct 2017 | WO |
Child | 16388146 | US |