This application was originally filed as PCT Application No. PCT/IB2013/054044 filed May 17, 2013.
The present application relates to apparatus for spatial object oriented audio signal processing. The invention further relates to, but is not limited to, apparatus for spatial object oriented audio signal processing within mobile devices.
Spatial audio signals are being used in greater frequency to produce a more immersive audio experience. A stereo or multi-channel recording can be passed from the recording or capture apparatus to a listening apparatus and replayed using a suitable multi-channel output such as a pair of headphones, headset, multi-channel loudspeaker arrangement etc.
Object oriented audio formats represent audio as separate tracks with trajectories. The trajectories contain the directions from which the audio on the track should sound to be coming from during playback. These trajectories are typically expressed with polar coordinates, where the polar angle and azimuth provide the direction.
Several object oriented audio formats have been presented, e.g. Dolby Atmos, MPEG SAOC. Object oriented audio formats have several benefits. For the consumer the most important benefit is the ability to play back the audio using any equipment and still achieve improved audio quality unlike when fixed 5.1 multichannel audio signals are downmixed or the like are used on playback equipment which has fewer channels than the audio signals or when fixed 5.1 multichannel audio signals are upmixed or the like are used on playback equipment which has more channels than the audio signals. The playback equipment can for example be headphones, 5.1 surround in a home theatre apparatus, mono/stereo speakers in a television or a mobile device.
However it would be understood that such object oriented representations can be problematic. The format known as Dolby Atmos can use up to 200 individual channels. Due to data transfer and computational limitations, attempting to transmit store or render 200 channels can impose a significant bandwidth and processing load. This bandwidth and processing load can be significant for mobile devices requiring additional processing capacity with cost and power usage disadvantages. Furthermore a fixed 5.1 downmix would lose all the benefits from an object oriented audio format, such as high quality with any loudspeaker or headphone setup and the possibility to play back audio from above or below.
Aspects of this application thus provide object oriented audio format reproduction without the high bandwidth or processing capacity requirements.
According to a first aspect there is provided an apparatus comprising at least one processor and at least one memory including computer code for one or more programs, the at least one memory and the computer code configured to with the at least one processor cause the apparatus to at least: perceptually order at least two object orientated audio signal channels; and process at least one of the at least two object orientated audio signal channels based on the order of the at least two object orientated audio signal channels.
Perceptually ordering at least two object orientated audio signal channels may further cause the apparatus to: determine a perception value for each of the at least two object orientated signal channels; and perceptually order the at least two object orientated audio signal channels based on the perception value.
Determining a perception value for each of the at least two object orientated signal channels may cause the apparatus to determine a perception value based on the distance difference between the channel and a defined position.
The defined position may be a nearest of a set of speaker positions.
The set of speaker positions in polar co-ordinates may be L=[Lr, Lθ, Lφ]=[1, −30, 0], R=[Rr, Rθ, Rφ]=[1, 30, 0], C=[Cr, Cθ, Cφ]=[1, 0, 0], Ls=[Lsr, Lsθ, Lsφ]=[1, −110, 0], and Rs=[Rsr, Rsθ, Rsφ]=[1, 110, 0].
Determining a perception value for each of the at least two object orientated signal channels may cause the apparatus to: divide each of the at least two object orientated signal channels into time parts; determine for each time part of the at least two object orientated signal channel Cx the following value:
where ∥Cx∥ is the energy level of the channel Cx, ∥Cmax∥ the maximum energy level of the at least two channels at the time part, ∥Cmin∥ the minimum energy level of the at least two channels at the time part, and δx is the angular distance for the channel Cx to a nearest of a set of speakers.
Determining a perception value for each of the at least two object orientated signal channels may cause the apparatus to: divide each of the at least two object orientated signal channels into time-frequency parts; determine for each time-frequency part of the at least two object orientated signal channel Cx the following value:
where ∥Cx,b∥ is the energy level of the channel for frequency band Cx, ∥Cmax,b∥ the maximum energy level of the at least two channels at the time frequency part, ∥Cmin,b∥ the minimum energy level of the at least two channels at the time frequency part, and δx is the angular distance for the channel Cx to a nearest of a set of speakers.
The value of δx may be defined by
where L=[Lr, Lθ, Lφ]=[1, −30, 0], R=[Rr, Rθ, Rφ]=[1, 30, 0], C=[Cr, Cθ, Cφ]=[1, 0, 0], Ls=[Lsr, Lsθ, Lsφ]=[1, −110, 0], and Rs=[Rsr, Rsθ, Rsφ]=[1, 110, 0].
Processing at least one of the at least two object orientated audio signal channels based on the order of the at least two object orientated audio signal channels may cause the apparatus to: select a first set of the at least two object orientated audio signal channels, the first set of the at least two object orientated audio signal channels being the lower perceptually ordered channels; downmix the first set of the at least two object orientated audio signal channels to a downmixed channel representation; and output the downmixed channel representation with the remainder of the at least two object orientated audio signal channels.
Processing at least one of the at least two object orientated audio signal channels based on the order of the at least two object orientated audio signal channels may cause the apparatus to: select for parts of the at least two object orientated audio signal channels a highest perceptually ordered channel part; combine the selected highest perceptually ordered part to generate a first audio signal; attenuate the at least two object orientated audio signal channels highest perceptually ordered channel part; combine the attenuated at least two object orientated audio signal channels highest perceptually ordered channel part to the remainder at least two object orientated audio signal channel parts to generate a second audio signal; and output the first audio signal and the second audio signal.
The parts may be frequency sub-bands and/or bands of time periods of the at least two object orientated audio signal channels.
According to a second aspect there is provided a method comprising: perceptually ordering at least two object orientated audio signal channels; and processing at least one of the at least two object orientated audio signal channels based on the order of the at least two object orientated audio signal channels.
perceptually ordering at least two object orientated audio signal channels may comprise: determining a perception value for each of the at least two object orientated signal channels; and perceptually ordering the at least two object orientated audio signal channels based on the perception value.
Determining a perception value for each of the at least two object orientated signal channels may comprise determining a perception value based on the distance difference between the channel and a defined position.
The defined position may be a nearest of a set of speaker positions.
The set of speaker positions in polar co-ordinates may be L=[Lr, Lθ, Lφ]=[1, −30, 0], R=[Rr, Rθ, Rφ]=[1, 30, 0], C=[Cr, Cθ, Cφ]=[1, 0, 0], Ls=[Lsr, Lsθ, Lsφ]=[1, −110, 0], and Rs=[Rsr, Rsθ, Rsφ]=[1, 110, 0].
Determining a perception value for each of the at least two object orientated signal channels may comprise: dividing each of the at least two object orientated signal channels into time parts; determining for each time part of the at least two object orientated signal channel Cx the following value:
Determining a perception value for each of the at least two object orientated signal channels may comprise: dividing each of the at least two object orientated signal channels into time-frequency parts; determining for each time-frequency part of the at least two object orientated signal channel Cx the following value:
where ∥Cx,b∥ is the energy level of the channel for frequency band Cx, ∥Cmax,b∥ the maximum energy level of the at least two channels at the time frequency part, ∥Cmin,b∥ the minimum energy level of the at least two channels at the time frequency part, and δx is the angular distance for the channel Cx to a nearest of a set of speakers.
Processing at least one of the at least two object orientated audio signal channels based on the order of the at least two object orientated audio signal channels may comprise: selecting a first set of the at least two object orientated audio signal channels, the first set of the at least two object orientated audio signal channels being the lower perceptually ordered channels; downmixing the first set of the at least two object orientated audio signal channels to a downmixed channel representation; and outputing the downmixed channel representation with the remainder of the at least two object orientated audio signal channels.
Processing at least one of the at least two object orientated audio signal channels based on the order of the at least two object orientated audio signal channels may comprise: selecting for parts of the at least two object orientated audio signal channels a highest perceptually ordered channel part; combining the selected highest perceptually ordered part to generate a first audio signal; attenuating the at least two object orientated audio signal channels highest perceptually ordered channel part; combining the attenuated at least two object orientated audio signal channels highest perceptually ordered channel part to the remainder at least two object orientated audio signal channel parts to generate a second audio signal; and outputting the first audio signal and the second audio signal.
The parts may be frequency sub-bands and/or bands of time periods of the at least two object orientated audio signal channels.
According to a third aspect there is provided an apparatus comprising: means for perceptually ordering at least two object orientated audio signal channels; and means for processing at least one of the at least two object orientated audio signal channels based on the order of the at least two object orientated audio signal channels.
The means for perceptually ordering at least two object orientated audio signal channels may comprise: means for determining a perception value for each of the at least two object orientated signal channels; and means for perceptually ordering the at least two object orientated audio signal channels based on the perception value.
The means for determining a perception value for each of the at least two object orientated signal channels may comprise means for determining a perception value based on the distance difference between the channel and a defined position.
The defined position may be a nearest of a set of speaker positions.
The set of speaker positions in polar co-ordinates may be L=[Lr, Lθ, Lφ]=[1, −30, 0], R=[Rr, Rθ, Rφ]=[1, 30, 0], C=[Cr, Cθ, Cφ]=[1, 0, 0], Ls=[Lsr, Lsθ, Lsφ]=[1, −110, 0], and Rs=[Rsr, Rsθ, Rsφ]=[1, 110, 0].
The means for determining a perception value for each of the at least two object orientated signal channels may comprise: means for dividing each of the at least two object orientated signal channels into time parts; means for determining for each time part of the at least two object orientated signal channel Cx the following value:
The means for determining a perception value for each of the at least two object orientated signal channels may comprise: means for dividing each of the at least two object orientated signal channels into time-frequency parts; means for determining for each time-frequency part of the at least two object orientated signal channel Cx the following value:
where ∥Cx,b∥ is the energy level of the channel for frequency band Cx, ∥Cmax,b∥ the maximum energy level of the at least two channels at the time frequency part, ∥Cmin,b∥ the minimum energy level of the at least two channels at the time frequency part, and ox is the angular distance for the channel Cx to a nearest of a set of speakers.
The value of δx may be defined by
where L=[Lr, Lθ, Lφ]=[1, −30, 0], R=[Rr, Rθ, Rφ]=[1, 30, 0], C=[Cr, Cθ, Cφ]=[1, 0, 0], Ls=[Lsr, LSθ, Lsφ]=[1, −110, 0], and Rs=[Rsr, Rsθ, Rsφ]=[1, 110, 0].
The means for processing at least one of the at least two object orientated audio signal channels based on the order of the at least two object orientated audio signal channels may comprise: means for selecting a first set of the at least two object orientated audio signal channels, the first set of the at least two object orientated audio signal channels being the lower perceptually ordered channels; means for downmixing the first set of the at least two object orientated audio signal channels to a downmixed channel representation; and means for outputting the downmixed channel representation with the remainder of the at least two object orientated audio signal channels.
The means for processing at least one of the at least two object orientated audio signal channels based on the order of the at least two object orientated audio signal channels may comprise: means for selecting for parts of the at least two object orientated audio signal channels a highest perceptually ordered channel part; means for combining the selected highest perceptually ordered part to generate a first audio signal; means for attenuating the at least two object orientated audio signal channels highest perceptually ordered channel part; means for combining the attenuated at least two object orientated audio signal channels highest perceptually ordered channel part to the remainder at least two object orientated audio signal channel parts to generate a second audio signal; and means for outputting the first audio signal and the second audio signal.
The parts may be frequency sub-bands and/or bands of time periods of the at least two object orientated audio signal channels.
According to a fourth aspect there is provided an apparatus comprising: a perception sorter configured to perceptually order at least two object orientated audio signal channels; and a selective channel processor configured to process at least one of the at least two object orientated audio signal channels based on the order of the at least two object orientated audio signal channels.
The perception sorter may comprise: a perception determiner configured to determine a perception value for each of the at least two object orientated signal channels; and perception metric sorter configured to perceptually order the at least two object orientated audio signal channels based on the perception value.
The perception determiner may be configured to determine a perception value based on the distance difference between the channel and a defined position.
The defined position may be a nearest of a set of speaker positions.
The set of speaker positions in polar co-ordinates may be L=[Lr, Lθ, Lφ]=[1, −30, 0], R=[Rr, Rθ, Rφ]=[1, 30, 0], C=[Cr, Cθ, Cφ]=[1, 0, 0], Ls=[Lsr, Lsθ, Lsφ]=[1, −110, 0], and Rs=[Rsr, Rsθ, Rsφ]=[1, 110, 0].
The perception determiner may be configured to: divide each of the at least two object orientated signal channels into time parts; determine for each time part of the at least two object orientated signal channel Cx the following value:
The perception determiner may be configured to: divide each of the at least two object orientated signal channels into time-frequency parts; determine for each time-frequency part of the at least two object orientated signal channel Cx the following value:
where ∥Cx,b∥ is the energy level of the channel for frequency band Cx, ∥Cmax,b∥ the maximum energy level of the at least two channels at the time frequency part, ∥Cmin,b∥ the minimum energy level of the at least two channels at the time frequency part, and δx is the angular distance for the channel Cx to a nearest of a set of speakers.
The value of δx may be defined by
where L=[Lr, Lθ, Lφ]=[1, −30, 0], R=[Rr, Rθ, Rφ]=[1, 30, 0], C=[Cr, Cθ, Cφ]=[1, 0, 0], Ls=[Lsr, Lsθ, Lsφ]=[1, −110, 0], and Rs=[Rsr, Rsθ, Rsφ]=[1, 110, 0].
The selective channel processor may comprise: a perception filter configured select a first set of the at least two object orientated audio signal channels, the first set of the at least two object orientated audio signal channels being the lower perceptually ordered channels; a downmixer configured to downmix the first set of the at least two object orientated audio signal channels to a downmixed channel representation; and an output configured to output the downmixed channel representation with the remainder of the at least two object orientated audio signal channels.
The selective channel processor may comprise: a perception filter configured to select for parts of the at least two object orientated audio signal channels a highest perceptually ordered channel part; a mid channel generator configured to combine the selected highest perceptually ordered part to generate a first audio signal; an attenuator configured to attenuate the at least two object orientated audio signal channels highest perceptually ordered channel part; a side channel generator configured to combine the attenuated at least two object orientated audio signal channels highest perceptually ordered channel part to the remainder at least two object orientated audio signal channel parts to generate a second audio signal; and an output configured to output the first audio signal and the second audio signal.
The parts may be frequency sub-bands and/or bands of time periods of the at least two object orientated audio signal channels.
A computer program product stored on a medium may cause an apparatus to perform the method as described herein.
An electronic device may comprise apparatus as described herein.
A chipset may comprise apparatus as described herein.
Embodiments of the present application aim to address problems associated with the state of the art.
For better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:
The following describes in further detail suitable apparatus and possible mechanisms for the provision of effective spatial object oriented audio signal format processing.
The concept as embodied in the examples described herein is utilizing object oriented audio signal formats, for example the Dolby Atmos audio format, in a mobile device. As described herein the computational limits and other resource capacity issues make it difficult if not practically impossible to apply object oriented audio signal formats such as the Atmos format in mobile devices with limited bandwidth, storage and processing capacities.
In such a manner a scalable version of object oriented audio signal formats can be generated. In such embodiments as described herein both the compactness of regular surround audio and most of the benefits from an object oriented audio format can be retained.
In this regard reference is first made to
The electronic device 10 may for example be a mobile terminal or user equipment of a wireless communication system when functioning as an audio capturer or format converting apparatus. In some embodiments the apparatus can be an audio server for supplying audio signals to a suitable player or audio recorder, such as an MP3 player, a media recorder/player (also known as an MP4 player), or any suitable portable apparatus suitable for recording audio or audio/video camcorder/memory audio or video recorder.
The apparatus 10 can in some embodiments comprise an audio-video subsystem. The audio-video subsystem for example can comprise in some embodiments a microphone or array of microphones 11 for audio signal capture. In some embodiments the microphone or array of microphones can be a solid state microphone, in other words capable of capturing audio signals and outputting a suitable digital format signal. In some other embodiments the microphone or array of microphones 11 can comprise any suitable microphone or audio capture means, for example a condenser microphone, capacitor microphone, electrostatic microphone, Electret condenser microphone, dynamic microphone, ribbon microphone, carbon microphone, piezoelectric microphone, or micro electrical-mechanical system (MEMS) microphone. In some embodiments the microphone 11 is a digital microphone array, in other words configured to generate a digital signal output (and thus not requiring an analogue-to-digital converter). The microphone 11 or array of microphones can in some embodiments output the audio captured signal to an analogue-to-digital converter (ADC) 14.
In some embodiments the apparatus can further comprise an analogue-to-digital converter (ADC) 14 configured to receive the analogue captured audio signal from the microphones and outputting the audio captured signal in a suitable digital form. The analogue-to-digital converter 14 can be any suitable analogue-to-digital conversion or processing means. In some embodiments the microphones are ‘integrated’ microphones containing both audio signal generating and analogue-to-digital conversion capability.
In some embodiments the apparatus 10 audio-video subsystem further comprises a digital-to-analogue converter 32 for converting digital audio signals from a processor 21 to a suitable analogue format. The digital-to-analogue converter (DAC) or signal processing means 32 can in some embodiments be any suitable DAC technology.
Furthermore the audio-video subsystem can comprise in some embodiments a speaker 33. The speaker 33 can in some embodiments receive the output from the digital-to-analogue converter 32 and present the analogue audio signal to the user. In some embodiments the speaker 33 can be representative of multi-speaker arrangement, a headset, for example a set of headphones, or cordless headphones.
In some embodiments the apparatus audio-video subsystem comprises a camera 51 or image capturing means configured to supply to the processor 21 image data. In some embodiments the camera can be configured to supply multiple images over time to provide a video stream.
In some embodiments the apparatus audio-video subsystem comprises a display 52. The display or image display means can be configured to output visual images which can be viewed by the user of the apparatus. In some embodiments the display can be a touch screen display suitable for supplying input data to the apparatus. The display can be any suitable display technology, for example the display can be implemented by a flat panel comprising cells of LCD, LED, OLED, or ‘plasma’ display implementations.
Although the apparatus 10 is shown having both audio/video capture and audio/video presentation components, it would be understood that in some embodiments the apparatus 10 can comprise only the audio capture parts of the audio subsystem such that in some embodiments of the apparatus the microphone (for audio capture) is present.
In some embodiments the apparatus 10 comprises a processor 21. The processor 21 is coupled to the audio-video subsystem and specifically in some examples the analogue-to-digital converter 14 for receiving digital signals representing audio signals from the microphone 11, the digital-to-analogue converter (DAC) 12 configured to output processed digital audio signals, the camera 51 for receiving digital signals representing video signals, and the display 52 configured to output processed digital video signals from the processor 21.
The processor 21 can be configured to execute various program codes. The implemented program codes can comprise for example audio-video recording and audio-video presentation routines. For example in some embodiments the processor is suitable for generating object oriented audio format signals and storing such a format. In some embodiments the program codes can be configured to perform audio format conversion as described herein.
In some embodiments the apparatus further comprises a memory 22. In some embodiments the processor is coupled to memory 22. The memory can be any suitable storage means. In some embodiments the memory 22 comprises a program code section 23 for storing program codes implementable upon the processor 21. Furthermore in some embodiments the memory 22 can further comprise a stored data section 24 for storing data, for example data that has been converted in accordance with the application or data to be encoded via the application embodiments as described later. The implemented program code stored within the program code section 23, and the data stored within the stored data section 24 can be retrieved by the processor 21 whenever needed via the memory-processor coupling.
In some further embodiments the apparatus 10 can comprise a user interface 15. The user interface 15 can be coupled in some embodiments to the processor 21. In some embodiments the processor can control the operation of the user interface and receive inputs from the user interface 15. In some embodiments the user interface 15 can enable a user to input commands to the electronic device or apparatus 10, for example via a keypad, and/or to obtain information from the apparatus 10, for example via a display which is part of the user interface 15. The user interface 15 can in some embodiments as described herein comprise a touch screen or touch interface capable of both enabling information to be entered to the apparatus 10 and further displaying information to the user of the apparatus 10.
In some embodiments the apparatus further comprises a transceiver 13, the transceiver in such embodiments can be coupled to the processor and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver 13 or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling. For example in some embodiments the transceiver 13 can be configured to output the audio signals in a hybrid object orientated audio format or other format converted from the object orientated audio format.
The transceiver 13 can communicate with further apparatus by any suitable known communications protocol, for example in some embodiments the transceiver 13 or transceiver means can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).
In some embodiments the apparatus comprises a position sensor 16 configured to estimate the position of the apparatus 10. The position sensor 16 can in some embodiments be a satellite positioning sensor such as a GPS (Global Positioning System), GLONASS or Galileo receiver.
In some embodiments the positioning sensor can be a cellular ID system or an assisted GPS system.
In some embodiments the apparatus 10 further comprises a direction or orientation sensor. The orientation/direction sensor can in some embodiments be an electronic compass, accelerometer, and a gyroscope or be determined by the motion of the apparatus using the positioning estimate.
It is to be understood again that the structure of the electronic device 10 could be supplemented and varied in many ways.
With respect to
In some embodiments the object oriented audio format processor comprises a perception sorter 101. The perception sorter 101 is configured to receive the object oriented audio format signals channels. There can be a significant number of channels, for example Dolby Atmos can use up to 200 individual channels.
The operation of receiving the object oriented audio format signals is shown in
The perception sorter 101 can then be configured to perceptually rate each of these channels and sort the channels according to the perception rating value.
The perception sorter 101 can then output the perception sorted channels Cp1 to CpN to a selective channel processor 103.
In some embodiments the object oriented audio format converter comprises a selective channel processor 103. The selective channel processor 103 can be configured to receive the perception sorted channel information and selectively process channels based on the perception sorted values.
The operation of selectively processing the object oriented audio format signals based on perception sort is shown in
The selective channel processor 103 can then output the converted channel signals according to the channel processing performed.
The operation of outputting the converted channel signals is shown in
With respect to
In some embodiments the perception sorter 101 comprises a signal segmenter 301. The signal segmenter 301 can in some embodiments be configured to receive the object oriented audio format signals.
The operation of receiving the object oriented audio format signals is shown in
In some embodiments the signal segmenter 301 is configured to segment the audio signals into short time segments. For example in some embodiments the short time segments are 20 ms segments. In some embodiments the short time segments are overlapping short time segments. In other words that each of the segments comprise an element of the preceding segment and an element of the succeeding segment. For example in some embodiments the short time segments are 20 ms segments which overlap 10 ms with the preceding short time segment and 10 ms with the succeeding short time segment.
In some embodiments the signal segmenter 301 is configured to output the time domain signal segmented short time segments to an energy level determiner 303. In the example shown in
The operation of segmenting the object oriented audio format signals into short time segments is shown in
In some embodiments the signal segmenter 301 is further configured to segment the object oriented audio format signals in the frequency domain as well as in the time domain. In such embodiments the short time segments can be converted by a suitable Time-to-Frequency domain converter. The Time-to-Frequency Domain Transformer or suitable transformer means can be configured to perform any suitable time-to-frequency domain transformation on the segmented or frame audio data. In some embodiments the Time-to-Frequency Domain Transformer can be a Discrete Fourier Transformer (DFT). However the Transformer can be any suitable Transformer such as a Discrete Cosine Transformer (DCT), a Modified Discrete Cosine Transformer (MDCT), a Fast Fourier Transformer (FFT) or a quadrature mirror filter (QMF). The Time-to-Frequency Domain Transformer can be configured to output a frequency domain signal for each channel to a sub-band filter.
In some embodiments the signal segmenter comprises a sub-band filter configured to sub-band or band filter the frequency domain short time segment or frame representations. In other words for each of the channels C1 to CN are generated channel representations C1,1 to C1,B and CN,1 to CN,B, where N is the number of input channels and B the number of sub bands for each channel. The sub-band filter or suitable means can be configured to receive the frequency domain signals from the Time-to-Frequency Domain Transformer and divide each frequency domain representation signal into a number of sub-bands.
The sub-band division can be any suitable sub-band division. For example in some embodiments the sub-band filter can be configured to operate using psychoacoustic filtering bands. The sub-band filter can then be configured to output each domain range sub-band to the energy level determiner 303.
In some embodiments the perception sorter 101 comprises an energy level determiner 303. The energy level determiner 303 can be configured to receive the representations (either in the time domain Ca or frequency domain Ca,b) and can determine energy levels for the object oriented audio format channel signals ∥Ca∥ or ∥Ca,b∥. The energy level determiner 303 can then be configured to further determine the ‘loudest’ channel value ∥Cmax∥ and the quietest channel value ∥Cmin∥ from the energy of the signal for each signal segment.
The energy level determiner 303 can then be configured to output the channels to the perception determiner 305 and further to the perception sorter 307.
The operation of determining the energy levels for the object oriented audio format signals is shown in
In some embodiments the perception sorter 101 comprises a perception determiner 305. The perception determiner 305 is configured to receive the channels Ca (or frequency domain Ca,b) and energy levels for the object oriented audio format channel signals ∥Ca∥ (or ∥Ca,b∥) and from these determine a perceptual importance value which can be used to sort the object oriented audio format signals in a suitable format. Perceptually the most important channels are the loudest ones and those that are meant to be played from a position away from the speakers in a defined (such as a 5.1 format) downmix. These positions include for example above or below the listener or straight behind as these channels aren't properly expressed by the 5.1 downmix which has no height (=azimuth) information nor a speaker straight behind.
In some embodiments the perception determiner 305 is configured to generate a perception value for a channel Cx short time segment according to the following equation:
where δx is the trajectory direction for channel Cx and can be defined as being the angular distance δ for the channel from point P=[r, θ, φ] to the nearest speaker as follows:
where for a 5.1 multichannel system
L=[Lr, Lθ, Lφ]=[1, −30, 0]
R=[Rr, Rθ, Rφ]=[1, 30, 0]
C=[Cr, Cθ, Cφ]=[1, 0, 0]
Ls=[Lsr, Lsθ, Lsθ]=[1, −110, 0]
Rs=[Rsr, Rsθ, Rsφ]=[1, 110, 0],
and where the numbers are radius, polar angle and azimuth. We can assume the radius to be 1 without loss of generality.
The angular distance can be at minimum 0 and at maximum 90 degrees.
The perception determiner 305 can then be configured to output the perception values perce(Cx) to the perception sorter 307.
The determination of the perception metric for each of the channels is shown in
In some embodiments it would be understood that the perception determiner is configured to determine a perception value associated with each of the channel sub-bands. In such embodiments the perception determiner 305 is configured to generate a perception value for a channel Cx,b short time segment for channel x and sub-band b according to the following equation:
where ∥CMAX,b∥ and ∥CMIN,b∥ are the energies of bands b in the channels that have the largest and smallest energy in band b respectively.
In some embodiments the perception sorter 101 comprises a perception metric sorter 307 configured to receive the channels and the perception values associated with each of these channels. The perception metric sorter 307 can then be configured to sort the channels according to the perception metric value. Thus in some embodiments the perception metric sorter 307 can be configured to output the channels and associated trajectory information to the selective channel processor 103 in a form where the selective channel processor 103 is able to determine the order of perceptually important channels.
The operation of sorting the object oriented audio format signals based on the perception metric is shown in
The operation of outputting the object oriented audio formats signals based on perception based sort is shown in
With respect to
The determination of available resources such as bit rate/storage/processing capacity is shown in
In some embodiments the selective channel processor 103 comprises a perception filter 503. The perception filter 503 is configured to receive the perception sorted object-oriented audio signal channels CP1 to CPN and filter the object-oriented audio format signals channels based on the determined available resources. In some embodiments the perception filter 503 is configured to filter the channels into high perception channels and low perception channels. The selection of the number of channels to be filtered is based on the available resources.
The perception filter 503 therefore can output the low perceptual channels CY1 to CYK to a downmixer 505 while passing the high perceptual channels Cx1 to CxH to be output.
The operation of filtering the object-oriented audio format signal channels based on the available resources based on the perception values into high perception and low perceptual channels is shown in
Furthermore the outputting of the high perception channels directly is shown in
In some embodiments the selective channel processor 103 comprises a downmixer 505. The downmixer 505 is configured to receive the low perceptual channels CY1 to CYK and downmix these channels with their associated trajectories into a defined number of output channels. For example the downmixer 505 can be configured to output a 5.1 channel configuration with a left (L), right (R), centre (C), left surround (Ls), and right surround (Rs) speakers and associated sub-woofer or ambience signal. However it would be understood that the downmixer 505 can be configured to output any suitable stereo or multichannel output signal.
The operation of down mixing the low perception channels to a small number of channels such as five channels or two channels is shown in
The downmixer 505 can then output the downmixed channels. The operation of outputting the downmixed channels is shown in
In such a manner the number of channels is significantly reduced such that the apparatus configured to receive the channels can process the hybrid audio format and playback the audio format in such a way that the playback device can render the channels using limited resources.
With respect to
The selective channel processor 103 in some embodiments comprises a perception filter 703. The perception filter 703 is configured to receive each of the channels in the form of sorted sub-band object oriented audio format signal channels.
The operation of receiving sorted sub-band object-oriented audio format signal channels is shown in
The perception filter can then be configured to filter or select from all of the channel sub-bands the channel sub-band which has the highest perceptual importance, in other words with the highest perceptual metric value and pass this value to a mid channel generator 705. Thus for example where channel CP1 had the most important 1st band, CP2 had the most important 2nd band, the Mid channel generator receives the components CP1,1, CP2,2, . . . , CPB,B.
The operation of filtering for the channel sub-bands the most perceptual important channel sub-band is shown in
Furthermore for the same channel elements the perception filter can be configured to attenuate the most perceptual important channel sideband components by a factor α. The factor α has a value 0≦α≦1. The value of α can in some embodiments be determined manually and is a compromise between possible artefacts and directionality effect.
The attenuated perceptual important channel sideband components and the other components, the non-important channel components are passed to a side channel generator 706. In other words using the above example the output to the side channel generator is CP1′ where CP1′=[αCP1,1, CP1,2, . . . , CP1,B], and channel CP2′ where CPN′=[CP2,1, αCP2,2, . . . , CP2,B].
The operation of attenuating the most perceptual important channel components is shown in
In some embodiments the selective channel processor 103 comprises a mid channel generator 705. The mid channel generator 705 is configured to receive from the perception filter the most perceptual important channel sub-band components. The mid channel generate 705 can then be configured to combine these to generate a mid signal. Thus according to the example shown above the mid signal is generated from the sub-band components according to M=[CP1,1, CP2,2, . . . , CPB,B].
The operation of generating the mid signal from the combined combination of the most perceptual important channel sub bands is shown in
The mid channel generator 705 can then be configured to output the mid signal M.
The operation of outputting the mid signal is shown in
In some embodiments the selective channel processor 103 comprises a side channel generator 706. The side channel generator 706 is configured to combine the attenuated most perceptual important channel sideband components with the other sideband components to form the side signal. Using the above example the side signal is generated from
S=Cp1′+Cp2′+ . . . +CpN′
The operation of combining the attenuated perceptual important and other side bands to form the side signal is shown in
Furthermore the side channel generator 706 can then be configured to output the side signal S.
The operation of outputting the side signal is shown in
It would be understood that in some embodiments the mid signal generator is further configured to output the object trajectory information associated with each of the perceptual important sub-bands.
The output mid and side signals can be rendered and output on a suitable playback device. For example in some embodiments a playback device can comprise a decoder which receives the mid signal and the side signal, and the associated direction information (the trajectory information).
In such playback apparatus the mid, side and directional information is rendered according to the suitable output format. For example in a stereo output the following operations can be performed to generate a left and right channel signal for the audio output. For example in some embodiments a HRTF can be applied to the low frequency components of the mid signal for sub-band b at segment n Mb(n) and the directional component
{tilde over (M)}(n)=Mb(n)HL,α
{tilde over (M)}Rb(n)=Mb(n)HR,α
The usage of HRTFs is straightforward. For direction (angle) β, there are HRTF filters for left and right ears, HLβ(z) and HRβ(z), respectively. A binaural signal with sound source S(z) in direction β is generated straightforwardly as L(z)=HLβ(z)S(z) and R(z)=HRβ(z)S(z), where L(z) and R(z) are the input signals for left and right ears.
The same filtering can be performed in DFT domain as presented for the subbands at higher frequencies the processing goes as follows:
In these embodiments it can be seen that only the magnitude part of the HRTF filters are used, in other words the delays are not modified. On the other hand, a fixed delay of τHRTF samples is added to the signal. This is used because the processing of the low frequencies introduces a delay to the signal. In some embodiments to avoid a mismatch between low and high frequencies, this delay needs to be compensated. τHRTF is the average delay introduced by HRTF filtering and it has been found that delaying all the high frequencies with this average delay provides good results. The value of the average delay is dependent on the distance between sound sources and microphones in the used HRTF set.
The side signal does not have any directional information, and thus no HRTF processing is needed. However in some embodiments delay caused by the HRTF filtering has to be compensated also for the side signal. This is done similarly as for the high frequencies of the mid signal:
For the side signal, the processing is equal for low and high frequencies.
The mid and side signals are then in some embodiments combined to determine left and right output channel signals. As HRTF filtering typically amplifies or attenuates certain frequency regions in the signal therefore in some embodiments the amplitudes of the mid and side signals may not correspond to each other. In some embodiments the average energy of mid signal is returned to the original level, while still maintaining the level difference between left and right channels. In one approach, this is performed separately for every subband.
The scaling factor for subband b is obtained as
Now the scaled mid signal is obtained as:
Synthesized mid and side signals signals
In some embodiments the externalization of the output signal can be further enhanced by the means of decorrelation. In an embodiment, decorrelation is applied only to the side signal, which represents the ambience part. Many kinds of decorrelation methods can be used, but described here is a method applying an all-pass type of decorrelation filter to the synthesized binaural signals. The applied filter is of the form
where P is set to a fixed value, for example 50 samples for a 32 kHz signal. The parameter β is used such that the parameter is assigned opposite values for the two channels. For example 0.4 is a suitable value for β. It would be understood that there is a different decorrelation filter for each of the left and right channels.
The output left and right channels are now obtained in some embodiments as:
L(z)=z−P
R(z)=z−P
It shall be appreciated that the term user equipment is intended to cover any suitable type of wireless user equipment, such as mobile telephones, portable data processing devices or portable web browsers, as well as wearable devices.
Furthermore elements of a public land mobile network (PLMN) may also comprise apparatus as described above.
In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.
Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
Programs, such as those provided by Synopsys, Inc. of Mountain View, Calif. and Cadence Design, of San Jose, Calif. automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or “fab” for fabrication.
The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IB2013/054044 | 5/17/2013 | WO | 00 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2014/184618 | 11/20/2014 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
5661808 | Klayman | Aug 1997 | A |
6446037 | Fielder et al. | Sep 2002 | B1 |
7706543 | Daniel | Apr 2010 | B2 |
8023660 | Faller | Sep 2011 | B2 |
8280077 | Avendano et al. | Oct 2012 | B2 |
8335321 | Daishin et al. | Dec 2012 | B2 |
RE44611 | Metcalf | Nov 2013 | E |
8600530 | Nagle et al. | Dec 2013 | B2 |
8861739 | Ojanpera | Oct 2014 | B2 |
20030161479 | Yang et al. | Aug 2003 | A1 |
20050008170 | Pfaffinger et al. | Jan 2005 | A1 |
20050195990 | Kondo et al. | Sep 2005 | A1 |
20050244023 | Roeck et al. | Nov 2005 | A1 |
20080008326 | Reichelt | Jan 2008 | A1 |
20080013751 | Hiselius | Jan 2008 | A1 |
20080232601 | Pulkki | Sep 2008 | A1 |
20090012779 | Ikeda et al. | Jan 2009 | A1 |
20090022328 | Neugebauer et al. | Jan 2009 | A1 |
20100061558 | Faller | Mar 2010 | A1 |
20100150364 | Buck et al. | Jun 2010 | A1 |
20100166191 | Herre et al. | Jul 2010 | A1 |
20100215199 | Breebaart | Aug 2010 | A1 |
20100284551 | Oh et al. | Nov 2010 | A1 |
20100290629 | Morii | Nov 2010 | A1 |
20110038485 | Neoran | Feb 2011 | A1 |
20110081024 | Soulodre | Apr 2011 | A1 |
20110299702 | Faller | Dec 2011 | A1 |
20120013768 | Zurek et al. | Jan 2012 | A1 |
20120019689 | Zurek et al. | Jan 2012 | A1 |
20120063604 | Myburg et al. | Mar 2012 | A1 |
20120183148 | Cho et al. | Jul 2012 | A1 |
20120230497 | Dressler et al. | Sep 2012 | A1 |
Number | Date | Country |
---|---|---|
2154910 | Feb 2010 | EP |
2445234 | Apr 2012 | EP |
2006-180039 | Jul 2006 | JP |
2009-271183 | Nov 2009 | JP |
2005086139 | Sep 2005 | WO |
2007011157 | Jan 2007 | WO |
2007052088 | May 2007 | WO |
2008018689 | Feb 2008 | WO |
2008046531 | Apr 2008 | WO |
2009001292 | Dec 2008 | WO |
2009150288 | Dec 2009 | WO |
2010017833 | Feb 2010 | WO |
2010028784 | Mar 2010 | WO |
2010125228 | Nov 2010 | WO |
2011020065 | Feb 2011 | WO |
2011114192 | Sep 2011 | WO |
2013006338 | Jan 2013 | WO |
2014099285 | Jun 2014 | WO |
Entry |
---|
International Search Report and Written Opinion received for corresponding Patent Cooperation Treaty Application No. PCT/IB2013/054044, dated Apr. 15, 2014, 14 pages. |
Rd J. et al. “Spatial Audio Object Coding (SAOC)—Te Upcoming MPEG Standard on Parametric Object Based Audio Coding”, 124th AS Convention, Audio Engineering Society, Paper 7377, 20080517. |
Laitinen et al., “Binaural Reproduction for Directional Audio Coding”, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Oct. 18-21, 2009, pp. 337-340. |
Gerzon, “Ambisonics in Multichannel Broadcasting and Video”, 74th AES Convention, Oct. 8-12, 1983, pp. 1-31. |
Backman, “Microphone Array Beam Forming for Multichannel Recording”, 114th AES Convention, Mar. 22-25, 2003, pp. 1-7. |
Merimaa, “Applications of a 3-d Microphone Array”, AES 112th Convention, May 10-13, 2002, pp. 1-11. |
Meyer et al., “Spherical Microphone Array for Spatial Sound Recording”, AES 115th Convention, Oct. 10-13, 2003, pp. 1-9. |
Wiggins, “An Investigation Into the Real-time Manipulation and Control of Threedimensional Sound Fields”, Thesis, 2004, 370 Pages. |
Gallo et al., “Extracting and Re-rendering Structured Auditory Scenes From Field Recordings”, AES 30th International Conference, Mar. 15-17, 2007, pp. 1-11. |
Goodwin et al., “Binaural 3-d Audio Rendering Based on Spatial Audio Scene Coding”, 123rd AES Convention, Oct. 5-8, 2007, pp. 1-12. |
Lindblom et al., “Flexible Sum-Difference Stereo Coding Based on Time-aligned Signal Components”, IEEE Workshop an Applications of Signal Processing to Audio and Acoustics, Oct. 16-19, 2005, pp. 255-258. |
Pulkki, “Spatial Sound Reproduction With Directional Audio Coding”, Journal of the Audio Engineering Society, vol. 55, No. 6, Jun. 2007, pp. 503-516. |
Vilkamo et al., “Directional Audio Coding: Virtual Microphone-based Synthesis and Subjective Evaluation”, Journal of the Audio Engineering Society, vol. 57, No. 9, Sep. 2009, pp. 709-724. |
Tamai et al., “Real-time 2 Dimensional Sound Source Localization by 128-channel Huge Microphone Array”, 13th IEEE International Workshop on Robot and Human Interactive Communication, Sep. 20-22, 2004, pp. 65-70. |
Nakadai et al., “Sound Source Tracking With Directivity Pattern Estimation Using a 64 Ch Microphone Array”, IEEE/RSJ International Conference on Intelligent Robots and Systems, Aug. 2-6, 2005, 7 Pages. |
Kallinger et al., “Enhanced Direction Estimation Using Microphone Arrays for Directional Audio Coding”, Hands-Free Speech Communication and Microphone Arrays, May 6-8, 2008, pp. 45-48. |
Ahonen et al., “Directional Analysis of Sound Field With Linear Microphone Array and Applications in Sound Reproduction”, 124th AES Convention, May 17-20, 2008, pp. 1-11. |
Baumgarte et al., “Binaural Cue Coding—Part I: Psychoacoustic Fundamentals and Design Principles”, IEEE Transactions on Speech and Audio Processing, vol. 11, No. 6, Nov. 2003, pp. 509-519. |
Faller et al., “Binaural Cue Coding—Part II: Schemes and Applications”, IEEE Transactions on Speech and Audio Processing, vol. 11, No. 6, Nov. 2003, pp. 520-531. |
“Stereophonic Sound”, Wikipedia, Retrieved on Jan. 20, 2017, Webpage available at : https://en.wikipedia.org/wiki/Stereophonic—sound#M.2FS—technique:—Mid.2FSide—stereophony. |
“Joint (Audio Engineering)”, Wikipedia, Retrieved on Jan. 20, 2017, Webpage available at : https://en.wikipedia.org/wiki/Joint—%28audio—engineering%29#M.2FS—stereo—coding. |
Pulkki et al., “Directional Audio Coding—Perception-based Reproduction of Spatial Sound”, International Workshop on the Principles and Applications of Spatial Hearing, Nov. 11-13, 2009, 4 Pages. |
Herre et al., “MPEG Surround—The ISO/MPEG Standard for Efficient and Compatible Multichannel Audio Coding”, Joint Audio Engineering Society, vol. 56, No. 11, Nov. 2008, pp. 932-955. |
“Information technology—MPEG audio technologies—Part 1: MPEG Surround”, ISO/IEC 23003-1, 2007, pp. 1-12. |
Fielder et al., “Introduction to Dolby Digital Plus, an Enhancement to the Dolby Digital Coding System”, 117th Audio Engineering Society Convention, Oct. 28-31, 2004, pp. 1-29. |
Kassier et al., “An Informal Comparison Between Surround-Sound Microphone Techniques”, 118th Audio Engineering Society Convention, May 28-31, 2005, pp. 1-17. |
Hiekkanen et al., “Reproduction of Virtual Reality with Multichannel Microphone Techniques”, 122nd Audio Engineering Society Convention, May 5-8, 2007, pp. 1-7. |
Craven, “Continuous Surround Panning For 5-speaker Reproduction”, AES 24th International Conference: Multichannel Audio, Jun. 1, 2003, pp. 1-6. |
U.S. Appl. No. 13/209,738, “Apparatus and Method for Multi-channel Signal Playback”, filed Aug. 15, 2011, 32 pages. |
Pulkki, “Virtual Sound Source Positioning Using Vector Base Amplitude Panning”, Audio Engineering Society (AES), vol. 45, No. 6, Jun. 1, 1997, pp. 456-466. |
Irwan et al., “A Method to Convert Stereo to Multi-channel Sound”, AES 19th International Conference, Jun. 21-24, 2001, pp. 1-5. |
“U.S. Appl. No. 12/927,663, ““Converting Multi-Microphone Captured Signals to Shiftedsignals Useful for Binaural Signal Processing and Use Thereof””, filed Nov. 19, 2010, 22 pages.” |
Blumlein, “Improvements in and relating to Sound-transmission, Sound-recording and Sound-reproducing Systems”, British Patent Specification 394,325, Dec. 14, 1931, pp. 32-40. |
International Search Report and Written Opinion received for corresponding Patent Cooperation Treaty Application No. PCT/FI2011/050861, dated Feb. 24, 2012, 13 pages. |
Knapp et al., “The Generalized Correlation Method For Estimation of Time Delay”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, No. 4, Aug. 1976, pp. 320-327. |
“My Week of Audio: Part 2—Dolby Atmos”, The Bonus View, Retrieved on Dec. 15, 2016, Webpage available at : http://www.highdefdigest.com/blog/dolby-atmos/. |
International Search Report and Written Opinion received for corresponding Patent Cooperation Treaty Application No. PCT/FI2012/050763, dated Feb. 4, 2013, 11 pages. |
Non-Final Office action received for corresponding U.S. Appl. No. 12/927,663, dated May 3, 2013, 12 pages. |
Non-Final Office action received for corresponding U.S. Appl. No. 13/209,738, dated Oct. 7, 2013, 16 pages. |
Final Office action received for corresponding U.S. Appl. No. 12/927,663, dated Nov. 20, 2013, 11 pages. |
Non-Final Office action received for corresponding U.S. Appl. No. 12/927,663, dated Mar. 27, 2014, 11 pages. |
Final Office action received for corresponding U.S. Appl. No. 13/209,738, dated Apr. 1, 2014, 16 pages. |
Final Office action received for corresponding U.S. Appl. No. 12/927,663, dated Oct. 23, 2014, 13 pages. |
Non-Final Office action received for corresponding U.S. Appl. No. 13/209,738, dated Nov. 24, 2014, 18 pages. |
Extended European Search Report received for corresponding European Patent Application No. 11840946.5, dated Feb. 24, 2015, 9 pages. |
Breebaart et al., “Multi-channel Goes Mobile: MPEG Surround Binaural Rendering”, AES 29th International Conference, Sep. 2-4, 2006, pp. 1-13. |
Tellakula, “Acoustic Source Localization Using Time Delay Estimation”, Thesis, Aug. 2007, 82 Pages. |
Non-Final Office action received for corresponding U.S. Appl. No. 12/927,663, dated Apr. 1, 2015, 16 pages. |
Final Office action received for corresponding U.S. Appl. No. 13/209,738, dated Jul. 16, 2015, 20 pages. |
Office action received for corresponding European Patent Application No. 11840946.5, dated Mar. 3, 2016, 5 pages. |
Office action received for corresponding European Patent Application No. 11840946.5, dated Nov. 17, 2016, 6 pages. |
“Matlab”, Wikipedia, Retrieved on Jan. 20, 2017, Webpage available at : https://en.wikipedia.org/wiki/MATLAB. |
Extended European Search Report received for corresponding European Patent Application No. 13884465.9, dated Dec. 19, 2016, 8 pages. |
Number | Date | Country | |
---|---|---|---|
20160119733 A1 | Apr 2016 | US |