This application claims priority to Great Britain Patent Application No. 2115768.0, filed Nov. 3, 2021, the entire contents of which are incorporated herein by reference.
Embodiments of the present disclosure relate to spatial audio. In particular, some embodiments relate to the transmission of audio signals between a transmitter apparatus and a receiver apparatus.
Spatial audio adapts with the changing point of view of a user. For example, for headphone listening, spatial audio can be rotated as a user turns his or her head.
Human beings are very good at detecting sound source directions. Human beings use a changing point of view, for example a head rotation, to improve detection of a sound source direction. For example, a user can rotate his or her head to get the desired sound to a central position where the user's sound source direction detection ability is best. Also head rotation can be used to distinguish between sound sources in front and behind a user. With a left-to-right head rotation sound sources in front move right-to-left whereas sound sources behind move left-to-right.
In existing solutions, point of view data that tracks the user's point of view is transmitted or obtained by a transmitter apparatus which modifies the audio signals to rotate an audio scene according to the point of view data. The transmitter apparatus then low-bit rating encodes the audio signal and sends the coded audio to a receiver apparatus for rendering. In some examples, the receiver apparatus can be headphones. The receiver apparatus decodes the audio and plays it back to the user. These steps can cause delays in rendering the modified audio to the user, in response to a change in point of view of the user. Typically, the delay can be several hundreds of milliseconds. As a consequence, the sound source directions can appear to lag. It would be desirable to reduce the delay.
According to various, but not necessarily all, embodiments there is provided an apparatus, for enabling adaptive playback, comprising means configured to: obtain, for a first point of view, a first audio signal for at least a first channel and a second channel;
obtain, for a second point of view, a second audio signal for at least the first channel and the second channel;
determine a single-channel difference audio signal, for the second point of view, based on at least a difference between the first audio signal and the second audio signal, enable estimation of both the first channel and the second channel of the second audio signal for the second point of view in dependence on the single-channel difference audio signal and the first audio signal.
According to some but not necessarily all examples the means configured to determine a single-channel difference audio signal, for the second point of view, based on a difference between the first audio signal and the second audio signal, is configured to determine a difference between a reference channel of the first audio signal and the second audio signal, wherein the reference channel is the first channel, the second channel or a composition channel based on the first channel and the second channel, and
wherein the means configured to enable estimation of both the first channel and the second channel of the second audio signal enables estimation in dependence on the single-channel difference audio signal and the reference channel of the first audio signal.
According to some but not necessarily all examples the means configured to determine a single-channel difference audio signal, for the second point of view, is configured to determine a difference between the first audio signal and the second audio signal, in a time domain.
According to some but not necessarily all examples, the apparatus further comprises smoothing means configured to smooth the single-channel difference audio signal in a frequency domain to obtain a smoothed single-channel difference audio signal and to enable estimation of at least the second audio signal in dependence upon the smoothed single-channel difference audio signal and the first audio signal.
According to some but not necessarily all examples the smoothing means is configured to replicate frequency bins within a frequency band, for one or more different frequency bands.
According to some but not necessarily all examples the smoothing means is configured for dynamic smoothing, wherein the dynamic smoothing of the single-channel difference audio signal, based on at least the difference between the first audio signal for the first point of view and the second audio signal for the second point of view, is dependent upon a likelihood of a change in point of view from the first point of view to the second point of view.
According to some but not necessarily all examples the apparatus comprises means configured to:
when the second point of view is offset from the first point of view by a first angle in a positive sense and a third point of view is offset from the first point of view by the first angle in a negative sense, obtaining the single-channel difference audio signal, for the second point of view but not for the third point of view.
According to various, but not necessarily all, embodiments there is provided a method, for enabling adaptive playback, comprising:
obtaining, for a first point of view, a first audio signal for at least a first channel and a second channel;
obtaining, for a second point of view, a second audio signal for at least the first channel and the second channel;
determining a single-channel difference audio signal, for the second point of view, based on at least a difference between the first audio signal and the second audio signal,
enabling estimation of both the first channel and the second channel of the second audio signal for the second point of view in dependence on the single-channel difference audio signal for the second point of view and the first audio signal.
According to various, but not necessarily all, embodiments there is provided an apparatus, for adaptive playback, comprising means configured to:
obtain a single-channel difference audio signal, for a second point of view, dependent on at least a difference between a first audio signal for a first point of view and a second audio signal for a second point of view; and
estimate a first channel and a second channel of the second audio signal for the second point of view in dependence on the single-channel difference audio signal and the first audio signal.
According to some but not necessarily all examples the apparatus comprises means configured to obtain the single-channel difference audio signal, for the second point of view, in the time domain, being dependent on at least a difference, in the time domain, between the first audio signal for the first point of view and the second audio signal for the second point of view; and
estimate a first channel and a second channel of the second audio signal for the second point of view, in the time domain, in dependence on the single-channel difference audio signal, in the time domain, and the first audio signal, in the time domain.
According to some but not necessarily all examples, the apparatus is configured such that, if the second point of view corresponds to a head rotation relative to the first point of view, to estimate the second audio signal at least based on an addition involving the single-channel difference audio signal and one of the first and second channels of the first audio signal and a subtraction involving the single-channel difference audio signal and the other of the point first and second channels of the first audio signal.
According to some but not necessarily all examples, the apparatus is configured such that, if the second point of view corresponds to a head translation relative to the first point of view, to estimate the second audio signal at least based on an addition involving the single-channel difference audio signal and one of the first and second channels of the first audio signal and an addition involving the single-channel difference audio signal and the other of the point first and second channels of the first audio signal or to estimate the second audio signal at least based on a subtraction involving the single-channel difference audio signal and one of the first and second channels of the first audio signal and a subtraction involving the single-channel difference audio signal and the other of the point first and second channels of the first audio signal.
According to some but not necessarily all examples, the apparatus comprises means configured to:
when the second point of view is offset from the first point of view by a first angle in a positive sense and a third point of view is offset from the first point of view by the first angle in a negative sense, re-using an inverse of the single-channel difference audio signal, for the second point of view as a single-channel difference audio signal, for the third point of view.
According to various, but not necessarily all, embodiments there is provided a method comprising:
obtaining a single-channel difference audio signal, for a second point of view, dependent on at least a difference between a first audio signal for a first point of view and a second audio signal for a second point of view; and
estimating a first channel and a second channel of the second audio signal for the second point of view in dependence on the single-channel difference audio signal and the first audio signal.
According to various, but not necessarily all, embodiments there is provided an apparatus, for enabling adaptive playback, comprising means for:
obtaining, for a first point of view, a first audio signal;
obtaining, for a second point of view, a second audio signal;
determining, for the second point of view, at least a difference audio signal based on a difference, between the first audio signal and the second audio signal;
smoothing the difference audio signal in the frequency domain to obtain a smoothed first difference audio signal;
enabling estimation of at least the second audio signal in dependence upon the smoothed difference audio signal and the first audio signal.
According to various, but not necessarily all, embodiments there is provided a method comprising:
obtaining, for a first point of view, a first audio signal;
obtaining, for a second point of view, a second audio signal;
determining, for the second point of view, at least a difference audio signal based on a difference, between the first audio signal and the second audio signal;
smoothing the difference audio signal in the frequency domain to obtain a smoothed first difference audio signal;
enabling estimation of at least the second audio signal in dependence upon the smoothed difference audio signal and the first audio signal.
According to various, but not necessarily all, embodiments there is provided examples as claimed in the appended claims.
Some examples will now be described with reference to the accompanying drawings in which:
The transmitter apparatus 20 comprises configured to: obtain, for a first point of view 401, a first audio signal 601; obtain, for a second point of view 402, a second audio signal 602; determine, for the second point of view 402, at least a difference audio signal 70 based on a difference, between the first audio signal 601 and the second audio signal 602; and enable estimation of at least the second audio signal 602 in dependence upon the difference audio signal 70 and the first audio signal 601.
The enablement of the estimation of at least the second audio signal 602 can, for example, be achieved by transmitting the difference audio signal 70 via the interface 12 to the receiver apparatus 30.
The receiver apparatus 30 comprises means configured to:
obtain a difference audio signal 70, for a second point of view 402, dependent on at least a difference between a first audio signal 601 for a first point of view 401 and a second audio signal 602 for a second point of view 402; and estimate the second audio signal 602 for the second point of view 402 in dependence on the difference audio signal 70 and the first audio signal 601.
The difference audio signal 70 can be defined in various different ways.
In this example, the transmitter apparatus 20 is configured to enable estimation of both the first channel 51 and the second channel 52 of the second audio signal 602 for the second point of view 402 in dependence on the difference audio signal 70 and the first audio signal 601. Also, the receiver apparatus 30 is configured to estimate a first channel 51 and a second channel 52 of the second audio signal 602 for the second point of view 402 in dependence on the difference audio signal 70 and the first audio signal 601.
In this example, the difference audio signal 70 can have various different forms. For example, let us use the following abbreviations:
Firstly, the number of channels in the signals around delayed user view direction are reduced. As an example, one of the following three formulas may be used:
The difference audio signal 70 based on a difference between the first audio signal 601 and the second audio signal 602, can be considered a difference between a reference channel of the first audio signal 601 and the second audio signal 602, wherein the reference channel is the first channel 51, the second channel 52 or composition channel based on the first channel 51 and the second channel 52.
In Equation 1, the reference channel is right minus left. R0−L0 is the reference channel of the first audio signal and R90−L90 is the reference channel of the second audio signal. The difference between the reference channel of the first audio signal and the reference channel of the second audio signal is X90.
In Equation 2, the reference channel is the left channel L. L0 is the reference channel of the first signal, L90 is the reference channel of the second audio signal and the difference between the reference channel of the first audio signal and the reference channel of the second audio signal is X90.
In Equation 3, the reference channel is the right channel R. R0 is the reference channel of the first audio signal and R90 is the reference channel of the second audio signal. The differences between the reference channel of the first audio signal and the reference channel of the second audio signal is X90.
It will be appreciated from
In the example illustrated in
In the example of
The term single-channel difference audio signal 70 can be replaced by:
“single-channel, difference audio signal 70”
“a single channel representation 70 of a difference audio signal” or
“a channel 70 comprising a difference audio signal” or
“a difference audio signal 70, configured to be transmitted in a single channel”
The difference can be based on one or more channels, but the representation of the difference is single channel.
Referring to the example of
In the example of
obtain, for a first point of view 401, a first audio signal 601 for at least a first channel 51 and a second channel 52;
obtain, for a second point of view 402, a second audio signal 602 for at least the first channel 51 and the second channel 52;
determine a single-channel difference audio signal 70, for the second point of view 402 based on at least a difference between the first audio signal 601 and the second audio signal 602; and
enable estimation of both the first channel 51 and the second channel 52 of the second audio signal 60 for the second point of view 402 in dependence on the single-channel difference audio signal 70 for the second point of view 402 and the first audio signal 601.
The receiver apparatus 30 comprises means configured to:
obtain a single-channel difference audio signal 70, for a second point of view 402, dependent on at least a difference between a first audio signal 601 for a first point of view 401 and a second audio signal 602 for a second point of view 402; and estimate a first channel 51 and a second channel 52 of the second audio signal 602 for the second point of view 402 in dependence on the single-channel difference audio signal 70 for the second point of view 402 and the first audio signal 601.
In the example illustrated in
As previously described, the difference means 22 used to determine the difference audio signal 70, for example the single-channel difference audio signal 70, for the second point of view 402 based on a difference between the first audio signal 601 and the second audio signal 602, is configured to determine a difference between a reference channel of the first audio signal 601 and the second audio signal 602, wherein the reference channel is the first channel 51, the second channel 52 or composition channel based on the first channel 51 and the second channel 52, and wherein the estimator 32 is configured to enable estimation of both the first channel 51 and the second channel 52 of the second audio channel 602 enables estimation in dependence on the single-channel difference audio signal 70 and the reference channel of the first audio signal 601.
In the example illustrated in
In the transmitter apparatus 20, the means configured to determine a single-channel difference audio channel 70, for the second point of view 402, is configured to determine a difference between the first audio signal 601 and the second audio signal 602, in a time domain or in a frequency domain.
The advantage of determining the difference in the time domain is that it provides lower latency. In the example where the time domain difference is used, the receiver apparatus 30 comprises means configured to obtain the single-channel difference audio signal 70, for the second point of view 402, in the time domain, being dependent on at least a difference, in the time domain, between the first audio signal 601 and the first point of view 401 and the second audio signal 602 and the second point of view 402; and
estimate a first channel 51 and a second channel 52 of the second audio signal 602 for the second point of view 402, in the time domain, in dependence on the single-channel difference audio signal 70, in the time domain, and the first audio signal 601, in the time domain.
An advantage of doing the difference in frequency domain is that only a part of the frequencies available can be used. For example, the difference may be calculated for high frequencies whereas for low frequencies the signal for a corresponding point of view is sent as such
At block 206, the enabling of the estimation can be provided by transmitting the single-channel difference audio signal 70 from the transmitter apparatus 20 to the receiver apparatus 30.
The method 200 can be performed by the transmitter apparatus 20.
At block 214, the method 210 comprises estimating a first channel 51 and a second channel 52 of the second audio signal 602 for the second point of view 402 in dependence on the single-channel difference audio signal 70 and the first audio signal 601. The single-channel difference audio signal 70 can be single-channel difference audio signal 70 for the second point of view 402.
In the preceding examples, a single alternative point of view 402 and a single second audio signal 602 has been described. However, the preceding description can be used with any number of different points of view 40i and corresponding audio signals 60i. Thus, although the preceding examples illustrate a primary stream associated with the first point of view 401, and the first audio signal 601 and a single side stream associated with the second point of view 402 and the second audio signal 602, in other examples there may be multiple such side streams each of which is associated with a different point of view 40i and corresponding audio signal 60i.
The information that indicates the direction of the primary stream and the side streams may be communicated between the transmitter apparatus 20 and the receiver apparatus 30.
Typically, the side stream directions could be +/−20°, 40°, 60°, 90°, 120° left (positive) or right (negative) of the primary stream direction. This gives enough directions so that switching between the different streams would not cause audible problems and that the directions are far enough left and right so that even if a user moves his point of view quickly there would typically be a side stream that is near the changed user point of view. For many use cases, such as watching movies or other non-360° content, a smaller number of side streams would typically suffice. For example, only the primary stream and any single 30° side stream could be used.
In some examples the selection of how the primary and side streams are rendered to the user is done after all the streams have been decoded. In this example, all the audio samples are available in the time domain and selection can be done sample by sample. In alternative examples, the selection can be done before decoding and in this way saving processing power because not all streams need to be decoded. However, for this option the delay will be longer (because of the audio decoding delay). This may be reduced by using a lower-latency audio and coder/decoder for the side streams.
In this example, the second point of view 402 is offset from the first point of view 401 by a first angle +α in a positive sense and a third point of view 403 is offset from the first point of view 401 by the first angle in a negative sense (−α). The transmitter apparatus 20 is configured to obtain the single-channel difference audio signal 70, for the second point of view 402 but not for the third point of view 403. The single-channel difference audio signal 701 for the second point of view 402, is transmitted from the transmitter apparatus 20 to the receiver apparatus 30 and can be used to estimate audio signals for both the second point of view 402 and the third point of view 403. The single-channel difference audio signal 701 for the third point of view 403, is not transmitted from the transmitter apparatus 20 to the receiver apparatus 30.
The receiver apparatus 30 uses the single-channel difference audio signal 70, for the second point of view 402, to estimate the second audio signal 602 for the second point of view 402 as previously described. In addition, the receiver apparatus 30 re-uses an inverse of the single-channel difference audio signal 70, for the second point of view 402, as a single-channel difference audio signal 70 for the third point of view 403. The single-channel difference audio signal 70, for the third point of view 403, is then used as previously described to estimate a third audio signal 603 for the third point of view 403 by combining it with the first audio signal 601.
The symmetry between the second point of view 402 and the third point of view 403 allow a single difference signal 70 to be used for the estimation of the audio signals for these different points of view. In
At the top of
The single-channel difference audio signal 70 is sent only once for the two different points of view 402, 403 because the symmetry of the problem makes possible using the difference signal 70 and its inverse for directions α° left from the current view direction and α° right from the current view direction.
At the bottom of
In this way the number of single-channel difference audio signals 70 transmitted is cut by half by using the same single-channel difference audio signal 70 for two symmetric directions.
Although the above has been described in relation to a single-channel difference audio signal 70 it will be appreciated that this approach can also be used when difference audio signals 70 are used i.e. for multiple channels.
Any of the preceding examples of the transmitter apparatus 20 can be adapted to introduce a smoothing means 100 as illustrated in
The difference audio signal 70 can be a single-channel difference audio signal 70. Then, in this example, the smoothing means 100 is configured to smooth the single-channel difference audio signal 70 in a frequency domain to obtain a smoothed single-channel difference audio signal 70′ and to enable estimation of at least the second audio signal 602 in dependence upon the smoothed single-channel difference audio signal 70′ and the first audio signal 601.
In the example illustrated in
In some examples, the smoothing means 100 is configured for dynamic smoothing, The dynamic smoothing of the (single-channel) difference audio signal 70, based on at least the difference between the first audio signal 601 for the first point of view 401 and the second audio signal 602 for the second point of view 402 is dependent upon a likelihood of a change in point of view from the first point of view 401 to the second point of view 402. Thus, different smoothing parameters e.g. bandwidth size and number can be change with a likelihood of a change in point of view. A smoothed (single-channel) difference audio signal 70′ for a more likely point of view 40 can have more, smaller bandwidths than a smoothed (single-channel) difference audio signal 70′ for a less likely point of view 40.
Thus, in some examples, the transmitter apparatus 30 comprises means configured to:
obtain, for a first point of view 401, a first audio signal 601;
obtain, for a second point of view 402, a second audio signal 602;
determine, for the second point of view 402, at least a difference audio signal 70 based on a difference, between the first audio signal 601 and the second audio signal 602;
smooth the difference audio signal 70 in the frequency domain to obtain a smoothed difference audio signal 70′; and
enable estimation of at least the second audio signal 602 in dependence upon the smoothed difference audio signal 70 and the first audio signal 601.
The transmitter apparatus 30 also performs the equivalent method of obtaining, for a first point of view 401, a first audio signal 601;
obtaining, for a second point of view 402, a second audio signal 602;
determining, for the second point of view 402, at least a difference audio signal 70 based on a difference, between the first audio signal 601 and the second audio signal 602;
smoothing the difference audio signal 70 for the second point of view 402 in the frequency domain to obtain a smoothed difference audio signal 70′; and
enabling estimation of at least a second audio signal 602 in dependence upon the smoothed difference audio signal 70′ and the first audio signal 601.
In the example of
In some embodiments, the bin values may be smoothed close to the frequency band borders towards the bin values in a neighboring frequency band.
Depending on the time-frequency transform, bins can be real or complex valued.
In some examples, the smoothing means 100 performs an averaging. An average of the bins inside a frequency band can be used to represent all bins inside a frequency band. The average may be a direct average of the complex valued bins where an average of the bin absolute and angle values or one of the bins that is closer to an average value, etc.
However, other approaches to smoothing are possible. For example, any low pass filtering will also be appropriate. The intention is to reduce the variance of the difference audio signal 70 by smoothing.
In some examples, code books or other parametric implementations may be used to represent a value for a frequency band after smoothing.
The selection of the frequency bands can be based on any suitable methodology. For example, they can be third octave bands, block bands, ERB equivalent rectangular bands. In some examples the frequency bands are narrower at low frequencies and wider at higher frequencies.
In some examples, the encoder used is an MPEG AAC, MP3, MPEG AAC+, MPEG AAC-LD encoder. Also, speech encoders such as AMR-WB can also be used. Even a mono-coder can be used to encode each channel in a multi-channel audio signal separately.
Multi-channel audio codecs such as MPEG AAC Dolby Digital can also be used too. Several streams may be coded using a single multi-channel audio codec.
Alternatively, a designed for purpose audio encoder can be used to low bit read encode the difference audio signal 70 with the copied bins. This encoder can be designed to take full advantage of the structure of the smoothed difference audio signal 70.
In some examples, the number of side streams and angles selected for them depend on how much the user can rotate his head and/or how good a quality is desired to be achieved. The side streams that are deemed less likely, e.g. the ones furthest away in angle from the delayed user view direction where a user is less likely to turn his head, may be encoded with smaller bit rate than the more likely side streams. Also, the frequency bands used for replicating bins may be wider for the less essential/less well used side streams.
In some examples, the difference audio signal 70 is generated in the time domain and is processed at the receiver apparatus 30 in the time domain. In this example, after the smoothed difference audio signal 70′ is determined in the frequency domain, it is converted from the frequency domain into the time domain. The conversion into the time domain can, for example, occur at the transmitter apparatus 20 or at the receiver apparatus 30. In some examples, frequency bin replication can occur within an audio encoder using the time-frequency transform that is used by the encoder.
The receiver apparatus 30 receives the primary stream comprising the first audio signal 601 for both first and second channels R, L. It also receives a number of side streams. A side stream is a single-channel difference audio signal 70 for a particular point of view. It can be used to estimate audio signals 60 for that point of view and its inverse can be used to estimate audio signals 60 for the symmetrically opposite point of view.
For example, the single-channel difference audio signal 70TL (for turn left, rotation α=90°), can be used to estimate audio signal 60TL for a user rotation (
A single-channel difference audio signal 70LF (for lean front), can be used to estimate audio signals 60LF for a user leaning forward (
A single-channel difference audio signal 70LL (for a lean left) can be used to estimate audio signal 60LL for a user leaning left (
It will be noticed from the figures that for rotation the single-channel difference audio signal 70 for a rotation point of view 40 and the first audio signal 601 of the primary stream are combined, in an estimator 32 in opposite senses for the different channels L, R (
It will also be noticed that for lean the single-channel difference audio signal 70 for a lean point of view 40 and the first audio signal 601 of the primary stream are combined, in estimator 32, in the same sense for the different channels L, R (
It is also possible to have different independent combinations of rotation, lean forwards/backwards and lean left/right. For example, it is possible to independently define a rotation as +/−α and/or define a forwards/backwards lean as forwards or backwards and/or define a left/right lean as either left or right.
For example,
Other combinations are possible such as a lean forward, lean left and turn left which would combine
Alternatively, or in addition, the receiver apparatus 30 is configured to manage head translation. The receiver apparatus 30 is configured, if the second point of view 402 corresponds to a head translation relative to the first point of view 401, to estimate the second audio signal 602 at least based on an addition involving the single-channel difference audio signal 70 and one of the first and second channels L, R of the first audio signal 601 and an addition involving the single-channel difference audio signal 70 and the other of the first and second channels L, R of the first audio signal 601 or to estimate the second audio signal 602 at least based on a subtraction involving the single-channel difference audio signal 70 and one of the first and second channels L, R of the first audio signal 601 and a subtraction involving the single-channel difference audio signal 70 and the other of the first and second channels L, R of the first audio signal 601.
In these examples the rotation mono signals (the single-channel difference audio signal 70 for rotation points of view) represent differences between current and future view direction binaural signals and the translation mono signals (single-channel difference audio signal 70 for different leans) represent the difference between current and future head translation binaural signals. When the user rotates his head, the corresponding rotation mono signal 70 is added and subtracted from the current view direction binaural signal left and right channels respectively. When the user translates his head, the corresponding translation mono signal 70 is added to both channels L, R of the current view direction left and right channels 601. The rotation and translation mono signals are independent and are combined (added/subtracted) from the current view direction binaural signal 601 independent of each other.
Typically, there would be more side streams for more points of view, especially orientations, than illustrated. This is indicated by the use of ellipsis “ . . . ”.
It is also possible to mix multiple side streams to the primary stream. For example, if future translation is towards the front right and the front right side stream is not available, the apparatus 30 can may mix front and right side streams and add the mix to the primary stream. The amount of how much translation side signal is added may depend on the amount of user head movement.
It is possible to mix side streams in different amounts to get an interpolated version for a direction for which there is no side stream. For example, if a user is looking at direction 30°, and there is no direction 30° side stream available, the device may mix available side streams to create an interpolated version of the 30° direction. For example, a mix of one third of a 10° side stream and two thirds of a 40° side stream can give an approximation of the 30° side stream.
There are three single-channel difference audio signals 70 for three different rotations α, 2α and 3α. There are single-channel difference audio signals 70 for four different translations: front, back, left, right. The respective single-channel difference audio signals 70 are smoothed by respective smoothing means 100 as previously described above before being encoded by respective encoders 110 for transmission to the receiver apparatus 30.
It will be appreciated from the foregoing that this disclosure introduces points of view tracked audio with practically zero latency and significantly smaller bit rate. This can be achieved by transmitting in addition to a current view direction audio (the first audio signal 601) one or more difference audio signals 70 for possible different future points of view, with differences between the current point of view and also, the potential future point of view. Zero latency can be achieved by adding and/or subtracting a difference audio signal 70 from the current view direction audio signal 601 in the time domain.
Low bit rate can be achieved by the difference signal 70 being mono and repetitive in the frequency domain after smoothing 100. With highly efficient codecs it is possible to achieve a bandwidth of less than 150 kb/s for near CD quality for music and less than 64 kb/s for speech.
In some embodiments the difference signal 70 is set only once for two different (symmetric) points of view because of symmetry.
It will therefore be appreciated that the side streams can be modified to be more compatible with low bit rate encoding by reducing the number of channels in the side streams. Also, the side streams can be modified to be more compatible with low bit encoding by replicating frequency bins in the side streams. Also, frequency bin replication can be used more (more bandwidth) in the side streams that are associated with the directions where the user is likely to turn his or her head.
As illustrated in
The processor 402 is configured to read from and write to the memory 404. The processor 402 may also comprise an output interface via which data and/or commands are output by the processor 402 and an input interface via which data and/or commands are input to the processor 402.
The memory 404 stores a computer program 406 comprising computer program instructions (computer program code) that controls the operation of the apparatus 20, 30 when loaded into the processor 402. The computer program instructions, of the computer program 406, provide the logic and routines that enables the apparatus to perform the methods required. The processor 402 by reading the memory 404 is able to load and execute the computer program 406.
The apparatus 20 can therefore comprise:
at least one processor 402; and
at least one memory 404 including computer program code
the at least one memory 404 and the computer program code configured to, with the at least one processor 402, cause the apparatus 20, 30 at least to perform:
obtaining, for a first point of view, a first audio signal for at least a first channel and a second channel;
obtaining, for a second point of view, a second audio signal for at least the first channel and the second channel;
determining a single-channel difference audio signal, for the second point of view, based on at least a difference between the first audio signal and the second audio signal, and
enabling estimation of both the first channel and the second channel of the second audio signal for the second point of view in dependence on the single-channel difference audio signal for the second point of view and the first audio signal.
The apparatus 30 can therefore comprise:
at least one processor 402; and
at least one memory 404 including computer program code
the at least one memory 404 and the computer program code configured to, with the at least one processor 402, cause the apparatus 20, 30 at least to perform:
obtaining a single-channel difference audio signal, for a second point of view, dependent on at least a difference between a first audio signal for a first point of view and a second audio signal for a second point of view; and
estimating a first channel and a second channel of the second audio signal for the second point of view in dependence on the single-channel difference audio signal and the first audio signal.
The apparatus 20 can therefore comprise:
at least one processor 402; and
at least one memory 404 including computer program code
the at least one memory 404 and the computer program code configured to, with the at least one processor 402, cause the apparatus 20, 30 at least to perform:
obtaining, for a first point of view, a first audio signal;
obtaining, for a second point of view, a second audio signal;
determining, for the second point of view, at least a difference audio signal based on a difference, between the first audio signal and the second audio signal;
smoothing the difference audio signal in the frequency domain to obtain a smoothed first difference audio signal;
enabling estimation of at least the second audio signal in dependence upon the smoothed difference audio signal and the first audio signal.
As illustrated in
Computer program instructions for causing an apparatus 20 to perform at least the following or for performing at least the following:
obtaining, for a first point of view, a first audio signal for at least a first channel and a second channel;
obtaining, for a second point of view, a second audio signal for at least the first channel and the second channel;
determining a single-channel difference audio signal, for the second point of view, based on at least a difference between the first audio signal and the second audio signal, and
enabling estimation of both the first channel and the second channel of the second audio signal for the second point of view in dependence on the single-channel difference audio signal for the second point of view and the first audio signal.
Computer program instructions for causing an apparatus 30 to perform at least the following or for performing at least the following:
obtaining a single-channel difference audio signal, for a second point of view, dependent on at least a difference between a first audio signal for a first point of view and a second audio signal for a second point of view; and
estimating a first channel and a second channel of the second audio signal for the second point of view in dependence on the single-channel difference audio signal and the first audio signal.
The computer program instructions may be comprised in a computer program, a non-transitory computer readable medium, a computer program product, a machine readable medium. In some but not necessarily all examples, the computer program instructions may be distributed over more than one computer program.
Although the memory 404 is illustrated as a single component/circuitry it may be implemented as one or more separate components/circuitry some or all of which may be integrated/removable and/or may provide permanent/semi-permanent/dynamic/cached storage.
Although the processor 402 is illustrated as a single component/circuitry it may be implemented as one or more separate components/circuitry some or all of which may be integrated/removable. The processor 402 may be a single core or multi-core processor.
References to ‘computer-readable storage medium’, ‘computer program product’, ‘tangibly embodied computer program’ etc. or a ‘controller’, ‘computer’, ‘processor’ etc. should be understood to encompass not only computers having different architectures such as single/multi-processor architectures and sequential (Von Neumann)/parallel architectures but also specialized circuits such as field-programmable gate arrays (FPGA), application specific circuits (ASIC), signal processing devices and other processing circuitry. References to computer program, instructions, code etc. should be understood to encompass software for a programmable processor or firmware such as, for example, the programmable content of a hardware device whether instructions for a processor, or configuration settings for a fixed-function device, gate array or programmable logic device etc.
As used in this application, the term ‘circuitry’ may refer to one or more or all of the following:
(a) hardware-only circuitry implementations (such as implementations in only analog and/or digital circuitry) and
(b) combinations of hardware circuits and software, such as (as applicable):
(i) a combination of analog and/or digital hardware circuit(s) with software/firmware and
(ii) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions and
(c) hardware circuit(s) and or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g. firmware) for operation, but the software may not be present when it is not needed for operation.
This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor and its (or their) accompanying software and/or firmware. The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit for a mobile device or a similar integrated circuit in a server, a cellular network device, or other computing or network device.
In this example, the receiver apparatus 30 is a headset 410. The headset can for example be a pair of headphones or a set of augmented reality or virtual reality glasses.
In some examples, the headset 410 may communicate via a wireless interface 12 that provides a wireless data connection such as a Bluetooth connection. In some examples, the transmitter apparatus 20 is a mobile phone or similar or other personal electronic device.
The point of view of the user used in the previously described examples can, in some examples, be determined by a point of view of the headset 410. The point of view of the headset 410 can be tracked using sensors in the headset 410. In this case, the headset 410 transmits head tracking information to the transmitter device 20.
In alternative implementations, the point of view of the user can be tracked by tracking the user using sensors at the transmitter device 20 or elsewhere.
Sensors for head tracking may, for example, be an accelerometer built into the headset 410 but it can also be other types like optical, camera, infrared, Bluetooth, LT antenna array, 3D camera, etc. The tracking sensor may reside outside the headset 410. For example, a Microsoft connect-like device may be used to track user head position from outside the headset 410.
The headset 410 has applications such as augmented reality, virtual reality and teleconference applications. The transmitter apparatus 20 can modify audio based on the head tracking information. The transmitter apparatus 20 sends the modified audio to the headset 410 and the headset further modifies/selects what audio is played to the user. The head tracking info is delayed when it reaches the transmitter device 20 compared to actual current user view direction (because of transmission delay). The transmitter apparatus 20 uses the delayed head tracking info to create the different audio streams. One high quality stereo audio stream (the first audio signal 601) is optimized for the user delayed view direction. This is the primary stream. Other side streams (other different audio signals 70 for different points of view 40) are of lower quality and can be used to modify the primary stream so that the primary stream becomes optimized for other points of view one of which is typically close to the current user view direction. The headset 410 modifies the primary stream constantly in this way based on current head tracking info.
As previously described, the modification is done (for rotation) by adding the side stream that is associated with the current user view direction to the left channel of the primary stream and subtracting the side stream from the right channel of the primary stream.
The used audio signal 60 may be stereo, binaural, 5.1 and Ambisonics, etc with Ambisonics or 5.1 that have more than two channels, it may not be possible to reduce the different signals into a mono signal. Instead, some of the channels in the 5.1 or Ambisonics may be grouped into stereo pairs and a different signal is used for each pair.
Ordinarily the side stream points of view 40 would be fixed, for example typical choices for the orientations might be +/−20°, +/−40°, +/−60°, +/−90°, +/−120° because these are close to the likely directions where the user can turn his head. However, in some cases there may be other reasons for selecting these directions. The selection may be done in the mobile phone or the headset 410. Either of the devices may determine a more likely direction. If the determination is done in the headset 410, then the determined directions need to be transmitted to the mobile phone 20 so that it can be used in the determined directions including the side streams 70.
Either of the apparatus 20, 30 may determine sound source directions either in the audio signal that is transmitted from the mobile phone 20 to the headset 30 or in the real-world sound environment. For real world sound sources, a device needs at least two (typically three or four) microphones to detect sound source directions. Sound source directions can be detected using methods such as beamforming or time difference. Sound source directions such as speaker in a teleconference or another real-world person (than the user) are likely directions when a user may turn his head. These directions can be used to create more likely side stream directions and these directions can be used to create more likely side streams. The more likely side streams may be encoded with a higher bit rate than other side streams.
The blocks illustrated in the FIGs may represent steps in a method and/or sections of code in the computer program 406. The illustration of a particular order to the blocks does not necessarily imply that there is a required or preferred order for the blocks and the order and arrangement of the block may be varied. Furthermore, it may be possible for some blocks to be omitted.
Where a structural feature has been described, it may be replaced by means for performing one or more of the functions of the structural feature whether that function or those functions are explicitly or implicitly described.
The above-described examples find application as enabling components of:
automotive systems; telecommunication systems; electronic systems including consumer electronic products; distributed computing systems; media systems for generating or rendering media content including audio, visual and audio visual content and mixed, mediated, virtual and/or augmented reality; personal systems including personal health systems or personal fitness systems; navigation systems; user interfaces also known as human machine interfaces; networks including cellular, non-cellular, and optical networks; ad-hoc networks; the internet; the internet of things; virtualized networks; and related software and services.
The term ‘comprise’ is used in this document with an inclusive not an exclusive meaning. That is any reference to X comprising Y indicates that X may comprise only one Y or may comprise more than one Y. If it is intended to use ‘comprise’ with an exclusive meaning then it will be made clear in the context by referring to “comprising only one . . . ” or by using “consisting”.
In this description, reference has been made to various examples. The description of features or functions in relation to an example indicates that those features or functions are present in that example. The use of the term ‘example’ or ‘for example’ or ‘can’ or ‘may’ in the text denotes, whether explicitly stated or not, that such features or functions are present in at least the described example, whether described as an example or not, and that they can be, but are not necessarily, present in some of or all other examples. Thus ‘example’, ‘for example’, ‘can’ or ‘may’ refers to a particular instance in a class of examples. A property of the instance can be a property of only that instance or a property of the class or a property of a sub-class of the class that includes some but not all of the instances in the class. It is therefore implicitly disclosed that a feature described with reference to one example but not with reference to another example, can where possible be used in that other example as part of a working combination but does not necessarily have to be used in that other example.
Although examples have been described in the preceding paragraphs with reference to various examples, it should be appreciated that modifications to the examples given can be made without departing from the scope of the claims.
Features described in the preceding description may be used in combinations other than the combinations explicitly described above.
Although functions have been described with reference to certain features, those functions may be performable by other features whether described or not.
Although features have been described with reference to certain examples, those features may also be present in other examples whether described or not.
The term ‘a’ or ‘the’ is used in this document with an inclusive not an exclusive meaning. That is any reference to X comprising a/the Y indicates that X may comprise only one Y or may comprise more than one Y unless the context clearly indicates the contrary. If it is intended to use ‘a’ or ‘the’ with an exclusive meaning then it will be made clear in the context. In some circumstances the use of ‘at least one’ or ‘one or more’ may be used to emphasis an inclusive meaning but the absence of these terms should not be taken to infer any exclusive meaning.
The presence of a feature (or combination of features) in a claim is a reference to that feature or (combination of features) itself and also to features that achieve substantially the same technical effect (equivalent features). The equivalent features include, for example, features that are variants and achieve substantially the same result in substantially the same way. The equivalent features include, for example, features that perform substantially the same function, in substantially the same way to achieve substantially the same result.
In this description, reference has been made to various examples using adjectives or adjectival phrases to describe characteristics of the examples. Such a description of a characteristic in relation to an example indicates that the characteristic is present in some examples exactly as described and is present in other examples substantially as described.
Whilst endeavoring in the foregoing specification to draw attention to those features believed to be of importance it should be understood that the Applicant may seek protection via the claims in respect of any patentable feature or combination of features hereinbefore referred to and/or shown in the drawings whether or not emphasis has been placed thereon.
Number | Date | Country | Kind |
---|---|---|---|
2115768.0 | Nov 2021 | GB | national |