The present application relates to apparatus and methods for delaying enhanced orientation signalling for immersive communications, but not exclusively for enhanced orientation signalling for immersive communications within a spatial audio signal environment.
Immersive audio codecs are being implemented supporting a multitude of operating points ranging from a low bit rate operation to transparency. An example of such a codec is the Immersive Voice and Audio Services (IVAS) codec which is being designed to be suitable for use over a communications network such as a 3GPP 4G/5G network including use in such immersive services as for example immersive voice and audio for virtual reality (VR). This audio codec is expected to handle the encoding, decoding and rendering of speech, music and generic audio. It is furthermore expected to support channel-based audio and scene-based audio inputs including spatial information about the sound field and sound sources. The codec is also expected to operate with low latency to enable conversational services as well as support high error robustness under various transmission conditions.
There is provided according to a first aspect an apparatus for encoding a spatial audio scene comprising means configured to: capture the spatial audio scene comprising at least one audio signal; determine, for an audio frame of the at least one audio signal, a change to the orientation of the apparatus, wherein the change to the orientation of the apparatus is respect to an orientation of the apparatus from a previous audio frame of the at least one audio signal, wherein the change to the orientation of the apparatus forms at least one orientation change value which forms at least part of an orientation change data set; determine an orientation change delay time for the change to the orientation of the apparatus, wherein the orientation delay time forms a further part of the orientation change data set; perform the change to the orientation of the apparatus after a period of time specified by the orientation change delay time; and output or store the orientation change data set.
The orientation change delay time may be expressed in units of audio frames of the at least one audio signal.
The at least one orientation change value may comprise at least one of: an azimuth value, an elevation value and a roll value.
The apparatus comprising means configured to output or store the orientation change data set may further comprises means configured to form the at least one of the orientation change data set as an RTP header extension according to RFC8285.
The RTP header extension may comprise a L field according to RFC 8285, and wherein a value of the L field indicates that the RTP header extension contains at least one of the azimuth value, the elevation value, the roll value and the orientation change delay time.
The RTP header extension is a one-byte header extension according to RFC8285.
According to a second aspect there is provided an apparatus for decoding a spatial audio scene, comprising means configured to: receive an orientation change data set, wherein the orientation change data set comprises: at least one orientation change value specifying a change to an orientation of the apparatus with respect to the spatial audio scene comprising at least one audio signal; and an orientation change delay time for the change to the orientation of the apparatus; and perform the change to the orientation of the apparatus, wherein the change to the orientation of the apparatus is performed within a period of time specified by the orientation change delay time.
The orientation change delay time may be expressed in units of audio frames of the at least one audio signal.
The means configured to perform a change to the orientation of the apparatus may comprise means configured to: determine an increment of change with respect to the at least one orientation change value.
The increment of change may be a linear increment of change, the means configured to determine the increment of change may comprise means configured to determine a factor relating to a ratio of the at least one orientation change value to the orientation delay time.
The means configured to perform the change to the orientation of the apparatus may comprise means further configured to: apply the increment of change to the orientation of the apparatus on an audio frame by audio frame basis of the at least one audio signal for the period of time specified by the orientation change delay time.
Alternatively, the apparatus may comprise a signal activity detection function, and the means configured to perform the change to the orientation of the apparatus may comprise means further configured to: apply the increment of change to the orientation of the apparatus on an audio frame by audio frame basis of the at least one audio signal when the signal activity function indicates an active audio signal state; and apply a change to the orientation of the apparatus such that the at least one orientation change value is reached over a period of an audio frame of the at least one audio signal when the signal activity detection function indicates an inactive audio signal.
Alternatively, the means configured to perform a change to the orientation of the apparatus may comprise means further configured to: override the change to the orientation of the apparatus by not performing the change to the orientation of the apparatus.
The at least one orientation change value may comprises at least one of; an azimuth value, an elevation value and a roll value.
The orientation change data set may be received in the form of an RTP header extension according to RFC8285.
The RTP header extension may comprise a L field according to RFC 8285, and wherein a value of the L field indicates that the RTP header extension contains at least one of the azimuth value, the elevation value, the roll value and the orientation change delay time.
The RTP header extension may be a one-byte header extension according to RFC8285.
According to a third aspect there is a method for encoding a spatial audio scene comprising: capturing the spatial audio scene comprising at least one audio signal; determining, for an audio frame of the at least one audio signal, a change to the orientation of an apparatus, wherein the change to the orientation of the apparatus is respect to an orientation of the apparatus from a previous audio frame of the at least one audio signal, wherein the change to the orientation of the apparatus forms at least part one orientation change value which forms at least part of an orientation change data set; determining an orientation change delay time for the change to the orientation of the apparatus, wherein the orientation delay time forms a further part of the orientation change data set; performing the change to the orientation of the apparatus after a period of time specified by the orientation change delay time; and outputting or storing the orientation change data set.
The orientation change delay time may be expressed in units of audio frames of the at least one audio signal.
The at least one orientation change value may comprise at least one of: an azimuth value, an elevation value and a roll value.
The method comprising means outputting or storing the orientation change data set may further comprise forming the at least one of the orientation change data set as an RTP header extension according to RFC8285.
The RTP header extension may comprise a L field according to RFC 8285, and wherein a value of the L field indicates that the RTP header extension contains at least one of the azimuth value, the elevation value, the roll value and the orientation change delay time.
The RTP header extension may be a one-byte header extension according to RFC8285.
According to a fourth there is a method for decoding a spatial audio scene, comprising: receiving an orientation change data set, wherein the orientation change data set comprises: at least one orientation change value specifying a change to an orientation of an apparatus with respect to the spatial audio scene comprising at least one audio signal; and an orientation change delay time for the change to the orientation of the apparatus; and performing the change to the orientation of the apparatus, wherein the change to the orientation of the apparatus is performed within a period of time specified by the orientation change delay time.
The orientation change delay time may be expressed in units of audio frames of the at least one audio signal.
Performing a change to the orientation of the apparatus may comprise determining an increment of change with respect to the orientation change value.
The increment of change may be a linear increment of change and determining the increment of change may comprise determining a factor relating to a ratio of the at least one orientation change value to the orientation delay time.
Performing the change to the orientation of the apparatus may further comprise: applying the increment of change to the orientation of the apparatus on an audio frame by audio frame basis of the at least one audio signal for the period of time specified by the orientation change delay time.
Alternatively, the apparatus may comprise a signal activity detection function, and performing the change to the orientation of the apparatus may further comprise: applying the increment of change to the orientation of the apparatus on an audio frame by audio frame basis of the at least one audio signal when the signal activity function indicates an active audio signal state; and applying a change to the orientation of the apparatus such that the at least one orientation change value is reached over a period of an audio frame of the at least one audio signal when the signal activity detection function indicates an inactive audio signal.
Alternatively, performing a change to the orientation of the apparatus may further comprise overriding the change to the orientation of the apparatus by not performing the change to the orientation of the apparatus.
The at least one orientation change value of the user may comprise at least one of; an azimuth value, an elevation value and a roll value.
The orientation change data set may be received in the form of an RTP header extension according to RFC8285.
The RTP header extension may comprise a L field according to RFC 8285, and wherein a value of the L field indicates that the RTP header extension contains at least one of the azimuth value, the elevation value, the roll value and the orientation change delay time.
The RTP header extension may be a one-byte header extension according to RFC8285.
According to a fifth aspect there is an apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: capture the spatial audio scene comprising at least one audio signal; determine, for an audio frame of the at least one audio signal, a change to the orientation of the apparatus, wherein the change to the orientation of the apparatus is respect to an orientation of the apparatus from a previous audio frame of the at least one audio signal, wherein the change to the orientation of the apparatus forms at least one orientation change value which forms at least part of an orientation change data set; determine an orientation change delay time for the change to the orientation of the apparatus, wherein the orientation delay time forms a further part of the orientation change data set; perform the change to the orientation of the apparatus after a period of time specified by the orientation change delay time; and output or store the orientation change data set.
According to a sixth aspect there is an apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: receive an orientation change data set, wherein the orientation change data set comprises: at least one orientation change value specifying a change to an orientation of the apparatus with respect to the spatial audio scene comprising at least one audio signal; and an orientation change delay time for the change to the orientation of the apparatus; and perform the change to the orientation of the apparatus, wherein the change to the orientation of the apparatus is performed within a period of time specified by the orientation change delay time.
According to a seventh aspect there is a non-transitory computer readable medium comprising program instructions for causing an apparatus to perform at least the following: capturing the spatial audio scene comprising at least one audio signal; determining, for an audio frame of the at least one audio signal, a change to the orientation of an apparatus, wherein the change to the orientation of the apparatus is respect to an orientation of the apparatus from a previous audio frame of the at least one audio signal, wherein the change to the orientation of the apparatus forms at least part one orientation change value which forms at least part of an orientation change data set; determining an orientation change delay time for the change to the orientation of the apparatus, wherein the orientation delay time forms a further part of the orientation change data set; performing the change to the orientation of the apparatus after a period of time specified by the orientation change delay time; and outputting or storing the orientation change data set.
According to an eighth aspect there is a non-transitory computer readable medium comprising program instructions for causing an apparatus to perform at least the following: receiving an orientation change data set, wherein the orientation change data set comprises: at least one orientation change value specifying a change to an orientation of an apparatus with respect to the spatial audio scene comprising at least one audio signal; and an orientation change delay time for the change to the orientation of the apparatus; and performing the change to the orientation of the apparatus, wherein the change to the orientation of the apparatus is performed within a period of time specified by the orientation change delay time.
An apparatus configured to perform the actions of the method as described above.
A computer program comprising program instructions for causing a computer to perform the method as described above.
A computer program product stored on a medium may cause an apparatus to perform the method as described herein.
An electronic device may comprise apparatus as described herein.
A chipset may comprise apparatus as described herein.
Embodiments of the present application aim to address problems associated with the state of the art.
For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:
The following describes in further detail suitable apparatus and possible mechanisms for a delayed orientation signalling for user-controlled spatial audio rendering.
There exists a problem for user-controlled spatial audio rendering when there is an update to the intended scene orientation.
A typical example of capturing spatial audio may involve a user walking down a busy road using a mobile device in a normal manner. In this scenario the user may take a turn or rotate their head to check for traffic resulting in rapid changes to the orientation of the spatial audio capturing device (mobile device). These changes to the spatial audio scene orientation tend to be rather random in nature and generally not of interest to the end user. These unintended changes to the spatial audio scene orientation may be compensated for in the design of the audio capture apparatus. For example, during the capture process orientation sensors may be used to rotate the captured audio scene such that a resulting audio scene may be stabilised.
There may be instances when there is an intended change to the audio scene which is required to be communicated to the end user. However, inadvertently the changes to the audio scene may also comprise unwanted annoying changes to the end user. An example of such an instance may be viewed by a user selecting a particular orientation when capturing starts. The transmission may include scene information having a first rotation value for the audio scene orientation. The user may then wish to bring to the listener's attention a particular directional aspect of the sound scene. The user's device may in response to a command transmit a new scene orientation instantaneously. However, depending on the circumstances of the changes, such as the magnitude or frequency of change, the signalled orientation update to the receiving renderer may result in annoying spatial audio scene artifacts to the end user.
Furthermore, due to the nature of encoding spatial audio scenes at low bit rates it may be preferable to avoid any scene orientation change coinciding with the encoding of, e.g., attack sounds in the audio signal such as the beginning of a speech burst.
Therefore, embodiments solve the problem of enabling robust transmission of spatial audio scenes when the scene orientation may be changed by a user's interaction without the unwanted effect of producing perceptually disturbing artefacts to the end listener.
Embodiments improve spatial audio rendering by avoiding any sudden changes to the scene orientation at the listener's device by delaying the signalling of the change of scene orientation and by signalling that a future change will take place thereby allowing the rendering device to provide sufficient time to apply a suitable form of compensation or smoothing.
Embodiments of the invention may be implemented within the framework of the Real-time Transport Protocol (RTP) which is used for transmitting digital media streams such as encoded audio over the Internet Protocol (IP). The parameters of the RTP payload are typically communicated between transmission end points using the Session Description Protocol (SDP).
Specific functionality may be implemented at the encoding device which can receive a user input to set a new orientation for a spatial audio scene. The encoding device may be arranged to respond to the request for an orientation change by selecting a suitable time window for the scene orientation change to take place, both at the encoding and the decoding/rendering devices. Thereby introducing a delay, such that an audio scene renderer does not respond with an instantaneous change to the audio scene. The suitable time window may be signalled to a receiving device including the decoder and renderer via signalling means such as the RTP protocol as discussed above. The receiving device may be arranged to respond to the signalled delay by performing a suitable smoothing or pre-compensation to counteract the effect of an instantaneous change to the scene orientation at the encoder.
The objective of the delay is to allow the receiving device, in particular the renderer, determine how to handle the orientation change. The encoder may then signal to the decoder/render a buffer length or compensation window which defines a delay window in which the scene orientation change can be handled by the decoder/renderer. The delay information sent to the decoder may take into account analysis of the content at the time of the scene change and also user preferences.
According to various embodiments of the invention the delay applied both at the encoder/capturer and decoder/render can be a fixed delay, an adaptive delay dependent on past, current or future signals.
In this regard
Initially the encoder is shown as receiving a command from a user indicating that a change in scene orientation/rotation is desired Step 201 in
In response to receiving the command the encoder may then set an appropriate delay in terms of the number of spatial audio frames, step 203 in
In embodiments the delay value may be provided as a factor to the encoder/capturer, or equally the delay may be determined by the encoder/capturer.
The delay value may be dependent on the amount of orientation change that is to be applied to the audio scene. For example, a higher delay value may be used for larger changes to the orientation of the audio scene. For instance, the encoder/capturer may implement a table whereby a range of orientation change to a particular angle (whether it be azimuth or elevation) may be mapped to a specific delay in terms of the number of audio frames.
The encoder may then encode the current spatial audio frame. Additionally, encoding of the current audio frame may also include an encoded delay value and an encoded scene orientation update. Typically, the encoded scene orientation update will comprise at least one angle of rotation of the audio scene. Such as an azimuth value, elevation value or a roll value. This is shown as step 205 in
The encoder may then encode the audio scene on a regular frame by frame basis for the required delay number of spatial audio frames. This may be performed in order that any coding memories remain synchronised between encoder and decoder. This is shown as processing step 207 in
Finally, the encoder may perform the requested scene orientation change after the prescribed delay period. This is shown as processing step 209 in
The decoder may be arranged to receive an encoded spatial audio frame together with a delay value and a scene orientation update information. This is shown processing step 301.
The decoder/renderer may then be arranged to update an orientation compensation curve. This is a function which enables a smooth transition from the current scene orientation to the upcoming scene orientation as received in the previous processing step 301. This is shown as processing step 303. The output of this step may in some embodiments be an incremental scene orientation change which can be applied on a frame by frame basis such that the full scene orientation change/rotation is achieved when the requisite number of delay frames has been reached. This is an example of a linear interpolation scheme where further details are given by the description accompanying
The decoder may then be arranged to decode the spatial audio frame and apply the incremental change to the scene orientation. This is shown as step 305 in
The next encoded spatial audio frame may then be received, shown as processing step 309. The encoded spatial audio frame may be then decoded. The incremental scene orientation change may then be applied to the audio scene, processing step 305. This processing loop may be repeated until the delay value in terms of number of spatial audio frames is reached, processing step 307.
Consequently, by the time the delay value in terms of the number of spatial audio frames has been reached the orientation change to the audio scene will have been fully processed. Thereby producing a gradual change to the orientation of the audio scene at a granularity of the incremental change on a frame by frame basis.
The orientation change parameter metadata set as encoded by the encoder may contain as a minimum the following fields;
In embodiments the above orientation parameters may be referred in the collective as the orientation_update (or orientation_change) metadata set.
In embodiments the orientation_update metadata set may be transported via the RTP protocol in accordance with the Internet Engineering Task Force (IETF) RFC 8285 “A General Mechanism for RTP Header Extensions.” In some embodiments, which deploy the SceneID, the RTP header extension mechanism may also be arranged to transport the orientation scene ID information.
When considering RFC 8285 it may be possible to utilise either a one-byte header extension format or a two-byte header extension format. Below is a one-byte header extension format for transporting the orientation time and the orientation update, according to the framework of RFC 8285.
The first two bytes 0xBE and 0xDE are used to identify the one-byte header form of the header extension according to RFC 8285. The next two bytes are the “length” field which gives the size of the data extension in terms of the number of whole 32-bit units (including any padding that may be needed to fill the 32-bit units.) In this example the data extension uses two 32-bit units to contain the data extension. This again is a field specified by RFC 8285. The next field is the single-byte extension field as specified by RFC 8285. This is split into two nibbles. The first nibble specifies a unique ID field and the second nibble is L whose value is related to the number of bytes of the data extension. These fields are again required by RFC 8285. The value of L plus one can be used to specify the number bytes required for the data extension. In embodiments L can be used as a form of embedded signaling where its value contains information relating to the orientation delay time (SceneDel), the scene change (sceneAzi, SceneEle and SceneRol) and orientation scene ID (SceneID). In some embodiments the orientation delay time, scene change and orientation scene ID may be encoded according to the following table.
In the above table, the symbol “x” denotes the specific orientation parameters contained in the RTP data set extension, and therefore the value of L can be used encode which specific orientation parameters of the orientation change parameter set are updated by the RTP packet. For example, when L takes the value of 2 the RTP header extension will contain updates to the SceneID, SceneDel and SceneAzi parameters. Similarly, with reference to the above one-byte header extension example. When L takes the value of 4, this signifies that the RTP header extension contains the following orientation parameter values, and therefore an update to, SceneID, SceneDel, SceneAzi, SceneEle and SceneRol.
Thus, in these embodiments the 4 bits allowed for the encoding of L (according to RFC 8285) allows for sufficient range of values to indicate a change to all 5 of the above orientation parameters. Note in RFC 8285, the 4-bit length allowed for L allows for up to 16 bytes of extension data.
According to RFC 8285 the local identifier (ID) in the stream may be negotiated or defined out of band, and each distinct extension, in other words the above orientation update may have a unique ID. For example, in embodiments the local identifier may be negotiated for the orientation_update metadata set using the session description protocol (SDP). For example, the above orientation_update may be negotiated to have the ID value of one. Using SDP signalling the negotiation may take the form of
The above SDP line uniquely describes the header extension with ID=1 using the unique URI http://3gpp.org/ivas/rtp_hdr_ext.htm#orientation update.
Alternatively, each orientation parameter may be considered on an individual basis as an RTP header extension and therefore each of the orientation parameters may be assigned their own ID value.
For instance, the SceneID parameter may take the following RTP header extension.
The SceneDel parameter may take the following RTP header extension.
The SceneAzi parameter may take the following RTP header extension.
The SceneEle parameter may take the following RTP header extension.
The SceneRol parameter may take the following RTP header extension
The relevant SDP lines for the above parameters may be formulated as, where each ID of 1 to 5 is uniquely assigned to a respective URIs:
Next, we turn to the problem of packet loss when signalling orientation_update metadata set information. To counteract the any effects of packet loss (or late arrival), an encoder may be arranged to incorporate some built in redundancy by repeating the signalling of the orientation_update information in subsequent transmitted packets/audio frames. In this case the encoder may be arranged to adjust the orientation delay parameter to compensate for the audio frames that have been previously transmitted with respect to the original orientation_update request. The repeat signalling of the orientation update with the corresponding adjustment to the delay value (SceneDel) may be performed until the transmission of the audio frame immediately before the audio frame in which the orientation change is due to take place. To that extent
Alternatively, in other embodiments the encoder/transmitter may retransmit the orientation_update information (with an adjusted delay value) in response to a notification that packets have been lost. The encoder may then be arranged to retransmit the orientation_update information, with the appropriate adjustment to the SceneDel value, on the proviso that the response is received within the window of delay. In these embodiments the retransmission of the orientation update information may be performed, e.g., in response to a RTCP NACK message.
As an aside, the encoder/capturer may also be arranged to transmit orientation_update information for a current frame, that is with a SceneDel value of 0. This may be useful when the encoder/capturer wishes to force an instantaneous update to the orientation of the audio scene. For instance, an immediate request for an orientation change may be useful in the case to reset the audio scene orientation to a previous value or to a default orientation.
In embodiments the orientation_update information can have absolute orientation values, in which the scene orientation values such as SceneAzi, SceneEle and SceneRol are standalone values. Alternatively, other embodiments may deploy relative orientation values, in which scene orientation values such as SceneAzi, SceneEle and SceneRol are relative to a previous orientation update.
Having absolute orientation values for audio scene positioning may have the added effect of resetting the audio scene to a specific orientation upon executing the scene change at the decoder/renderer. Furthermore, absolute orientation values have the added effect of repeating the orientation_update after the delay period has expired. For instance, in some operating instances the decoder/renderer may not have received the orientation_update information within the delay window. By providing provision for the retransmission of the orientation_update information outside of the delay window would ensure that the decoder/renderer is aware of the missed information and therefore can determine how or indeed whether to transition to the new audio scene orientation.
When an orientation_update metadata set is received, the decoder/renderer may be arranged to have a transition mechanism which may determine how the change to the audio scene orientation is applied.
For instance, in embodiments the decoder/renderer may deploy a smooth transition mechanism whereby an audio scene orientation change is incrementally applied as a series of small adjustments to the audio scene until the target/signalled orientation change has been reached. This may be performed over a series of audio frames with an incremental change applied at each audio frame. Typically, the number of audio frames used for the transition may be determined by the received delay time, SceneDel, so that the full scene orientation change has been applied by the time the number of delay audio frames has been reached.
An example of a smooth transition mechanism is shown in
Note in this example, that the convention of a positive angle corresponding to a clockwise rotation n is followed.
The decoder side is shown as 505 in
Incremental change=audio scene angle change/(audio delay in frames+1)
Therefore, returning to the example of
In other instances, the decoder/renderer may behave differently when a request for an orientation_update is received. For example, the decoder/renderer may choose to effectively ignore the orientation_update request in accordance with user preferences. This particular use case is depicted in
The decoder side is shown as 605 in
In embodiments the smooth transition mechanism as shown in
As described above the delay sent to the decoder may be set by the encoder, e.g., as a fixed delay or may be based on particular rules or set in response to an external control signal. For example, a particular user selection at the encoder, a particular multimedia service or a particular orientation change to the audio scene may each trigger a specific and predetermined value for the delay.
Alternatively, the delay value may be adaptive at the encoder in the sense that the signal may be monitored for various levels of activity with a SAD or a VAD, rather like the above example of
Furthermore, the system may be designed such that an intended orientation_update change sent to the decoder may be overridden by a subsequent orientation_update change sent from the encoder providing the earlier change has not been fully performed. In order for this to function the encoder side may need to be keeping an internal state count of the orientation transitions taking place at the decoder, so that both sides can maintain a level of synchronisation. Therefore, any overriding orientation_update messages sent from the encoder to decoder may need to take into account the transition states at the decoder.
With respect to
Thus, with respect to the capture apparatus 8811 there is shown an audio capture and input format generator/obtainer+orientation control information generator/obtainer 901. In embodiments the aforementioned may be arranged in a single device or alternatively they may be arranged across several different processing modules. The audio capture and input format generator/obtainer+orientation control information generator/obtainer 801 is configured to obtain the audio signals and furthermore the orientation control information. The audio signals may be passed to an IVAS input audio formatter 811 and the orientation control information passed to an orientation input 817.
The capture apparatus 881 may furthermore comprise an IVAS input audio formatter 811 which is configured to receive the audio signals from the audio capture and input format generator/obtainer+orientation control information generator/obtainer 801 and format it in a suitable manner to be passed to an IVAS encoder 821. The IVAS input audio formatter 811 may for example comprise a mono formatter 812, configured to generate a suitable mono audio signal. The IVAS input audio formatter 811 may further comprise a CBA (channel based audio signal, for example a 5.1 or 7.1+4 channel audio signals) formatter configured to generate a CBA format and pass it to a suitable audio encoder. The IVAS input audio formatter 911 may further comprise a metadata assisted spatial audio, MASA (SBA—(parametric) scene based audio), formatter configured to generate a suitable MASA format signal and pass it to a suitable audio encoder. The IVAS input audio formatter 811 may further comprise a first order ambisonics/higher order ambisonics (FOA/HOA (SBA)) formatter configured to generate a suitable ambisonic format and pass it to a suitable audio encoder. The IVAS input audio formatter 811 may further comprise an object based audio (OBA) formatter configured to generate an object audio format and pass it to a suitable audio encoder.
The capture apparatus 881 may furthermore comprise an orientation input 817 configured to receive the orientation control information and format it/pass it to an orientation information encoder 829 within the IVAS encoder 821.
The capture apparatus 881 may furthermore comprise an IVAS encoder 821. The IVAS encoder 821 can be configured to receive the audio signals and the orientation information and encode it in a suitable manner to generate a suitable bitstream, such as an IVAS bitstream 831 to be transmitted or stored.
The IVAS encoder 821 may in some embodiments comprise an EVS encoder 823 configured to receive a mono audio signal, for example from the mono formatter 812 and generate a suitable EVS encoded audio signal.
The IVAS encoder 821 may in some embodiments comprise an IVAS spatial audio encoder 825 configured to receive a suitable format input audio signal and generate suitable IVAS encoded audio signals.
The IVAS encoder 821 may in some embodiments comprise a metadata encoder 827 configured to receive spatial metadata signals, for example from the MASA formatter 814 and generate suitable metadata encoded signals.
The IVAS encoder 821 may in some embodiments comprise orientation information encoder 829 configured to receive the orientation information, for example from the orientation input 817 and generate suitable encoded orientation information signals.
The encoder 821 thus can be configured to transmit the information provided in the orientation input according to its capability to the decoder for rendering with user control. User control is allowed via interface to IVAS renderer or an external renderer.
Thus, with respect to the renderer or playback apparatus 883 there is shown an IVAS decoder 841. The IVAS decoder 841 can be configured to receive the encoded audio signals and orientation information and decode it in a suitable manner to generate a suitable decoded audio signals and orientation information.
The IVAS decoder 841 may in some embodiments comprise an EVS decoder 843 configured to generate a mono audio signal from the EVS encoded audio signal.
The IVAS decoder 841 may in some embodiments comprise an IVAS spatial audio decoder 845 configured to generate a suitable format audio signal from IVAS encoded audio signals.
The IVAS decoder 841 may in some embodiments comprise a metadata decoder 847 configured to generate spatial metadata signals from metadata encoded signals.
The IVAS decoder 841 may in some embodiments comprise an orientation information decoder 849 configured to generate orientation information from encoded orientation information signals.
In some embodiments the renderer or playback apparatus 883 comprises an IVAS renderer 851 configured to receive the decoded audio signals, decoded metadata and decoded orientation information and generate a suitable rendered output to be output on a suitable output device such as headphones or a loudspeaker system. In some embodiments the IVAS renderer comprises an orientation controller 855 which is configured to receive the orientation information and based on the orientation information (and in some embodiments also user inputs) control the rendering of the audio signals.
In some embodiments the IVAS decoder 841 can be configured to output the orientation information from the orientation information decoder and audio signals to an external renderer 853 which is configured to generate a suitable rendered output to be output on a suitable output device such as headphones or a loudspeaker system based on the orientation information.
The summary of the operations of the system as shown in
For example, the system may receive audio signals as shown in
Furthermore, it may be received orientation information or orientation data as shown in
There then follows a series of encoder or capture method operations 911.
These operations may comprise obtaining an input audio format (for example, an audio scene corresponding to any suitable audio format) and orientation input format as shown in
The next operation may be one of determining an input audio format encoding mode as shown in
Then there may be an operation of determining an orientation input information encoding based on at least one of an input audio format encoding mode and encoder stream bit rate (i.e., encoding bit rate) as shown in
The system may furthermore perform decoder operations 921.
The decoder operations may for example comprise obtaining from the bitstream the orientation information as shown in
Additionally, there may be an operation of providing orientation information to an internal renderer orientation control (or to a suitable external renderer interface) as show in
With respect to the rendering operations 931 there may be an operation of receiving a user input 930 and furthermore applying orientation control of decoded audio signals (the audio scene) according to the orientation information and user input as shown in
The renderer audio scene according to the orientation control can then be output as shown in
With respect to
In some embodiments the device 1700 comprises at least one processor or central processing unit 1707. The processor 1707 can be configured to execute various program codes such as the methods such as described herein.
In some embodiments the device 1700 comprises a memory 1711. In some embodiments the at least one processor 1707 is coupled to the memory 1711. The memory 1711 can be any suitable storage means. In some embodiments the memory 1711 comprises a program code section for storing program codes implementable upon the processor 1707. Furthermore in some embodiments the memory 1711 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1707 whenever needed via the memory-processor coupling.
In some embodiments the device 1700 comprises a user interface 1705. The user interface 1705 can be coupled in some embodiments to the processor 1707. In some embodiments the processor 1707 can control the operation of the user interface 1705 and receive inputs from the user interface 1705. In some embodiments the user interface 1705 can enable a user to input commands to the device 1700, for example via a keypad. In some embodiments the user interface 1705 can enable the user to obtain information from the device 1700. For example the user interface 1705 may comprise a display configured to display information from the device 1700 to the user. The user interface 1705 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1700 and further displaying information to the user of the device 1700. In some embodiments the user interface 1705 may be the user interface for communicating.
In some embodiments the device 1700 comprises an input/output port 1709. The input/output port 1709 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1707 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.
The transceiver can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).
The transceiver input/output port 1709 may be configured to receive the signals.
In some embodiments the device 1700 may be employed as at least part of the synthesis device. The input/output port 1709 may be coupled to any suitable audio output for example to a multichannel speaker system and/or headphones (which may be a headtracked or a non-tracked headphones) or similar.
In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general-purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.
Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or “fab” for fabrication.
The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2021/078115 | 10/12/2021 | WO |