Example embodiments may relate to systems, methods and/or computer programs for providing spatial audio. In particular, some embodiments relate to the transmission of spatial audio signals and associated metadata.
Spatial audio signals are being used in greater frequency to produce a more immersive audio experience. Spatial audio refers to 3D audio, i.e., it can provide a percept where sound sources are heard from different directions. Spatial audio can be reproduced, e.g., using a loudspeaker setup or via headphones, preferably with head-tracking capability. A stereo or multi-channel recording can be passed from the recording or capture apparatus to a listening apparatus and replayed using a suitable multi-channel output such as a multi-channel loudspeaker arrangement and using binaural rendering or virtual surround processing on a pair of stereo headphones or headset.
It may be possible for mobile apparatus such as mobile phone to have more than two microphones. This offers the possibility to record real multichannel audio. With advanced signal processing it is further possible to beamform or directionally amplify or process the audio signal captured by the microphones from a specific or desired direction.
The captured audio signals may comprise metadata to provide greater adaptability of spatial audio signals.
The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention.
According to a first aspect, there is described an apparatus comprising: means for capturing spatial audio signals by a plurality of microphones using a first capture setting; means for generating a set of audio encoder input format data comprising a representation of the spatial audio signals and associated metadata, the metadata including, in part, a set of descriptive metadata indicative of one or more capture parameters associated with the first capture setting; means for providing the set of audio encoder input format data to an audio encoder for encoding the representation of the spatial audio signals and the associated metadata to a bitstream for a current streaming session; means for transmitting the bitstream to one or more remote devices for rendering the representation of the spatial audio signals based, at least in part, on the set of descriptive metadata; and means for determining, during the current streaming session, that the first capture setting changes to a second capture setting, wherein, in response to the determination that the first capture setting changes to a second capture setting, and while transmitting the bitstream, the means for capturing spatial audio signals is configured to capture spatial audio signals using the second capture setting and the means for generating the audio encoder input format data is configured to change at least one of the one or more capture parameters of the set of descriptive metadata to be associated with the second capture setting.
In some embodiments, the means for determining that the first capture setting changes to the second capture setting is configured to determine that the orientation of the apparatus has changed from a first orientation to a second orientation.
In some embodiments, the means for determining that the first capture setting changes to the second capture setting is configured to determine that an input request has been received to change the first capture setting to the second capture setting.
In some embodiments, the apparatus further comprises means for transmitting the set of descriptive metadata indicative of one or more capture parameters associated with the first capture setting to the one or more remote devices via an out-of-band signal.
In some embodiments the one or more capture parameters comprises at least one of: a directional element comprising a number of directions described by the spatial metadata; a channel element comprising a number of transport channels supported by the apparatus; a source format describing a configuration of the apparatus; and a variable description describing at least one of a capture type, an angle between two microphones of the plurality of microphones, an apparatus size and a microphone polar pattern including omnidirectional, cardiod, hypercardioid, or supercardioid patterns.
In some embodiments, the bitstream for the current streaming session comprises the one or more capture parameters.
In some embodiments, the means for generating the set of audio encoder input format data is configured to use a metadata-assisted spatial audio, MASA, format.
In some embodiments, the audio encoder is an immersive voice and audio services, IVAS, codec.
In some embodiments, the apparatus comprises a user device.
According to a second aspect, there is described an apparatus comprising: means for receiving a bitstream for a current streaming session; means for decoding the bitstream for the current streaming session, to determine a representation of spatial audio signals and associated metadata, the metadata including, in part, a set of descriptive metadata indicative of one or more capture parameters associated with a first capture setting; means for configuring an audio renderer according to the decoded descriptive metadata; means for providing the representation of the spatial audio signals to the audio renderer for rendering to produce a rendered audio output signal; means for outputting the rendered audio output signal by a plurality of speakers; means for receiving an updated bitstream for the current streaming session and means for decoding the updated bitstream for the current streaming session, to determine an updated representation of spatial audio signals and associated metadata, the metadata including, in part, a set of descriptive metadata indicative of one or more capture parameters associated with a second capture setting, wherein the means for configuring the audio renderer according to the decoded descriptive metadata is configured to change at least one of the one or more capture parameters of the set of descriptive metadata to be associated with the second capture setting.
In some embodiments, the apparatus further comprises means for receiving a bitstream for a current streaming session receives the bitstream.
In some embodiments, the apparatus further comprises means for receiving the set of descriptive metadata indicative of one or more capture parameters associated with the first capture setting to the one or more remote devices via an out-of-band signal; and wherein the means for configuring the audio renderer further comprises configuring the audio renderer according to the set of descriptive metadata indicative of one or more capture parameters associated with the first capture setting.
In some embodiments, the apparatus comprises a user device.
According to a third aspect, there is described a method comprising: capturing spatial audio signals by a plurality of microphones using a first capture setting; generating a set of audio encoder input format data comprising a representation of the spatial audio signals and associated metadata, the metadata including, in part, a set of descriptive metadata indicative of one or more capture parameters associated with the first capture setting; providing the set of audio encoder input format data to an audio encoder for encoding the representation of the spatial audio signals and the associated metadata to a bitstream for a current streaming session; transmitting the bitstream to one or more remote devices for rendering the representation of the spatial audio signals based, at least in part, on the set of descriptive metadata; and determining, during the current streaming session, that the first capture setting changes to a second capture setting, wherein, in response to the determination that the first capture setting changes to a second capture setting, and while transmitting the bitstream, capturing spatial audio signals comprising capturing spatial audio signals using the second capture setting and generating the audio encoder input format data comprises changing at least one of the one or more capture parameters of the set of descriptive metadata to be associated with the second capture setting.
According to a fourth aspect, there is described a method comprising: receiving a bitstream for a current streaming session; decoding the bitstream for the current streaming session, to determine a representation of spatial audio signals and associated metadata, the metadata including, in part, a set of descriptive metadata indicative of one or more capture parameters associated with a first capture setting; configuring an audio renderer according to the decoded descriptive metadata; providing the representation of the spatial audio signals to the audio renderer for rendering to produce a rendered audio output signal; outputting the rendered audio output signal by a plurality of speakers; receiving an updated bitstream for the current streaming session and decoding the updated bitstream for the current streaming session, to determine an updated representation of spatial audio signals and associated metadata, the metadata including, in part, a set of descriptive metadata indicative of one or more capture parameters associated with a second capture setting, wherein configuring the audio renderer according to the decoded descriptive metadata comprises changing at least one of the one or more capture parameters of the set of descriptive metadata to be associated with the second capture setting.
According to a fifth aspect, there is provided a computer program product comprising a set of instructions which, when executed on an apparatus, is configured to cause the apparatus to carry out the method of any preceding method definition.
According to a sixth aspect, there is provided a non-transitory computer readable medium comprising program instructions stored thereon for performing a method, comprising: capturing spatial audio signals by a plurality of microphones using a first capture setting; generating a set of audio encoder input format data comprising a representation of the spatial audio signals and associated metadata, the metadata including, in part, a set of descriptive metadata indicative of one or more capture parameters associated with the first capture setting; providing the set of audio encoder input format data to an audio encoder for encoding the representation of the spatial audio signals and the associated metadata to a bitstream for a current streaming session; transmitting the bitstream to one or more remote devices for rendering the representation of the spatial audio signals based, at least in part, on the set of descriptive metadata; and determining, during the current streaming session, that the first capture setting changes to a second capture setting, wherein, in response to the determination that the first capture setting changes to a second capture setting, and while transmitting the bitstream, capturing spatial audio signals comprising capturing spatial audio signals using the second capture setting and generating the audio encoder input format data comprises changing at least one of the one or more capture parameters of the set of descriptive metadata to be associated with the second capture setting.
According to a seventh aspect, there is provided a non-transitory computer readable medium comprising program instructions stored thereon for performing a method, comprising: receiving a bitstream for a current streaming session; decoding the bitstream for the current streaming session, to determine a representation of spatial audio signals and associated metadata, the metadata including, in part, a set of descriptive metadata indicative of one or more capture parameters associated with a first capture setting; configuring an audio renderer according to the decoded descriptive metadata; providing the representation of the spatial audio signals to the audio renderer for rendering to produce a rendered audio output signal; outputting the rendered audio output signal by a plurality of speakers; receiving an updated bitstream for the current streaming session and decoding the updated bitstream for the current streaming session, to determine an updated representation of spatial audio signals and associated metadata, the metadata including, in part, a set of descriptive metadata indicative of one or more capture parameters associated with a second capture setting, wherein configuring the audio renderer according to the decoded descriptive metadata comprises changing at least one of the one or more capture parameters of the set of descriptive metadata to be associated with the second capture setting.
Example embodiments will now be described by way of non-limiting example, with reference to the accompanying drawings, in which:
Example embodiments relate to an apparatus, method and computer program for output of stereo or spatial audio. Stereo or spatial audio may be represented by data in any suitable form, whether in the form of one or more data files or, in the case of streaming data, data packets or any other suitable format. The stereo or spatial audio may relate to voice and other audio communications.
In its basic form, stereo audio data comprises two channels, left and right, for output by respective first and second loudspeakers. More advanced formats include 2.1, which adds lower frequencies for output to a third, subwoofer loudspeaker, as well as 5.1 and 7.1 which may be generally known as “surround sound” formats. Spatial audio data, also called three dimensional audio or immersive audio, may describe audio data that enables users to perceive sounds from all around them; for a fully-immersive experience, the spatial audio data may include cues so that users can perceive other properties such as directions of sounds emitted by one or more sound sources or objects, trajectories of the sound sources or objects, variations of sound magnitudes based on changing distance from the sound sources or objects, and other sound effects. For example, if a user moves their user device, e.g., their smartphone, this may change how the audio is perceived.
As used herein, the apparatus may comprise a user device having three or more microphones. The user device may be a portable user device, for example a smartphone, a tablet computer, digital assistant, wearable computer or head mounted device (HMD). This list is not exhaustive. The user device may also comprise loudspeakers.
User devices may have different form factors. For example, some user devices have multiple screens, some have three or more microphones and/or some may be foldable, i.e., having a foldable body carrying a foldable screen for use in both open and closed configurations and possibly in one or more intermediate configurations in which the screen is at some angle between the open and closed configurations. Some user devices may be used in different orientations, e.g., changing a user interface from a portrait mode to a landscape mode upon detecting rotation of the user device beyond 45 degrees of the horizontal plane or thereabouts. User devices may be configured to receive and decode different types of audio data, including monaural audio data, stereoscopic (stereo) audio data comprising two channels, other forms of multi-channel audio data, e.g., 2.1, 5.1 and 7.1 and spatial audio data, e.g., Ambisonics or metadata-assisted spatial audio (MASA).
User devices may be capable of establishing a communications session with one or more other user devices, servers and/or nodes via a communications network. A user device may be configured to transmit and receive data using protocols for 3G, 4G, LTE, 5G or any future generation communication protocol. A user device may comprise means for short-range communications using, for example, Bluetooth, Zigbee or WiFi. The user device may comprise one or more antennas for communicating with external devices, for example one or more other remote user devices and/or one or more remote servers and/or one or more communications nodes of a network.
In use, a user device may process and output different types of audio data. For example, a user device may output stereo audio data associated with a music track or movie to first and second loudspeakers. Upon receipt of other audio data, i.e., audio data not being the stereo audio data currently being output to the first and second loudspeakers, this is usually output by one or both of the first and second loudspeakers. For example, upon receipt of a new text or multimedia message, an audible notification may be output to one or both of the first and second loudspeakers. The two types of audio data are mixed, at least for some period of time. The same or similar situation may be true for other types of data such as incoming call or conference notifications. Indeed, sometimes output of the other audio data may pause, mute or reduce the volume of the stereo audio data, at least for some period of time. Example embodiments are aimed at improved flexibility and user experience for user devices where there are three or more loudspeakers and/or microphones. For example, example embodiments may enable utilization of one or more loudspeakers that are currently not in use. For example, example embodiments may enable optimized audio output, e.g., stereo widening effects, enhanced immersivity and/or increased volume. Other advantages will become apparent.
Spatial audio refers to 3D audio. Spatial audio includes a perception of where sound sources are heard from different directions and replicated the audio to a user. Spatial audio can be reproduced, for example, using a loudspeaker setup or via headphones. Optionally, spatial audio rendering can utilize head-tracking capability.
Spatial capture is possible using various means. For example, a multi-microphone device such as a smartphone can be used to capture spatial audio.
The apparatus of
The apparatus may be a user device, or user equipment UE, which typically refers to a portable computing device that includes wireless mobile communication devices operating with or without a subscriber identification module (SIM), including, but not limited to, the following types of devices: a mobile station (mobile phone), smartphone, personal digital assistant, handset, device using a wireless modem (alarm or measurement device, etc.), laptop and/or touch screen computer, tablet, game console, notebook, and multimedia device. It should be appreciated that a user device may also be a nearly exclusive uplink only device, of which an example is a camera or video camera loading images or video clips to a network. A user device may also be a device having capability to operate in Internet of Things (IoT) network which is a scenario in which objects are provided with the ability to transfer data over a network without requiring human-to-human or human-to-computer interaction.
Metadata-assisted spatial audio (MASA) is a parametric spatial audio format, known from the ongoing 3GPP standardization of the Immersive Voice and Audio Services (IVAS) codec and specified in Annex A of 3GPP TS 26.258. For example, Ambisonics is a spatial audio format that is non-parametric. MASA is a format of spatial audio and meta data may be obtained by analysis of the microphone signals. The MASA format consists of audio signals and metadata. The audio signals can be mono or stereo (i.e., 1-2 transport channels). The metadata comprises descriptive metadata and spatial metadata. Examples of descriptive metadata and spatial metadata are given below in detail:
Table A.1 presents the MASA descriptive common metadata parameters in order of writing.
Table A.2a and Table A.2b present the MASA spatial metadata parameters dependent and independent of the number of directions, respectively.
The MASA spatial metadata describes the spatial audio characteristics corresponding to the one or two transport audio signals. Thus, the spatial audio scene can be rendered for listening based on the combination of the transport audio signals and the spatial metadata.
The definitions and use of the MASA spatial metadata parameters are described in order in the following.
Spatial directions represent the directional energy flows in the sound scene. Each spatial direction together with corresponding direct-to-total energy ratio describes how much of the total energy for each time-frequency tile is coming from that specific direction. In general, this parameter can also be thought of as the direction of arrival (DOA).
There can be one or two spatial directions for each time-frequency tile in the input metadata. Each spatial direction is represented using a 16-bit direction index. This is an efficient representation of directions as points of a spherical grid with an accuracy of about 1 degree in any arbitrary direction.
The direction indexing corresponds to the function for transforming the audio direction angular values (azimuth ϕ and elevation θ) into an index, and the inverse function for transforming the index into the audio direction angular values.
Each pair of values containing the elevation and the azimuth is first quantized on a spatial spherical grid of points and the index of the corresponding point is constructed. The structure of the spherical grid is defined first, followed by the quantization function and lastly the index formation followed by the corresponding de-indexing function.
The spherical grid is defined as a succession of horizontal circles of points. The circles are distributed on the sphere, and they correspond to several elevation values. The indexing functions make the connection between the angles (elevation and azimuth) corresponding to each of these points on the grid and a 16-bit index.
The spherical grid is on a sphere of unitary radius that is defined by the following elements:
with cumN(1)=0 and
where δ is the uniform quantization step for i=1, . . . , Nθ−1, 2 roundi(x/2) is a rounding function to the nearest even integer (above x for i=2, closest for i>2). The term cumN(i) gives the cumulative cardinality (i.e., cumulative number of points in the spherical grid) in a spherical zone going from the first non-zero elevation value to the i-th elevation value. This cumulative cardinality is derived from the relative area on the spherical surface, assuming a (near) uniform point distribution of the remaining number of points 216−432 (let alone the equator and poles).
The quantization in the spherical grid is done as follows:
The resulting quantized direction index is obtained by enumerating the points on the spherical grid by starting with the points for null elevation first, then the points corresponding to the smallest positive elevation codeword, the points corresponding to the first negative elevation codeword, followed by the points on the following positive elevation codeword and so on.
Direct-to-total energy ratios work together with spatial directions as described above. Each direct-to-total energy ratio corresponds to a specific spatial direction and describes how much of the energy comes from that specific spatial direction compared to the total energy.
Spread coherence is a parameter that describes the directional energy flow further. It represents situations where coherent directional sound energy is coming from multiple directions at the same time. This is represented with a single spread coherence parameter that describes how the sound should be synthesized.
In synthesis, this parameter should be used such that value 0 means that the sound is synthesized to single direction as directed by the spatial direction, value 0.5 means that the sound is synthesized to the spatial direction and two surrounding directions as coherent, and 1 means that the sound is synthesized to two surrounding directions around the spatial direction.
Diffuse-to-total energy ratio represents non-directional energy flow in the sound scene. This is a complement to the direct-to-total energy ratios and in an ideal capture with no undesired signal (or synthesized sound scene), the diffuse-to-total ratio value is always
Surround coherence is a parameter that describes the non-directional energy flow. It represents how much of the non-directional energy should be presented as coherent reproduction instead of decorrelated reproduction.
Remainder-to-total represents all the energy that does not “belong” to the captured sound scene based on the used model. This includes possible microphone noise and other capture artefacts that have not been removed from the signal in pre-processing. This means that by considering the direct-to-total energy ratio, the diffuse-to-total energy ratio, and the remainder-to-total energy we end up with a complete energy ratio model of
when there is any remainder energy present.
The spatial metadata describes how spatial rendering is done, i.e., from which direction certain time-frequency component should be renderer, etc. This defines the sound scene (together with the audio). The descriptive metadata provides additional information, e.g., on how the spatial capture or spatial audio generation was done. This information can be used to optimize encoding in certain cases, and it can be used to improve rendering quality in advanced rendering, when the information is made available at the receiver. For this to happen, the information needs to be passed though the transmission channel, e.g., using the codec bitstream, RTP signaling, or any other suitable method.
Channel audio format in MASA descriptive metadata defines the number of directions in MASA metadata (1-2), the number of channels (1-2 transport channels), and the source format (unknown source format including mixes (default value); microphone grid, e.g., smartphone or other such UE; channel-based source (e.g., 5.1); or Ambisonics). In addition, there is a so-called variable description. All this information in a session can be constant/fixed, however, the information is provided in each frame, i.e., once per 20 ms and it can thus vary often. Depending on the exact source format value (and number of transport channels), the variable description provides, e.g.: transport channel polar pattern information, channel angles for the directive patterns, and channel distance information.
Spatial audio capture can be done in many different usage scenarios. For example, regular spatial audio call, spatial audio with audio focus, and spatial audio capture in a rich ambience generally require somewhat different means and settings for optimal capture, encoding, and rendering. In capture, this is up to the implementor, and many solutions may exist. On the other hand, anything related to encoding needs to be available for the codec, and for anything related to rendering, the information needs to be made available for the receiving UE.
If these optimizations and adaptations are not fully considered, the resulting user experience, such as the rendered spatial audio quality, can always be compromised to some degree. Thus, overall improvement is necessary. It is noted that IVAS MASA format provides a mechanism to adapt several rendering-related aspects via 50 Hz update rate for the descriptive metadata and delivering immersive voice and audio capture optimizations can then be done using IVAS codec in a suitable way.
The present disclosure relates to spatial audio capture, encoding, transmission, and rendering, where a method is proposed to enable rendering quality improvements in spatial audio communications and real-time streaming by configuring and adding descriptive metadata parameters that correspond to the selected spatial audio capture mode and/or the targeted effect, which achieves providing real-time information describing specific properties of the spatial audio capture setup and/or the targeted spatial audio effect to the spatial audio renderer. Utilization of the transmitted descriptive metadata parameters is based on the specific renderer implementation of the receiving device. Thereby, the present disclosure makes it possible to provide audio capture parameters that are synchronized with the corresponding spatial audio representation to a renderer via a codec bitstream or out-of-band signaling.
The adaptive audio capture parameters can in examples, e.g., indicate a separation of an audio-focused source or sources for improved rendering control in playback.
The adaptive audio capture parameters can in further examples, e.g., indicate a change in at least one of: capture microphone polar patterns, angle between the beams, and distance between the stereo channels. This information can be used in rendering to, for example, modify rendering such that the changes become inaudible for a user to maintain rendering regardless of changes in capture or such that the changes are made more pronounced to make the effect stronger.
The rendering control is thus improved significantly leading to improved user experience. Specifically, the current disclosure is described in context of the IVAS MASA input format, and the disclosure can be implemented as part of an enhanced MASA audio capture system. According to the provided IVAS implementation, the adaptation rate for the system is 50 Hz based on the IVAS frame size and MASA metadata definitions. For example, out-of-band signaling, in preferred examples, Real-time Transport Protocol (RTP) can be used to transmit the information at least substantially synchronized with the audio codec bitstream. While the adaptation rate at input can be 50 Hz, the transmitted update rate can be same or lower.
The present disclosure relates to real-time adaptation of descriptive metadata in a parametric spatial audio format for encoding and transmission in spatial audio communications and streaming use cases and services. Specifically, the codec can be the 3GPP IVAS codec, the parametric audio format can be the MASA format, and the adapted part of the descriptive metadata can contain the Channel audio format including the Source format description comprising Transport definition, Channel angles and Channel distance, while Number of channels can be (at least predominantly) 2 and Number of directions can be 1 or 2.
In handset mode, the apparatus 100 can have several different orientations (e.g., the first 300, second 310 and third orientations 320 shown in
The user 202 can first be on a media call with the apparatus 100 (e.g., mobile phone) against their ear, i.e., user is capturing a spatial audio signal with no video in handset mode and the first capture setting is used in this case. The microphone 302A-302D placement and resulting selection are understood as one example only. In an example where there are four microphones 302A-302D, one microphone may be placed in the center of each side of the apparatus. An example capture first selects and maintains the following parameters in MASA descriptive metadata:
As will be appreciated, many MASA formats may conceivably be used, however the above represents one example MASA format.
When finished with the media call the user 202 then changes the apparatus 100 orientation to capture further audio (e.g., landscape orientation, see
In the second capture setting, the spatial audio signals and corresponding metadata is adapted. The parameters in MASA descriptive metadata can now be as follows, by way of example:
Thus, the user triggering recording a video as part of the media call results in different selection of a second capture setting, which may in turn drive new selection of microphone polar pattern configuration and finally the corresponding change in descriptive metadata that is sent to receiving user. In addition, the orientation of the device changes the optimal transport signal selection. Also, this change is made visible to the receiver as part of the parametric format's descriptive metadata.
Thus, it is specifically the first two sub-fields in the “Variable description” that are adapted according to the choice driven by mode selection (audio-only, audio-video) in this example. In addition, the apparatus orientation change results in different transport channel selection of microphones 502A and 502B, which is reflected primarily in the last sub-field value. The new transport channel selection of microphones 502A and 502B is conducted based on the most useful microphones to be used.
This new adaptive behavior of the capture information, when delivered to the receiver, can then be used to optimize the rendering in a suitable renderer implementation. For example, a wider sound stage can be renderer for the music video part. A more subdued spatial scene may be of interest during the handset mode call.
A second example use case (not shown) may also be considered. Audio focus is a feature that allows for pronounced capture of directional sound sources such as talkers in a scene via beamforming. Audio focus can thusly, e.g., suppress some of the background noise or ambience signals to allow for a cleaner capture and reproduction of the talker. For example, user wishes to use audio focus while recording a spatial scene.
Consider that an audio focus algorithm on user's smartphone detects two talkers (or other sound sources of interest) in different parts of the scene and the active signals from these two sound sources are at least partly overlapping in time. For example, it can be useful to be able to concentrate more on one of the audio-focused sources at a given time during the rendering. According to the present disclosure, this can be better achieved by a suitable description of the capture scenario and related capture configuration, which is enabled by adapting the descriptive metadata (i.e., in a second capture setting) and transmitting it to the receiver.
According to the present disclosure, the system now adapts the descriptive metadata according to a second capture setting of the MASA format as follows during a specific portion of the audio focusing spatial audio capture:
Thus, it is specifically the first two sub-fields in the “Variable description” that are adapted according to the choice driven by the audio focus processing.
This set of information, when delivered to the receiver, can then be used to optimize the rendering in a suitable renderer implementation. For example, listener may be able to control the rendering of the two main sound sources that the audio focus detected and focused on.
A third example use case (not shown) may also be considered. For example, user is in a nice ambience in a forest with birds singing in the trees. The spatial audio capture algorithm on the user's tablet device detects a rich ambience in a 3D sound scene. For example, there can be an AI component that understand in which kind of environment the user is capturing audio. Good settings for the scenario are chosen, e.g., smooth playback with good 3D coverage is suitable.
According to the disclosure, the system selects the capture configuration and adapts the descriptive metadata of the MASA format. As will be appreciated, many MASA formats may conceivably be used, however by way of example an proposed MASA format is shown as follows:
Thus, it is specifically the first two sub-fields in the “Variable description” that are adapted according to the choice driven by the desired effect of smooth playback of an expansive immersive ambience. This generates the second capture setting, which may be different to a first initial capture setting that is used initially by the apparatus.
This set of information, when delivered to the receiver, can then be used to optimize the rendering in a suitable renderer implementation. For example, listener may get better utilization of their multi-channel loudspeaker setup, where there is more content now being played back also from behind the user and from the height channels.
The method 600 comprises a first operation 601 of capturing spatial audio signals by a plurality of microphones using a first capture setting.
The method 600 comprises a second operation 602 of generating a set of audio encoder input format data comprising a representation of the spatial audio signals and associated metadata, the metadata including, in part, a set of descriptive metadata indicative of one or more capture parameters associated with the first capture setting. Generating the set of audio encoder input format data may optionally comprise generating the set of audio encoder input format data is configured to use the MASA format.
The method 600 comprises a third operation 603 of providing the set of audio encoder input format data to an audio encoder for encoding the representation of the spatial audio signals and the associated metadata to a bitstream for a current streaming session. Optionally, the the audio encoder may be the IVAS codec.
The method 600 comprises a fourth operation 604 of transmitting the bitstream to one or more remote devices for rendering the representation of the spatial audio signals based, at least in part, on the set of descriptive metadata.
The method 600 comprises a fifth operation 605 of determining, during the current streaming session, that the first capture setting changes to a second capture setting.
In response to the determining that the first capture setting changes to a second capture setting, and while transmitting the bitstream, capturing spatial audio signals comprising capturing spatial audio signals using the second capture setting and generating the audio encoder input format data comprises changing at least one of the one or more capture parameters of the set of descriptive metadata to be associated with the second capture setting. Optionally, determining that the first capture setting changes to a second capture setting may comprise determining that the orientation of the apparatus has changed from a first orientation to a second orientation. For example, the apparatus may have changed from a handset mode as shown in
The method 600 may further comprise transmitting the set of descriptive metadata indicative of one or more capture parameters associated with the first capture setting to the one or more remote devices. The transmission may be conducted via an out-of-band signal.
The capture parameters may comprises at least one of: a directional element comprising a number of directions described by the spatial metadata, a channel element comprising a number of transport channels supported by the apparatus, a source format describing a configuration of the apparatus; and a variable description describing at least one of a capture type, an angle between two microphones of the plurality of microphones, an apparatus size and a microphone polar pattern including omnidirectional, cardiod, hypercardioid, or supercardioid patterns. Optionally, the bitstream for the current streaming session comprises the one or more capture parameters.
The method 700 comprises a first operation 701 of receiving a bitstream for a current streaming session. Receiving the bitstream for a current streaming session may include receives the bitstream from the apparatus 100 of
The method 700 comprises a second operation 702 of decoding the bitstream for the current streaming session, to determine a representation of spatial audio signals and associated metadata, the metadata including, in part, a set of descriptive metadata indicative of one or more capture parameters associated with a first capture setting.
The method 700 comprises a third operation 703 of configuring an audio renderer according to the decoded descriptive metadata.
The method 700 comprises a fourth operation 704 of providing the representation of the spatial audio signals to the audio renderer for rendering to produce a rendered audio output signal.
The method 700 comprises a fifth operation 705 of outputting the rendered audio output signal by a plurality of speakers.
The method 700 comprises a sixth operation 706 of receiving an updated bitstream for the current streaming session and decoding the updated bitstream for the current streaming session, to determine an updated representation of spatial audio signals and associated metadata, the metadata including, in part, a set of descriptive metadata indicative of one or more capture parameters associated with a second capture setting. Configuring the audio renderer according to the decoded descriptive metadata comprises changing at least one of the one or more capture parameters of the set of descriptive metadata to be associated with the second capture setting.
The method may optionally further comprise further for receiving the set of descriptive metadata indicative of one or more capture parameters associated with the first capture setting to the one or more remote devices via an out-of-band signal. Furthermore, configuring the audio renderer may further comprise configuring the audio renderer according to the set of descriptive metadata indicative of one or more capture parameters associated with the first capture setting.
At step 801, user begins a spatial audio capture, e.g., a spatial audio call. The capture system responds by defining an initial capture mode with a first set of capture settings. At step 802, the capture system captures the spatial audio and generates an audio encoder input format according to the captured audio (and, e.g., according to other system settings such as negotiated encoding mode for the codec being supported by the spatial audio call). For example, the capture system generates MASA format for IVAS encoder.
At step 803, the capture system configures descriptive metadata parameters for the audio encoder format based on the initially selected capture mode and settings. These are inserted as part of the encoder format to complete the input for current time segment. For example, this inserts MASA descriptive metadata parameters as shown in examples above for an IVAS encoder time segment, i.e., a 20-ms frame.
At step 804, a next time segment is proceeded to and processed (e.g., a 20-ms frame for IVAS). The capture system detects any input to adapt the spatial audio capture. For example, this can be a user input, an application input, etc. For example, user can add a video recording, or select an audio focus, or any other relevant input that can change at least some aspect relating to the spatial audio capture. At step 806, if an input was detected, the spatial audio capture is updated according to a new mode setting. If no input is detected, the previously selected mode setting is maintained. Again, at step 802, the system captures the spatial audio and generates an audio encoder input format according to the captured audio. The spatial audio capture may now differ according to whether or the spatial audio capture was updated. At step 803, the capture system again configures the descriptive metadata parameters for the audio encoder format based now on the currently selected capture mode and settings.
At step 806, the audio encoder format is encoded by the audio encoder, which is ran, and encoded audio data (bitstream) is transmitted at step 807. The encoded bitstream is received 808 at a separate apparatus. The separated apparatus may be a user device of any kind, as for the apparatus where the encoded bitstream was sent from.
On the receiving side, at step 809 the bitstream is received and decoded using the decoder. At step 810, an audio renderer is configured according to the transmitted and decoded descriptive metadata parameters. For example, in IVAS these are the MASA descriptive metadata parameters as shown in examples above. There can be also other inputs, e.g., determining the output audio target based on the rendering device of the recipient (e.g., headphone or loudspeaker rendering). In addition, there could be, e.g., some user inputs relating to rendering mode.
At step 811, an output audio is renderer according to at least the received spatial audio and the selected configuration and finally at step 812 the audio is output.
Steps 901-912 correspond to steps 801-812 of
The capture system disclosed in
The IVAS codec or any other suitable immersive audio codec can provide means for delivering the necessary descriptive metadata to the receiver. There can be provided means to encode the descriptive metadata as part of the codec bitstream (as part of “audio data”). The metadata signalling being part of the bitstream/payload can be called in-band signalling.
The IVAS codec or any other suitable immersive audio codec can provide means for delivering the necessary descriptive metadata to the receiver according to a suitable out-of-band signalling. For example, in preferred implementations, the descriptive metadata parameters or, e.g., their changes, can be transmitted using RTP. For example, RTP header extension mechanism can be used.
Names of network elements, protocols, and methods are based on current standards. In other versions or other technologies, the names of these network elements and/or protocols and/or methods may be different, as long as they provide a corresponding functionality. For example, embodiments may be deployed in 2G/3G/4G/5G networks and further generations of 3GPP but also in non-3GPP radio networks such as WiFi.
A memory may be volatile or non-volatile. It may be e.g. a RAM, a SRAM, a flash memory, a FPGA block ram, a DCD, a CD, a USB stick, and a blue ray disk.
If not otherwise stated or otherwise made clear from the context, the statement that two entities are different means that they perform different functions. It does not necessarily mean that they are based on different hardware. That is, each of the entities described in the present description may be based on a different hardware, or some or all of the entities may be based on the same hardware. It does not necessarily mean that they are based on different software. That is, each of the entities described in the present description may be based on different software, or some or all of the entities may be based on the same software. Each of the entities described in the present description may be embodied in the cloud.
Implementations of any of the above described blocks, apparatuses, systems, techniques or methods include, as non-limiting examples, implementations as hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof. Some embodiments may be implemented in the cloud.
It is to be understood that what is described above is what is presently considered the preferred embodiments. However, it should be noted that the description of the preferred embodiments is given by way of example only and that various modifications may be made without departing from the scope as defined by the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
2313324.2 | Sep 2023 | GB | national |