The present application claims priority from Greece Provisional Patent Application No. 20190100493, filed Nov. 4, 2019, entitled “SIGNALLING OF AUDIO EFFECT METADATA IN A BITSTREAM,” which is incorporated by reference in its entirety.
Aspects of the disclosure relate to audio signal processing.
The evolution of surround sound has made available many output formats for entertainment nowadays. The range of surround-sound formats in the market includes the popular 5.1 home theatre system format, which has been the most successful in terms of making inroads into living rooms beyond stereo. This format includes the following six channels: front left (L), front right (R), center or front center (C), back left or surround left (Ls), back right or surround right (Rs), and low frequency effects (LFE)). Other examples of surround-sound formats include the growing 7.1 format and the futuristic 22.2 format developed by NHK (Nippon Hoso Kyokai or Japan Broadcasting Corporation) for use, for example, with the Ultra High Definition Television standard. It may be desirable for a surround sound format to encode audio in two dimensions (2D) and/or in three dimensions (3D). However, these 2D and/or 3D surround sound formats require high-bit rates to properly encode the audio in 2D and/or 3D.
Beyond channel-based formats, new audio formats for enhanced reproduction are becoming available, such as, for example, object-based and scene-based (e.g., higher-order Ambisonics or HOA) codecs. An audio object encapsulates individual pulse-code-modulation (PCM) audio streams, along with their three-dimensional (3D) positional coordinates and other spatial information (e.g., object coherence) encoded as metadata. The PCM streams are typically encoded using, e.g., a transform-based scheme (for example, MPEG Layer-3 (MP3), AAC, MDCT-based coding). The metadata may also be encoded for transmission. At the decoding and rendering end, the metadata is combined with the PCM data to recreate the 3D sound field.
Scene-based audio is typically encoded using an Ambisonics format, such as B-Format. The channels of a B-Format signal correspond to spherical harmonic basis functions of the sound field, rather than to loudspeaker feeds. A first-order B-Format signal has up to four channels (an omnidirectional channel W and three directional channels X, Y, Z); a second-order B-Format signal has up to nine channels (the four first-order channels and five additional channels R, S, T, U, V); and a third-order B-Format signal has up to sixteen channels (the nine second-order channels and seven additional channels K, L, M, N, O, P, Q).
Advanced audio codecs (e.g., object-based codecs or scene-based codecs) may be used to represent the soundfield (i.e., the distribution of air pressure in space and time) over an area to support multi-directional and immersive reproduction. The incorporation of head-related transfer functions (HRTFs) during rendering may be used to enhance these qualities for headphones.
A method of manipulating a soundfield according to a general configuration comprises receiving a bitstream that comprises metadata and a soundfield description; parsing the metadata to obtain an effect identifier and at least one effect parameter value; and applying, to the soundfield description, an effect identified by the effect identifier. The applying may include using the at least one effect parameter value to apply the identified effect to the soundfield description. Computer-readable storage media comprising code which, when executed by at least one processor, causes the at least one processor to perform such a method are also disclosed.
An apparatus for manipulating a soundfield according to a general configuration includes a decoder configured to receive a bitstream that comprises metadata and a soundfield description and to parse the metadata to obtain an effect identifier and at least one effect parameter value; and a renderer configured to apply, to the soundfield description, an effect identified by the effect identifier. The renderer may be configured to use the at least one effect parameter value to apply the identified effect to the soundfield description. Apparatus comprising a memory configured to store computer-executable instructions and a processor coupled to the memory and configured to execute the computer-executable instructions to perform such parsing and rendering operations are also disclosed.
Aspects of the disclosure are illustrated by way of example. In the accompanying figures, like reference numbers indicate similar elements.
A soundfield as described herein may be two-dimensional (2D) or three-dimensional (3D). One or more arrays used to capture a soundfield may include a linear array of transducers. Additionally or alternatively, the one or more arrays may include a spherical array of transducers. One or more arrays may also be positioned within the scene space, and such arrays may include arrays having fixed positions and/or arrays having positions that may change during an event (e.g., that are mounted on people, wires, or drones). For example, one or more arrays within the scene space may be mounted on people participating in the event such as players and/or officials (e.g., referees) in a sports event, performers and/or an orchestra conductor in a music event, etc.
A soundfield may be recorded using multiple distributed arrays of transducers (e.g., microphones) in order to capture spatial audio over a large scene space (e.g., a baseball stadium as shown in
Audio formats that provide for more accurate modeling of a soundfield (e.g., object- and scene-based codecs) may also allow for spatial manipulation of the soundfield. For example, a user may prefer to alter the reproduced soundfield in any one or more of the following aspects: to make sound arriving from a particular direction louder or softer as compared to sound arriving from other directions; to hear sound arriving from a particular direction more clearly as compared to sound arriving from other directions; to hear sound from only one direction and/or to mute sound from a particular direction; to rotate the soundfield; to move a source within the soundfield; to move the user's location within the soundfield. User selection or modification as described herein may be performed, for example, using a mobile device (e.g., a smartphone), a tablet, or any other interactive device or devices.
Such user interaction or direction (e.g., soundfield rotation, zooming into the audio scene) may be performed in a manner that is similar to selecting an area of interest in an image or video (as shown in
Although audio manipulation (e.g., zooming, focus) is described above as a consumer side-only process, it may be desirable for a content creator to be able to apply such effects during production of media content that includes a soundfield. Examples of such produced content may include recordings of live events, such as sports or musical performances, as well as recordings of scripted events, such as movies or plays. The content may be audiovisual (e.g., a video or movie) or audio only (e.g., a sound recording of a music concert) and may include one or both of recorded (i.e. captured) audio and generated (e.g., synthetic, meaning synthesized rather than captured) audio. A content creator may desire to manipulate a recorded and/or generated soundfield for any of various reasons, such as for dramatic effect, to provide emphasis, to direct a listener's attention, to improve intelligibility, etc. The product of such processing is audio content (e.g., a file or bitstream) having the intended audio effect baked-in (as shown in
While producing audio content in such form may ensure that the soundfield can be reproduced as the content creator intended, such production may also impede a user from being able to experience other aspects of the soundfield as originally recorded. For example, the result of a user's attempt to zoom into an area of the soundfield may be suboptimal, as audio information for that area may no longer be available within the produced content. Producing the audio content in this manner may also prevent consumers from being able to reverse the creator's manipulations and may even prevent the content creator from being able to modify the produced content in a desired manner. For example, a content creator may be dissatisfied with the audio manipulation and may want to change the effect in retrospect. As audio information necessary to support such a change may have been lost during the production, being able to alter the effects after production may require that the original soundfield has been stored separately as a backup (e.g., may require the creator to maintain a separate archive of the soundfield before the effects were applied).
Systems, methods, apparatus, and devices as disclosed herein may be implemented to signal intended audio manipulations as metadata. For example, the captured audio content may be stored in a raw format (i.e., without the intended audio effect), and a creator's intended audio effect behavior may be stored as metadata in the bitstream. A consumer of the content may decide if she wants to listen to the raw audio or to hear the audio with the intended creator's audio effect (as shown in
Several illustrative configurations will now be described with respect to the accompanying drawings, which form a part hereof. While particular configurations, in which one or more aspects of the disclosure may be implemented, are described below, other configurations may be used and various modifications may be made without departing from the scope of the disclosure or the spirit of the appended claims.
Unless expressly limited by its context, the term “signal” is used herein to indicate any of its ordinary meanings, including a state of a memory location (or set of memory locations) as expressed on a wire, bus, or other transmission medium. Unless expressly limited by its context, the term “generating” is used herein to indicate any of its ordinary meanings, such as computing or otherwise producing. Unless expressly limited by its context, the term “calculating” is used herein to indicate any of its ordinary meanings, such as computing, evaluating, estimating, and/or selecting from a plurality of values. Unless expressly limited by its context, the term “obtaining” is used to indicate any of its ordinary meanings, such as calculating, deriving, receiving (e.g., from an external device), and/or retrieving (e.g., from an array of storage elements). Unless expressly limited by its context, the term “selecting” is used to indicate any of its ordinary meanings, such as identifying, indicating, applying, and/or using at least one, and fewer than all, of a set of two or more. Unless expressly limited by its context, the term “determining” is used to indicate any of its ordinary meanings, such as deciding, establishing, concluding, calculating, selecting, and/or evaluating. Where the term “comprising” is used in the present description and claims, it does not exclude other elements or operations. The term “based on” (as in “A is based on B”) is used to indicate any of its ordinary meanings, including the cases (i) “derived from” (e.g., “B is a precursor of A”), (ii) “based on at least” (e.g., “A is based on at least B”) and, if appropriate in the particular context, (iii) “equal to” (e.g., “A is equal to B”). Similarly, the term “in response to” is used to indicate any of its ordinary meanings, including “in response to at least.” Unless otherwise indicated, the terms “at least one of A, B, and C,” “one or more of A, B, and C,” “at least one among A, B, and C,” and “one or more among A, B, and C” indicate “A and/or B and/or C.” Unless otherwise indicated, the terms “each of A, B, and C” and “each among A, B, and C” indicate “A and B and C.”
Unless indicated otherwise, any disclosure of an operation of an apparatus having a particular feature is also expressly intended to disclose a method having an analogous feature (and vice versa), and any disclosure of an operation of an apparatus according to a particular configuration is also expressly intended to disclose a method according to an analogous configuration (and vice versa). The term “configuration” may be used in reference to a method, apparatus, and/or system as indicated by its particular context. The terms “method,” “process,” “procedure,” and “technique” are used generically and interchangeably unless otherwise indicated by the particular context. A “task” having multiple subtasks is also a method. The terms “apparatus” and “device” are also used generically and interchangeably unless otherwise indicated by the particular context. The terms “element” and “module” are typically used to indicate a portion of a greater configuration. Unless expressly limited by its context, the term “system” is used herein to indicate any of its ordinary meanings, including “a group of elements that interact to serve a common purpose.”
Unless initially introduced by a definite article, an ordinal term (e.g., “first,” “second,” “third,” etc.) used to modify a claim element does not by itself indicate any priority or order of the claim element with respect to another, but rather merely distinguishes the claim element from another claim element having a same name (but for use of the ordinal term). Unless expressly limited by its context, each of the terms “plurality” and “set” is used herein to indicate an integer quantity that is greater than one.
The soundfield description may include different audio streams for different regions based on, e.g., predetermined areas of interest inside the soundfield (for example, an object-based scheme for some regions and an HOA scheme for other regions). It may be desirable, for example, to use an object-based or HOA scheme to encode a region having a high degree of wavefield concentration, and to use HOA or a plane-wave expansion to encode a region having a low degree of wavefield concentration (e.g. ambience, crowd noise, clapping).
An object-based scheme may reduce a sound source to a point source, and directivity patterns (e.g., the variation with respect to direction of the sound emitted by, for example, a shouting player or a trumpet player) may not be preserved. HOA schemes (more generally, an encoding scheme based on a hierarchical set of basis function coefficients) are typically efficient at encoding large numbers of sound sources than object-based schemes (e.g., more objects can be represented by smaller HOA coefficients as compared to an object-based scheme). Benefits of using an HOA scheme may include being able to evaluate and/or represent the soundfield at different listener positions without the need to detect and track individual objects. Rendering of an HOA-encoded audio stream is typically flexible and agnostic to loudspeaker configuration. HOA encoding is also typically valid under free-field conditions, such that translation of a user's virtual listening position can be performed within a valid region close to the nearest source.
Task T200 parses the metadata to obtain an effect identifier and at least one effect parameter value. Task T300 applies, to the soundfield description, an effect identified by the effect identifier. The information which is signaled in the metadata stream may include the type of audio effect to be applied to the soundfield: e.g., one or more of any of a focus, a zoom, a null, a rotation, and a translation. For each effect that is to be applied, the metadata may be implemented to include a corresponding effect identifier ID10 which identifies the effect (e.g., a different value for each of zoom, null, focus, rotate, and translate; a mode indicator to indicate a desired mode, such as a conference or meeting mode; etc.).
For each identified effect, the metadata may include a corresponding set of effect parameter values PM10 for parameters that define how the identified effect is to be applied (e.g., as shown in
It may be desirable to allocate more bits of the metadata stream to carry parameter values for one effect than for another effect. In one example, the number of bits allocated for the parameter values for each effect is a fixed value of the encoding scheme. In another example, the number of bits allocated for the parameter values for each identified effect is indicated within the metadata stream (e.g., as shown in
A focus effect may be defined as an enhanced directionality of a particular source or region. Parameters defining how a desired focus effect is to be applied may include a direction of the focus region or source, a strength of the focus effect, and/or a width of the focus region. The direction may be indicated in three dimensions, for example, as the azimuth angle and the angle of elevation corresponding to the center of the region or source. In one example, a focus effect is applied during rendering by decoding the source or region of focus at a higher HOA order (more generally, by adding one or more levels of the hierarchical set of basis function coefficients) and/or by decoding other sources or regions at a lower HOA order.
A zoom effect may be applied to boost an acoustic level of the soundfield in a desired direction. Parameters defining how a desired zoom effect is to be applied may include a direction of the region to be boosted. This direction may be indicated in three dimensions, for example, as the azimuth angle and the angle of elevation corresponding to the center of the region. Other parameters defining the zoom effect which may be included in the metadata may include one or both of a strength of the level boost and a size (e.g., width) of the region to be boosted. For a zoom effect that is implemented using a beamformer, the defining parameters may include selection of a beamformer type (e.g., FIR or IIR); selection of a set of beamformer weights (e.g., one or more series of tap weights); time-frequency masking values; etc.
A null effect may be applied to reduce an acoustic level of the soundfield in a desired direction. The parameters defining how a desired null effect is to be applied may be similar to those defining how a desired zoom effect is to be applied.
A rotation effect may be applied by rotating the soundfield to a desired orientation. Parameters defining a desired rotation of the soundfield may indicate the direction which is to be rotated into a defined reference direction (e.g., as shown in
A translation effect may be applied to translate a sound source to a new location within the soundfield. Parameters defining a desired translation may include a direction and a distance (alternatively, an angle of rotation relative to the user position).
Each soundfield modification indicated in the metadata may be linked to a particular moment of the soundfield stream (e.g., by a timestamp included in the metadata, as shown in
As noted above, it may be desirable to enable a user to select a raw version of the soundfield or a version modified by the audio effects metadata, and/or modify the soundfield in a manner that is partially or completely different from the effects indicated in the effects metadata. A user may indicate such a command actively: for example, on a touchscreen, by gesture, by voice command, etc. Alternatively or additionally, a user command may be produced by passive user interaction via a device that tracks movement and/or orientation of the user: for example, a user tracking device that may include an inertial measurement unit (IMU).
In order to support an immersive VR experience, it may be desirable to adjust a provided audio environment in response to changes in the listener's virtual position. For example, it may be desirable to support virtual movement in six degrees of freedom (6DOF). As shown in
It may be desirable to allow a content creator to limit the degree to which effects described in the metadata may be changed downstream. For example, it may be desirable to impose a spatial restriction to permit a user to apply an effect only in a specific area and/or to prevent a user from applying an effect in a specific area. Such a restriction may apply to all signaled effects or to a particular set of effects, or a restriction may apply to only a single effect. In one example, a spatial restriction permits a user to apply a zoom effect only in a specific area. In another example, a spatial restriction prevents a user from applying a zoom effect in another specific area (e.g., a confidential and/or private area). In another example, it may be desirable to impose a time restriction to permit a user to apply an effect only during a specific interval and/or to prevent a user from applying an effect during a specific interval. Again, such a restriction may apply to all signaled effects or to a particular set of effects, or a restriction may apply to only a single effect.
To support such restriction, the metadata may include a flag to indicate a desired restriction. For example, a restriction flag may indicate whether one or more (possibly all) of the effects indicated in the metadata may be overwritten by user interaction. Additionally or alternatively, a restriction flag may indicate whether user alteration of the soundfield is permitted or disabled. Such disabling may apply to all effects, or one or more effects may be specifically enabled or disabled. A restriction may apply to the entire file or bitstream or may be associated with a particular period of time within the file or bitstream. In another example, the effect identifier may be implemented to use different values to distinguish a restricted version of an effect (e.g., which may not be removed or overwritten) and an unrestricted version of the same effect (which may be applied or ignored according to the consumer's choice).
An audio file or stream may include one or more versions of effects metadata, and different versions of such effects metadata may be provided for the same audio content (e.g., as user suggestions from a content generator). The different versions of effects metadata may provide, for example, different regions of focus for different audiences. In one example, different versions of effects metadata may describe effects of zooming in to different people (e.g., actors, athletes) in a video. A content creator may markup interesting audio sources and/or directions (e.g., different levels of zooming and/or nulling for different hotspots as depicted, for example, in
Effects metadata may be created by human direction (e.g., by a content creator) and/or automatically in accordance with one or more design criteria. In a teleconferencing application, for example, it may be desired to automatically select a single loudest audio source, or audio from multiple talking sources, and to deemphasize (e.g., discard or lower the volume of) other audio components of the soundfield. A corresponding effects metadata stream may include a flag to indicate a “meeting mode.” In one example as shown in
Other parameters defining how a meeting mode is to be applied may include metadata to enhance extraction of the sources from the soundfield (e.g., beamformer weights, time frequency masking values, etc.). The metadata may also include one or more parameter values that indicate a desired rotation of the soundfield. The soundfield may be rotated according to the location of the loudest audio source: for example, to support auto-rotation of a remote user's video and audio so that the loudest speaker is in front of the remote user. In another example, the metadata may indicate auto-rotation of the soundfield so that a two-person discussion happens in front of the remote user. In a further example, the parameter values may indicate a compression (or other re-mapping) of the angular range of the soundfield as recorded (e.g., as shown in
An audio effects metadata stream as described herein may be carried in the same transmission as the corresponding audio stream (or streams) or may be received in a separate transmission or even from a different source (e.g., as described above). In one example, the effects metadata stream is stored or transmitted in a dedicated extension payload (e.g., in the afx_data field as shown in
While described with respect to AAC, the techniques may be performed using any type of psychoacoustic audio coding that, as described in more detail below, allows for an extension payload and/or extension packets (e.g., fill elements or other containers of information that include an identifier followed by fill data) or otherwise allows for backward compatibility. Examples of other psychoacoustic audio codecs include Audio Codec 3 (AC-3), Apple Lossless Audio Codec (ALAC), MPEG-4 Audio Lossless Streaming (ALS), aptX®, enhanced AC-3, Free Lossless Audio Codec (FLAC), Monkey's Audio, MPEG-1 Audio Layer II (MP2), MPEG-1 Audio Layer III (MP3), Opus, and Windows Media Audio (WMA).
Renderer SR10 may be configured to apply a focus effect to the soundfield, for example, by rendering a selected region of the soundfield at a higher resolution than other regions, and/or by rendering other regions to have a higher diffusivity. In one example, an apparatus or device performing task T300 (e.g., renderer SR10) is configured to implement a focus effect by requesting additional information for the focus source or region (e.g., higher-order HOA coefficient values) from a server over a wired and/or wireless connection (e.g., Wi-Fi and/or LTE).
Renderer SR10 may be configured to apply a zoom effect to the soundfield, for example, by applying a beamformer (e.g., according to parameter values carried within a corresponding field of the metadata). Renderer SR10 may be configured to apply a rotation or translation effect to the soundfield, for example, by applying a corresponding matrix transformation to a set of HOA coefficients (or more generally, to a hierarchical set of basis function coefficients) and/or by moving audio objects within the soundfield accordingly.
Hardware for virtual reality (VR) may include one or more screens to present a visual scene to a user, one or more sound-emitting transducers (e.g., an array of loudspeakers, or an array of head-mounted transducers) to provide a corresponding audio environment, and one or more sensors to determine a position, orientation, and/or movement of the user. User tracking device UT10 as shown in
Computer-mediated reality systems are being developed to allow computing devices to augment or add to, remove or subtract from, substitute or replace, or generally modify existing reality as experienced by a user. Computer-mediated reality systems may include, as a couple of examples, virtual reality (VR) systems, augmented reality (AR) systems, and mixed reality (MR) systems. The perceived success of computer-mediated reality systems are generally related to the ability of such systems to provide a realistically immersive experience in terms of both video and audio such that the video and audio experiences align in a manner that is perceived as natural and expected by the user. Although the human visual system is more sensitive than the human auditory systems (e.g., in terms of perceived localization of various objects within the scene), ensuring an adequate auditory experience is an increasingly important factor in ensuring a realistically immersive experience, particularly as the video experience improves to permit better localization of video objects that enable the user to better identify sources of audio content.
In VR technologies, virtual information may be presented to a user using a head-mounted display such that the user may visually experience an artificial world on a screen in front of their eyes. In AR technologies, the real-world is augmented by visual objects that may be superimposed (e.g., overlaid) on physical objects in the real world. The augmentation may insert new visual objects and/or mask visual objects in the real-world environment. In MR technologies, the boundary between what is real or synthetic/virtual and visually experienced by a user is becoming difficult to discern. Techniques as described herein may be used with a VR device 400 as shown in
Video, audio, and other sensory data may play important roles in the VR experience. To participate in a VR experience, the user 402 may wear the VR device 400 (which may also be referred to as a VR headset 400) or other wearable electronic device. The VR client device (such as the VR headset 400) may track head movement of the user 402, and adapt the video data shown via the VR headset 400 to account for the head movements, providing an immersive experience in which the user 402 may experience a virtual world shown in the video data in visual three dimensions.
While VR (and other forms of AR and/or MR) may allow the user 402 to reside in the virtual world visually, often the VR headset 400 may lack the capability to place the user in the virtual world audibly. In other words, the VR system (which may include a computer responsible for rendering the video data and audio data—that is not shown in the example of
Though full three-dimensional audible rendering still poses challenges, the techniques in this disclosure enable a further step towards that end. Audio aspects of AR, MR, and/or VR may be classified into three separate categories of immersion. The first category provides the lowest level of immersion and is referred to as three degrees of freedom (3DOF). 3DOF refers to audio rendering that accounts for movement of the head in the three degrees of freedom (yaw, pitch, and roll), thereby allowing the user to freely look around in any direction. 3DOF, however, cannot account for translational (and orientational) head movements in which the head is not centered on the optical and acoustical center of the soundfield.
The second category, referred to 3DOF plus (or “3DOF+”), provides for the three degrees of freedom (yaw, pitch, and roll) in addition to limited spatial translational (and orientational) movements due to the head movements away from the optical center and acoustical center within the soundfield. 3DOF+ may provide support for perceptual effects such as motion parallax, which may strengthen the sense of immersion.
The third category, referred to as six degrees of freedom (6DOF), renders audio data in a manner that accounts for the three degrees of freedom in term of head movements (yaw, pitch, and roll) but also accounts for translation of a person in space (x, y, and z translations). The spatial translations may be induced, for example, by sensors tracking the location of the person in the physical world, by way of an input controller, and/or by way of a rendering program that simulates transportation of the user within the virtual space.
Audio aspects of VR may be less immersive than the video aspects, thereby potentially reducing the overall immersion experienced by the user. With advances in processors and wireless connectivity, however, it may be possible to achieve 6DOF rendering with wearable AR, MR and/or VR devices. Moreover, in the future it may be possible to take into account movement of a vehicle that has the capabilities of AR, MR and/or VR devices and provide an immersive audio experience. In addition, a person of ordinary skill would recognize that a mobile device (e.g., a handset, smartphone, tablet) may also implement VR, AR, and/or MR techniques.
In accordance with the techniques described in this disclosure, various ways by which to adjust audio data (whether in an audio channel format, an audio object format, and/or an audio scene-based format) may allow for 6DOF audio rendering. 6DOF rendering provides a more immersive listening experience by rendering audio data in a manner that accounts for the three degrees of freedom in term of head movements (yaw, pitch, and roll) and also for translational movements (e.g., in a spatial three-dimensional coordinate system—x, y, z). In implementation, where the head movements may not be centered on the optical and acoustical center, adjustments may be made to provide for 6DOF rendering, and not necessarily be limited to spatial two-dimensional coordinate systems. As disclosed herein, the following figures and descriptions allow for 6DOF audio rendering.
The wearable device 800 may represent other types of devices, such as a watch (including so-called “smart watches”), glasses (including so-called “smart glasses”), headphones (including so-called “wireless headphones” and “smart headphones”), smart clothing, smart jewelry, and the like. Whether representative of a VR device, a watch, glasses, and/or headphones, the wearable device 800 may communicate with the computing device supporting the wearable device 800 via a wired connection or a wireless connection.
In some instances, the computing device supporting the wearable device 800 may be integrated within the wearable device 800 and as such, the wearable device 800 may be considered as the same device as the computing device supporting the wearable device 800. In other instances, the wearable device 800 may communicate with a separate computing device that may support the wearable device 800. In this respect, the term “supporting” should not be understood to require a separate dedicated device but that one or more processors configured to perform various aspects of the techniques described in this disclosure may be integrated within the wearable device 800 or integrated within a computing device separate from the wearable device 800.
For example, when the wearable device 800 represents the VR device 400, a separate dedicated computing device (such as a personal computer including one or more processors) may render the audio and visual content, while the wearable device 800 may determine the translational head movement upon which the dedicated computing device may render, based on the translational head movement, the audio content (as the speaker feeds) in accordance with various aspects of the techniques described in this disclosure. As another example, when the wearable device 800 represents smart glasses, the wearable device 800 may include the processor (e.g., one or more processors) that both determines the translational head movement (by interfacing within one or more sensors of the wearable device 800) and renders, based on the determined translational head movement, the loudspeaker feeds.
As shown, the wearable device 800 includes a rear camera, one or more directional speakers, one or more tracking and/or recording cameras, and one or more light-emitting diode (LED) lights. In some examples, the LED light(s) may be referred to as “ultra bright” LED light(s). In addition, the wearable device 800 includes one or more eye-tracking cameras, high sensitivity audio microphones, and optics/projection hardware. The optics/projection hardware of the wearable device 800 may include durable semi-transparent display technology and hardware.
The wearable device 800 also includes connectivity hardware, which may represent one or more network interfaces that support multimode connectivity, such as 4G communications, 5G communications, etc. The wearable device 800 also includes ambient light sensors, and bone conduction transducers. In some instances, the wearable device 800 may also include one or more passive and/or active cameras with fisheye lenses and/or telephoto lenses. The steering angle of the wearable device 800 may be used to select an audio representation of a soundfield (e.g., one of mixed-order ambisonics (MOA) representations) to output via the directional speaker(s)—headphones 404—of the wearable device 800, in accordance with various techniques of this disclosure. It will be appreciated that the wearable device 800 may exhibit a variety of different form factors.
Although not shown in the example of
Although described with respect to particular examples of wearable devices, a person of ordinary skill in the art would appreciate that descriptions related to
The various elements of an implementation of an apparatus or system as disclosed herein (e.g., apparatus A100, A200, F100, and/or F200) may be embodied in any combination of hardware with software and/or with firmware that is deemed suitable for the intended application. For example, such elements may be fabricated as electronic and/or optical devices residing, for example, on the same chip or among two or more chips in a chipset. One example of such a device is a fixed or programmable array of logic elements, such as transistors or logic gates, and any of these elements may be implemented as one or more such arrays. Any two or more, or even all, of these elements may be implemented within the same array or arrays. Such an array or arrays may be implemented within one or more chips (for example, within a chipset including two or more chips).
A processor or other means for processing as disclosed herein may be fabricated as one or more electronic and/or optical devices residing, for example, on the same chip or among two or more chips in a chipset. One example of such a device is a fixed or programmable array of logic elements, such as transistors or logic gates, and any of these elements may be implemented as one or more such arrays. Such an array or arrays may be implemented within one or more chips (for example, within a chipset including two or more chips). Examples of such arrays include fixed or programmable arrays of logic elements, such as microprocessors, embedded processors, IP cores, DSPs (digital signal processors), FPGAs (field-programmable gate arrays), ASSPs (application-specific standard products), and ASICs (application-specific integrated circuits). A processor or other means for processing as disclosed herein may also be embodied as one or more computers (e.g., machines including one or more arrays programmed to execute one or more sets or sequences of instructions) or other processors. It is possible for a processor as described herein to be used to perform tasks or execute other sets of instructions that are not directly related to a procedure of an implementation of method M100 or M200 (or another method as disclosed with reference to operation of an apparatus or system described herein), such as a task relating to another operation of a device or system in which the processor is embedded (e.g., a voice communications device, such as a smartphone, or a smart speaker). It is also possible for part of a method as disclosed herein to be performed under the control of one or more other processors.
Each of the tasks of the methods disclosed herein (e.g., methods M100 and/or M200) may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. In a typical application of an implementation of a method as disclosed herein, an array of logic elements (e.g., logic gates) is configured to perform one, more than one, or even all of the various tasks of the method. One or more (possibly all) of the tasks may also be implemented as code (e.g., one or more sets of instructions), embodied in a computer program product (e.g., one or more data storage media such as disks, flash or other nonvolatile memory cards, semiconductor memory chips, etc.), that is readable and/or executable by a machine (e.g., a computer) including an array of logic elements (e.g., a processor, microprocessor, microcontroller, or other finite state machine). The tasks of an implementation of a method as disclosed herein may also be performed by more than one such array or machine. In these or other implementations, the tasks may be performed within a device for wireless communications such as a cellular telephone or other device having such communications capability. Such a device may be configured to communicate with circuit-switched and/or packet-switched networks (e.g., using one or more protocols such as VoIP). For example, such a device may include RF circuitry configured to receive and/or transmit encoded frames.
In one or more exemplary aspects, the operations described herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, such operations may be stored on or transmitted over a computer-readable medium as one or more instructions or code. The term “computer-readable media” includes both computer-readable storage media and communication (e.g., transmission) media. By way of example, and not limitation, computer-readable storage media can comprise an array of storage elements, such as semiconductor memory (which may include without limitation dynamic or static RAM, ROM, EEPROM, and/or flash RAM), or ferroelectric, magnetoresistive, ovonic, polymeric, or phase-change memory; CD-ROM or other optical disk storage; and/or magnetic disk storage or other magnetic storage devices. Such storage media may store information in the form of instructions or data structures that can be accessed by a computer. Communication media can comprise any medium that can be used to carry desired program code in the form of instructions or data structures and that can be accessed by a computer, including any medium that facilitates transfer of a computer program from one place to another. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technology such as infrared, radio, and/or microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technology such as infrared, radio, and/or microwave are included in the definition of medium. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and Blu-ray Disc™ (Blu-Ray Disc Association, Universal City, Calif.), where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
In one example, a non-transitory computer-readable storage medium comprises code which, when executed by at least one processor, causes the at least one processor to perform a method of characterizing portions of a soundfield as described herein. Further examples of such a storage medium include a medium further comprising code which, when executed by the at least one processor, causes the at least one processor to receive a bitstream that comprises metadata and a soundfield description (e.g., as described herein with reference to task T100); parse the metadata to obtain an effect identifier and at least one effect parameter (e.g., as described herein with reference to task T200); and apply, to the soundfield description, an effect identified by the effect identifier (e.g., as described herein with reference to task T300). The applying may include using the at least one effect parameter to apply the identified effect to the soundfield description.
Implementation examples are described in the following numbered clauses:
Clause 1. A method of manipulating a soundfield, the method comprising: receiving a bitstream that comprises metadata and a soundfield description; parsing the metadata to obtain an effect identifier and at least one effect parameter value; and applying, to the soundfield description, an effect identified by the effect identifier.
Clause 2. The method of clause 1, wherein the parsing the metadata comprises parsing the metadata to obtain a timestamp corresponding to the effect identifier, and wherein the applying the identified effect comprises using the at least one effect parameter value to apply the identified effect to a portion of the soundfield description that corresponds to the timestamp.
Clause 3. The method of clause 1, wherein the applying the identified effect comprises combining the at least one effect parameter value with a user command to obtain at least one revised parameter value.
Clause 4. The method of any of clauses 1 to 3, wherein the applying the identified effect comprises rotating the soundfield to a desired orientation.
Clause 5. The method of any of clauses 1 to 3, wherein the at least one effect parameter value includes an indicated direction, and wherein the applying the identified effect comprises using the at least one effect parameter value to rotate the soundfield to the indicated direction.
Clause 6. The method of any of clauses 1 to 3, wherein the at least one effect parameter value includes an indicated direction, and wherein the applying the identified effect comprises using the at least one effect parameter value to increase an acoustic level of the soundfield in the indicated direction, relative to an acoustic level of the soundfield in other directions.
Clause 7. The method of any of clauses 1 to 3, wherein the at least one effect parameter value includes an indicated direction, and wherein the applying the identified effect comprises using the at least one effect parameter value to reduce an acoustic level of the soundfield in the indicated direction, relative to an acoustic level of the soundfield in other directions.
Clause 8. The method of any of clauses 1 to 3, wherein the at least one effect parameter value indicates a location within the soundfield, and wherein the applying the identified effect comprises using the at least one effect parameter value to translate a sound source to the indicated location.
Clause 9. The method of any of clauses 1 to 3, wherein the at least one effect parameter value includes an indicated direction, and wherein the applying the identified effect comprises using the at least one effect parameter value to increase a directionality of at least one of a sound source of the soundfield or a region of the soundfield, relative to another sound source of the soundfield or the region of the soundfield.
Clause 10. The method of any of clauses 1 to 3, wherein the applying the identified effect comprises applying a matrix transformation to the soundfield description.
Clause 11. The method of clause 10, wherein the matrix transformation comprises at least one of a rotation of the soundfield and a translation of the soundfield.
Clause 12. The method of any of clauses 1 to 3, wherein the soundfield description comprises a hierarchical set of basis function coefficients.
Clause 13. The method of any of clauses 1 to 3, wherein the soundfield description comprises a plurality of audio objects.
Clause 14. The method of any of clauses 1 to 3, wherein the parsing the metadata comprises parsing the metadata to obtain a second effect identifier, and wherein the method comprises determining to not apply, to the soundfield description, an effect identified by the second effect identifier.
Clause 15. An apparatus for manipulating a soundfield, the apparatus comprising: a decoder configured to receive a bitstream that comprises metadata and a soundfield description and to parse the metadata to obtain an effect identifier and at least one effect parameter value; and a renderer configured to apply, to the soundfield description, an effect identified by the effect identifier.
Clause 16. The apparatus of clause 15, further comprising a modem configured to: receive a signal that represents the bitstream; and provide the bitstream to the decoder.
Clause 17. A device for manipulating a soundfield, the device comprising: a memory configured to store a bitstream that comprises metadata and a soundfield description; and a processor coupled to the memory and configured to: parse the metadata to obtain an effect identifier and at least one effect parameter value; and apply, to the soundfield description, an effect identified by the effect identifier.
Clause 18. The device of clause 17, wherein the processor is configured to parse the metadata to obtain a timestamp corresponding to the effect identifier, and to apply the identified effect by using the at least one effect parameter value to apply the identified effect to a portion of the soundfield description that corresponds to the time stamp.
Clause 19. The device of clause 17, wherein the processor is configured to combine the at least one effect parameter value with a user command to obtain at least one revised parameter.
Clause 20. The device of any of clauses 17 to 19, wherein the at least one effect parameter value includes an indicated direction, and wherein the processor is configured to apply the identified effect by using the at least one effect parameter value to rotate the soundfield to the indicated direction.
Clause 21. The device of any of clauses 17 to 19, wherein the at least one effect parameter value includes an indicated direction, and wherein the processor is configured to apply the identified effect by using the at least one effect parameter value to increase an acoustic level of the soundfield in the indicated direction, relative to an acoustic level of the soundfield in other directions.
Clause 22. The device of any of clauses 17 to 19, wherein the at least one effect parameter value includes an indicated direction, and wherein the processor is configured to apply the identified effect by using the at least one effect parameter value to reduce an acoustic level of the soundfield in the indicated direction, relative to an acoustic level of the soundfield in other directions.
Clause 23. The device of any of clauses 17 to 19, wherein the at least one effect parameter value indicates a location within the soundfield, and wherein the processor is configured to apply the identified effect by using the at least one effect parameter value to translate a sound source to the indicated location.
Clause 24. The device of any of clauses 17 to 19, wherein the at least one effect parameter value includes an indicated direction, and wherein the processor is configured to apply the identified effect by using the at least one effect parameter value to increase a directionality of at least one of a sound source of the soundfield or a region of the soundfield, relative to another sound source of the soundfield or region of the soundfield.
Clause 25. The device of any of clauses 17 to 19, wherein the processor is configured to apply the identified effect by using the at least one effect parameter value to apply a matrix transformation to the soundfield description.
Clause 26. The device of clause 25, wherein the matrix transformation comprises at least one of a rotation of the soundfield and a translation of the soundfield.
Clause 27. The device of any of clauses 17 to 19, wherein the soundfield description comprises a hierarchical set of basis function coefficients.
Clause 28. The device of any of clauses 17 to 19, wherein the soundfield description comprises a plurality of audio objects.
Clause 29. The device of any of clauses 17 to 19, wherein the processor is configured to parse the metadata to obtain a second effect identifier, and to determine to not apply, to the soundfield description, an effect identified by the second effect identifier.
Clause 30. The device of any of clauses 17 to 19, wherein the device comprises an application-specific integrated circuit that includes the processor.
Clause 31. An apparatus for manipulating a soundfield, the apparatus comprising: means for receiving a bitstream that comprises metadata and a soundfield description; means for parsing the metadata to obtain an effect identifier and at least one effect parameter value; and means for applying, to the soundfield description, an effect identified by the effect identifier.
Clause 32. The apparatus of clause 31, wherein at least one of the means for receiving, the means for parsing, or the means for applying is integrated in at least one of a mobile phone, a tablet computer device, a wearable electronic device, a camera device, a virtual reality headset, an augmented reality headset, or a vehicle.
Those of skill would further appreciate that the various illustrative logical blocks, configurations, modules, circuits, and algorithm steps described in connection with the implementations disclosed herein may be implemented as electronic hardware, computer software executed by a processor, or combinations of both. Various illustrative components, blocks, configurations, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or processor executable instructions depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, such implementation decisions are not to be interpreted as causing a departure from the scope of the present disclosure.
The steps of a method or algorithm described in connection with the implementations disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in random access memory (RAM), flash memory, read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disk, a removable disk, a compact disc read-only memory (CD-ROM), or any other form of non-transient storage medium known in the art. An exemplary storage medium is coupled to the processor such that the processor may read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an application-specific integrated circuit (ASIC). The ASIC may reside in a computing device or a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a computing device or user terminal.
The previous description is provided to enable a person skilled in the art to make or use the disclosed implementations. Various modifications to these implementations will be readily apparent to those skilled in the art, and the principles defined herein may be applied to other implementations without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the implementations shown herein but is to be accorded the widest scope possible consistent with the principles and novel features as defined by the following claims.
Number | Date | Country | Kind |
---|---|---|---|
20190100493 | Nov 2019 | GR | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2020/058026 | 10/29/2020 | WO |