This specification relates generally to methods and apparatus for distributed audio mixing. The specification further relates to, but it not limited to, methods and apparatus for distributed audio capture, mixing and rendering of spatial audio signals to enable spatial reproduction of audio signals.
Spatial audio refers to playable audio data that exploits sound localisation. In a real world space, for example in a concert hall, there will be multiple audio sources, for example the different members of an orchestra or band, located at different locations on the stage. The location and movement of the sound sources is a parameter of the captured audio. In rendering the audio as spatial audio for playback such parameters are incorporated in the data using processing algorithms so that the listener is provided with an immersive and spatially oriented experience.
Spatial audio processing is an example technology for processing audio captured via a microphone array into spatial audio; that is audio with a spatial percept. The intention is to capture audio so that when it is rendered to a user the user will experience the sound field as if they are present at the location of the capture device.
An example application of spatial audio is in virtual reality (VR) and augmented reality (AR) whereby both video and audio data may be captured within a real world space. In the rendered version of the space, i.e. the virtual space, the user, through a VR headset, may view and listen to the captured video and audio which has a spatial percept.
The captured content may be manipulated in a mixing stage, which is typically a manual process involving a director or engineer operating a mixing computer or mixing desk. For example, the volume of audio signals from a subset of audio sources may be changed to improve end-user experience when consuming the content.
According to one aspect, a method comprises: providing one or more predefined constellations, each constellation defining a spatial arrangement of points forming a shape or pattern; receiving positional data indicative of the spatial positions of a plurality of audio sources in a capture space; identifying a correspondence between a subset of the audio sources and a constellation based on the relative spatial positions of audio sources in the subset; and responsive to said correspondence, applying at least one action.
The at least one action may be applied to selected ones of the audio sources.
The action applied may be one or more of an audio action, a visual action and a controlling action.
An audio action may be applied to audio signals of selected audio sources, comprising one or more of: reducing or muting the audio volume, increasing the audio volume, distortion and reverberation.
A controlling action may be applied to control the spatial position(s) of selected audio source(s).
The controlling action may comprise one or more of modifying spatial position(s), fixing spatial position(s), filtering spatial position(s), applying a repelling movement to spatial position(s) and applying an attracting movement to spatial position(s).
A controlling action may be applied to control movement of one or more capture devices in the capture space.
A controlling action may be applied to apply selected audio sources to a first audio channel and other audio sources to one or more other audio channel(s).
The or each constellation may define one or more of a line, arc, circle, cross or polygon.
The positional data may be derived from positioning tags, carried by the audio sources in the capture space.
A correspondence may be identified if the relative spatial positions of the audio sources in the subset have substantially the same shape or pattern of the constellation, or deviate therefrom by no more than a predetermined distance.
The or each constellation may be defined by means of receiving, through a user interface, a user-defined spatial arrangement of points forming a shape or pattern.
The or each constellation may defined by capturing current positions of audio sources in a capture space.
According to a second aspect, there is provided a computer program comprising instructions that when executed by a computer program control it to perform the method comprising: providing one or more predefined constellations, each constellation defining a spatial arrangement of points forming a shape or pattern; receiving positional data indicative of the spatial positions of a plurality of audio sources in a capture space; identifying a correspondence between a subset of the audio sources and a constellation based on the relative spatial positions of audio sources in the subset; and responsive to said correspondence, applying at least one action.
According to a third aspect, there is provided a non-transitory computer-readable storage medium having stored thereon computer-readable code, which, when executed by at least one processor, causes the at least one processor to perform a method, comprising: providing one or more predefined constellations, each constellation defining a spatial arrangement of points forming a shape or pattern; receiving positional data indicative of the spatial positions of a plurality of audio sources in a capture space; identifying a correspondence between a subset of the audio sources and a constellation based on the relative spatial positions of audio sources in the subset; and responsive to said correspondence, applying at least one action.
According to a fourth aspect, there is provided an apparatus, the apparatus having at least one processor and at least one memory having computer-readable code stored thereon which when executed controls the at least one processor: to provide one or more predefined constellations, each constellation defining a spatial arrangement of points forming a shape or pattern; to receive positional data indicative of the spatial positions of a plurality of audio sources in a capture space; to identify a correspondence between a subset of the audio sources and a constellation based on the relative spatial positions of audio sources in the subset; and responsive to said correspondence, to apply at least one action.
According to a fifth aspect, there is provided an apparatus configured to perform the method of: providing one or more predefined constellations, each constellation defining a spatial arrangement of points forming a shape or pattern; receiving positional data indicative of the spatial positions of a plurality of audio sources in a capture space; identifying a correspondence between a subset of the audio sources and a constellation based on the relative spatial positions of audio sources in the subset; and responsive to said correspondence, applying at least one action.
Embodiments will now be described, by way of non-limiting example, with reference to the accompanying drawings, in which:
Embodiments herein relate generally to systems and methods relating to the capture, mixing and rendering of spatial audio data for playback.
In particular, embodiments relate to systems and methods in which there are multiple audio sources which may move over time. Each audio source generates respective audio signals and, in some embodiments, positioning information for use by the system. Embodiments provide automation of certain functions during, for example, the mixing stage, whereby one or more actions are performed automatically responsive to a subset of entities matching or corresponding to a predefined constellation which defines a spatial arrangement of points forming a shape or pattern.
An example application is in a VR system in which audio and video may be captured, mixed and rendered to provide an immersive user experience. Nokia's OZO® VR camera is used as an example of a VR capture device which comprises a microphone array to provide a spatial audio signal, but it will be appreciated that embodiments are not limited to VR applications nor the use of microphone arrays at the capture point. Local or close-up microphones or instrument pickups may be employed, for example. Embodiments may also be used in Augmented Reality (AR) applications.
Referring to
The sports team may be comprised of multiple members 7-13 each of which has an associated close-up microphone providing audio signals. Each may therefore be termed an audio source for convenience. In other embodiments, other types of audio source may be used. For example, if the audio sources 7-13 are members of a musical band, the audio sources may comprise a lead vocalist, a drummer, lead guitarist, bass guitarist, and/or members of a choir or backing singers. Further, for example, the audio sources 7-13 may be actors performing in a movie or television filming production. The number of audio sources and capture devices is not limited to what is presented in
As well as having an associated close-up microphone, the audio sources 7-13 may carry a positioning tag which may be any module capable of indicating through data its respective spatial position to the CRS 15. For example the positioning tag may be a high accuracy indoor positioning (HAIP) tag which works in association with one or more HAIP locators 20 within the space 3. HAIP systems use Bluetooth Low Energy (BLE) communication between the tags and the one or more locators 20. For example, there may be four HAIP locators mounted on, or placed relative to, the VR capture device 6. A respective HAIP locator may be to the front, left, back and right of the VR capture device 6. Each tag sends BLE signals from which the HAIP locators derive the tag, and therefore, audio source location.
In general, such direction of arrival (DoA) positioning systems are based on (i) a known location and orientation of the or each locator, and (ii) measurement of the DoA angle of the signal from the respective tag towards the locators in the locators' local co-ordinate system. Based on the location and angle information from one or more locators, the position of the tag may be calculated using geometry.
In some embodiments, other forms of positioning system may be employed, in addition, or as an alternative. For example, each audio source 7-13 may have a GPS receiver for transmitting respective positional data to the CRS 15.
The CRS 15 is a processing system having an associated user interface (UI) 16 which will explained in further detail below. As shown in
The input audio data may be multichannel audio in loudspeaker format, e.g. stereo signals, 4.0 signals, 5.1 signals, Dolby Atmos® signals or the like. Instead of loudspeaker format audio, the input may be in the multi microphone signal format, such as the raw eight signal input from the OZO VR camera, if used for the VR capture device 6.
The memory 32 may be a non-volatile memory such as read only memory (ROM) a hard disk drive (HDD) or a solid state drive (SSD). The memory 32 stores, amongst other things, an operating system 38 and one or more software applications 40. The RAM 34 is used by the controller 22 for the temporary storage of data. The operating system 38 may contain code which, when executed by the controller 22 in conjunction with RAM 34, controls operation of each of hardware components of the terminal.
The controller 22 may take any suitable form. For instance, it may be a microcontroller, plural microcontrollers, a processor, or plural processors.
In embodiments herein, the software application 40 is configured to provide video and distributed spatial audio capture, mixing and rendering to generate a VR environment, or virtual space, including the spatial audio.
The software application 40 may provide the UI 16 shown in
The input interface 36 receives video and audio data from the VR capture device 6, such as Nokia's OZO® device, and audio data from each of the audio sources 7-13. The capture device may be a 360 degree camera capable of recording approximately the entire sphere. The input interface 36 also receives the positional data from (or derived from) the positioning tags on each of the VR capture device 6 and the audio sources 7-13, from which may be made an accurate determination of their respective positions in the real world space 3 and also their relative positions to other audio sources.
The software application 40 may be configured to operate in any of real-time, near real-time or even offline using pre-stored captured data.
During capture it is sometimes the case that audio sources move. For example, in the
In one example aspect of the mixing step 3.2, the software application 40 is configured to identify when at least a subset of the audio sources 7-13 matches a predefined constellation, as will be explained below.
A constellation is a spatial arrangement of points forming a shape or pattern which can be represented in data form.
The points may for example represent related entities, such as audio sources, or points in a path or shape. A constellation may therefore be an elongate line (i.e. not a discrete point), a jagged line, a cross, an arc, a two-dimensional shape or indeed any spatial arrangement of points that represents a shape or pattern. For ease of reference, a line, arc, cross etc. is considered a shape in this context. In some embodiments, a constellation may represent a 3D shape.
A constellation may be defined in any suitable way, e.g. as one or more vectors and/or a set of co-ordinates. Constellations may be drawn or defined using predefined templates, e.g. as shapes which are dragged and dropped from a menu. Constellations may be defined by placing markers on an editing interface, all of which may be manually input through the UI 16. A constellation may be of any geometrical shape or size, other than a discrete point. In some embodiments, the size may be immaterial, i.e. only the shape is important.
In some embodiments, a constellation may be defined by capturing the positions of one or more audio sources 7-13 at a particular point in time in a capture space. For example, referring to
The data representing each constellation 45, 46, 47 is stored in the memory 32 of the CRS 15, or may be stored externally or remotely and made available to the CRS by a data port or a wired or wireless network link. For example, the constellation data may be stored in a cloud-based repository for on-demand access by the CRS 15.
In some embodiments, only one constellation is provided. In other embodiments, a larger number of constellations are provided.
In overview, the software application 40 is configured to compare the relative spatial positions of the audio sources 7-13 with one or more of the constellations 45, 46, 47, and to perform some action in the event that a subset matches a constellation.
From a practical viewpoint, the audio sources 7-13 may be divided into subsets comprising at least two audio sources. In this way, the relative positions of the audio sources in a given subset may be determined and the corresponding shape or pattern they form may be compared with that of the constellations 45, 46, 47.
Referring to
A first step 5.1 comprises providing data representing one or more constellations. A second step 5.2 comprises receiving a current set of positions of audio sources within a subset. The first step 5.1 may comprise the CRS 15 receiving the constellation data from a connected or external data source, or accessing the constellation data from local memory 32. A third step 5.3 comprises determining if a correspondence or match occurs between the shape or pattern represented by the relative positions of the subset, and one of said constellations. Example methods for determining a correspondence will be described later on. If there is a correspondence, in step 5.4 one or more actions is or are performed. If there is no correspondence, the method returns to step 5.2, e.g. for a subsequent time frame.
The method may be performed during capture or as part of a post-processing operation.
The actions performed in step 5.4 may be audio, visual positional or other control effects or a combination of said effects. Steps 5.4.1-5.4.4 represent example actions that may comprise step 5.4. A first example action 5.4.1 is that of modifying audio signals. A second example action 5.4.2 is that of modifying video or visual data. A third example action 5.4.3 is that of controlling the movement or position of certain audio sources 7-13. A fourth example action 5.4.4 is that of controlling something else, e.g. the capture device 6, which may involve moving the capture device or assigning audio signals from selected sources to one channel and other audio signals to another channel. Any of said actions 5.4.1-5.4.4 may be combined so that multiple actions may be performed responsive to a match in step 5.3.
Examples of audio effects in 5.4.1 include one or more of, but not limited to: enabling or disabling certain microphones; decreasing or muting the volume of certain audio signals; increasing the volume of certain audio signals; applying a distortion effect to certain audio signals; applying a reverberation effect to certain audio signals; and harmonising audio signals from certain multiple sources.
Examples of video effects in 5.4.2 may include changing the appearance of one or more captured audio sources in the corresponding video data. The effects may be visual effects, for example, controlling lighting; controlling at least one video projector output; controlling at least one display output.
Examples of movement/positioning effects in 5.4.3 may include fixing the position of one or more audio sources and/or adjusting or filtering their movement in a way that differs from their captured movement. For example, certain audio sources may be attracted to, or repelled away from a reference position. For example, audio sources outside of the matched constellation may be attracted to, or repelled away from, audio sources within said constellation.
Examples of camera control effects in 5.4.4 may include moving the capture device 6 to a predetermined location when a constellation match is detected in step 5.3. Such effects may be applied to more than one capture device if multiple such devices are present.
In some embodiments, action(s) may be performed for a defined subset of the audio sources, for example only those that match the constellation, or, alternatively, those that do not.
As will be explained below, rules may be associated with each constellation.
For example, rules may determine which audio sources 7-13 may form the constellation. The term ‘forming’ in this context refers to audio sources which are taken into account in step 5.3.
Additionally, or alternatively, rules may determine a minimum (or maximum or exact) number of audio sources 7-13 that are required to form the constellation.
Additionally, or alternatively, rules may determine how close to the ideal constellation pattern or shape the audio sources 7-13 need to be, e.g. in terms of a maximum deviation from the ideal.
Other rules may determine what action is triggered when a constellation is matched in step 5.3.
Applying the
In some embodiments, a correspondence is identified in step 5.3 if the pattern or shape formed by a subset of audio sources 7-13 overlies or has substantially the same shape as a constellation.
For example, in
In some embodiments, markers (not shown) may be defined as part of the constellation which indicate a particular configuration of where the individual audio sources need to be positioned in order for a match to occur.
In some embodiments, a tolerance or deviation measure may be defined to allow a limited amount of error between the respective positions of audio sources when compared with a predetermined constellation. One method is to perform a fit of the audio source positions to a constellation, for example using a least squares fit method. The resulting error, for example the Mean Squared Error, for the subset of audio sources may be compared with a threshold to determine if there is a match or not.
Referring to
In some embodiments, the rules may define one or more matching criteria, i.e. criteria as to what constitutes a correspondence with said constellation for the purpose of performing step 5.3 of the
In some embodiments, the matching rules may determine that a correspondence occurs just prior to the pattern or shape overlaying that of a constellation. In other words, some form of prediction is performed based on movement as the pattern or shape approaches that of a constellation.
In some embodiments, the matching rules may further define that the orientation of a subset of audio sources in relation to a capture device position, e.g. the position of a camera, is a factor for triggering an action.
In some embodiments, the simultaneous and coordinated movement of a subset of audio sources may be a factor for triggering an action.
Alternatively, or additionally, in some embodiments, rules may define one or more actions to be applied or triggered in the event of a correspondence in step 5.3. These may be termed action rules. The action rules may be applied for one or more selected subsets of the sound sources.
Responsive to this correspondence, the first and second action rules 63, 64 given by way of example in
Referring to
Further rules may for example implement a delay in the movement of audio sources, e.g. for a predetermined time period after the line constellation breaks.
For completeness,
In some embodiments, the action that is triggered upon detecting a constellation correspondence may result in audio sources of the constellation being assigned to a first channel or channel group of a physical mixing table and/or to a first mixing desk. Other audio sources, or a subset of audio sources corresponding to a different constellation, may be assigned to a different channel or channel group and/or to a different mixing desk. In this way, a single controller may be used to control all audio sources corresponding to one constellation. Multi-user mixing workflow is therefore enabled.
As mentioned, the above described mixing method enables a reduction in the workload of a human operator because it performs or triggers certain actions automatically. The method may improve user experience for VR or AR consumption, for example by generating a noticeable effect if audio sources outside of the user's current field-of-view match a constellation. The method may be applied, for example, to VR or AR games for providing new features.
It will be appreciated that the above described embodiments are purely illustrative and are not limiting on the scope of the invention. Other variations and modifications will be apparent to persons skilled in the art upon reading the present application.
Moreover, the disclosure of the present application should be understood to include any novel features or any novel combination of features either explicitly or implicitly disclosed herein or any generalization thereof and during the prosecution of the present application or of any application derived therefrom, new claims may be formulated to cover any such features and/or combination of such features.
Number | Date | Country | Kind |
---|---|---|---|
16204016.6 | Dec 2016 | EP | regional |