The example and non-limiting embodiments of the present invention relate to capturing of audio signals.
Already for many years mobile devices such as mobile phones and tablet computers have been provided with a camera and a microphone arrangement that enable the user of the device to capture audio and video. With the development of microphone technologies and with increase in processing power and storage capacity available in mobile device, providing such mobile devices with multi-microphone arrangements that enable capturing multi-channel audio is becoming increasingly common, which in turn e.g. enables usage of the mobile device for recoding high-quality spatial audio. Typically, spatial audio (or multi-channel audio in general) is captured together with video, while multi-channel audio can be quite obviously recorded as a stand-alone media without accompanying video.
Typically, the process of capturing a multi-channel audio signal using the mobile device comprises operating a microphone array to capture a plurality of microphone signals and processing the captured microphone signals into a recorded multi-channel audio signal for further processing in the mobile device, for storage in the mobile device and/or for transmission to one or more other devices.
The audio processor 102 may enable a plurality of audio processing functions, whereas application of the audio processing functions available therein may be controlled via the capture control data. Non-limiting examples of such audio processing functions that may be applied by the audio processor 102 to the microphone signals or to one or more signals derived from the microphone signals include the following:
The capture control data may further define general audio characteristics pertaining to the received microphone signals (i.e. input audio) and/or to captured audio signal (i.e. output audio), such as the number of input channels, the number of output channels, the sampling rate of the audio, the sample resolution of the audio (e.g. the number of bits per audio sample), the applied (output) audio format (e.g. binaural, loudspeaker channels according to a specified channel configuration, parametric audio, Ambisonics), etc. In addition to the general audio characteristics (of the input and/or output audio), the capture control data may define which of the audio processing functions available in the audio processor 102 are (to be) applied and, if applicable, respective audio processing parameters for controlling application of the respective audio processing function. Hence, the capture control data identifies at least one (audio) characteristic for derivation of the captured audio signal.
The capture control data may comprise definitions originating from preselection made by the user of the mobile device (and stored in a memory of the mobile device) prior to an audio capturing session, definitions originating from automated selection made by the mobile device and/or definitions originating from user input received upon initiation or during the audio capturing session. The capture control data, and hence the corresponding characteristics of operation of the audio processor 102, may remain unchanged throughout the audio capturing session. On the other hand, at least some aspects of the capture control data, and hence the corresponding characteristics of operation of the audio processor 102, may vary or be varied during the audio capturing session. Such variations may include enabling or disabling a certain audio processing function during the audio capturing session or changing characteristics of a certain audio processing function during the audio capturing session, either automatically under control of the mobile device or in response to user input received during the audio capturing session. The user input may include direct user input that directly addresses one or more audio characteristics of the audio capturing session and/or indirect user input that results from user adjusting a related procedure in the mobile device, e.g. changes video zooming settings applied for a concurrent video capturing session.
Consequently, the outcome of the audio capturing session carried out along the lines described above results in the captured audio signal that may be subsequently accessed by a user of the mobile device applying the audio capturing arrangement 100 or by a user of another device. The resulting captured audio signal reflects the selections made (with respect to application and characteristics of the audio processing functions available in the audio processor 102) upon deriving the captured audio signal.
In a typical usage scenario, at the time of capture the user of the mobile device at the same time also directly listens to the real audio scene he or she is capturing by the mobile device, and hence no ‘monitoring’ of the captured audio signal takes place during the audio capturing session. Consequently, the user may subsequently find the selections made upon deriving the captured audio signal non-optimal and/or another user may have different preferences with respect to selections that control operation of the audio processing functions available in the audio processor 102. However, some of the audio processing functions that have been applied in the underlying audio capturing session may have an effect that cannot be reversed (or ‘undone’) or reversing (or ‘undoing’) the respective audio processing function may result in compromised audio quality and/or excessive computation. Moreover, some of the audio processing functions that are available in the audio processor 102 but that were not applied upon deriving the captured audio signal cannot be necessarily applied for the captured audio signal or their application may result in comprised audio quality or excessive computation.
According to an example embodiment, a method for processing two or more microphone signals is provided, the method comprising: deriving a first captured audio signal having one or more channels based on the two or more microphone signals received from respective two or more microphones in accordance with capture control data that identifies at least one characteristic for derivation of the first captured audio signal; storing at least part of the capture control data and intermediate audio data derived based on the two or more received microphone signals; deriving modified capture control data as a combination of the stored capture control data and user-definable post-capture control data that identifies at least one characteristic for derivation of a second captured audio signal; and deriving the second captured audio signal having one or more channels based on said intermediate audio data in accordance with the modified capture control data.
According to another example, a method for processing two or more microphone signals is provided, the method comprising: deriving a first captured audio signal having one or more channels based on the two or more microphone signals received from respective two or more microphones in accordance with capture control data that identifies at least one characteristic for derivation of the first captured audio signal; and storing at least part of the capture control data and intermediate audio data derived based on the two or more received microphone signals to enable derivation of a second captured audio signal having one or more channels based on the intermediate audio data in accordance with at least part of the stored capture control data.
According to another example embodiment, a method for processing two or more microphone signals is provided, the method comprising: obtaining a first captured audio signal having one or more channels derived on basis of said two or more microphone signals in dependence of capture control data that identifies at least one characteristic for derivation of the first captured audio signal; obtaining at least part of said capture control data and intermediate audio data derived on basis of said two or more microphone signals; obtaining at least part of said capture control data and intermediate audio data derived on basis of said two or more microphone signals; and deriving the second captured audio signal having one or more channels based on said intermediate audio data in accordance with the modified capture control data.
According to another example embodiment, a system for processing two or more microphone signals is provided, the system comprising: a means for deriving a first captured audio signal having one or more channels based on the two or more microphone signals received from respective two or more microphones in accordance with capture control data that identifies at least one characteristic for derivation of the first captured audio signal; a means for storing at least part of the capture control data and intermediate audio data derived based on the two or more received microphone signals; a means for deriving modified capture control data as a combination of the stored capture control data and user-definable post-capture control data that identifies at least one characteristic for derivation of a second captured audio signal; and a means for deriving the second captured audio signal having one or more channels based on said intermediate audio data in accordance with the modified capture control data.
According to another example embodiment, an apparatus for processing two or more microphone signals is provided, the apparatus comprising: a means for deriving a first captured audio signal having one or more channels based on the two or more microphone signals received from respective two or more microphones in accordance with capture control data that identifies at least one characteristic for derivation of the first captured audio signal; a means for storing at least part of the capture control data and intermediate audio data derived based on the two or more received microphone signals; a means for deriving modified capture control data as a combination of the stored capture control data and user-definable post-capture control data that identifies at least one characteristic for derivation of a second captured audio signal; and a means for deriving the second captured audio signal having one or more channels based on said intermediate audio data in accordance with the modified capture control data.
According to another example embodiment, an apparatus for processing two or more microphone signals is provided, the apparatus comprising: a means for deriving a first captured audio signal having one or more channels based on the two or more microphone signals received from respective two or more microphones in accordance with capture control data that identifies at least one characteristic for derivation of the first captured audio signal; and a means for storing at least part of the capture control data and intermediate audio data derived based on the two or more received microphone signals to enable derivation of a second captured audio signal having one or more channels based on the intermediate audio data in accordance with at least part of the stored capture control data.
According to another example embodiment, an apparatus for processing two or more microphone signals is provided, the apparatus comprising: a means for obtaining a first captured audio signal having one or more channels derived on basis of said two or more microphone signals in dependence of capture control data that identifies at least one characteristic for derivation of the first captured audio signal; a means for obtaining at least part of said capture control data and intermediate audio data derived on basis of said two or more microphone signals; a means for deriving modified capture control data as a combination of said obtained capture control data and user-definable post-capture control data that identifies at least one characteristic for derivation of a second captured audio signal; and a means for deriving the second captured audio signal having one or more channels based on said intermediate audio data in accordance with the modified capture control data.
According to another example embodiment, an apparatus for processing two or more microphone signals is provided, wherein the apparatus comprises at least one processor; and at least one memory including computer program code, which, when executed by the at least one processor, causes the apparatus to perform at least a method according to one of the example embodiments described in the foregoing.
According to another example embodiment, a computer program is provided, the computer program comprising computer readable program code configured to cause performing at least a method according to one of the example embodiments described in the foregoing when said program code is executed one or more computing apparatuses.
The computer program according to an example embodiment may be embodied on a volatile or a non-volatile computer-readable record medium, for example as a computer program product comprising at least one computer readable non-transitory medium having program code stored thereon, the program which when executed by one or more apparatuses causes the one or more apparatuses at least to perform the operations described hereinbefore for the computer program according to an example embodiment of the invention.
The exemplifying embodiments of the invention presented in this patent application are not to be interpreted to pose limitations to the applicability of the appended claims. The verb “to comprise” and its derivatives are used in this patent application as an open limitation that does not exclude the existence of also unrecited features. The features described hereinafter are mutually freely combinable unless explicitly stated otherwise.
Some features of the invention are set forth in the appended claims. Aspects of the invention, however, both as to its construction and its method of operation, together with additional objects and advantages thereof, will be best understood from the following description of some example embodiments when read in connection with the accompanying drawings.
The embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, where
Referring to
Although not shown in
Typically, although not necessarily, the two or more microphone signals include audio information that readily provides or can be processed into a representation of an audio scene in the environment of the mobile device that implements the capture arrangement 200a. In the following, where applicable, the representation of the audio scene provided by the two or more microphone signals is referred to as an original representation of the audio scene. The (perceptual) quality and/or accuracy of the representation of the audio scene captured in the audio information provided by the two or microphone signals depends, for example, on the position and orientation of the microphones of the mobile device with respect to sound sources of the audio scene and on respective positions of the two or more microphones with respect to each other. Along similar lines, the captured audio signal may constitute a multi-channel audio signal (of at least two channels) that conveys a representation of the audio scene in the environment of the mobile device that implements the capture arrangement 200a, which may be similar to the original representation of the audio scene or a modified version thereof.
The term audio scene, as used in the present disclosure, refers to the sound field in the environment of the mobile device that implements the capture arrangement 200a, whereas e.g. the two or more microphone signals provide a representation of the audio scene. An audio scene may involve one or more sound sources at specific spatial positions of the audio scene and/or the ambience of the audio scene. A representation of an audio scene may be defined using a (spatial) audio format, such as binaural audio, audio channels according to a predefined channel configuration, parametric audio, Ambisonics, etc. that enables delivering (audio) information related to one or more directional sound components and/or related to ambient sounds such as environmental sounds and reverberation within the audio scene. Listening to such a representation of an audio scene enables the listener to experience the audio environment similar to as if he or she was at the location the audio scene serves to represent.
The capture arrangement 200a is typically applied to process the two or more microphone signals as a sequence of input frames to derive corresponding sequence of output frames that constitute the captured audio signal. Each input frame includes a respective segment of digital audio signal for each of the microphone signals at a respective predefined sampling frequency and each output frame includes a respective segment of digital audio signal for each channel of the captured audio signal at a respective predefined sampling frequency. In typical example, the capture arrangement 200a employs a fixed predefined frame length such that each frame comprises respective L samples for each channel of the respective audio signal (i.e. the microphone signal(s) and the captured audio signal), which at the respective predefined sampling frequency maps to a corresponding duration in time. As an example in this regard, the fixed frame length may be 20 milliseconds (ms), which at a sampling frequency of 8, 16, 32 or 48 kHz results in a frame of L=160, L=320, L=640 and L=960 samples per channel, respectively. The frames may be non-overlapping or they may be partially overlapping. These values, however, serve as non-limiting examples and frame lengths and/or sampling frequencies different from these examples may be employed instead, depending e.g. on the desired audio bandwidth, on desired framing delay and/or on available processing capacity.
Still referring to
Still referring to
The capture control data format applied by the control data formatter 204 for the stored capture control data may include a sequence of control data entries, each control data entry identifying either a respective general audio characteristic of the audio capture or identifying an applied audio processing function of the audio processor 102, possibly together with respective audio processing parameters. According to an example, a control data entry comprises an indication of timing (e.g. a starting time and/or an ending time) assigned for the respective control data entry, the general audio characteristic or an audio processing function associated with respective control data entry, and possible audio processing parameters (to be) applied for controlling application of the respective audio processing function. In another examples, the timing associated with a control data entry may be implicit e.g. based on the position of the control data entry in the sequence of control data entries (e.g. if a dedicated control data entry is provided for each frame of the underlying audio signal) or based on another structural aspect the capture control data format. In such examples the timing indications may be omitted from the control data entries.
In a non-limiting example, a control data entry of the stored capture control data stored by the control data formatter 204 may include a timestamp that indicates the starting time with respect to a reference time (e.g. as seconds with respect to the beginning of the underlying audio signal), an identification of the general audio characteristic or an identification of an audio processing function associated with the respective control data entry, and, if applicable, one or more audio parameters applied for controlling application of the respective audio processing function, e.g. as shown in the following example.
The control data entries may be provided in human-readable (text) format or in a computer-readable (binary and/or encoded) format. The control data formatter 204 provides the stored capture control data as metadata associated with the audio data extracted from the two or more microphone signals by the audio formatter 206. In an example, the control data formatter 204 writes the stored capture control data in the storage 208 in a separate (or dedicated) file. In another example, the control data formatter 204 embeds the stored capture control data into another file stored in the storage 208, e.g. as metadata included in the file that (also) stores the audio data extracted from the two or more microphone signals by the audio formatter 206.
The control data formatter 204 may include all received capture control data in the stored capture control data, thereby enabling subsequent reconstruction of the captured audio signal by the post-capture arrangement 200b. In another example, the control data formatter 204 includes only a subset of the received capture control data in the stored capture control data in order to reduce the amount of storage (and/or data transfer) capacity required for the metadata. As an example in this regard, the control data formatter 204 may be arranged to include in the stored capture control data respective definitions that pertain to certain (first) one or more predefined general audio characteristics or audio processing functions and/or to omit from the stored captured control data respective definitions that pertain to certain (second) one or more predefined general audio characteristics or audio processing functions. As another example, the amount of stored capture control data may be reduced by reducing the update rate of a given audio parameter associated with an applied audio processing function, e.g. such that the updated value for the given audio parameter is included at N frame intervals instead of including the respective parameter value in each frame. Consequently, the post-capture arrangement 200b may interpolate the audio parameter value for those frames for which the audio parameter value is not explicitly indicated.
Still referring to
In other examples, the audio formatter 206 may apply one or more audio processing functions to process the received two or more microphone signals into the intermediate audio data for storage in the storage 208. Examples of audio processing functions that may be applied by the audio formatter 206 include one or more of the following: gain control, audio equalization, noise suppression (such as wind noise removal) or audio enhancement processing of other kind, change (e.g. reduction) of sampling rate, change (e.g. reduction) of audio sample resolution, change (e.g. reduction) of the number of channels, audio encoding, conversion to a selected or predefined audio format (e.g. binaural, audio channels according to a predefined channel configuration, parametric audio, Ambisonics), etc. In such an example, the intermediate audio data may include one or more intermediate audio signals, possibly complemented by audio metadata. Typically, however, no or only a few audio processing functions are applied in the audio formatter 206 to retain as much as possible of the audio information conveyed by the two or more microphone signals (e.g. in view of the available capacity of the storage 208 and/or in view of the available processing capacity) in order to provide the post-capture arrangement 200b with the intermediate audio data that is (significantly) closer in information content to the two or more microphone signals received at the capture arrangement 200a than the captured audio signal output from the audio processor 102. Typically, although not necessarily, the stored capture control data (stored e.g. as metadata) in the storage 208 by the control data formatter 206 includes respective control data entries pertaining only to those audio processing functions that are applied by the audio processor 102, whereas the one or more predefined audio processing functions possibly applied by operation of the audio formatter 206 are not identified in the stored capture control data provided in the storage 208. However, information that identifies at least some of the audio processing functions applied by the audio formatter 206 in derivation of the intermediate audio data may be stored in the storage 208 together with the one or more intermediate audio signals as metadata associated thereto, either in a separate (or dedicated) file or embedded into the file that (also) stores the one or more intermediate audio signals.
The audio formatter 206 may store the intermediate audio data using a suitable storage format known in the art. As an example, if the intermediate audio data is provided as one or more time-domain multi-channel audio signals, they may be stored in the storage 208 in the PCM format (e.g. as a .wav file). In another example, if the audio formatter 206 applies a selected or predefined audio encoding to the two or more microphone signals (or to one or more signals derived on basis of the two or more microphone signals) to derive the intermediate audio data, this information may be stored using a predefined container (or encapsulation) format defined for the respective audio encoder.
Still referring to
As described in the foregoing, the intermediate audio data comprises audio information that includes one or more intermediate audio signals (possibly complemented by audio metadata). The intermediate audio data conveys the original representation of the audio scene (i.e. the one provided by the two or more microphone signals received at the capture arrangement 200a) or an approximation thereof. As also described in the foregoing, each of the intermediate audio signals included in the intermediate audio data may be provided as a respective single-channel (monophonic) audio signal or as a respective multi-channel audio signal (having two or more channels). The post-captured audio signal resulting from operation of the post-capture arrangement 200b typically comprises a multi-channel audio signal (of at least two channels) that conveys a representation of the audio scene, whereas in some examples the post-captured audio signal may comprise as single-channel (monophonic) audio signal, depending on the definitions provided in the modified capture control data.
The post-capture arrangement 200b is typically applied to process the intermediate audio data (e.g. the one or more intermediate audio signals) as a sequence of input frames to derive a corresponding sequence of output frames that constitute the post-captured audio signal. The description of the frame structure provided in the foregoing with references to the capture arrangement 200a applies also to the post-capture arrangement 200b, mutatis mutandis.
As described in the foregoing, the audio preprocessor 212 is arranged to derive the one or more reconstructed signals based on the intermediate audio data. In this regard, the audio preprocessor 212 obtains (e.g. reads) the intermediate audio data from the storage 208 and, depending on the content and format applied for the intermediate audio data, either applies the two or more intermediate audio signals included therein as such as respective two or more reconstructed signals, or subjects the one or more intermediate audio signals included in the intermediate audio data to one or more audio processing functions to derive the one or more reconstructed signals for further processing by the audio processor 202.
In case the intermediate audio data obtained from (the capture arrangement 200a via) the storage 208 includes two or more intermediate audio signals provided as respective copies of the two or more microphone signals originally received at the capture arrangement 200a (i.e. as ‘raw audio signal’), no audio processing by the audio preprocessor 212 is necessary but the two or more intermediate audio signals may be passed as such as the respective two or more reconstructed signals for processing by the audio processor 202. For example in case the intermediate audio data obtained from the storage 208 includes one or more intermediate audio signals that provide encoded representation provide an encoded representation of the two or more microphone signals originally received at the capture arrangement 200a, the audio preprocessor 212 may be arranged to apply respective audio decoding to the one or more intermediate audio signals to derive the one or more reconstructed signals.
As described in the foregoing, the audio processor 202 is arranged to derive the post-captured audio signal based on the one or more reconstructed signals in accordance with the modified capture control data. The audio processor 202 may be similar to the audio processor 102 with respect to its operation and capabilities. Hence, the audio processor 202 may enable a plurality of audio processing functions, whereas application of the audio processing functions available therein may be controlled via the modified capture control data. Non-limiting examples of audio processing functions that may be available in the audio processor 202 are described in the foregoing with references to the audio processor 102.
Although not shown in
As described in the foregoing, the control data combiner 210 is arranged to derive modified capture control data based on the stored capture control data and post-capture control data. In this regard, the control data combiner 210 obtains (e.g. reads) the stored capture control data from the storage 208. While the stored capture control data identifies at least one audio characteristics that has been applied in derivation of the captured audio signal in the capture arrangement 200a (and that may be applied for derivation of the post-captured audio signal), each of the post-capture control data and the resulting modified capture control data identifies at least one audio characteristics that is (to be) applied for derivation of the post-captured audio signal in the post-capture arrangement 200b.
Referring back to the characteristics of the capture control data described in the foregoing in context of the capture arrangement 200a, the stored control data identifies at least one audio characteristic applied for derivation of the captured audio signal, which may also be applied for derivation of the post-captured audio signal. In this regard, the stored capture control data may identify general audio characteristics pertaining to the received microphone signals (i.e. an input audio of the capture arrangement 200a) and/or to the captured audio signal (i.e. an output audio of the post-capture arrangement 200a) and/or the stored capture control data may define which of the audio processing functions available in the audio processor 102 have been applied in deriving the captured audio signal in the audio processor 102 and, if applicable, respective audio processing parameters that were used for controlling application of the respective audio processing functions. Examples of both general audio characteristics and audio processing functions available in the audio processor 102 are described in the foregoing.
Along the lines described in the foregoing in context of the capture arrangement 200a, the stored capture control data is stored in the storage 208 using a predefined capture control data format that may comprise a sequence of control data entries, each control data entry identifying either a respective general characteristic of the captured audio signal or identifying an audio processing function applied for processing the one or more microphone signals in the capture arrangement 200a, possibly together with respective audio processing parameters. Depending on the information available as the stored capture control data, the control data combiner 210 may interpolate between data points available in the stored capture control data to ensure availability of capture control data for the full duration (e.g. for each frame) of the corresponding intermediate audio data also stored in the storage 208.
Still referring to
The post-capture control data may comprise definitions originating from user input received upon initiating or carrying out a post-capturing session. In this regard, the post-capture control data, and hence the corresponding characteristics of operation of the audio processor 202, may remain unchanged throughout the post-capturing session. On the other hand, at least some aspects of the post-capture control data, and hence the corresponding characteristics of operation of the audio processor 202, may vary or be varied during the post-capturing session. Such variations may include enabling or disabling a certain audio processing function during the post-capturing session or changing characteristics of a certain audio processing function during the post-capturing session, for example, in response to user input received during the post-capturing session.
Still referring to
Hence, the control data combiner 210 may be arranged to one or more of the following:
Consequently, the post-capture arrangement 200b enables a user to omit, to replace, to modify and/or to complement the selections made with respect to audio characteristics applied for derivation of the captured audio signal to derive the post-captured audio signal that provides improved perceptual audio quality and/or that otherwise more closely reflects the preferences of the user of the post-capture arrangement 200b.
According to an example, as schematically illustrated in
In another example, as schematically illustrated in
In a further example, as schematically illustrated in
As described in the foregoing, a plurality of audio processing functions may be available in the audio processors 102, 202 for modification of the two or more microphone signals, the one or more reconstructed signals or one or more signals derived therefrom. Many of these audio processing functions may result changes in the audio information conveyed in the respective processed audio signal(s) that cannot be reversed or ‘undone’, at least not to the full extent. A few examples in this regard are provided in the following:
Considering the examples above, a user of the post-capture arrangement 200b may, for example, prefer adjusting the gain or audio equalization settings differently from those applied in the capture arrangement 200a, prefer omitting one or more audio enhancement functions that were applied in the capture arrangement 200a, prefer omitting audio focusing applied in the capture arrangement 200a or apply audio focusing with different settings, prefer omitting audio encoding applied in the capture arrangement 200a, prefer applying audio encoding technique different from that applied in the capture arrangement 200a, prefer converting the microphone signals into a (spatial) audio format different from that applied in the capture arrangement 200a, etc.
In the course of the operation of the capture arrangement 200a the audio processor 102 typically derives the captured audio signal based on the two or more microphone signals in accordance with the capture control data frame by frame as further audio comes available from the two or more microphones. Consequently, when processing a given frame of the two or more microphone signals, the audio processing functions available in the audio processor 102 typically do not have any (or have limited) access to audio content of the two or more microphone signals that follows the given frame. On the other hand, the audio processor 202 in the post-capture arrangement 200b typically has full access to the one or more reconstructed signals in their entirety when applying the audio processing functions available therein, including also frames of the one or more reconstructed signals that follow the frame currently under processing. Consequently, the audio processor 202 may be arranged to apply one or more of the audio processing functions available therein in manner that differs from application of the respective audio processing function in the audio processor 102, e.g. such that the signal content in some of the future frames is taken into account when processing a given frame. A non-limiting example in this regard involves signal level adjustment by an automatic gain control (AGC) function that may benefit from access to the one or more reconstructed signals in their entirety when deriving and applying a gain for a given frame of the one or more reconstructed signals.
In the following, a particular example that pertains to controlling operation of audio focusing (or “audio zooming”) is described in more detail. Audio focusing enables modifying the representation of an audio scene conveyed by a multi-channel audio signal by adjusting (e.g. one of increasing or decreasing) the sound level in a user-selectable spatial portion of the audio scene by a user-definable amount in relation to other spatial portions of the audio scene. Hence, the audio focusing enables modifying the multi-channel audio signal (and hence the representation of the audio scene conveyed by the multi-channel audio signal) e.g. such that the sounds in a user selectable focus direction are emphasized with respect to sounds in other directions by a user-selectable focus amount. Herein, the audio focusing may be applied to the two or more microphone signals (by the audio processor 102) and/or to the one or more reconstructed signals (by the audio processor 202). In an example, the operation of audio focusing may be controlled via user-definable focus direction and focus amount parameters, which may be provided as input to the audio processing arrangement as part of the capture control data and/or as part of the post-capture control data: the focus direction defines the spatial portion (e.g. one or more spatial directions or a range of spatial directions) of the audio scene to be modified and the focus amount defines the extent of adjustment to be applied to the sound level in the selected spatial portion of the audio scene. In particular, the user may define a first focus direction and a first focus amount upon operating the capture arrangement 200a, whereas the user or another user may define a second focus direction (that is different from the first focus direction) and/or a second focus amount (that is different from the first focus amount) upon operating the post-capture arrangement 200b. Consequently, the audio processing arrangement 200 enables correcting or otherwise re-defining the audio focusing defined by the first focus direction and the first focus amount applied upon deriving the captured audio signal (via operation of the capture arrangement 200a) by defining the second focus direction and the second focus amount differently for derivation of the post-captured audio signal (via operation of the post-capture arrangement 200b) e.g. to obtain audio focusing that better reflects his/her preferences.
As illustrated in
When applied as the audio processor 102, channels of the input audio signal to the audio processor 302 comprise respective two or more microphone signals received at the capture arrangement 200a and channels of the one or more output audio signals of the audio processor 302 represent respective channels of the captured audio signal, whereas when applied as the audio processor 202, the channels of input audio signal to the audio processor 302 comprise respective one or more reconstructed signals obtained at the post-capture arrangement 200b and the channels of the output audio signal of the audio processor 302 represent respective channels of the post-captured audio signal.
In context of the audio processor 302, the focus direction refers to a user-selectable spatial direction of interest. The focus direction may be, for example, a certain direction of the audio scene in general. In another example, the focus direction or a direction in which a sound source of interest is currently positioned. In the former scenario, the user-selectable focus direction typically denotes a spatial direction that stays constant or changes infrequently since the focus is predominantly in a specific spatial direction, whereas in the latter scenario the user-selected focus direction may change more frequently since the focus is set to a certain sound source that may (or may not) change its position of the audio scene over time. In an example, the focus direction may be defined, for example, as an azimuth angle that defines the spatial direction of interest with respect to a first predefined reference direction and/or as an elevation angle that defines the spatial direction of interest with respect to a second predefined reference direction.
The focus amount refers to a user-selectable change in relative sound level of sound arriving from the focus direction. The focus amount may be selectable between zero (i.e. no focus) and a predefined maximum focus amount. The focus amount may be applied by mapping the user-selected focus amount into a scaling factor in a range from 0 to 1 and modifying the sound level of one or more sound components in a representation of the audio scene arriving from the focus direction (in relation to other sounds in the representation of the audio scene) in accordance with the scaling factor. As described in the foregoing, the filter bank 322 is arranged to transform the channels of input audio signals from time domain into a transform domain. In this regard, the processing by the filter bank 322 may comprise transforming each channel of each frame of the input audio signal from the time domain to the transform domain. Transforming a frame to the transform domain may involve using information also from one or more frames that (immediately) precede the current frame, depending on characteristics of the applied transform technique and/or the filter bank. Without losing generality, the transform domain may be considered as a frequency domain and the transform-domain samples resulting from the transform may be referred to as frequency bins. The filter bank 322 employs a predetermined transform technique known in the art. In an example, the filter bank employs short-time discrete Fourier transform (STFT) to convert each channel of the input audio signal into a respective channel of the transform-domain signal using a predefined analysis window length (e.g. 20 milliseconds). In another example, the filter bank 322 employs complex-modulated quadrature-mirror filter (QMF) bank for time-to-frequency-domain conversion. The STFT and QMF bank serve as non-limiting examples in this regard and in further examples any suitable technique known in the art may be employed for creating the transform-domain signals. The inverse filter bank 332 is arranged to transform each frame of the focused audio signal (obtained from the combiner 330) from the transform domain back to the time domain for provision to the (optional) audio encoder 334. The inverse filter bank 332 employs an inverse transform matching the transform applied by the filter bank 322, e.g. an inverse STFT or inverse QMF. The filter bank 322 and the inverse filter bank 332 are typically arranged to process each channel of the audio signal signal separately from the other channels.
The filter bank 322 may further divide each channel of the input audio signal into a respective plurality of frequency sub-bands, thereby resulting in the transform-domain input audio signal that provides a respective time-frequency representation for each channel of the input audio signal. A given frequency band in a given frame of the transform-domain audio signal may be referred to as a time-frequency tile, and the processing of the audio signal between the filter bank 322 and the inverse filter bank 332 is typically carried out separately for each time-frequency tile in the transform domain. The number of frequency sub-bands and respective bandwidths of the frequency sub-bands may be selected e.g. in accordance with the desired frequency resolution and/or available computing power. In an example, the sub-band structure involves 24 frequency sub-bands according to the Bark scale, an equivalent rectangular band (ERB) scale or 3rd octave band scale known in the art. In other examples, different number of frequency sub-bands that have the same or different bandwidths may be employed. A specific example in this regard is a single frequency sub-band that covers the input spectrum in its entirety or a continuous subset thereof. Another specific example is consideration of each frequency bin as a separate frequency sub-band.
As described in the foregoing, the spatial analyzer 324 is arranged to estimate spatial characteristics of the input audio signal based on the transform-domain signal obtained from the filter bank 322. The processing carried out by the spatial analyzer 324 may be referred to as spatial analysis, which may be based on signal energies and correlations between audio channels in a plurality of time-frequency tiles of the transform-domain audio signal. The outcome of the spatial analysis may be referred to as spatial audio parameters, which are provided for the focus processor 326 and for the spatial processor 328. The spatial audio parameters may include at least the following spatial audio parameters for one or more frequency sub-bands and for a number of frames (i.e. for a number of time-frequency tiles):
The spatial analysis may be carried out using any suitable spatial analysis technique known in the art, while details of the spatial analysis are outside the scope of the present disclosure. As a non-limiting example, the input audio signal has three audio channels originating from respective microphones of a three-microphone array schematically illustrated in
As described in the foregoing, the focus processor 326 is arranged to generate the first spatial audio component that represents a focus portion in a representation of the audio scene conveyed by the input audio signal. The processing carried out by the focus processor 326 may be referred to as focus processing, which may be performed based on the transform-domain audio signal (obtained from the filter bank 322) in dependence of the spatial audio parameters (obtained from the spatial analyzer 324) and further in dependence of the focus direction and an output format indication (both derived based on user input).
The output of the focus processor 326 is the (transform-domain) first audio component, where at least some sound components of a portion in a representation of the audio scene indicated by the focus portion parameter are emphasized with respect to the remaining sound components in the representation of the audio scene and positioned in their original spatial position in the representation of the audio scene. The focus processing may be carried out using any suitable audio focusing technique known in the art, while details of the focus processing are outside the scope of the present disclosure.
According to a non-limiting example, the focus processing comprises a beamforming and a post-filtering in one or more frequency sub-bands and in a number of frames (i.e. in a number of time-frequency tiles) as outlined in the following:
The signal that results from the procedure that involves the beamforming and the post-filtering may comprise a single-channel (monophonic) focus signal, which is further processed into the focused (spatial) audio signal in accordance with an audio format indicated by the output format parameter Non-limiting examples in this regard are outlined in the following:
As described in the foregoing, the spatial processor 328 is arranged to generate the second spatial audio component that represents a non-focus portion of the representation of the audio scene conveyed by the input audio signal. The processing carried out by the spatial processor 328 may be referred to as spatial conversion, which may be performed based on the transform-domain audio signal (obtained from the filter bank 322) in dependence of the spatial audio parameters (obtained from the spatial analyzer 324) and further in dependence of the output format indication (derived based on user input). The output of the spatial processor 328 is the (transform-domain) second audio component processed in accordance with the indicated output format. The spatial conversion may be carried out using any suitable processing technique known in the art, while details of the spatial conversion are outside the scope of the present disclosure.
According to a non-limiting example, the spatial conversion may be provided in one or more frequency sub-bands and in a number of frames (i.e. in a number of time-frequency tiles) as outlined in the following:
Some approaches known in the art implement the procedure according to steps 1) to 4) above depending on the applied output format, e.g. ones described in Laitinen, Mikko-Ville and Pulkki, Ville, “Binaural reproduction for directional audio coding”, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2009, WASPAA'09, pp. 337-340, IEEE, 2009 and in Vilkamo, Juha, Lokki, Tapio and Pulkki, Ville. “Directional audio coding: Virtual microphone-based synthesis and subjective evaluation”, Journal of the Audio Engineering Society 57, no. 9 (2009), pp. 709-724. Further approaches that potentially result in higher perceptual audio quality with the cost of increased computational load may apply e.g. least-squares optimized mixing to generate the second spatial audio component based on the input audio signals and the spatial audio parameters (also referred to as spatial metadata), e.g. as described in Vilkamo, Juha and Pulkki, Ville, “Minimization of decorrelator artifacts in directional audio coding by covariance domain rendering”, Journal of the Audio Engineering Society 61, no. 9 (2013), pp. 637-646. As a further example, aspects related to providing the output of the spatial processor 328 (and hence the output of the audio processor 302) in Ambisonics format are described e.g. in WO 2018/060550.
In a further example, in case the output format is binaural audio, the focus processor 326 and the spatial processor 328 may further receive (as part of the capture control data and/or the post-capture control data) an indication of (current) head orientation and apply this information together with the indicated focus direction for selection of the HRTFs for generation of the first and second spatial audio components. In this regard, the focus direction applied by the focus processor 326 and the spatial processor 328 is modified in view of the indicated head orientation: as an example in this regard, if the indicated focus direction is the front direction (e.g. 0 degrees) and the indicated head orientation is 30 degrees left (e.g. −30 degrees), HRTFs assigned for the spatial direction at −30 degrees are selected for the respective processing in the focus processor 326 and the spatial processor 328.
As described in the foregoing, the combiner 330 is arranged to combine the first and second spatial audio components to from the focused (spatial) audio signal in accordance with the indicated focus amount. In this regard, the combiner 330 may be arranged to carry out the combination in each frequency-band at each channel of the focused audio signal. In each frequency sub-band in each channel, the combination may be carried out as a linear combination of the respective signals that represent the time-frequency tiles of the first and second spatial audio components in accordance with the focus amount. As an example in this regard, assuming that the focus amount is indicated by a parameter a that has value in the range from 0 to 1, the linear combination may be provided e.g. as a weighted sum of the respective signals from the first and second spatial audio components such that the signal from the first spatial audio component is multiplied by a and the signal from the second spatial audio component is multiplied by (1-a) before summing the signals.
As described in the foregoing, the inverse filter bank 332 is arranged to transform each frame of the focused audio signal (obtained from the combiner 330) from the transform domain back to the time domain for provision to the (optional) audio encoder 334.
As described in the foregoing, the audio processor 302 may optionally include the audio encoder 334 that is arranged to encode the focused and/or spatially processed audio signal output from the inverse filter bank 332 for local storage and/or for transfer to another device. In this regard, any audio encoding technique known in the art that is suitable for encoding multi-channel audio signals may be applied. A non-limiting example in this regard is advanced audio coding (AAC) encoder. In case the audio encoder 334 is not employed as part of the audio processor 302, the focused audio signal may be provided e.g. as a PCM signal.
In a scenario where the audio processor 302 is applied (as part of) the audio processor 102 of the capture arrangement 200a, the spatial audio parameters derived by the spatial analyzer 324 may be provided for the audio formatter 206 for storage in the storage 208 as spatial metadata associated with the intermediate audio data. When accessing the data in the storage, the audio preprocessor 212 may obtain the spatial metadata from the storage 208 together with the intermediate audio data and provide the spatial metadata along with the intermediate audio data for the audio processor 302 applied (as part of) the audio processor 202 in the post-capture arrangement 200b. Consequently, the audio processor 302 in the post-capture arrangement 200b may omit the processing described in the foregoing for the spatial analyzer 324 and directly apply the spatial audio parameters received as the spatial metadata received along with the intermediate audio.
In a variation of the audio processing arrangement 200, the audio formatter 206 is communicatively coupled (e.g. via a communication network) to a server that is arranged to provide audio enhancement processing for the two or more microphone signals obtained at the capture arrangement 200a to derive respective two or more enhanced microphone signals, which may serve (instead of the two or more microphone signals as originally received) as basis for deriving the intermediate audio data in the audio formatter 206 for writing into the storage 208. The purpose of such audio enhancement processing by the server is to provide the two or more modified enhanced microphone signals at higher (perceptual) audio quality, thereby enabling creation of a higher-quality post-captured audio signal via operation of the post-capture arrangement 200b. The server may be provided as a single server device (e.g. a computer) or as a combination of two or more server devices (e.g. computers) that may be arranged to provide, for example, a cloud computing service.
As an example of audio enhancement processing available at the server, the server may be arranged to provide a trained deep leaning network, for example a generative adversarial network (GAN) for improving the signal-to-noise ratio (SNR) of the two or more microphone signals and/or otherwise improve the (perceptual) audio quality of the two or more microphone signals.
As another example of audio enhancement processing available in the server, alternatively or additionally, the server may be arranged to carry out some or all of the predefined audio processing functions assigned to the audio formatter 206 on behalf of the audio formatter 206. As an example, the audio formatter may provide the two or more microphone signals to the server, which carries out e.g. audio encoding (and/or one or more other predefined audio processing function(s)) based on the original two or more microphone signals (or based on the two or more enhanced microphone signals) and provides the audio data resulting from this procedure to the audio formatter 206, which writes this information as the intermediate audio data to the storage 208.
In another (or further) variation of the audio processing arrangement 200, an entity of the post-capture arrangement 200b, e.g. the control data combiner 210 and/or the audio preprocessor 212, may be communicatively coupled (e.g. via a communication network) to the server, which is (further) arranged to analyze the intermediate audio data obtained via the storage 208 or the one or more reconstructed signals derived therefrom by the audio preprocessor 212 and extract, accordingly, a secondary post-capture control data that may be applied to replace or complement the post-capture control data received at the post-capture arrangement 200b. In this regard, a machine learning network in the server may have been trained to identify situations where specific directions of interest exist in the representation of the audio scene conveyed by the intermediate audio data or by the one or more reconstructed signals. As an example, the audio scene may involve a talker on a stage, whereas the machine learning network may derive secondary capture control data that enables controlling audio focus such that it follows the position of person on the stage in the representation of the audio scene. The server may derive and track the position of the talker in the representation of the audio scene via analysis of the intermediate audio data or the one or more reconstructed signals. In a scenario where the captured audio signal is provided together with an associated video signal, derivation and tracking of the talker position may be, additionally or alternatively, based on the associated video signal.
In another variation of the audio processing arrangement 200, one or more of the functionalities described above with references to the server may be carried out by the audio formatter 206 or by the audio preprocessor 212 instead, assuming that the device applied for implementing the respective entity has sufficient processing capacity available thereto.
As described in the foregoing, at least some of definitions of the capture control data may originate from user input received upon initiation or during the audio capturing session and/or at least some of definitions of the post-capture control data may originate from user input received upon initiation or during the post-capturing session. As a non-limiting example in this regard, such user input may be received via an user input of the mobile device 150, 150a applied to implement the capture arrangement 200a and/or via an user interface of the (mobile) device 150, 150b applied to implement the post-capture arrangement 200b. In this regard,
The functionality described in the foregoing with references to components of the capture arrangement 200a and the post-capture arrangement 200b, for example, in accordance with a method 400 illustrated by a flowchart depicted in
The method 400 comprises deriving a captured audio signal based on the two or more microphone signals received form respective one or more microphones in accordance with the capture control data that identifies at least one audio characteristic for derivation of the captured audio signal, as indicated in block 402. The method 400 further comprises storing at least part of the capture control data and intermediate audio data derived based on the two or more received microphone signals, as indicated in block 404. The method 400 further comprises deriving modified capture control data as a combination of the stored capture control data and user-definable post-capture control data that identifies at least one audio characteristic for derivation of a post-captured audio signal, as indicated in block 406. The method 400 further comprises deriving the post-captured audio signal based on said intermediate audio data in accordance with the modified capture control data, as indicated in block 408. The method 400 optionally further comprises replacing the captured audio signal by the post-captured audio signal, as indicated in block 410.
The method 400 may be varied in a plurality of ways, for example in accordance with examples pertaining to respective functionality of components of the audio processing arrangement 200 provided in the foregoing and in the following.
The apparatus 500 comprises a processor 516 and a memory 515 for storing data and computer program code 517. The memory 515 and a portion of the computer program code 517 stored therein may be further arranged to, with the processor 516, to implement at least some of the operations, procedures and/or functions described in the foregoing in context of the capture arrangement 200a and/or the post-capture arrangement 200b.
The apparatus 500 comprises a communication portion 512 for communication with other devices. The communication portion 512 comprises at least one communication apparatus that enables wired or wireless communication with other apparatuses. A communication apparatus of the communication portion 512 may also be referred to as a respective communication means.
The apparatus 500 may further comprise user I/O (input/output) components 518 that may be arranged, possibly together with the processor 516 and a portion of the computer program code 517, to provide a user interface for receiving input from a user of the apparatus 500 and/or providing output to the user of the apparatus 500 to control at least some aspects of operation of the capture arrangement 200a and/or the post-capture arrangement 200b implemented by the apparatus 500. The user I/O components 518 may comprise hardware components such as a display, a touchscreen, a touchpad, a mouse, a keyboard, and/or an arrangement of one or more keys or buttons, etc. The user I/O components 518 may be also referred to as peripherals. The processor 516 may be arranged to control operation of the apparatus 500 e.g. in accordance with a portion of the computer program code 517 and possibly further in accordance with the user input received via the user I/O components 518 and/or in accordance with information received via the communication portion 512.
Although the processor 516 is depicted as a single component, it may be implemented as one or more separate processing components. Similarly, although the memory 515 is depicted as a single component, it may be implemented as one or more separate components, some or all of which may be integrated/removable and/or may provide permanent/semi-permanent/dynamic/cached storage.
The computer program code 517 stored in the memory 515, may comprise computer-executable instructions that control one or more aspects of operation of the apparatus 500 when loaded into the processor 516. As an example, the computer-executable instructions may be provided as one or more sequences of one or more instructions. The processor 516 is able to load and execute the computer program code 517 by reading the one or more sequences of one or more instructions included therein from the memory 515. The one or more sequences of one or more instructions may be configured to, when executed by the processor 516, cause the apparatus 500 to carry out at least some of the operations, procedures and/or functions described in the foregoing in context of the capture arrangement 200a and/or the post-capture arrangement 200b.
Hence, the apparatus 500 may comprise at least one processor 516 and at least one memory 515 including the computer program code 517 for one or more programs, the at least one memory 515 and the computer program code 517 configured to, with the at least one processor 516, cause the apparatus 500 to perform at least some of the operations, procedures and/or functions described in the foregoing in context of the capture arrangement 200a and/or the post-capture arrangement 200b.
The computer programs stored in the memory 515 may be provided e.g. as a respective computer program product comprising at least one computer-readable non-transitory medium having the computer program code 517 stored thereon, the computer program code, when executed by the apparatus 500, causes the apparatus 500 at least to perform at least some of the operations, procedures and/or functions described in the foregoing in context of the capture arrangement 200a and/or the post-capture arrangement 200b. The computer-readable non-transitory medium may comprise a memory device or a record medium such as a CD-ROM, a DVD, a Blu-ray disc or another article of manufacture that tangibly embodies the computer program. As another example, the computer program may be provided as a signal configured to reliably transfer the computer program.
Reference(s) to a processor should not be understood to encompass only programmable processors, but also dedicated circuits such as field-programmable gate arrays (FPGA), application specific circuits (ASIC), signal processors, etc. Features described in the preceding description may be used in combinations other than the combinations explicitly described.
Although functions have been described with reference to certain features, those functions may be performable by other features whether described or not. Although features have been described with reference to certain embodiments, those features may also be present in other embodiments whether described or not.
Number | Date | Country | Kind |
---|---|---|---|
1900112.2 | Jan 2019 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/FI2020/050001 | 1/2/2020 | WO | 00 |