The present invention relates to a method and apparatus for processing audio data.
This application claims priority to copending United Kingdom utility application entitled, “PROCESSING AUDIO DATA,” having serial no. GB 0411297.5, filed Apr. 21, 2004, which is entirely incorporated herein by reference.
Audio data representing recordings of sound associated with physical environments are increasingly being stored in digital form, for example in computer memories. This is partly due to the increase in use of desktop computers, digital sound recording equipment and digital camera equipment. One of the main advantages of providing audio and/or image data in digital form is that it can be edited on a computer and output to an appropriate data output device so as to be played. Increasingly common is the use of personal sound capture devices that comprise an array of microphones to record a sound scene, which a given person is interested in recording. The well known camcorder type device is configured to record visual images associated with a given environmental scene and these devices may be used in conjunction with an integral personal sound capture device so as to create a visual and audiological recording of a given environmental scene. Frequently such camcorder type devices are used so that the resultant, image and sound recordings are played back at a later date to colleagues of, or friends and family of, an operator of the device. Camcorder type devices may frequently be operated to record one or more of: sound only, static images or video (moving) images. With advances in technology sound capture systems that capture spatial sound are also becoming increasingly common. By spatial sound system it is meant, in broad terms, a sound capture system that conveys some information concerning the location of perceived sound in addition to the mere presence of the sound itself. The environment in respect of which such a system records sound may be termed a “soundscape” (or a “sound scene” or “sound field”) and a given soundscape may comprise one or a plurality of sounds. The complexity of the sound scene may vary considerably depending upon the particular environment in which the sound capture device is located within. A further source of sound and/or image data is sound and image data produced in the virtual world by a suitably configured computer program. Sound and/or image sequences that have been computer generated may comprise spatial sound.
Owing to the fact that such audio and/or image data is increasingly being obtained by a variety of people there is a need to provide improved methods and systems for manipulating the data obtained. An example of a system that provides motion picture generation from a static digital image is that disclosed in European patent publication no. EP 1235182, incorporated herein by reference, and in the name of Hewlett-Packard Company. Such a system concerns improved digital images so as to inherently hold the viewer's attention for a longer period of time and the method as described therein provides for desktop type software implementations of “rostrum camera” techniques. A conventional rostrum camera is a film or television camera mounted vertically on a fixed or adjustable column, typically used for shooting graphics or animation—these techniques for producing moving images are the type that can typically be obtained from such a camera. The system described in EP 1235182 provides zooming and panning across static digital images.
According to an exemplary embodiment, there is provided a method of processing audio data comprising: characterising an audio data representative of a recorded sound scene into a set of sound sources occupying positions within a time and space reference frame; analysing the sound sources; and generating a modified audio data representing sound captured from at least one virtual microphone configured for moving about the recorded sound scene, wherein the virtual microphone is controlled in accordance with a result of the analysis of said audio data, to conduct a virtual tour of the recorded sound scene.
For a better understanding of the invention and to show how the same may be carried into effect, there will now be described by way of example only, specific embodiments, methods and processes according to the present invention with reference to the accompanying drawings in which:
a-3d schematically illustrate an example of a processed complex spatio-temporal audio scene that may result from operation of the application program of
There will now be described by way of example a specific mode contemplated by the inventors. In the following description numerous specific details are set forth in order to provide a thorough understanding. It will be apparent however, to one skilled in the art, that the present invention may be practiced without limitation to these specific details. In other instances, well known methods and structures have not been described in detail so as not to unnecessarily obscure the description.
Overview
A soundscape comprises a multi dimensional environment in which different sounds occur at various times and positions. Specific embodiments and methods herein provide a system for navigating a such a soundscape. An example of a soundscape may be a crowded room, a restaurant, a summer meadow, a woodland scene, a busy street or any indoor or out door environment where sound occurs at different positions and times. Soundscapes can be recorded as audio data, using directional microphone arrays or other like means.
Specific embodiments and methods herein may provide a post processing facility for a soundscape which is capable of navigating a stored soundscape data so as to provide a virtual tour of the soundscape. This is analogous to a person with a microphone navigating the environment at the time at which the soundscape was captured, but can be carried out retrospectively and virtually using the embodiments and methods disclosed herein.
Within the soundscape, a virtual microphone is able to navigate, automatically identifying and investigating individual sounds sources, for example, conversations of persons, monologues, sounds produced by machinery or equipment, animals, activities, natural or artificially generated noises, and following sounds which are of interest to a human user. The virtual microphone may have properties and functionality analogous to those of a human sound recording engineer of the type known for television or radio programme production, including the ability to identify, seek out and follow interesting sounds, home in on those sounds, zoom in or out from those sounds, pan the environment in general landscape “views” across the soundscape. The virtual microphone provides a virtual mobile audio rostrum, capable of moving around within the virtual sound environment (the soundscape), in a similar manner to how a human sound recording engineer may move around within a real environment, holding a sound recording apparatus.
A 3D spatial location of sound sources is determined, and preferably also, acoustic properties of the environment. This defines a sound scene allowing a virtual microphone to be placed anywhere within it, adjusting the sounds according to the acoustic environment, and allows a user to explore a soundscape.
This spatial audio allows camera-like-operations to be defined for the virtual microphone as follows:
An audio zoom function is analogous to a camera zoom which determines a field of “view” that selects part of the scene. The audio zoom may determine which sound sources are to be used by their spatial relation to a microphone, for example within a cone about a 3D point of origin at the microphone;
An audio focus is analogous to a camera focus. This is akin to placing the microphone closer to particular sound sources to they appear louder; and
A panning (rotating) function and a translating function are respectively analogous to their camera counterparts for panning (rotating) or translating the camera. This is analogous to selecting different sound sources in particular spatial relation.
The existence of these camera-like operations in a soundscape allows the soundscape to be sampled in a similar manner to a rostrum camera moving about a still image. However there are important differences. For example:
Audio has a temporal nature that is somewhat ignored by the analogous operations that exploit the spatial properties of their sources; and
A rostrum camera work finds its most compelling use when used in combination with a display which is incapable of using the available resolution in the captured image signal. Part of the value of the rostrum camera is in revealing the extra detail through the inadequate display device. There is no similar analogous between the detail captured and displayed in the audio domain. However there is some benefit derived from zooming—it selects and hence emphasizes particular sound sources as with zooming in on part of an image.
In attempting to apply the known light imaging rostrum camera concept, the temporal nature of sound forces. The concept to be generalized into a “spatial-temporal rostrum camera” concept, better seen as some form of video editing operation for a wearable video stream where the editing selects both spatially and in time. The composed result may jump about in time and space, perhaps showing things happening with no respect for temporal order, that is, showing the future before the past events that caused it. This is common behavior in film directing or editing. Hence the automatic spatial-temporal rostrum camera attempts to perform automatic video editing.
An important feature of the present embodiments and methods is the extra option of selecting in time as well as the ability to move spatial signals into the temporal (e.g. a still into video).
Audio analysis may be applied to the soundscape, to automatically produce a tour of the spatial soundscape which emphasizes and de-emphasizes, omits and selects particular sound sources To do this automatically requires some notion of interesting audio events and “saliency”. In accordance with the present preferred embodiment it is useful to detect when a particular sound source would be interesting—this would depend upon the position of the virtual listener. For example, if you are close to a sound source you will not notice the contribution of other sound sources, and the saliency will be dominated by the how much the loudness, texture, etc. . . . of this sound compared to the other sounds within the field of view. There may be provided a signal (a “saliency” signal) indicative of when a particular sound may be of interest to a listener located at a particular position in a given sound scene. As previously stated the sound scene may be associated with an image or image sequence that may itself have been recorded with a particular sound-recording being played saliency of a sound source may be based upon cues from an associated image or images. The images may be still images or moving images. Furthermore the interest-measure provided in respect of sounds is not necessarily solely based on the intensity (loudness) of these sounds. The saliency signal may be based partly on an intensity-measure or may be based on parameters that do not include sound intensity.
A preferred embodiment uses zoom and focus features to select the virtual microphone or listening position and then predicts saliency based upon the auditory saliency at this position relative to particular sound sources.
In a preferred embodiment, auditory saliency is used to recognize particular human speakers, children's voices, laughter and to detect emotion or prosody. By prosody it is meant the manner in which one or more words is/are spoken. Known word recognition techniques are advanced enough such that a large number of words can be accurately recognized. Furthermore the techniques are sufficiently advanced, as those skilled in the art are aware, to recognize voice intensity pattern, lowered or raised voice, or a pattern of variation such as is associated with asking a question, hesitation, the manner in which words are spoken (i.e. the different stresses associated with different words) and to detect particular natural sounds etc. For example, U.S. Pat. No. 5,918,223 (Muscle Fish) discloses a system for the more detailed classification of audio signals by comparison with given sound signals. The system is claimed to be used for multimedia database applications and Internet search engines. Other Muscle Fish patents are known that concern techniques for recognizing particular natural or mechanical sounds. Certain sounds are known to be highly distinctive as is known to those skilled in the art that are familiar with the work of “The World Soundscape Project”. Moving sound sources attract attention as well adding a temporal dimension, but after a while people get used to similar sounds and they are deemed less interesting.
The audio data of the soundscape is characterized into sound sources occupying positions within a time-spatial reference frame. There are natural ways of grouping or cropping sound sources based upon their spatial position. There are ways of detecting the natural scope of particular sounds. They provide some way of temporally segmenting the audio. But equally there are temporal ways of relating and hence selecting sound sources in the scene that need not be based upon the spatial grouping or temporal segmentation. The way in which sound sources work in harmony together can be compared using a wide variety of techniques as is known to those skilled in the art. The way in which one sound works in beat or rhythm with others over a period of time suggests that they might well be grouped together i.e. they go together because they would sound nice together. Also declaring sound sources to be independent of other sound sources is a useful facility, as is detecting when a sound source can be used to provide discrete background to other sounds.
An important commercial application may be achieved where a visual tour of a soundscape is synchronized with a visual channel (such as with an audio photograph or with a panoramic audio photograph). The embodiments may be used with the virtual microphone located in a given soundscape, or the audio may be used to drive the visual. Combinations of these two approaches can also be used.
An example would be zooming in on a child when a high resolution video or still image is providing a larger field of view of the whole family group. The sound sources for the whole group are changed to one emphasizing the child, as the visual image is zoomed in
A preferred embodiment may synchronize respective tours provided by a virtual audio rostrum and a visual virtual rostrum camera. This would allow the virtual camera to be driven by either or both of the auditory analysis and/or the visual analysis. By “virtual audio rostrum” it is meant, a position which may be a moving position within a recorded soundscape, at which a virtual microphone is present. By the term “visual virtual rostrum camera” it is meant a position within a three dimensional environment, which is also subject of a recorded sound scene, in which a still and/or video camera is positioned, where the position of the camera may be moveable within the environment.
Examples of the styles of producing an audio tour and the forms of analysis appropriate
There now follows several examples of how a soundscape comprising audio data may be analysed, the audio data characterized into sound sources, and a virtual microphone may be controlled to navigate the soundscape, controlled by results of the analysis of the sound sources to conduct a virtual tour of the soundscape.
Simultaneous Conversations
In one example of analysing sound sources and controlling a virtual microphone according to those sound sources, here may be supplied spatial sound sources for a restaurant/café/pub. A virtual microphone might focus in on a conversation on one table and leave out the conversation taking place at another table. This allows or directs a human listener to focus on one group. After playing this group of sound sources the virtual microphone or another virtual microphone might then focus in on the conversation on the other table that was taking place at the same time. To do this it is necessary to be sure that the groups of sounds are independent of each other (overlapping speakers that are spatially distributed would be a good indicator). However “showing” background sound sources common to both groups would add to the atmosphere. The background would probably show as lots of diffuse sounds.
Capturing an Atmosphere
In another example, audio data may be analysed, and a virtual microphone used to capture the atmosphere of a place that is crowded with sound sources. Here the one or more virtual microphones would not be configured to try to listen in on conversations, rather they would deliberately break up a speaker talking, deliberately preventing a listener from being distracted by what is a said. Whilst listening to one sound source the other sounds might be removed using the zoom or perhaps de-emphasized and played less loudly. The emphasis could switch to other sound sources in the room, blending smoothly from one sound source to another or perhaps making shaper transitions (such as a cut). The sound sources might be sampled randomly in a temporal fashion or moved about as a virtual audio microphone.
This form of presentation of selecting different sound sources mirrors the way that a human listener's attention to sound works. A person can lock on to one sound source and lock out the effect of other sound sources. The attention of a person can flick around the scene. This provides another (non-geometric) inspiration for the selective focus upon different sound sources in the scene.
The Orchestra
This example envisages an orchestra playing, but it is possible for an expert listener to pick out the contributions of individual instruments. To re-create this for the unskilled listener the spatial distribution of the instruments of a certain type would be used to zoom in on them thereby emphasizing the instruments of interest. This can be seen as moving the virtual microphone amongst this particular block of instruments.
Another alternative would be to detect when the sound sources of the same type of instrument (or perhaps related instruments) occurred.
Bird Songs
Songs of birds of a particular species may be selected disregarding the sounds from other animals.
Parents and Children
Family groups consisting of parents and several children go through phases of interaction with each other and periods where the sound sources are independent. If the parents are watching the children it becomes important to disregard the sound of people nearby and people not from the group. It may be desirable to zoom and focus on the sounds of the children.
A source of spatial sound is required for capture of the soundscape. This may be obtained from a spatial sound capture system on, for example, a wearable camera. Depending upon the application requirements a source of video or a high resolution still image of the same scene may also be required. The system proceeds using image/video processing and audio analysis determining saliency.
An automatic method of synthesizing new content from within the spatial audio of a recorded sound scene, there is an ability spatial audio may be possible using the embodiments and methods herein. to suppress and emphasize particular sound sources. The method selects both spatially and temporally to produce new content. The method can expand simultaneous audio threads in time.
There are two ways in which spatial sound can be used—one is driven by geometrical considerations of the sound scene and explains the tour through geometric movements of the listener, the other is driven by attention and/or aesthetic considerations where the inspiration is of human perception of sounds.
Other aspects of the features include synchronizing visual and audio rostrum camera functionality.
In the case of spatial audio captured from crowded scenes a random like style may be identified for giving the atmosphere of a place. This avoids the need for long audio tracks.
Further there may be provided means of lifting auditory saliency measures into the realms of spatial sound.
There now follows description of a first specific embodiment. Where appropriate, like reference numbers denote similar or the same items in each of the drawings.
Hardware and Overview of Processing
Referring to
Referring to
Application program 201 is configured to receive and process a set of audio data received via data input port 105 and representative of a recorded sound scene such that the audio data is characterized into a set of sound sources located in a reference frame comprising a plurality of spatial dimensions and at least one temporal dimension. The application program 201 is configured to perform an analysis of the audio data to identify characteristic sounds associated with the sound sources and also to generate a set of modified audio data such that the modified audio data represents sound captured from at least one virtual microphone configurable to move about the recorded sound scene. The modified audio data generated by the application program 201 provides a playable “audio programme” representing a virtual microphone moving about the recorded sound scene. This audio programme can thereafter be played on an audio player, such as provided by processor 102, to generate resultant sound through speaker system 106, 107.
The acquired audio data is stored in memory 103. The application program 201 is launched, and the location of the file holding the audio data in is accessed by the program. The application program 201, operating under the control of processor 102, performs an analysis of the image data such that particular characteristics of the audio content (that is particular pre-defined characteristic sounds) are identified. The application program then proceeds to generate the above mentioned modified audio data based on the identified audio content characteristics. To facilitate this, the application program 201 includes an algorithm comprising a set of rules for determining how the audio programme should play the resultant modified audio data based on the different audio characteristics that have been identified.
An overview of the main processes undertaken by a preferred embodiment of a virtual microphone application program 201, is schematically illustrated in
At process 206 application program 201 is configured to undertake a search to select/determine a set of sound sources (based on an optimized saliency calculation resulting in either an optimal selection or one of a set of acceptable results). The selected result is then used to determine one or more virtual microphone trajectories.
Following process 206, at process 207 application program 201 is configured to render or mix the sound sources so as to provide a resultant edited version of the recorded sound scene which may then be played back to a listener as mentioned above and as indicated at process 208. Rendering is the process of using the virtual microphone trajectory and selections of process 206 to produce an output sound signal. In the best mode contemplated application program 201 is configured to automatically determine the movement of and change of field of reception of the one or more virtual microphones. However the application program may be configured to permit semi-automatic processing according to choices made of certain parameters in each of the processes of
In this specification, the following terms have the following meanings.
“Spatial Sound”: Spatial sound is modelled as a set of identified sound sources mapped to their normalised sound signals and their trajectories. Each sound source is represented as a sound signal. Spatial sound as thus defined conveys some information concerning the location of a perceived sound in three-dimensional space. Although the best mode utilises such “spatially localised sound” it is to be understood by those skilled in the art that other forms of sound that convey some degree of spatial information may be utilised. One good example is “directional sound”, that is sound which conveys some information concerning the direction from which a perceived sound is derived.
“Trajectory”: The trajectory of an entity is a mapping from time to position where position could be a three dimensional space co-ordinate. In the best mode contemplated ‘position’ also includes orientation information and thus in this case trajectory is a mapping from time to position and orientation of a given sound source. The reason for defining trajectory in this way is that some sound sources, such as for example a loudhailer, do not radiate sound uniformly in all directions. Therefore in order to synthesise the intensity of the sound detected by a microphone at a particular position it is necessary to determine the orientation of the sound source (and the microphone). A further consideration that may be taken into account is that a sound source may be diffuse and therefore an improved solution would regard the sound source as occupying a region rather than being a point source.
“Sound Signal”: The sound signal is a mapping from time to intensity. In other words the intensity of a sound signal may vary with time.
“Sound Feature”: A feature is a recognised type of sound such as human speech, non-speech (e.g. whistle, scream) etc.
“Recogniser”: A recogniser classifies a sound signal and so maps sound signals to sets of features. Within an interval of recorded sound it is required to determine where in the interval the feature occurs. In the best mode a recogniser function returns a mapping from time to a feature set.
“Saliency”: Saliency is defined as a measure of the inherent interest of a given sound that is realised by a notional human listener. In the best mode application program 102 uses real numbers for the saliency metric. Those skilled in the art will realise that there are a wide variety of possibilities for implementing saliency measure. In the preferred embodiment described below saliency calculations only involve arithmetic to decide which of a number of calculated saliency measures is the greatest in magnitude.
“Style”: The style parameter is a mechanism for giving top down choices to the saliency measures (and associated constraints) that are used in the search procedure 206. The overall duration of the edited audio may be determined bottom up from the contents of the spatial sound, or it may be given in a top-down fashion through the style parameter. In the best mode both styles are accommodated through the mechanism of defining a tolerance within which the actual duration should be of target duration. The style parameter sets the level of interest, in the form of a score, assigned to particular features and groups of features.
“Virtual Microphone”: A virtual microphone trajectory specifies the position (3D co-ordinates and 3D orientation) and its reception. The implementation of application program 201 is simplified if the position includes orientation information because then reception needs to change only because a non-monopole radiator has rotated. The virtual microphone can move and rotate and change its field of view. The sound received at a microphone is a function of the position of the process 207 of sound source and the microphone. In the simplistic model employed in process 207 of the preferred embodiment described herein sound reflections are ignored and the model simply takes into account the inverse square law of sound intensity.
“Reception”: The reception (otherwise termed “listening” herein) of the virtual microphone may be defined in various ways. In the preferred embodiment it is defined as the distance between the position of the virtual microphone and the position of the sound source. This distance is then used to reduce or increase (i.e. blend) the intensity of the sound source at the position of the virtual microphone. This definition provides a simple and intuitive way of defining contours of reception for a region. More complex embodiments may additionally use one or more other parameters to define reception.
As described later the reception is a function implementing the modification of the normalised sound signals associated with each sound source. It uses the position of the virtual microphone and sound source to determine a multiplier that is applied to the sound source signal for a particular time. The reception defines how sensitive a microphone is to sounds in different directions. i.e. a directional microphone will have a different reception as compared with an omnidirectional microphone. The directional microphone will have a reception of zero for certain positions whereas the onmidirectional microphone will be non-zero all around the microphone, but might weight some directions more than others.
“Audio Rostrum Function 206”: The audio rostrum function or processing routine 206 can be seen as a function taking a style parameter and spatial sound and returning a selection of the spatial sound sources and a virtual microphone trajectory. One or more virtual microphones may be defined in respect of a given sound scene that is the subject of processing by application program 201.
“Selection Function”: The selection function of the audio rostrum process 206 is simply a means of selecting or weighting particular sound sources from the input spatial sound. Conceptually the selection function derives a new version of the spatial sound from the original source and the virtual microphone trajectory is rendered within the new version of the spatial sound. It may be implemented as a Boolean function to return a REAL value, returning a “0” to reject a sound source and returning a “1” to accept it. However in the best mode it is implemented to provide a degree of blending of an element of the sound source.
“Rendering Function”: Rendering is the process of using the virtual microphone trajectory and selection to produce an output signal.
“Normalisation of sound signals”: On recording of each sound signal, the signals may be recorded with different signal strengths (corresponding to different signal amplitudes). In order to be able to process the different sounds without having the sound strength varying in a manner which is unpredictable to a processor, each sound signal is normalised. That is to say, the maximum amplitude of the signal is set to a pre-set level, which is the same for all sound signals. This enables each signal to be referenced to a common maximum signal amplitude level, which means that subsequent processing stages can receive different sound signals which have amplitudes which are within a defined range of levels.
Examples of Sound Scenes and Virtual Microphone Synthesis
In order to demonstrate the effects produced by virtual microphone application program 201,
In
In
Following the events of
Following
In this example, the sound scene recorded as audio data by the couple is subsequently required to be played back in a modified form to friends and family. The played back version of the audio sound recording is required to be modified from the original audio data so as to provide the friends and family with a degree of interest in the recording by way of their being made to feel that they were actually in the scene themselves. In the preferred embodiment, the modified audio is played in conjunction with a video recording so that the listener of the audio is also provided with the actual images depicted in
In
Following the scene of
The motion of the virtual microphone and expansion/contraction of the field of listening as described in the example of
The example sound scene environment of
An example of the scope of application of the presently described embodiments and methods is to consider the well-known fairground ride of the “merry-go-round”. The embodiments and methods may be used to process sound captured by a spatial sound capture device located on a person who takes a ride on the merry-go-round. The application program 201 may process the recorded spatial sound so that it is re-played from a stationery frame of reference relative to the rotating merry-go-round from which it is recorded. Thus the application program is not to be considered as limited to merely enabling sound sources to be tracked and zoomed in on by a moving virtual microphone since it may also be used to “step-back” from a moving frame of reference, upon which is mounted a spatial sound capture device, to a stationary frame. In this way the present there may be provided useful application in a wide variety of possible situations where captured spatial sound is required to be played back from the point of view of a different frame of reference to that in which it was actually recorded.
Acquiring Audio Data, Process 202
A source of spatial sound is obtained. As will be understood by those skilled in the art this may be obtained in a variety of ways and is not to be considered as limited to any particular method. However it will also be understood that the particular method employed will affect the specific configuration of data processing processes 203-207 to some degree.
One commonly employed method of obtaining spatial sound is to use a microphone array such that information on the spatial position of the microphones with respect to the sound sources is known at any given time. In this case the rendering process 207 should be configured to utilize the stored information, thereby simplifying the rendering process. Another example is to obtain spatially localized sound from a virtual (computer generated) source and to utilize the positional information that is supplied with it.
Methods of obtaining spatial sound and of separating and localizing sound sources are detailed below.
Obtaining Spatial Sound
There are a number of different spatially characterised soundscapes that application program 201 may be configured to use:
1. Soundscapes captured using multiple microphones with unknown trajectories. e.g. where several people are carrying microphones and the variation in the position of each microphone either has or can be calculated over time.
2. Virtual reality soundscapes such as defined by the webs VRML (Virtual Reality Modelling Language) that can describe the acoustical properties of the virtual environment and the sounds emitted by different sources as they move about the virtual world (in 3D space and time).
3. Spatial sound captured using microphone arrays. Here there are multiple microphones with known relative positions that can be used to determine the location of sound sources in the environment.
4. Soundscapes captured using a set of microphone arrays with each microphone array knowing the relative positions of its microphones, but not knowing the spatial positions of the other microphone arrays.
It should be noted that with microphone arrays (method no. 3 above) the relative positions of the microphones in the array are known, whereas in the general case (method no. 1) the relative positions of the microphones have to be determined. It will be understood by those skilled in the art that the different characteristics associated with spatially characterised sound obtained from each of the four methods (1)-(4) affects the more detailed configuration requirements of application program 201. In consequence of this different versions of the underlying processing algorithms result that exploit the different characteristics and/or which work within the limitations of a particular source of spatial sound.
In the case of method no. 1 above, use of multiple microphones, this does not decompose the environment into distinct spatial sound sources, although a physical microphone located on a sound source, such as a person, will mean that the sound captured is dominated by this sound source. Ideally such a sound source would be separated from its carrier to provide a pure spatially characterised sound. However this might not be possible without distorting the signal. Specific implementations of application program 201 may be configured to work with such impure forms of spatial sound. In the simplest case a suitably configured application program 201 might simply switch between different microphones. In a more sophisticated version, application program 201 may be configured to separate the sound source co-located with the physical microphone from the other sounds in the environment and allow a virtual microphone to take positions around the original sound source. It is also possible to determine the relative position of a microphone co-located sound source whenever it is radiating sound because this gives the clearest mechanism for separating sounds from the general microphone mix. However any reliably separated sound source heard by multiple microphones could be used to constrain the location of the sound sources and the microphones.
Even if processing were performed to identify sound sources it is likely to be error prone and not robust. This is because errors arise in the determination of the location of a sound source both in its exact position and in the identification of an actual sound source as opposed to its reflection (a reflection can be mistaken for a sound source and vice versa). Application program 201 needs to take the probability of such errors into account and it should be conservative in the amount of movement of and the selecting and editing of sound sources that it performs.
Identification of spatial sound sources is difficult for diffuse sound sources such as, for example, motorway noise or the sound of the sea meeting the shore. This is due to a lack of a point of origin for such diffuse sound sources. Other diffuse sound sources such as a flock of birds consisting of indistinguishable sound sources also present problems that would need to be taken into account in a practical spatial sound representation as used by a suitably configured application program 201.
If the output from application program 201 is intended to be spatial sound then there is greater emphasis required on the accuracy of the locations and labelling of different spatial sound sources. This is because not only should the output sound be plausible, but application program 201 should also give plausible spatial sound cues to the listener of the resultant edited sound scene that is produced. This is unlikely to be possible without an accurate 3D model of the environment complete with its acoustic properties and a truly accurate representation will generally only available or possible when the spatial sound comes from a synthetic or virtual environment in the first place.
Sound Source Separation and Determination of Location of Sound Sources
Given access to a sound field application program 201 is then required to recover the separate components if these have not already been determined. Solution of this problem concerns dealing with the following degrees of freedom: greater than N signals from N sensors where N is the number of sensors in the sound field. There are two general approaches to solving this problem:
Information-Theoretic Approaches
This type uses only very general constraints and relies on precision measurements; and
Anthropic Approaches
This type is based on examining human perception and then attempting to use the information obtained.
Two important methods of separating and localising sound sources are (i) use of microphone arrays and (ii) use of binaural models. In order to better understand the requirements for configuring application program 201 further details of these two methods are provided below.
(i) Microphone Arrays
Use of microphone arrays may be considered to represent a conventional engineering approach to solving the problem. The problem is treated as an inverse problem taking multiple channels with mixed signals and determining the separate signals that account for the measurements. As with all inverse problems this approach is under-determined and it may produce multiple solutions. It is also vulnerable to noise.
Two approaches to obtaining multiple channels include combining signals from multiple microphones to enhance/cancel certain sound sources and making use of ‘coincident’ microphones with different directional gains.
The general name given to the techniques used to solve this problem is, as is known to those skilled in the art, “Adaptive Beamforming & Independent Component Analysis (ICA)”. This involves formulation of mathematical criteria to optimise the process for determination of a solution. The method includes (a) beamforming to drive any interference associated with the sound sources to zero (energy during non-target intervals is effectively cancelled) and (b) independent component analysis to maximise mutual independence of the outputs from higher order moments during overlap. The method is limited in terms of separation model parameter space and may, in a given implementation, be restricted to a sound field comprising N sound source signals from N sensors.
The following references, incorporated herein by reference, provide detailed information as regards sound source separation and localisation using microphone arrays:
Sumit Basu, Steve Schwartz, and Alex Pentland.
“Wearable Phased Arrays for Sound Localisation and Enhancement.” In Proceedings of the IEEE Int'l Symposium on Wearable Computing (ISWC '00). Atlanta, Ga. October, 2000. pp. 103-110. (PDF) (slides);
Sumit Basu, Brian Clarkson, and Alex Pentland.
“Smart Headphones.” In Proceedings of the Conference on Human Factors in Computing Systems (CHI '01). Seattle, Wash. April, 2001. (PDF) (slides);
Valin, J.-M., Michaud, F., Hadjou, B., Rouat, J.,
Localisation of Simultaneous Moving Sound Sources for Mobile Robot Using a Frequency-Domain Steered Beamformer Approach.
Accepted for publication in IEEE International Conference on Robotics and Automation (ICRA), 2004;
Valin, J.-M., Michaud, F., Rouat, J., Letourneau, D.,
Robust Sound Source Localisation Using a Microphone Array on a Mobile Robot.
Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems, 2003;
Microphone-Array Localisation Error Estimation with Application to Sensor Placement (1995)
Michael Brandstein, John E. Adcock, Harvey F. Silverman;
Algebraic Methods for Deterministic Blind Beamforming (1998)
Alle-Jan van der Veen;
Casey, M. A.; Westner, W., “Separation of Mixed Audio Sources by Independent Subspace Analysis”,
International Computer Music Conference (ICMC), August 2000;
B. Kollmeier, J. Peissig, and V. Hohmann,
“Binaural noise-reduction hearing aid scheme with real-time processing in the frequency domain,”
Scand. Audiol. Suppl., vol. 38, pp. 28-38, 1993;
Shoko Araki, Shoji Makino, Ryo Mukai & Hiroshi Saruwatari
Equivalence between Frequency Domain Blind Source Separation and Frequency Domain Adaptive Beamformers;
(ii) Binaural Models
Human listeners have only two audio channels (by way of the human ears) and are more able to accurately separate out and determine the location of sound sources than can a conventional microphone array based system. For this reason there are many approaches to emulating human sound localisation abilities, the main ones concentrating on the main cues to spatial hearing of interaural time difference, interaural intensity difference and spectral detail.
Extraction of Interaural Time Difference Cues
The interaural time difference (ITD) cue arises due to the different path lengths around the head to each ear. Below 1.5 KHz it is the dominant cue that people use to determine the location of a sound source. However the ITD cue only resolves spatial position to a cone of confusion. The basic approach is to perform cross-correlation to determine the timing differences.
Extraction of Interaural Intensity Difference Cues
Interaural intensity difference (IID) arises due to the shadowing of the far ear, and is negligible for low frequency, but becomes more useful for higher frequencies.
Extraction of Spectral Detail
The shape of the pinnae introduces reflections and spectral detail that is dependent on elevation. It is because of this that IID cues are used by people for detecting range and elevation. Head motion is a means of introducing synchronised spectral change.
Once the direction of the sound sources has been determined they can then be separated by application program 201 (assuming this is required in that sound sources have not been provided in a pre-processed format) based upon direction. As will be understood by those skilled in the art separation of sound sources based on direction may involve one or more of:
estimating direction locally;
choosing target direction; and
removing or minimising energy received from other directions.
The following references, incorporated herein by reference, provide detailed information as regards auditory scene analysis/binaural models:
G. J. Brown and M. P. Cooke (1994)
Computational auditory scene analysis. Computer Speech and Language, 8, pp. 297-336;
B. Kollmeier, J. Peissig, and V. Hohmann,
“Binaural noise-reduction hearing aid scheme with real-time processing in the frequency domain,”
Scand. Audiol. Suppl., vol. 38, pp. 28-38, 1993;
This latter reference provides further information on separation of sound sources based on direction.
Model and Application of a Binaural 360° Sound Localisation System (2001)
C. Schauer, H.-M. Gross
Lecture Notes in Computer Science;
Identification of Spectral Features as Sound Localisation Cues in the External Ear Acoustics
Paul Hofman, John van Opstal
IWANN;
Enhancing sound sources by use of binaural spatial cues
Johannes Nix, Volker Hohmann
AG Medizinische Physik
Universit{umlaut over ( )}at Oldenburg, Germany;
Casey, M., “Sound Classification and Similarity Tools”, in B. S. Manjunath, P. Salembier and T. Sikora, (Eds), Introduction to MPEG-7: Multimedia Content Description Language, J. Wiley, 2001; and
Casey, M., “Generalized Sound Classification and Similarity in MPEG-7”, Organised Sound, 6:2, 2002.
However a source of spatial sound is obtained the audio source may be received via input port 105 in a form wherein the spatial sound sources have already been determined with unattributable sources being labeled as such and echoes and reflections having being identified. In this case the spatial sound sources may be required to be normalized by application program 201 as described below. Normalization greatly simplifies the processing required in the subsequent analysis and rendering processes of the pipeline.
Normalization of Sound Signal
The spatially characterized sound source signals are normalized with the normalized signals being stored in memory 103. Normalization is required to simplify the main rendering task of placing a virtual microphone in the soundscape and synthesizing the sound signals that it would capture.
Normalization involves processing the signals so that the resultant stored signals are those that would have been obtained by a microphone array (i) located at the same position as regards orientation from and distance from each of the sound sources and (ii) preferably, in an environment that is free of reverberations. In the preferred embodiment normalisation is applied to the intensity of the sound sources. Normalisation processing is preferably arranged so that when the virtual microphone is placed equidistant from two similar sound sources then they are rendered with an intensity that is proportional to the intensity produced at each sound source.
If the spatial sound sources are captured using microphones in known positions then the intensity of the sound sources detected will vary with the relative position of the sound source and the microphone. Thus to render spatially characterised sound for an arbitrary virtual microphone position it is preferred to store the intensity of the sound source from a standard distance and orientation with respect to the sound source. This process simplifies the sound source rendering process 207, but introduces an extra resampling of the captured sound. It is also a process that simplifies the pattern recognition because each sound source need only be recognised from a standard distance. Those skilled in the art will appreciate that the alternative is to store the orientation and position of the sound source and microphone (which will vary over time) and resample for the actual virtual microphone used in rendering. This would only resample the recorded sound once thus giving maximum quality.
A further preferred embodiment as regards normalization comprises both of the aforementioned approaches: normalizing the sound signals associated with each sound source to make recognition easier and also storing the positions of the original microphones. This latter approach provides the benefits of both approaches, but at a computational cost in relation to extra storage and sampling.
Characterizing the Sound Scene into Sound Sources, 203, 204.
Select or Determine Styles, Process 203
In the preferred embodiment of application program 201 process 203 concerning selection or determination of style initially identifies which one of a plurality of predefined sound classes that the stored audio data to be processed actually represents. For automatic determination of styles the application program 201 is thus required to comprise a plurality of predefined sound classes in the form of stored exemplary waveforms.
Referring to
Process 203 concerning selection or determination of styles may be automatically effected by the application program 201 or the application program 201 may be configured to accept an appropriate selection made by an operator of the system. In general the style can be determined through:
user interaction via selection from a set of menu items or slider bars visible on a monitor or via explicit setting of particular parameters;
a priori or default settings (which may be varied randomly); and
parameters determined externally of the application program if the application program forms part of a larger composition program.
Although the process for selection/determination of styles (process 203) is illustrated in
Select or Determine Analysis Reference Frame (or Frames), Process 204
This process concerns selecting an appropriate analysis reference frame from:
a fixed reference frame of the type used in the example of
a reference frame that moves around.
In the best mode this decision is effected by the style determined either automatically or selected by the operator of application program 201 at process 203. The choice effects the overall style of the resultant edited soundscape produced by application program 201 and it effects the saliency accorded by application program 201 to particular sound sources.
Perform Analysis of Sound Sources, Process 205
Grouping together of sound sources as indicated at 501;
Determination of the causality of sound sources as indicated at 502;
Determination of the similarity of sound sources as indicated at 503;
Classification of the sound sources as indicated at 504;
Identification of new sounds as indicated at 505; and
Recognition of moving sound sources or anonymous sound sources as indicated at 506.
Grouping of Sound Sources, Process 501
Common fate process 601: Common fate describes the tendency to group sound sources whose properties change in a similar way over time. A good example is a common onset of sources.
Sound source similarity process 602: The similarity of sound sources according to some measure of the timbre, pitch or loudness correlation between the different sound sources indicates a tendency to group the sources.
Sound source proximity process 603: The proximity of sound sources in time, frequency and spatial position provides a good basis for grouping.
Sound source continuity process 604: The degree of smoothness between consecutive sound elements can be used to group, a higher degree of smoothness providing a greater tendency for application program 201 to link the elements as a group.
Sound source closure process 605: Sound sources that form a complete, but possibly partially obscured sound object, are required to be grouped.
Determination of the Causality of Sound Sources, Process 502
Application program 201 is configured to determine whether one sound source causes another sound source to occur. A good example of causality is where a person asks another person a question and the other person replies with an answer. This process thus comprises another means of grouping sound sources by means of cause and effect rather than being based upon Gestalt principles. In the example on
Determination of the Similarity of Sound Sources, Process 503
Classifying (Recognizing) Sound Sources, Process 504
The sound source analysis process 205 of application program 201 also includes sound source classification processing as indicated at 504. By classification it is meant processing as regards recognizing different sounds, and classifying those sounds into sounds of similar types.
Below are listed various types of sounds that may be recognized. However the lists are not to be considered as exhaustive:
Sounds from crowds 902 may be further sub-divided into laughing sounds as indicated at 915, clapping and/or stomping as indicated at 916, cheering sounds as indicated at 917 and sounds of the people singing as indicated at 918. Application program 201 may be configured to recognize the different types of sounds 901 to 918 respectively. Sounds made by individuals and sounds made by crowds of people are very different as are vocal and non-vocal sounds and therefore application program 201 is, in the best mode contemplated, configured with recognizers for at least these categories.
Identification of New Sound Sources, Process 505
Application program 201 is, in the best mode contemplated, also provided with means of identifying new sound sources. The loud sounds cause the startle reflex to occur in humans with the result that the loud sound captures the attention of the person. Application program 201 is preferably configured to incorporate processing that mimics the startle reflex so that attention can be drawn to such sounds as and when they occur. The ability of application program 201 to incorporate such processing is made substantially easier with spatial sound because it is known when a new object sound occurs. However a new sound that is different from any sound heard previously will also tend to capture the attention of people. In the best mode some form of recogniser for recognizing sound that differs from anything else heard previously is also provided since sounds that are similar to what has already been heard will be deemed less interesting and will fade from a person's attention.
Determination of Motion of Sound Sources, Process 506
A recognizer configured to determine when sounds are stationary relative to the self (fixed analysis framework) or accompanying the self (moving framework) is important because sound sources can be transient and have no or little interaction with objects in the scene.
The above examples of recognizers are merely given to demonstrate the kinds of sound recognizers that may be implemented in a particular embodiment of application program 201. The number and type of recognizers that may be employed may clearly vary greatly from one system to another and many more examples of recognizers than those discussed above may find useful application depending on particular end-user requirements.
Controlling the path/trajectory of the tour of the virtual microphone; and
selecting sound sources supplied on the virtual tour—process 206
The matter of selecting sound sources and determining a virtual microphone trajectory in process 206 can be seen as a form of optimisation problem. However an optimal solution is not necessarily required. Rather, for many applications of a suitably configured application program 201, only an acceptable result is required such that the resultant virtual microphone provides a modified version of the sound scene that is aesthetically acceptable to a nominal listener of the resultant edited sound scene. In the preferred embodiment processing in process 206 therefore concerns a search 1401 to find an acceptable result from a number of reasonable candidates that are so produced. The search routines may therefore make use of genetic algorithms and one or more heuristic rules to find possible selections and tours of the virtual microphone about the sound field, the emphasis being to avoid clearly poor or embarrassing resultant processed audio data for use in play-back. For example:
when a person is on the move the virtual microphone should be configured by application program 201 to keep around the person;
when a person enters a new environment the virtual microphone should be configured to simulate attention drifting on to new or interesting sound sources nearby;
before zooming in on sound sources in a complex scene an overview of the sound scene should be given before zooming in on particular sound sources that are interesting.
The method described below uses a simple model of a four-dimensional soundscape and does not take into account reflections when the microphone is moved to different positions. For more complex embodiments VRML (Virtual Reality Modelling Language) BIFS (Binary Format for Scene description) may be employed to yield higher quality results as regards the form of the resultant edited sound scene produced.
At process 1402 the saliency of the selected sound sources are maximised over possible virtual microphone trajectories and the sound source selections of process 206. This processing is subject to one or more constraints 1403 that are provided by the style parameters introduced at process 203.
(1) Constraints
The constraints provided by the style parameters ensure that:
the duration of the output sound signal is within certain bounds as indicated at process 1404;
certain aesthetic constraints upon the selections are maintained within certain bounds as indicated at process 1405; and
the integrity of the sound sources are respected within certain bounds as indicated at process 1406.
The duration constraint 1404 is the most basic constraint that forces the editing process and it simply ensures that the duration of the selected material is within certain predefined limits.
The most important function of the aesthetic constraint (or constraints) 1405 concerns control of the virtual microphone trajectory. As will be understood by those skilled in the art it would be confusing if the virtual microphone trajectory constantly changed to grab interesting features in the soundscape. Thus the motion of the virtual microphone is required to be damped. Similarly changing the region of reception over time will also cause confusion and therefore this action is also required to be damped. In the best mode an aesthetic constraint is therefore used to impose a smoothness constraint on the virtual microphone trajectory such that jerky virtual microphone movements are given poor scores. In addition other smoothing function aids are preferably employed such as target smoothness values and also predefined tolerances as regards acceptable movements.
Aesthetic constraints and selected style parameters are also required to constrain the balance of features contained within the selection. For example it may be undesirable to produce a resultant edited soundscape that focuses too much on one person and therefore a constraint may be defined and selected for ensuring that resultant edited sound content is provided from a number of people within a group of sound sources. Similarly a suitable constraint may be provided that focuses on a particular person whilst minimising the sounds produced by other members of the group.
Aesthetic and style parameters may also be provided to determine how groups of people are introduced. For example all the people within a group could first be introduced before showing each piecewise or in smaller chunks, or alternatively pieces or chunks may be provided first before showing the group as a whole. Aesthetic constraints may also be provided to determine how background or diffuse sound sources are to be used in a given editing session.
Aesthetic constraints may also be provided to constrain how stock sound sources such as music and background laughter or similar effects should be used. Stock footage can be treated as just another sound source to be used or optimised in the composition. Such footage is independent of the original timeline, and constraints on its use are tied to the edited or selected output signal. However actual ambient sound sources may be treated in the same way by application program 201.
Integrity constraints are required to be provided such that the resulting edited soundscape is, in some sense, representative of the events that occurred in the original soundscape. This would include, for example, a constraint to maintain the original temporal sequence of sound sources within a group and a constraint to ensure that the causality of sounds sources is respected (if one sound causes another then both should be included and in the correct sequence). A suitably configured integrity constraint thus indicates how well a particular virtual microphone trajectory and spatial sound selection respects the natural sound envelopes of the sound sources. It is a matter of style as regards what is scored and by how much. Again tolerances for a target value are preferably defined and used as a constraint in application program 201.
As will be understood by those skilled in the art the types and nature of the particular constraints actually provided in a given application program configured as described herein may vary depending upon the particular requirements of a given user. However an automated or semi-automated system should to be controllable in the sense that the results are predictable to some degree and therefore it will be appreciated that a fully automatic system may provide less freedom to make interesting edits than one which enables an operator to make certain choices.
(2) Saliency
In the preferred embodiment illustrated schematically in
i. The intrinsic saliency of the waveforms of each sound source, 1407;
ii. The saliency of recognised features in each sound source, 1408; and
iii. The saliency of certain sound sources when the sources are grouped together, 1409.
All three components of saliency 1407-1409 will be affected by the trajectory (the variation in position and orientation with time) of both the sound source and the virtual microphone. This is because the sound intensity received by the microphone, even in the simplest models (i.e. those ignoring room acoustics), varies in accordance with the inverse square law. In other words the intensity is inversely proportional to the distance between the microphone and the sound source. All the component types of saliency are actually calculated over an interval of time and most forms of saliency should be affected by the style parameters. Since the saliency of sound is defined over intervals of time the application program 201 is required to determine the set of intervals for which each sound source is selected and then sum the resultant saliencies for each sound source over these intervals.
Intrinsic Saliency for the Interval
Intrinsic saliency derives from the inherent nature of a sound source waveform. It may comprise loudness (the human perception of intensity), the presence of rhythm, the purity of the pitch, the complexity of the timbre or the distribution of frequency.
Recognised Feature Based Saliency for the Interval
Feature based saliency is based upon some a priori interest in the presence of particular features within the interval. However features will have their own natural time interval and thus it is a requirement that the saliency interval includes the interval of the feature. The impact of each feature on the whole interval is affected by the relative duration of the feature and overall intervals. The features are detected prior to the search procedure 1401 by pattern recognition recogniser functions of the type described in relation to
Group Based Saliency for the Interval
The group based saliency is composed of an intrinsic saliency and a feature based saliency. A group's saliency in an interval is determined either by some intrinsic merit of the group's composite sound waveform or because the group is recognised as a feature with its own saliency. The group feature is required to place value upon interaction between different or distinct sound sources, such as capturing a joke told by a given person at a dinner table as well as capturing the resulting laughter. Thus the group feature should be configured to value causality between sound sources provided that they are similar according to some Gestalt measure and, in particular, providing that the sound sources are close in space and in time.
Process 1703 concerns determination of feature group saliency. Since a group can have a number of features that are significant for saliency purposes then application program 201 is required to sum over all such features in the interval as indicated at process 1708. Following summing at process 1708, the texture interval is determined at process 1709. Then at process 1710 the feature trajectory is determined. At process 1711 a table look-up for the saliency of the feature is performed whereafter at process 1712 the saliency obtained is modified to take account of the actual feature duration. Following process 1712, at process 1713 the saliency determined at processes 1711 and 1712 is then further modified for the feature trajectory determined at process 1710.
Saliency processing may be based on one or a number of approaches, but in the best mode it is based partly on a psychological model of saliency and attention. An example of such a model that may form a good basis for incorporating the required processing routines in application program 201 is that described in the PhD by Stuart N. Wrigley: “A Theory and Computational Model of Auditory Selective Attention”, August, 2002, Dept. of Computer Science, University of Sheffield, UK which is incorporated herein by reference. In particular Chapter 2 of this reference discloses methods for and considerations to be understood in auditory scene analysis, Chapter 4 provides details pertaining to auditory selective attention and Chapter 6 describes a computational model of auditory selective attention. In addition various heuristic based rules and probabilistic or fuzzy based rules may be employed to decide on which sound sources to select, to what extent given sound sources should be selected and also to determine the virtual microphone characteristics (trajectory and/or field of reception) at a given time.
The search procedure of the audio rostrum effectively guesses a virtual microphone trajectory and spatial sound selection and scores its saliency and ensures that it satisfies the various constraints on its guesses. The search continues until either sufficiently interesting guesses have been found or some maximum number of guesses have been made. In the preferred embodiment a brute force search operation is used to obtain a set of acceptable guesses that utilises no intelligence except for that provided by way of the rules that score and constrain the search. However multi-objective optimisation might be used to use some of the constraints as additional objectives. There are many approaches to making the guesses that can be used. Other examples that may complement or replace the optimisation approach include: use of genetic algorithms and use of heuristics. In the case of using heuristics a template motion for the virtual microphone motion could be used for example. The template would be defined relative to an actual microphone's position and might recognise particular phases of the microphone motion.
Alternative Approach to Determining Sound Sources and Virtual Microphone Trajectory (Process 206)
In an alternative of the aforementioned embodiment, the search/optimization method of determining sound sources and a virtual microphone trajectory may be simplified in various ways. One such method is to utilize the concept of index audio clips for intervals of sound. An index audio clip may be considered to represent a “key” spatial sound clip that denotes a set of spatial sound sources selected for a particular time interval. In this way a key part of the audio may be determined as a set of sound sources to focus on at a particular time. The virtual microphone may then be placed in a determined position such that the position enables the set of sound sources to be recorded (the virtual microphone being kept stationary or moving with the sound sources). By using index audio clips in this way the search problem is therefore reduced to picking the position of a fixed virtual microphone for each key spatial sound clip selection and then managing the transitions between these key sound clips. However it would also be required to permit operation of application program 201 such that the virtual microphone is allowed to accompany a group of moving sound sources. In this case the relative position of the virtual microphone would be fixed with respect to the group of sound sources, but again the absolute position of the virtual microphone would need to be fixed.
Using index audio clips leads to a heuristic based algorithm to be employed by application program 201 as follows:
1. Determine a set of index audio clips by identifying and selecting a set of sound sources within a common interval (for example, using sound source recognition processes of the type illustrated schematically in
For each index audio clip calculate a virtual microphone trajectory that would most suitably represent the selected sound sources. This determines the field of reception of the virtual microphone and it's position during the interval. It should be noted that the virtual microphone might well be configured by application program 201 to track or follow the motion of the sound sources if they are moving together; determine a spatial sound selection for each index audio clip; and determine the nature of the audiological transitions between the key spatial sound clips (from one index audio clip to the next).
Process 4 above concerns the determination of the nature of the transitions may be achieved by panning between the virtual microphone positions or by moving to a wide field of view that encompasses fields of reception for two or more virtual microphones. Furthermore it should be appreciated that if the index audio clips are temporally separated then a need to cut or blend between sound sources that occurred at different times would arise.
It will be understood by those skilled in the art that the order in which the clips are visited need not follow the original sequence. In this case application program 201 should be provided with an extra process between processes 1 and 2 as follows:
1b. Determine the order in which the index frames are to be used.
Rendering or Mixing the Sound Sources, Process 207
The main rendering task is that of generating the sound signal detected by a virtual microphone (or a plurality of virtual microphones) at a particular position within the sound field environment. Thus in the case of a sound field sampled by using physical microphones a virtual microphone would be generated by application program 201 in any required position relative to the actual microphones. This process may be considered to comprise a two-stage process. In the first stage the selections are applied to obtain a new spatial sound environment composed only of sound sources that have been selected, and defined only for the interval that they were selected. The selected spatial sound may thus have a new duration, a new timeline, and possibly new labels for the sound sources. Furthermore additional sound sources can be added in for effect (e.g. a stock sound of background laughter). In the second stage the virtual microphone trajectory is applied to the selected spatial sound to output a new sound signal that would be output by a virtual microphone following a given calculated trajectory. This process takes into account the inverse square law and also introduces a delay that is proportional to the distance between the sound source and the virtual microphone.
As mentioned earlier the audio rostrum can be seen as a function 206 taking a style parameter and spatial sound and returning a selection of the spatial sound sources and a virtual microphone trajectory. The selection is simply a means of selecting or weighting particular sound sources from the input spatial sound. Conceptually the selection derives a new spatial sound from the original and the virtual microphone trajectory is rendered within this spatial sound.
Rendering process 207 is very important for getting realistic results. For example acoustic properties of the 3D environment need to be taken into account to determine the reflections of the sound. When the spatial sound is determined (for example from using a microphone array) then distinguishing the direct sound sources from reflections is important. If the reflection is seen as a distinct sound source then moving a virtual microphone towards it will mean changing the intensity of the reflection and changing the delay between the two sources, perhaps allowing the reflection to be heard before the direct sound signal.
As will be appreciated by those skilled in the art there are numerous known methods that may suitably be employed to perform one or more aspects of the required rendering. Examples of such systems, incorporated herein by reference, include:
U.S. Pat. No. 3,665,105 in the name of Chowning which discloses a method and apparatus for simulating location and movement of sound through controlling the distribution of energy between loud speakers;
U.S. Pat. No. 6,188,769 in the name of Jot which discloses an environmental reverberation processor for simulating environmental effects in, for example, video games; and
U.S. Pat. No. 5,544,249 in the name of Opitz, which discloses a method of simulating a room and/or sound impression.
Additionally those skilled in the art will appreciate that the rendering system could be configured to utilise MPEG4 audio BIFS for the purpose of defining a more complete model of a 3D environment having a set of sound sources and various acoustic properties. However for many it will suffice to rely on a relatively simple form of 3D model of acoustics and sound sources. This is particularly so if arbitrary motion of the virtual microphone from the original sound capture microphones is not allowed. These simpler approaches effectively make crude/simple assumptions about the nature of a 3D environment and its acoustics.
The difficulties in providing physically realistic rendering when using a simple acoustical model imposes practical constraints upon how far the virtual microphone is allowed to move from the actual microphones that captured the spatial sound. It will be understood by those skilled in the art that these constraints should be built into the search procedure 206 for the spatial sound selections and virtual microphone trajectory.
A useful reference that addresses many of the relevant issues pertaining to the rendering process and which is incorporated herein by reference is “ACM Siggraph 2002 course notes ‘Sounds good to me!’ Computational sound for graphics, virtual reality and interactive systems” Thomas Funckerhouser, Jean Marc Jot, Nicolas Tsingos. The main effects to consider in determining a suitable 3D acoustical model are presented in this reference including the effect of relative position on such phenomena as sound delay, energy decay, absorption, direct energy and reflections. Methods of recovering sound source position are discussed in this reference based on describing the wavefront of a sound by its normal. The moving plane is effectively found from timing measurements at three points. To determine spatial location three parameters are required such as, for example, two angles and a range. The effects of the environment on sounds are also considered and these are also important in configuring required processing for rendering process 207. For instance reflections cause additional wavefronts and thus reverberation with resultant “smearing” of signal energy. The reverberation impulse response is dependent upon the exponential decay of reflections which, in turn, is dependent upon:
frequency of the sound(s)—there is a greater degree of absorption at higher frequencies resulting in faster decay;
size of the sound field environment—larger rooms are associated with longer delays and therefore slower decay of sound sources.
Normally the sound heard at a microphone (even if there is only one sound source) will be the combination or mixing of all the paths (reflections). These path lengths are important because sound is a coherent waveform phenomenon, and interference between out of phase waves can be significant. Since phase along each propagation path is determined by path length then path length needs to be computed to an accuracy of a small percentage of the wavelength. Path length will also introduce delay between the different propagation paths because of the speed of sound in air (343 meters per second).
The wavelength of audible sound ranges from 0.02 to 17 meters (20 khz and 20Hz). This impacts the spatial size of objects in an environment that are significant for reflection and diffraction. Acoustic simulations need less geometric detail because diffraction of sound occurs around obstacles of the same size as wavelength. Also sound intensity is reduced with distance following the inverse square law and high frequencies also get reduced due to atmospheric scattering. When the virtual microphone is moving relatively to the sound source, there is a frequency shift in the received sound compared to the how it was emitted. This is the well-known Doppler effect.
The inverse square law and various other of the important considerations for effective rendering are more fully discussed below.
Inverse Square Law and Acoustic Environments
As has already been indicated the rendering process of process 207 is required to be configured to take account of the decay of sound signals based on the inverse square law associated with acoustic environments. Also a delay has to be introduced to take account of the time for the sound to travel the distance from the sound source to the virtual microphone. In a simple environment (i.e. ignoring reverberations) then a microphone placed equidistant between two sound sources would capture each sound proportional to the relative intensity of the original sound sources. The important properties of acoustic environments and of the effects of the inverse square law that require consideration for providing acceptable rendering processing 207 are briefly summarised below.
The acoustical field of a sound source depends upon the geometry of the source and upon the environment. The simplest sound source is the monopole radiator which is a symmetrically pulsating sphere. All other types of sound sources have some preferred directions for radiating energy. The physical environment in which sounds are created effects the sound field because sound waves are reflected from surfaces. The reflected waves add to the direct wave from the source and distort the shape of the radiating field.
The simplest environment, called a free-field, is completely homogenous, without surfaces. Free-field conditions can be approximated in an anechoic room where the six surfaces of the room are made highly absorbing so that there are no reflections, alternatively in an open field with a floor that does not reflect sound.
A monopole radiator expands and contracts, respectively causing, over-pressure and partial vacuum in the surrounding air. In the free-field environment the peaks and troughs of pressure form concentric spheres as they travel out from a source.
The power in the field a distance r away from the source is spread over the surface of the sphere with an area 4πr2. It follows that for a source radiating acoustical power P, the intensity I is given by:
I=P/4πr2
This is the inverse square law for the dependence of sound intensity on distance.
If the source is not spherically symmetric then in a free field, the intensity, measured in any direction with respect to the source is still inversely proportional to the square of the distance, but will have a constant of proportionality different than ¼π that is affected by direction. Furthermore the area over which a microphone captures sounds will also affect the outcome.
Atmospheric Scattering
This is another form of attenuation of sound intensity that affects higher frequencies. The attenuation of propagating acoustic energy increases as a function of:
increasing frequency, decreasing temperature and decreasing humidity. For most sound fields atmospheric absorption can be neglected, but it becomes increasingly important where long distances or very high frequencies are involved. The following reference, incorporated herein by reference, provides further details on atmospheric considerations to be taken account of in the rendering process: Cyril Harris, “Absorption of Sound in Air versus Humidity and Temperature,” Journal of the Acoustical Society of America, 40, p. 148.
Döppler Shifting
This concerns the effect of relative motion between sound sources and virtual microphones that are be built into the rendering process if realistic edited sound is to be produced. When a sound source s and or a receiver r are moving relative to one another, sound waves undergo a compression or dilation in the direction of the relative speed of motion. This compression or dilation modifies the frequency of the received sound relative to the emitted sound in accordance with the well known Döppler equation:
Fr/Fs=(1−(n.Vr/c))/(1−(n.Vs/c))
where Vs is the velocity of the source, Vr is the velocity of the receiver, Fr is the frequency of the received sound, Fs is the frequency of the sound emitted from a source and n is the unit vector of the direction between source and receiver.
Alternatives to using a full acoustical model of the environment and sound path tracing are based upon statistical characterisations of the environment. For example in the case of providing artificial reverberation algorithms wherein the sound received is a mixture of the direct signal, some relatively sparse “early reflections” and a set of dense damped reflections, these are better modelled statistically than through sound path tracing or propagation. These techniques are complementary to path tracing approaches.
From the above discussion pertaining to the difficulties associated with providing optimal spatial sound rendering it will be appreciated that use of plausible solutions or approximations may in many cases suffice to provide an acceptable rendering solution.
Process 206: Pre-Processing of the Sound Field
Application program 201 may be configured to operate with an additional processing process in the aforementioned processing pipeline. The recorded spatio-temporally characterised sound scene may itself be pre-processed by way of performing selective editing on the recorded sound scene. In this way there is generated a modified recorded sound scene for the subsequent selection processing (206) and rendering (207) processes to process. This of course results in the at least one generated virtual microphone being configurable to move about the modified recorded sound scene. Selective editing may be a desirable feature in configuring application program 201 for use by certain end users. By selective editing it is meant provision of a means of cutting out material from the recorded sound scene. It may be configured to remove particular intervals of time (temporal cutting) and/or it may remove sound sources from an interval (sound source cutting).
The selective editing functionality may also be used to re-weight the loudness of the spatial sound sources rather than simply removing one or more sound source. In this way particular sound sources may be made less (or more) noticeable. Re-weighting is a generalisation of selection where a value of 0 means cut out the sound source and 1 means select the sound source. Values between 0 and 1 may be allocated to make a sound source less noticeable and values greater than 1 may be allocated to make a particular sound source more noticeable. It should be noted that the selection (or reweighting) will vary over time. i.e. the original sound source may be made silent in one instance and be made louder in another. Temporal cutting may be considered to be equivalent to switching the virtual microphone off (by making it unreceptive to all sounds). However this would still leave sound source cutting and re-weighting.
Collectively processing processes 205-207 thereby result in processor 102 generating a set of modified audio data for output to an audio player. One or a plurality of virtual microphones are generated in accordance with, and thereby controlled by, the characteristic sounds identified in the analysis of the sound sources. The modified audio data may represent sound captured from one or a plurality of virtual microphones that are configurable to be able to move about the recorded sound scene. Furthermore motion of the virtual microphones may of course comprise situations where they are required to be stationary (such as, for example, around a person who does not move) or where only the field of reception changes.
Although the aforementioned preferred embodiments of application program 201 have been described in relation to processing of sound sources of a spatially characterised sound field it should be remembered that the methods and apparatus described may be readily adapted for use in relation to spatially characterised sound that has been provided in conjunction with still or moving (video) images. In particular a suitably configured application program 201 may be used to process camcorder type video/spatial sound data such that the one or more virtual microphones thus created are also responsive to the actual image content to some degree. In this respect the methods and apparatus of European patent publication no. EP 1235182 in the name of Hewlett-Packard Company, incorporated herein by reference (and which may suitably be referred to as the auto-rostrum), find useful application in conjunction with the methods and apparatus described herein. The skilled person in the art will see that the following combinations are possible:
A virtual microphone application program controlled fully or in part by the sound content as substantially described herein before; and
A virtual microphone application program controlled to some degree by the image content of image data associated with the sound content.
The disclosure in European patent publication no. EP 1235182, concerns generation of “video data” from static image data wherein the video is generated and thereby controlled by determined characteristics of the image content itself. The skilled person in the art will therefore further appreciate that the methods and systems disclosed therein may be combined with a virtual microphone application program as described herein. In this way image data that is being displayed may be controlled by an associated sound content instead of or in addition to control actuated purely from the image content.
For applications where audio data is associated with image data the process of generating the virtual microphone comprises synchronising the virtual microphone with the image content. The modified audio data (representing the virtual microphone) is used to modify the image content for display in conjunction with the generated virtual microphone. In this way the resultant displayed image content more accurately corresponds to the type of sound generated. For example if the sound of children laughing is present then the image actually displayed may be a zoom in on the children.
Similarly for applications where the audio data is associated with image data and the process of generating the virtual microphone comprises synchronising the virtual microphone with identified characteristics of the image content. Here the identified image content characteristics are used to modify the audio content of the generated virtual microphone.
The specific embodiments and methods presented herein may provide an audio rostrum for use in editing spatial sound. The audio rostrum operates a method of editing a spatio-temporal recorded sound scene so that the resultant audio represents sound captured from at least one virtual microphone generated in accordance with, and thereby controlled by, identified characteristic sounds associated with the sound scene.
At least one virtual microphone is generated, which is configurable to move about a spatio-temporally recorded sound scene. The degree of psychological interest in the sound to a listener of the sound represented by the virtual microphone may thereby be enhanced.
There may be provided a method and system for generating a virtual microphone representation of a spatial sound recording that has been recorded by a spatial sound capture device.
There may be provided a method and system for generating a virtual microphone representation of a spatial sound capture device sound recording such that the frame of reference of the virtual microphone representation is rendered to be stationary with respect to the movements of the spatial sound capture device.
There may be provided a method and system for generating a virtual microphone representation of a spatial sound capture device sound recording such that the frame of reference of the virtual microphone representation is rendered to move relative to particular sound sources.
There may be provided a method and apparatus for generating a virtual microphone representation of a spatial sound capture device sound recording such that the virtual microphone is rendered to move closer to, or further away from, particular sound sources.
There may be provided an audio processing method and system configured to process complex recorded spatial sound scenes into component sound sources that can be consumed piecewise.
There may yet further be provided a method of editing of a spatio-temporal recorded sound scene, so that the resultant audio represents sound captured from at least one virtual microphone generated in accordance with, and thereby controlled by, identified characteristic sounds associated with the sound scene and identified image content characteristics of an associated digital image.
Optionally a soundscape as described herein may be recorded in conjunction with still or moving (video) images.
As noted above, according to one exemplary embodiment, there is provided a method of processing audio data, the method comprising: characterising an audio data representative of a recorded sound scene into a set of sound sources occupying positions within a time and space reference frame; analysing the sound sources; and generating a modified audio data representing sound captured from at least one virtual microphone configured for moving about the recorded sound scene, wherein the virtual microphone is controlled in accordance with a result of the analysis of the audio data, to conduct a virtual tour of the recorded sound scene.
Embodiments may further comprise identifying characteristic sounds associated with the sound sources; and controlling the virtual microphone in accordance with the identified characteristic sounds associated with the sound sources.
Embodiments may further comprise normalising the sound signals by referencing each the sound signal to a common maximum signal level; and mapping the sound sources to corresponding the normalised sound signals.
Embodiments may further comprise selecting sound sources which are grouped together within the reference frame.
Embodiments may further comprise determining a causality of the sound sources.
Embodiments may further comprise recognizing sound sources representing sounds of a similar classification type.
Embodiments may further comprise identifying new sounds which first appear in the recorded sound scene and which were not present at an initial beginning time position of the recorded sound scene.
Embodiments may further comprise recognizing sound sources which accompany self reference point within the reference frame.
The embodiment may further comprise recognizing a plurality of pre-classified types of sounds by comparing a waveform of a the sound source against a plurality of stored waveforms that are characteristic of the pre-classified types.
Embodiments may further comprise classifying sounds into sounds of people and non-people sounds.
Embodiments may further comprise grouping the sound sources according to at least one criterion selected from the set of: physical proximity of the sound sources; and similarity of the sound sources.
In the various embodiments, generating modified audio data may further comprise executing an algorithm for determining a trajectory of the virtual microphone followed with respect to the sound sources, during the virtual tour.
In the various embodiments, generating a modified audio data may further comprise executing an algorithm for determining a field of reception of the virtual microphone with respect to the sound sources.
In the various embodiments, modified audio data may further comprise executing a search algorithm comprising a search procedure for establishing a saliency of the sound sources.
In the various embodiments, generating a modified audio data may further comprise a search procedure, based at least partly on the saliency of the sound sources, to determine a set of possible virtual microphone trajectories.
In the various embodiments, generating a modified audio data may further comprise a search procedure, based on the saliency of the sound sources, to determine a set of possible virtual microphone trajectories, the search being constrained by at least an allowable duration of a sound source signal output by the generated virtual microphone.
In the various embodiments, generating a modified audio data may further comprise a search procedure, based on the saliency of the sound sources, to determine a set of possible virtual microphone trajectories, the search procedure comprising a calculation of: an intrinsic saliency of the sound sources; and at least one selected from the set comprising: a feature-based saliency of the sources; and a group saliency of a group of the sound sources.
In the various embodiments, analysis may further comprise identifying a predefined sound scene class wherein, in that sound scene class, sub-parts of the sound scene have predefined characteristics; and establishing index audio clips based on recognized sound sources or groups of sound sources.
In the various embodiments, generating modified audio data comprises executing an algorithm for determining a trajectory and field of listening of the virtual microphone from one sound source or group of sound sources to the next.
In the various embodiments, analysis may further comprise identifying a predefined sound scene class wherein, in that sound scene class, sub-parts of the sound scene have predefined characteristics; and establishing index audio clips based on recognized sound sources or groups of sound sources; and the process of generating a modified audio data comprises executing an algorithm for determining a trajectory and field of view of the virtual microphone from one sound source or group of sound sources to the next, the algorithm further determining at least one parameter selected from the set comprising: the order of the index audio clips to be played; the amount of time for which each index audio clip is to be played; and the nature of the transition between each of the index audio clips.
In the various embodiments, generating a modified audio data may further comprise use of a psychological model of saliency of the sound sources.
The method may further comprise an additional process of performing a selective editing of the recorded sound scene to generate a modified recorded sound scene, the at least one virtual microphone being configurable to move about in the modified recorded sound scene.
In the various embodiments, generating the virtual microphone may further comprise a rendering process of placing the virtual microphone in the soundscape and synthesising the sounds that it would capture in accordance with a model of sound propagation in a three dimensional environment.
In the various embodiments, audio data may be associated with an image data and generating the virtual microphone comprises synchronising the virtual microphone with an image content of the image data.
In the various embodiments, audio data may be associated with image data and generating the virtual microphone comprises synchronising the virtual microphone with an image content of the image data, the modified audio data representing the virtual microphone being used to modify the image content for display in conjunction with the generated virtual microphone.
In the various embodiments, audio data may be associated with an image data and generating the virtual microphone comprises synchronising the virtual microphone with identified characteristics of an image content of the image data.
The various embodiments may further comprise acquiring the audio data representative of the recorded sound scene.
In the various embodiments, the time and space reference frame may be moveable with respect to the recorded sound scene.
In the various embodiments, characterising of audio data may further comprise determining a style parameter for conducting a search process of the audio data for identifying the set of sound sources.
In the various embodiments, characterising may further comprise selecting the time and space reference frame from: a reference frame fixed with respect to the sound scene; and a reference frame which is moveable with respect to the recorded sound scene.
In the various embodiments, the virtual microphone may be controlled to tour the recorded sound scene following a path which is determined as a path which a virtual listener would traverse within the recorded sound scene; and wherein the modified audio data represents sound captured from the virtual microphone from a perspective of the virtual listener.
In the various embodiments, the virtual microphone may be controlled to conduct a virtual tour of the recorded sound scene, in which a path followed by the virtual microphone is determined from an analysis of sound sources which draw an attention of a virtual listener; and the generated modified audio data comprises the sound sources which draw the attention of the virtual listener.
In the various embodiments, the virtual microphone may be controlled to conduct a virtual tour along a path, determined from a set of aesthetic considerations of objects within the recorded sound scene.
In the various embodiments, the virtual microphone may be controlled to follow a virtual tour of the recorded sound scene following a path which is determined as a result of aesthetic considerations of viewable objects in an environment coincident with the recorded sound scene; and wherein the generated modified audio data represents sounds which would be heard by virtual listener following the path.
According to another embodiment, there is provided a method of processing audio data representative of a recorded sound scene, the audio data comprising a set of sound sources each referenced within a spatial reference frame, the method comprising: identifying characteristic sounds associated with each the sound source; selecting individual sound sources according to their identified characteristic sounds; navigating the sound scene to sample the selected individual sound sources; and generating a modified audio data comprising the sampled sounds originating from the selected sound sources.
In the various embodiments, navigating may comprise following a multi-dimensional trajectory within the sound scene.
In the various embodiments, selecting may comprise determining which individual the sound sources exhibits features which are of interest to a human listener in the context of the sound scene; and the navigating the sound scene comprises visiting individual the sound sources which exhibit the features which are of interest to a human listener.
According to another embodiment, there is provided a method of processing audio data comprising: resolving an audio signal into a plurality of constituent sound elements, wherein each the sound element is referenced to a spatial reference frame; defining an observation position within the spatial reference frame; and generating from the constituent sound elements, an audio signal representative of sounds experienced by a virtual observer at the observer position within the spatial reference frame.
In the various embodiments, observer position may be moveable within the spatial reference frame.
In the various embodiments, observer position may follow a three dimensional trajectory with respect to the spatial reference frame.
Embodiments may further comprise resolving an audio signal into constituent sound elements, wherein each the constituent sound element comprises a characteristic sound quality, and (b) a position within a spatial reference frame; defining a trajectory through the spatial reference frame; and generating from the constituent sound elements, an output audio signal which varies in time according to an output of a virtual microphone traversing the trajectory.
According to another embodiment, there is provided a method of processing audio data, the method comprising: acquiring a set of audio data representative of a recorded sound scene; characterising the audio data into a set of sound sources occupying positions within a time and space reference frame; identifying characteristic sounds associated with the sound sources; and generating a modified audio data representing sound captured from at least one virtual microphone configured for moving around the recorded sound scene, wherein the virtual microphone is controlled in accordance with the identified characteristic sounds associated with the sound sources, to conduct a virtual tour of the recorded sound scene.
According to another embodiment, there is provided a computer system comprising an audio data processing means, a data input port and an audio data output port, the audio data processing means being arranged to: receive from the data input port, a set of audio data representative of a recorded sound scene, the audio data characterized into a set of sound sources positioned within a time-space reference frame; perform an analysis of the audio data to identify characteristic sounds associated with the sound sources; generate a set of modified audio data, the modified audio data representing sound captured from at least one virtual microphone configurable to move about the recorded sound scene; and output the modified audio data to the data output port, wherein the virtual microphone is generated in accordance with, and is controlled by, the identified characteristic sounds associated with the sound sources.
In the various embodiments, performing an analysis of the audio data may comprise recognizing a plurality of pre-classified types of sounds by comparing a waveform of a the sound source against a plurality of stored waveforms that are characteristic of the pre-classified types.
In the various embodiments, performing an analysis of the audio data may comprise classifying sounds into sounds of people and non-people sounds.
In the various embodiments, analysis of the sound sources may comprise grouping the sound sources according to at least one criterion selected from the set of: physical proximity of the sound sources; and similarity of the sound sources.
In the various embodiments, the computer system may comprise an algorithm for determining a trajectory of the virtual microphone with respect to the sound sources.
In the various embodiments, the computer system may comprise an algorithm for determining a field of view of the virtual microphone with respect to the sound sources.
In the various embodiments, the computer system may comprise a search algorithm for performing a search procedure for establishing the saliency of the sound sources.
In the various embodiments, the computer system may comprise a search algorithm for performing a search procedure, based at least partly on the saliency of the sound sources, to determine a set of possible virtual microphone trajectories.
In the various embodiments, the computer system may comprise an algorithm for performing a search procedure, based on the saliency of the sound sources, to determine a set of possible virtual microphone trajectories, the search being constrained by at least the allowable duration of a sound source signal output by the generated virtual microphone.
In the various embodiments, generating the modified audio data may comprise a search procedure, based on the saliency of the sound sources, to determine a set of possible virtual microphone trajectories, the search procedure comprising a calculation of: an intrinsic saliency of the sound sources; and at least one selected from the set comprising: a feature based saliency of the sources; and a group saliency of a group of the sound sources.
In the various embodiments, performing an analysis of the audio data may further comprise identifying a predefined sound scene class wherein, in that sound scene class, sub-parts of the sound scene have predefined characteristics; and establishing index audio clips based on recognised sound sources or groups of sound sources, and the generating the modified audio data comprises executing an algorithm for determining a trajectory and field of view of the virtual microphone from one sound source or group of sound sources to another sound source or group of sound sources.
In the various embodiments, performing an analysis of the audio data further may comprise identifying a predefined sound scene class wherein, in that sound scene class, sub-parts of the sound scene have predefined characteristics; and establishing index audio clips based on recognized sound sources or groups of sound sources, the generating modified audio data comprising executing an algorithm for determining a trajectory and field of view of the virtual microphone from one sound source or group of sound sources to the next, the algorithm further determining at least one parameter from the set comprising: an order of the index audio clips to be played; an amount of time for which each index audio clip is to be played; and a nature of a transition between each of the index audio clips.
In the various embodiments, generating modified audio may comprise use of a psychological model of saliency of the sound sources.
In the various embodiments, the audio data processing means may be configured to perform a selective editing of the recorded sound scene to generate a modified recorded sound scene, the at least one virtual microphone being configurable to move about therein.
In the various embodiments, generating the virtual microphone may comprise a rendering process of placing the virtual microphone in the soundscape and synthesising the sounds that it would capture in accordance with a model of sound propagation in a three dimensional environment.
In the various embodiments, the audio data may be associated with image data and generating the virtual microphone comprises synchronising the virtual microphone with an image content of the image data, the modified audio data representing the virtual microphone being used to modify the image content for display in conjunction with the generated virtual microphone.
In the various embodiments, the audio data may be associated with an image data and the generating audio data comprises synchronising the virtual microphone with identified characteristics of an image content of the image data.
According to another embodiment, there is provided a computer program stored on a computer-usable medium, the computer program comprising computer readable instructions for causing a computer to execute the functions of: acquiring a set of audio data representative of a recorded sound scene, the audio data characterized into a set of sound sources within a time-space reference frame; using an audio data processing means to perform an analysis of the audio data to identify characteristic sounds associated with the characterized sound sources; and generating, in the audio data processing means, a set of modified audio data for output to an audio-player, the modified audio data representing sound captured from at least one virtual microphone configurable to move about the recorded sound scene, wherein the virtual microphone is generated in accordance with, and thereby controlled by, the identified characteristic sounds associated with the sound sources.
According to another embodiment, there is provided an audio data processing apparatus for processing data representative of a recorded sound scene, the audio data comprising a set of sound sources each referenced within a spatial reference frame, the apparatus comprising: means for identifying characteristic sounds associated with each the sound source; means for selecting individual sound sources according to their identified characteristic sounds; means for navigating the sound scene to sample the selected individual sound sources; and means for generating a modified audio data comprising the sampled sounds.
In the various embodiments, the navigating means may be operable for following a multi-dimensional trajectory within the sound scene.
In the various embodiments, the selecting means may comprise means for determining which individual the sound sources exhibit features which are of interest to a human listener in the context of the sound scene; and the navigating means is operable for visiting individual the sound sources which exhibit the features which are of interest to a human listener.
In the various embodiments, the audio data processing apparatus may comprise a sound source characterisation component for characterising an audio data into a set of sound sources occupying positions within a time and space reference frame; a sound analyser for performing an analysis of the audio data to identify characteristic sounds associated with the sound sources; at least one virtual microphone component, configurable to move about the recorded sound scene; and a modified audio generator component for generating a set of modified audio data representing sound captured from the virtual microphone component, wherein movement of the virtual microphone component in the sound scene is controlled by the identified characteristic sounds associated with the sound sources.
In the various embodiments, the audio data processing apparatus may further comprise a data acquisition component for acquiring the audio data representative of a recorded sound scene.
According to another embodiment, there is provided a method of processing an audio visual data representing a recorded audio-visual scene, the method comprising: characterising the audio data into a set of sound sources, occupying positions within a time and space reference frame; analysing the audio-visual data to obtain visual cues; and generating a modified audio data representing sound captured from at least one virtual microphone configured for moving around the recorded audio-visual scene, wherein the virtual microphone is controlled in accordance with the visual cues arising as a result of the analysis of the audio-visual data to conduct a virtual tour of the recorded audio-visual scene.
According to another embodiment, there is provided an audio-visual data processing apparatus for processing an audio-visual data representing a recorded audio-visual data representing a recorded audio-visual scene, the apparatus comprising: a sound source characterizer for characterizing audio data into a set of sound sources occupying positions within a time and space reference frame; an analysis component for analysing the audio-visual to obtain visual cues; at least one virtual microphone component, configurable to navigate the audio-visual scene; and an audio generator component for generating a set of modified audio data representing sound captured from the virtual microphone component, wherein navigation of the virtual microphone component in the audio-visual scene is controlled in accordance with the visual cues arising as a result of the analysis of the audio-visual data.
The data processing apparatus may further comprise a data acquisition component for acquiring audio-visual data representative of a recorded audio-visual scene.
Number | Date | Country | Kind |
---|---|---|---|
0411297.5 | May 2004 | GB | national |