1. Field of the Invention
The present invention relates to wave-field synthesis systems and particularly to wave-field synthesis systems allowing moving virtual sources.
2. Description of the Related Art
There is an increasing demand for new technologies and innovative products in the field of consumer electronics. Here, it is an important prerequisite for the success of new multimedia systems to offer optimum functionalities or capabilities, respectively. This is achieved by the usage of digital technologies and particularly computer technology. Examples therefore are applications offering an improved realistic audiovisual impression. In prior art audio systems, a significant weak point is the quality of the spatial sound reproduction of real but also virtual environments.
Methods for multichannel loudspeaker reproduction of audio signals have been known and standardized for many years. All common techniques have the disadvantage that both the location of the loudspeakers and the position of the listener are already imprinted in the transmission format. If the loudspeakers are positioned in a wrong way with regard to the listener, the audio quality suffers significantly. An optimum sound is only possible in a very small area of the reproduction room, the so called sweet spot.
An improved natural spatial impression as well as stronger enclosure during audio reproduction can be obtained with the help of new technology. The basics of this technology, the so called wave-field synthesis (WFS) have been researched at the TU Delft and have been presented for the first time in the late 80ies (Berkhout, A. J.; de Vries, D.; Vogel, P.: Acoustic control by Wave-field Synthesis. JASA 93, 1993).
Due to the huge requirements of this method with regard to computing effort and transmission rates, the wave-field synthesis has so far only rarely been applied in practice. Only the progresses in the field of microprocessor technique and audio encoding allow the usage of this technology in specific applications today. First products in the professional field are expected next year. In a few years, the first wave-field synthesis applications for the consumer field will come on the market.
The basic idea of WFS is based on the application of the Huygens principle of the wave theory.
Every point captured by a wave is the starting point of an elementary wave, which propagates in a spherical or circular way.
Applied to acoustics, any form of an incoming wave front can be reproduced by a large number of loudspeakers arranged next to another (a so called loudspeaker array). In the simplest case, a single point source to be reproduced and a linear arrangement of the loudspeakers, the audio signals of every loudspeaker have to be fed with a time delay and amplitude scaling such that the emitted sound fields of the individual loudspeakers overlay properly. With several sound sources, the contribution to every loudspeaker is calculated separately for every source and the resulting signals are added. In a virtual space with reflecting walls, the reflections can also be reproduced via the loudspeaker array as additional sources. Thus, the calculation effort depends heavily on the number of sound sources, the reflection characteristics of the recording room and the number of loudspeakers.
The particular advantage of this technique is that a natural spatial sound impression is possible across a large area of the reproduction room. In contrary to the known techniques, direction and distance from the sound sources are reproduced very accurately. To a limited degree, virtual sound sources can even be positioned between the real loudspeaker array and the listener.
Although wave-field synthesis functions well for surroundings whose conditions are known, irregularities occur when the conditions change or when wave-field synthesis is performed based on surrounding conditions which do not correspond to the actual condition of the surroundings, respectively.
The technique of wave-field synthesis can also be used advantageously to add a corresponding spatial audio perception to a visual perception. So far, during production in virtual studios, the focus was on the production of an authentic visual impression of the virtual scene. The acoustic impression matching the image is normally imprinted on the audio signal afterwards by manual operating steps in the so-called postproduction or is considered to be too expensive and too time-consuming to realize and is thus neglected. This causes normally a discrepancy between individual sense impressions, which causes the designed space, i.e. the designed scene, to be considered as less authentic.
In the expert publication “Subjective experiments on the effects of combining spatialized audio and 2D video projection in audio-visual systems”, W. de Bruijn and M. Boone, AES convention paper 5582, May 10th to 13th, 2003, Munich, subjective experiments with regard to the effects of combining spatial audio and a two-dimensional video projection in audiovisual systems are presented. Particularly, it is emphasized that two speakers standing at different distances to a camera, who stand almost behind one another, can be understood better by an audience when the two persons standing behind one another can be seen and reconstructed as different virtual sound sources with the help of wave-field synthesis. In that case, it has been found out by subjective tests that a listener can better understand and differentiate the two speakers speaking simultaneously when they are separated.
In a conference contribution for the 46th international academic colloquium in Ilmenau from Sep. 24 to 27, 2001, with the title “Automatisierte Anpassung der Akustik an virtuelle Raume”, U. Reiter, F. Melchior and C. Seidel, an approach for automating sound post-processing processes is presented. Therefore, the parameters of a film set required for the visualization, such as room size, texture of the surfaces or camera position and position of the actors are checked for their acoustic relevance, whereupon corresponding control data are generated. These influence then in an automated way the effect and post-processing processes used for postproduction, such as the adaptation of the speaker volume dependency on the distance to the camera or the reverberation time in dependence on room size and wall conditions. Here, it is the aim to enforce the visual impression of a virtual scene for an increased perception of reality.
It is intended to enable “listening with the ears of the camera” for making a scene appear more real. In this connection, it is intended that a correlation between sound event location in the image and listening event location in the surround field is as high as possible. This means that sound source positions are constantly adapted to an image. Camera parameters, such as zoom, are also to be incorporated in the sound design like a position of two loudspeakers L and R. Therefore, tracking data of a virtual studio are written into a file by the system together with an associated time code. Image, sound and time code are recorded simultaneously on an MAZ. The Camdump file is transmitted to a computer, which generates control data for an audio workstation therefrom and outputs them via an MIDI interface synchronously to the image coming from the MAZ. The actual audio processing as well as positioning the sound source in the surround field and inserting earlier reflections and reverberation is performed within the audio workstation. The signal is rendered for a 5.1 surround loudspeaker system.
Camera tracking parameters as well as positions of sound sources in the recording setting can be recorded in real film sets. Such data can also be generated in virtual studios.
In a virtual studio, an actor or presenter is alone in a recording room. Particularly, he stands in front of a blue wall, which is also referred to as blue box or blue panel. On this blue wall, a pattern of blue and light-blue stripes is disposed. Special about this design is that the stripes have a different width and thus a plurality of stripe combinations result. During post-processing, when the blue wall is replaced by a virtual background, it is possible to determine exactly which direction the camera looks due to the unique stripe combination on the blue wall. With the help of this information, the computer can determine the background for the current angle of view of the camera. Further, sensors at the camera are evaluated, which detect additional camera parameters and output the same. Typical parameters of a camera, which are detected via sensor technology, are the three translation degrees x, y, z, the three rotation degrees, which are also referred to as roll, tilt, pan, and the focal length or the zoom, respectively, which is equal to the information about the aperture angle of the camera.
In order to be able to determine the exact position of the camera even without image recognition and without expensive sensor technique, the tracking system can also be used, which consists of several infrared cameras, which determine the position of an infrared sensor mounted to the camera. Thereby, the position of the camera is also determined. With the camera parameters provided by the sensor technology and the stripe information evaluated by image recognition, a real time computer can now calculate the background for the current image. Then, the blue hue, which the blue background had, is removed from the image, so that instead of the blue background the virtual background is brought in.
In most cases, a concept is followed, which is based on getting an acoustic overall impression of the visually imaged scene. This can be described with the expression “full shot” coming from image design. This “full shot” sound impression remains mostly constant via all settings in a scene, although the optical angle of view on things often changes very much. Optical details are emphasized by corresponding angles or moved into the background. Countershots in creating dialogs in films are also not reproduced by sounds.
Thus, there is the need to embed the audience acoustically into an audiovisual scene. In this connection, the screen or the image area is the line of vision and the angle of view of the audience. This means that the sound is to follow the image in the form that it always corresponds to the image. This is particularly important for virtual studios since there is typically no correlation between the sound of the moderation, for example and the surroundings where the presenter is at the moment. In order to get an audiovisual overall impression of the scene, a room impression matching the rendered image has to be simulated. In that context, the location of a sound source, as it is perceived by, for example, an audience of a cinema screen, is a significant subjective characteristic in such a sound concept.
In the audio domain, a good spatial sound can be obtained for a large listener area by the technique of wave-field synthesis (WFS). As it has been discussed, the wave-field synthesis is based on the principle of Huygens, according to which wave fronts can be formed and structured by overlaying elementary waves. According to mathematically correct theoretical description, an infinite amount of sources in infinitely small distance would have to be used for generating the elementary waves. Practically, however, a finite amount of loudspeakers are used in a finite small distance to each other. According to the WFS principle, each of these loudspeakers is controlled by an audio signal from a virtual source, which has a certain delay and a certain level. Levels and delays are normally different for all loudspeakers.
In the audio domain exists a so called natural Doppler effect. This Doppler effect occurs from a source sending an audio signal with a certain frequency, a receiver receiving the signal and a movement of the source taking place relative to the receiver. Due to an “extension” or “compression” of the acoustic waveforms, this causes the frequency of the audio signal to change for the receiver according to the movement. Normally, a person is the receiver and hears this frequency change directly, for example when an ambulance with siren moves towards a person and then passes the person. The person will hear the siren at the time when the ambulance is in front of him with a different pitch than when the ambulance is behind him.
A Doppler effect exists also in the wave-field synthesis or sound field synthesis, respectively. It is physically based on the same background as the above-described natural Doppler effect. However, in contrary to the natural Doppler effect, there is no direct path between sender and receiver in sound field synthesis. Instead, a differentiation is made in that there is a primary transmitter and a primary receiver. Above that, a secondary transmitter and a secondary receiver exist. This scenario will be discussed below with reference to
In wave-field synthesis, the transmission between primary transmitter and primary receiver takes place “virtually”. This means that the wave-field synthesis algorithms are responsible for extension and compression of the wave front of the waveforms. At the time when a loudspeaker 704 receives a signal from the wave-field synthesis module, there is no audible signal at first. The signal only becomes audible after being output by the loudspeaker. Thereby, Doppler effects can occur at different locations.
If the virtual source moves relative to the loudspeakers, every loudspeaker reproduces a signal with different Doppler effect, depending on its specific position with regard to the moving virtual source, since the loudspeakers are in different positions and thus the relative movements are different for every loudspeaker.
On the other hand, the listener can also move relative to the loudspeakers. However, particularly in a cinema setting, this is an insignificant case in practice, since the movement of the listener with regard to the loudspeakers will always be a relatively slow movement with a relatively small Doppler effect, since the Doppler shift, as it is known in the art, is proportional to the relative motion between transmitter and receiver.
The former Doppler effect, which means when the virtual source moves relative to the loudspeakers, can sound relatively natural but also very unnatural. This depends on the direction of the movement. If the source moves away from the center of the system or towards the same in a straight manner, a rather natural effect results. With reference to
However, if the virtual source 700 “encircles” the listener, as it is illustrated with regard to
It is an object of the present invention to provide an improved concept for calculating a discrete value at a current time of a component in a loudspeaker signal where artifacts due to Doppler effects are reduced.
In accordance with a first aspect, the present invention provides an apparatus for calculating a discrete value for a current time of a component in a loudspeaker signal for a loudspeaker based on a virtual source in a wave-field synthesis system with a wave-field synthesis module and a plurality of loudspeakers, wherein the wave-field synthesis module is formed to determine delay information by using an audio signal associated to the virtual source and by using position information indicating a position of the virtual source, indicating delayed by how many samples the audio signal is to occur with regard to a time reference in the component, having: a means for providing a first delay associated to a first position of the virtual source at a first time, and for providing a second delay associated to a second position of the virtual source at a second later time, wherein the second position differs from the first position and wherein the current time lies between the first time and the second time; a means for determining a value of the audio signal delayed by the first delay for the current time and for determining a second value of the audio signal delayed by the second delay for the current time; a means for weighting the first value with a first weighting factor to obtain a first weighted value, and a second value with a second weighting factor to obtain a second weighted value; and a means for summing the first weighted value and the second weighted value to obtain the discrete value for the current time.
In accordance with a second aspect, the present invention provides a method for calculating a discrete value for a current time of a component in a loudspeaker signal for a loudspeaker based on a virtual source in a wave-field synthesis system with a wave-field synthesis module and a plurality of loudspeakers, wherein the wave-field synthesis module is formed to determine delay information by using an audio signal associated to the virtual source and by using position information indicating a position of the virtual source, indicating delayed by how many samples the audio signal is to occur with regard to a time reference in the component, having the steps of: providing a first delay associated to a first position of the virtual source to a first time, and providing a second delay associated to a second position of the virtual source at a second later time, wherein the second position differs from the first position and wherein the current time lies between the first time and the second time; determining a value of the audio signal delayed by the first delay for the current time and determining a second value of the audio signal delayed by the second delay for the current time; weighting the first value with the first weighting factor to obtain a first weighted value, and a second value with a second weighting factor to obtain a second weighted value; and summing the first weighted value and the second weighted value to obtain the discrete value for the current time.
In accordance with a third aspect, the present invention provides a computer program with a program code for performing the method for calculating a discrete value for a current time of a component in a loudspeaker signal for a loudspeaker based on a virtual source in a wave-field synthesis system with a wave-field synthesis module and a plurality of loudspeakers, wherein the wave-field synthesis module is formed to determine delay information by using an audio signal associated to the virtual source and by using position information indicating a position of the virtual source, indicating delayed by how many samples the audio signal is to occur with regard to a time reference in the component, having the steps of: providing a first delay associated to a first position of the virtual source to a first time, and providing a second delay associated to a second position of the virtual source at a second later time, wherein the second position differs from the first position and wherein the current time lies between the first time and the second time; determining a value of the audio signal delayed by the first delay for the current time and determining a second value of the audio signal delayed by the second delay for the current time; weighting the first value with the first weighting factor to obtain a first weighted value, and a second value with a second weighting factor to obtain a second weighted value; and summing the first weighted value and the second weighted value to obtain the discrete value for the current time, when the program runs on a computer.
The present invention is based on the knowledge that Doppler effects can be considered, since they are part of the information required for position identification of a source. If such Doppler effects had to be omitted fully, this could lead to the fact that no optimum sound experience results, since the Doppler effect is natural and it would result in a non-optimum impression, if, for example, a virtual source moves towards a listener but no Doppler shift of the audio frequency takes place.
On the other hand, according to the invention, for “slurring” the Doppler effect, to the effect that it is present but its effect do lead to no or only reduced artifacts, “banning” is performed from one position to another position. Then, in the prior art, when a delay change occurs, which means when a change of position of the virtual source occurs, samples are simply inserted artificially during a reduced delay or samples are simply omitted during an increased delay. This causes sharp jumps in the signal. However, according to the invention these sharp jumps are reduced by achieving a continuous transition from one position of the virtual source to another position of the virtual source. Therefore, in a panning region, a discrete value is calculated for a current time in the panning region by using a sample of the audio signal at the first position valid for the current time, which means at a first time, and by using a sample of an audio signal of the virtual position at the second position associated to a current time, which means the second time.
Preferably, panning occurs to the effect that at the first time when the first position changes and thus the first delay information is valid, a weighting factor for the audio signal delayed by the first delay is 100%, while a weighting factor for the audio signal delayed by the second delay is 0%, and that then an opposing change of the two weighting factors is performed from the first time to the second time in order to “pan” “smoothly” from the one position to the other position.
The inventive concept represents a tradeoff between a certain loss of position information on the one hand since new position information of the source are no longer considered with every new current time, since a position update of the virtual source is performed in rather coarse steps, wherein panning is performed between the one position of the source and the second position of the source occurring at a later time. This is performed by performing the delay first for relatively coarse spatial step widths, i.e. position information relatively distant in time (of course by considering the speed of the source). Thereby, the delay change leading to the above-mentioned virtual Doppler effect between the primary transmitter and the primary receiver, is slurred, i.e. transformed continuously from one delay change to the other. According to the invention, “panning” is performed via volume scaling from one position to the next to avoid spatial jumps and thereby audible “clicks”. Thereby, “hard” omitting or adding of samples due to delay change is replaced by a signal shape adapted to the hard signal shape with rounded edges, so that the delay changes are accounted for but the hard influence on a loudspeaker signal leading to artifacts is avoided due to a change of position of the virtual source.
These and other objects and features of the present invention will become clear from the following description taken in conjunction with the accompanying drawings, in which:
a is a waveform of a discrete audio signal of a virtual source at a first time with a first delay D=0;
b is a representation of the same audio signal as in
c is a first panned version based on the audio signals shown in
d is a further panning representation at a later time than
Before reference will be made in more detail to
As has been explained above, one wave-field synthesis module feeds a plurality of loudspeakers LS1, LS2, LS3, LSm by outputting loudspeaker signals via the outputs 210 to 216 to the individual loudspeakers. Via the input 206, the positions of the individual loudspeakers in a reproduction setting, such as a cinema, are provided to the wave-field synthesis module 200. In the cinema, many individual loudspeakers are grouped around the audience, which are arranged in arrays preferably such that loudspeakers are both in front of the audience, which means, for example, behind the screen and behind the audience as well as on the right hand side and left hand side of the audience. Further, other inputs can be provided to the wave-field synthesis module 200, such as information about the room acoustics, etc., in order to be able to simulate actual room acoustics during the recording setting in a cinema.
Generally, the loudspeaker signal, which is, for example, supplied to the loudspeaker LS1 via the output 210, will be a superposition of component signals of the virtual sources, in that the loudspeaker signal comprises for the loudspeaker LS1 a first component coming from the virtual source 1, a second component coming from the virtual source 2 as well as an n-th component coming from the virtual source n. The individual component signals are linearly superposed, which means added after their calculation to reproduce the linear superposition at the ear of the listener who will hear a linear superposition of the sound sources he can perceive in a real setting.
In the following, a detailed design of the wave-field synthesis module 200 will be illustrated with regard to
As can be seen from
In the following, the mode of operation of the apparatus illustrated in
At the first time t′=0, which is further marked by 401 in
The audio signal shifted from the virtual source by D=2 is illustrated in
For suppressing the undesired characteristics and for suppressing the artifacts caused by this switching from one delay to another delay, the inventive apparatus shown in
Thus, the means 10 for providing provides on the output side a first delay 12a for the first time as well as a second delay 12b for the second time. Optionally, the means 10 is further formed to also output scaling factors for the two times apart from the delays, as will be discussed below.
The two delays at the outputs 12a, 12b of the means 10 are supplied to a means 14 for determining the value of the audio signal delayed by the first delay, which is supplied to means 14 via an input 16, for the current time (which can be signalized via an input 18) and for determining a second value of the audio signal delayed by the second delay for the current time. On the output side, the means 14 for determining provides first a first value A1(ti′) at a time ti′=tA of the audio signal delayed by the first delay, indicated by 20a in
Further, the inventive apparatus comprises a means 22 for weighting the first value of A1 with a first weighting factor to obtain a weighted first value 24a. Further, the means 22 is effective to weight the second value 20b from A4 with a second weighting factor n to obtain a second weighted value 24b. The two weighted values 24a and 24b are supplied to a means 26 for summing the two values to obtain an “panned” discrete value 28 for the current time of the component Kij in a loudspeaker signal for a loudspeaker j based on the virtual source i.
In the following, the functionality of the apparatus shown in
According to the invention, neither the value of A1 at a first time 401 nor the value of A4 at a second time 402 is modified. However, all values between t1 401 and t2 402 are modified according to the invention, which means values associated to a current time tA, which lies between the first time 401 and the second time 402. Thus, the current time extends from the times t′=1 to t′=8 for the subsequent exemplary explanation.
In mathematical terms, this is expressed in the graph in
Merely exemplarily, in the embodiment illustrated in
A “finer” slurring could be achieved when the position update interval PAI shown in
In the embodiment illustrated in
However, for the inventive panning, the current time tA has to lie between the first time 401 and the second time 402. The minimum “step width”, which means the minimum distance between the first time 401 and the second time 402 is two sample periods according to the invention, so that the current time between the first time 401 and the second time 402 can be processed with, for example, respective weighting factors of 0.5. For the practice however, a larger step width is preferred, on the one hand for computing time reasons and on the other hand for generating a panning effect which would not occur when the following position is already achieved at the next time, which would again lead to a natural Doppler effect in the conventional wave-field synthesis. An upper limit for the step width, which means for the distance from the first time 401 to the second time 402 will be that with increasing distance more and more position information, which would actually be provided, are ignored due to panning, which will, in the extreme case, lead to a loss of locatability of the virtual source for the listener. Thus, step widths in the medium range are preferred, which can depend additionally on the speed of the virtual source depending on the embodiment to realize an adaptive step width control.
In the embodiment shown in
In
AWi=B(tA)*m*SF1+B(tA)*n+SF2.
From the above expression, for simplification reasons, the multiplication of a value of the audio signal with two weighting factors can be replaced by a multiplication of the value with the product of the two weighting factors.
Depending on the circumstances, the inventive method as illustrated with regard to
While this invention has been described in terms of several preferred embodiments, there are alterations, permutations, and equivalents, which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
DE 103 21 980.3 | May 2003 | DE | national |
This application is a continuation of copending International Application No. PCT/EP2004/005047, filed May 11, 2004, which designated the United States and was not published in English, which claimed priority to German Patent Application No. 103 21 980.3, filed on May 15, 2003, and which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP04/05047 | May 2004 | US |
Child | 11257781 | Oct 2005 | US |