The present invention relates to a device and method for calculating loudspeaker signals for a plurality of loudspeakers while using filtering in the frequency domain such as a wave field synthesis renderer device and a method of operating such a device.
In the field of consumer electronics there is a constant demand for new technologies and innovative products. An example here is reproducing audio signals as realistically as possible.
Methods of multichannel loudspeaker reproduction of audio signals have been known and standardized for many years. All conventional technologies have the disadvantage that both the positions of the loudspeakers and the locations of the listeners are already impressed onto the transmission format. If the loudspeakers are arranged incorrectly with regard to the listener, the audio quality will decrease significantly. Optimum sound is only possible within a small part of the reproduction space, the so-called sweet spot.
An improved natural spatial impression and increased envelopment in audio reproduction may be achieved with the aid of a new technique. The basics of said technique, so-called wave field synthesis (WFS), were investigated at the Technical University of Delft and were presented for the first time in the late 1980s (Berkhout, A. J.; de Vries, D.; Vogel, P.: Acoustic Control By Wavefield Synthesis. JASA 93, 1993).
As a result of the enormous requirements said method has placed upon computer performance and transmission rates, wave field synthesis has only been rarely used in practice up to now. It is only the progress made in the fields of microprocessor technology and audio coding that by now allow said technique to be used in specific applications.
The fundamental idea of WFS is based on applying Huygen's principle of wave theory: each point that is hit by a wave is a starting point of an elementary wave, which propagates in the shape of a sphere or a circle.
When applied to acoustics, any sound field may be replicated by using a large number of loudspeakers arranged adjacently to one another (a so-called loudspeaker array). To this end the audio signal of each loudspeaker is generated from the audio signal of the source by applying a so-called WFS operator. In the simplest case, e.g., when reproducing a point source and a linear loudspeaker array, the WFS operator will correspond to amplitude scaling and to a time delay of the input signal. Application of said amplitude scaling and time delay will be referred to as scale & delay below.
In the case of a single point source to be reproduced and a linear arrangement of the loudspeakers, a time delay and amplitude scaling may be applied to the audio signal of each loudspeaker so that the emitted sound fields of the individual loudspeakers will superpose correctly. In the event of several sound sources, the contribution to each loudspeaker will be calculated separately for each source, and the resulting signals will be added. If the sources to be reproduced are located in a room having reflecting walls, reflections will also have to be reproduced as additional sources via the loudspeaker array. The effort in terms of calculation will therefore highly depend on the number of sound sources, the reflection properties of the recording room, and on the number of loudspeakers.
The advantage of this technique consists, in particular, in that a natural spatial sound impression is possible across a large part of the reproduction room. Unlike the known technologies, the direction and distance of sound sources are reproduced in a highly exact manner. To a limited extent, virtual sound sources may even be positioned between the real loudspeaker array and the listener.
Application of wave field synthesis provides good results if the preconditions assumed in theory such as ideal loudspeaker characteristics, regular, unbroken loudspeaker arrays, or free-field conditions for sound propagation are at least approximately met. In practice, however, said conditions are frequently not met, e.g. due to incomplete loudspeaker arrays or a significant influence of the acoustics of a room.
A environmental condition can be described by the impulse response of the environment.
This will be set forth in more detail by means of the following example. It shall be assumed that a loudspeaker emits a sound signal against a wall, the reflection of which is undesired.
For this simple example, room compensation while using wave field synthesis would consist in initially determining the reflection of said wall in order to find out when a sound signal which has been reflected by the wall arrives back at the loudspeaker, and which amplitude this reflected sound signal has. If the reflection by this wall is undesired, wave field synthesis offers the possibility of eliminating the reflection by this wall by impressing upon the loudspeaker—in addition to the original audio signal—a signal that is opposite in phase to the reflection signal and has a corresponding amplitude, so that the forward compensation wave cancels the reflection wave such that the reflection by this wall is eliminated in the environment contemplated. This may be effected in that initially, the impulse response of the environment is calculated, and the nature and position of the wall is determined on the basis of the impulse response of this environment. This involves representing the sound that is reflected by the wall by means of an additional WFS sound source, a so-called mirror sound source, the signal of which is generated from the original source signal by means of filtering and delay.
If the impulse response of this environment is measured, and if the compensation signal that is superposed onto the audio signal and impressed onto the loudspeaker is subsequently calculated, cancellation of the reflection by this wall will occur such that a listener in this environment will have the impression that this wall does not exist at all.
However, what is decisive for optimum compensation of the reflected wave is the impulse response of the room is accurately determined, so that no overcompensation or undercompensation occurs.
Thus, wave field synthesis enables correct mapping of virtual sound sources across a large reproduction area. At the same time, it offers to the sound mixer and the sound engineer a new technical and creative potential in generating even complex soundscapes. Wave field synthesis as was developed at the Technical University of Delft at the end of the 1980s represents a holographic approach to sound reproduction. The Kirchhoff-Helmholtz integral serves as the basis for this. Said integral states that any sound fields within a closed volume may be generated by means of distributing monopole and dipole sound sources (loudspeaker arrays) on the surface of said volume.
In wave field synthesis, a synthesis signal is calculated, from an audio signal emitting a virtual source at a virtual position, for each loudspeaker of the loudspeaker array, the synthesis signals having such amplitudes and delays that a wave resulting from the superposition of the individual sound waves output by the loudspeakers existing within the loudspeaker array corresponds to the wave that would result from the virtual source at the virtual position if said virtual source at the virtual position were a real source having a real position.
Typically, several virtual sources are present at different virtual positions. The synthesis signals are calculated for each virtual source at each virtual position, so that typically, a virtual source results in synthesis signals for several loudspeakers. From the point of view of one loudspeaker, said loudspeaker will thus receive several synthesis signals stemming from different virtual sources. Superposition of said sources, which is possible due to the linear superposition principle, will then yield the reproduction signal actually emitted by the loudspeaker.
The possibilities of wave field synthesis may be exhausted all the more, the larger the size of the loudspeaker arrays, i.e. the larger the number of individual loudspeakers provided. However, this also results in an increase in the computing performance that a wave field synthesis unit supplies since, typically, channel information is also taken into account. Specifically, this means that in principle, a dedicated transmission channel exists from each virtual source to each loudspeaker, and that in principle, the case may exist where each virtual source leads to a synthesis signal for each loudspeaker, and/or that each loudspeaker obtains a number of synthesis signals which is equal to the number of virtual sources.
If the possibilities of wave field synthesis are to be exhausted, specifically, in cinema applications to the effect that the virtual sources can also be movable, it has to be noted that quite substantial computing operations have to be effected because of the calculation of the synthesis signals, the calculation of the channel information, and the generation of the reproduction signals by combining the channel information and the synthesis signals.
A further important expansion of wave field synthesis consists in reproducing virtual sound sources with complex, frequency-dependent directional characteristics. For each source/loudspeaker combination, convolution of the input signal by means of a specific filter is also taken into account in addition to a delay, which will then typically exceed the computing expenditure in existing systems.
According to an embodiment, a device for calculating loudspeaker signals for a plurality of loudspeakers while using a plurality of audio sources, an audio source having an audio signal, may have: a forward transform stage for transforming each audio signal, block-by-block, to a spectral domain so as acquire for each audio signal a plurality of temporally consecutive short-term spectra; a memory for storing a plurality of temporally consecutive short-term spectra for each audio signal; a memory access controller for accessing a specific short-term spectrum among the plurality of temporally consecutive short-term spectra for a combination having a loudspeaker and an audio signal on the basis of a delay value; a filter stage for filtering the specific short-term spectrum for the combination of the audio signal and the loudspeaker by using a filter provided for the combination of the audio signal and the loudspeaker, so that a filtered shot-term spectrum is acquired for each combination of an audio signal and a loudspeaker; a summing stage for summing up the filtered short-term spectra for a loudspeaker so as acquire summed-up short-term spectra for each loudspeaker; and a backtransform stage for backtransforming, block-by-block, summed-up short-term spectra for the loudspeakers to a time domain so as acquire the loudspeaker signals.
According to another embodiment, a method of calculating loudspeaker signals for a plurality of loudspeakers while using a plurality of audio sources, an audio source having an audio signal, may have the steps of: transforming each audio signal, block-by-block, to a spectral domain so as acquire for each audio signal a plurality of temporally consecutive short-term spectra; storing a plurality of temporally consecutive short-term spectra for each audio signal; accessing a specific short-term spectrum among the plurality of temporally consecutive short-term spectra for a combination having a loudspeaker and an audio signal on the basis of a delay value; filtering the specific short-term spectrum for the combination of the audio signal and the loudspeaker by using a filter provided for the combination of the audio signal and the loudspeaker, so that a filtered shot-term spectrum is acquired for each combination of an audio signal and a loudspeaker; summing up the filtered short-term spectra for a loudspeaker so as acquire summed-up short-term spectra for each loudspeaker; and backtransforming, block-by-block, summed-up short-term spectra for the loudspeakers to a time domain so as acquire the loudspeaker signals.
Another embodiment may have a computer program having a program code for performing the method as claimed in claim 18 when the program code runs on a computer or processor.
The present invention is advantageous in that it provides, due to the combination of a forward transform stage, a memory, a memory access controller, a filter stage, a summing stage, and a backtransform stage, an efficient concept characterized in that the number of forward and backtransform calculations need not be performed for each individual combination of audio source and loudspeaker, but only for each individual audio source.
Similarly, backtransform need not be calculated for each individual audio signal/loudspeaker combination, but only for the number of loudspeakers. This means that the number of forward transform calculations equals the number of audio sources, and the number of backward transform calculations equals the number of loudspeaker signals and/or of the loudspeakers to be driven when a loudspeaker signal drives a loudspeaker. In addition, it is particularly advantageous that the introduction of a delay in the frequency domain is efficiently achieved by a memory access controller in that on the basis of a delay value for an audio signal/loudspeaker combination, the stride used in the transform is advantageously used for said purpose. In particular, the forward transform stage provides for each audio signal a sequence of short-term spectra (STS) that are stored in the memory for each audio signal. The memory access controller thus has access to a sequence of temporally consecutive short-term spectra. On the basis of the delay value, from the sequence of short-term spectra that short-term spectrum is then selected, for an audio signal/loudspeaker combination, which best matches the delay value provided by, e.g., a wave field synthesis operator. For example, if the stride value in the calculation of the individual blocks from one short-term spectrum to the next short-term spectrum is 20 ms, and if the wave field synthesis operator may use a delay of 100 ms, said entire delay may easily be implemented by not using, for the audio signal/loudspeaker combination considered, the most recent short-term spectrum in the memory but that short-term spectrum which is also stored and is the fifth one counting backwards. Thus, the inventive device is already able to implement a delay solely on the basis of the stored short-term spectra within a specific raster (grid) determined by the stride. If said raster is already sufficient for a specific application, no further measures need to be taken. However, if a finer delay control may be used, it may also be implemented, in the frequency domain, in that in the filter stage, for filtering a specific short-term spectrum, one uses a filter, the impulse response of which has been manipulated with a specific number of zeros at the beginning of the filter impulse response. In this manner, finer delay granulation may be achieved, which now does not take place in time durations in accordance with the block stride, as is the case in the memory access controller, but in a considerably finer manner in time durations in accordance with a sampling period, i.e. with the time distance between two samples. If, in addition, even finer granulation of the delay may be used, it may also be implemented, in the filter stage, in that the impulse response, which has already been supplemented with zeros, is implemented while using a fractional delay filter. In embodiments of the present invention, thus, any delay values that may be used may be implemented in the frequency domain, i.e. between the forward transform and the backward transform, the major part of the delay being achieved simply by means of a memory access control; here, granulation is already achieved which is in accordance with the block stride and/or in accordance with the time duration corresponding to a block stride. If finer delays may be used, said finer delays are implemented by modifying, in the filter stage, the filter impulse response for each individual combination of audio signal and loudspeaker in such a manner that zeros are inserted at the beginning of the impulse response. This represents a delay in the time domain, as it were, which delay, however, is “imprinted” onto the short-term spectrum in the frequency domain in accordance with the invention, so that the delay being applied is compatible with fast convolution algorithms such as the overlap-save algorithm or the overlap-add algorithm and/or may be efficiently implemented within the framework provided by the fast convolution.
The present invention is especially suited, in particular, for static sources since static virtual sources also have statistical delay values for each audio signal/loudspeaker combination. Therefore, the memory access control may be fixedly set for each position of a virtual source. In addition, the impulse response for the specific loudspeaker/audio signal combination within each individual block of the filter stage may be preset already prior to performing the actual rendering algorithm. For this purpose, the impulse response that may actually be used for said audio signal/loudspeaker combination is modified to the effect that an appropriate number of zeros is inserted at the start of the impulse response so as to achieve a more finely resolved delay. Subsequently, this impulse response is transformed to the spectral domain and stored there in an individual filter. In the actual wave field synthesis rendering calculation, one may then resort to the stored transmission functions of the individual filters in the individual filter blocks. Subsequently, when a static source transitions from one position to the next, resetting of the memory access control and resetting of the individual filters will be useful, which, however, are already calculated in advance, e.g., when a static source transitions from one position to the next, e.g. at a time interval of 10 seconds. Thus, the frequency domain transmission functions of the individual filters may already be calculated in advance, whereas the static source is still rendered at its old position, so that when the static source is to be rendered at its new position, the individual filter stages will already have transmission functions stored therein again which were calculated on the basis of an impulse response with the appropriate number of zeros inserted.
An advantageous wave field synthesis renderer device and/or an advantageous method of operating a wave field synthesis renderer device includes N virtual sound sources providing sampling values for the source signals x0 . . . xN-1, and a signal processing unit producing, from the source signals x0 . . . xN-1, sampling values for M loudspeaker signals y0 . . . yM-1; a filter spectrum is stored in the signal processing unit for each source/loudspeaker combination, each source signal x0 . . . xN-1 using several FFT calculation blocks of the block length L is transformed into the spectra, the FFT calculation blocks comprising an overlap of the length (L-B) and a stride of the length B, each spectrum being multiplied by the associated filter spectra of the respectively same source, whereby the spectra are produced; access to the spectra being effected such that the loudspeakers are driven with a predefined delay with regard to each other in each case, said delay corresponding to an integer multiple of the stride B; all spectra of the respectively same loudspeaker i being added up, whereby the spectra Qj are produced; and each spectrum Qj is transformed, by using an IFFT calculation block, to the sampling values for the M loudspeaker signals y0 . . . yM-1.
In one implementation, block-wise shifting of the individual spectra may be exploited for producing a delay in the loudspeaker signals y0 . . . yM-1 by means of targeted access to the spectra. The computing expenditure for this delay depends only on the targeted access to the spectra, so that no additional computing power is required for introducing delays as long as the delay corresponds to an integer multiple of the stride B.
Overall, the invention thus relates to wave field synthesis of directional sound sources, or sound sources with directional characteristics. For real listening scenes and WFS setups consisting of several virtual sources and a large number of loudspeakers, the need to apply individual FIR filters for each combination of a virtual source and a loudspeaker frequently prevents implementation from being simple.
In order to reduce this fast increase in complexity, the invention proposes an efficient processing structure based on time/frequency techniques. Combining the components of a fast convolution algorithm into the structure of a WFS rendering system enables efficient reuse of operations and intermediate results and, thus, a considerable increase in efficiency. Even though potential acceleration increases as the number of virtual sources and loudspeakers increases, substantial savings are achieved also for WFS setups of moderate sizes. In addition, the power gains are relatively constant for a broad variety of parameter selection possibilities for the order of magnitude of filters and for the block delay value. Handling of time delays, which are inherently involved in sound reproduction techniques such as WFS, involves modification of the overlap-save technique. This is efficiently achieved by partitioning the delay value and by using frequency-domain delay lines, or delay lines implemented in the frequency domain.
Thus, the invention is not limited to rendering directional sound sources, or sound sources comprising directional characteristics, in WFS, but is also applicable to other processing tasks using an enormous amount of multichannel filtering with optional time delays.
An advantageous embodiment provides for the spectra to be produced in accordance with the overlap-save method. The overlap-save method is a method of fast convolution. This involves decomposing the input sequence x0 . . . xN-1 into mutually overlapping subsequences. Following this, those portions which match the aperiodic, fast convolution are withdrawn from the periodic convolution products (cyclic convolution) that have formed.
A further advantageous embodiment provides for the filter spectra to be transformed from time-discrete impulse responses by means of an FFT. The filter spectra may be provided before the time-critical calculation steps are actually performed, so that calculation of the filter spectra does not influence the time-critical part of the calculation.
A further advantageous embodiment provides that each impulse response is preceded by a number of zeros such that the loudspeakers are mutually driven with a predefined delay which corresponds to the number of zeros. In this manner, it is possible to realize even delays which do not correspond to an integer multiple of the stride B. To this end, the desired delay is decomposed into two portions: The first portion is an integer multiple of the stride B, whereas the second portion represents the remainder. In such a decomposition, the second portion thus is invariably smaller than the stride B.
Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
Thus, the memory access controller is configured to resort to a specific short-term spectrum among the plurality of short-term spectra for a combination of loudspeaker and audio signal on the basis of a delay value predefined for this audio signal/loudspeaker combination. The specific short-term spectra determined by the memory access controller 600 are then fed to a filter stage 300 for filtering the specific short-term spectra for combinations of audio signals and loudspeakers so as to there perform filtering with a filter provided for the respective combination of audio signal and loudspeaker, and to obtain a sequence of filtered short-term spectra for each such combination of audio signal and loudspeaker. The filtered short-term spectra are then fed to a summing stage 400 by the filter stage 300 so as to sum up the filtered short-term spectra for a loudspeaker such that a summed-up short-term spectrum is obtained for each loudspeaker. The summed-up short-term spectra are then fed to a backtransform stage 800 for the purpose of block-wise backtransform of the summed-up short-term spectra for the loudspeakers so as to obtain the short-term spectra within a time domain, whereby the loudspeaker signals may be determined. The loudspeaker signals are thus output at an output 12 by the backtransform stage 800.
In one embodiment, wherein the device is a wave field synthesis device, the delay values 701 are supplied by a wave field synthesis operator (WFS operator) 700, which calculates the delay values 701 for each individual combination of audio signal and loudspeaker as a function of source positions fed in via an input 702 and as a function of the loudspeaker positions, i.e. those positions where the loudspeakers are arranged within the reproduction room, and which are supplied via an input 703. If the device is configured for a different application than for wave field synthesis, i.e. for an ambisonics implementation or the like, there will also exist an element corresponding to the WFS operator 700 which calculates delay values for individual loudspeaker signals and/or which calculates delay values for individual audio signal/loudspeaker combinations. Depending on the implementation, the WFS operator 700 will also calculate scaling values in addition to delay values, which scaling values can typically also be taken into account by a scaling factor in the filter stage 300. Said scaling values may also be taken into account by scaling the filter coefficients used in the filter stage 300, without causing any additional computing expenditure.
The memory access controller 600 may therefore be configured, in a specific implementation, to obtain delay values for different combinations of audio signal and loudspeaker, and to calculate an access value to the memory for each combination, as will be set forth with reference to
In particular, the WFS operator 700 is configured to provide a delay value D, as is depicted in step 20 of
The delay achieved by controlling the filter in step 24 may be interpreted as a delay in the “time domain” even though said delay in the frequency domain is applied, due to the specific implementation of the filter stage, to the specific short-term which has been read out—specifically while using the multiple Db—from the memory 200. Thus, the result is a splitting up into three blocks for the entire delay, as is depicted at 26 in
Subsequently, an advantageous implementation of the filter stage 300 will be discussed while referring to
In a step 30, an impulse response for an audio signal/loudspeaker combination is provided. For directional sound sources, in particular, one will have a dedicated impulse response for each combination of audio signal and loudspeaker. However, for other sources, too, there are different impulse responses at least for specific combinations of audio signal and loudspeaker. In a step 31, the number of zeros to be inserted, i.e. the value DA, is determined, as was depicted in
In the embodiment, the forward transform stage 100 is configured to determine the sequence of short-term spectra with the stride B from a sequence of temporal samples, so that a first sample of a first block of temporal samples converted into a short-term spectrum is spaced apart from a first sample of a second subsequent block of temporal samples by a number of samples which equals the stride value. The stride value is thus defined by the respectively first sample of the new block, said stride value being present, as will be set forth by means of
In addition, in order to enable optional storage in the memory 200, a time value associated with a short-term spectrum is advantageously stored as a block index which indicates the number of stride values by which the first sample of the short-term spectrum is temporally spaced apart from a reference value. The reference value is, e.g., the index 0 of the short-term spectrum at 249 in
In addition, the memory access means is advantageously configured to determine the specific short-term spectrum on the basis of the delay value and of the time value of the specific short-term spectrum in such a manner that the time value of the specific short-term spectrum equals or is larger by 1 than the integer result of a division of the time duration corresponding to the delay value by the time duration corresponding to the stride value. In one implementation, the integer result used is precisely that which is smaller than the delay that may actually be used. Alternatively, however, one might also use the integer result plus one, said value being a “rounding-up”, as it were, of the delay that may actually be used. In the event of rounding-up, a slightly too large delay is achieved, which may easily suffice for applications, however. Depending on the implementation, the question whether rounding-up or rounding-down is performed may be decided as a function of the amount of the remainder. For example, if the remainder is larger than or equal to 50% of the time duration corresponding to the stride, rounding-up may be performed, i.e. the value which is larger by one may be taken. In contrast, if the remainder is smaller than 50%, “rounding-down” may be performed, i.e. the very result of the integer division may be taken. Actually, one may speak of rounding-down when the remainder is not implemented as well, e.g. by inserting zeros.
In other words, the implementation presented above and comprising rounding-up and/or rounding-down may be useful when a delay is applied which is achieved only by means of granulation of a block length, i.e. when no finer delay is achieved by inserting zeros into an impulse response. However, if a finer delay is achieved by inserting zeros into an impulse response, rounding-down rather than rounding-up will be performed in order to determine the block offset.
In order to explain this implementation, reference shall be made to
A specific exemplary access controller might read out, for example for the implementation of
In a specific implementation as was already illustrated with reference to
Advantageously, the memory 200 includes, for each audio source, a frequency-domain delay line, or FDL, 201, 202, 203 of
As is shown in
In an advantageous embodiment, the forward transform stage 100 and the backtransform stage 800 are configured in accordance with an overlap-save method, which will be explained below by means of
Alternatively, both the forward transform stage 100 and the backtransform stage 800 may be configured to perform an overlap-add method. The overlap-add method, which is also referred to as segmented convolution, is also a method of fast convolution and is controlled such that an input sequence is decomposed into actually adjacent blocks of samples with a stride B, as is depicted at 43. However, due to the attachment of zeros (also referred to as zero padding) for each block, as is shown at 44, said blocks become consecutive overlapping blocks. The input signal is thus split up into portions of the length B, which are then extended by the zero padding in accordance with step 44, so as to achieve a longer length for the result of the convolution operation. Subsequently, the blocks produced by step 44 and padded with zeros are transformed by the forward transform stage 100 in a step 45 so as to obtain the sequence of short-term spectra. Subsequently, in accordance with the processing performed in block 39 of
Depending on the implementation, the forward transform stage 100 and the backtransform stage 800 are configured as individual FFT blocks as shown in
As was already depicted by means of
There are several approaches to producing directional sound sources, or sound sources having directional characteristics, while using wave field synthesis. In addition to experimental results, most approaches are based on expanding or developing the sound field to form circular or spherical harmonics. The approach presented here also uses an expansion of the sound field of the virtual source to form circular harmonics so as to obtain a driving function for the secondary sources. This driving function will also be referred to as a WFS operator below.
The following representation is an exemplary description of the wave field synthesis process. Alternative descriptions and implementations are also known. The sound field of the primary source ψ is generated in the region y<yL by using a linear distribution of secondary monopole sources along x (black dots).
Using the geometry of
It states that the sound pressure PR ({right arrow over (r)}R,{right arrow over (r)},ω) of a primary sound source may be generated at the receiver position R while using a linear distribution of secondary monopole line sound sources with y=yL. To this end, the speed V{right arrow over (n)}({right arrow over (r)},ω) of the primary source ψ at the positions of the secondary sources may be known in accordance with its normal {right arrow over (n)}. In equation (1), ω is the angular frequency, c is the speed of sound, and
is the Hankel function of the second kind of the order of 0. The path from the primary source position to the secondary source position is designated by {right arrow over (r)}. By analogy, {right arrow over (r)}R is the path from the secondary source to the receiver R. The two-dimensional sound field emitted by a primary source ψ with any directional characteristic desired may be described by an expansion to form circular harmonics.
wherein S(ω) is the spectrum of the source, and α is the azimuth angle of the vector {right arrow over (r)}. {hacek over (C)}v(2) (w) are the circular-harmonics expansion coefficients of the order of magnitude of v. While using the motion equation, the WFS secondary source driving function Q ( . . . ) is indicated as
In order to obtain synthesis operators that can be realized, two assumptions are made: first of all, real loudspeakers behave rather like point sources if the size of the loudspeaker is small as compared to the emitted wavelength. Therefore, the secondary source driving function should use secondary point sources rather than line sources. Secondly, what is contemplated here is only the efficient processing of the WFS driving function. While calculation of the Hankel function involves a relatively large amount of effort, the near-field directional behavior is of relatively little importance from a practical point of view.
As a result, only the far-field approximation of the Hankel function is applied to the secondary and primary source descriptions (1) and (2). This results in the secondary source driving function
Consequently, the synthesis integral may be expressed as
For a virtual source having ideal monopole characteristics, the directivity term of the source driving function becomes simpler and results in G(ω,α)=1. In this case, only a gain
a delay term
corresponding to a frequency-independent time delay of
and a constant phase shift of j are applied to the secondary source signal.
In addition to the synthesis of monopole sources, a common WFS system enables reproduction of planar wave fronts, which are referred to as plane waves. These may be considered as monopole sources arranged at an infinite distance. As in the case of monopole sources, the resulting synthesis operator consists of a static filter, a gain factor, and a time delay.
For complex directional characteristics, the gain factor A( . . . ) becomes dependent on the directional characteristic, the alignment and the frequency of the virtual source as well as on the positions of the virtual and secondary sources. Consequently, the synthesis operator contains a non-trivial filter, specifically for each secondary source
As in the case of fundamental types of sources, the delay may be extracted from (4) from the propagation time between the virtual and secondary sources
For practical rendering, time-discrete filters for the directional characteristics are determined by the frequency response (8). Because of their ability to approximate any frequency responses and their inherent stability, only FIR filters will be considered here. These directivity filters will be referred to as hm,n[k] below, wherein n=0, . . . , M−1 designates the virtual-source index, n=0, . . . , M−1 is the loudspeaker index, and k is a time domain index. K is the order of magnitude of the directivity filter. Since such filters are needed for each combination of N virtual sources and M loudspeakers, production is expected to be relatively efficient.
Here, a simple window (or frequency sampling design) is used. The desired frequency response (9) is evaluated at K+1 equidistantly sampled frequency values within the interval 0≦ω2π. The discrete filter coefficients hm,n[k], k=0, . . . , K are obtained by an inverse discrete Fourier transform (IDFT) and by applying a suitable window function w[k] so as to reduce the Gibbs phenomenon caused by cutting off of the impulse response.
h
m,n
[k]=w[k]IDFT{AD({right arrow over (r)}R,{right arrow over (r)},ω,α)} (10)
Implementing this design method enables several optimizations. First of all, the conjugated symmetry of the frequency response AD({right arrow over (r)}R,{right arrow over (r)},ω,α); this function is evaluated only for approximately half of the raster points. Secondly, several parts of the secondary source driving function, e.g. the expansion coefficients {hacek over (C)}v(2)(ω), are identical for all of the driving functions of any given virtual source and, therefore, are calculated only once. The directivity filters hm,n[k] introduce synthesis errors in two ways. On the one hand, the limited order of magnitude of filters results in an incomplete approximation of AD({right arrow over (r)}R,{right arrow over (r)},ω,α). On the other hand, the infinite summation of (4) is replaced by a finite boundary. As a result, the beam width of the generated directional characteristics cannot become infinitely narrow.
WFS processing is generally implemented as a time-discrete processing system. It consists of two general tasks: calculating the synthesis operator and applying this operator to the time-discrete source signals. The latter will be referred to WFS rendering in the following.
The impact of the synthesis operator on the overall complexity is typically low since said synthesis operator is calculated relatively rarely. If the source properties change in a discrete manner only, the operator will be calculated as needed. For continuously changing source properties, e.g. in the case of moving sound sources, it is typically sufficient to calculate said values on a coarse grid and to use simple interpolation methods in between.
In contrast to this, application of the synthesis operator to the source signals is performed at the full audio sampling rate.
The number of scale and delay operations is formed by the product of the number of virtual sources N and the number of loudspeakers M. Thus, this product typically reaches high values. Consequently, the scale and delay operation is the most critical part, in terms of performance, of most WFS systems—even if only integer delays are used.
By means of
In order to substantially reduce the computing resources that may be used, the invention proposes a signal processing scheme based on two interacting effects.
The first effect relates to the fact that the efficiency of FIR filters may frequently be increased by using fast convolution methods in the transform domain, such as overlap-save or overlap-add, for example. Generally, said algorithms transform segments of the input signal to the frequency domain by means of fast Fourier transform (FFT) techniques, perform a convolution by means of frequency domain multiplication, and transform the signal back to the time domain. Even though the actual performance highly depends on the hardware, the order of magnitude of the filter typically ranges between 16 and 50 where transform-based filtering becomes more efficient than direct convolution. For overlap-add algorithms and overlap-save algorithms, the forward and inverse FFT operations constitute the large part of the computational expenditure.
Advantageously, it is only the overlap-save method that is taken into account since it involves no addition of components of adjacent output blocks. In addition to the reduced arithmetic complexity as compared to overlap-add, said property results in a simpler control logic for the proposed processing scheme.
A further embodiment for reducing the computational expenditure exploits the structure of the WFS processing scheme. On the one hand, here each input signal is used for a large number of delay and filtering operations. On the other hand, the results for a large number of sound sources are summed for each loudspeaker. Thus, partitioning of the signal processing algorithm, which performs typical operations only once for each input or output signal, promises gains in efficiency. Generally, such partitioning of the WFS rendering algorithm results in considerable improvements in performance for moving sound sources of fundamental types of sources.
When transform-based fast convolution is employed for rendering directional sound sources, or sound sources having directional characteristics, the forward and inverse Fourier transform operations are obvious candidates for said partitioning. The resulting processing scheme is shown in
As was explained by means of
Conceptually, a random time delay may readily be built into the FIR directivity filter. Due to the large range of the delay value in a typical WFS system, however, this approach results in very long filter lengths and, thus, in large FFT block sizes. On the one hand, this considerably increases the computational expenditure and the storage requirements. On the other hand, the latency period for forming input blocks is not acceptable for many applications due to the block formation delay that may be used for such large FFT sizes.
For this reason, a processing scheme is proposed here which is based on a frequency-domain delay line and on partitioning of the delay value. Similarly to the conventional overlap-save method, the input signal is segmented into overlapping blocks of the size L and into a stride (or delay block size) B between adjacent blocks. The blocks are transformed to the frequency domain and are designated by Xn[I], wherein n designates the source, and I is the block index. These blocks are stored in a structure which enables indexed access of the form Xn[I-i] to the most recent frequency domain blocks. Conceptually, this data structure is identical with the frequency-domain delay lines used within the context of partitioned convolution.
The delay value D, indicated in samples, is partitioned into a multiple of the block delay quantity and into a remainder Dr or Dr′
D=D
b
B+D
r with 0≦Dr≦B−1,Dbε. (11)
The block delay Db is applied as an indexed access to the frequency-domain delay line. By contrast, the remaining part is included into the directivity filter hm,n[k], which is formally expressed by a convolution with the delay operator δ(k−Dr)
h
m,n
d
[k]=h
m,n
[k]*δ(k−Dr). (12)
For integer delay values, this operation corresponds to preceding hm,n[k] with Dr zeros. The resulting filter is padded with zeros in accordance with the requirements of the overlap-save operation. Subsequently, the frequency-domain filter representation Hm,nd is obtained by means of an FFT.
The frequency-domain representation of the signal component from the source n to the loudspeaker m is calculated as
C
m,n
[l]=h
m,n
d
·X
n
[l−D
b] (13)
wherein · designates an element-by-element complex multiplication. The frequency-domain representation of the driving signal for the loudspeaker m is determined by accumulating the corresponding component signals, which is implemented as a complex-valued vector addition
The remainder of the algorithm is identical with the ordinary overlap-save algorithm. The blocks Ym[I] are transformed to the time domain, and the loudspeaker driving signals ym[k] are formed by deleting a predetermined number of samples from each time domain block. This signal processing structure is schematically shown in
The lengths of the transformed segments and the shift between adjacent segments follow from the derivation of the conventional overlap-save algorithm. A linear convolution of a segment of the length L with a sequence of the length P, L<P, corresponds to a complex multiplication of two frequency domain vectors of the size L and yields L−P+1 output samples. Thus, the input segments are shifted by this amount, subsequently referred to as B=L−P+1. Conversely, in order to obtain B output samples from each input segment for a convolution with an FIR filter of the order of magnitude of K (length P=K−1), the transformed segments have a length of
L=K+B. (15)
If the integer part of the remainder portion Dr of the delay is embedded into the filter hm,nd[k] in accordance with (12), the order of magnitude for hm,nd[k] that may be used will result in K′=K+B−1. This is due to the fact that hm,nd[k] is preceded by a maximum of B−1 zeros, which is the maximum value for Dr (11). Thus, the segment length that may be used for the proposed algorithm is indicated by
L=K+2B−1. (16)
So far, only integer sample delay values D have been taken into account. However, the proposed processing scheme may be extended to include any delay values by accommodating an FD filter (FD=fractional delay), a so-called directivity filter hm,nd[k]. Here, only FIR-FD filters are taken into account since they may readily be integrated into the proposed algorithm. To this end, the residual delay Dr is partitioned into an integer part Dint and a fractional delay value d, as is customary in the FD filter design. The integer part is integrated into hm,nd[k] by preceding hm,n[k] with Dint zeros. The fractional delay value is applied to hm,nd[k] by convoluting same with an FD filter designed for this fractional value d. Thus, the order of magnitude of hm,nd[k] that may be used is increased by the order of magnitude of the FD filter KFD, and the block size L (16) that may be used changes to
L=K+K
FD+2B−1. (17)
However, the advantages of using random delay values are highly limited. It has been shown that fractional delay values may be used only for moving virtual sources. However, they have no positive effect on the quality as far as static sources are concerned. On the other hand, the synthesis of moving directional sound sources, or sound sources having directional characteristics, would entail constant temporal variation of synthesis filters, the design of which would dominate the overall complexity of rendering in a simple implementation.
In a next step, fast convolution in accordance with the overlap-save method (OS) as well as a backtransform with an IFFT to the loudspeaker signals y0 . . . yM-1 is performed at stage 503. What is decisive here is the manner in which access to the spectra occurs. By way of example, access operations 504, 505, 506, and 507 are depicted in the figure. In relation to the time of the access operation 507, access operations 504, 505, and 506 are in the past.
If the loudspeaker 511 is driven by means of the access operation 507 and if, simultaneously, loudspeakers 510, 512 are driven by means of the access operation 506, it seems to the listener as if the loudspeaker signals of the loudspeakers 510, 512 are delayed as compared to the loudspeaker signal of the loudspeaker 511. The same applies to the access operation 505 and the loudspeaker signals of the loudspeakers 509, 513 as well as to the access operation 504 and to the loudspeaker signals of the loudspeakers 508, 514.
In this manner, each individual loudspeaker may be driven with a delay corresponding to a multiple of the block stride B. If further delay is to be provided which is smaller than the block stride B, this may be achieved by preceding the corresponding impulse response of the filter, which is the subject of the overlap-save operation, with zeros.
In order to evaluate the potential increase in efficiency achieved by the proposed processing structure, a performance comparison is provided here which is based on the number of arithmetic commands. It should be understood that this comparison can only provide rough estimations of the relative performances of the different algorithms. The actual performance may differ on the basis of the characteristics of the actual hardware architecture. Performance characteristics of, in particular, the FFT operations involved differ considerably, depending on the library used, the actual FFT sizes, and the hardware. In addition, the memory capacity of the hardware used may have a critical impact on the efficiency of the algorithms compared. For this reason, the memory requirements for the filter coefficients and the delay line structures, which are the main sources of memory consumption, are also indicated.
The main parameters determining the complexity of a rendering algorithm for directional sound sources, or sound sources having directional characteristics, are the number of virtual sources N, the number of loudspeakers M, and the filter order of the directivity filter K. For methods based on fast convolution, the shift between adjacent input blocks, which is also referred to as the block delay B, impairs performance and memory requirements. In addition, block-by-block operation of the fast convolution algorithms introduces an implementation latency period of B−1 samples. The maximally allowed delay value, which is referred to as Dmax and is indicated as a number of samples, influences the memory size that may be used for the delay line structures.
Three different algorithms are compared: linear convolution, filter-by-filter fast convolution, and the proposed processing structure. The method which is based on linear convolution performs NM time domain convolutions of the order of magnitude of K. This amounts to NM(2K+1) commands per sample. In addition, M(N−1) real additions may be used for accumulating the loudspeaker driving signals. The memory that may be used for an individual delay line is Dmax+K floating-point values. Each of the MN FIR filters hm,n[k] may use K+1 memory words for floating-point values. These performance numbers are summarized in the following table. The table shows a performance comparison for wave field synthesis signal processing schemes for directional sound sources, or sound sources having directional characteristics. The number of commands is indicated for calculating a sample for all of the loudspeakers. The memory requirements are specified as numbers of floating-point values.
The second algorithm, referred to as filter-by-filter linear convolution, calculates the MN FIR filters separately while using the overlap-save fast convolution method. In accordance with (15), the size of the FFT blocks in order to calculate B samples per block is L=K+B. For each filter, a real-valued FFT of the size L and an inverse FFT of the same size is performed. A number of commands of pL log2(L) is assumed for a forward or inverse FFT of the size L, wherein p is a proportionality constant which depends on the actual implementation. p may be assumed to have value between 2.5 and 3.
Since the frequency transforms of real-valued sequences are symmetrical, complex vector multiplication of the length L, which is performed in the overlap-save method, may use approximately L/2 complex multiplications. Since a single complex multiplication is implemented by 6 arithmetic commands, the effort involved in one vector multiplication amounts to 3L commands. Thus, filtering while using the overlap-save method may use
for one single output sample on all loudspeaker signals. Similarly to the direct convolution algorithm, the effort involved in accumulating the loudspeaker signals amounts to M(N−1) commands. The delay line memory is identical with the linear convolution algorithm. In contrast, the memory requirements for the filters are increased due to the zero paddings of the filters hm,n[k] prior to the frequency transform. It is to be noted that a frequency domain representation of a real filter of the length L may be stored in L real-valued floating-point values because of the symmetry of the transformed sequence.
For the proposed efficient processing scheme, the block size for a block delay B equals L=K+2B−1 (16). Thus, a single FFT or inverse FFT operation may use p(K+2B−1)log2(K+2B−1) commands. However, only N forward and M inverse FFT operations may be used for each audio block. The complex multiplication and addition are each performed on the frequency domain representation and may use 3(K+2B−1) and K+2B−1 commands, respectively, for each symmetrical frequency domain block of the length K+2B−1. Since each processed block yields B output samples, the overall number of commands for a sampling clock iteration amounts to
Since the frequency-domain delay line stores the input signals in blocks of the size L, with a shift of B, the number of memory positions that may be used for one single input signal is
By analogy therewith, a frequency-transformed filter may use K+2B−1 memory words.
In order to evaluate the relative performance of these algorithms, an exemplary wave field synthesis rendering system shall be assumed for 16 virtual sources, 128 loudspeaker channels, directivity filters of the order of magnitude of 1023, and a block delay of 1024. Each parameter is varied separately so as to evaluate its influence on the overall complexity.
The influence of the number of loudspeaker is shown in
The effect of the order of magnitude of the directivity filters is examined in
In
For the contemplated configuration (N=16, M=16, K=1023, B=1024) and a maximum delay value Dmax=48000, which corresponds to a delay value of one second at a sampling frequency of 48 kHz, the linear convolution algorithms may use approximately 2.9·106 memory words. For the same parameters, the filter-by-filter fast convolution algorithm uses approximately 5.0·106 floating-point memory positions. The increase is due to the size of the pre-calculated frequency domain filter representations. The proposed algorithm may use approximately 8.6·106 words of the memory due to the frequency-domain delay line and to the increased block size for the frequency domain representations of the input signal and of the filters. Thus, the performance improvement of the proposed algorithm as compared to filter-by-filter fast convolution is obtained by an increase in the memory of about 72.7% that may be used. Thus, the proposed algorithm may be regarded as a space-time compromise which uses additional memory in order to store pre-calculated results such as frequency-domain representations of the input signal, for example, so as to enable more efficient implementation.
The additional memory requirements may have an adverse effect on the performance, e.g. due to reduced cache locality. At the same time, it is likely that the reduced number of commands, which implies a reduced number of memory access operations, minimizes this effect. It is therefore useful to examine and evaluate the performance gains of the proposed algorithm for the intended hardware architecture. By analogy therewith, the parameters of the algorithm, such as the FFT block size L or the block delay B, for example, are adjusted to the specific target platform.
Even though specific elements are described as device elements, it shall be noted that this description may equally be regarded as a description of steps of a method, and vice versa.
Depending on the circumstances, the inventive method may be implemented in hardware or in software. Implementation may be effected on a non-transitory storage medium, a digital storage medium, in particular a disc or CD which comprises electronically readable control signals which may cooperate with a programmable computer system such that the method is performed. Generally, the invention thus also consists in a computer program product having a program code, stored on a machine-readable carrier, for performing the method when the computer program product runs on a computer. In other words, the invention may thus be realized as a computer program which has a program code for performing the method, when the computer program runs on a computer.
While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations and equivalents as fall within the true spirit and scope of the present invention.
This application is a continuation of U.S. patent application Ser. No. 14/329,457, filed Jul. 11, 2014, which is a continuation of copending International Application No. PCT/EP2012/077075, filed Dec. 28, 2012, which is incorporated herein by reference in its entirety, and additionally claims priority from German Application No. 102012200512.9, filed Jan. 13, 2012, which is also incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | 14329457 | Jul 2014 | US |
Child | 15603946 | US |