The present invention relates to the field of audio rendering in a distributed and heterogeneous audio rendering system.
More particularly, the present invention relates to a method and system for calibrating an audio rendering system comprising a plurality of heterogeneous speakers or sound rendering elements.
The term “heterogeneous speakers” is understood to mean speakers which come from different suppliers and/or which are of different types, for example wired or wireless. In such a heterogeneous distributed context, where wired and wireless speakers, of different makes and models, are networked and controlled by a server, obtaining a coherent listening system which makes it possible to listen to a complete soundstage or to broadcast the same audio signal simultaneously in several rooms of the same house is not easy.
Indeed, several heterogeneity factors may arise. The various wireless speakers have their own clock. This situation creates a lack of coordination between the speakers. This lack of coordination includes both a lack of synchronization between the clocks of the speakers, i.e. the speakers do not start to “play” at the same time, and a lack of tuning, i.e. the speakers do not “play” at the same rate.
A lack of synchronization may result in an audible delay and/or a shift in the spatial image between the devices. A lack of tuning may result in a comb filter variation effect, an unstable spatial image, and/or audible clicks due to sample starvation or overload.
Another heterogeneity factor may arise from the fact that the different speakers may have different sound renderings. First of all, from an overall point of view, since some speakers are not on the same sound card and others are wireless speakers, they probably do not play at the same volume. In addition, each speaker has its own frequency response, thus meaning that the rendering of each frequency component of the signal to be played is not the same.
Yet another heterogeneity factor may lie in the spatial configuration of the speakers. In the case of a multichannel rendering, the speakers are generally not ideally positioned, i.e. their positions relative to one another do not follow standardized positions for obtaining optimal listening at a given position of a listener. For example, the ITU standard entitled “Multichannel stereophonic sound system with and without accompanying picture” from ITU-R BS.775-3, Radiocommunication Sector of ITU, Broadcasting service (sound), published in 2012 describes such a positioning of speakers for multichannel stereophonic systems.
There are various systems or protocols allowing only some heterogeneity factors to be corrected, and independently.
Conventional multichannel listening systems control various speakers from a single sound card, so these systems do not experience synchronization issues. Synchronization issues appear as soon as a plurality of sound cards are present or wireless speakers are used. In this case, the synchronization issue stems from a latency issue between the speakers.
Manufacturers of wireless speakers are able to address this issue by applying a network synchronization protocol between their products which of course come from the same manufacturer, but this is no longer possible in the case of heterogeneous distributed audio where the speakers come from different manufacturers.
Another solution consists in finding the latency between the speakers using electroacoustic measurement. If the same signal is sent at the same time to all of the speakers of a distributed audio system, each of them will play it at a different time. Measuring the differences between these times gives the relative latencies between the speakers. Synchronizing the speakers therefore means delaying those which are furthest ahead from the estimated values. This technique has already been applied to synchronize Bluetooth speakers of different makes and models. However, it does not take into account the clock drift that exists between the speakers. Thus, the speakers may appear to play at the same time at the start of playback but will fall out of sync over time.
Other techniques make it possible to reduce defects of sound rendering level or speaker position type, but this requires independent measures linked to each defect capable of being corrected.
An exemplary embodiment of the present invention aims to improve the situation.
To that end, an exemplary embodiment of the invention relates to a method for calibrating a distributed audio rendering system, comprising a set of N heterogeneous speakers controlled by a server. The method is such that it comprises the following steps:
The calibration process thus described makes it possible to optimize capture for various heterogeneous speakers which do not necessarily belong to the same supplier or which are of different types in order to obtain corrections adapted to the various heterogeneity factors of the speakers of the rendering system. A single calibration process makes it possible to correct various heterogeneity factors, which both allows the quality of the distributed system to be improved and the resources required for the calibration of this system to be optimized. Steps b), c) and d) of this method may be carried out in a different order without this adversely affecting the scope of the invention.
In one particular embodiment, the microphone is in a calibration device previously tuned with the server.
Thus, it is possible to use, for example, a terminal equipped with a microphone to carry out the capture steps. Since this calibration device is at the same rate as the server, it is then possible to correct the heterogeneity factors of the various speakers in an appropriate manner with respect to the server that controls them and by virtue of the captured data.
In one embodiment, the analysis of the captured data comprises multiple detections of peaks in a signal resulting from a convolution of the captured data with an inverse calibration signal, a maximum peak being detected by taking into account an exceedance threshold for the detected peak and a minimum duration between two detected peaks, in order to obtain N*(N+1) timestamp data.
The convolution of the captured data with the inverse calibration signal gives the impulse responses of the various speakers during the capture according to the described method. The detection of the peaks therefore makes it possible to find the timestamp data for these impulse responses.
According to one advantageous embodiment, an upsampling is implemented on the captured data before the detection of peaks. This upsampling makes it possible to have more precise detection of peaks, which refines the timestamp data determined on the basis of this detection of peaks and will make it possible to increase the precision of the estimated drifts.
In one particular embodiment, an estimate of a clock drift of a speaker of the set with respect to a clock of the processing server is made on the basis of the timestamp data obtained for the calibration signals sent at the first and at the second time and of the time elapsed between these two times.
The calculation of this clock drift makes it possible to determine the heterogeneity factor relating to the tuning of the speakers which may then be corrected in order to homogenize the rendering system.
To supplement this estimate of drift, in one embodiment, an estimate of the relative latency between the speakers of the set, taken in pairs, is made on the basis of the obtained timestamp data and the estimated drifts.
The calculation of these latencies makes it possible to determine the heterogeneity factor relating to the synchronization of the various speakers which may then be corrected in order to homogenize the rendering system.
On the basis of this latency estimate, it is possible, according to one embodiment, to estimate the distance between the speakers of the set, taken in pairs, on the basis of the obtained timestamp data, the estimated relative latencies and the estimated drifts.
The estimation of these distances makes it possible to determine the heterogeneity factor relating to the mapping of the speakers in the rendering system which may be corrected in order to homogenize it.
This type of correction thus makes it possible to correct the clock drifts of the speakers without modifying the clock of their respective client.
According to one embodiment, a heterogeneity factor relating to a synchronization of the speakers of the set is corrected by adding a buffer, for the transmission of the audio signals intended for the corresponding speakers, the duration of which is dependent on the estimated latencies of the speakers. Similarly, this type of correction makes it possible to correct the relative latencies between the speakers without modifying the clocks of the respective clients.
According to one particular embodiment, a heterogeneity factor relating to the sound rendering and/or a heterogeneity factor relating to the sound volume of the speakers of the set is corrected by equalizing the audio signals intended for the corresponding speakers, according to gains dependent on the captured impulse responses of the speakers.
Thus, the correction made to the audio signals makes it possible to easily adapt the sound rendering and/or the sound volume. A plurality of heterogeneity factors may thus be corrected via one and the same calibration method.
In one particular embodiment, a heterogeneity factor relating to a mapping of the speakers of the set is corrected by applying a spatial correction to the corresponding speakers, according to at least one delay dependent on the estimated distances between the speakers and a given position of a listener.
Another heterogeneity factor is thus corrected on the basis of these same collected data and estimated distances between the speakers.
The present invention also relates to a system for calibrating a distributed audio rendering system, comprising a set of N heterogeneous speakers controlled by a server. The calibration system comprises:
The invention relates lastly to a storage medium, able to be read by a processor, which is integrated or not integrated into the calibration system and potentially removable, on which there is recorded a computer program comprising code instructions for executing the steps of the calibration method as described above.
Other features and advantages of the invention will become more clearly apparent from reading the following description, given purely by way of non-limiting example and with reference to the appended drawings, in which:
Thus,
The speaker represented by HP3 is, for example, a speaker using “Apple Airplay®” technology to connect wirelessly to a broadcast server.
Other speakers of the overall rendering system are connected by wire to devices which may be different and have different sound cards. For example, the speaker represented by HP2 is connected to a living room audio-video decoder, of “set-top box” type, the speaker HPi is connected to a personal computer. Of course, this configuration is only one example of a possible configuration, many other types of configuration are possible and the number N of speakers is variable.
All of these speakers in this set are therefore heterogeneous; they each have their own clock. Each sound card or wireless speaker is controlled by a software module called the client module represented here by C1, C2, C3, Ci, CN. These client modules are themselves connected to a processing server of a local network represented by 100. This local network server may be a personal computer, a compact computer of “Raspberry Pi®” type, an audio-video amplifier (“AVR” for audio-video receiver), a home gateway serving both as an external network access point and as a local network server, a communication terminal. The server 100 and the client modules may be integrated into the same device or distributed over a plurality of devices in the house. For example, the client module C1 of the speaker HP1 is integrated into the server 100 while the client module C2 of the speaker HP2 is integrated into a TV decoder controlled by the server 100.
The server 100 comprises a processing module 150 comprising a processor μP for controlling the interactions between the various modules of the server and cooperating with a memory block 120 (MEM) comprising a storage and/or working memory. The memory module 120 stores a computer program (Pg) comprising instructions for executing, when these instructions are executed by the processor, steps of the calibration method as described, for example, with reference to
This server 100 comprises an input or communication module 110 able to receive audio data S originating from various audio sources, whether local or from a communication network.
The processing module 150 then sends, to the client modules C1 to CN, the received audio data, in the form of RTP (for “Real-Time Protocol”) packets. In order for these audio data to be rendered by the set of speakers in a homogeneous manner, i.e. so that they constitute a homogeneous and audible soundstage between the various speakers, the client modules have to be able to control their speakers without them having uncorrected heterogeneity factors between them. For example, the various clients C1 to CN have to be both synchronized and tuned with the server. An explanation of these two terms is described later with reference to
The calibration system presented in
In another embodiment, a microphone 240 is integrated into a calibration device 200 comprising the microphone control module 230, a processing module 210 comprising a microprocessor and a memory MEM. Such a calibration device also comprises a communication module 220 able to communicate data to the server 100. This calibration device may for example be a communication terminal of smartphone type.
In this embodiment, the calibration device has its own sound card and its own clock. Tuning is then to be provided so that the calibration device and the server have the same clock rate and so that the capture of the data and the corrections to be made to the speakers are consistent with the clock of the server. For this, it is possible to implement a network synchronization protocol of PTP (for “Precision Time Protocol”) type and as described for example in the IEEE standard entitled “Standard for a precision dock synchronization protocol for networked measurement and control systems”, published by IEEE Instrumentation and Measurement Society IEEE1588-2008.
To implement the calibration method according to the invention, the microphone is placed in front of the speakers of the set of speakers of the rendering system according to a calibration method described below. A calibration signal as described later with reference to
All of the data captured by this microphone and following this calibration procedure are collected, for example, by the collection module 160 of the server which memorizes the captured signals and the timestamp information determined after analysis of the rendered signals and the various times of sending of the calibration signals to the various speakers.
These captured and recorded data are analyzed by the analysis module 170 of the server 100 in order to determine a plurality of heterogeneity factors to be corrected on the various speakers. Corrections for these various heterogeneity factors are then determined by the correction module 180 which calculates the sampling frequencies, buffer duration, gains or other parameters to be applied to the speakers in order to make the system homogeneous. These various parameters are then sent to the various client modules so that the appropriate correction is made to the corresponding speakers.
In the case where the microphone is integrated into a calibration device 200, this device may also comprise a collection module 260 which collects the captured data and sends them to the server via the communication module 220. This calibration device may also integrate an analysis module 270 which, in the same way as described above for the server, analyzes the collected data in order to determine a plurality of heterogeneity factors to be corrected. The calibration device may send these heterogeneity factors to the server via its communication module 220 or else determine the corrections to be made itself if it integrates a correction module 270. In this case, it sends the server the corrections which are to be applied to the speakers via their respective client module.
Thus, when the calibration method is carried out, the rendering system has become homogeneous, i.e. the various heterogeneity factors of the speakers of the set have been corrected. The various speakers are then, for example, synchronized, tuned, they have homogeneous sound rendering and sound volume. Their spatial rendering may be corrected so that the soundstage rendered by this rendering system is optimal with respect to the given position of a listener.
A definition of the terms “synchronization” and “tuning” of the clocks of the various speakers is now presented. Two independently operating devices have their own clock. A clock is defined as a monotonic function equal to a time which increases at the rate determined by the clock frequency. It generally starts when the device is started up.
The clocks of two devices are necessarily different and three parameters are defined:
Conventional modeling of a clock ignores clock deviation, which is mainly caused by changes in temperature. Thus, in a server/client network context, the clock of the client TC is expressed according to the clock of the server TS according to the equation (EQ1): TC=α(TS+θ) where α represents the clock drift of the client with respect to that of the server, and θ represents the offset of the clock of the client.
In an audio context, the drift may be found on the basis of the sampling frequencies.
The calibration method implemented by the calibration system described above with reference to
A first step E410 of initiating capture is implemented by initializing the number of speakers taken into account at 0 (i=0).
In step E415, the capture microphone of the calibration device is placed in front of a first speaker (HPi) of the rendering system which therefore comprises N speakers.
In step E420, a calibration signal is sent, at a first time t1, to the speaker HPi by the server via the client module Ci of the speaker HPi. The rendering of this signal is captured by the microphone in this step E420.
The calibration signal is, for example, a signal the frequency of which increases logarithmically with time, this signal being called logarithmic “sweeps” or “chirps”.
The convolution of the signal measured at the output of the speaker with an inverse calibration signal makes it possible to obtain the impulse response of the speaker directly. Such a signal is, for example, an exponential sliding sine-type signal as illustrated with reference to
The measurement of this signal played by a speaker makes it possible to estimate its impulse response by calculating the cross-correlation between the measured signal and the theoretical signal ESS(t). This is achieved in practice by convolving the measured signal with an inverse sliding sine IESS exhibiting an exponential decay in order to compensate for the differences in energy between the frequencies (EQ4):
In steps E430, E432 and E435 of
This time shift is memorized in the server. It may be equivalent between each of the speakers or different. The rendering of these signals is captured in this step E430 by the microphone which is still in front of the speaker HPi.
The order in which the calibration signal is sent to these various speakers may be pre-established by the server. For example, in the embodiment illustrated in steps E430 to E435 of
In step E440, the calibration signal is played again by the speaker HPi, at a time t2 different from t1, which may be at a time shift Δt from the last speaker of the loop E430 to E435 or else a time shifted by t1 and before the implementation of the loop E430 to E435.
The duration separating the time t2 from the time t1 is memorized in the memory of the processing server.
In step E440, it is checked whether the loop E415 to E455 is finished, i.e. all of the speakers have been processed in the same way. If this is not the case (N in E450), then steps E415 to E440 are iterated for the next speaker i, i ranging from 0 to N−1. The order of passage of the speakers is the same for the loop E430 to E435 for each iteration. When all of the speakers have been processed by the loop E415 to E440 (O in E450), step E460 is implemented.
Steps E420 to E440 may be carried out in a different order. For example, the capture of the calibration signal sent at times t1 and t2 to the same speaker i may be performed before the capture of the signals rendered by the other speakers. It is also possible to capture the signals rendered by the speakers other than i before capturing the signal rendered at times t1 and t2 by the speaker i. The order of these steps does not matter as far as the result of the method is concerned.
In step E460, the capture by the microphone is stopped and the captured data (Dc) are collected and recorded in a memory of the server or of the calibration device depending on the embodiment. These data are taken into account in the analysis step E470. This analysis step makes it possible to determine a plurality of heterogeneity factors to be corrected for all of the N speakers. These heterogeneity factors form part of a list from among:
A correction suitable for the determined heterogeneity factors is then determined and applied in E480.
These steps E470 and E480 are detailed in
Once this operation has been carried out, a signal is obtained comprising a series of impulse responses corresponding to the various speakers according to the order of rendering of the calibration signal of the capture procedure.
In step E520, a peak detection is determined on the impulse responses thus obtained. The times corresponding to the maximum of the impulse responses are kept as timestamp data. The detection step is in fact a detection of multiple peaks. The approach used here as one embodiment consists of discovering each local maximum defined by the transition from a positive slope to a negative slope. All of these local maxima are then sorted in descending order and the first N*(N−1) are retained.
This approach is simple but may lead to errors if an impulse response has a maximum that is lower than noise. In order for these particular cases to be detected, a peak detection threshold is defined.
In addition, for each impulse response, secondary peaks may be present and higher than the primary peak of another response. To avoid this, a minimum duration is defined between two peaks detected on the signal.
N*(N+1) timestamp data are thus obtained.
In step E522, for each of the speakers HPi of the set, the drift αi of its clock with respect to that of the processing server is determined.
The captured data used are the N+1 timestamp data measured when the calibration microphone is placed in front of the speaker HPi. These timestamp data are denoted by Tik with k∈[0, . . . ,N+1[, and the theoretical time elapsed between two measurements of the same speaker HPi: t2−t1.
If the theoretical time elapsed between the signal played by the speaker HPi at time t1 and at time t2 is equal to Nδ with δ=Δt, the constant theoretical time elapsed between two renderings of the calibration signal on two adjacent speakers of the loop E430 to E435, it is possible to estimate the drift of the speaker HPi with respect to the server according to the following equation (EQ5):
This theoretical time t2−t1 is set before initiating the calibration and it may be chosen according to the desired precision in terms of estimating the various heterogeneity factors.
Specifically, the precision in the estimation the various clock coordination and mapping parameters is mainly linked to the precision in the estimation of the timestamp data. The detection of peaks on the impulse responses means a temporal precision corresponding to one sample, i.e. approximately 20 μs for a sampling frequency at 48 kHz. Beyond the fact that better precision may be desirable, it is above all the estimation of the clock drift which is affected. Specifically, small drift values are to be expected, of the order of 10 ppm. If the theoretical duration between the two timestamp data being used to estimate the drift in the above equation EQ5is equal to 1 s, an error of one sample in the estimation of the timestamp data results in an error of about 20 ppm.
A first solution for decreasing this error is to increase the duration δ between the renderings of the calibration signal. If this duration is such that the duration between the two renderings of the calibration signal on the same speaker (t2−t1) being used to estimate the drift is at least equal to 20 s, the estimation error becomes smaller than 1 ppm. This solution involves significantly increasing the total duration of the acoustic calibration, which is not always possible.
A second solution consists in upsampling the impulse responses in a step E510 shown in
In practice, a mixture of the two solutions (increasing the time interval δ and upsampling) is used. The time between the signals being used to estimate the drift is increased to about 8 s and an upsampling by a factor of 10 is implemented.
Defining the relative latencies with respect to the first speaker is arbitrary and may lead to negative values. In order to achieve only positive values and thus have the delay of each speaker with respect to that which is furthest ahead, the following is calculated (EQ7):
All of the relative latencies between speakers taken in pairs are thus obtained, in step E524. When all of the clock drifts and all of the relative latencies are known, the distances between the speakers may be estimated in step E526. According to the calibration procedure described in
with c the speed of sound in air.
d
ij
2
for
(i, j)∈[0 . . . N[2.
After this detailed analysis step E470, the calibration method implements a correction step E480 which is now detailed in order to homogenize the heterogeneous distributed audio system.
This buffer value is transmitted to the client module Ci of the speaker HPi in E580 so that the audio data received from the server are not sent directly to the sound card or to the wireless speaker but after a delay corresponding to the size of the buffer thus determined. The synchronization of all of the speakers may then be achieved by adding Φi to the size of the buffer of each client Ci.
To correct the heterogeneity factor of the sound rendering of the speakers, step E560 retrieves the impulse responses of the speakers which have been generated and retained from the captured data. The amplitude of its Fourier transform constitutes the response of the speaker as a function of the frequency. It allows step E560 to calculate the energy in each frequency band in question. The calibration process, described in
For this, the client modules of the corresponding speakers have a volume option expressed as a percentage. If Ei is the overall energy estimated for each speaker i, its volume Vi (in %) is calculated according to the following equation (EQ11):
This volume correction is thus sent, in E580, to the corresponding client modules so that they apply this volume correction by applying a suitable gain.
The acoustic calibration produces the matrix D of the squares of the distances, in step E526, between each pair of speakers. In step E550, a mapping of the speakers is first produced on the basis of these data, in order to then be able to apply a spatial correction to adapt the optimum listening point to a given position of a listener.
According to the authors, this assumption is true if the Gram matrix obtained after centering the matrix D is positive semi-definite, i.e. its eigenvalues are greater than or equal to 0. It turns out that this condition is not always met in the application case described above because of the placement of the measurement microphone or errors in the estimation of the distances between the speakers.
If the matrix D is not an EDM, another approach is needed for the mapping. For example, the ACD (for “alternate coordinate descent”) algorithm. This method consists of a gradient descent on each coordinate sought in order to minimize the error between the matrix Das measured and as estimated. This method is described in the document entitled “Euclidean Distance Matrices: Properties, Algorithms and Applications” by the author Parhizkar, R, published in his PhD thesis, École Polytechnique Fédérale de Lausanne (Swiss Federal Institute of Technology Lausanne), Switzerland in 2013. While this algorithm converges quickly, it is still more cumbersome than the conventional MDS. For this reason, in one embodiment of the invention, the mapping algorithm carried out begins with the application of the MDS method and applies the ACD method only once it has been verified that the matrix of the measured distances is not an EDM.
Number | Date | Country | Kind |
---|---|---|---|
1873726 | Dec 2018 | FR | national |
This Application is a Section 371 National Stage Application of International Application No. PCT/FR2019/052961, filed Dec. 9, 2019, the content of which is incorporated herein by reference in its entirety, and published as WO 2020/128214 on Jun. 25, 2020, not in English.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/FR2019/052961 | 12/9/2019 | WO | 00 |