The invention relates generally to test systems and methods that enable the prediction of the perceived spatial quality of an audio processing or reproduction system, where the systems and methods apply metrics derived from the audio signals to be evaluated in such a way as to generate predicted ratings that closely match those that would be given by human listeners.
It is desirable to be able to evaluate the perceived spatial quality of audio processing, coding-decoding (codec) and reproduction systems without needing to involve human listeners. This is because listening tests involving human listeners are time consuming and expensive to run. It is important to be able to gather data about perceived spatial audio quality in order to assist in product development, system setup, quality control or alignment, for example. This is becoming increasingly important as manufacturers and service providers attempt to deliver enhanced user experiences of spatial immersion and directionality in audio-visual applications. Examples are virtual reality, telepresence, home entertainment, automotive audio, games and communications products. Mobile and telecommunications companies are increasingly interested in the spatial aspect of product sound quality. Here simple stereophony over two loudspeakers, or headphones connected to a PDA/mobile phone/MP3 player, is increasingly typical. Binaural spatial audio is to become a common feature in mobile devices. Home entertainment involving multichannel surround sound is one of the largest growth areas in consumer electronics, bringing enhanced spatial sound quality into a large number of homes. Home computer systems are increasingly equipped with surround sound replay and recent multimedia players incorporate multichannel surround sound streaming capabilities, for example. Scalable audio coding systems involving multiple data rate delivery mechanisms (e.g. digital broadcasting, internet, mobile comms) enable spatial audio content to be authored once but replayed in many different forms. The range of spatial qualities that may be delivered to the listener will therefore be wide and degradations in spatial quality may be encountered, particularly under the most band-limited delivery conditions or with basic rendering devices.
Systems that record, process or reproduce audio can give rise to spatial changes including the following: changes in individual sound source-related attributes such as perceived location, width, distance and stability; changes in diffuse or environment related attributes such as envelopment, spaciousness and environment width or depth. In order to be able to analyse the reasons for overall spatial quality changes in audio signals it may also be desirable to be able to predict these individual sub-attributes of spatial quality.
Under conditions of extreme restriction in delivery bandwidth, major changes in spatial resolution or dimensionality may be experienced (e.g. when downmixing from many loudspeaker channels to one or two). Recent experiments involving multivariate analysis of audio quality show that in home entertainment applications spatial quality accounts for a significant proportion of the overall quality (typically as much as 30%).
Because listening tests are expensive and time consuming, there is a need for a quality model and systems, devices and methods implementing this model that is capable of predicting perceived spatial quality on the basis of measured features of audio signals. Such a model needs to be based on a detailed analysis of human listeners' responses to spatially altered audio material, so that the results generated by the model match closely those that would be given by human listeners when listening to typical programme material. The model may optionally take into account the acoustical characteristics of the reproducing space and its effects on perceived spatial fidelity, either using acoustical measurements made in real spaces or using acoustical simulations.
Based on the above background it is an object of the present invention to provide systems, devices and methods for predicting perceived spatial quality on the basis of metrics derived from psychoacoustically informed measurements of audio signals. Such signals may have been affected by any form of audio recording, processing, reproduction, rendering or other audio-system-induced effect on the perceived sound field.
The systems, devices and methods operate either in a non-intrusive (single-ended) fashion, or an intrusive (double-ended) fashion. In the former case predictions are made solely on the basis of metrics derived from measurements made on the audio signal(s) produced by a DUT (“device under test”), which in the present context means any audio system, device or method that is to be tested by the present invention, when no reference signal(s) is available or desirable. In the latter case, predictions of spatial quality are made by comparing the version of the audio signal(s) produced by the DUT with a reference version of the same signals. This is used when there is a known original or ‘correct’ version of the spatial audio signal against which the modified version should be compared. Thus, the non-intrusive prediction constitutes a prediction of perceived spatial quality as an absolute quantity, whereas the intrusive prediction constitutes a prediction of perceived spatial quality as a relative quantity, for instance a degradation compared with a given reference condition.
As will be described in more detail in the subsequent detailed description of the invention the predictions of spatial audio quality provided by the present invention are basically obtained by the use of suitable metrics that derive objective measures relating to a given auditory space-related quantity or attribute (for instance the location in space of a sound source, the width of a sound source, the degree of envelopment of a sound field, etc.) when said metrics are provided with signals that represent an auditory scene (real or virtual). Alternatively, or additionally, the prediction of spatial audio quality (as a holistic quantity) may be derived from one or more metrics that do not have specifically named attribute counterparts, i.e. individual metrics may be objective measures that are only applied as functional relationships used in the total model for predicting perceived spatial audio quality as a holistic quantity, but with which there may not be associated individual perceptual attributes. The total model, according to a further alternative, utilises a combination of metrics related to perceived attributes and metrics to which there are not related perceived attributes. Said objective measures provided by the respective metrics must be calibrated (or interpreted) properly, so that they can represent a given human auditory perception, either of an individual attribute or of spatial audio quality as a holistic quantity. After translation to this perceptual measure ratings for instance on various scales can be obtained and used for associating a value or verbal assessment to the perceptual measure. Once the system has been calibrated, i.e. a relationship between the objective measures provided by the metrics and the perceptual measure has been established the system can be used for evaluating other auditory scenes and an “instrument” has hence been provided which makes expensive and time consuming listening tests superfluous.
According to the invention raw data relating to audio signals (which may be physical measurements, such as sound pressure level or other objective quantities) are typically made and from these data/measurements are derived metrics that are used as higher-level representations of the raw data/measurements. For example “spectral centroid” is a single value based on a measurement of the frequency spectrum, “iacc0” is the average of iacc in octave bands at two different angles, etc. According to the invention these higher-level representations are then used as inputs to predictor means (which could be a look-up table, a regression model, an artificial neural network etc.), which predictor means is calibrated against the results of listening tests.
According to the invention said objective measures may be derived from said raw data or measurements (physical signals) through a “hierarchy” of metrics. Thus, low-level metrics may be derived directly from the raw data and higher-level metrics may derive the final objective measure from the set of low-level metrics. A schematic representation of this principle according to the invention is given in the detailed description of the invention.
Furthermore, it should be noted that there may not always be just one physical/objective metric that relates to one perceptual attribute. In most cases there are many metrics (e.g. for envelopment) that, appropriately weighted and calibrated, lead to an accurate prediction. Some further clarification will be given in the detailed description of the invention for instance in connection with 2(b), 3(a) and 3(b).
As mentioned, the systems, devices and methods according to the present invention comprise both single-ended (“unintrusive”) and a double-ended (“intrusive”) versions. These different versions will be described in more detail in the following.
The above and further objects and advantages are according to the broadest form of the present invention obtained by a method and/or corresponding system or device for predicting perceived auditory spatial quality comprising:
Said raw data or measurements could for instance comprise microphone signals representing sound pressure in an acoustic environment. Many different microphone configurations can in practice be used to pick up sound signals that represent the acoustic environment in a manner suitable for prediction of spatial sound quality. Thus, a number of separate microphones distributed over an area in for instance a listening room or an array of microphones could be used. Specifically two microphones mounted in an artificial human head (an artificial head) could also be used to create binaural recordings that could be used in the invention. Furthermore, the acoustic environment is not limited to real environments but could be simulated, virtual environments as well, as might be created by use of suitable auralisation systems.
According to a specific embodiment of the invention, said metric means comprises at least two separate metric means, where the first of these metric means receives said raw data or measurements, or processed versions hereof, and based on these derives a set of first objective measures that is provided to a second of said metric means that based on said set of first objective measures derives a second set of objective measures that is provided to said predictor means.
According to another specific embodiment of the invention, said metric means comprises at least two separate metric means (level 1 metrics and level 2 metrics), where the first of these metric means receives said raw data or measurements, or processed versions hereof, and based on these derives a first set of first objective measures that is provided to said predictor means, and where the second of these metric means receives data (values) provided by said first metric means and based on these derives a second set of objective measures that is provided to said predictor means.
The above and further objects and advantages are according to a first aspect of the present invention obtained by a single-ended (unintrusive) method for predicting the perceived spatial quality of sound processing and reproducing equipment, where the method basically comprises the following steps:
The above and further objects and advantages are according to a first aspect of the present invention alternatively obtained by a double-ended (intrusive) method for predicting the perceived spatial quality of sound processing and reproducing equipment, where the method basically comprises the following steps:
The above and further objects and advantages are according to a second aspect of the present invention obtained by a system for predicting the perceived spatial quality of sound processing and reproducing equipment, where the system basically comprises:
The above and further objects and advantages are according to the second aspect of the present invention alternatively obtained by a double-ended (intrusive) system for predicting the perceived spatial quality of sound processing and reproducing equipment, where the system basically comprises:
The present invention furthermore relates to various specific devices (or functional items or algorithms) used for carrying out the different functions of the invention.
Still further the present invention also relates to specific methods for forming look-up tables that translates a given physical measure provided by one or more of said metrics into a perceptually related quantity or attribute. One example would be a look-up table for transforming the physical measure: interaural time difference (ITD) into a likely azimuth angle of a sound source placed in the horizontal plane around a listener. Another example would be a look-up table for transforming the physical measure: interaural cross-correlation to the perceived width of a sound source. It should be noted that instead of using look-up tables that comprise columns and rows defining cells, where each cell contains a specific numerical value in the method and system according to the invention other equivalent means, such as regression models showing the regression (correlation) between one or more physically related quantities provided by metrics in the system and a perceptually related quantity that constitutes the desired result of the evaluation carried out by the system may be used. Also artificial neural networks may be used as prediction means according to the invention.
Generally, the regression models (equations) or equivalent means such as a look-up table or artificial neural network used according to the invention weights the individual metrics according to calibrated values.
The present invention incorporates one or more statistical regression models, look-up tables or said equivalent means of weighting and combining the results of the derived metrics so as to arrive at an overall prediction of spatial quality or predictions of individual attributes relating to spatial quality.
The present invention furthermore relates to a metric or method for prediction of perceived azimuth angle θ based on interaural differences, such as interaural time difference (ITD) and/or interaural level (or intensity) difference (ILD), where the method comprises the following steps:
Said frequency bands are according to a specific embodiment of the invention bands of critical bandwidth (“critical bands”).
The present invention also relates to systems or devices able to carry out the above method for prediction of perceived azimuth angle.
The present invention furthermore relates to a metric or method for predicting perceived envelopment, the method comprising the steps of:
The present invention also relates to systems or devices able to carry out the above method for predicting perceived envelopment.
Within the context of the present invention a division is made between foreground (F) attributes and background (B) attributes. Foreground refers to attributes describing individually perceivable and localisable sources within the spatial auditory scene, whereas background refers to attributes describing the perception of diffuse, unlocalisable sounds that constitute the perceived spatial environment components such as reverberation, diffuse effects, diffuse environmental noise etc. These provide cues about the size of the environment and the degree of envelopment it offers to the listener. Metrics and test signals designed to evaluate perceived distortions in the foreground and background spatial scenes can be handled separately and combined in some weighted proportion to predict overall perceived spatial quality.
Foreground location-based (FL) attributes are related to distortions in the locations of real and phantom sources. (e.g. individual source location, Direct envelopment, Front/rear scene width, Front/rear scene skew)
Foreground width-based (FW) attributes are related to distortions in the perceived width or size of individual sources (e.g. individual source width).
Background (B) attributes relate to distortions in diffuse environment-related components of the sound scene that have perceived effects, such as Indirect envelopment, Environment width/depth (spaciousness).
Depending for instance on whether overall (holistic) spatial sound quality (referred to in the present specification as SMOS (spatial mean opinion score) or SQ (spatial quality)) is predicted or whether specific space perception related attributes, such as envelopment, location of sound sources, perceived width of sound sources etc. is predicted by the methods or systems according to the invention, different combinations of test signals, choice of metrics or structures of metrics may be most advantageously used and, hence, according to a specific embodiment of the invention there is provided a modular version of the methods and systems according to the invention, by means of which the specific test signals, metrics and configuration of the metrics can be chosen according to a specific task. In the detailed description of the invention, configuration of the method or system according to the invention by means of a spreadsheet is illustrated, but it is understood that other graphical means of configuring the method or system could also be used without departing from the scope of the invention.
The method and system according to the present invention may be used to carry out an absolute (“un-intrusive”) prediction or a relative (“intrusive”) prediction of either SMOS or of specific auditory space-related perceptual attributes at a specific location in a (real or simulated) acoustic environment, for instance a listening room or auditorium. It is, however, also possible to use the method or system according to the invention to obtain predicted values at a plurality of different positions in an acoustic environment and thus for instance obtain a plot of SMOS or specific attributes over an entire environment, for instance over an entire listening room. An example would be predicted SMOS in a listening room in which a surround sound system is installed, where the method and system according to the invention would enable evaluation of the usable listening area within the room, the effect on SMOS of acoustical treatment of surfaces of the room or the effect of changes in the processing of signals to the individual loudspeakers in the loudspeaker set-up. Thus, the present invention may for instance be used together with existing auralisation systems that simulate an acoustic environment. A number of such plots are shown in the detailed description of the invention.
The invention will be better understood with reference to the following detailed description of embodiments hereof in conjunction with the figures of the drawing, where:
a illustrates a first embodiment of the invention in the form of unintrusive evaluation using audio programme material as a source, although the input to the DUT (i.e. the Device Under Test, which means any combination of the recording, processing, rendering or reproducing elements of an audio system) according to another option may be customised/dedicated test signals for instance aiming at highlighting specific auditory attributes or spanning the complete range of auditory perception;
b illustrates a second embodiment of the invention in the form of intrusive evaluation basically comprising comparison between signals processed by the DUT and a reference signal, i.e. a signal that has not been processed by the DUT; also in this embodiment the input signals may be either audio programme material or customised/dedicated signals as mentioned in connection with
a exemplifies the general concept of a system according the second embodiment of the invention using the intrusive approach as mentioned in connection with
b gives a further example of the general concept of a system according to the second embodiment of the invention comprising prediction of separate attributes related to perceived auditory spatial quality and also prediction of perceived auditory spatial quality as a holistic quantity;
a shows a schematic representation of the principle of the invention, illustrating a set-up where a set of raw data are provided to a metric means that based on the raw data provides a higher-level representation relating to a given perceptive attribute (holistic or specific), where this higher-level representation is subsequently provided to predictor means that are calibrated on the basis of listening tests and which predictor means provides a prediction of the given perceptive attribute (holistic or specific);
b shows a further schematic representation of the principle of the invention, according to which a plurality of metrics are provided as input to the prediction model;
c shows schematically the relationship between raw data (physical signals), low-level metrics, high-level metrics and objective measures;
a to 8d show an example of a look-up table: (a) the likelihoods of ITD values against the azimuth angle relative to the listener's head (in this case, of an acoustical dummy), which were obtained from training data for filter-bank channel 12, (b) with some smoothing applied, the grey level indicating the strength of the likelihood with black being the highest. The proposed system performs vertical (c) and horizontal (d) normalisations of this table to produce estimates of the probability of an angle given the ITD.
e and f show an illustration of the procedure according to the invention for forming a histogram corresponding to a single, given frequency band;
a and b show an illustration of the use of head movement used according to the invention to resolve front/back ambiguity.
Referring to
Referring to
As mentioned the abbreviation DUT represents ‘Device Under Test’, which refers broadly to any combination of the recording, processing, rendering or reproducing elements of an audio system and also any relevant processing method implemented by use of such elements (this can include loudspeaker format and layout) and QESTRAL encode 5 refers to a method for encoding spatial audio signals into an internal representation format suitable for evaluation by the quality model of the present invention (this may include room acoustics simulation, loudspeaker-to-listener transfer functions, and/or sound field capture by one or more probes or microphones). As mentioned previously test signals of a generic nature, i.e. signals that can be used to evaluate the spatial quality of any relevant DUT, may be provided by the test source 1. In order to use these signals in a special application a transcoding 8 may be necessary. An example would be the transcoding required in order to be able to use a test signal comprising a universal directional encoding for instance in the form of high order spherical harmonics for driving a standard 5.1 surround sound loudspeaker set-up. There may of course be instances where no transcoding is required. After QESTRAL encoding (if needed) suitable metrics 6 derive the physical measures m1 and m2 characterising the spatial quality (or specific attributes hereof) and these measures are compared in comparison means 9 and the result of this comparison c is translated to a predicted spatial fidelity difference grade 10 referring to the difference grade between the reference version of the signals and the version of these signals that has been processed through the DUT. As an addition to the comparison carried out by the system shown in
In one alternative of the present invention the spatial quality of one or more DUTs are evaluated using real acoustical signals. In another alternative the acoustical environment and transducers are simulated using digital signal processing. In still another alternative a combination of the two approaches is employed (simulated reproduction of the reference version, acoustical reproduction of the evaluation version).
Referring to
The reference system consists of a standard 5.1 surround sound reproduction system comprising a set-up of five loudspeakers 17 placed around a listening position in a well-known manner. The test signals 1 applied are presented to the loudspeakers 17 in the appropriate 5.1 surround sound format (through suitable power amplifiers, not shown in the figure) as symbolically indicated by the block “reference rendering” 14. The original test signals 1 may, if desired, be authored as indicated by reference numeral 8′. The sound signals emitted by the loudspeakers 17 generate an original sound field 15 that can be perceived by real listeners or recorded by means of an artificial listener (artificial head, head and torso simulator etc.) 16. The artificial listener 16 is provided with pinna replicas and microphones in a well-known manner and can be characterised by left and right head-related transfer functions (HRTF) and/or corresponding head-related impulse responses (HRIR). The sound signals (a left and a right signal) picked up by the microphones in the artificial listener 16 are provided (symbolised by reference numeral 18) to means 6′ that utilises appropriate metrics to derive a physical measure 19 that in an appropriate manner characterises the auditory spatial characteristics or attributes of the sound field 15. These physical measures 19 are provided to comparing means 9.
The system to be evaluated by this embodiment of the present invention is a virtual 2-channel surround system comprising only two front loudspeakers 25 in stead of the five-loudspeaker set-up of the reference system. The total “device under test” DUT 2 consists in the example of a processing/codec/transmission path 21 and a reproduction rendering 22 providing the final output signals to the loudspeakers 25. The loudspeakers generate a sound field 24 that is an altered version of the original sound field 15 of the reference system. This sound field is recorded by an artificial listener 16 and the output signals (left and right ear signals) from the artificial listener are provided to means 6″ that utilises appropriate metrics to derive a physical measure 20 that in an appropriate manner characterises the auditory spatial characteristics (in this case the same characteristics or attributes as the means 6′) of the sound field 24. These physical measures 20 are provided to comparing means 9 where they are compared with the physical measures 19 provided by the metric means 6′ in the reference system.
The result of the comparison carried out in the comparison means 9 are provided as designated by reference numeral 28.
The result 28 of the comparison of the two physical measures 19 and 20 is itself a physical measure and this physical measure must be translated to a predicted subjective (i.e. perceived) difference 10 that can for instance be described by means of suitable scales as described in more detail in following paragraphs of this specification.
Referring to
Referring to
a illustrates un-intrusive prediction of perceived envelopment, but it is understood that envelopment is only an example of an auditory perceptual attribute that could be predicted by the invention and that other, specific attributes as well as overall spatial audio quality, as a holistic quantity could be predicted according to the invention, both intrusively and un-intrusively. Reference numeral 43 indicates raw physical measurements or quantities, for instance sound pressure measurements in an actual sound field or audio signals from a DVD or from a signal processing algorithm. The corresponding signals, measurements or quantities I are provided to a metric for providing an objective measure M relating to the attribute (in this example) “envelopment”, thus at the output of the metric is provided a higher level objective measure derived from the raw measurements, signals or quantities. This objective measure M is provided to a prediction model 46, which may for instance be implemented as a regression model, lookup table, or an artificial neural network, which prediction model has been calibrated by means of suitable listening tests on real listeners as symbolically indicated by reference numeral 45. The prediction model 46 “translates” the objective measure M into a subjective perceptual measure 47, in the shown example of the envelopment as it would be perceived by human listeners. This subjective perceptual measure 47 can for instance be characterised by a rating on a suitable scale as indicated by 47′ in the figure.
Referring to
c shows schematically the relationship between raw data (physical signals), low-level metrics, high-level metrics and objective measures. According to the invention the objective measures that are subsequently provided to the prediction means may be derived from the raw data or measurements (physical signals) 43 through a “hierarchy” of metrics. Thus, as exemplified in
The objective measures provided by the plurality of metrics are subsequently provided to a prediction model 46′ that has been calibrated appropriately by means of listening tests as described above and which prediction model 46′ (which as in
When auditioned by a human listener, one or more audio signals reproduced through one or more transducers give rise to a perceived spatial audio scene, whose features are determined by the content of the audio signal(s) and any inter-channel relationships between those audio signals (e.g. interchannel time and amplitude relationships). For the sake of clarity, the term ‘version’ is used to describe a particular instance of such a reproduction, having a specific channel format, transducer arrangement and listening environment, giving rise to the perception of a certain spatial quality. It is not necessary for any versions that might be compared by the system to have the same channel format, transducer arrangement or listening environment. The term ‘reference version’ is used to describe a reference instance of such, used as a basis for the comparison of other versions. The term ‘evaluation version’ is used to describe a version whose spatial quality is to be evaluated by the system, device and method according to the present invention described here. This ‘evaluation version’ may have been subject to any recording, processing, reproducing, rendering or acoustical modification process that is capable of affecting the perceived spatial quality.
In the case of single-ended embodiments of the system, device and method of the present invention, no reference version is available, hence any prediction of spatial quality is made on the basis of metrics derived from the evaluation version alone. In the case of double-ended embodiments of the system, device and method according to the invention, it is assumed that the evaluation version is an altered version of the reference version, and a comparison is made between metrics derived from the evaluation version and metrics derived from the reference version (as exemplified by
An ‘anchor version’ is a version of the reference signal, or any other explicitly defined signal or group of signals, that is aligned with a point on the quality scale to act as a scale anchor. Anchor versions can be used to calibrate quality predictions with relation to defined auditory stimuli.
Spatial quality, in the present context, means a global or holistic perceptual quality, the evaluation of which takes into account any and all of the spatial attributes of the reproduced sound, including, but not limited to:
Spatial quality can be evaluated by comparing the spatial quality of an evaluation version to a reference version (double-ended or intrusive method), or using only one or more evaluation versions (single-ended or unintrusive method).
In one embodiment of the invention the spatial quality rating can include a component that accounts for the subjective hedonic effect of such spatial attributes on a defined group of human subjects within a given application context. This subjective hedonic effect can include factors such as the appropriateness, unpleasantness or annoyance of any spatial distortions or changes in the evaluation version compared with the reference version.
When using the single-ended method, the global spatial quality grade is to some extent arbitrary, as there is no reference version available for comparison. In this case spatial quality is defined in terms of hedonic preference for one version over another, taking into account the application context, target population and programme content. Different databases of listening test results and alternative calibrations of the statistical regression model, look-up table or equivalent means may be required if it is desired to obtain accurate results for specific scenarios.
However, also when using the single-ended method, one manifestation of the system and method enables selected sub-attributes, contributing to the global spatial quality grade, to be predicted in a single-ended fashion. One example of this is the ‘envelometer’ which predicts the envelopment of arbitrary spatial audio signals, calibrated against explicit auditory anchors (see the following detailed description of an embodiment of an envelometer according to the present invention). Another example of this is the source location predictor (an embodiment of which is also described in detail in the following).
The spatial quality of the evaluation version can be presented in the form either of a numerical grade or rank order position among a group of versions, although other modes of descriptions may also be used.
A number of embodiments of the system, device and method according to the invention are possible, each of which predicts spatial quality on an appropriate scale, calibrated against a database of responses derived from experiments involving human listeners. The following are examples of scales that can be employed, which in a basic form of the system can give rise to ordinal grades that can be placed in rank order of quality, or in a more advanced form of the system can be numerical grades on an interval scale:
(1) A spatial quality scale. This is appropriate for use either with or without a reference version. If a reference version is available its spatial quality can be aligned with a specific point on the scale, such as the middle. Evaluation versions are graded anywhere on the scale, depending on the prediction of their perceived spatial quality. Evaluation versions can be graded either higher or lower than any reference version. If an evaluation version is graded above any reference version it is taken to mean that this represents an improvement in spatial quality compared to the reference.
(2) A spatial quality impairment scale. This is a special case of (1) appropriate for use only where a reference version, representing a correct original version, is available for comparison. Here the highest grade on the scale is deemed to have the same spatial quality as that of the reference version. Lower grades on the scale have lower spatial quality than that of the reference version. All evaluation versions have to be graded either the same as, or lower than, the reference version. It is assumed that any spatial alteration of the reference signal must be regarded as an impairment and should be graded with lower spatial quality.
As there is no absolute meaning to spatial quality, and no known reference point for the highest and lowest spatial quality possible in absolute terms, the range of scales employed must be defined operationally within the scope of the present invention. A number of embodiments are possible, requiring alternative calibrations of for instance a statistical regression model, look-up table or equivalent means used to predict the spatial quality, and which may require alternative metrics and databases of listening test results from human subjects if the most accurate results are to be obtained. In all the embodiments described below the minimum requirement is that the polarity of the scale is indicated—in other words, which direction represents higher or lower quality:
1) An unlabelled scale without explicit anchors. Here the evaluation versions are graded in relation to each other, making it possible to determine their relative spatial quality, but with no indication of their spatial quality in relation to verbal or auditory anchor points.
2) An unlabelled scale with explicit auditory anchors. Here the evaluation versions are graded against one or more explicit auditory anchors. The auditory anchors are aligned with specific points on the scale that may correspond to desired or meaningful levels of spatial quality. The auditory anchors define specific levels of spatial quality inherent in the anchor versions. In the case of the spatial impairment scale, the only explicit anchor is at the top of the scale and is the reference version.
3) An unlabelled scale with reference and hidden auditory anchors. Here the evaluation versions are graded in relation to the reference version. Hidden among the versions are one or more anchor stimuli having known spatial characteristics. This can be used during the calibration of the system to compensate for different uses of the scale across different calibration experiments, provided that the same anchor stimuli are used on each calibration occasion.
4) Any of the above scales can be used together with verbal labels that assign specific meanings to marked points on the scale. Examples of such labels are derived from ITU-R standards BS.1116 and 1534. In the case of impairment scales these can be marked from top to bottom at equal intervals: imperceptible (top of scale); perceptible but not annoying; slightly annoying; annoying; very annoying (bottom of scale). In the case of quality scales the interval regions on the scale can be marked excellent (highest interval), good, fair, poor, bad (lowest interval). In all cases these scale labels are intended to represent equal increments of quality on a linear scale. It should be noted that such verbal labels are subject to biases depending on the range of qualities inherent in the stimuli evaluated, language translation biases, and differences in interpretation between listeners. For this reason it is recommended that labelled scales are only used when a verbally defined meaning for a certain quality level is mandatory.
In one embodiment of the invention the input signals to the DUT are any form of ecologically valid spatial audio programme material.
In another embodiment of the invention the input signals to the DUT are special test signals, having known spatial characteristics.
The system, device and method according to the invention includes descriptions of ecologically valid, or programme like test signals, and sequences thereof, that have properties such that when applied to the DUT and subsequently measured by the algorithms employed by the system, lead to predictions of perceived spatial quality that closely match those given by human listeners when listening to typical programme material that has been processed through the same DUT. These test signals are designed in a generic fashion in such a way that they stress the spatial performance of the DUT across a range of relevant spatial attributes.
The selection of appropriate test signals and the metrics used for their measurement depends on the chosen application area and context for the spatial quality prediction. This is because not all spatial attributes are equally important in all application areas or contexts. In one embodiment of the invention the test signals and sequence thereof can be selected from one of a number of stored possibilities, so as to choose the one that most closely resembles the application area of the test in question. An example of this is that the set of test signals and metrics required to evaluate spatial quality of 3D flight simulators would differ from the set required to evaluate home cinema systems.
Other examples of sets of test signals and metrics include those suitable for the prediction of typical changes in spatial quality arising from, for example (but not restricted to): audio codecs, downmixers, alternative rendering formats/algorithms, non-ideal or alternative loudspeaker layouts or major changes in room acoustics.
In one embodiment of the invention the test signals are created in a universal spatial rendering format of high directional accuracy (e.g. high order ambisonics). These are then transcoded to the channel format of the reference and/or evaluation versions so that they can be used. In this way the test signals are described in a fashion that is independent of the ultimate rendering format and can be transcoded to any desired loudspeaker or headphone format.
In another embodiment of the invention, the test signals are created in a specific channel format corresponding to the format of the system under test. An example of this is the ITU-R BS.775 3-2 stereo format. Other examples include the ITU 5-2 stereo format, the 2-0 loudspeaker stereo format and the two channel binaural format. In the last case the test signals are created using an appropriate set of two-channel head-related transfer functions that enable the creation of test signals with controlled interaural differences. Such test signals are appropriate for binaural headphone system or crosstalk cancelled loudspeaker systems that are designed for binaural sources.
In one embodiment of the invention the spatial quality of one or more DUTs is evaluated using real acoustical signals reproduced in real rooms.
In another embodiment the acoustical environment and/or transducers are simulated using digital signal processing.
In another embodiment a combination of the two approaches is employed (e.g. simulated reproduction of the reference version, acoustical reproduction of the evaluation version). In this embodiment, for example, a stored and simulated reference version could be compared in the field against a number of real evaluation versions.
The DUT may include the transducers and/or room acoustics (e.g. if one is comparing different loudspeaker layouts or the effects of different rooms).
The room impulse responses used to simulate reproduction of loudspeaker signals in various listening environments may be obtained from a commercial acoustical modeling package, using room models built specifically for the purposes of capturing impulse responses needed for the loudspeaker layouts and listener positions needed for the purposes of this model.
The process of QESTRAL encoding is the translation of one or more audio channels of the reference or evaluation versions into an internal representation format suitable for analysis by the system's measurement algorithms and metrics. Such encoding involves one or more of the following processes, depending on whether the DUT includes the transducers and/or room acoustics:
(1) Loudspeaker or headphone reproduction, or simulation thereof, at one or more locations.
(2) Anechoic or reverberant reproduction, or simulation thereof, with one or more rooms.
(3) Pickup by probe transducers (real or simulated), at one or more locations, with one or more probes.
(4) Direct coupling of the audio channel signals from the DUT, if the DUT is an audio signal storage, transmission or processing device, (i.e. omitting the influence of transducers, acoustical environment and head-related transfer functions).
Depending on the set of metrics to be employed, according to the mode of operation of the system, device and method according to the invention, one or more of these encoding processes will be employed.
Examples of probe transducers include omnidirectional and directional (e.g. cardioid or bi-directional) microphones, Ambisonic ‘sound field’ microphone of any order, wavefield capture or sampling arrays, directional microphone arrays, binaural microphones or dummy head and torso simulator.
In one example, given for illustration purposes, the DUT is a five channel perceptual audio codec and it is desired to determine the spatial quality in relation to an unimpaired five channel reference version. In such a case the evaluation and reference versions are five channel digital audio signals. QESTRAL encoding then involves the simulated or real reproduction of those signals over loudspeakers, either in anechoic or reverberant room conditions, finally the capture of the spatial sound field at one or more locations by means of one or more simulated or real pickup transducers or probes. This requires processes (1) (2) and (3) above. Alternatively, in another embodiment of the invention, results of limited applicability could be obtained by means of process (4) alone, assuming that appropriate metrics and listening test results can be obtained.
In another example the DUT is a loudspeaker array and it is desired to determine the spatial quality difference between a reference loudspeaker array and a modified array that has different loudspeaker locations, in the same listening room. In such a case the evaluation and reference versions are real or simulated loudspeaker signals reproduced in a real or simulated listening room. QESTRAL encoding then involves only process (3).
In one embodiment of the invention the spatial quality is predicted at a single listening location.
In another embodiment the spatial quality is predicted at a number of locations throughout the listening area. These results can either be averaged or presented as separate values. This enables the drawing of a quality map or contour plot showing how spatial quality changes over the listening area.
The system, device or method according to the invention is calibrated using ratings of spatial quality given on scales as described above, provided by one or more panels of human listeners. A database of such results can be obtained for every context, programme type and/or application area in which the system, device and method according to the invention is to be applied. It may also be desirable to obtain databases that can be used for different populations of listeners (e.g. audio experts, pilots, game players) and for scenarios with and without different forms of picture. For example, it may be necessary to obtain a database of quality ratings in the context of home cinema systems (application area), movie programme material (programme content) and expert audio listeners (population). Another database could relate to flight simulators (application area), battle sound effects and approaching missiles (programme content), and pilots (population).
In the case of each database, a range of programme material is chosen that, in the opinion of experts in the field, and a systematic evaluation of the spatial attributes considered important in that field, is representative of the genre. This programme material is subjected to a range of spatial audio processes, based on the known characteristics of the DUTs that are to be tested, appropriate to the field, giving rise to a range of spatial quality variations. It is important that all of the relevant spatial attributes are considered and that as many as possible of the spatial processes likely to be encountered in practical situations are employed. Greater accuracy of prediction is obtained from the system as more, and more relevant, examples are employed in the calibration process. It is important that the range of spatial qualities presented in the calibration phase spans the range of spatial qualities that are to be predicted by the system, and does so in a well distributed and uniform manner across the scale employed.
Calibration is achieved by listening tests which should be carried out using controlled blind listening test procedures, with clear instructions to subjects about the task, definition of spatial audio quality, meaning of the scale and range of stimuli. Training and familiarisation can be used to improve the reliability of such results. Multiple stimulus comparison methods enable fast and reliable generation of such quality data.
The systems, devices and methods according to the invention relies on psychoacoustically informed metrics, derived from measurements of the audio signals (that may have been QESTRAL-encoded) and that, in an appropriately weighted linear or non-linear combination, enable predictions of spatial quality.
As noted above, it is possible for the input signals to the QESTRAL model to be either ecologically valid programme material, or, in another embodiment, specially designed test signals with known spatial characteristics. The metrics employed in each case may differ, as it is possible to employ more detailed analysis of changes in the spatial sound field when the characteristics of the signals to be evaluated are known and controllable. For example, known input source locations to the DUT could be compared against measured output locations in the latter scenario. In the case where programme material is used as a source a more limited range of metrics and analysis is likely to be possible.
The systems, devices and methods according to the invention incorporates a statistical regression model, look-up table or tables or equivalent means of weighting and combining the results of the above metrics (for instance relating to the prediction of the perception of different auditory space-related attributes) so as to arrive at an overall prediction of spatial quality or fidelity. Such a model may scale and combine some or all of the metrics in an appropriate linear or non-linear combination, in such a way as to minimise the error between actual (listening test database) and predicted values of spatial quality.
In one embodiment of the invention a generic regression model is employed that aims to predict an average value for spatial audio quality of the evaluation version, based on a range of listening test databases derived from different application areas and contexts.
In another embodiment individual regression models are employed for each application area, context, programme genre and/or listener population. This enables more accurate results to be obtained, tailored to the precise circumstances of the test.
There follows an example of a regression model employed to predict the spatial quality of a number of evaluation versions when compared to a reference version.
The following is an example of the use of selected metrics, together with special test signals, also a regression model calibrated using listening test scores derived from human listeners, to measure the reduction in spatial quality of 5-channel ITU BS.775 programme material compared with a reference reproduction, when subjected to a range of processes modifying the audio signals (representative of different DUTs), including downmixing, changes in loudspeaker location, distortions of source locations, and changes in interchannel correlation.
In this example, special test signals are used as inputs to the model, one of which enables the easy evaluation of changes in source locations. These test signals in their reference form are passed through the DUTs leading to spatially impaired evaluation versions. The reference and evaluation versions of the test signals are then used as inputs to the selected metrics as described below. The outputs of the metrics are used as predictor variables in a regression model. A panel of human listeners audition a wide range of different types of real 5-channel audio programme material, comparing a reference version with an impaired version processed by the same DUTs. Spatial quality subjective grades are thereby obtained. This generates a database of listening test scores, which is used to calibrate the regression model. The calibration process aims to minimize the error in predicted scores, weighting the predictor variables so as to arrive at a suitable mathematical relationship between the predictor variables and the listening test scores. In this example, a linear partial-least-squares regression (PLS-R) model is used, which helps to ameliorate the effects of multi-colinearity between predictor variables.
Test signal 1: a decorrelated pink noise played through all five channels simultaneously.
Test signal 2: thirty-six pink noise bursts, pairwise-constant-power-panned around the five loudspeakers from 0° to 360° in 10° increments. Each noise burst lasts one second.
As shown in Table 1, one set of metrics is used with test signal 1, and another with test signal 2. In the case of the metrics used with test signal 1, these are calculated as difference values between the reference condition and the evaluation condition. These metrics are intended to respond to changes in envelopment and spaciousness caused by the DUT. In the case of the metrics used with test signal 2, the noise bursts are input to the localisation model, resulting in thirty-six source location angles in the range 0° to 360°. Three higher-level metrics that transform a set of thirty-six angles into a single value are then used. The metrics used on the second test signal are intended to respond to changes in source localisation caused by the DUT.
The coefficients of an example calibrated regression model, showing raw and standardised (weighted) coefficients are shown in Table 2.
An example of the prediction of listening test scores using this regression model are shown in
There follows an exemplary list of high level metrics (described here as “features”) that according to the invention can be used in the prediction of spatial quality or individual attributes thereof.
Sound Localisation from Binaural Signals by Probabilistic Formulation of HRTF-Based measurement statistics
In the following there is described a method according to the present invention for estimation of sound source direction based on binaural signals, the signals being received at the ears of a listener. Based on cues extracted from these signals, a method and corresponding system or device is developed according to the invention that employs a probabilistic representation of the cue statistics to determine the most likely direction of arrival.
Just as a camera determines the direction of objects from which light emanates, it is useful to find the direction of sound sources in a scene, which could be in a natural or artificial environment. In many cases, it is important to perform localisation in a way that mimics human performance, for instance so that the spatial impression of a musical recording or immersive sensation of a movie can be assessed. From the perspective of engineering solutions to directional localisation of sounds, perhaps the most widespread approach involves microphone arrays and the time differences on arrival of incident sound waves.
According to a preferred embodiment of the present invention only two sensors are used (one to represent each ear of a human listener) and the prediction of the direction to a sound source does not rely only on time delay cues. It includes (but is not limited to) the use of both interaural time difference (ITD) and interaural level difference (ILD) cues. By enabling the responses of human listeners to be predicted, time-consuming and costly listening tests can be avoided. Many signals can be evaluated, having been processed by the system. Where acoustical simulations are manipulated in order to generate the binaural signals, it is possible to run extensive computer predictions and obtain results across an entire listening area. Such simulations could be performed for any given sound sources in any specified acoustical environment, including both natural sound scenes and those produced by sound reproduction systems in a controlled listening space.
There are possible applications of the invention at least in the following areas:
According to this embodiment of the invention there is provided a system, device and method for estimating the direction of a sound source from a pair of binaural signals, i.e. the sound pressures measured at the two ears of the listener. The listener could be a real person with microphones placed at the ears, an acoustical dummy or, more often, a virtual listener in a simulated sound field. The human brain relies on ITD and ILD cues to localise sounds, but its means of interpreting these cues to yield an estimated direction is not fully known. Many systems use a simple relation or a look-up table to convert from cues to an angle. According to the present invention this problem is considered within a Bayesian framework that guides the use of statistics from training examples to provide estimates of the posterior probability of an angle given a set of cues. In one embodiment, a discrete probability representation yields a set of re-weighted look up tables that produce more accurate information of how a human listener would perceive the sound direction. An alternative continuous probability embodiment might use, for example, a mixture of Gaussian probability density functions to approximate the distributions learnt from the training data.
The localisation prediction according to the invention is compatible with any set of binaural signals for which the HRTF training examples are valid, or for any real or simulated sound field from which binaural signals are extracted. The current embodiment uses an HRTF database recorded with a KEMAR® dummy head, but future embodiments may use databases recorded with other artificial heads and torsos, properly averaged data from humans or personalised measurements from one individual. Where sufficient individual data are not available, the training statistics may be adapted to account for variations in factors such as the size of head, the shape of ear lobes and the position of the ears relative to the torso. By the principle of superposition, the training data may be used to examine the effects of multiple concurrent sources.
Although the invention applies, in principle, to localisation of a sound source in 3D space, which implies the estimation of azimuth angle, angle of elevation, and range (distance to the sound source), for the sake of simplicity the following discussion will deal only with azimuth θ. The discussion here is also restricted to consideration of ITD and ILD cues, although the invention includes the use of other cues, such as timbral features, spectral notches, measures of correlation or coherence, and estimated signal-to-noise ratio for time-frequency regions of the binaural signals.
The Bayesian framework uses the statistics of the training examples to form probability estimates that lead to an estimate of the localisation angle {circumflex over (θ)} based on the cues at that time. To take one particular instantiation of the system, we will first consider an implementation that can form an approximation of the posterior probability of any angle θ given the ITD cue ΔT and ILD cue ΔL. A more general case takes into account the dependency of the angle on both of these cues together, incorporating features of the joint distribution. However, the instantiation that we now describe combines the information by assuming independence of these cues: the product of their separate conditional probabilities is divided by the prior probability of the angle. Thus, the predicted or estimated source direction {circumflex over (θ)} is defined as the angle with the maximum probability:
The prior probability p(θ) is a measure of how likely any particular source direction is to occur. In general, we may define all directions as equally likely by giving it a uniform distribution; in some applications however, there may be very strong priors on the audio modality, for instance, in television broadcast where the majority of sources coincide with people and objects shown on the screen in front of the listener.
The conditional probabilities for each cue are defined as:
where two normalisations are applied. Initial estimates of the probabilities may be gathered from counting the occurrences of interaural difference values at each angle. The first operation normalises the training counts from recordings made at specified angles to give an estimate of the likelihood of a cue value given an angle p(Δ|θ), similar to the relative frequency. We refer to this as vertical normalisation as it applies to each column in the look-up table. The second operation, the horizontal normalisation applied to the rows, employs Bayes' theorem to convert the likelihood into a posterior probability, dividing by the evidence probability p(Δ). These steps are the same for ITD and ILD cues, ΔT and ΔL, and are illustrated in
One implementation of the process for training the system's representation of the posterior probabilities and thereby providing a prediction of the most likely azimuth angle to a sound source may be summarised as follows:
The trained look-up tables are then ready to be used in the chosen application with new unknown binaural signals for localisation. As shown in
Referring to
Reverting to
Referring to
The formation of histograms is further illustrated with reference to
The procedure according to the invention for forming a histogram corresponding to a single, given frequency band is furthermore illustrated with reference to
Referring to
Referring to
A method for distinguishing between sound incidence from the frontal hemisphere and the rear hemisphere (i.e. for front/back disambiguation) is illustrated with reference to
a and b show plots of the estimated angular probability of source location for a given stationary source at −30 degrees for two head orientations: straight ahead and 5 degrees to the left. According to the invention in order to resolve front/back ambiguities two sets of binaural signals are passed sequentially through the localisation model according to the invention: one with the head at an orientation of θ and the other with an orientation of θ−Δ degrees (e.g. Δ equal to 5 degrees). I.e. the head is turned (rotated) 5 degrees to the left. The resulting angles are then compared and the direction in which the measured angle (provided by the model) moves is used to determine whether the signal is in front or in the back hemisphere. (Having measurements made at two or more head rotations is consistent with some of the other measures used in the model, in particular the IACC90, which uses a different set of binaural signals to the IACC0, i.e. a second set of measurements made at a different head rotation.
In the example shown in
The combination of information from each cue, across frequency bands and over time represents a form of multi-classifier fusion [Kittler, J. and Alkoot, F. M. (2003). “Sum versus vote fusion in multiple classifier systems”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 25, Issue 1, Pages: 110-115]. To achieve optimal performance it is possible to extend the localisation based model beyond a series of naïve Bayes probability estimates. Essentially, as well as making the most likely interpretation of the measurements at any given moment, the system can consider whether these measurements are reliable or consistent. Loudness weighting performs a related operation, in that it gives more confidence to the measurements that are assumed to have a higher signal-to-noise ratio. Similarly, methods for combining information over subsequent time frames, such as averaging, or thresholding and averaging, may be employed. A measure of the confidence of the extracted cue can be used to influence the fusion of scores, so that the overall output of the system combines the widest range of reliable estimates.
The ITD and ILD cues on which the localisation prediction according to the present invention relies are currently extracted using standard techniques, as for instance described in the PhD thesis of Ben Supper [University of Surrey, 2005]. Yet, because the same signal processing is applied to the training data as to the test signals during system operation, alternative techniques can be substituted without any further change to the system, devices and method according to the present invention.
Prediction of spatial attributes can be performed for arbitrary test signals. There is no restriction of the nature or type of acoustical signal that can be processed by the proposed invention. It may include individual or multiple simultaneous sources.
The proposed prediction of direction may be applied to time-frequency elements or regions of time-frequency space (e.g., in spectrogram or Gabor representation). A straightforward example could implement this notion by identifying which critical frequency bands were to be included for a given time window (a block of samples). In one embodiment, the selection of bands for each frame could be based on a binary mask that was based on a local signal-to-noise ratio for the sound source of interest.
The localisation of sound sources can be applied for evaluation of foreground and background streams. Human perception of sound through the signals received at the ears is contingent on the interpretation of foreground and background objects, which are broadly determined by the focus of attention at any particular time. Although the present invention does not provide any means for the separation of sound streams into foreground and background, it may be used to predict the location of sources within them, which includes any type of pre-processing of the binaural signals aimed at separation of content into foreground and background streams.
Improved localisation and front-back disambiguation can as mentioned above be achieved by head movement The resolution of human sound localisation is typically most accurate directly in front of the listener, so the location of a stationary source may be refined by turning the face towards the direction of the sound. Equally, such a procedure can be used in the present invention. Another active listening technique from human behaviour that can be incorporated into the present invention is head movement aimed at distinguishing between localisation directions in front of and behind the listener. Owing to the availability of only two sensors, there exists a “cone of confusion” about the axis (the line between the two ears). Thus, for a sound source in the horizontal plane there would be solutions front and back. However, whereas the true direction of the source would stay fixed with respect to the environment (inertial reference), the image direction would move around and lack stability, allowing it to be discounted. The present invention can embody a similar behaviour, where predictions are gathered for multiple head orientations, and those hypotheses that remain consistently located are identified (while inconsistent ones can be suppressed).
Use of Subjective Training Data from Listening Tests
For individual stationary sound sources in an anechoic environment, the majority of systems make the assumption that the perceived direction of localisation matches the physical direction of the sound source subtended to the listener. However, it is well known that human listeners make mistakes and introduce variability in their responses. In some cases, bias is introduced, for example for a sound source being perceived as coming from a location slightly higher than its true elevation angle. For the purposes of azimuth estimation, the embodiment of the present invention assumes perfect alignment between physical and perceived azimuth to provide annotation of the training data. In other words, we assume that a sound source presented at 45° to the right is actually perceived as coming from that direction. More generally, however, the probabilistic approach to localisation allows for any annotation of the recordings used for training. Thus, labels based on listening test results could equally be used, for example, to train the system to recognize source elevation in terms of the perceived elevation angles. Another embodiment involves the use of labels for other perceived attributes in training, such as source width, distance, depth or focus. The result is presented in terms of equivalent probability distributions based on the cues from the binaural signals for the attribute whose labels were provided. In other words, where the perception of an alternative spatial attribute (such as width or distance) may depend on the cues that the system uses (which typically include but are not limited to ITDs and ILDs), training data can be used in a similar way to formulate a probabilistic prediction of that alternative attribute. Therefore, the use of labels in a training procedure enables alternative versions of the system to output predicted attribute values for any given test signals. This approach produces outputs that are more reliable estimates of human responses because: (i) it uses binaural signal features, such as ITD and ILD cues, in a way that imitates the primary stages of human auditory processing, and (ii) it can be trained to model the pdf of listener responses based on actual attribute data, such as the set of localisation angles.
An innovative aspect of our listening test methodology that was used to elicit responses of spatial attributes, such as directional localisation and perceived source width, from subjects was the use of a graphical user interface. The interface allowed spatial attributes of the perceived sound field to be recorded in a spatial representation. The example shown in FIG. 5 demonstrates how the localisation of each sound source was represented by an arrow in a sound scene that included multiple sources. The benefit of this approach is that it allows for a direct conversion of listening test data from a listener's experience to the domain of the perceived angle, maintaining the spatial relations within the test environment and without the need for an arbitrary scale.
Integration of Components from Scene to Scale
Within the context of an overall system for predicting the perceived spatial quality of processed/reproduced sound, the present localisation prediction model constitutes a module that takes binaural signals as input and provides an estimate of the distribution of posterior probability over the possible localisation angles. Most directly, by picking the maximum probability, or the peak of the pdf, the module predicts the most likely direction of localisation. Hence, the localisation module according to the invention can be used in conjunction with a sound-scene simulation in order to predict the most-likely perceptual response throughout listening area.
The present implementation is designed for sound sources in an anechoic environment. Nonetheless, any processing aimed at enhancing direct sound in relation to indirect sound can be used to improve the performance of the system. Conversely, for cases where it is important to identify the directions of reflections, the localisation module may be applied to the indirect, reflected sound. By concentrating, for example, on a time window containing the early reflections, the locations of the dominant image sources can be estimated, which may prove valuable for interpreting the properties of the acoustical environment (e.g., for estimating wall positions).
As discussed above in the section on the use of subjective training data, alternative outputs from the system can be achieved through supervised training. Soundfield features obtained in this way can be used in an overall quality predictor.
The close relationship between perceived and physical source locations implies that the output from the prediction of direction localisation has a meaningful interpretation in terms of physical parameters. Many prediction schemes can only be treated as a black box, without the capability of drawing any inference from intermediate attributes. For instance, a system that used an artificial neural network or set of linear regressions or look-up tables to relate signal characteristics directly to a measurement of spatial audio quality would typically not provide any meaningful information concerning the layout of the spatial sound scene. In contrast, as a component of a spatial quality predictor, the present module gives a very direct interpretation of the soundfield in terms of the perceived angles of sound sources.
As a specific example there is in the following described a so-called “ENVELOMETER”, which is a device according to the present invention for measuring perceived envelopment of a surrounding sound field, for instance, but not limited to, a reproduced sound field for instance generated by a standard 5.1 surround sound set-up.
People are normally able to assess this subjectively in terms of “high”, “low” or “medium” envelopment. However, there have been very few attempts to predict this psychoacoustical impression for reproduced sound systems using physical metrics, and none that are capable of working with a wide range of different types of programme material, with and without reverberation. The envelometer according to the present invention described in detail in the following makes it is possible to measure this perceptual phenomenon in an objective way.
Envelopment is a subjective attribute of audio quality that accounts for the enveloping nature of the sound. A sound is said to be enveloping if it “wraps around the listener”.
A need for listeners to feel enveloped (or surrounded) by a sound is a main driving force behind the introduction of surround sound. For example, a 5.1 channel format was introduced to movies by the film industry in order to increase the sense of realism since it allows one to reproduce sound effects “around the listener”. Another example is related to sports broadcasts, which in the near future will allow the listener to experience the sound of a crowd coming from all directions and in this way will enhance a sense of immersion or involvement in sports event. Hence, one of the most important features of a high-quality surround sound system is the ability to reproduce the illusion of being enveloped by a sound. An Envelometer according to the present invention could be used as a tool to verify objectively how good or bad a given audio system is in terms of providing a listener with a sensation of envelopment.
The overall aim of the present invention is to develop a system, one or more devices and corresponding methods that could for instance comprise an algorithm for prediction of spatial audio quality. Since, as mentioned above, the envelopment is an important component (sub-attribute) of spatial audio quality, it is likely that the here proposed Envelometer, or metrics derived from it, will form an important part of the spatial quality prediction algorithm.
A schematic representation of an envelometer according to an embodiment of the present invention specifically for measuring/predicting envelopment of a five-channel surround sound is presented in
Although an envelometer according to the invention as shown in
The distinct feature of this implementation of the Envelometer is that it is a single-ended meter (also called “un-intrusive”), as opposed to the double-ended meters (“intrusive”). In a single-ended approach the envelometer 66 measures the envelopment 68 directly on the basis of the input signals 67 (see
Single-ended meters are much more difficult to develop than double-ended meters due to the difficulty in obtaining unbiased calibration data from listening tests. According to the invention this bias can be reduced by calibrating the scale in the listening tests using two auditory anchors 71 and 72, respectively near the ends of the scale 70, as shown in
In contrast to the double-ended approach, the advantage of the single-ended approach is the ease of interfacing with current industrial applications. For example, a single-ended version of the Envelometer does not require generating or transmitting any reference signals prior to measurement. Hence, for example, it can be directly “plugged-in” to broadcast systems for the in-line monitoring of envelopment of the transmitted programme material. Also, it can be directly used at a consumer site to test how enveloping a reproduced sound is. For example, placement of 5 loudspeakers for reproduction of surround sound in a typical living room is a challenging task. The Envelometer may help to assess different loudspeaker set ups so that the optimum solution can be found.
The above approach of calibrating the scale is well known in the literature. However, novel aspects are at least that (1) the approach is according to the invention applied to the scaling of envelopment and (2) specific anchor signals have been devised for application with the invention.
The following recordings can be used as an Anchor A defining a high sensation of envelopment on the scale used in the listening tests:
The Anchor B used to define a low sensation of envelopment can be achieved by processed versions of the signals described above. For example, if these signals are first down-mixed to mono and then reproduced by the front centre loudspeaker, this will give rise to a very low sensation of envelopment as the sound will be perceived only at the front of the listener.
Finding appropriate anchor recordings for subjective assessment of envelopment is not a trivial task, but a number of different signals may be used. However, according to a presently preferred embodiment there is used a spatially uncorrelated 5-channel applause recording to anchor the highly enveloping point on the scale (Anchor A) and a mono applause recording reproduced via the centre channel only to anchor the lowly enveloping sound (Anchor B). The advantage of using the applause recording instead of more analytical signals, such as uncorrelated noise, is that they are more ecologically valid and therefore some listeners reported that the applause signals are easier to compare with musical signals in terms of the envelopment compared to some artificial noise signals. In addition, the advantage of using ecologically valid signals such as the applause is that they are less intrusive and less fatiguing for the listeners if they are exposed to these sounds for a long period of time. From a mathematical point of view, the applause signals have similar properties to some artificial uncorrelated noise signals. It should be noted, however, that the present invention is not limited to the above signals, nor to any specific processing of these.
It is possible to adapt the envelometer to the double-ended mode of measurement, which is exemplified by the embodiment shown in
Some preliminary tests with a double-ended version of the Envelometer have been carried out. The test signal consisted of 8 talkers surrounding the listeners at equal angles of 30 degrees (there were 8 loudspeakers around the listener). There were two versions of the tests signal: foreground and background. The first version (foreground) contained only the anechoic (dry) recordings of speech. The second version (background) contained only very reverberant counterparts of the above version.
Another novel approach of the proposed Envelometer is the scale used to display the measured envelopment. It is proposed to use a 100-point scale, where the two points on the scale, A and B (see figure pp) define the impression of the envelopment evoked by the high and low anchor signals, respectively, as for instance by said uncorrelated applause signals and by a mono down-mix of the applause signal reproduced by the front loudspeaker respectively.
It should be noted that there are several other possible scales that could be used both in the Envelometer and in the listening tests but the one proposed in
Below are outlined three major approaches that could be chosen for both subjective and objective assessment of envelopment. There are many variants of all three methods and only typical examples are presented in TABLE 1 below.
The table above shows only some manifestations of the scales discussed. For example, the other possible manifestations of the categorical scale are the uni-dimensional scale and semantic differential scale presented in
Moreover, it is possible to use indirect scales for assessment of envelopment, for example the Likert scales, shown in
Regardless of the type of the scale used (ordinal, ratio, graphic) the main challenge is to obtain un-biased envelopment data from a listening test that is going to be used to calibrate the Envelometer. If the data from the listening test is biased, the errors would propagate and would adversely affect the reliability and the precision of the meter. The task of obtaining unbiased data from a subjective test is not trivial and there are many several reports demonstrating how difficult it is. Currently, it seems that the only way of reducing biases, or at least keeping them constant, is to properly calibrate the scale using some carefully chosen auditory anchors as shown in TABLE 2 below:
As already discussed above, in the listening tests that were performed and in the embodiment of an Envelometer according to the invention it was decided to use a graphic scale with the two auditory anchors, which provides the listeners with a fixed frame of reference for their assessment of envelopment and in this way reduces the contextual biases and stabilises the results. Similarly, if the results from the Envelometer are interpreted by their users, the frame of reference is clearly defined (points A and B on the scale) and hence the user will know how to interpret the results. For example, if the envelopment predicted by the Envelometer is approximately 80, it would mean that the sound is very enveloping. To be more specific, it is almost as enveloping as the sound of the applause surrounding a listener, which defines the point 85 on the scale (highly enveloping Anchor A).
If the auditory anchors were not used, the contextual effects would make it almost impossible to predict the envelopment of recording in different listening tests with a high precision. However, it might still be possible to predict correctly the rank order of different stimuli in terms of their envelopment.
An internal structure of the current version of the Envelometer (a prototype) is presented in
The envelometer estimates the envelopment of the surround sound based on physical features of the input signals including, but not limited to:
More examples are presented in TABLE 3 below.
There are some additional features that have not been identified as statistically significant in the presently preferred embodiment of the Envelometer according to the invention, but which may be of importance as they were identified as significant in preliminary experiments. They include features such as:
Once the features are extracted in the Envelometer, they are used as input signals for the predictor 84 (see
In present embodiment it was decided to use a linear regression model with the first order interactions between features, but it is understood that other models and also artificial neural networks might be used in connection with the present invention. The adopted model can be expressed using the following equation:
y=k
1
x
1
+k
2
x
2
+k
3
x
3
+ . . . +k
12
x
1
x
2
+k
13
x
1
x
3
+ . . . +g,
In listening test carried out the participants assessed the envelopment of 181 audio recordings. They predominantly consisted of commercially released 5-channel surround sound recordings. In addition, two-channel stereo and one-channel mono recordings were also included in this database as they represented recordings of lower level of envelopment. Moreover, some of the recordings were deliberately degraded using typical processes used currently in modern audio systems. Examples of controlled degradations are presented in TABLE 4.
With reference to
TABLE 5 shows the regression coefficients used in the Envelometer after its calibration. The table contains both raw and weighted coefficients. The raw coefficients were used to generate the predicted data presented in previously discussed
In the validation part of the development of the present embodiment of an envelometer according to the invention a separate database of subjective responses was used. This database was obtained using the same listeners as above but different programme material and different controlled degradation (but of the same nature). In total 65 recordings were used in the validation part of the development.
The results of the validation are presented in
Finally,
Thus,
In the following, a specific application of the present invention is described, whereby spatial sound quality (either as an absolute quantity or as a relative quantity, i.e. a degradation of spatial sound quality) is predicted over an area for instance in a listening room in which there is installed a loudspeaker system, for instance a standard 5.1 channels surround sound loudspeaker system. Referring to
As a second example, prediction of SMOS over a listening area in a room is shown, in which the specific set of metrics listed in TABLE X is used to provide input values to the regression model.
It is to be noted that although the various plots of predictions described above have related to the prediction of overall spatial sound quality or degradation hereof, the method and system according to the invention could also have been used for providing similar plots of specific spatial audio attributes, such as envelopment, source localisation, source width etc. over an area of a real or simulated room. Furthermore, such predictions—both of SQ/SMOS or of the individual attributes could also be provided by the method and system according to the invention over the entire 3-dimensional space of a given, real or simulated, environment.
According to specifically preferred embodiments of the invention, predicted overall spatial quality is rated using a weighted set of low and high level metrics derived from binaural and/or microphone measurements (or from virtual microphones in a simulation system such as an auralisation system) of one or more test signals designed to represent components of a spatial audio scene.
In the following are described two levels of metrics: Level 1 metrics and Level 2 metrics. However, the hierarchical metrics structure might comprise even further levels of metrics.
Essential features of the present invention are the precise nature of the metrics, the way in which the high level metrics are formed from the low level metrics, the way they are related to specific test signals, the values of their coefficients in the regression model used to predict spatial quality and the nature of any transforms applied to the resulting prediction data or original metric data.
The Level 1 type of metrics receive binaural or microphone signals (real or virtual) corresponding to a single test signal and provides a Metric value and/or an Intermediate results file. Examples of Level 1 metrics are: iacc0, CardKLT and localisation for a single test signal.
The Level 2 type of metrics receive the Intermediate results calculated by the Level 1 metrics and provide a Metric value. Typical examples of Level 2 metrics are: Hull metric, mean_ang_diff and max_ang_diff.
The present invention is not restricted to the use of a specific regression model. However, in a number of the practical implementations of the method according to the invention, the “Unscrambler”® modelling package produced by CAMO (a partial least squares regression model) has been used as the predictor means.
Referring to
The application of the regression model on the metric values provided by level 1 and 2 metrics may not result in the desired linearity and reference stimulus prediction accuracy. A transform may be used to improve the linearity and reference stimulus prediction accuracy as schematically shown by the flow chart in
In this section there is briefly described configuration of the method and system according to the present invention as a modular structure, where a user can easily configure the system by choosing different combinations of test signals and metrics. According to a presently preferred embodiment of the invention, this configuration takes place by means of a spreadsheet program (referred to as “MetricMap” on the spreadsheet) as shown in
The at present most up-dated list of metrics used in the method and system according to the present invention is shown in TABLE 7.
The regression models fall into two categories which incorporate different methods of ensuring that the unimpaired reference signal, should it be presented to the model, will be predicted at the top of the scale. The first example involves partial-least-squares regression and a post-hoc correction algorithm to improve the fit and alignment of the reference signal on the scale. The second example uses an approach involving linear equality constraints and multiple linear regression (MLR). Both achieve similar results in practice, in terms of correlation between predicted and actual scores for a wide range of test items, and in terms of RMS error (RMSE), and have been based on a current calibration set of listening test results. Both also have similar (although not identical) weightings of the attributes concerned.
Both models employ five metrics selected from the main list, which measure different aspects of the soundfield. The metrics are largely independent of each other (they have low correlation), except that there is a moderately high correlation between entropy and IACC. The PLS model is likely to be the more robust in the case of correlated metrics. Although in practice both models come up with similar results.
This application is a continuation-in-part of application Ser. No. 12/051,912, filed Mar. 20, 2008; and this application claims benefit of U.S. Provisional Application No. 61/097,697, filed Sep. 17, 2008 (both of which are hereby incorporated by reference).
Number | Date | Country | |
---|---|---|---|
61097697 | Sep 2008 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 12051912 | Mar 2008 | US |
Child | 12411108 | US |