This application claims the benefit of United Kingdom Patent Application No. 0326043.7, filed Nov. 7, 2003, the entirety of which is incorporated herein by reference.
This invention relates to a new parameter suitable for use in non-intrusive speech quality assessment system.
Signals carried over telecommunications links can undergo considerable transformations, such as digitisation, encryption and modulation. They can also be distorted due to the effects of lossy compression and transmission errors.
Objective processes for the purpose of measuring the quality of a signal are currently under development and are of application in equipment development, equipment testing, and evaluation of system performance.
Some automated systems require a known (reference) signal to be played through a distorting system (the communications network or other system under test) to derive a degraded signal, which is compared with an undistorted version of the reference signal. Such systems are known as “intrusive” quality assessment systems, because whilst the test is carried out the channel under test cannot, in general, carry live traffic.
Conversely, non-intrusive quality assessment systems are systems which can be used whilst live traffic is carried by the channel, without the need for test calls.
Non-intrusive testing is required because for some testing it is not possible to make test calls. This could be because the call termination points are geographically diverse or unknown. It could also be that the cost of capacity is particularly high on the route under test. Whereas, a non-intrusive monitoring application can run all the time on the live calls to give a meaningful measurement of performance.
A known non-intrusive quality assessment system uses a database of distorted samples which has been assessed by panels of human listeners to provide a Mean Opinion Score (MOS).
MOSs are generated by subjective tests which aim to find the average user's perception of a system's speech quality by asking a panel of listeners a directed question and providing a limited response choice. For example, to determine listening quality users are asked to rate “the quality of the speech” on a five-point scale from Bad to Excellent. The MOS, is calculated for a particular condition by averaging the ratings of all listeners.
In order to train the quality assessment system each sample is parameterised and a combination of the parameters is determined which provides the best prediction of the MOSs indicted by the human listeners. International Patent Application number WO 01/35393 describes one method for paramterising speech samples for use in a non-intrusive quality assessment system.
This invention relates to improved parameters for a speech quality assessment system.
According to the invention there is provided a method of generating a parameter from a signal comprising a sequence of values measured from voiced portions of said signal at a sampling frequency, said parameter suitable for use in a quality assessment tool, said method comprising the steps of
Said section of said sequence of values may be selected such that a pitch mark is associated with a value central to said section.
The frequency transform may comprise a Fast Fourier Transform.
The step of generating a pitch frequency estimate may comprise the steps of using pitch marks associated with said sequence of values; comparing the number of values between a value associated with a pitch mark and a value associated with an immediately preceding pitch mark with the number of vlues between the value associated with the pitch mark and a value associated with an immediately following pitch mark; and generating said pitch frequency estimate in dependence upon the minimum number of said values, and the sampling frequency.
The portions of said sequence of frequency values may be selected by generating multiples of said pitch frequency estimate, said multiples representing harmonics of said pitch frequency estimate; and selecting portions in which the frequency range of the portion is substantially equal to half said pitch frequency estimate; and which the central frequency of each portion is either a frequency substantially equal to one of said multiples, or a frequency substantially half way between two of said multiples.
The invention also provides a method of training a quality assessment tool comprising the step of training a mapping for use in a method of assessing speech quality in a telecommunications network, such that a fit between a quality measure generated from a plurality of parameters for a signal and the mean opinion score associated with said signal is optimised by said mapping wherein said plurality of parameters includes a parameter generated according to any on of the preceding claims.
The invention also provides a method of assessing speech quality in a telecommunications network comprising the steps of generating a parameter according to any one of the preceding claims; generating a quality measure in dependence upon said parameter.
Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings, in which:
a to 4c illustrate signal processing in order to generate a parameter in accordance with the present invention;
Referring to
The database 4 may store quality prediction results from a plurality of different intercept points. The database 4 may be remotely interrogated by a user via a user terminal 5, which provides analysis and visualisation of quality prediction results stored in the database 4.
The telecommunication network shown in
Non intrusive quality assessment may be performed, for example, at the following points:
A variety of testing regimes and configurations can be used to suit a particular application, providing quality measures for selections of calls based upon the user's requirements. These could include different testing schedules and route selections. With multiple assessment points in a network, it is possible to make comparisons of results between assessment points. This allows the performance of specific links or network subsystems to be monitored. Reductions in the quality perceived by customers can then be attributed to specific circumstances or faults.
The data, stored in the database 4, can be used for a number of applications such as:—
Referring now to
A database 60 contains distorted speech samples containing a diverse range of conditions and technologies. These have been assessed by panels of human listeners to provide a MOS, in a known manner. Each speech sample therefore has an associated MOS derived from subjective tests. The database 60 includes speech signal having the following network conditions and impairments amongst others, mobile network errors, mutes, low bit rate speech codecs, noise, transcoding, Voice over Internet Protocol (VoIP), Digital Circuit Multiplication Equipment (DCME) clipping.
At 61 each sample is pre-processed to normalise the signal level and take account of any filtering effects of the network via which the speech sample was collected. The speech sample is filtered, level aligned and any DC offset is removed. The amount of amplification or attenuation applied is stored for later use.
At step 62 tone detection is performed for each sample to determine whether the sample is speech, data, or if it contains DTMF or musical tones. If it is determined that the sample is not speech then the sample is discarded, and is not used for training the quality assessment tool.
At step 63 each speech sample is annotated to indicate periods of speech activity and silence/noise. This is achieved by use of a Voice Activity Detector (VAD) together with a voiced/unvoiced speech discriminator.
At step 64 each speech sample is annotated to indicate positions of the pitch cycles using a temporal/spectral pitch extraction method. This allows parameters to be extracted on a pitch synchronous basis, which helps to provide parameters which are independent of the particular talker. Vocal Tract Descriptors are extracted as part of the speech parameterisation described later and need to be taken from the voiced sections of the speech file. A final pitch cycle identifier is used to provide boundaries for this extraction. A characterisation of the properties of the pitch structure over time is also passed to step 65 to form part of the speech parameters.
The parameterisation step 65 is designed to reduce the amount of data to be processed whilst preserving the information relevant to the distortions present in the speech sample.
In this embodiment of the invention over 300 candidate parameters are calculated including the following:
In addition to the above, various descriptions of the vocal tract parameters are calculated. They capture the overall fit of the vocal tract model, instantaneous improbable variations and illegal sequences. Average values and statistics for individual vocal tract model elements over time are also included as base parameters. For example, see International Patent Application Number WO 01/35393.
Distortion identification may also be performed. This is not described here, as it is not relevant to the present invention. A full description may be found in co-pending European Patent Application number 03250333.6.
The inventors have recently invented a new spectral clarity parameter which significantly improves performance of the speech quality assessment method.
The generation of this parameter from the portions of the signal which have been marked as voiced at step 63 will now be described, with reference to
At step 100 a section of a signal such as that shown in
The logarithm of each frequency value is calculated in order to provide a value which is independent of the level (average) of the original signal. At step 104, a pitch frequency estimate is generated as follows. The number of values between pitch mark P and pitch mark P+1 is compared to the number of values between pitch mark P and pitch mark P−1. In this example the differences are 80 and 81 values respectively. The minimum is selected, and the pitch frequency estimate is calculated in dependence upon the sampling frequency. Therefore in this example the pitch frequency estimate is 100 Hz. The pitch frequency estimate represents the pitch of the speech and is represented by H0.
At step 106 portions of the sequence of frequency values are selected in dependence upon the pitch frequency estimate as follows. Harmonics (H1-H5) are estimated to occur around multiples of the pitch frequency estimate H0, so in this example we would expect H1 to be around 200 Hz, H2 to be around 300 Hz etc. These are illustrated schematically in
Portions comprising a frequency range of half the pitch frequency estimate are selected, although other shorter frequency ranges could be used. The centre frequency of the portions selected are equal to either a frequency value of a harmonic, or to a frequency value half way between two harmonics. Selected portions A, B, C, D, E, F, G are illustrated in
An average value for each portion is then calculated at step 108, simply by summing the sequence of values in each portion and dividing the total by the number of values in said portion.
Then finally at step 110 the sum of differences between two adjacent portions is calculated and an average over the number of peaks used is generated. In this embodiment of the invention the differences used to generate the parameter are those associated with the portions relating to H2 to H5 and the subsequence portion in each case. This is because H1 is in generally filtered out in practice because of the telephone bandwidth.
A parameter is thus generated for each pitch mark, and in order to generate a parameter for the whole of the voiced part of the signal a simple average is generated.
Once all of the parameters have been calculated, including the new parameter described above, mapping 76, is trained at 68. Once the optimum mapping between the parameters for each speech sample and the MOS associated with each speech sample (provided by the database 60) has been determined a characterisation of the mapping is saved at step 69, which includes identification of the particular parameters which resulted in the optimum mapping.
In this embodiment the mapping is a linear mapping between the chosen parameters and MOSs and the optimum mapping is determined using linear regression analysis, such that once the mapping has been trained at step 68, the mapping 76 is characterised by a set of parameters used together with a weight for each parameter.
The operation of the non-intrusive quality assessment tool, once training has been completed, will now be described with reference to
The steps for operation of the quality assessment tool are similar to the steps shown in
Steps 61-64 operate as described with reference to
It will be understood by those skilled in the art that the methods described above may be implemented on a conventional programmable computer, and that a computer program encoding instructions for controlling the programmable computer to perform the above methods may be provided on a computer readable medium.
It will be appreciated that whilst the process above has been described with specific reference to speech signals, the processes are equally applicable to other types of signals, for example video signals.
Number | Date | Country | Kind |
---|---|---|---|
0326043.7 | Nov 2003 | GB | national |