1. Field of the Invention
The present invention relates generally to the field of audio data processing systems and methods, and, more particularly, to a novel system and method for performing blind change detection audio segmentation.
2. Discussion of the Prior Art
Many audio resources like broadcast news contain different kinds of audio signals like speech, music, noise, and different environmental and channel conditions. The performance of many applications based on these streams like speech recognition and audio indexing degrades significantly due to the presence of the irrelevant portions of the audio stream. Therefore segmenting the data to homogeneous portions according to type (speech, noise, music, etc.), speaker identity, environmental conditions, and channel conditions has become an important preprocessing step before using them. The previous approaches for automatic segmentation of audio data can be classified into two categories: informed and blind. Informed approaches include both decoder-based and model-based algorithms. In decoder-based approaches, the input audio stream is first decoded using speech and silence models; then the desired segments can be produced by using the silence locations generated by the decoder. In model-based approaches, different models are built to represent the different acoustic classes expected in the stream and the input audio stream can be classified by maximum likelihood selection and then locations of change in the acoustic class are identified as segmental boundaries. In both cases, models trained on the data representing all acoustic classes of interest are used in the automatic segmentation. The informed automatic segmentation is limited to applications where enough amount of training data is available for building the acoustic models. It can not generalize to unseen acoustic conditions in the training data. Also approaches based solely on speech and silence models mainly detect silence locations that are not necessarily corresponding to boundaries between different acoustic segments. We will focus on blind automatic segmentation techniques which do not suffer from these limitations and therefore serve a wider range of applications.
Blind change detection avoids the requirements of the informed approach by trying to build models of the observations in a neighborhood of a candidate point under the two hypothesis of change and no change and using a criterion based on the log likelihood ratio of these two models for automatic segmentation of the acoustic data. Most of the previous approaches had the goal of providing an input to a speech recognition, or a speaker adaptation system. Therefore they provided the evaluation of their systems based on comparisons of the word error rates achieved by using the automatic and the manual segmentation not the accuracy of the generated boundaries using the automatic segmentation. Exceptions of this trend include when the main focus is data indexing.
In many applications like on-line audio indexing and information retrieval, the goal of the automatic segmentation algorithm is to detect the changes in the input audio stream and to keep the number of false alarms as low as possible. Unfortunately all of the current techniques for automatic blind segmentation like using the Kullback-Liebler distance, the generalized likelihood ratio distance, or the Bayesian Information Criterion (BIC) try to optimize an objective function that is not directly related to minimizing the missing probability for a given false alarm rate. If the missing probability is defined as the probability of not detecting a change within a reasonable period of time of a valid change in the stream, then minimizing the missing probability is equivalent to minimizing the duration between the detected change and the actual change, namely the detection time.
Known solutions of this problem like using the BIC criterion are not accurate enough and have robustness problems due to employing a single criterion that is not directly related to minimizing the missing probability for a given false alarm rate and comparing this criterion to a threshold
Thus, it would be highly desirable to provide a novel approach for solving the automatic audio segmentation problems described herein with respect to the prior art.
It would be highly desirable to provide a novel approach for solving the automatic audio segmentation problem that combines the results of several segmentation algorithms to achieve better and more robust segmentation.
It is an object of the present invention to provide a comprehensive system, method and computer program product that enables blind change detection audio segmentation.
In one aspect, the system and method combines hypothesized boundaries from several segmentation algorithms to achieve the final segmentation of the audio stream. More particularly, a methodology is implemented that combines the output of at least two blind change detection audio segmentation systems to generate a final segmentation. Particularly, the system and method combines at least two approaches for change detection using different statistical modeling of the data, and optimizes at least two different criteria to generate an automatic segmentation of the audio stream.
Thus, according to the invention, there is provided a system, method and computer program product for blind change detection of audio segments. The method comprises the following:
providing an input audio stream to be segmented;
applying at least two change detection audio segmentation processes to said input audio stream and obtaining candidate change points from each;
combining said candidate change points of each said applied processes for audio segmentation change detection; and,
removing invalid candidate change points to thereby optimize audio segmentation change points of the audio stream.
According to the invention, the system and method searches for a proper segmentation of a given audio signal such that each resulting segment is homogeneous and belongs to one of the different acoustic classes like speech, noise, and music and, to a single speaker and a single channel. At least two algorithms, known in the art, are implemented and assumptions made to make the estimation of the segmentation points efficient. Three algorithms contemplated for use include: the BIC, CuSum (cumulative sum), and the CDF comparison (Kolmogorov-Smimov's test) for automatic segmentation of the audio data.
As part of the audio segmentation process, the method further comprises recording a start time for each remaining change point in the audio stream, i.e., for each segment, determining whether a candidate change point exists, and recording a corresponding start time.
Advantageously, the system and method for providing automatic segmentation of the audio streams according to the invention, is used for many applications like speech recognition, speaker recognition, audio data mining, online audio indexing, and information retrieval systems, where the actual boundaries of the audio segments are required.
The objects, features and advantages of the present invention will become apparent to one skilled in the art, in view of the following detailed description taken in combination with the attached drawings, in which:
The present invention is directed to a system and method that combines various approaches for audio segmentation change detection using different statistical modeling of the data and optimizes different criteria to generate an automatic segmentation of the audio stream.
While an example embodiment described herein utilizes three (3) automatic change detection audio segmentation algorithms, it is understood that other algorithms providing for automatic segmentation of the audio data may be used in addition to or as alternates of the three algorithms described herein. While it is understood that the invention contemplates use of at least two algorithms, three (3) algorithms employed according to the present invention are now described:
A. Change Detection Using the CuSum Algorithm
Under the assumption that the sequence of the log likelihood ratios,
is an i.i.d process, the CuSum algorithm is optimal in the sense of minimizing detection time for a given false alarm rate. This assumption is valid for many interesting processes like some random processes that are modeled by Markov chains or some autoregressive processes. In the CuSumn algorithm, the likelihood ratio of the conditional PDFs of the observations under both the hypothesis H1 of change for time r≦n and the hypothesis H0 is estimated, then the maximum of the sum of the log likelihood ratio of a given sequence of observations is compared to a threshold to determine whether a boundary exists between two segments of the observation sequence. Given n observations, a comparison is made as in equation (1) as follows:
where lk is the log likelihood ratio of the observation k to a threshold λ.
The CuSum algorithm assumes that the conditional PDFs of the observations under both the hypothesis H1 of change for time r≦n and the hypothesis H0 of no change (i.e. r≧n) are known. In most automatic segmentation applications, this is not true. Therefore, a two-Gaussian mixture is trained using the n observations in the given sequence. The two Gaussian components are initialized such that the mean of one of them corresponds to the mean of few observations in the beginning of the sequence of observations and the mean of the other corresponds to the mean of few observations in the end of the observations sequence. The automatic segmentation using the CuSum algorithm is then reduced to a binary hypothesis testing problem. The two hypothesis of this problem are
and
where r*=arg max,
where lk is the log likelihood ratio estimated using the two Gaussian components
B. Change Detection Using the BIC Algorithm
The Bayesian information criterion is based on the log likelihood ratio of two models representing the two hypothesis of having two-class or one-class observation sequence. It adds a penalty term to account for the difference in the number of parameters of the two models. The parameters of both models are estimated using the maximum likelihood criterion. Given n observations, the Bayesian information criterion BIC approach performs a comparison as in equation (2) as follows:
where d1 and d2 are the number of parameters of the two models, and M is the dimension of the observation vector.
Thus, the conditional PDF of the observations under the hypothesis H1 of change consists of two Gaussian PDFs. Both Gaussian PDFs are trained using maximum likelihood estimation. One of them is trained using the observations before the hypothesized boundary and the other is trained using observations after it. The conditional PDF of the observations under the hypothesis H0 of no change is modeled with a single Gaussian PDF trained using maximum likelihood estimation from using all the n observations. Detecting a change at time r using the BIC algorithm is then reduced to a binary hypothesis testing problem. The two hypothesis of this problem are
where
is the Gaussian model trained using all the n observations and
is trained using the first r observations and
is trained using the last n-r observations. Since the model of the conditional PDF under the hypothesis H1 of change depends on the location of the change, reestimation of the model parameters is required for each new hypothesized boundary within the sequence of observations of length n. This problem is avoided in the CuSum algorithm implementation, as in this case both models are independent of the location of the hypothesized boundary.
C. Change Detection Using the Kolmogorov-Smimov's Test
The Kolmogorov-Smimov's test is a nonparametric test of change in the input data. It compares the maximum of the difference of the empirical CDFs of the data before and after the hypothesized change point to a threshold to determine whether this point is a valid boundary point between two distinct classes. In other words, to test the validity of a boundary at observation k, the test performs a comparison as in equation (3) as follows:
and Θ (.) is the unit step function, to a threshold α.
The Kolmogorov-Smimov's test was designed for one-dimensional observations. To generalize for observation vectors of dimension M, it is assumed that the elements of the observation vector are statistically independent and replace the criterion of the Kolmogorov-Smimov's test with the following criterion according to equation (6) as follows:
for m=1, . . . , M, and the range of values of each dimension is quantized to fixed number of bins,
to be used in calculating the empirical CDFs.
Since the three approaches of BIC, cumulative sum, CDF comparison for automatic segmentation of the audio data use different criteria and different modeling of the conditional PDFs of the observations under both hypothesis of valid change or no change. It is reasonable to expect these algorithms to employ complementary information for automatic change detection and therefore combining the three approaches can improve the overall performance and robustness of the automatic change detection system. For purposes of description, the three algorithms described herein are implemented for the automatic blind change detection scheme for audio segmentation according to one embodiment of the invention. It is understood that in alternate embodiments, two of the three automatic audio segmentation algorithms may be used for automatic change detection according to the principles described herein; furthermore, approaches of more than three audio segmentation algorithms (e.g., a number of “M” algorithms) may be combined for automatic change detection without departing from the scope of the invention. For example, observation sequences resulting from application of change detection using Kullback-Liebler measure, non-linear volume-preserving maps, support vector machines, independent component analysis are examples of such change detection algorithms that may be employed.
Continuing in
To accomplish this, as indicated at step 25, a list of the candidate points are generated from the union of the output of the three (or more) algorithms, referred to as a candidate boundary list (L). Then, the values of the three (or more) measures used in the three (or more) algorithms for detection of the change are evaluated at every point of the three sets. This comprises calculating the values of the measurements of the three (or more) algorithms at every point of the candidate list as indicated at step 30.
Although not shown in the Figures, based on either a voting scheme or a likelihood ratio test of two models trained on the values of the three (or more) measurements of manually segmented data (i.e., change points labeled manually) near and far from a valid change respectively, the set of valid change points are selected from the collection of the three sets (i.e., invalid boundaries are removed). That is, as shown at step 35,
Continuing to step 40,
Returning to step 55, if it is determined that the difference between the current observation sequence time f and the start time l is greater than a multiple of time segment durations, then a new start time is calculated as performed at step 60 according to:
l=f−Xn0
Thus, for example, if the time commensurate with 3 time segments has elapsed without hitting a candidate boundary, then the process will result in execution of step 60 to set the next current starting time l to the next observation sequence f offset by the quantity Xn0; i.e., set l=f−Xn0. Thereafter, the process proceeds to step 65 to determine if the end of the audio stream (last time segment) has been reached. If the end of the audio stream has been reached, the process ends as indicated at step 70; otherwise, the next time segment of the observation sequence provided by each algorithm is processed by returning to step 20,
Returning to step 40, if a candidate change point is detected in the current segment, then the following calculations are performed:
set l=r; and
f=r+n0;
where r is the location (in time) of the last change in the candidate list i.e., the time when a valid change point is encountered in an audio segment). Thus, according to these calculations the observation sequence f and the starting time l is changed after detection of a change point and the process proceeds to step 65 to determine if the end of the audio stream (last time segment) has been reached. If the end of the audio stream has been reached, the process ends as indicated at step 70; otherwise, the next time segment of the observation sequence provided by each algorithm is processed by returning to step 20,
As will be appreciated by one of skill in the art, embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product, which is embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
Thus, as shown in
The computing system 100 additionally includes: computer readable media, including a variety of types of volatile and non-volatile media, each of which can be removable or non-removable. For example, system memory 150 includes computer readable media in the form of volatile memory, such as random access memory (RAM), and non-volatile memory, such as read only memory (ROM). The ROM may include an input/output system (BIOS) that contains the basic routines that help to transfer information between elements within computer device 100, such as during start-up. The RAM component typically contains data and/or program modules in a form that can be quickly accessed by processing unit. Other kinds of computer readable media 105 for storing program data and/or audio data to be segmented according to the invention include a hard disk drive (not shown) for reading from and writing to a non-removable, non-volatile magnetic media, a magnetic disk drive for reading from and writing to a removable, non-volatile magnetic disk (e.g., a “floppy disk”), and an optical disk drive for reading from and/or writing to a removable, non-volatile optical disk such as a CD-ROM, DVD-ROM, or other optical media. Any audio data storage media 105 including hard disk drive, magnetic disk drive, and optical disk drive would be connected to the system bus 101 by one or more data media interfaces 146. Alternatively, the hard disk drive, magnetic disk drive, and optical disk drive can be connected to the system bus 101 by a SCSI interface (not shown), or other coupling mechanism. Although not shown, the computer 100 can include other types of computer readable media. Generally, the above-identified computer readable media provide non-volatile storage of computer readable instructions, data structures, program modules, and other data for use by computer 100. For instance, the readable media can store the operating system (O/S), one or more application programs, such as the audio segmentation editing software applications, and/or other program modules and program data for enabling blind change detection for audio segmentation according to the invention. Input/output interfaces 145, 146 are provided that couple the input devices and data storage devices to the processing unit 110. More generally, input devices can be coupled to the computer 100 through any kind of interface and bus structures, such as a parallel port, serial port, universal serial bus (USB) port, etc. The computer environment 100 also includes the display device 19 and a video adapter card 135 that couples the display device 19 to the bus 101. In addition to the display device 19, the computer environment 100 can include other output peripheral devices, such as speakers (not shown), a printer, etc. I/O interfaces 145 are used to couple these other output devices to the computer 100.
Computing system 100 is further adapted to operate in a networked environment using logical connections to one or more other computers that may include all of the features discussed above with respect to computer device 100, or some subset thereof. It is understood that any type of network can be used to couple the computer system 100 with server device 20, such as a local area network (LAN), or a wide area network (WAN) 300 (such as the Internet). When implemented in a LAN networking environment, the computer 100 connects to a local network via a network interface or adapter 29, e.g., supporting Ethernet or like network communications protocols. When implemented in a wide area network (WAN) networking environment, the computer 100 may connect to a WAN 300 via a high speed cable/dsl modem 180 or some other connection means. The cable/dsl modem 180 can be located internal or external to computer 100, and can be connected to the bus 101 via the I/O interfaces 145 or other appropriate coupling mechanism. Although not illustrated, the computing environment 100 can provide wireless communication functionality for connecting computer 100 with other networked remote devices (e.g., via modulated radio signals, modulated infrared signals, etc.).
In the networked environment, it is understood that the computer system 100 can draw from program modules stored in a remote memory storage devices (not shown) in a distributed configuration. However, wherever physically stored, one or more of the application programs executing the blind change detection for audio segmentation system of the invention can include various modules for performing principal tasks. For instance, the application program can provide logic enabling input of audio source data for storage as media files in a centralized data storage system and/or performing the audio segmentation techniques thereon. Other program modules can be used to implement additional functionality not specifically identified here.
The present invention has been described with reference to flow diagrams and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flow diagram flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flow diagram flow or flows and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer-readable or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flow diagram flow or flows and/or block diagram block or blocks.
While it is apparent that the invention herein disclosed is well calculated to fulfill the objects stated above, it will be appreciated that numerous modifications and embodiments may be devised by those skilled in the art and it is intended that the appended claims cover all such modifications and embodiments as fall within the true spirit and scope of the present invention.
This application is relates to and claims the benefit of U.S. Provisional Patent Application Ser. No. 60/663,079 filed Mar. 18, 2005, the entire contents and disclosure of which is incorporated by reference herein.
This invention was made with Government support under contract number H98230-04-3-0001 awarded by the Distillery Phase II Program. The Government has certain rights in this invention
Number | Date | Country | |
---|---|---|---|
60663079 | Mar 2005 | US |