Claims
- 1. A method of segmenting multimedia data using audio information, comprising:
receiving a search request identifying at least one target speaker; retrieving at least one model for the at least one target speaker; and segmenting the multimedia data into one or more target speaker segments and background segments based on feature vectors of the multimedia data and the at least one model for the at least one target speaker, wherein the step of segmenting comprises:
reading a first block of frames of the multimedia data; determining a score for the first block of frames based on the at least one model for the at least one target speaker; and determining if the score for the first block of frames is above or below a first threshold.
- 2. The method of claim 1, further comprising:
identifying the first block of frames as part of a target speaker segment if the score for the block of frames is above the predetermined threshold; and identifying the first block of frames as part of a background segment if the score for the block of frames is below the predetermined threshold.
- 3. The method of claim 1, further comprising:
identifying a tentative start point of a target speaker segment if the score for the first block of frames is above the first threshold; and identifying a tentative end point of a target speaker segment if the score for the first block of frames is below the first threshold.
- 4. The method of claim 3, further comprising:
reading a second block of frames of the audio data; determining a score for the second block of frames based on the model for the target speaker; verifying the tentative start point of the target speaker segment if the score for the second block of frames is above a second threshold; and verifying the tentative end point of the target speaker segment if the score for the second block of frames is below a third threshold.
- 5. The method of claim 1, wherein the score is a normalized score.
- 6. The method of claim 5, wherein the normalized score is calculated based on the model for the target speaker and one or more background data models.
- 7. The method of claim 1, wherein the score is an averaged normalized score for the first block of frames.
- 8. The method of claim 1, further comprising:
sending at least one of (a) at least a portion of the target speaker segments and (b) at least a portion of the background segments to a user device from which the search request was received to enable the user device to reproduce a multimedia presentation incorporating the at least one of (a) the at least a portion of target speaker segments and (b) the at least a portion of the background segments.
- 9. The method of claim 8, wherein the user device is one of a computer, a wired telephone, a wireless telephone, a Web TV™ terminal, and a Personal Digital Assistant.
- 10. The method of claim 1, wherein the at least one model for the at least one target speaker is a Gaussian Mixture Model.
- 11. The method of claim 1, wherein the at least one model for the at least one target speaker is a vector quantization codebook model.
- 12. The method of claim 1, wherein the at least one model for the at least one target speaker is a hidden Markov model.
- 13. The method of claim 1, further comprising retrieving at least one model for background, wherein the step of segmenting includes segmenting the multimedia data into the one or more target speaker segments and the background segments based on the at least one model for the background.
- 14. The method of claim 13, wherein the at least one model for the background is a Gaussian Mixture Model.
- 15. The method of claim 13, wherein the at least one model for the background is a vector quantization codebook model.
- 16. The method of claim 13, wherein the at least one model for the background is a hidden Markov model.
- 17. An apparatus that identifies segments of multimedia data for retrieval, comprising:
a controller; a network interface; and a memory, wherein the controller receives a search request via the network interface identifying at least one target speaker, retrieves at least one model for the at least one target speaker from the memory, and segments the multimedia data into one or more target speaker segments and background segments based on feature vectors of the multimedia data and the at least one model for the at least one target speaker; wherein the controller segments the multimedia data by reading a first block of frames of the multimedia data, determining a score for the first block of frames based on the at least one model for the at least one target speaker, and determining if the score is above or below a first threshold.
- 18. The apparatus of claim 17, wherein the controller identifies the first block of frames as part of a target speaker segment if the score is above the predetermined threshold and identifies the first block of frames as part of a background segment if the score is below the predetermined threshold.
- 19. The apparatus of claim 17, wherein the controller identifies a tentative start point of a target speaker segment if the score is above the first threshold and identifies a tentative end point of a target speaker segment if the score is below the first threshold.
- 20. The apparatus of claim 19, wherein the controller reads a second block of frames of the audio data, determines a score for the second block of frames based on the model for the target speaker, verifies the tentative start point of the target speaker segment if the score for the second block of frames is above a second threshold, and verifies the tentative end point of the target speaker segment if the score for the second block of frames is below a third threshold.
- 21. The apparatus of claim 17, wherein the score is a normalized score.
- 22. The apparatus of claim 21, wherein the normalized score is calculated based on the model for the target speaker and one or more background data models.
- 23. The apparatus of claim 17, wherein the score is an averaged normalized score for the first block of frames.
- 24. The apparatus of claim 17, wherein the controller sends at least one of (a) at least a portion of the target speaker segments and (b) at least a portion of the background segments to a user device from which the search request was received to enable the user device to reproduce a multimedia presentation incorporating the at least one of (a) the at least a portion of target speaker segments and (b) the at least a portion of background segments.
- 25. The apparatus of claim 24, wherein the user device is one of a computer, a wired telephone, a wireless telephone, a Web TV™ terminal, and a Personal Digital Assistant.
- 26. The apparatus of claim 17, wherein the at least one model for the at least one target speaker is a Gaussian Mixture Model.
- 27. The apparatus of claim 17, wherein the at least one model for the at least one target speaker is a vector quantization codebook model.
- 28. The apparatus of claim 17, wherein the at least one model for the at least one target speaker is a hidden Markov model.
- 29. The apparatus of claim 17, wherein the controller retrieves at least one model for background and segments the multimedia data into the one or more target speaker segments and the background segments based on the at least one model for the background.
- 30. The apparatus of claim 29, wherein the at least one model for the background is a Gaussian Mixture Model.
- 31. The apparatus of claim 29, wherein the at least one model for the background is a vector quantization codebook model.
- 32. The apparatus of claim 29, wherein the at least one model for the background is a hidden Markov model.
- 33. A user device that receives at least one of (a) at least a portion of the target speaker segments and (b) at least a portion of the background segments that are segmented by the method of claim 1 and reproduces a multimedia presentation incorporating the at least one of (a) the art least a portion of the target speaker segments and (b) the at least a portion of the background segments.
- 34. The user device of claim 33, wherein the user device is one of a computer, a wired telephone, a wireless telephone, a WebTV™ terminal, and a Personal Digital Assistant.
Parent Case Info
[0001] This nonprovisional application claims the benefit of U.S. provisional application No. 60/096,372 entitled “Speaker Detection in Broadcast Speech Databases” filed on Aug. 13, 1998. The provisional application and all references cited therein are hereby incorporated by reference.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60096372 |
Aug 1998 |
US |
Continuations (1)
|
Number |
Date |
Country |
Parent |
09353192 |
Jul 1999 |
US |
Child |
09976023 |
Oct 2001 |
US |