Claims
- 1. A method for indexing and summarizing digital video content, comprising the steps of:
a) receiving digital video content; b) automatically parsing digital video content into one or more fundamental semantic units based on a set of predetermined domain-specific cues; c) determining corresponding attributes for each of said fundamental semantic units to provide indexing information for said fundamental semantic units; and d) arranging one or more of said fundamental semantic units with one or more of said corresponding attributes for display and browsing.
- 2. The method of claim 1, wherein said step of automatically parsing digital video content further comprises the steps of:
a) automatically extracting a set of features from digital video content based on said predetermined set of domain-specific cues; b) recognizing one or more domain-specific segments based on said set of features for parsing digital video content; and c) parsing digital video content into one or more fundamental semantic units corresponding to said one or more domain-specific segments.
- 3. The method of claim 2, wherein said one or more domain-specific segments are views.
- 4. The method of claim 2, wherein said one or more domain-specific segments are events.
- 5. The method of claim 2, wherein said set of features from digital video content includes a set of visual features.
- 6. The method of claim 2, wherein said set of features from digital video content includes a set of audio features.
- 7. The method of claim 2, wherein said step of automatically extracting a set of features from digital video content further comprises a step of recognizing speech signals.
- 8. The method of claim 7, wherein said step of recognizing speech signals further comprises a step of converting said speech signals to recognized text data.
- 9. The method of claim 2, wherein said step of automatically extracting a set of features from digital video content further comprises a step of decoding closed caption information from digital video content.
- 10. The method of claim 2, wherein said step of automatically extracting a set of features from digital video content further comprises the steps of:
a) detecting text images in said digital video content; and b) converting said text images into text information.
- 11. The method of claim 10, wherein said step of detecting text images further comprises the steps of:
a) computing a set of frame-to-frame motion measures; b) comparing said set of frame-to-frame motion measures with a set of predetermined threshold values; and c) determining one or more candidate text areas based on said comparing.
- 12. The method of claim 11, further comprising the step of removing noise from said one or more candidate text areas.
- 13. The method of claim 12, further comprising the step of applying domain-specific spatio-temporal constraints to remove detection errors from said one or more candidate text areas.
- 14. The method of claim 12, further comprising the step of color-histogram filtering said one or more candidate text areas to remove detection errors.
- 15. The method of claim 10, wherein said step of converting said text images into text information further comprises the steps of:
a) computing a set of temporal features for frame-to-frame differences of said one or more candidate text areas; b) computing a set of spatial features of an intensity projection histogram for said one or more candidate text areas containing peaks or valleys; c) determining a set of text character sizes and spatial locations of one or more characters located within said one or more candidate text areas based on said set of temporal features and said said of spatial features; and d) comparing said one or more characters to a set of pre-determined template characters to convert text images into text information.
- 16. The method of claim 2, wherein said step of automatically extracting a set of features from digital video content further comprises a step of synchronizing a timing information between said set of features.
- 17. The method of claim 2, wherein said step of automatically extracting a set of features from digital video content further comprises a step of detecting scene changes.
- 18. The method of claim 17, wherein said step of detecting scene changes comprises a step of automatically detecting flashlights.
- 19. The method of claim 18, wherein said step of detecting flashlights further comprises the steps of:
a) calculating a frame-to-frame color difference of each frame; b) calculating a corresponding long-term color difference; c) computing a ratio of said frame-to-frame color difference to said long term color difference; and d) comparing said ratio with a pre-determined threshold value to detect flashlights.
- 20. The method of claim 17, wherein said step of detecting scene changes further comprises a step of automatically detecting direct scene changes.
- 21. The method of claim 20, wherein said step of detecting direct scene changes further comprises the step of computing a frame-to-frame color difference for each frame.
- 22. The method of claim 17, wherein said step of detecting scene changes further comprises the steps of:
a) determining one or more intra-block motion vectors from digital video content; b) determining a set of corresponding forward-motion vectors for each of said intra-block motion vectors; c) determining a set of corresponding backward-motion vectors for each of said intra-block motion vectors; and d) computing a ratio of said one or more intra-block motion vectors and said corresponding forward-motion vectors and backward motion vectors to detect scene changes.
- 23. The method of claim 17, wherein said step of detecting scene changes further comprises the step of computing a set of color differences from a local window of each digital video frame to detect gradual scene changes
- 24. The method of claim 17, wherein said step of detecting scene changes further comprises a step of detecting camera aperture changes.
- 25. The method of claim 24, wherein said step of detecting camera aperture changes further comprises the steps of:
a) computing color differences between adjacent detected scene changes; and b) comparing said color differences with a pre-determined threshold value to detect camera aperture changes.
- 26. The method of claim 17, wherein said step of detecting scene changes further comprises the steps of:
a) determining a set of threshold levels using a decision tree based on a set of predetermined parameters; and b) automatically detecting corresposnding multi-level scene changes for said set of threshold levels.
- 27. The method of claim 2, further comprising the step of integrating one or more of said domain-specific segments to form a domain-specific event for display.
- 28. The method of claim 2, wherein said set of predetermined domain-specific cues is determined based on user preferences.
- 29. The method of claim 2, wherein said predetermined domain-specific cues are either one of color, motion or object layout.
- 30. The method of claim 2, wherein said step of recognizing one or more domain-specific segments further comprises a fast adaptive color filtering of digital video content to select possible domain-specific segments.
- 31. The method of claim 30, wherein said fast adaptive color filtering is based on one or more pre-trained filtering models.
- 32. The method of claim 31, wherein such filtering models are built through a clustering-based training process.
- 33. The method of claim 30, wherein said step of recognizing one or more domain-specific segments further comprises a segmentation-based verification for verifying domain-specific segments based on a set of pre-determined domain-specific parameters.
- 35. The method of claim 33, wherein said segmentation-based verification comprises a salient feature region extraction.
- 36. The method of claim 33, wherein said segmentation-based verification comprises a moving object detection.
- 37. The method of claim 33, wherein said segmentation-based verification comprises a similarity matching scheme of visual and structure features.
- 38. The method of claim 30, wherein said step of recognizing one or more domain-specific segments further comprises an edge-based verification for verifying domain-specific segments based on a set of pre-determined domain-specific parameters.
- 39. A method for content-based adaptive streaming of digital video content, comprising the steps of:
a) receiving digital video content; b) automatically parsing said digital video content into one or more video segments based on a set of predetermined domain-specific cues for adaptive streaming; c) assigning corresponding video quality levels to said video segments based on a set of predetermined domain-specific requirements; d) scheduling said video segments for adaptive streaming to one or more users based on corresponding video quality levels; and e) adaptively streaming said video segments with corresponding video quality levels to users for display and browsing.
- 40. The method of claim 39, wherein said step of automatically parsing said digital video content further includes the steps of:
a) automatically extracting a set of features from said digital video content based on said predetermined set of domain-specific cues; b) recognizing one or more domain-specific segments based on said set of features for parsing said digital video content; and c) parsing said digital video content into one or more fundamental semantic units corresponding to said domain-specific segments.
- 41. The method of claim 40, wherein said one or more domain-specific segments are views.
- 42. The method of claim 40, wherein said one or more domain-specific segments are events.
- 43. The method of claim 40, wherein said set of features from said digital video content includes a set of visual features.
- 44. The method of claim 40, wherein said set of features from said digital video content includes a set of audio features.
- 45. The method of claim 40, wherein said step of automatically extracting a set of features from said digital video content further comprises a step of recognizing speech signals.
- 46. The method of claim 45, wherein said step of recognizing speech signals further comprises a step of converting said speech signals to recognized text data.
- 47. The method of claim 40, wherein said step of automatically extracting a set of features from digital video content further comprises a step of decoding closed caption information from said digital video content.
- 48. The method of claim 40, wherein said step of automatically extracting a set of features from said digital video content further comprises the steps of:
a) detecting text images in said digital video content; and b) converting said text images into text information.
- 49. The method of claim 48, wherein said step of detecting text images further comprises the steps of:
a) computing a set of frame-to-frame motion measures; b) comparing said set of frame-to-frame motion measures with a set of predetermined threshold values; and c) determining one or more candidate text areas based on said comparing.
- 50. The method of claim 49, further comprising the step of removing noise from said one or more candidate text areas.
- 51. The method of claim 50, further comprising the step of applying domain-specific spatio-temporal constraints to remove detection errors from said one or more candidate text areas.
- 52. The method of claim 50, further comprising the step of color-histogram filtering said one or more candidate text areas to remove detection errors.
- 53. The method of claim 48, wherein said step of converting said text images into text information further comprises the steps of:
a) computing a set of temporal features for frame-to-frame differences of said one or more candidate text areas; b) computing a set of spatial features of an intensity projection histogram for said one or more candidate text areas containing peaks or valleys; c) determining a set of text character sizes and spatial locations of one or more characters located within said one or more candidate text areas based on said set of temporal features and said said of spatial features; and d) comparing said one or more characters to a set of pre-determined template characters to convert text images into text information.
- 54. The method of claim 40, wherein said step of automatically extracting a set of features from digital video content further comprises a step of synchronizing timing information between said set of features.
- 55. The method of claim 40, wherein said step of automatically extracting a set of features from digital video content further comprises a step of detecting scene changes.
- 56. The method of claim 55, wherein said step of detecting scene changes comprises a step of automatically detecting flashlights.
- 57. The method of claim 56, wherein said step of detecting flashlights further comprises the steps of:
a) calculating a frame-to-frame color difference of each frame; b) calculating a corresponding long-term color difference; c) computing a ratio of said frame-to-frame color difference to said long term color difference; and d) comparing said ratio with a pre-determined threshold value to detect flashlights.
- 58. The method of claim 55, wherein said step of detecting scene changes further comprises a step of automatically detecting direct scene changes.
- 59. The method of claim 58, wherein said step of detecting direct scene changes further comprises the step of computing a frame-to-frame color difference for each frame.
- 60. The method of claim 55, wherein said step of detecting scene changes further comprises the steps of:
a) determining one or more intra-block motion vectors from digital video content; b) determining a set of corresponding forward-motion vectors for each of said intra-block motion vectors; c) determining a set of corresponding backward-motion vectors for each of said intra-block motion vectors; and d) computing a ratio of said one or more intra-block motion vectors and said corresponding forward-motion vectors and backward motion vectors to detect scene changes.
- 61. The method of claim 55, wherein said step of detecting scene changes further comprises the step of computing a set of color differences from a local window of each digital video frame to detect gradual scene changes
- 62. The method of claim 55, wherein said step of detecting scene changes further comprises a step of detecting camera aperture changes.
- 63. The method of claim 62, wherein said step of detecting camera aperture changes further comprises the steps of:
a) computing color differences between adjacent detected scene changes; and b) comparing said color differences with a predetermined threshold value to detect camera aperture changes.
- 64. The method of claim 55, wherein said step of detecting scene changes further comprises the steps of:
a) determining a set of threshold levels using a decision tree based on a predetermined set of parameters; and b) automatically detecting corresponding multi-level scene changes for said set of threshold levels.
- 65. The method of claim 40, further comprising the step of integrating one or more of domain-specific segments to form a domain-specific event for display.
- 66. The method of claim 40, wherein said set of predetermined domain-specific cues is determined based on user preferences.
- 67. The method of claim 40, wherein said predetermined domain-specific cues are either one of color, motion or object layout.
- 68. The method of claim 40, wherein said step of recognizing one or more domain-specific segments further comprises a fast adaptive color filtering of digital video content to select possible domain-specific segments.
- 69. The method of claim 68, wherein said fast adaptive color filtering is based on one or more filtering models.
- 70. The method of claim 69, wherein said filtering models are built through a clustering-based training process.
- 71. The method of claim 68, wherein said step of recognizing one or more domain-specific segments further comprises a segmentation-based verification for verifying domain-specific segments based on a set of pre-determined domain-specific parameters.
- 72. The method of claim 71, wherein said segmentation-based verification comprises a salient feature region extraction.
- 73. The method of claim 71, wherein said segmentation-based verification comprises a moving object detection.
- 74. The method of claim 71, wherein said segmentation-based verification comprises a similarity matching scheme of visual and structure features.
- 75. The method of claim 71, wherein said step of recognizing one or more domain-specific segments further comprises an edge-based verification for verifying domain-specific segments based on a set of pre-determined domain-specific parameters.
- 78. A system for indexing digital video content, comprising:
a means for receiving digital video content; a means, coupled to said receiving means, for automatically parsing digital video content into one or more fundamental semantic units based on a set of predetermined domain-specific cues; a means, coupled to said parsing means, for determining corresponding attributes for each of said fundamental semantic units; and a means, coupled to said parsing means and said determining means, for arranging one or more of said fundamental semantic units with one or more of said corresponding attributes for browsing.
- 79. The system of claim 78, wherein said means for automatically parsing digital video content further comprises:
a) a means, coupled to said receiving means, for automatically extracting a set of features from digital video content based on said predetermined set of domain-specific cues; b) a means, coupled to said extracting means, for recognizing one or more domain-specific segments based on said set of features for parsing digital video content; c) a means, coupled to said recognizing means, for parsing digital video content into one or more fundamental semantic units corresponding to said one or more domain-specific segments.
- 80. The system of claim 79, wherein said one or more domain-specific segments are views.
- 81. The system of claim 79, wherein said one or more domain-specific segments are events.
- 82. The system of claim 79, wherein said set of features from said digital video content includes a set of visual features.
- 83. The system of claim 79, wherein said set of features from said digital video content includes a set of audio features.
- 84. The system of claim 79, wherein said extracting means further comprises a means for recognizing speech signals.
- 85. The system of claim 84, wherein said means for recognizing speech signals converts said speech signals into recognized text data.
- 86. The system of claim 79, wherein said extracting means comprises a means for decoding closed caption information from said digital video content.
- 87. The system of claim 79, wherein said extracting means detects text images in said digital video content and converts said text images into text information.
- 88. The system of claim 87, wherein said extracting means computes a set of frame-to-frame motion measures, compares said set of frame-to-frame motion measures with a set of predetermined threshold values to determine one or more candidate text areas.
- 89. The system of claim 88, wherein said extracting means further removes noise from said one or more candidate text areas.
- 90. The system of claim 89, wherein said extracting means further applies domain-specific spatio-temporal constraints to remove detection errors from said one or more candidate areas.
- 91. The system of claim 90, wherein said extracting means further applies color-histogram filtering on said one or more candidate text areas to remove detection errors.
- 92. The system of claim 79, wherein said extracting means synchronizes a timing information between said set of features.
- 92. The system of claim 79, wherein said extracting means comprises a detector of scene changes.
- 93. The system of claim 92, wherein said detector of scene changes comprises an automatic flashlight detector.
- 94. The system of claim 93, wherein said automatic flashlight detector comprises a comparator for comparing a ratio of a frame-to-frame color difference for each frame to a corresponding long-term color difference to detect flashlights.
- 95. The system of claim 92, wherein said detector of scene changes comprises an automatic detector of direct scene changes.
- 96. The system of claim 92, wherein said detector of scene changes comprises an automatic detector of gradual scene changes.
- 97. The system of claim 92, wherein said detector of scene changes comprises a detector of camera aperture changes.
- 98. The system of claim 79, wherein said set of pre-determined domain-specific cues is determined based on user preferences.
- 99. The system of claim 79, wherein said set of pre-determined domain-specific cues is either of color, motion and object layout.
- 100. The system of claim 79, wherein said means for recognizing one or more domain-specific segments comprises a fast adaptive color filter for selecting possible domain-specific segments.
- 101. The system of claim 100, wherein said adaptive color filter uses one or more pre-trained filtering models built through a clustering-based training.
- 102. The system of claim 101, wherein said means for recognizing one or more domain-specific segments comprises a segmentation-based verification module for verifying domain-specific segments based on a set of pre-determined domain-specific parameters.
- 103. The system of claim 102, wherein said segmentation-based verification module comprises a salient-feature region extraction module.
- 104. The system of claim 103, wherein said segmentation-based verification module further comprises a moving object detection module.
- 105. The system of claim 104, wherein said segmentation-based verification module further comprises a similarity matching of visual and structure features module.
- 106. The system of claim 105, wherein said means for recognizing one or more domain-specific segments comprises an edge-based verification module for verifying domain-specific segments based on a set of pre-determined domain-specific parameters.
- 107. A system for content-based adaptive streaming of digital video content, comprising:
a means for receiving digital video content; a means, coupled to said receiving means, for automatically parsing digital video content into one or more fundamental semantic units based on a set of predetermined domain-specific cues; a means, coupled to said parsing means, for assigning corresponding video-quality levels to said video segments based on a set of predetermined content-specific requirements; a means, coupled to said assigning means and said parsing means, for scheduling said video segments for adaptive streaming to one or more users based on corresponding video quality levels; and a means, coupled to said scheduling means, for adaptively streaming said video segments with corresponding video quality levels to users for display.
- 108. The system of claim 107, wherein said means for automatically parsing digital video content further comprises:
a) a means, coupled to said receiving means, for automatically extracting a set of features from digital video content based on said predetermined set of domain-specific cues; b) a means, coupled to said extracting means, for recognizing one or more domain-specific segments based on said set of features for parsing digital video content; c) a means, coupled to said recognizing means, for parsing digital video content into one or more fundamental semantic units corresponding to said one or more domain-specific segments.
- 109. The system of claim 108, wherein said one or more domain-specific segments are views.
- 110. The system of claim 108, wherein said one or more domain-specific segments are events.
- 111. The system of claim 108, wherein said set of features from said digital video content includes a set of visual features.
- 112. The system of claim 108, wherein said set of features from said digital video content includes a set of audio features.
- 113. The system of claim 108, wherein said extracting means further comprises a means for recognizing speech signals.
- 114. The system of claim 113, wherein said means for recognizing speech signals further converts said speech signals into recognized text data.
- 115. The system of claim 108, wherein extracting means further comprises a means for decoding closed caption information from said digital video content.
- 116. The system of claim 108, wherein said extracting means further detect text images in said digital video content and convert said text images into text information.
- 117. The system of claim 116, wherein said extracting means computes a set of frame-to-frame motion measures, compares said set of frame-to-frame motion measures with a set of predetermined threshold values to determine one or more candidate text areas.
- 118. The system of claim 117, wherein said extracting means further removes noise from said one or more candidate text areas.
- 119. The system of claim 118, wherein said extracting means further applies domain-specific spatio-temporal constraints to remove detection errors from said one or more candidate areas.
- 120. The system of claim 119, wherein said extracting means further applies color-histogram filtering on said one or more candidate text areas to remove detection errors.
- 121. The system of claim 108, wherein said extracting means further synchronizes a timing information between said set of features.
- 122. The system of claim 108, wherein said extracting means further comprises a detector of scene changes.
- 123. The system of claim 122, wherein said detector of scene changes further comprises an automatic flashlight detector.
- 124. The system of claim 123, wherein said automatic flashlight detector further comprises a comparator for comparing a ratio of a frame-to-frame color difference for each frame to a corresponding long-term color difference to detect flashlights.
- 125. The system of claim 122, wherein said detector of scene changes further comprises an automatic detector of direct scene changes.
- 126. The system of claim 122, wherein said detector of scene changes further comprises an automatic detector of gradual scene changes.
- 127. The system of claim 122, wherein said detector of scene changes further comprises a detector of camera aperture changes.
- 128. The system of claim 122, wherein said set of pre-determined domain-specific cues is determined based on user preferences.
- 129. The system of claim 108, wherein said a means for recognizing one or more domain-specific segments further comprises a fast adaptive color filter for selecting possible domain-specific segments.
- 130. The system of claim 129, wherein said adaptive color filter uses one or more pre-trained filtering models built through a clustering-based training.
- 131. The system of claim 130, wherein said means for recognizing one or more domain-specific segments further comprises a segmentation-based verification module for verifying domain-specific segments based on a set of pre-determined domain-specific parameters.
- 132. The system of claim 131, wherein said segmentation-based verification module further comprises a salient-feature region extraction module.
- 133. The system of claim 132, wherein said segmentation-based verification module further comprises a moving object detection module.
- 134. The system of claim 133, wherein said segmentation-based verification module further comprises a similarity matching of visual and structure features module.
- 135. The system of claim 134, wherein said means for recognizing one or more domain-specific segments further comprises an edge-based verification module for verifying domain-specific segments based on a set of pre-determined domain-specific parameters.
CROSS-REFERENCES TO RELATED APPLICATIONS
[0001] This application is based on U.S. Provisional Application Serial No. 60/218,969, filed Jul. 17, 2000, and U.S. Provisional Application Serial No. 60/260,637, filed Jan. 3, 2001, which are incorporated herein by reference for all purposes and from which priority is claimed.
PCT Information
Filing Document |
Filing Date |
Country |
Kind |
PCT/US01/11485 |
4/9/2001 |
WO |
|