The disclosure relates generally to the detection of subtitle and in particular to the detection of subtitles in video.
Among visual, audio and textual information present in a video sequence, text provides condensed information in understanding the content of video and thus plays an important role in browsing and retrieving video data. Subtitle is one of the text information, which draws researcher's attention; massive efforts have been made for subtitle detection and recognition. However, it is not trivial to reliably detect and locate subtitles embedded in images. The font, size and even color of subtitles could vary in different videos; subtitle could appear and disappear in arbitrary video frames; and the background video could be complex and changing, independent of the subtitle.
Methods for subtitle detection applied on a single frame of video sequences, could be generally categorized into connect-component-based analysis, texture-based analysis and edge-based analysis. Known methods based on connected-component analysis use the uniformity of text as a distinguishing feature. Typically, it is usually combined with color-based analysis, which is based on the assumption that text is presented with a uniform color. After color quantization, the connected components with homogeneous color, which further conform to a certain size, shape and spatial alignment constraints, are extracted as text. In C. Garcia and X. Apostolidis, “Text detection and segmentation in complex color images,” Proceedings of the Acoustics, Speech, and Signal Processing. 2000. on IEEE International Conference—Volume 04, IEEE Computer Society, 2000, pp. 2326-2329, Garcia performs connected component analysis on the edge map derived by recursive Deriche edge detector. In Soo-Chang Pei and Yu-Ting Chuang, “Automatic text detection using multi-layer color quantization in complex color images,” Multimedia and Expo, 2004. ICME '04. 2004 IEEE International Conference on, 2004, pp. 619-622, Soo-Chang Pei applies neural network for color quantization and 3D histogram analysis for color candidate selection and then perform connectivity analysis and morphological operation to produce text candidates. Connected-component-based methods are efficient, yet would run into difficulties when the background is complex or the text is noisy, degraded, textured and touching other graphical objects.
Texture-based analysis treats the text-region as texture, because it is known that text regions possess a special texture due to high contrast of text versus background and a periodic horizontal intensity variation. For example, in Y. Zhong, H. Zhang, and A. K. Jain, “Automatic Caption Localization in Compressed Video,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, 2000, pp. 385-392, Zhong uses the coefficients of Discrete Cosine Transform to capture the directionality and periodicity of local image blocks as texture properties and then based on this properties detect the text region. In Xueming Qian and Guizhong Liu, “Text Detection, Localization and Segmentation in Compressed Videos,”Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, 2006, p. II, Qian et al. also uses the DCT coefficients to compute text boxes using vertical and horizontal block texture projections. Wu et al. [10] implement linear and nonlinear filtering, nonlinear transformation and k-mean clustering to derive texture feature for text detection. In K. I. Kim, K. Jung, and J. H. Kim, “Texture-Based Approach for Text Detection in Images Using Support Vector Machines and Continuously Adaptive Mean Shift Algorithm,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, 2003, pp. 1631-1639, Kim et al. adopts a support vector machine to do texture classification. In general, texture-based method performs better than connected-component based method in dealing with complex background. However, the texture-based methods find difficulties when the background display similar texture structures as the text region.
Edge-based analysis has been also widely used in text extraction, since characters are composed of line segment and have sharp edges. In Congjie Mi, Yuan Xu, Hong Lu, and Xiangyang Xue, “A Novel Video Text Extraction Approach Based on Multiple Frames,” Information, Communications and Signal Processing, 2005 Fifth International Conference on, 2005, pp. 678-682, Congjie Mi et al. detects edges by an improved Canny edge detector accompanied with text line features. In H. Li, D. Doermann, and O. Kia, “Automatic text detection and tracking in digital video,” Image Processing IEEE Transactions on, vol. 9, 2000, pp. 156, 147, Li et al. adopts wavelet decomposition to detect edges in high frequency sub-bands. In C. Liu, C. Wang, and R. Dai, “Text Detection in Images Based on Unsupervised Classification of Edgebased Features,” Proceedings of the Eighth International Conference on Document Analysis and Recognition, IEEE Computer Society, 2005, pp. 610-614, Liu et al. detect edges with Sobel edge operator in four directions (horizontal, vertical, up-right and up-left). In G. Guo, J. Jin, X. Ping, and T. Zhang, “Automatic Video Text Localization and Recognition,” Image and Graphics, International Conference on, Los Alamitos, Calif., USA: IEEE Computer Society, 2007, pp. 484-489, Guo et al. implements Susan edge detector to derive corner point response. These edge-based methods are effective in segmenting text after text bounding boxes are already located, but are less reliable when applied to the entire frame since other objects in the scene could also possess strong edges.
Besides the above techniques applied on individual frames independently, there are techniques which take the temporal information into account based on multiple frames of video. These techniques are exploited to overcome the complexity of background using static location of subtitle/text, which means the subtitle would stay in the same position for many consecutive frames. In Xian-Sheng Hua, Pei Yin, and Hong-Jiang Zhang, “Efficient video text recognition using multiple frame integration,” Image Processing. 2002. Proceedings. 2002 International Conference on, 2002, pp. II-397-II-400 vol. 2, Hua et al. utilizes multiple frame verification and high contrast frame averaging to extract clear text from very complex background. In Congjie Mi, Yuan Xu, Hong Lu, and Xiangyang Xue, “A Novel Video Text Extraction Approach Based on Multiple Frames,” Information, Communications and Signal Processing, 2005 Fifth International Conference on, 2005, pp. 678-682, Congjie Mi et al. compares the similarity of both text region and edge map to determine the start and end frame containing the same object for multiple frame integration, and refines the text regions by reapplying text detection on a synthesized image produced by minimum/maximum pixel search on consecutive frames. In C. Wolf, J. Jolion, and F. Chassaing, “Text Localization, Enhancement and Binarization in Multimedia Documents,” Pattern Recognition, International Conference on, Los Alamitos, Calif., USA: IEEE Computer Society, 2002, p. 21037, Wolf et al. develops an algorithm of bi-linear interpolation using multiple frames for text enhancement. In X. Tang, X. Gao, J. Liu, and H. Zhang, “A spatial-temporal approach for video caption detection and recognition,” IEEE Transactions on Neural Networks/a Publication of the IEEE Neural Networks Council, vol. 13, 2002, pp. 961-971, Tang et al. chooses a self-organizing neural network to segment the video sequence into camera shots and then compute the histogram and spatial difference to detect the shot boundary for multiple integration. In Bo Luo, Xiaoou Tang, Jianzhuang Liu, and Hongjiang Zhang, “Video caption detection and extraction using temporal information,” Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on, 2003, pp. 1-297-300 vol. 1, Luo et al. proposes a method to segment video text using temporal feature vectors to detect the appearance and disappearance of subtitles.
Other works on subtitle/text detection, localization and extraction adopt similar analysis techniques mentioned above, but differ in combination methodology and processing sequence.
As described in B. Penz, “Subtitle Text Detection and Localization in MPEG-2 Streams,” 2002, subtitles have the following characteristics: size, contrast, color, geometry, temporal uniformity, and overlay.
Based on the above subtitle characteristics, the multitude of features subtitle possess, such as high contrast between text and background, font size limitation, geometry and aspect ratio of the subtitle text area, uniform color and temporal uniformity, could be utilized for subtitle detection and localization. However, reliable detection may still prove to be difficult in practice. For example, despite the rule of thumb on the size, the fonts and sizes can vary a lot; some text pixels might share the color of the background, which lowers the contrast; the background is non-uniform, rapidly changing, and might contain similar features as subtitle. Therefore, it is desirable to provide a method that carefully considers the merit of using each feature, and its effect on the performance of the detection result.
The current static region detection algorithm uses pixel-level information such as high-contrast transitions in the luminance channel, and the stability of their spatial location over multiple video frames. This detection algorithm complies with the temporal uniformity of subtitles such that nearly all subtitle areas are detected. However, the simplicity of this algorithm causes that the detection result contains significant amount of false positives (non-subtitle regions being detected as subtitle). These excessive false positives are due to the following reasons.
Despite the above shortcomings, the current algorithm is a good starting point for subtitle detection, since the majority of subtitle regions are detected. Therefore, it is desirable to provide a system and method for detecting subtitles in a way that achieves a true positive rate similar to the current static region detection algorithm (nearly all subtitle regions are detected), while obtaining a lower false positive rate (reject more non-subtitle areas).
Thus, it is desirable to provide a system and method for detecting and localizing subtitles from real-time accessing videos based on the detection of a static region and other subtitle-specific features and it is to this end that the disclosure is directed.
a) and (b) illustrate a static region detection input image and output, respectively;
a)-(c) illustrate horizontal luminance variation;
a)-(b) illustrate the definition of a transition pair;
a)-(c) illustrate pruning a static region map;
a)-(e) illustrate a first bi-projection operation and results;
a)-(d) illustrate an iterative bi-projection of the method;
a)-(c) illustrate fixed thresholding;
a)-(e) illustrate the computation of an adaptive threshold;
a)-(c) illustrate adaptive temporal filtering;
a)-(f) illustrate an example of adaptive temporal filtering;
a)-(b) illustrate horizontal transition pairs used for pruning a static region map;
a) and (b) illustrate examples of convolution kernels for measuring alignment for alignment of transition pairs;
a)-(c) illustrates a text stroke alignment feature;
a) and (b) illustrates two exemplary misclassified subtitle bounding boxes;
a) and (b) are a bounding box computation on moving zoneplates test pattern; and
The system and method are described and illustrated in the context of a subtitle detection system and method implemented in a device that detects subtitles in a digital television signal and it is in this context that the system and method are described. However, the subtitle detection system and method may be implemented in other types of display devices (including analog television devices), may be implemented on a stand-alone computer or other implementations. In addition, while the implementation below in a hardware device (a controller chip in a television analysis unit or a controller in a television receiver or television set top box) that executes code (firmware or the like) to implement the subtitle method described below, the system may also be implemented solely in software and then executed on a processing unit based device or in a combination of hardware and software.
A subtitle detection system and method is illustrated in
Now, returning to the method of subtitle detection, a process for pruning a static region map is described.
Pruning the Static Region Map
The existing static region detector classifies strong edges (in the luminance channel) that remain at the same spatial location over a few frames. Since this simple detector finds nearly all subtitle areas and thus can be taken as a good starting point for subtitle detection. However, not only subtitles, but also many other objects in an image are stationary.
To make things worse, the static region detector might also capture pseudo-stationary region like moving black-and-white stripes. Hence, before making use of the static region map, we need to perform some form of pruning on this map, using subtitle-specific features other than stationary strong edges.
For pruning the static region map, we propose a method based on spatial density of horizontal transition pairs, computed in the luminance channel of the input image. Only areas on the static region map that coincide with a high density of the above-mentioned transition pairs are accepted as pruned static regions. This method of pruning is based on the observation that subtitles are in strong contrast to the background. Subtitles typically have bright colors and are surrounded by a dark border, in order to make them distinguishable in bright or dark backgrounds. The method only requires image information of a single frame.
In
The method defines a ‘Transition Pair’ as a pair of closely located falling-followed-by-rising edges, or a pair of rising-followed-by-falling edges.
The pruning is implemented by the following procedures.
a. Apply a horizontal gradient filter on the input image to localize sharp rising and falling edges. Sharp is quantified as the gradient of luminance bigger than 20 over two horizontally adjacent pixels.
b. Find closely-located rising and falling edges and mark them as transition pairs (we use both rising-followed-by-falling and falling-followed-by-rising edge pairs). Denote a transition point for each transition pair as the middle of corresponding rising and falling edges.
c. Calculate the density of transition pairs with block size of 12×12, by counting the number of transition pairs within the block.
d. Select the region with high density of horizontal transition pairs, and apply dilation to also admit its surrounding regions that have a lower density of horizontal transition pairs. We denote the output of this step as density map of horizontal transition pairs.
e. Combine the density map of horizontal transition pairs with the static region map to create pruned static region map. Combining is implemented by a logical AND function.
Compared to the result of the static region detector, it can be seen that the adoption of the above method is effective in eliminating the static regions corresponding to the non-subtitle background. The pruned static region map can now undergo further processing for subtitle detection.
Bounding Box Computation
Before introducing adaptive temporal filtering, we first describe the details of bounding box computation. In the Adaptive Temporal Filtering section below, an explanation is provided about the adaptive temporal filtering, which is based on the result of bounding box computation of previous frames.
A subtitle bounding box delimits the smallest rectangle that encompasses a subtitle area. Bounding box computation relies on properties such as the distance between characters and words, and height and width of the subtitle. The method uses iterative bi-projections of a temporally filtered version of the pruned static region map. For brevity, the notion of temporally filtered in the remaining of section is omitted, so where we say pruned static region map, we mean temporally filtered, pruned static region map.
Below is an overview of the steps involved in bounding box computation, and then each step is explained in more detail.
Overview of Operations for Bounding Box Computation
Bounding box computation starts (
First Bi-Projection with Fixed Thresholding
The first step of bounding box computation is bi-projection of the pruned static region map. Bi-projection computes the sum of the pixel values of an image first in the horizontal direction, and then in the vertical direction, resulting to two I-dimensional arrays of values (histograms). The method computes the sum of pixels that are indicated as static in the pruned static region map.
a) shows the pruned static region map in gray, and the bounding boxes computed on the image in green. Let us consider the pruned static regions (the gray pixels) to have a value of 1, and the rest of the image to have a value of 0. The process of bi-projection is described in more detail below.
The first step is to perform a horizontal projection. For horizontal projection, we calculate two values for each line: the number of pruned static pixels (1s) and the compactness of the static pixels on the line (shown in
This is illustrated in Table 3-1, where series A and series B have the same amount of pruned static pixels (with value 1) but series B is more densely clustered. The value of the compactness measure is generally higher for subtitle regions as compared to other regions, and can therefore be used for eliminating sparsely spaced static regions.
Employing both histogram and compactness, the method applies fixed thresholds to select regions with high value in both histogram and compactness. These regions (R1 and R2 in
The second step is to perform a vertical projection on each of the above-mentioned selected regions (shown by the two histograms in
In
Iterative Bi-Projection
After the first iteration of bi-projection, method applies a second iteration of bi-projection on the region bounded by the each bounding box (as if each part were a small input image), in order to determine whether the content can be separated to individual bounding boxes with higher precision of spatial positions.
Adaptive Thresholding
In some cases, even iterative bi-projection cannot split two vertically stacked subtitle lines. In
It is a plausible assumption that the number of static pixels (in pruned static region map) between two subtitle lines is proportional to the horizontal width of the subtitle lines (the wider the subtitle lines, the more spurious pixels may exist in between the two subtitle lines). Based on this assumption, the method implements the following adaptive threshold to improve separation of subtitle lines.
The method defines a Gaussian convolution kernel with a width of approximately two-subtitle-line height as illustrated in
Categorizing Bounding Boxes
In this step, the method categorizes each bounding box to standard or non-standard, based on bounding box geometry (height) and the filling degree of the pruned static regions within the bounding box.
If the height of the bounding box exceeds a pre-defined threshold, the bounding box is considered to be non-standard (marked with pink in the images of this report). If the filling degree of the pruned static regions within the bounding box is under a pre-defined threshold, the bounding box is considered to be non-standard (marked with blue in the images of this report). In other cases, the bounding box is considered as standard (marked with green in the images of this report).
Adaptive Temporal Filtering
The property of temporal uniformity of subtitles was mentioned which means that subtitles remain stable on the screen for a minimum number of frames, while the background may change during this period. Even though the our static region detector uses temporal information to verify the stillness of the location of sharp luminance transitions, this temporal filtering is performed on pixel level only. Since the method has access to bounding box information, we propose to perform a more advanced temporal filtering, which adapts its filtering strength based on the position of each pixel. The method enables strong temporal filtering on the pixels within a bounding box, while allowing a quick response (weak temporal filtering) for pixels outside bounding boxes. Such adaptive temporal filtering improves the spatial and temporal stability of the computed bounding boxes.
The method is particularly applicable to the TV applications, which are implemented in a real-time streaming video processing computation environment, where accessing data of multiple frames is costly. For this reason, the method uses a strategy that requires access to the pruned static region map of the current frame (frame n) only. Of older frames, the method only needs to remember the bounding box list of frame n□1, and the temporally filtered version of the pruned static region map of frame n□1. This information is sufficient for computing the temporally filtered pruned static region map of the current frame (frame n), on which the bounding box computation is performed. Below we specify the required steps for the temporal filter.
a. Multiply the pruned static region map with a factor of 255, in order to increase the computation precision (all the pixel value of the temporal filtered pruned static region map range from 0 to 255, where 255 means a static pixel with 100% certainty).
b. Let the standard bounding box BB computed on temporally filtered pruned static region map of frame n□1 be labeled with its horizontal and vertical boundary positions as BBn−1[h1,h2,v1,v2]. Further, let TPij (n−1□) be the temporally filtered version of the pruned static region map of frame n−□1 at location i and j, and let CPij(n) be the pruned static region map of frame n at location i and j. The method calculates the normalized absolute difference NAD between the temporally filtered version of the pruned static region map of frame n−□1 and the pruned static region map of frame n as
c. Determine the adaptive temporal filter coefficient α as
d. Compute the temporally filtered pruned static region map of frame nTP ij(n) at location i and j, as:
e. Store the temporally filtered pruned static region map of frame nTP ij(n) to be used in the next frame.
f. Binarize the temporally filtered pruned static region map of frame nTP ij(n) for bounding box computation of frame n.
In step d of the above procedure indicates that if a pixel lies within or in the vicinity of (dh or dv pixels away from) a bounding box, a first order temporal filter with filter coefficient α is applied, were a high a value means a quick adaptation to the pruned static region map of the current frame, and a low a value means a slow adaptation to the static region map of the current frame. Furthermore, step d prescribes that if the pixel does not lie within or in the vicinity any bounding box, no temporal filtering is applied (the pruned static region map of current frame is used).
The adaptive temporal filter accounts for different situations that may occur in a video, using the schematic illustration
Region R4 in
a)-(c) are schematic Illustration of Adaptive Temporal Filtering: (a) Temporally filtered static region map of frame n−1, (b) Pruned static region map of frame n, (c) Resulting temporally filtered static region map of frame n.
In
After temporal filtering, the result of bounding box computation is significantly improved. On a single frame it leads to higher bounding box accuracy, and on multiple frames of video, it yields a more stable bounding box computation. We illustrate this by the following sample result.
In
Text Stroke Alignment Features
Despite using constraints such as geometry and filling degree for categorizing bounding boxes, as described above, non-subtitle areas in the pruned static region map that have similar properties of size and filling-degree as subtitles will be categorized as standard bounding box. Therefore, the method needs other subtitle-specific features in order to reject non-subtitle bounding boxes. The method uses a pair of vertical and horizontal text stroke alignment features that helps classify subtitle bounding boxes.
Observing subtitles, it can noticed that text strokes have a few frequently occurring spatial orientation. In typically used Latin subtitle fonts, vertical strokes usually dominate the subtitle text. This spatial orientation is a feature that could be used to distinguish subtitles from other areas that contain high-contrast static edges.
In order to measure the above-mentioned spatial orientation, the method uses the previously developed transition pairs. Recall the horizontal transition pair above.
The method first defines the vertical text stroke alignment feature. To start with, the method computes horizontal transition pairs and create a transition center map that has a value of ‘1’ at spatial location corresponding to the middle (transition point) of rising-followed-by-falling transition pairs, and a value of 0 elsewhere. Next, 2D vertical kernel may be convolved with this transition center map in order to measure the vertical alignment of horizontal transition points. The vertical kernel is depicted in
The following procedures specify the steps for computing the vertical stroke feature further.
a. Apply a horizontal gradient filter on the input image to obtain sharp rising and falling edges. Sharp is quantified as a gradient of intensity bigger than 20 over two horizontally adjacent pixel.
b. Find closely-located rising-followed-by-falling edge pairs and mark them as transition pairs. Denote a transition point for each transition pair as the middle of corresponding rising and falling edges. Notice that unlike 3.2, only rising-followed-by-falling edge pairs are treated.
c. Convolute the transition map with the pre-defined vertical kernel. Denote VAij as the convolution results for pixel (i, j).
d. For a standard bounding box labeled as [h1,h2,v1,v2], calculate the vertical stroke alignment feature VF as
The above procedure can be used to compute text stroke alignment in any direction. For Latin font, we found the horizontal stroke alignment feature to be useful. We compute the horizontal stroke alignment by replacing the horizontal gradient filter of step (a) above with a vertical gradient filter, and by replacing the vertical kernel in step (c) above with the horizontal kernel depicted in
Other orientations, e.g. 45°, 135°, may be useful depending on the language of the subtitles. In our case we found horizontal and vertical stroke alignment features to have a higher distinction power for our test videos.
In
Bounding Box Classification
The text stroke alignment features described above are used for classifying the standard bounding boxes to either a subtitle bounding box, or a non-subtitle bounding box. In this section, the classification procedure is described, which includes training and model parameter estimation using a training set, and also measure the performance of the classifier using a test set.
For training and testing the classifier, we use two videos with similar subtitle fonts, with a total length of around 10700 frames. These sequences have been recorded from a TV broadcasting. The method assigns roughly half of both sequence (around 5500 frames) as training set, and use the other half of both sequences (around 5200 frames) as test set used for measuring the performance of the classifier.
Training the Classifier
As a first step, the distribution of the horizontal and vertical text stroke alignment features is investigated (referred to as features in the remainder of this section). This requires manual annotation of the detected standard bounding boxes to either subtitle or non-subtitle classes. To this end, we compute the standard bounding boxes for the training set, and then manually mark each bounding box as either subtitle or nonsubtitle bounding box. For all bounding boxes, we calculate the two features.
Observing the compactness of distribution of the features of the subtitle class in
The probability density function for our 2-D Gaussian model is defined as
where μ□ is the mean value, Σ□ is the covariance matrix and ∥ denotes the determinant. The model parameter μ and Σ□ may be estimates from the data by: 0
For our training set, the model parameters are computed to be
Having computed the model parameters, the probability of each bounding box belonging to the subtitle class may be calculated based on its pair of horizontal and vertical alignment features, and the 2D Gaussian model. Since we are interested in a binary classification to either of the two classes, this probability needs to be thresholded. In the following, we explain how the threshold level is determined.
Using the result of the 2D continuous Gaussian model and the manual annotations, we first compute the Receiver Operating Characteristic curve (ROC curve). From the ROC curve in
We determine suitable set of requirements as follows. The True Positive rate must reach at least 99.9% and the False Positive Rate may not exceed 1%. That is to say, at least 99.9% of subtitle bounding boxes should be correctly classified as subtitle, and at most 1% non-subtitle bounding boxes could be mistakenly classified as subtitle bounding box. The optimal threshold that fulfils the requirement is 0.00133307. This threshold level cannot be seen in
Testing the Classifier
We mentioned previously that we reserve half of the available frames as a test set. Applying the classifier on the test set, we obtain detection result illustrated in Table 3-2, which indicates that 99.67% of the subtitle bounding boxes correctly classified (retained), while 93.82% non-subtitle bounding boxes are rejected.
The above measurement shows the superior discrimination power of our features and classifier for differentiating between subtitle and non-subtitle bounding boxes.
It is essential that we do not reject bounding box that actually containing subtitle. Hence we further check the bounding boxes that were mistakenly classified as non-subtitle. As can be seen in
It needs to be mentioned that this classifier is sensitive to the font and size of subtitle characters, and hence it is important to use a training set that contains all desired subtitle, fonts and sizes.
Summary of the Subtitle Detection Method and System
Here follows a summary of the system and method for subtitle detection.
The first step is to compute static region map on the video. The Matlab implementation of the method operates on a (Pfspd) video file on which the static region map is already computed. Next, the method computes horizontal transition pairs over the image, select regions, with high density of transition pairs and use this to prune the static region map. Then the method performs adaptive temporal filtering on the pruned static region map to reduce variation in the pruned static region map. This efficiently integrates information from multiple frames. Thereafter the method roughly locate subtitles by generating standard bounding boxes using iterative horizontal and vertical projections. Finally, for each standard bounding box, the method computes a pair of horizontal and vertical text stroke alignment feature, and use this to carry out a binary classification of the standard bounding box to subtitle or non-subtitle classes, based on a trained 2D Gaussian model.
Result of the Method on Test Patterns
Synthetic test patterns are used for evaluating the quality and shortcomings of TV image processing functions. These test patterns usually contain sharp repetitive high frequency edges, which is detected by the static region detector as static regions. In moving test patterns, this misdetection could result in unacceptable artifacts after frame-rate up-conversion. Therefore, we present here the result of the proposed subtitle detection on the well-known Zoneplate test pattern, shown in
Application
The above described subtitle detection system and method may be used to remove unintended artifacts in the up-converted video. Foremost, the subtitle detection method could be potentially extended to provide a pixel-level segmentation of subtitle pixels. This level of information could be used to guide the motion estimator such that the motion vectors of the background are not disturbed by the overlaid subtitle. Alternatively, we could remove (or fill) the subtitle pixels accurately before frame-rate up-conversion and overlay the subtitle pixels back on the up-converted video. Previously performed experiments have shown that this procedure could significantly reduce the artifacts around subtitle in the frame-rate upconversion. Due the subjective visual importance of subtitles, this reduction in artifacts could greatly improve the quality of up-converted videos.
We also showed above, the method is effective in rejecting patterns that have high spatial frequency, such as a moving Zoneplate. This kind of pattern has the tendency to appear static to the static region detector, causing unacceptable artifacts after frame-rate up-conversion. The result above shows that the proposed subtitle detection method can effectively classify such pseudo-static regions as non-subtitle. This object-level information can be used for preventing artifacts in such cases.
While the foregoing has been with reference to particular embodiments of the disclosure, it will be appreciated by those skilled in the art that changes in this embodiment may be made without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims.
This application claims the benefit under 35 USC 119(e) and 120 of U.S. Provisional Patent Application Ser. No. 61/382,244 filed on Sep. 13, 2010 and entitled “Subtitle Detection to the Television Video,” the entirety of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
61382244 | Sep 2010 | US |