The present invention relates to a method for classifying digital image data. More particular the present invention inter alia also relates to the noise robust detection of caption text overlays on non-uniform video scene background.
When generating and/or displaying single images or sequences of images, for instance video scenes or the like for or on a display it is sometimes feasible to add and incorporate additional information in the form of texts into the displayed material. If such combined display material is received it is often important to maintain or to even improve the display quality of the added text information. Therefore, it is necessary to reliably detect areas within digital image data representing an image or a sequence thereof which refer to the text information in the display process.
It is an object of the present invention to provide a method for classifying digital image data which is capable of reliably indicating text elements in digital image data representing an image or a sequence of images.
The object underlying the present invention is achieved by a method for classifying digital image data according to the feature combination of independent claim 1. The object is further achieved by an apparatus, by a computer program product, as well as by a computer readable storage medium, according to independent claims 51, 52, and 53, respectively.
According to the present invention in its broadest sense a method for classifying digital image data is provided, wherein a post-processing is employed operating non-linearly and using artificial text overlay attribute constraints.
According to the present invention a method for classifying digital image data is provided, wherein a luminance component of the input image is processed by a filter bank with band-pass transfer characteristic that generates N separate filter responses, wherein each of said filter responses is binarized and post-processed non-linearly using typical attribute constraints of artificial text overlays, wherein said N post-processed filter results are recombined into a single binary image map, and wherein said single binary image map classifies each pixel of the original luminance image as being text or non-text.
According to the present invention a method for classifying digital image data is provided, comprising (a) a step of receiving (S1) digital image data or a part thereof as an input signal or as a part thereof, said digital image data or said part thereof being representative for an image or for a part or a sequence thereof, (b) a step of processing (S2) said digital image data in order to generate and provide image classification data, said image classification data at least one of indicating and describing at least one of the presence, the position and the further properties of text portions with respect to said image, said part of an image (I) or said sequence of images underlying said digital image data or a part thereof, and (c) a step of providing and/or applying (S3) said image classification data.
Said step (c) of processing (S2) said digital image data may comprise (c1) a sub-step of detecting and providing (S2-1) a luminance component of said digital image data, (c2) a sub-step of processing (S2-2) said luminance component by a filter bank operation, said filter bank operation having a band-pass transfer characteristic and said filter bank operation generating a plurality of N separate filter response signal components, N being an integer, (c3) a sub-step of binarizing (S2-3) said N filter response signal components, thereby generating respective binarized filter response signal components, (c4) a sub-step of applying (S2-4) to each of said binarized filter response signal components a respective post-processing operation, thereby generating respective binary band signals as post-processed binarized filter response signal components, said respective post-processing operation in each case operating non-linearly and said respective post-processing operation in each case using text overlay attribute constraints, and (c5) a sub-step of recombining (S2-5) said N post-processed binary filter response signal in order to form a single binary image map as a part or as a preform of said image classification data, said single binary image map classifying each pixel of said digital image data or of said luminance component thereof as being text or as being non-text.
The invention will now be explained based on preferred embodiments thereof and by taking reference to the accompanying and schematical figures.
In the following functional and structural similar or equivalent element structures will be denoted with the same reference symbols. Not in each case of their occurrence a detailed description will be repeated.
According to the present invention a method for classifying digital image data is provided, wherein a post-processing is employed operating non-linearly and using artificial text overlay attribute constraints.
According to the present invention a method for classifying digital image data is provided, wherein a luminance component of the input image is processed by a filter bank with band-pass transfer characteristic that generates N separate filter responses, wherein each of said filter responses is binarized and post-processed non-linearly using typical attribute constraints of artificial text overlays, wherein said N post-processed filter results are recombined into a single binary image map, and wherein said single binary image map classifies each pixel of the original luminance image as being text or non-text.
According to the present invention a method for classifying digital image data is provided, comprising (a) a step of receiving (S1) digital image data ID or a part thereof as an input signal IS or as a part thereof, said digital image data ID or said part thereof being representative for an image I or for a part or a sequence thereof, (b) a step of processing (S2) said digital image data ID in order to generate and provide image classification data ICD, said image classification data ICD at least one of indicating and describing at least one of the presence, the position and the further properties of text portions with respect to said image I, said part of an image I or said sequence of images I underlying said digital image data ID or a part thereof, and (c) a step of providing and/or applying (S3) said image classification data ICD.
Said step (c) of processing (S2) said digital image data ID may comprise (c1) a sub-step of detecting and providing (S2-1) a luminance component SI of said digital image data ID, (c2) a sub-step of processing (S2-2) said luminance component SI by a filter bank operation FB, said filter bank operation FB having a band-pass transfer characteristic and said filter bank operation FB generating a plurality of N separate filter response signal components FRSj; j=1, , . . . , N), N being an integer, (c3) a sub-step of binarizing (S2-3) said N filter response signal components FRSj; j=1, . . . , N, thereby generating respective binarized filter response signal components SFj, j=1, . . . , N, (c4) a sub-step of applying (S2-4) to each of said binarized filter response signal components SFj; j=1, . . . , N a respective post-processing operation PPj; j=1, . . . , N, thereby generating respective binary band signals as post-processed binarized filter response signal components SPj; j=1, . . . , N, said respective post-processing operation PPj; j=1, . . . , N in each case operating non-linearly and said respective post-processing operation PPj; j=1, . . . , N in each case using text overlay attribute constraints TOAC, and (c5) a sub-step of recombining (S2-5) said N post-processed binary filter response signal components SPj; j=1, . . . , N in order to form a single binary image map SC as a part or as a preform of said image classification data ICD, said single binary image map SC classifying each pixel of said digital image data ID or of said luminance component SI thereof as being text or as being non-text.
The inventive method my be is adapted and designed in order to reliably detect pixels and/or areas of said image (I) or a part thereof underlying said digital image data (ID) or a part thereof.
Said text overlay attribute constraints TOAC may be representative for one or an arbitrary combination of attributes of the group consisting of
Said filter bank FB may be adapted in order to operate in 1-D dimensional horizontal spatial direction.
Said filter bank operation FB may comprise one or a plurality of processes of the group consisting of short window discrete Fourier transform operations, short window discrete cosine transform operations, Goertzel algorithm based operations, FIR operations and IIR operations, in particular in order to obtain a band-limited, horizontally directed and/or multi-band representation of the luminance signal component SI.
Said post-processing operations PPj; j=1, . . . , N may be adapted in order to determine a short window signal energy, in particular in a small horizontal window first, and then in particular to binarize the respective signals using a band-specific threshold.
Said single binary image map SC as said part or preform of said image classification data ICD may be obtained in said sub-step c5 of recombining S2-5.
Said N post-processed binary filter response signals SPj; j=1, . . . , N may be subjected to a combined binary cleaning operation BCLC in order to generate a final binary map ST or a final binary map signal ST as a control signal.
For each of said post-processing operations PPj; j=1, . . . , N in a first step a respective signal energy or energy value may be determined, in particular for a respective short window of a respective horizontal length Sw, in particular by a EC operation, thereby generating respective signal energy values SEj; j=1, . . . , N.
For each of said post-processing operations PPj; j=1, . . . , N a respective resulting energy signal SEj; j=1, . . . , N may be formed with a resolution which is reduced horizontally by a factor which is given by the horizontal length Sw of the respective short window.
For each of said post-processing operations PPj; j=1, . . . , N in a following step a respective signal energy value or level SEj; j=1, . . . , N may be compared to a respective threshold value TCj; j=1, . . . , N, in particular by a respective binarization operation BIN in particular in order to derive a respective binary map signal SBj; j=1, . . . , N.
For each of said post-processing operations PPj; j=1, . . . , N the respective threshold values TCj; j=1, . . . , N may be adaptively changed to or with respect to a measured noise level NL, in particular in order to mitigate effects of additive noise which in particular might be contained in the input signal IS, SI.
The adaptive change of the respective threshold values TCj; j=1, . . . , N may be achieved by a respective threshold adaptation operation TA, which in particular combines respective fixed but band specific threshold levels THj; j=1, . . . , N, in particular with a respective variable offset, which is in particular controlled by the measured noise level NL.
For each of said post-processing operations PPj; j=1, . . . , N the respective variable offset may be determined depending on the respective type of the used filter bank or filter bank operation FB and/or on the statistics of the expected noise signal.
The respective filter bank and the respective filter bank operations FB may be implemented by linear and time-invariant FIR filters.
The respective noise is modelled may be an additive white Gaussian noise. For each of said post-processing operations PPj; j=1, . . . , N after the respective binarization process a respective initialized profile may be generated as a respective horizontal projection from the respective binary band map signal SBj; j=1, . . . , N, in particular by a respective line profile generation operation LPG.
For each of said post-processing operations PPj; j=1, . . . , N the respective line profile may be defined as a respective binary vector with H elements for a picture height of H scan lines, in particular realizing 1 bit per scan line, H being an integer.
For each of said post-processing operations PPj; j=1, . . . , N the respective line profile element is set to a value of “1”, if there may be a substantial indication for a text area from the respective binary map SBj; j=1, . . . , N.
For each of said post-processing operations PPj; j=1, . . . , N a respective line profile element may be set to a value of “0”, if there is no substantial indication for a text area from the respective binary map SBj; j=1, . . . , N.
For said line profile generation operation LPG in a first step an image area may be partitioned into M slices, in particular by a respective partitioning operation VSPk; k=1, . . . , M, M being an integer.
For each of said line profile generation processes LPG in a following step a respective slice profile may be generated in particular by summing up all of the plurality of Hw horizontal bits in a respective slice of a respective binary map, in particular by the respective binarization operation VSBk; k=1, . . . , M.
For each of said line profile generation operations LPG a respective sum may be compared against a fixed threshold value VTH and
A binary output value may be generated with having a value of “1”, if the respective sum is larger than or equal to the respective threshold value VTH.
For each of said line profile generation operations LPG the respective output bit may be generated with having a value of “0”, if the respective sum is not greater than or equal to respective threshold value VTH.
A respective overall line profile SPLj; j=1, . . . , N may be created by a respective profile combination operation PC, in particular from all slice profiles.
The respective slice profiles may be combined by means of a bit-wise OR operation.
The respective initial line profile SPLj; j=1, . . . , N may be used as an auxiliary input value for a respective binary cleaning operation BCLj; j=1, . . . , N.
The respective initial binary line profile SPLj; j=1, . . . , N may be processed by a respective line run length cleaning operation RLC, in particular in order to produce a respective cleaned profile SPCj; j=1, . . . , N.
For each of said binary cleaning operation BCLj; j=1, . . . , N first of all sequences of a plurality of up to NVC,N elements having the value “0” which may be enclosed by elements having the value “1” are replaced by or with the value “1”.
For each of said binary cleaning operations BCLj; j=1, . . . , N in a further step all sequences of pluralities of up to NVC,N elements having the value “1” which are enclosed by elements having the value “0” may be replaced by the value “0”.
Each of said binary band map signals SBj; j=1, . . . , N may be processed by a respective column profile generation operation CPG, in particular in order to produce a respective binary band map SBMj; j=1, . . . , N.
A respective cleaned profile SPCj; j=1, . . . , N may be adapted to control which lines in the respective binary map SBj; j=1, . . . , N are used for processing.
All elements of a corresponding scan line in a respective binary band map signal SBMj; j=1, . . . , N may be set to be zero, if a profile element has the value “0”.
A corresponding element in a respective output line profile SPPj; j=1, . . . , N may be set to have a value of “0”, in particular via the respective profile update signal SPUj; j=1, . . . , N and a respective profile update operation PU, if a processing of remaining lines of a respective binary map SBj; j=1, . . . , N results in a line having elements with values which are all set to “0” in the respective binary band map SBMj; j=1, . . . , N.
The processing may be designed in order to have the respective binary map and the line profile always in synchronicity.
The respective column profile generation operation CPG may be adapted in order to loop over all sections marked in the respective binary map and the line profile SPCj; j=1, . . . , N as potential text blocks to be evaluated.
For each “0” to “1” transition in the respective line profile SPCj; j=1, . . . , N an iteration may be started and a respective column profile is initialized with the respective contents of the corresponding line in the binary map, wherein in particular the respective scan line number is recorded as a value n1.
All following scan lines of the respective binary map may be added to the respective column profile, in particular up to and including a last line before a respective “1” to “0” transition in the line profile, wherein the respective line number is recorded as a value n2.
The respective elements of a respective column profile may be compared against a threshold value HTH in order to obtain the binary column profile.
The column profile may be cleaned up by replacing sequences of pluralities of up to NHC,N elements having a value “0” which are enclosed by elements having a value “1” with a value “1”, in particular in a similar manner as with respect to the RLC operation for the line profile.
In a following step all sequences of pluralities of up to NHC,N elements having a value of “1” which are enclosed by elements having a value of “0” may be replaced by values of “0”.
All lines in a range of n1 to n2 within the respective binary output map SBMj; j=1, . . . , N may be replaced by a cleaned binary column profile.
A respective line profile SPPj; j=1, . . . , N may be updated and set to a value “0” for all elements from n1 to n2, if the respective column profile contains only values of “0” after the respective binarization step has been performed.
The respective column profile generation operation CPG may be repeated iteratively with a next iteration step until an end of the respective image at a respective scan line H.
Respective resulting binary band maps SBMj; j=1, . . . , N may be combined by a respective band combination operation BBC, in particular in order to produce a single binary map SCM.
Said binary line profiles SPPj; j=1, . . . , N may be combined, in particular in order to produce a single binary line profile SCP.
The respective single binary map SCM and the respective single binary line profile SCP may be used together as said single binary map SC.
The respective combination operation may be realized via a look-up table, which in particular performs a mapping from a N bit value to a binary value, further in particular by combining and using the binary values of band maps or line profiles from a same spatial position or image coordinate as a table index, in particular in order to find the respective binary replacement values.
The final cleaning operation BCLC of the combined signal SC as a combination of SCM and of SCP may be performed, which is in particular structurally identical to the cleaning operation BCLj; j=1, . . . , N for the respective band signals, in particular except for the output or the cleaned line profile.
According to a further aspect of the present invention a system and/or an apparatus for classifying digital image data are provided, which are adapted and comprise means for realizing a method for classifying digital image data according to the present invention.
According to a further aspect of the present invention a computer program product is provided, comprising computer program means which is adapted in order to perform a method for classifying digital image data according to the present invention and the steps thereof when it is executed on a computer or a digital signal processing means.
According to a further aspect of the present invention a computer readable storage medium is provided, comprising a computer program product according to the present invention.
These and further aspects of the present invention will be further discussed in the following:
The present invention inter alia also relates to the noise robust detection of caption text overlays in or on non-uniform video scene background.
Problems arise in the field of the detection of image areas with artificial text overlay in video sequences. The detection should be robust in the presence of additive noise. The detection should be invariant to interlaced or progressive mode of the video sequence.
The present invention inter alia presents a solution for such problems. The luminance component of the input image is processed by a filter bank with band-pass transfer characteristic that generates N separate filter responses. Each of these filter responses is binarized and post-processed non-linearly using typical attribute constraints of artificial text overlays. The N post-processed filter results are then recombined into a single binary image map that classifies each pixel of the original luminance image as being text or non-text.
In [1], a method for extraction and recognition of video captions in television news broadcast is described. The overall system identifies text regions in groups of subsequent luminance video frames, segments individual characters in these text regions, and uses a conventional pattern matching technique to recognize the characters. The text region detection part uses a 3×3 horizontal differential filter to generate vertical edge features, followed by a smoothing and spatial clustering technique to identify the bounding region of text candidates. The candidate regions are interpolated to sub-pixel resolution and integrated over multiple frames to help improving the separation of non-moving text from moving scene background.
The method described in [2] first segments a luminance image into non-overlapping homogeneous regions using a technique called generalized region labelling (GRL), which is based on contour tracking with chain codes. The homogeneous regions are then filtered initially by spatial size properties to remove non-text regions. The regions are then refined and binarized using a local threshold operation. The refinement is followed by another verification step that removes regions of small size or with low contrast to their bounding background. The remaining individual character regions are then tested for consistency, i.e. alignment along a straight line, inter-character spacing, etc.. In a final step, text regions are verified by analysis over five consecutive frames in the video sequence.
The text extraction described in [3] first computes a 2-D colour intensity gradient image from RGB colour frames at multiple scales. Fixed rectangular regions of 20×10 pixels in all scales of the gradient images are used as input features into an artificial neural network for classification into text regions and non-text background regions. The network responses from different scales are integrated into a single saliency map from which initial text region boxes are extracted using a shape-restricted region growing method. The initial text region boxes are then refined by evaluation of local horizontal and vertical projection profiles. The text region boxes are then tracked over multiple frames to reduce the detection of false positives.
In [4], a method for detection of still and moving text in video sequences is presented. The detector is intended for identification of text, which is sensitive for video processing. The primary features are luminance edges (i.e. derivatives) in horizontal direction, which are correlated over three adjacent scan lines in an interlaced video frame. The density of edges per line is then used to decide during post processing whether a line contains text or not.
In [5], a method for text extraction from video sequences for a video retrieval system is described. The detection part uses a spatial, local accumulation of horizontal gradients derived by the Sobel operator on the luminance component as basic text feature. The accumulated gradient image is binarized using a modification of Otsu's method to determine an optimal threshold from the grey value histogram of the input image. The binary image is then processed by a number of morphological operations, and the resulting text candidate regions are selected by geometrical constraints of typical horizontal text properties. The quality of localized text region is finally improved by multi frame integration.
The method described in [6] uses the coefficients of DCT compressed video sequences for detection of image areas containing text. Specifically, the coefficients representing horizontal high frequency luminance variation are utilized to initially classify each 8×8 pixel image block of a MPEG stream into text or non-text area. The 8×8 pixel block units are morphologically processed and spatially clustered by a connected component analysis to form the text region candidates. In a refinement step, only candidate regions are retained, which enclose at least one row of DCT coefficients representing vertical high luminance variation.
The method proposed in [7] employs a multi-scale coarse detection step to localize candidate text areas, followed by a fine detection step that collects local image properties into a high dimensional feature vector which is then classified into text or non-text region by a support vector machine. The coarse detection step is based on a discrete wavelet decomposition with Daubechies-4 wavelet function and scale decimation, where a local wavelet energy is derived from the bandpass wavelet coefficients for each decomposition level individually. The candidate regions are formed by a region growing process that attempts to fit a rectangular area in six difference directions. In the fine detection step, features like moment, histogram, co-occurrence and crossing counts are extracted from the candidate regions in the wavelet domain for the subsequent classification.
In the approach presented in [8], a local energy variation measure is defined for the horizontal and vertical bandpass coefficients of a decimating Haar wavelet decomposition. For each scale level, the local energy variation is thresholded, and a connected component analysis is performed, followed by geometric filtering of the resulting boundary boxes. In a final step, the results of the individual scale levels are recombined in a multi-scale fusion step.
In a broader scope extending to texture segmentation, in [9] a design method is described for an optimal single Gabor filter to segment a two-texture image. The magnitude of the Gabor filter output is followed by a Gaussian post-filter, the output of which is thresholded to achieve the segmentation result. The design method relies on an equivalence assumption that models the texture signal at the input of the Gabor filter as a superposition of a dominant frequency component within the filter passband and an additive bandpass noise component that captures all remaining signal components of the texture.
The work in [10] analyzes the suitability of the wavelet transform with critical sampling for the purpose of deriving texture description features from the decomposition coefficients. The effect of shift-variance is exemplified for a range of popular wavelet basis functions, and a ranking scheme is proposed to select the optimal basis function for the purpose of texture classification.
This report addresses the problem of detecting image areas with artificial text overlay in video sequences. The objective of such a detector is to segment the image into regions that have been superimposed with a video character generator and the residual part of the image that contains the main scene content without text. The intended target application of the text detector is a picture improvement system that applies different types of processing operations to the text and the non-text regions to achieve an overall enhanced portrayal of both text and non-text image areas.
Text overlays can origin from several steps in the production and transport chain. Specifically, open captions can be inserted during movie or video post-postproduction, by the broadcaster, by transformation or transcoding during video transport, or by a multimedia playback device such as a DVD-player. The insertions point in the end-to-end chain between production and display influences the amount of quality impairment of the text representation. Obviously, there is no impairment to be expected if the display device superimposes the text at the end of the chain without further processing, like with traditional closed caption or OSD. However, the earlier in the transport chain text is superimposed onto the video scene, the more vulnerable it is for image quality degradation, esp. if transport includes a lossy compression scheme like e.g. MPEG. In general, the degradation of the text area will be more apparent to the viewer since usual codec and/or other video processing during transport, as well as potential picture improvement processing at the display end, is designed with a focus on best representation of natural scene content rather than artificial signals like text. A text region detector would therefore be helpful in order to switch to a different type of processing for text than for non-text areas. The other way around, it is also beneficial for the processing of the natural scene if the text area is properly excluded. This affects especially operations that select their parameters from global image statistics, like e.g. a colour or luminance histogram based transformation.
First of all digital image data ID which are representative for and therefore a function of an image I are provided as an input signal IS. This is realized in the embodiment shown in
Said image classification data ICD are then forwarded to a third or application section 30 where the respective image classification data ICD are in some sense further processed for instance applied to other processes or provided as output data.
In a following second step S2 said digital image data ID are processed to thereby generate image classification data ICD.
In a following third step S3 said image classification data ICD are provided and/or applied in some sense.
In each case said image classification data are generated so as to indicate and/or describe the presence and/or further properties of text portions and of text contained in the underlying image I or in a sequence of images I.
In the following details of the distinct processing steps are explained in more extent by means of FIGS. 1 to 11.
The list of representative video processing operations includes artefact reduction in general, analogue noise reduction, digital noise reduction (block noise, mosquito noise), sharpness enhancement, colour transformation, histogram transformation, interlaced to progressive conversion, frame rate conversion, pre-processing before compression, post-processing after decompression, but is not limited to.
For the application scenario outlined above, it is important that the text detection performance is independent from progressive or interlaced video mode, esp. if the video processing operation VPO itself includes an interlaced to progressive conversion step.
In case of reception from analogue broadcast or playback from an analogue VCR device, the input signal SI is susceptible for noise. It is therefore desirable that the text detector is robust against additive noise, esp. if the video processing operation VPO includes a noise reduction step.
In a slightly different application scenario, the text detection result ST does not control directly the video processing but rather supports other video analysis modules, like e.g. realizing a ticker detection for motion estimation.
Most of the existing literature on methods for text detection is focussed on the application for video summarization and meta content extraction for digital video libraries [1-3, 5, 6]. These methods assume noise-free, progressive video and thus require additional noise reduction and/or interlaced to progressive conversion beforehand for such video material. Furthermore, these methods exploit the property of steady captions to appear in a number of consecutive frames for temporal sub-sampling and/or multi-frame integration. As a consequence, the regions detected by these methods expose a temporal inaccuracy, which makes them disadvantageous for the purpose of picture improvement. There is only few prior art [4], that addresses text detection for the application of video enhancement.
The appearance of text in video can be categorized by two distinct origins. The first origin is in-scene text, which is usually found on in-scene objects. This kind of text has an unlimited variety of appearance and is usually not prepared for good video reproduction. However, a special treatment of this type of text for video enhancement is less compelling. In contrast, the second origin is artificial text, which is characterized by being intentionally superimposed onto the video background to carry additional information complementing the visual information. For such text, a couple of attributes can be postulated, which can then be exploited for detection. Since the artificial text appears intentionally, it is designed for good readability for the viewer. Good readability is achieved by constraints like:
The method presented here is designed to reliably detect artificially superimposed text, which is aligned in horizontal direction. The initial feature that allows a separation of text from background is derived from the observation, that image areas with a high contrast text overlay expose a higher luminance gradient density compared to the surrounding non-overlay background. For most language fonts, the gradient density feature in horizontal direction is more prominent than in the vertical direction, because the text characters are dominantly composed of vertical strokes. A properly designed horizontal band-pass filter arrangement, which will result in an initial map of text candidate areas, can exploit this feature. These candidate areas are then further filtered non-linearly using some of the attribute constraints for artificial text listed above.
It should be emphasized here that a conventional FIR or IIR filter is preferred over multi-scale approaches like the wavelet transform used e.g. in [7] and [8]. There are several properties of the wavelet transform that make it appear less favourable for the intended purpose.
First, the bandpass filter parameters are inherently constrained by the wavelet decomposition, which leads to a filter bank with octave band division of the spectrum. This can be seen from the typical implementation of the transform, where half-band filters divide the spectrum into a lower and an upper frequency band, followed by a 2-to-1 decimation step, recursively repeating the two steps for the residual low pass signal at each scale level. The only degree of freedom is the section of the wavelet function.
Second, due to the recursive decimation steps, the filter response will be shift-variant except for the case of the Haar wavelet functions. This is a consequence of the decimation steps performed in the transform. As a consequence for the intended application, the pattern to be analyzed will yield different filter results depending on its location in the picture. A detailed analysis of the shift-variance for difference decimating wavelet transforms can be found in [10].
The only sift-invariant transform, the Haar wavelet transform, as used e.g. in [8], suffers from the well known low selectivity of the rectangular filter, which leads to pronounced aliasing artifacts after decimation.
The set of filter parameters for the filter bank FB can be determined by an ad hoc method based on a set of video scenes with relevant text overlay together with a manual pre-segmentation which represents the ground-truth. Then, a spectral analysis of a pre-segmented text and background areas is performed, and a set of filter parameters is chosen such that band pass channels are located around pronounced peaks in the text area spectrum which are not present in the background spectrum.
Each of the band filter output signals SF1 to SFN are then individually processed by the post-processing operations PP1 to PPN. The post-processing first determines the short window signal energy in a small horizontal window and then binarizes the signal using a band specific threshold. The resulting binary band maps are then combined by the band combination operation BBC to produce a single binary map SC. As a last processing step, the combined binary cleaning operation BCLC generates the final binary map signal ST.
In order to mitigate the effects of additive noise in the input signal, the threshold value TCN is changed adaptively to the measured noise level NL. This is achieved by the threshold adaptation operation TA, which combines the fixed but band-specific threshold level THN with a variable offset controlled by the noise level NL. The variable offset has to be determined depending on the type of filter bank and the statistics of the expected noise signal. In a particular embodiment, the filter bank is implemented by linear time-invariant FIR filters, and the noise is modelled as additive white Gaussian noise. In this case, for a known (measured) noise level of variance σ2, the required threshold offset is proportional to σ.
The threshold value TCN is derived from the threshold value THN by the threshold adaptation operation TA. The threshold value THN for a filter channel is determined from the statistics of the signal energy level SEN o n the data set used for the filter setup. IT is assumed that the ground-truth data set is free of independent noise and contains only signal components.
If the filter bank is selected to be based on Gabor filters, the method proposed in [9]—reduced to the one-dimensional case—can be used to determine the filter parameters and the threshold THN for each band pass channel. In the context of [9], the ground-truth text area data is then interpreted as the first texture and the ground-truth non-text areas as the second texture. It should be emphasized, that the noise component that are not represented by the dominant frequency component. In other words, the notion of noise in the work must not be confused with noise from an independent origin that is superposing the texture signal.
Therefore, in any case, a fixed but band specific threshold THN is determined by above methods, such threshold being dependent on the characteristics of the ground-truth segmented data set only.
After binarization, an initial line profile is generated as a horizontal projection from the binary band map signal SBN by the line profile generation operation LPG. The line profile is defined as a binary vector with H elements for a picture height of H scan lines, i.e. there is 1 bit per scan line. A line profile element is set to value “1”, if there is substantial indication of text area from the binary map SBN. Otherwise, the line profile element is set to “0”.
Usually, subtitle text is not covering the whole image area horizontally. Instead it is restricted to a shorter text string that covers only a fraction of the horizontally available space. Furthermore, the position of the text is not known. The text can appear left or right adjusted, or at any position in-between. In order to improve the robustness of the line profile generation, the input image is partitioned horizontally into M vertical slices. For each slice, an individual line profile is generated.
The vertical slices are spatially arranged with maximum horizontal overlap. The horizontal window size of a vertical slice depends on the aspect ratio of the luminance image and the expected minimal horizontal length of text lines.
In
The initial line profile SPLN is an auxiliary input to the binary cleaning operation BCLN. The internals of the cleaning operation BCLN are depicted in
The binary band map signal SBN is processed by the column profile generation operation CPG to produce the binary band map SBMN. The cleaned profile SPCN controls, which lines in the binary map SBN are used for processing. If a profile element has the value “0”, then all elements of the corresponding scan line in the signal SBMN will also be set to zero. If the processing of the remaining lines of SBN results in a line with all elements set to value “0” in signal SBMN, then the corresponding element in the output line profile SPPN will also be set to the value “0” via the profile update signal SPUN and the profile update operation PU. This procedure ensures that the binary map and the line profile are always in sync.
The CPG operation now loops over all potential text blocks marked in the binary map and the line profile. With each “0” to “1” transition in the line profile SPCN, one iteration begins and a column profile is initialised with the contents of the corresponding line in the binary map and the scan-line number is recorded as n1. All following scan-lines of the binary map are added to the column profile up to and including the last line before a “1” to “0” transition in the line profile, whose scan-line number is recorded as n2. The elements of the column profile of this region are then compared against a threshold value HTH to obtain a binary column profile. Similar to the RLC operation for the line profile, the column profile is cleaned up by replacing sequences of up to NHC,N “0” elements enclosed by “1” elements with the value “1”. In a second step, all sequences of up to NHO,N “1” elements enclosed by “0” elements are replaced by “0” values.
Then, all lines in the range from n1 to n2 in the binary output map SBMN are replaced by the cleaned binary column profile. If the column profile contains only zeros after the binarization step, the line profile SPPN has to be updated and set to value “0” for all elements from n1 to n2, as indicated above.
This column profiling is repeated with the next iteration until the end of the image at scan line H is reached.
The resulting binary band maps SBM1 to SBMN are then combined by the band combination operation BBC to produce a single binary map SCM. Similarly, the binary line profiles SPP1 to SPPN are combined to produce a single binary line profile SCP. Both signals SCM and SCP together are denoted as SC in
The final cleaning operation BCLC of the combined signal in
Number | Date | Country | Kind |
---|---|---|---|
06 006 320.3 | Mar 2006 | EP | regional |