The present disclosure relates to video processing. More specifically, the present disclosure relates to detection of text such as subtitles or captions in input video data, and modification of output video data based on the detected text.
Captions and subtitles are added to video to describe or enhance video content using the written word. Subtitles are typically a transcription or translation of the dialogue spoken by the actors in a video. Captioning is intended to convey information for listening-impaired viewers, and typically contains non-speech information such as sound effects in addition to dialogue text. Captions and subtitles may be “open” or “closed.” Open captions/subtitles are part of the video signal and cannot be turned on or off by the viewer, whereas closed captions/subtitles can be turned on or off based on user selection.
A number of technologies have been developed for delivery of captions and subtitles. Closed Caption (CC) technology has been widely used in the United States, and PAL Teletext has been used primarily in Europe. These caption technologies involve encoding text data in a non-visible portion of the video signal. In DVDs, subtitle data is stored as bitmaps, separate from the main video data. To generate an output video signal that includes subtitles, the main DVD video data is decoded, and then subtitle data is added to the main video data as a subpicture data stream.
Although subtitles and captions are intended to improve the viewer's understanding of the accompanied video, subtitles and captions can often be difficult to read. Difficulty in reading on-screen text can be attributed to a number of causes: the display in which the subtitles are displayed may be too small; the viewer may have poor eyesight or may be too great a distance from the display; the viewer may have difficulty with the language in which the text is displayed; the text may be poorly authored; the text may change too quickly; or the background on which the text is rendered may be of a color that makes reading the text difficult. Therefore, advancements in the readability and accessibility of on-screen text are required.
The present disclosure relates to methods and apparatus for detecting text information in a video signal that includes subtitles, captions, credits, or other text. Additionally, the present disclosure relates to methods and apparatus for applying enhancements to the display of text areas in video. The sharpness and/or contrast ratio of subtitles of detected text areas may be improved. Text areas may be displayed in a magnified form in a separate window on a display, or on a secondary display. Further disclosed are methods and apparatus for extending the duration for which subtitles appear on the display, for organizing subtitles to be displayed in a scrolling format, for allowing the user to control when a subtitle advances to the next subtitle using a remote control, and for allowing a user to scroll back to a past subtitle in cases where the user has not finished reading a subtitle. Additionally, optical character recognition (OCR) technology may be applied to detected areas of a video signal that include text, and the text may then be displayed in a more readable font, displayed in a translated language, or rendered using voice synthesis technology.
A more detailed understanding may be had from the following description, given by way of example in conjunction with the accompanying drawings wherein:
The video source component 100 may be, for example, a cable box, media player such as a DVD or Blu-Ray player, set-top box, digital media library device, or other device for producing a video signal. The video processing component 110 may be configured to process the video signal according to the principles as shown in
The video processing component 110 may be, for example, an integrated circuit (IC), a system-on-a-chip (SOC), a software module for use with a general purpose computer, or an alternative type of component. The video processing component 110 includes a text-detection sub-module 111 and a text enhancement sub-module 112.
The display component 120 may be, for example, a monitor or television display, a plasma display, a liquid crystal display (LCD), or a display based on technologies such front or rear project, light emitting diodes (LED), organic light-emitting diodes (OLEDs), or Digital Light Processing (DLP).
The video processing component 110 and the display component 120 may include interfaces to receive user input, such as input from a remote control or other control device. The video processing component 110 and the display component 120 may be integrated into a single device. The integrated device may be, for example, a television or monitor. Alternatively, the video processing component 110 and the display component 120 may be implemented in physically distinct devices. For example, the video processing component 110 may be included in a set-top box or general purpose computer, and the display component 120 may be included in a separate television or monitor.
The video processing device 210 receives an input video signal from video input device 202 at input interface 220. When a video signal is received, the video processing device 210 is configured to send the signal to the frame buffer 201. The motion estimator 212 is configured to receive frame data from the frame buffer 201. The motion estimator 212 is configured to generate motion vector data from the frame data, and to store the motion vector data in the RAM 213. The processor 215 is configured to process the motion vector data stored in the RAM 213. The processor 215 is also configured to use the video engine 211 to modify video data. For example, the video engine 211 may include a scaler and a sharpener (not shown) which may receive information directly or indirectly from the processor 215. The video processing device 210 may also include another RAM (not shown) coupled to the processor 215 and an additional video engine (not shown). The additional video engine may be used for a picture-in-picture display.
The video processing device 210 also includes an alpha (A) blender 214 and a graphics block 216. The graphics block 216 is configured to read graphics such as on-screen display (OSD) graphics and menus. The alpha blender 214 is configured to combine display signals from the video engine 211 and the graphics block 216, and to output the signal to the display 203 via one or more output interface components. The integrated device 200 may include additional components (not depicted) for receiving the signal from the alpha (A) blender 214 and driving the display 203.
The methods and features to be described with reference to
Although features and elements are described herein with reference to subtitles, captions, and credits, the disclosed methods and apparatus are applicable to text included in a video frame regardless of the initial storage format of the text data, or whether the text is classifiable as captions, subtitles, credits, or any other form. As used herein, the terms “text area” and “text region” include geometric shapes with bounds that encompass text (such as captions, subtitles, credits, or other kinds of text) included in video content. As used herein, the term “video data” includes a representation of the graphics of video content. Video data may be represented as video frames, or in other formats. A video frame is a data structure that represents the graphical contents of a single image, where video content on the whole is made up of successive video frames. Video frames may include or be associated with timing and synchronization information indicating when the video frames should be displayed. Video data may also include audio data.
The motion vectors may be included in a data structure such as example motion vector map 412. Each tile in the example motion vector map 412 represents a motion vector region in the video data. In the example motion vector map 412, a region 414 is shaded to indicate near-zero values for the motion vectors in the motion vector regions within the bounds of the region 414. To determine if the video data includes a text region 406, the motion vector map 412 may be analyzed to produce a histogram including data representing lines 416, 418. To generate example horizontal histogram lines 416, motion vector values are analyzed for the corresponding row in the motion vector map 412. The lengths of the histogram lines 416 are inversely proportional to the values of the motion vectors in the corresponding row. Accordingly, a longer line indicates a higher probability that the corresponding row in the motion vector map 412 includes text such as subtitles or captions. Similarly, columns in the motion vector map 412 may be analyzed on a per-column basis to generate values represented in the example histogram lines 418 beneath the motion vector map 412. Again, the longer lines indicate a lower probability of motion in the corresponding columns. Based on the values in example motion vector map 412, the example histogram lines 416, 418 are longer in the rows and columns which correspond to the region 414 containing motion vectors with near-zero values.
Histogram values may be analyzed 406 to determine the bounds of a potential text region in the video data. This may be accomplished by, for example, searching for a spikes or plateaus in histogram lines 416, 418 or motion vector values. Based on the bounds of associated motion vector regions, the bounds of a potential text region may be defined. Regarding motion vector map 412, for example, the bounds of region 414 may be defined based on the longer histogram lines 416, 418 in the columns and rows that correspond to region 414.
Additional heuristics may optionally be applied 408 to the video data, motion vector data (including associated probability values), and/or histogram values to further define a probability that the region includes text. For example, the bounds of a potential detected text region may be compared to the center of the video frame. Because subtitles or captions are typically horizontally centered in a video frame, this may be used to increase the probability that the region includes text. If human speech can be detected on the audio track at the time corresponding to the video frames, that may be used to increase the probability that the region includes text. Additionally, characteristics of the potential text region may be compared against characteristics of previously-detected text regions. For example, if a region is similar in color or dimensions with a previous text region that had a high probability of including text, that may be used to increase the probability that the region includes text. A final probability value indicating the probability that the region includes text is generated 410.
As shown in
Output video data may be modified 508 such that a text region is displayed in a magnified form in a separate window on a display where the main video data is displayed. This can be accomplished by using a picture-in-picture display feature of a video processing device, or alternatively by copying the text video region into the display's graphical overlay with a scale-bit operation. Sharpness and contrast ratio can be adjusted for the magnified text, and video enhancement settings may be applied to the region in which the magnified text is displayed.
Additionally, output video data may be modified 508 such that the main video data is displayed on a first side of an output frame, and the text region is displayed on a second side of the output frame. The text region may be enhanced as described above, and/or may be magnified. One of the sides of the frame may be an active area, and include the main video data. The other side may be an inactive area, and include no data other than the text region. An example of this is depicted in
Referring again to
A determination is made 806 as to whether the first text region should continue to be included in output video data (and displayed), or whether the second text region should replace the first text region in the output data. This determination 806 may be based on a parameter such as whether input from a user has been received indicating that the next subtitle should be displayed. If a determination is made that the second text region should be included in the output video data, then the second text region is included in the output video data and the first text region is no longer included in the output video data 810. If a determination is made 806 that the first text region should continue to be included in the output video data, then the first text region is further included in the output video data and the second text region is not included in the output video data 808. The determination 806 may subsequently be repeated until the second text region is included in the output video data 810.
The determination 806 may be made based on whether the first text region has already been displayed for a time longer than a time threshold. The time threshold indicates how long the first text region should be included in the output video data. If the first text region has already been displayed for a time longer than the time threshold, then the first text region should no longer be displayed and the second text region should be included in the output video data 810. If the first text region has not yet been displayed for a time exceeding the time threshold, then the first text region should be further included in the output video data and the second text region is not yet included in the output video data 808.
The time threshold used to make the determination 806 may be based on different parameters. For example, the time threshold may be based on the size of the first text region. A larger text region would correspond to a longer time threshold, to allow a viewer a longer time to read. A smaller text region corresponds to a shorter display time value. Alternatively, the time threshold may be based on the number of characters included in the first text region. After a text region is detected, character-recognition technology may be used to count the number of characters in the region. A higher number of characters will correspond to a longer time threshold, and a lower number of characters will correspond to a shorter time threshold. Additionally, the time threshold may be based on an average reading rate parameter that indicates how long the user requires to read a text region. The determination 806 may be based on any combination or sub-combination of the factors as described above, and may also be based on additional other factors.
After character detection, the output video data is modified 1206 to include data based on the character values of the text in the detected region. The text output region may include a rendering of the text in a more readable font than originally used. The text may be translated into another language, and then text of the translated subtitles can be generated and included in the output text region.
After subtitles have been converted to a text representation by a character recognition technique, a voice synthesis technology may be applied 1206 to the text and a synthesized voice may render the text mixed with the audio track(s) of the video source. Alternatively, a voice synthesis technology may be applied 1206 to the text and a synthesized voice may render the text in an auxiliary audio output. The auxiliary output could drive audio output devices different from the audio output device rending the audio track(s) of the video, such as a pair of headphones.
Although features and elements are described above in particular combinations, each feature or element can be used alone without the other features and elements or in various combinations with or without other features and elements. The sub-elements of the methods and features as described above with respect to
In addition to processor types mentioned above, suitable processors for use in accordance with the principles of the current disclosure include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine.
Number | Name | Date | Kind |
---|---|---|---|
5684542 | Tsukagoshi | Nov 1997 | A |
5731847 | Tsukagoshi | Mar 1998 | A |
5889564 | Tsukagoshi | Mar 1999 | A |
5913009 | Kuboji | Jun 1999 | A |
5929927 | Rumreich | Jul 1999 | A |
5946046 | You | Aug 1999 | A |
5959687 | Dinwiddie | Sep 1999 | A |
5978046 | Shintani | Nov 1999 | A |
6097442 | Rumreich | Aug 2000 | A |
6462746 | Min et al. | Oct 2002 | B1 |
6663244 | Wichner | Dec 2003 | B1 |
6741323 | Plunkett | May 2004 | B2 |
7474356 | Lee | Jan 2009 | B2 |
7742105 | Lee et al. | Jun 2010 | B2 |
7773852 | Nanba | Aug 2010 | B2 |
20060170824 | Johnson et al. | Aug 2006 | A1 |
20070189724 | Wan et al. | Aug 2007 | A1 |
Number | Date | Country | |
---|---|---|---|
20100259676 A1 | Oct 2010 | US |