Video subtitles or video captions (e.g., closed captioning) include transcription of spoken words and sounds in multimedia contents to text. Video subtitles or video captions can be displayed with video as part of at least television, playing a video clip on a personal computer (PC), or browsing a video website. Original video frames can be composed with video subtitles and then displayed on the screen. A positioning of subtitles on screen can affect user's video viewing experience.
Some online video websites allow manual adjustment of a position of subtitles on a screen. If the subtitles occlude information in the picture, a user can use a mouse to re-position (drag) the subtitle to another position on the screen. However, re-positioning subtitles may not be feasible for television if there is no input system to re-position the subtitles.
Mocanu and Tapu, “Automatic Subtitle Synchronization and Positioning System Dedicated to Deaf and Hearing Impaired People” (IEEE Access 2021) introduces an automatic subtitle positioning system that is based on text information in the image. Text information is part of a region of interest (ROI) in an image. Other important areas of a screen, such as a human face in a movie or a ball in a sports broadcast, may be occluded by the subtitles.
So that the manner in which the features of the present embodiments can be understood in detail, a more particular description of the embodiments, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments and are therefore not to be considered limiting of its scope.
At least to attempt to provide automatic subtitle positioning without manual input of a subtitle position by a user and reduce a likelihood that sub-titles overwrite areas of interest in a video, some examples set subtitle position based on saliency-detection to detect areas of interest in a video frame and attempt to avoid overlapping areas of interest by subtitles. Some examples can be implemented as a processor-executed software or circuitry to automatically find a position for the subtitles in video frames. Some examples can be used to place subtitles at for online video, live streaming, television (TV) broadcasting, and video software vendors. Some examples can automatically position video subtitles based, at least, on (1) artificial intelligence (AI) based saliency detection (e.g., neural network) to divide a video frame into one or more regions of interest and one or more regions that are not of interest and (2) a time adaptive scheme to select a position for the sub-titles for different frames of video.
As described earlier, in some cases, subtitles can be positioned in a portion of a screen that would occlude information of interest to a user.
Application 204 can include applications such as a media player, web browsers, video editor, social media application, video game, or others. Driver 202 executed by processor 200 can provide access to display interface 254 to an applications 204 executing on processor 200. In some examples, driver 202 can process image file 208 to identify one or more potential locations on a frame of video or an image in which to position a subtitle based on one or more regions of interest to a viewer and considering frame-to-frame subtitle location, as described herein. Image file 208 can include video, one or more still images, audio, metadata, 2D or 3D graphics images generated by a graphics processing pipeline, and so forth. Display interface 254 can provide video frames or images with subtitles 210 to display 270 for display.
Decoder 404 can output video frames 406 for processing. In some examples, decoder 404 can generate subtitle text 420. For example, subtitle text 420 can be generated based on closed captioning (CC) data associated with video file 402 that provides a transcript of audio and can be synchronized with audio playback. In some examples, subtitle text can be encoded into one or more associated video frames and decoding video can provide subtitle text for each frame. In some examples, subtitle text 420 can be generated by transcribing audio generated from decoding video file 402. In some examples, subtitle text 420 can be retrieved from a file associated with video file 402.
Saliency detection 408 can detect images that are considered salient or of interest to a viewer and that are less likely to be covered by subtitles. For example, a largest person or people in a frame can be considered salient. For example, a car or plane in a frame can be considered salient. For example, a centered image in a frame can be considered salient. For example, text in a frame can be considered salient. For example, the following can be considered non-salient: lake region, sky region, uniform colored region (e.g., football or baseball field), or others. In some examples, salient images can be identified by use of neural networks such as convolutional neural networks (CNNs). Neural networks can be trained to perform inference of features in a visual feature representation. A CNN or other neural network can be trained with video and games and tag images in video and games to identify salient regions. For example, a largest person in a frame can be considered salient. For example, a car or plane can be considered salient. For example, a centered image can be considered salient. For example, an image of a human can be considered salient. For example, the following can be considered non-salient: lake region, sky region, uniform colored region (e.g., house, football field, or baseball field).
Subtitle positioning 410 can select one or more positions in a frame to place subtitle text 420 to avoid salient region(s) that include salient image(s). In some examples, a position of the subtitle text can be selected on a frame-by-frame basis. Subtitle text 420 can be bounded by a rectangle or other shape and subtitle positioning 410 can attempt to reduce overlap between the rectangle or other shape and the salient region(s). In some examples, a size of subtitle text 420 can be reduced so that subtitle text 420 can be positioned within a region that does not overlap with salient region(s). In some examples, subtitle text 420 can be positioned within
Position of subtitle text 420 on a frame can be set by pixel coordinates such as top left coordinates of a bounding shape of subtitle text 420. Blend 412 can position subtitle text 420 in a position specified by subtitle positioning 410. The composite image can be output to display 450 for display, storage, or streaming. Audio frames 430 potentially corresponding to subtitle text 420 can be output to speaker 460, stored, or streamed.
Position calculation 552 can be performed in a client and/or server side. For example, a client device can include one or more of a personal computer, laptop, mobile phone, smart phone, television, or other device. For example, a server can include one or more of a server, content delivery network (CDN), or other device.
In some examples, to attempt to smooth a transition of subtitles' position, a weighted moving average of position b2 can be used to calculate the subtitle bounding box's top left pixel position as:
Posactual=α*Posprev+(1−α)*Posnew
At 708, a determination can be made if b1 and b2 are overlapped. In other words, a determination can be made if the subtitle bounding box overlaps with the salient map's bounding box. Based on a non-overlap of the subtitle bounding box with the salient map's bounding box, at 710, the subtitle bounding box (b2) position can be set as the position of b2 in a prior frame. Based on an overlap of the subtitle bounding box with the salient map's bounding box, at 720, the subtitle bounding box (b2) can be moved to another position (b3) by movement in one of more of four directions: up, down, left, or right.
At 722, a determination can be made if a position of the subtitle bounding box at position b3 overlaps with the salient region bounding box. A bounding box can be a region of a solid color on screen and can overwrite pixels of a video frame or image. Based on non-overlap of the subtitle bounding box at position b3 with the salient region bounding box (b2), at 724, position b3 can be used as the subtitle position. At 740, this position can be saved for a next frame's calculation. Operation 722 can attempt to reduce movement of subtitle position (despite subtitle text changing) from frame to frame of video by finding region of frame that is non-salient for multiple consecutive frames. Note that operations 720 and 722 can repeat a configured integer X number of times until operation 722 yields a true result. In some examples, a size of the subtitle bounding box or associated text can be reduced one or more times and operations 720 and 722 can repeat until a true result occurs. Based on overlap of the subtitle bounding box at position b3 with the salient region bounding box (b2) after repeat of 720 and 722 an integer X number of times, at 726, a full frame per-pixel search can be performed in raster scan order to find a position (b4) that is not overlapped with b1. If b4 is found, at 732, b4 can be used as the subtitle position and the position is saved at 740. If b4 is not found, at 730, a previous frame's subtitle position (b2) can be used for current frame.
Various examples may be implemented using hardware elements, software elements, or a combination of both. In some examples, hardware elements may include devices, components, processors, microprocessors, circuits, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth), integrated circuits, ASICs, PLDs, DSPs, FPGAs, memory units, logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth. In some examples, software elements may include software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, APIs, instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or combination thereof. Determining whether an example is implemented using hardware elements and/or software elements may vary in accordance with any number of factors, such as desired computational rate, power levels, heat tolerances, processing cycle budget, input data rates, output data rates, memory resources, data bus speeds and other design or performance constraints, as desired for a given implementation. A processor can be one or more combination of a hardware state machine, digital control logic, central processing unit, or any hardware, firmware and/or software elements.
Some examples may be implemented using or as an article of manufacture or at least one computer-readable medium. A computer-readable medium may include a non-transitory storage medium to store logic. In some examples, the non-transitory storage medium may include one or more types of computer-readable storage media capable of storing electronic data, including volatile memory or non-volatile memory, removable or non-removable memory, erasable or non-erasable memory, writeable or re-writeable memory, and so forth. In some examples, the logic may include various software elements, such as software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, API, instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or combination thereof.
According to some examples, a computer-readable medium may include a non-transitory storage medium to store or maintain instructions that when executed by a machine, computing device or system, cause the machine, computing device or system to perform methods and/or operations in accordance with the described examples. The instructions may include any suitable type of code, such as source code, compiled code, interpreted code, executable code, static code, dynamic code, and the like. The instructions may be implemented according to a predefined computer language, manner or syntax, for instructing a machine, computing device or system to perform a certain function. The instructions may be implemented using any suitable high-level, low-level, object-oriented, visual, compiled and/or interpreted programming language.
One or more aspects of at least one example may be implemented by representative instructions stored on at least one machine-readable medium which represents various logic within the processor, which when read by a machine, computing device or system causes the machine, computing device or system to fabricate logic to perform the techniques described herein. Such representations, known as “IP cores” may be stored on a tangible, machine readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that actually make the logic or processor.
The appearances of the phrase “one example” or “an example” are not necessarily all referring to the same example or embodiment. Any aspect described herein can be combined with any other aspect or similar aspect described herein, regardless of whether the aspects are described with respect to the same figure or element. Division, omission or inclusion of block functions depicted in the accompanying figures does not infer that the hardware components, circuits, software and/or elements for implementing these functions would necessarily be divided, omitted, or included in embodiments.
Some examples may be described using the expression “coupled” and “connected” along with their derivatives. These terms are not necessarily intended as synonyms for each other. For example, descriptions using the terms “connected” and/or “coupled” may indicate that two or more elements are in direct physical or electrical contact with each other. The term “coupled,” however, may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other.
The terms “first,” “second,” and the like, herein do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The terms “a” and “an” herein do not denote a limitation of quantity, but rather denote the presence of at least one of the referenced items. The term “asserted” used herein with reference to a signal denote a state of the signal, in which the signal is active, and which can be achieved by applying any logic level either logic 0 or logic 1 to the signal. The terms “follow” or “after” can refer to immediately following or following after some other event or events. Other sequences of operations may also be performed according to alternative embodiments. Furthermore, additional operations may be added or removed depending on the particular applications. Any combination of changes can be used and one of ordinary skill in the art with the benefit of this disclosure would understand the many variations, modifications, and alternative embodiments thereof.
Disjunctive language such as the phrase “at least one of X, Y, or Z,” unless specifically stated otherwise, is otherwise understood within the context as used in general to present that an item, term, etc., may be either X, Y, or Z, or combination thereof (e.g., X, Y, and/or Z). Thus, such disjunctive language is not generally intended to, and should not, imply that certain embodiments require at least one of X, at least one of Y, or at least one of Z to each be present. Additionally, conjunctive language such as the phrase “at least one of X, Y, and Z,” unless specifically stated otherwise, should also be understood to mean X, Y, Z, or combination thereof, including “X, Y, and/or Z.””
Illustrative examples of the devices, systems, and methods disclosed herein are provided below. An embodiment of the devices, systems, and methods may include one or more, and combination of, the examples described below.
Example 1 includes one or more examples and includes: at least one non-transitory computer-readable medium comprising instructions store thereon, that if executed by one or more processors, cause the one or more processors to: for a frame of a video: identify one or more bounding regions in the frame that correspond to regions of interest; select a location of a text bounding region in the frame that does not overlap with the one or more bounding regions, wherein the text bounding region in the frame is associated with text; and cause the text to be displayed in the text bounding region corresponding to the selected location.
Example 2 includes one or more examples, wherein the identify one or more bounding regions in the frame that correspond to regions of interest is based on a neural network trained based on regions of interest.
Example 3 includes one or more examples, wherein the regions of interest comprise one or more of: a largest image in a frame, a moving image, a centered image, text, or an image of a human.
Example 4 includes one or more examples, wherein the regions of interest exclude a solid colored region.
Example 5 includes one or more examples, wherein the text to be displayed in the region comprises subtitles or closed captioning (CC) text.
Example 6 includes one or more examples, wherein the select a location of a text bounding region in the frame that does not overlap with the one or more bounding regions is based on locations of text bounding regions in multiple frames.
Example 7 includes one or more examples, wherein the select a location of a text bounding region in the frame that does not overlap with the one or more bounding regions is based on locations of text bounding regions in multiple frames of the video and reduces an amount of movement of the locations of text bounding regions in multiple frames.
Example 8 includes one or more examples, wherein the select a location of a text bounding region in the frame that does not overlap with the one or more bounding regions is based on a per-pixel raster order scan of a frame.
Example 9 includes one or more examples, wherein the select a location of a text bounding region in the frame that does not overlap with the one or more bounding regions comprises reduce a size of the bounding region in the frame until identifying a location of a text bounding region in the frame that does not overlap with the one or more bounding regions.
Example 10 includes one or more examples, wherein the frame of video comprises one or more of: text, audio, graphics, video, holographic images or video, or audio.
Example 11 includes one or more examples, and includes an apparatus that includes: at least one processor and at least one memory comprising stored thereon, that if executed by the at least one processor, cause the at least one processor to: for a frame of a video file: identify one or more bounding regions in the frame that correspond to regions of interest; select a location of a text bounding region in the frame that does not overlap with the one or more bounding regions, wherein the text bounding region in the frame is associated with text; and cause the text to be displayed in the text bounding region corresponding to the selected location.
Example 12 includes one or more examples, wherein the identify one or more bounding regions in the frame that correspond to regions of interest is based on a neural network trained based on regions of interest.
Example 13 includes one or more examples, wherein the regions of interest comprise one or more of: a largest image in a frame, a moving image, a centered image, text, or an image of a human.
Example 14 includes one or more examples, wherein the text to be displayed in the region comprises subtitles or closed captioning (CC) text.
Example 15 includes one or more examples, wherein the select a location of a text bounding region in the frame that does not overlap with the one or more bounding regions is based on locations of text bounding regions in multiple frames.
Example 16 includes one or more examples, wherein the select a location of a text bounding region in the frame that does not overlap with the one or more bounding regions is based on a per-pixel raster order scan of a frame.
Example 17 includes one or more examples, and includes a method that includes: for frames of a video file: determining a location of subtitle text by: based on a user-input configuration specifying to select a location of display of the subtitle text to avoid images of interest: determining one or more regions of interest in the frame and selecting a location of the subtitle text to avoid overlapping a bounding box surrounding the subtitle text with the one or more regions of interest.
Example 18 includes one or more examples, wherein the regions of interest comprise one or more of: a largest image in a frame, a moving image, a centered image, text, or an image of a human.
Example 19 includes one or more examples, wherein the selecting a location of the subtitle text to avoid overlapping a bounding box surrounding the subtitle text with the one or more regions of interest is based on a per-pixel raster order scan of a frame.
Example 20 includes one or more examples, wherein the selecting a location of the subtitle text to avoid overlapping a bounding box surrounding the subtitle text with the one or more regions of interest comprises reducing a size of the bounding box in the frame until identifying a location of the bounding box in the frame that does not overlap with the one or more regions of interest.
Number | Date | Country | Kind |
---|---|---|---|
PCT/CN2023/079885 | Mar 2023 | WO | international |
This application claims the benefit of priority to Patent Cooperation Treaty (PCT) Application No. PCT Application No. PCT/CN2023/079885 filed Mar. 6, 2023. The entire contents of that application are incorporated by reference.