None.
The present invention relates generally to the field of video and audio engineering, and in particular, to systems and methods for manipulating pre-recorded and live video and audio content that is broadcasted or streamed with commercial advertisements (“ads”).
Conventional advertisement-supported broadcast television or radio and over-the-top (OTT) programming delivered to viewers, whether live or prerecorded, include commercial ads that are inserted into breaks that occur regularly during the progress of the viewed programs. The ads that are placed into ad breaks of TV/video programming are often 1-5 minutes in length and are often, but not always, preceded with a singular black frame serving as an ad break marker.
Ads are typically created and stored separately from the programs they are added to. Conventionally, at selected times during a broadcast, the ads are spliced into the video feed so as to deliver them to viewers during program breaks. This provides flexibility for changing the ads to be played during a program. Some commercials are inserted at the broadcast origination point by the TV network or other content provider. Other ads may be inserted further downstream by the program service distributor, such as the cable or OTT operator using ad break markers coded into the video as indicators for when the programming breaks and restarts. This is very desirable because the sale and insertion of local and regional advertisements have become a huge source of revenue for the suppliers of ad-sponsored programming to their subscribers/customers.
In this latter model, different ads directed to localized audiences may be inserted into a program that is broadcast over a much wider geographical area. It is thus understood that “targeted ads” can provide great value for advertisers and thus enhanced revenue generating opportunities for the distributors, operators and other providers. Also, during a rebroadcast of a program, ad break markers enable new ads to replace original ads that were burned into the video when the program was originally broadcast. Thus, an important aspect of program delivery is program splicing and, more particularly, ad insertion. Similar arrangements are employed with audio-only streams.
When live broadcasts are recorded, the ads are recorded as well. This creates a scenario with ads that are “burnt-in” with the original video or audio stream content. If ad break markers are not already encoded in the stream during recording, it is time consuming to effectively and reliably identify these ad break starts and ends for future playback in order to differentiate between entertainment and ad content segments. Nonetheless, it is understandably desirable to be able to replace the pre-recorded ads that are burnt in with dynamically generated ads that can be targeted to the viewing audience, thus enhancing the revenue-generating opportunities.
Various solutions for discriminating between programming and ads—“ad detection”—in “content streams” that have not been pre-encoded with commercial ad break markers have been proposed to address this problem. By “content streams” the inventors of the present invention mean any of audio, video or audio/video media files or live audio, video or audio/video media streams that are broadcast or streamed and that comprise programming content, and typically also include advertising content. One conventional, but crude and inefficient, solution is the manual method: a person simply watches (or listens to) the recorded programming on an editing desk and manually adds “ad start” and “ad end” markers to the content stream file. Obviously, this solution is only applicable for recorded shows and is painfully slow and costly. Others have attempted to automate this process by analyzing the video or audio signals of recorded content. One automated solution, for example, attempts to employ the detection of low audio signals as an indicator of a break, transitioning from a program segment end to an ad, or from the end of an ad to the restart of the programming. See for example the solution proposed by U.S. Pat. No. 8,654,255 to Hua, et al. While this single metric “binary” technique (i.e., “Is scene x an ad transition or not?”) may work for some program-to-ad transitions, unfortunately it is subject to high degree of false positives, when, for example, the content itself, whether in the midst of the program segment or the ads, has low audio signals that are similar to the low audio signal threshold set by the system as transition indicators.
What is needed, therefore, is a robust and reliable solution that automatically identifies ad break starts and ends in both pre-recorded and live video and/or audio programming content—media streams—that do not contain ad breaks markers. Such a solution would, in real-time, reliably differentiate between entertainment content and ad content segments, without suffering from the aforementioned problems, thereby enabling automated ad skipping as well as the dynamic (“on the fly”) replacement of “burnt in” ad content in the media streams with custom ad content.
The present invention meets these needs and more by disclosing systems and methods for automatically and in real time identifying and marking a commercial advertisement break transition in a content stream that is not pre-encoded with commercial advertisement breaks. In preferred embodiments, the method comprises the steps of receiving in a real-time a learning A/V computing engine the content stream; parsing the content stream into content stream scenes; selecting a current scene from the content stream scenes for identifying whether the scene is likely programming content or advertising content; using a scene recognition subsystem of the computing engine, conducting at least one of object and speech recognition analyses on the current scene to recognize one or more objects and/or speech streams in the scene; automatically selecting a first visual or audio context model corresponding to a first recognized object or speech stream for operation on the content stream scene, the context model selected from a set of context models stored in the computing engine (108); applying the selected context model on the scene to extract information from the scene indicative of whether the scene is likely programming content or advertising content (109); computing a preliminary score on the extracted information to quantify a likelihood of the scene being programming content or advertising content (110); and comparing the computed context model preliminary score for the current scene against preexisting and stored scene scores computed using the selected context model for prior scenes in the content stream, if any (112).
In further embodiments, the method further includes using the recognized objects and/or speech streams in the scene, automatically selecting one or more additional audio or visual context models, if any, each corresponding to an additionally recognized object or speech stream for operation on the content stream scene (108); and repeating the last three steps for each additionally recognized object or speech stream having a corresponding context model. Finally, the method aggregates the preliminary scores computed on all extracted information for the scene into an aggregated computed similarity score. Further, when the aggregated computed similarity score of the current scene is less than an aggregated acceptance threshold (that may be provided by the system or calculated by the machine learning algorithm, or predetermined or determined another way that may be understood by one skilled in the art), the method of the present invention classifies the current scene as a transition between programming content and advertising content. Then, the method may electronically mark the scene transition with an ad break marker.
In embodiments, when the computed similarity score of the current scene is greater than the acceptance threshold, the scene is classified as programming content.
In more detailed embodiments, the step of conducting at least one of object and speech recognition analyses on the current scene to recognize one or more objects and/or speech streams in the scene further includes the step of inputting into the scene recognition subsystem known object and speech patterns stored in a scene classification database.
In various embodiments, a set of context models is provided for use by the method of the present invention. These models may include one of more of an actor identification model, a scene object identification model, topics-of-discussion model, an audio level model, and a blackscreen model. Other context model are contemplated by the present invention.
In other embodiments, the content stream is live broadcast television programming. In others, it may be a recorded video or audio or A/V file. In preferred embodiments, the present invention is capable of adding ad break markers to the content stream in real time.
It is to be understood that the invention is not limited in its application to the details of construction and the arrangement of components described hereinafter and illustrated in the drawings and photographs. Those skilled in the art will recognize that various modifications can be made without departing from the scope of the invention.
Further advantages of the present invention may become apparent to those skilled in the art with the benefit of the following detailed description of the preferred embodiments and upon reference to the accompanying drawings in which:
Referring now to the drawings, like reference numerals designate identical or corresponding features throughout the several views.
The present invention takes a novel approach to ad scene identification in both live and pre-recorded video that does not include ad markers by applying in real time a novel combination of machine learning algorithms to a video stream containing both programming content and non-program content, such as commercial ads. In particular, the systems and methods of the present invention use real-time machine learning to automate recognition of objects and speech to differentiate entertainment content with burnt-in commercial ad content segments.
The present invention accomplishes this advance in systems and methods for scene detection by preferably ingesting pre-recorded or live video content, and comparing multiple audio, visual, and contextual cues in a current scene against previously played scenes to determine on the fly if the video stream is no longer streaming entertainment content and has switched to an ad break. If a scene in the video content is recognized as a commercial ad, a “begin-ad” timestamp is recorded and a “begin-ad” ad marker is inserted into the video content to signify the start of a commercial ad break. When the machine learning algorithm recognizes that the ad break is over and entertainment programming content is resuming, an “end-ad” timestamp is recorded, and a marker is inserted into the video stream to signify the end of the commercial ad break. The video stream with ad markers added signifying the start and end of the commercial break can then be ingested by a dynamic ad insertion (DAI) system which can automatically replace the existing burnt-in commercial ads with targeted commercial ads in real-time.
The present invention is extremely beneficial to Multichannel Video Programming Distributors (MVPDs) who offer digital video recording (DVR) services allowing customers to record live content such as series/episodic content, movies, and sports events that contain commercial breaks. This creates an opportunity to monetize revenue using generating dynamic ad insertion by replacing burnt-in ads with targeted advertising.
One high level block diagram embodiment of the system 1000 of the present invention according to one embodiment is shown in
One embodiment of the ad content recognition and replacement invention disclosed herein is shown in the block diagram of
At step 100, the system of the present invention ingests a content stream from either a live video stream or video file. The ingested video content could be a currently broadcasted live show or may be recorded content (an A/V file, like a recorded TV show) actively distributed to end users. At step 102, the system parses the ingested video content into scenes using an image scene cutter 12. For example, in one preferred embodiment, a scene may comprise 6-frames of video captured for video shot at a standard 24 frames per second (fps) rate. However, it is understood that a scene may comprise any other short piece or snippet of A/V content. Next, in step 104, a scene X is selected and advanced to scene recognition step 106. Here, the software of the system attempts to recognize all identifiable objects and the speech in scene X. Turning momentarily to
As seen, another input to step 106 (scene recognition subsystem 20) is classification database 210, which stores object and speech classifications from previous scenes and compares those classifications with the current content. Thus, if a previous scene stored in classification database 210 associated an image in the scene as an infant, than a similarly shaped object in the scene might also be characterized as an infant.
Next, in the preferred embodiment, system 10 employs machine learning to accomplish ad content recognition. To do this, object recognition and speech recognition are conducted on visual and audio content from each scene. The results of this analysis are then used to determine if a scene contains entertainment content or ad content. These learning “models” extract information available describing the context of the content being presented in the video and audio streams. Extracted information may include, but is not limited to, background scene setting”, “actor complement” (i.e., the number and identity of actors in a given scene), context of the discussion, volume of the audio file, black screen identification, cues or other markers indicating ad placement, standard times for ad breaks, or other information as available.
Accordingly, for any scene X being analyzed, in step 108, one or more “context models” are identified and selected as relevant for that scene using the recognized objects and speech from step 106. Various exemplary machine learning context models 300, 400, 500, 600, 700 are detailed in
Details of each of these preferred context models employed in connection with the presently described embodiment are now explained. Considering first the visual context models, the visual content within a video stream is extracted and compiled using object recognition on both actors and scenery contained within the scene. In actor complement model 300 shown in
Now, at decision 310, actor complement model 300 asks whether all actors in the scene are recognized from previously played scenes. If the answer to that question is “YES”, then the system decides at step 312 that the content is likely programming (the show) and not likely an advertisement. However, if the answer to question 310 is that not all actors are likely recognized, then the system further asks at step 316 whether any program actors are recognized. If “NO”, then at step 318, model 300 concludes that the content in scene X may be an ad and not the program, and at step 320 labels faces identified therein as Ad Actors and sends that tentative conclusion to actor database 308. However, if the answer to question 316 is “YES”—that some but not all faces in the scene are recognized as show actors—the algorithm is programmed to conclude that the content is not likely an Ad at step 312.
However, the content in scene X is determined from the actor complement to likely not be an ad in step 312, step 314 then labels faces identified therein as Ad Actors and sends that data to be stored actor database 308 for use against future analyzed scenes.
Turning to the next visual model,
On the other hand, if at decision 510 not all objects are recognized as programming objects, then model 500 further asks at step 516 if any objects are recognized as an object that is found in a program (not commercial ad) scene. If the answer is YES, then the algorithm loops back to step 512 with the conclusion that the content is not likely an ad, and labels and catalogs those objects as programming ads for storage in DB 508. Only if at step 516 no objects are recognized as programming objects does the model at step 518 conclude that the scene may be an advertisement. In that case, at step 520, the objects are labels as advertisement objects and stored in DB 508 as such.
The final visual algorithm that may be invoked for A/V content scene in this preferred embodiment is the Blackscreen Recognition model. One simple preferred embodiment is model 700 as shown in
Turning now to speech recognition models in the presently preferred embodiment, Speech recognition model 400 is shown in
The final sound recognition model employed by the presently preferred embodiment is the audio level model, one of which is model 600 shown in
It should be understood that all of these and other context models work individually in an iterative fashion in that their content databases get “smarter” with greater use—that is, with machine learning—and therefore the accuracy of these “learning modules” get better with time as the databases grow with relevant data from previous learnings.
Turning now to back to
Back again to
Optionally, then, the system of the present invention at step 128 on the fly inserts an ad marker to the end of Scene X indicating it is an advertisement. Now the scene can be saved and prepared for content distribution at step 130. This can now serve as a marker for dynamically and in real time inserting at step 132 any replacement ad from an ad server to the scene string that represents the predicted ad content. At this point, the ad identification and replacement for that scene/snippet is complete and the system ingests the next scene/snippet of A/V content to repeat the process on it. Now, the databases, however, are loaded with an additional set of data from scene X.
However, if at decision 112 the similarity score of scene X (to a value representing prior stored programming content) is not lower than the acceptance threshold, meaning the scene is similar enough to the prior scenes it was compared to, then at step 124 the system extracts content difference between Scene X and the scene before it Scene X−1, and at step 122 adds the observed differences to the appropriate context model and to the classification database 210. In a feedback loop, this ever-growing database provide input back into object and speech recognition module 106.
While embodiments of the invention have been illustrated and described, it is not intended that these embodiments illustrate and describe all possible forms of the invention. Various changes, modifications, and alterations in the teachings of the present invention may be contemplated by those skilled in the art without departing from the intended spirit and scope thereof. It is intended that the present invention encompass such changes and modifications.
Number | Date | Country | |
---|---|---|---|
20240137612 A1 | Apr 2024 | US |