The invention relates generally to the field of image augmentation. More specifically, the present invention relates to automated augmentation of images or videos with added static or dynamic content.
Marketing in digital video is an important and profitable industry, as billions of videos are watched every day across the globe. A variety of standard advertising techniques exist, but they each fall short of an advertising ideal in at least one important dimension. Ideally, ads should be dynamically targeted, actionable, and non-interruptive.
The current mainstream video ad formats, in which video ads roll before, during, or after a content video, are interruptive to viewers. These formats are adapted from television (TV) ads which the viewer is forced to watch. In contrast, video ads on the web are mostly skipped by viewers, and this slows down the growth of the video advertising industry.
The predominant monetization model used by the publishers on the broadcast media and internet is through advertisements. On the internet the space on a website is rented as real estate for placing advertisements. Unlike other multi-media content, hosting and publishing video on the internet has proved to be much more expensive due to storage and bandwidth requirements. Video monetization approaches common in broadcast media such as TV, and advertisement media on the internet such as banners have so far been adopted for video monetization. Current video monetization strategies can be categorized as pre-roll, mid-roll and post-roll; also referred to as linear advertisement. There can also be ads that appear on screen with the program content by utilizing only a portion of the screen such as banners and side bars, also referred to as nonlinear ads.
In the case of pre-roll, a small video advertisement (often at or under 30 seconds) is played before the start of the video. Similarly, mid-roll is a small video advertisement which interrupts the video to play. Post-roll is similar to pre-roll except post-roll is placed after the video, and is often used to direct the user to additional content. Banners are overlays on top of the video content being played.
These advertisements may only be profitable for a publisher when they appear for a certain amount of time or when an overlay banner remains on the screen for a certain amount of time. Pre-rolls are usually skipped, as the viewer does not want to wait for the actual requested content. Sometimes skipping is not allowed, however this is considered a bad user experience. Overlay banner ads usually appear at a fixed location (bottom of the screen) and suffer from banner blindness as viewers are trained consciously or subconsciously to ignore predefined standard ad locations.
To overcome these limitations, there have been efforts to make ads part of the video content instead of a banner overlay. Ads can be made part of the video content through product placement, i.e. by placing the product or product ad in the scene at the time of recording the video (referred to as “product placement in content”). Examples of product placement include use of a particular brand in movies, e.g. the use of Pepsi bottles or Starbucks coffee in filmed scenes. Products can also be placed in pre-recorded video through computational means such as by manually post-processing the video (referred to as “computational product placement”). Such ads are referred to as native in-video ads. Post-processing videos to manually introduce advertisements in the recorded videos before they are uploaded on the internet or shown on broadcast channels results in an advertisement that is less distracting for the viewer. And since the advertisement is an integral part of the video, it cannot be skipped or cancelled by the viewer and impression is guaranteed. However, these ads are non-actionable, and targeting based on user persona is not possible with this method.
Another available technology lets publishers and advertisers tag individual products within videos, and making them actionable such that products can be bought from the videos. However, here, the focus is on tagging already present products within videos instead of augmenting the videos to provide new products. Again, this does not allow targeted placement as the same products will appear no matter who sees the video.
The native in-video advertisement mechanisms currently available fail to fulfill core requirements of the ad sector. The ads are static, lack targeting based on a specific audience, and do not allow user interaction with such ads. These ads are shown without disclosing them as being advertisements and people are forced to watch as they cannot be removed once added. Thus they are only suitable for big brands that want to send subliminal messages reminding the viewer of their existence. They also lack any way to measure conversion by the user, since the ads are entirely non-interactive.
Embodiments disclosed herein include a system and method for automatically placing native in-video content which could be any rich media content such as advertisements. The system and methods enable automatic, dynamically targeted, interactive native content (including but not limited to images, videos, text, animations, or computer generated graphics) augmentation in real-time, and decouple meta-data generation and content augmentation. In some cases the augmented content could be an advertisement; in some other cases it could be any additional content. Although the method is general and can be applied to automatic augmentation of dynamically targeted, interactive native rich media content to any image or video, for clarity this disclosure will focus mainly on augmentation of ads in images and videos. Aspects of embodiments of the inventions also include automatic augmentation of a dynamically targeted, interactive native ad to an advert itself such that certain content of an advert gets dynamically targeted while the remaining content of the ad could be fixed.
Aspects of embodiments of the inventions include a method for automatic meta-data generation. This method can be implemented by a computing device executing instruction for various modules including: a shot-detection module; a tracking module that can automatically track features in videos which are then used to improve 3D planes identification; a module for automatic 3D plane identification in videos; a module for spot detection; a module for spot tracking; and a module for manual correction of identified or marked spots through interactive ad placement interface. The method is capable of identifying repetitive shots and/or shots with similarities and enhances the overall process of content (such as ad) augmentation including tracking, 3D planes identification, spot detection and spot tracking. The disclosed system and method provides a real-time augmentation of native, in video, dynamic, interactive content (such as ads) on the videos as well as on an image or sequence of images.
The system and method will be referred to as “ingrain.” The advertisements are not part of video content in the sense that they are not in the originally filmed scene nor do they replace the pixels in the actual video. The disclosed method of placing native in-video ads is automatic, dynamically targeted, and actionable. The ads are less distracting and less interruptive of the viewing experience than ads placed by existing methods, in part because the ads appear to be part of the scene. The disclosed methods inherit all the advantages of previous ad formats and overcome all of their stated limitations. The ad formats enabled by the embodiments described herein can adapt old ad formats (e.g. display, text, or rich media), as well as use emerging native ad formats. The ingrain system can also introduce new ad formats and cooperate with other ad formats.
Embodiments automatically analyze a video and identify regions where an ad can be placed in a most appropriate manner. The resultant metadata is stored on an ad server of the disclosed system. When a viewer requests a video, the video host provides the video data and the system server provides the metadata associated with the video and also serves ads through the system ad server. The ad server interacts with the advertisers to get the ads that need to be placed. Based on the user persona, video content-related tags, and disclosed computer vision-based analysis of the video content, a dynamic targeted and actionable ad is embedded in the video in the form of a native in-video ad. The dynamically targeted ads are served using existing ad services.
A product embodying the invention includes automatically marking regions within a video where dynamically targeted, native in-video ads can be placed. Video content creators are able to guide the process of ad placement if desired. A product embodying the invention also includes manually marking or correcting automatically marked regions within a video where pre-selected ads can be previewed. Multiple such videos are hosted on a website and viewers are brought in through social media and advertisements to view the videos. Existing methods of video ads, as well as native in-video ad methodology, can then be applied to these videos while delivering these ads to viewers. Because native in-video ads are subtle, they can be assumed to be a part of the video content itself.
Aspects of the present invention allow for analysis of “Cooperative Conversion Rate” (CCR) by coupling native in-video ads with conventional video ad formats when the same item is advertised in both formats. In some other cases the present invention also provides for the addition of a post-roll which replays the content of single video to remind the user of the native in-video ad, and then magnifies the ad, leading to a full post-roll. In some other embodiments of the present invention other conventional formats can be coupled with native in-video ads in a similar way.
The network flow diagram of the system is shown in
The flow diagram of this process of video submission and metadata generation is shown in
The flow diagrams of this dynamically targeted ad delivery network are shown in
Using targeting algorithms, the ad server retrieves a targeted ad based on the received user data (406). The ad server also receives augmentation metadata which provides instructions for adding the native advertisement to the particular video (408). The ad server sends both the ad and the metadata to the video player (410), which in turn uses the metadata to include the native ad in the video as the video is played for the viewer (412).
The steps of the flowchart 400 show that, in order to make the advert dynamically targeted, user tracking information (such as its persona) is also sent to the ad server to fetch appropriate targeted ad. The same augmentation can be applied with other rich media content such as augmentation on images instead of videos as well as augmentation of any images, videos, animation and graphics on any video. The rich media content is not limited to advert only, instead it could be any generic rich media content.
Embodiments of the ingrain system described herein include a user interface referred to as the “Ad Placement Interface” (AdPI) (see
In an embodiment, the process of native in-video advertising starts with a content producer accessing the AdPI. The user can either upload a new video to an ingrain system server or submit a link to a video already uploaded on another video-hosting website. In some cases the video link can be discovered or notified automatically. The ingrain system temporarily downloads the video to a system ad server (AdPI backend server) for processing. The video upload and processing is demonstrated in
A user can place one or more marks in the video window 504 to represent locations where a native ad could be placed. Such a mark 510 is shown in
The detailed ad placement interface 600 of a disclosed ingrain system is shown in
During the video registration phase the following six major operations are performed on the video to enable automatic augmentation of native in-video ads:
These operations are performed through the use of novel functionalities available via the ingrain system. First, the video is automatically segmented into multiple shots. The system then automatically identifies and tracks multiple 3D planar regions within each shot. Entire planes or smaller regions within these planes are then selected as spots for ad placement. The system then computes various transformations to be applied to the advertisement in order to embed it into these 3D planes. The resulting information is stored in ingrain system databases as metadata along with a video identifier (ID). When the same (now processed) video is accessed by the viewer through a video player that has the ingrain SDK running on it, the SDK uses the tracking information of the viewer (such as his/her persona, his/her browsing history, etc.), requests a targeted advertisement, accesses the metadata stored with video, transforms the ad into a native ad, augments the advertisement into the video as an overlay, and displays it to the viewer. The system can also perform refinement to fit the ad within the content of the video. These refinements include, but are not limited to: blending the retrieved ad content within the video content; relighting the ad according to the content of the scene to create a better augmentation with fewer visually perceivable artifacts; and selecting an ad content that is similar to the video content. This similarity between ad content and video content includes one or more of the following: color similarity, motion similarity, text similarity, and other contextual similarity. The ad content could be an image, animation, video or simply a piece of text. The result of this process is automatic placement of a native in-video advertisement that is non-interruptive, dynamically targeted, and augmented.
The first major operation on the video is that of shot detection, or extraction of shot boundaries. Videos are usually composed of multiple shots, each of which is a series of frames that runs for an uninterrupted period of time. Since the present ad format augments an ad within the 3D structure of the scene, it is valid only for a single shot or a portion of a shot. Once a video is received by the ingrain system ad server, the first main processing step is to identify the shot boundaries. These boundaries are identified by analyzing the change in the consecutive frames. A shot boundary is detected on a sub-sampled version of a video using two different tests: i) trivial boundary test; and ii) non-trivial boundary test. The trivial boundary test is a computationally efficient mechanism to identify a shot boundary. The non-trivial boundary test is performed only when the trivial test fails. In the trivial boundary test, the system acquires a current frame fi and a frame after a certain offset k,fi+k, and computes an absolute distance (in some cases sum of squared distance, or “SSD,” is computed instead) between the two as follows:
diff=|fi−fi+k|¤*x,y)
The pixel-wise difference is then added together to obtain a single difference value (Sum of Absolute Difference (SAD)):
If the SAD value (or SSD value in some cases) is greater than a certain automatically computed threshold (defined as μ+ασ, where μ and σ are mean and standard deviation whereas a is a predetermined blending factor which may be set, for example, to 0.7), it is declared as a shot boundary, and the next frame fi+k is considered as the starting of a new shot or scene. In cases when the SAD value (or SSD value in some cases) is below the computed threshold, it is considered as a non-trivial case and frames are processed further for detailed analysis. This case ensures that there are no false negatives, i.e. missed scene changes. The non-trivial boundary test tries to find the precise boundary of every detected shot. It achieves that by processing the differences between each shot, including, for example, the motion and edge information
In non-trivial analysis, motion information between the frames is computed. In some cases motion information is only computed between consecutive frames and in some other cases motion information is computed between frames separated by a fixed number of frames or intervals. In one approach, the optical flow between consecutive frames is computed and stored for later motion analysis. Instead of computing optical flow between just the current and next frames, the flow is computed between each consecutive frame up to n frames following the current frame. In some cases the optical flow is computed between consecutive frames up to n frames before the current frame provided that at least n frames have already been processed. Each time an optical flow is computed, a counter is incremented and it is checked to determine whether the counter has reached its maximum desired value (initially set in the system). For example, a maximum value of 7 is used in some cases, and in other cases the maximum value is computed based on the frame rate. In yet other cases, instead of motion information, some other motion feature is computed.
Once the shot boundaries are identified, the ingrain system then performs classification of shots. To perform the shot classification, the system computes the statistics on the computed motion information. In the case of optical flow, a histogram of its X and Y motion components is created. In some cases this histogram is computed by grouping together similar motion values (i.e. values within the interval of [x−a, x+a], where a is a positive real valued number) into the same bins of the histogram. This grouping may be done independently on X and Y components, or may be done on a total magnitude of the vector obtained from X and Y components. Frequencies of various bins of the histogram are then analyzed to estimate the motion type in the shot. In some cases only the frequency in the highest bin is analyzed; if its value is below a certain minimum threshold, its motion type is declared to be static (i.e. without significant motion). If its value is above a higher maximum threshold, then its motion type is declared to be camera motion. Otherwise, with values between the two thresholds, the motion type is declared to be an object motion—that is, one or more objects in the scene are moving. As an alternative, the motion type may be classified as static when the frequency of the highest bin is lower than a threshold and as continuous motion when the frequency of the highest bin is higher than a threshold. These low and high thresholds can also be dynamically computed. The continuous motion case can be further classified as either being one that can be defined by a single homography or one that can be defined only by multiple homographies.
The shot detection and classification information is stored in the metadata along with the video to be used by later modules, such as a tracking module to decide where to stop tracking previous frames and perform reinitializing of tracks. The system then proceeds to process the next shot. The flow diagram 700 of an algorithm for shot classification is shown in
The next operation is to automatically identify 3D planes across some or all of the scenes in the video. The planes are identified by analyzing the geometric information in the scene. The ingrain system identifies regions in the scene that are suitable to place ads without degrading video content. These include regular flat regions in the scene such as flat walls, windows, and other rectangular structures common in the man-made world. To identify such 3D planes in the scene, an embodiment uses angle regularity as a geometric constraint for reconstruction of 3D structure from a single image (referred to as “structure from angle regularities or “SfAR”). A key idea in exploiting angle regularity is that the image of a 3D plane can be rectified to a fronto-parallel view by searching for the homography that maximizes the number of orthogonal angles between projected line-pairs. This homography yields the normal vector of the 3D plane. The present approach is fully automatic and is applicable for both single plane as well as multi-planar scenarios. The invented method does not place any restriction on plane orientations. Many flat region hypotheses are generated using angle regularity, vanishing points, and single view learning based methods. The rectangular patches used for segmentation need not be axes-aligned. The camera can be in any arbitrary orientation, and visibility of the ground plane is not required. The planar identification process gives multiple hypotheses for spot identification.
Once the hypothetical flat 3D regions are identified, the next major operation is to track and verify these regions across the shot. As discussed earlier, shots are classified into either static or continuous camera motion whereas camera motion can be further classified as either camera motion that can be explained by single homography (PTZ or single plane, i.e. no parallax) or camera motion that can be defined only by multiple homographies (multiple planes as well as translation, i.e. parallax). Depending upon the camera motion, two different tracking algorithms are disclosed here. In case of generic camera motion, multiple planes within the same frame are identified and tracked; in the case of of PTZ or single plane, only one homography is needed. When a user submits a video or a video link through the AdPI, the tracking process can start in one of the two ways: i) automatically through 3D understanding of the scene; or ii) after manual initialization by the content producer or the AdPI administrator.
In one embodiment, the multi-view tracking algorithm is an extension of single video geometric video parsing and requires computing lines and vanishing point matching along with feature point tracking. These matched vanishing points serve to constrain the search for homography (by providing two fixed correspondences in RANSAC sampling) as the whole image could be related by a single unconstrained homography in a narrow baseline case. Thus the homography will always correspond to the correct vanishing points and the tracked rectangle will always be distorted correctly. All parallel lines grouped with one vanishing point do not correspond to the same plane. In fact coplanar subsets are identified by further analyzing the matched lines. This way, when the user marks a rectangle by snapping it to some lines in the neighborhood, all the needed homographies are computed without performing any feature tracking. Moreover, the orientation map of planes generated from physically coplanar subsets is more accurate as well. This also allows the user to visualize other physically coplanar lines when the user marks the rectangle as an additional visual aid, either confirming the rectangle tracking or asking for additional checking in a subsequent frame (e.g., in case the physically coplanar set is not detected correctly).
Optionally, Delaunay triangulation can be utilized so that even a rectangle marked inside completely flat regions will be associated to some features (which form the Delaunay triangles it intersects with). If two vanishing points are available, they can be utilized as a default part of the RANSAC random samples and two other points can be picked at random. This ensures additional speed and stability. The population for RANSAC is all the feature points inside a marked rectangle as well as (optionally) the features forming all the Delaunay triangles it intersects with.
The initial results of these tracking processes can then be refined using projective flow based techniques. For a moving camera scenario, this step also involves background subtraction to test for the visibility of the rectangle in all views.
In a different embodiment, a single-view tracking algorithm may be required where no or little camera motion is identified within a shot or scene. The single-view tracking algorithm uses adjacent line-pairs and appearance-based segmentation. Physically adjacent line-pairs are detected (similar to SfAR), with an additional appearance based test like a Harris corner-ness measure, edge-corner measure etc. to remove the false positives in SfAR line-pairs. Since these are only adjacent pairs in 2D they might occur at discontinuous lines in 3D. If the discontinuous line is on the rectangle boundary, the line evidence is assumed to be coming from two lines in 3D, one on each plane. When geometry doesn't provide enough cues, the system may fall back to segmentation for rectangles as well as planes using an approach based on appearance, texture, and gradient entropy.
The next major operation is to identify and track spots for ad placements. Ads are not supposed to be placed on the entire detected and tracked plane; instead a small sub region within these planes called a “spot” is used to place ads. These spots are detected using a ratio test performed on a set of rectangles that were used to form a plane in 3D. The detected spot tracking is performed by utilizing the tracks obtained for each plane. In fact, tracks associated with a spot are simply a subset of tracks associated with the plane inscribing that spot. Additional smoothing, filtering and refinements are applied to remove jitter or noise in these tracks. Tracks along with 3D position and orientation of planes and spots across the shots are then stored in the meta-data along with the video.
The ingrain method and system can also perform analysis of the video content and can deliver ads that are relevant to the video content. Various aspects of the video can be analyzed by the system, including but not limited to 3D content of the scene in the video, color content, scene lighting, position of light sources (particularly sun vector), motion information, amount of excitement in the scene using audio visual analysis, understanding through subtitles and available transcription via speech-to-text etc. The ad can be modified to better fit the content of the video in one or more of aspects of the video content. These modifications include, but are not limited to, color blending, text, conversions to appropriate size, shape, language etc.
In the case of manual initialization, the user is asked to mark a region (spot) in the shot appropriate to place an advertisement. AdPI also allows users to manually select one of the suggested ad spot or, identify a region or a spot in one of the frames of a shot as a potential place for one or more native in-video ads. Region marking involves drawing a polygon which could be a 4 vertex polygon representing a projected rectangular patch in the scene. The user is only required to mark the rectangle in just one frame of the scene. The system then automatically tracks this polygonal patch across each frame of the scene. Tracks of both manually marked spots as well as automatically detected spots can be interactively corrected through the AdPI. In some cases, the ingrain system automatically tracks identified 3D planes (identified through automatic plane identification algorithm or through manual identification) using feature-based tracking. The system first detects the salient features in a frame and then tracks them in the next frame. To improve robustness against scene variations, additional features are also detected and added in the list of features to be tracked in the next frame. Feature tracking can also be performed using any tracker that makes use of spatial intensity information to direct the search for the position that yields the best match.
The system then performs a random sampling of the features using a modified RANSAC implementation that identifies inliers and filters outliers. The outliers are then removed from the list of features and a new region is computed using the existing features (i.e. the set of inliers). To speed up the overall tracking, feature correspondence between two consecutive frames is done by searching for each feature within a small rectangular window centered on each feature or an extended window that encloses all the features and also includes an extra margin within the window. This also increases robustness against symmetric structures within the scene.
Across two consecutive frames, some portions within the polygon may get occluded or lost due to noise while some others may reappear. To enable consistent and smooth tracking across the shot, at each frame new features are detected and features with significant high confidence are made part of the set of features to be tracked. Similarly, features that are occluded or noisy tracks result in a very low confidence match and are thus discarded. Once the feature correspondence is done, the extents of the polygon are established and any new features that are detected outside these extents are also discarded. A flow diagram of the tracking algorithm according to an embodiment is shown in
Once all the polygonal regions have been tracked across their respective shots, the system computes the projection matrices between each frame and between all standard ad sizes (including projection of 3D ads) and the deformed ad placement region in the scene. Tracking information along the projection matrices is then stored for each video frame within the database on the system server. In some cases the database can be a relational database, in some other cases that database could also be a flat file based database. The system then analyses the appearance of ad placement regions as well as the remaining frames for selection of advertisements with appropriate color schemes.
In some implementations, the final major processing step is that of preview and correction of detected and tracked ad placement spots. Once the video is processed at the ingrain server and all the transformation matrices are computed, the video is presented to the user as a “preview”. The AdPI presents these identified planes to the user as suggested regions for advertisement placement. These planes are presented as editable polygons whose vertices can be adjusted by the user. The user can select one or more of such planes, or can modify these planes to improve the quality of regions for ad placement. Once the user has finished editing the planes, the system tracks them across their respective shot using the same tracking approach as the one employed in case of manual initialization of regions.
In some cases, once the video is uploaded or its link is submitted to the ingrain system, the user can opt to wait for the processing to complete, or the user can be informed via a message that the video is ready for preview. In some cases the preview is available once the entire video is processed. In yet other cases the preview is available once a particular shot has been processed. The system selects an advertisement from the ad repository to be inserted into the scene. The ad could be selected by analyzing the appearance properties of the scene, or picked at random and then modified to resemble the color and lighting properties of the scene. The ad could also be selected by the user of the system from the list of available advertisements. The user of the ad placement and preview (APP) module of the AdPI can change as many ads as desired.
The ingrain interface also allows users to correct any inaccuracies to maximize the viewing quality of the scene. The interface also allows the user to select the color scheme of suitable ads. Each time the user previews the video, the system dynamically modifies the ad for best viewing quality using the tracking information, the computed projection matrices, and the appearance information. These modifications include transforming the advertisement using the projection matrices and warping it into the ad region, alpha blending the advertisement with the scene, edge preserving color blending using Poisson image editing through Laplacian pyramids, and relighting. In some cases the video content and the ad placement polygon are first projected into the advertisement space using the inverse of projection matrices. The advertisement is then modified using the same techniques listed above, and the modified ad is projected back and embedded into the scene using the projection matrices.
In some cases all the processing to augment an advertisement into the video is done at the system server and only the processed frames are transmitted to the user for preview. In some other cases, all or some of the processing is done by the video player on the client terminal using a system software module running on the client terminal. In some cases, tracking, estimation of projection matrices, and automatic video understanding are performed at the system server side while the projection and blending are done by the video player on the client terminal using system methods and the corresponding metadata of the video stored on the ad server. In some cases, the system can analyze the process capabilities on the client terminal and dynamically decide which steps should be processed at server side and which ones at client side to maintain the responsiveness of the system.
Once the content creator is satisfied with the output quality of the video, the modified meta-data is stored back on the system server and the video is published for viewing by the viewers. Note the advertisements used at the time of preview are just for preview purposes and the actual advertisement shown to the viewer is totally dependent on the tracking information about the viewer and other meta-data associated with the video.
An example of an overall process 900 for publishing a video with native in-video ads using the proposed system is illustrated in
Once the video is uploaded on one of a system or a partner publisher's server, the system processes the video and automatically identifies the region(s) within the 3D scene of the video where native ads can be placed. The content generator can then preview and make any adjustments if needed. This metadata is then stored along with the identity information on the system. Videos provided via other platforms are then removed and only those video are kept which the content generator uploaded from local storage to the system.
When the viewer plays the video using a video player with the ingrain system software, the SDK takes the user persona information and video metadata and requests a targeted advertisement. Using the metadata, these regions are automatically replaced by dynamically targeted advertisements without disrupting the viewing experience. This augments the video content with the proposed ad content.
These ads are also interactive in the same manner as banner ads, in that the user can click on or otherwise select an advertisement in order to proceed to website associated with the advertised product. The proposed ad format is also dynamically targeted and changes based on the user persona. Furthermore, the proposed ad format is for any screens including Smart TV, touch pads, mobile devices and wearable devices. The described system and methods are also applicable to real-time augmented reality in addition to pictures and videos on desktop and mobile.
Using publication methods according to the present invention, systems could be set up to remunerate content publishers in a variety of ways. For instance, the system can be presented as a platform interposed between the ad delivery networks and the publisher. The system software (also referred to as a “player host”) running on video players (with system SDK) acts as a publisher for any website or mobile application that embeds it. The platform receives compensation for delivering the ad which will then be shared with those who have embedded the player host. The compensation can be calculated using any standard online advertising metric (such as CPM, CPC, CPV, or CPA). The amount of compensation offered to the player host can be negotiated on client-by-client basis. Since the ad content is modified to make it a non-distracting part of the 3D scene present in the video, CPM and CPV methods are redefined for native in-video advertisements. The disclosed new format of ads is less disruptive for viewers compared to existing formats. Thus, there are more impressions, resulting in higher conversion rates. Considering the lack of monetization on video content particularly in mobile space, once it is proven that the proposed native in-video advertisement mechanism is more effective both for advertiser and publisher, it can be widely accepted.
Each week, a variety of popular shows release previously-unseen new episodes, which provide additional opportunities for native content. Although the episodes are new, a given show will often re-use the same sets and camera angles in episode after episode. This disclosure introduces a technique for “on-boarding,” by which data from existing episodes of a give show can be used to more accurately and efficiently analyze a new episode of the same show.
The on-boarding process is performed to compute several of the show specific parameters and data that can be used for fully automatic processing of unseen video of the already on-boarded show. On-boarding involves understanding the visual content present in the scene, creating the 3D understanding of the scene, training classifiers for recognizing objects present in the scene and tuning of parameters of several modules. Each of the following modules may be especially tune or trained for on-boarding: shot/scene segmentation, duplicate and target scene identification, 3d plan identification, spot ROI detection, training of object detectors, and mapping of scene lighting and shading.Duplicate and Target Scene Identification
In some implementations, on-boarding is an interactive process involving user input in understanding the video content. In some other embodiments, on-boarding is a fully automated process that can understand and on-board a new unseen episode or show without any user input.
When a new show is required to be on-boarded, multiple episodes are provided to the ingrain system. In some implementation, user feedback is taken into account in order to refine the on-boarding process and avoid major errors.
In one implementation, the system first performs the shot segmentation and presents the output to the user, as described above with respect to scene segmentation and
Using the shot segmentation interface 1100 as shown in
In some embodiments of the present invention, global feature point tracking is performed across the entire video. Global feature point tracking is performed by first detecting salient features in each frame and then finding correspondence between the features in consecutive frames. In some cases a hybrid of KLT and SIFT is employed to perform tracking. The hybrid approach first identifies KLT on a low resolution video to identify moving patches. More precise tracking is then performed in each of these patches using SIFT. The hybrid approach is effective in providing computation efficiency and lowers time complexity.
In some embodiments of the present invention, the hybrid tracking process can result in multiple one dimensional signals. The system can perform C0 (end point) and C1 (first derivative) continuity tests on each of these signals to compute track continuity score. The aggregate of track continuity score can be computed on each frame of the video. Applying a threshold to the track continuity score can be used to detect a scene boundary. In some cases, during on-boarding, the aggregate of track continuity score can automatically tuned to maximize accuracy.
In some cases, several other features are also computed using the hybrid KLT and SIFT signals to identify scene boundary. These features may include minimum, maximum and median track length, birth and death rate of tracks in an interval, variance and standard deviation, and others.
Once scene segmentation is performed, motion inside each frame is analyzed to classify each scene as static or moving, as illustrated in the flowchart 1300 of
In some cases, a threshold is applied on the cumulative scene motion score to classify each scene as being moving or static. In some cases the user is also asked to correct the classification decision of scene classification. Each time the user makes a correction, the system may automatically tune the parameters for generating the cumulative team motion score in order to maximize the system's performance.
In many cases, only during small portion of the scene is the camera moving. For example, during the start of a scene the may camera zoom in on a particular person and then remains static. Alternatively, the camera may only move when an object of interest moves during the scene. Such scenes are difficult to classify using cumulative scene motion score. In such cases, the scenes may be segmented into smaller intervals which are individually analyzed and classified as static or moving.
As part of the ongoing process, scene classification module may place each scene in a variety of categories in order to match it to similar scenes. The scene classification module can classify each scene as being either indoor or outdoor and further as being a day time scene or a night time scene or a studio lighting scene. The scenes can be further classified as being captured using a handheld or tripod-mounted camera. Further features, such as whether the scene is single- or dual-anchor can also be determined. This classification is done using various low-level and high level features using, for example, color, gradients, and one or more pieces of face detection software. If the identified type of the scene is a known type, then the on-boarding already completed for the known type can be used to automatically on-board the new scene. For example, if the system has already identified a scene associated with a particular talk show (indoor, studio lights, one anchor) and the target scene to classify of a unseen video is also classified to be indoor, studio lights and a one-anchor scenario, then this new target scene of a new unseen video is automatically on-boarded using the knowledge from already known show.
After scene segmentation, the system may undergo a process similar to that described in the flowchart 1400 of
During on-boarding, the user is presented with a duplicate scene clustering and correction interface to correct any of the incorrect clustering of duplicate scenes. The input provided by the user is then used by an iterative algorithm that tunes the threshold on cumulative scene matching score. Once the scenes are clustered together, some of the clusters are marked as target scenes and are further analyzed for detailed scene understanding.
The ingrain system utilizes several already-trained object and environment detectors commonly available as well as object detectors specifically trained by the ingrain system to increase the scene understanding. In some cases, already trained object detectors require retraining by utilizing the examples present in the scenes from the current video set. In some other cases, object detectors and classifiers for additional objects are also trained during the on-boarding process to further improve the scene understanding for the current video and other such videos utilizing the same set. In some cases, the training is performed by cropping out several positive and negative samples from the scene, as described in the flowchart 1600 of
In some cases, the user can be presented with the interface similar to that described above with respect to
In some embodiments, scene lighting information is also extracted so that new content can be realistically rendered. This includes identifying directional light 3D vectors for shadow creation, directional light 3D vectors for reflection, parameters of plane (A,B,C,D) on which the creative will be placed, an optional weight value indicating gradient for shadow and reflection, and an optional value indicating alpha for shadow and reflection. In some cases these vectors and plane parameters are extracted automatically by analyzing color information in the scene and utilizing shape from shading and single view reconstruction, such as by exploiting angle regularity as described above. Based on this information, 4-point correspondence between creative and the plane (for spot suggestion) is established and a transformation is computed to create the shadow and reflection layers.
In some implementations, several standard object detectors are applied on the target scene to further enhance the understanding of the scene. For example, different detectors may be use for faces, people, upper bodies, furniture, and common objects. In some cases, the output of these detectors are aggregated together to get better understanding of the scene. For example, localization of face detector, shelves, and objects on shelves can provide information about available empty space on the shelves where a product can be placed. All such locations in the scene are recorded and combined to create a larger region of interest (ROI) where there is a possibility of detecting objects of interest as well as finding spaces for spot insertion. In some cases the user can also create a polygon to define a ROI.
In some cases, on-boarding is performed fully automatically resulting in an automatic generation of configuration files and metadata for the new unseen show/set of videos. In automatic on-boarding, results produced by different modules are directly passed to the next module without user correction/update as shown in
It is to be understood that the disclosed subject matter is not limited in its application to the details of construction and to the arrangements of the components set forth in the following description or illustrated in the drawings. The disclosed subject matter is capable of other embodiments and of being practiced and carried out in various ways. Also, it is to be understood that the phraseology and terminology employed herein are for the purpose of description and should not be regarded as limiting.
As such, those skilled in the art will appreciate that the conception, upon which this disclosure is based, may readily be utilized as a basis for the designing of other structures, methods, and systems for carrying out the several purposes of the disclosed subject matter. It is important, therefore, that the claims be regarded as including such equivalent constructions insofar as they do not depart from the spirit and scope of the disclosed subject matter.
Although the disclosed subject matter has been described and illustrated in the foregoing exemplary embodiments, it is understood that the present disclosure has been made only by way of example, and that numerous changes in the details of implementation of the disclosed subject matter may be made without departing from the spirit and scope of the disclosed subject matter, which is limited only by the claims which follow.
This application claims priority to U.S. Provisional Patent Application No. 62/038,525, filed Aug. 18, 2014, which is incorporated by reference in its entirety as though fully disclosed herein.
Number | Date | Country | |
---|---|---|---|
62038525 | Aug 2014 | US |