The present disclosure relates to media delivery systems and, in particular, to media delivery systems that require conservation of network resources between devices.
The proliferation of media data captured by audio-visual devices in daily life has become immense, which leads to significant problems in the exchange of such data in communication or computer networks. Device operators oftentimes capture video at the highest resolution, highest frame rate available, then exchange the video with remote devices for further processing. Those remote devices may not require video at such high levels of quality. Accordingly, the exchange of such videos consumes device and network resources unnecessarily.
Embodiments of the present invention provide a video delivery system that generates and stores reduced-bandwidth videos (called “tracks,” for convenience) from source video. The system may include a track generator that executes functionality of application(s) to be used at sink devices, in which the track generator generates tracks from execution of the application(s) on source video and generates tracks having a reduced data size as compared to the source video. The track generator may execute a first instance of application functionality on the source video, which identifies region(s) of interest from the source video. The track generator further may downsample the source video according to downsampling parameters, and execute a second instance of application functionality on the downsampled video. The track generator may determine, from a comparison of outputs from the first and second instances of the application, whether the output from the second instance of application functionality is within an error tolerance of the output from the first instance of application functionality. If so, the track generator may generate a track from the downsampled video. In this manner, the system generates tracks that enable reliable application operation when processed by sink devices but also have reduced size as compared to source video.
Although only one sink terminal 120 is illustrated in
In an embodiment, content of a source video may be parsed into one or more regions of interest (ROIs) according to the needs of the different applications executed by sink devices 120 and tracks 152.1-152.n, 154.1-154.n, 156.1-156.n may be created therefrom the regions of interest at one or more resolutions.
When content is stored as tracks 152.1-152.n, 154.1-154.n, 156.1-156.n, the content may be encoded according to bandwidth compression algorithms to reduce their sizes. Applications such as face recognition algorithms that require high resolution video may be coded by highly efficiency coding techniques, such as neural network-based encoders. Track codings also may contain metadata hints the assist client-side applications 122 to perform their processing operations.
The set of tracks 152.1-152.n, 154.1-154.n, 156.1-156.n, 158 illustrated in
Sink devices 120 may identify tracks for download in a variety of ways. In one embodiment, a sink device 120 may request a track by identifying the application for which the track is to be used, which a source device 110 may use to index and retrieve the track. Alternatively, the sink device 120 may identify a requested track by identifying a purpose for the video (e.g., face recognition); a source device 110 may track(s) that were generated for such purpose and supply the track. Alternatively, a source device 110 may supply to a sink device 120 a manifest file (not shown) that provides information regarding such tracks, such as the applications for which they created, their data rates, spatial resolutions, frame rates, etc. and the sink device 120 may select an appropriate track from options presented in the manifest file.
Tracks prepared as discussed herein may be used in a variety of use cases. In one instance, for example, processing of ROI-based tracks may support privacy initiatives in video conferencing, where it may be desired to obscure location-specific information from exchanged video when it is recovered and displayed. To support such an application, a sink device 120 may retrieve tracks corresponding to persons recognized in videos (see
Tracks also may support frame rate conversion operations in certain embodiments. A sink device 120 may perform frame rate upconversion on a track that has low frame rate and is accompanied by metadata describing object motion at times provided between frames. In this manner, a client-side application may refine object motion estimates that would be obtained solely from content of track frames and thereby provide higher quality upconverted video.
Further, sink devices 120 may integrate content from multiple tracks into a composite user interface. A sink device 120, for example, may download a low-resolution trick-play video with high frame rate that represents motion of video content, and also download a high-resolution but low frame rate track that contains face crops at sample frames. The sink device 120 may merge these two representations into a common output interface.
The track generator 200 may include a first instance of an application 210 that may identify region(s) of interest from a source video. The application 210 typically may contain functional elements that process video for the sink application's purpose. For example, if the track is intended for use with a face recognition application, the partitioning unit 210 may include functionality to recognize faces from video. So, too, with tracks intended for text recognition applications, action recognition applications, and the like; the first instance of the application 210 may include functionality corresponding to those applications. The application 210, however, need not include application functionality that is unrelated to the video-processing functionality for which tracks are generated.
The first instance of the application 210 may operate on video at a source resolution and it may output data identifying recognized content in the video. Continuing with the face recognition example, the application 210 may output data identifying face(s) that the application 210 recognizes and the location(s) within video where those face(s) are recognized. Recognized content may be processed as regions of interest within the track generator 200.
A downsampler 220 may downsample source video according to a set of downsampling parameters. Downsampling may occur via spatial downsampling, which typically reduces the resolution of source video (sometimes perceived as a reduction in the frames' sizes), by temporal downsampling, which causes a reduction of the video's frame rate, or both. The downsampler 220 may output a downsampled copy of the source video to a second instance of the application 230.
The second instance of the application 230 may process the downsampled video according to its operation. As with the first instance 210 of the application, it is expected that the second application 230 will be provided to perform a predetermined action on video, such as performing face recognition, object recognition, action recognition, text recognition, or the like. And, as with the first instance of the application 210, the second instance of the application 230 may output data identifying the region(s) of interest recognized from input video (this time, the downsampled video) and their location(s). But, again, the second instance of the application 230 need not have functionality of sink device applications 122 (
An error estimator 240 may compare region of interest data from the first and second instances of the application 210, 230. The error estimator 240 may determine whether the recognized regions of interest and locations information from the two application instances agree with each other within a predetermined range of error. If so, the video output by the downsampler 220 may be processed into tracks and placed in storage. Specifically, a partitioning unit 250 may generate cropped versions of the downsampled video that correspond to the regions of interest identified by the second instance of the application 230.
If the error estimator 240 determines that the recognized regions of interest and locations information from the two application instances do not agree with each other, the error estimator 240 may cause a parameter generator 260 to revise downsampling parameters. In this manner, the downsampler 220 may generate a new version of downsampled source video and operation of the second instance of the application 230 and the error estimator 240 may repeat.
It is expected that the track generator 200 will converge on a set of downsampling parameters that cause the second instance of the application 230 to operate reliably upon downsampled video obtained from the downsampler 220. Such convergence will lead to generation of tracks that induce reliable operation of an application at a sink device 120 (
The error estimator 240 also may perform other estimates of application errors. Some sink device applications 122 (
In this example, when text recognition is applied to the source video, a first region of interest ROI1 may be identified therefrom. The downsampler may downsample the source video both spatially and temporally. Spatial downsampling parameters may be determined to converge appropriately when characters from the text crawl are properly recognized. Temporal downsampling parameters may be determined to converge appropriately when the frames of new text character's appearance or new word's appearance are properly recognized. For example, frames in which individual characters (or words) are presented only partially may be removed from the track.
The example of
The device 600 may possess a transceiver system 630 to communicate with other system components, for example, sink devices 120 (
Although the source device (
Embodiments of the present disclosure also find application with on-device generation of video tracks in which tracks are generated and consumed by a common device. In such applications, certain processing operation such as video coding/decoding and video scaling may be performed with less resource consumption than would occur when performing such operations on source video from which the tracks are generated. Moreover, use of tracks may enable fast seeking to items of interest such as recognized people, video detected as having face(s) and/or video having recognized text. The techniques described herein find application in local processing operations where the tracks are generated and processed on a common device.
Several embodiments of the disclosure are specifically illustrated and/or described herein. However, it will be appreciated that modifications and variations of the disclosure are covered by the above teachings and within the purview of the appended claims without departing from the spirit and intended scope of the disclosure. The present specification describes components and functions that may be implemented in particular embodiments, which may operate in accordance with one or more particular standards and protocols. However, the disclosure is not limited to such standards and protocols. Such standards periodically may be superseded by faster or more efficient equivalents having essentially the same functions. Accordingly, replacement standards and protocols having the same or similar functions are considered equivalents thereof.
The present disclosure benefits from priority of U.S. application s.n. 63/348,282, filed Jun. 2, 2022, entitled “Analytic- and Application-Aware Video Derivative Generation Techniques,” the disclosure of which is incorporated herein in its entirety.
Number | Date | Country | |
---|---|---|---|
63348282 | Jun 2022 | US |