Embodiments of the present disclosure relate to image processing. More specifically, embodiments of the present disclosure relate to processing video information.
Video is ubiquitous on the Internet. In fact, many people today watch video exclusively online. And, according to the latest statistics, almost 90% of Internet traffic is attributable to video. All of this is possible, in part, due to sophisticated video compression. Video compression, thusly, plays an important role in the modern world's communication infrastructure. By way of illustration, uncompressed video at standard resolution (i.e., 640×480) would require 240 Mbps of bandwidth to transmit. This amount of bandwidth, for just a standard video, exceeds significantly the capacity of today's infrastructure and, for that matter, the widely available infrastructure of the foreseeable future.
Embodiments of the present disclosure include systems and methods for processing video data. Embodiments utilize segmentation and object analysis techniques to achieve video processing such as, for example, compression and/or encoding with greater efficiency and quality than conventional video processing techniques.
While the disclosed subject matter is amenable to various modifications and alternative forms, specific embodiments have been shown by way of example in the drawings and are described in detail below. The intention, however, is not to limit the subject matter disclosed herein to the particular embodiments described. On the contrary, the disclosure is intended to cover all modifications, equivalents, and alternatives falling within the scope of the subject matter disclosed herein, and as defined by the appended claims.
As used herein in association with values (e.g., terms of magnitude, measurement, and/or other degrees of qualitative and/or quantitative observations that are used herein with respect to characteristics (e.g., dimensions, measurements, attributes, components, etc.) and/or ranges thereof, of tangible things (e.g., products, inventory, etc.) and/or intangible things (e.g., data, electronic representations of currency, accounts, information, portions of things (e.g., percentages, fractions), calculations, data models, dynamic system models, algorithms, parameters, etc.), “about” and “approximately” may be used, interchangeably, to refer to a value, configuration, orientation, and/or other characteristic that is equal to (or the same as) the stated value, configuration, orientation, and/or other characteristic or equal to (or the same as) a value, configuration, orientation, and/or other characteristic that is reasonably close to the stated value, configuration, orientation, and/or other characteristic, but that may differ by a reasonably small amount such as will be understood, and readily ascertained, by individuals having ordinary skill in the relevant arts to be attributable to measurement error; differences in measurement and/or manufacturing equipment calibration; human error in reading and/or setting measurements; adjustments made to optimize performance and/or structural parameters in view of other measurements (e.g., measurements associated with other things); particular implementation scenarios; imprecise adjustment and/or manipulation of things, settings, and/or measurements by a person, a computing device, and/or a machine; system tolerances; control loops; machine-learning; foreseeable variations (e.g., statistically insignificant variations, chaotic variations, system and/or model instabilities, etc.); preferences; and/or the like.
Although the term “block” may be used herein to connote different elements illustratively employed, the term should not be interpreted as implying any requirement of, or particular order among or between, various blocks disclosed herein. Similarly, although illustrative methods may be represented by one or more drawings (e.g., flow diagrams, communication flows, etc.), the drawings should not be interpreted as implying any requirement of, or particular order among or between, various steps disclosed herein. However, certain embodiments may require certain steps and/or certain orders between certain steps, as may be explicitly described herein and/or as may be understood from the nature of the steps themselves (e.g., the performance of some steps may depend on the outcome of a previous step). Additionally, a “set,” “subset,” or “group” of items (e.g., inputs, algorithms, data values, etc.) may include one or more items, and, similarly, a subset or subgroup of items may include one or more items. A “plurality” means more than one.
As used herein, the term “based on” is not meant to be restrictive, but rather indicates that a determination, identification, prediction, calculation, and/or the like, is performed by using, at least, the term following “based on” as an input. For example, predicting an outcome based on a particular piece of information may additionally, or alternatively, base the same determination on another piece of information.
The terms “up,” “upper,” and “upward,” and variations thereof, are used throughout this disclosure for the sole purpose of clarity of description and are only intended to refer to a relative direction (i.e., a certain direction that is to be distinguished from another direction), and are not meant to be interpreted to mean an absolute direction. Similarly, the terms “down,” “lower,” and “downward,” and variations thereof, are used throughout this disclosure for the sole purpose of clarity of description and are only intended to refer to a relative direction that is at least approximately opposite a direction referred to by one or more of the terms “up,” “upper,” and “upward,” and variations thereof.
Embodiments of the disclosure include systems and methods for processing video data. According to embodiments, processing video data may include, for example, any number of different techniques, processes, and/or the like for performing one or more operations on video data. For example, in embodiments, processing video data may include compressing video data, encoding video data, transcoding vide data, analyzing video data, indexing features of video data, delivering (transporting) video data, and/or the like. In embodiments, video data may include any information associated with video content, such as, for example, image data, metadata, raw video information, compressed video information, encoded video information, view information, camera information (e.g., camera position information, camera angle information, camera settings, etc.), object information, segmentation information, encoding instructions (e.g., requested encoding formats and/or parameters), object group information, feature information, object information, quantization information, time information (e.g., time codes, markers, presentation time stamps (PTSs), decoding time stamps (DTSs), program clock references (PCRs), GPS time stamps, other times of reference time stamps, etc.), frame index information (e.g., information configured to facilitate reconstruction of a video file such as, e.g., information associated with the order of video frames, etc.), and/or the like.
Video processing platform 102 is illustratively coupled to a receiving device 108 by a communication link 110. Although not illustrated herein, the receiving device 108 may include any combination of components described herein with reference to the video processing platform 102, components not shown or described, and/or combinations of these. In embodiments, the video processing platform 102 communicates video data over the communication link 110. The term “communication link” may refer to an ability to communicate some type of information in at least one direction between at least two devices, and is not meant to be limited to a direct, persistent, or otherwise limited communication channel. That is, according to embodiments, one or more of the communication links 106 and 110 may be a persistent communication link, an intermittent communication link, an ad-hoc communication link, and/or the like. The communication links 106 and/or 110 may refer to direct communications between the video image source 104 and the video processing platform 102, and between the video processing platform 102 and the receiving device 108, respectively, /or indirect communications that travel between the devices via at least one other device (e.g., a repeater, router, hub, and/or the like).
In embodiments, communication links 106 and/or 110 are, include, or are included in, a wired network, a wireless network, or a combination of wired and wireless networks. In embodiments, one or both of communication links 106 and 110 include a network. Illustrative networks include any number of different types of communication networks such as, a short messaging service (SMS), a local area network (LAN), a wireless LAN (WLAN), a wide area network (WAN), the Internet, a P2P network, or other suitable networks. The network may include a combination of multiple networks.
According to embodiments, the video processing platform 102 may be, include, be similar to, or be included in any one or more of a video encoding system, a video compression system, a video optimization system, a video analysis system, a video search system, a video index system, a content delivery network (CDN), a video-on-demand (VOD) system, a digital video recorder (DVR), a cloud DVR, a mobile video system, and/or the like. According to embodiments, the video processing platform 102 may include any number of different video processing technologies. For example, the video processing platform 102 may include number of different hardware, software, and/or firmware components configured to perform aspects of embodiments of the process 200 depicted in
According to embodiments, the video processing platform 102 may be, include, be similar to, or be included in, any number of different aspects (or combinations thereof) of embodiments of the systems and methods described in U.S. application Ser. No. 13/428,707, filed Mar. 23, 2012, entitled “VIDEO ENCODING SYSTEM AND METHOD;” U.S. Provisional Application No. 61/468,872, filed Mar. 29, 2011, entitled “VIDEO ENCODING SYSTEM AND METHOD;” U.S. application Ser. No. 13/868,749, filed Apr. 23, 2013, entitled “MACROBLOCK PARTITIONING AND MOTION ESTIMATION USING OBJECT ANALYSIS FOR VIDEO COMPRESSION,” and issued on Sep. 20, 2016, as U.S. Pat. No. 9,451,253; U.S. application Ser. No. 15/269,960, filed Sep. 19, 2016, entitled “MACROBLOCK PARTITIONING AND MOTION ESTIMATION USING OBJECT ANALYSIS FOR VIDEO COMPRESSION;” U.S. Provisional Application No. 61/646,479, filed May 14, 2012, entitled “MACROBLOCK PARTITIONING AND MOTION ESTIMATION USING OBJECT ANALYSIS FOR VIDEO COMPRESSION;” U.S. Provisional Application No. 61/637,447, filed Apr. 24, 2012, entitled “MACROBLOCK PARTITIONING AND MOTION ESTIMATION USING OBJECT ANALYSIS FOR VIDEO COMPRESSION;” U.S. application Ser. No. 14/696,255, filed Apr. 24, 2015, entitled “METHOD AND SYSTEM FOR UNSUPERVISED IMAGE SEGMENTATION USING A TRAINED QUALITY METRIC,” and issued on Nov. 22, 2016, as U.S. Pat. No. 9,501,837; U.S. application Ser. No. 15/357,906, filed Nov. 21, 2016, entitled “METHOD AND SYSTEM FOR UNSUPERVISED IMAGE SEGMENTATION USING A TRAINED QUALITY METRIC;” U.S. Provisional Application No. 62/132,167, filed Mar. 12, 2015, entitled “TRAINING BASED MEASURE FOR SEGMENTATION QUALITY AND ITS APPLICATION;” U.S. Provisional Application No. 62/058,647, filed Oct. 1, 2014, entitled “A TRAINING BASED MEASURE FOR SEGMENTATION QUALITY AND ITS APPLICATION;” U.S. application Ser. No. 14/737,401, filed Jun. 11, 2015, entitled “LEARNING-BASED PARTITIONING FOR VIDEO ENCODING;” U.S. Provisional Application No. 62/042,188, filed Aug. 26, 2014, entitled “LEARNING-BASED PARTITIONING FOR VIDEO ENCODING;” U.S. application Ser. No. 14/737,418, filed Jun. 11, 2015, entitled “FOREGROUND DETECTION USING FRACTAL DIMENSIONAL MEASURES;” U.S. Provisional Application No. 62/134,534, filed Mar. 17, 2015, entitled “FOREGROUND DETECTION USING FRACTAL DIMENSIONAL MEASURES;” U.S. application Ser. No. 15/237,048, filed Aug. 15, 2016, entitled “OBJECT CATEGORIZATION USING STATISTICALLY-MODELED CLASSIFIER OUTPUTS;” U.S. Provisional Application No. 62/204,925, filed Aug. 13, 2015, entitled “OBJECT CATEGORIZATION USING STATISTICALLY-MODELED CLASSIFIER OUTPUTS;” and/or U.S. Provisional Application No. 62/368,853, filed Jul. 29, 2016, entitled “LOGO IDENTIFICATION;” the entirety of each of which is hereby incorporated herein by reference for all purposes.
The illustrative system 100 shown in
Embodiments of the process 200 include a segmentation process 202. According to embodiments, the segmentation process 202 may include any number of different image segmentation techniques, algorithms, and/or the like. In embodiments, for example, the segmentation process 202 is configured to be used to segment images (e.g., video frames) to generate segmentation information. Segmentation information may include any information associated with a segmentation of an image such as, for example, identification of segments (e.g., segment boundaries, segment types, etc.), segment maps, and/or the like.
According to embodiments, the segmentation process 202 may be performed to segment a video frame into a number of segments. The segments may include, for example, objects, groups, slices, tiles, and/or the like. The segmentation process 202 may include any number of various automatic image segmentation methods known in the field. In embodiments, the segmentation process 202 may use image color and corresponding gradients to subdivide an image into segments that have similar color and texture. Two examples of image segmentation techniques include the watershed algorithm and optimum cut partitioning of a pixel connectivity graph. For example, the segmentation process 202 may include Canny edge detection for detecting edges on a video frame for optimum cut partitioning, and may create segments using the optimum cut partitioning of the resulting pixel connectivity graph. In embodiments, the segmentation process may be, be similar to, include or be included in, aspects of embodiments of the segmentation techniques described in U.S. application Ser. No. 14/696,255, filed Apr. 24, 2015, entitled “METHOD AND SYSTEM FOR UNSUPERVISED IMAGE SEGMENTATION USING A TRAINED QUALITY METRIC,” the entirety of which is incorporated herein by reference for all purposes.
Embodiments of process 200 may include a template-based pattern recognition process 204. According to embodiments, the template-based pattern recognition process 204 may be, include, be similar to, or be included in, any number of different techniques, algorithms, and/or the like, configured to perform template-based pattern recognition on an image. According to embodiments, the process 200 may include an emblem identification process 206, which may be facilitated by template-based pattern recognition information generated by the template-based pattern recognition process 204. In embodiments, the emblem identification process 206 may be, be similar to, include, or be included in, aspects of embodiments of the logo identification techniques described in U.S. Application Ser. No. 62/368,853, filed Jul. 29, 2016, entitled “LOGO IDENTIFICATION,” the entirety of which is hereby incorporated herein by reference for all purposes.
Embodiments of the process 200 may include a foreground detection process 208. According to embodiments, the foreground detection process 208 may be, include, be similar to, or be included in, any number of different techniques, algorithms, and/or the like, configured to perform foreground and/or background detection on an image. For example, in embodiments, the foreground detection process 208 may include segment-based foreground detection, where the foreground segments, or portions of the segments, determined in the segmentation process 202 are detected using one or more aspects of embodiments of the methods described herein. In embodiments, the foreground detection process 208 may include foreground detection on images that have not been segmented. In embodiments, the foreground detection process 208 may be, be similar to, include, or be included in, aspects of embodiments of the foreground detection techniques described in U.S. application Ser. No. 14/737,418, filed Jun. 11, 2015, entitled “FOREGROUND DETECTION USING FRACTAL DIMENSIONAL MEASURES,” the entirety of which is hereby incorporated herein by reference for all purposes.
Embodiments of the process 200 may include a segment-based motion estimation process 210. According to embodiments, the segment-based motion estimation process 210 may be, include, be similar to, or be included in, any number of different techniques, algorithms, and/or the like, configured to perform motion estimation on an image. For example, in embodiments, the segment-based motion estimation process 210 may include any number of various motion estimation techniques known in the field. Two examples of motion estimation techniques are optical pixel flow and feature tracking. As an example, the segment-based motion estimation process 210 may include feature tracking in which Speeded Up Robust Features (SURF) are extracted from both a source image (e.g., a source frame) and a subsequent image (e.g., a subsequent frame). The individual features of the two images may then be compared using a Euclidean metric to establish a correspondence between features, thereby generating a motion vector for each feature. In embodiments, a motion vector for a segment may be the median of all of the motion vectors for each of the segment's features. After a motion vector for a segment is determined, each segment may be categorized based on its motion properties. For example, a segment may be categorized as either moving or stationary.
Embodiments of the process 200 may include an object group analysis process 212. According to embodiments, the object group analysis process 212 may be, include, be similar to, or be included in, any number of different techniques, algorithms, and/or the like, configured to perform analysis and/or tracking of one or objects in video data. For example, in embodiments, the object group analysis process 212 may be configured to identify, using a segment map and/or motion vectors, the presence of objects (e.g., clusters of segments that move together, or at least approximately together) within digital images such as, for example, frames of video. In embodiments, object group analysis process 212 may be configured to identify and/or analyze only objects that are moving within a particular scene. In embodiments, the object group analysis process 212 may include object analysis on images that have not been segmented.
Embodiments of the process 200 may include a super-resolution process 214. According to embodiments, the super-resolution process 214 may be, include, be similar to, or be included in, any number of different techniques, algorithms, and/or the like, configured to perform super-resolution upscaling video, encoding video, enhancing video and/or the like. In embodiments, the super-resolution process 214 may be, be similar to, include, or be included in, aspects of embodiments of the super-resolution techniques described in U.S. Pat. No. 8,958,484, issued Feb. 17, 2015, entitled “ENHANCED IMAGE AND VIDEO SUPER-RESOLUTION PROCESSING;” and/or U.S. Pat. No. 8,861,893, issued Oct. 14, 2014, entitled “ENHANCING VIDEO USING SUPER-RESOLUTION,” the entirety of which is hereby incorporated herein by reference for all purposes.
Embodiments of the process 200 may include a feature-based pattern recognition process 216. According to embodiments, the feature-based pattern recognition process 216 may be, include, be similar to, or be included in, any number of different techniques, algorithms, and/or the like, configured to perform feature-based pattern recognition on an image. According to embodiments, feature-based pattern recognition information may be used to facilitate object classification.
Embodiments of the process 200 may include an object classification process 216. According to embodiments, the object classification process 216 may be, include, be similar to, or be included in, any number of different techniques, algorithms, and/or the like, configured to perform classification of objects in video data. For example, in embodiments, the object classification process 216 may include one or more classifiers configured to classify objects within video data. That is, for example, the one or more classifiers may be configured to receive any number of different inputs such as, for example, video data, segmentation information associated with the video data, motion information associated with the video data, object group information associated with the video data, and/or feature-based pattern recognition information, and may be configured to use aspects of the received information to classify objects in the video data. Classifying an object in video data may include, for example, identifying the existence of an object, determining and/or tracking the location of the object, determining and/or tracking the motion of the object, determining a class to which the object belongs (e.g., determining whether the object is a person, an animal, an article of furniture, etc.), developing an object profile (e.g., a set of information corresponding to the object such as, e.g., characteristics of the object) corresponding to an identified object, and/or the like.
In embodiments, the object classification process 216 may be, be similar to, include, or be included in, aspects of embodiments of the object classification techniques described in U.S. application Ser. No. 15/237,048, filed Aug. 15, 2016, entitled “OBJECT CATEGORIZATION USING STATISTICALLY-MODELED CLASSIFIER OUTPUTS,” the entirety of which is hereby incorporated herein by reference for all purposes.
Embodiments of the process 200 may include a deep scene level analysis process 220. According to embodiments, the deep scene level analysis process 220 may be, include, be similar to, or be included in, any number of different techniques, algorithms, and/or the like, configured to perform analysis, indexing, metatagging, labelling, and/or the like, associated with video data (e.g., a video scene). For example, in embodiments, the deep scene level analysis process 220 may include analyzing video data to identify characteristics of a video scene (e.g., identification of objects in the scene, characteristics of the objects, behavior of the objects, characteristics of foreground/background features, characteristics of segmentation of the images of the video data, characteristics of motion of segments and/or objects in the scene, etc.). According to embodiments, characteristics of a video scene may be captured using a metatagging (referred to herein, interchangeably, as “labeling”) procedure. In embodiments, information resulting from a metatagging procedure may be referred to, for example, as “metadata.” In embodiments such metadata may be stored as a file, in a database, and/or the like.
Embodiments of the process 200 may include a partitioning process 222. According to embodiments, the partitioning process 222 may be, include, be similar to, or be included in, any number of different techniques, algorithms, and/or the like, configured to perform partitioning of video frames for encoding (e.g., macroblock partitioning). For example, the partitioning process 222 may include machine-learning techniques, and/or macroblock partitioning using biased cost calculations so as to encourage separation of objects among macroblock partitions. In embodiments, the partitioning process 222 may be, be similar to, include, or be included in, aspects of embodiments of the partitioning techniques described in U.S. application Ser. No. 14/737,401, filed Jun. 11, 2015, entitled “LEARNING-BASED PARTITIONING FOR VIDEO ENCODING” and/or U.S. Application No. 13/868,749, filed Apr. 23, 2013, entitled “MACROBLOCK PARTITIONING AND MOTION ESTIMATION USING OBJECT ANALYSIS FOR VIDEO COMPRESSION;” the entirety of each of which is hereby incorporated herein by reference for all purposes.
Embodiments of the process 200 may include an adaptive quantization process 224. According to embodiments, the adaptive quantization process 224 may be, include, be similar to, or be included in, any number of different techniques, algorithms, and/or the like, configured to perform adaptive quantization for encoding. For example, in embodiments, an encoder may be configured to perform adaptive quantization and encoding that may utilize metadata, as described herein, to facilitate efficient encoding.
Embodiments of the process 200 may include an encoding process 226. According to embodiments, the encoding process 226 may be, include, be similar to, or be included in, any number of different techniques, algorithms, and/or the like, configured to perform encoding of video data. In embodiments, the encoding process 226 may be, be similar to, include, or be included in, aspects of embodiments of the encoding techniques described in U.S. application Ser. No. 13/428,707, filed Mar. 23, 2012, entitled “VIDEO ENCODING SYSTEM AND METHOD;” the entirety of each of which is hereby incorporated herein by reference for all purposes.
According to embodiments, any number of different aspects of one or more of the sub-processes (also referred to herein, interchangeably, as processes) of the illustrative process 200 may be performed in various implementations. In this manner, for example, embodiments of the process 200 may be implemented by a video processing platform (e.g., the video processing platform 102 depicted in
For example, in embodiments, an illustrative video processing platform (e.g., the video processing platform 102 depicted in
As shown in
As shown in
According to embodiments, as indicated above, various components of the operating environment 300, illustrated in
In embodiments, a computing device includes a bus that, directly and/or indirectly, couples the following devices: a processor, a memory, an input/output (I/O) port, an I/O component, and a power supply. Any number of additional components, different components, and/or combinations of components may also be included in the computing device. The bus represents what may be one or more busses (such as, for example, an address bus, data bus, or combination thereof). Similarly, in embodiments, the computing device may include a number of processors, a number of memory components, a number of I/O ports, a number of I/O components, and/or a number of power supplies. Additionally any number of these components, or combinations thereof, may be distributed and/or duplicated across a number of computing devices.
In embodiments, the memory 314 includes computer-readable media in the form of volatile and/or nonvolatile memory and may be removable, nonremovable, or a combination thereof. Media examples include Random Access Memory (RAM); Read Only Memory (ROM); Electronically Erasable Programmable Read Only Memory (EEPROM); flash memory; optical or holographic media; magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices; data transmissions; or any other medium that can be used to store information and can be accessed by a computing device such as, for example, quantum state memory, and the like. In embodiments, the memory 314 stores computer-executable instructions for causing the processor 312 to implement aspects of embodiments of system components discussed herein and/or to perform aspects of embodiments of methods and procedures discussed herein. Computer-executable instructions may include, for example, computer code, machine-useable instructions, and the like such as, for example, program components capable of being executed by one or more processors associated with a computing device. Examples of such program components include a segmenter 318, an emblem identifier 320, a foreground detector 322, a motion estimator 324, an object analyzer 326, an object classifier 328, a partitioner 330, a metatagger 332, an encoder 334, and a communication component 336. Program components may be programmed using any number of different programming environments, including various languages, development kits, frameworks, and/or the like. Some or all of the functionality contemplated herein may also, or alternatively, be implemented in hardware and/or firmware.
In embodiments, the segmenter 318 may be configured to segment a video frame into a number of segments. The segments may include, for example, objects, groups, slices, tiles, and/or the like. The segmenter 318 may employ any number of various automatic image segmentation methods known in the field. In embodiments, the segmenter 318 may use image color and corresponding gradients to subdivide an image into segments that have similar color and texture. Two examples of image segmentation techniques include the watershed algorithm and optimum cut partitioning of a pixel connectivity graph. For example, the segmenter 318 may use Canny edge detection to detect edges on a video frame for optimum cut partitioning, and create segments using the optimum cut partitioning of the resulting pixel connectivity graph.
In embodiments, the emblem identifier 320 is configured to perform emblem identification with respect to video data. For example, in embodiments, the emblem identifier 320 may be configured to perform a template-based pattern recognition. According to embodiments, the template-based pattern recognition may include any number of different techniques for performing template-based pattern recognition with respect to images (e.g., video data).
In embodiments, the foreground detector 322 is configured to perform foreground detection on a video frame. For example, in embodiments, the foreground detector 322 may perform segment-based foreground detection, where the foreground segments, or portions of the segments, determined by the segmenter 318 are detected using one or more aspects of embodiments of the methods described herein. In embodiments, the foreground detector 322 may perform foreground detection on images that have not been segmented. In embodiments, results of foreground detection may be used by the segmenter 318 to inform a segmentation process.
In embodiments, the motion estimator 324 is configured to perform motion estimation on video data. For example, in embodiments, the motion estimator 324 may utilize any number of various motion estimation techniques known in the field. Two example motion estimation techniques are optical pixel flow and feature tracking. As an example, the motion estimator 324 may use feature tracking in which Speeded Up Robust Features (SURF) are extracted from both a source image (e.g., a source frame) and a subsequent image (e.g., a subsequent frame). The individual features of the two images may then be compared using a Euclidean metric to establish a correspondence between features, thereby generating a motion vector for each feature. In embodiments, a motion vector for a segment may be the median of all of the motion vectors for each of the segment's features. After a motion vector for a segment is determined, each segment may be categorized based on its motion properties. For example, a segment may be categorized as either moving or stationary.
In embodiments, the object analyzer 326 is configured to perform object analysis and/or object group analysis on video data. For example, in embodiments, the object analyzer 326 may be configured to identify, using a segment map and/or motion vectors computed by the motion estimator 324, the presence of objects (e.g., clusters of segments that move together, or at least approximately together) within digital images such as, for example, frames of video. In embodiments, the object analyzer 326 may be configured to identify and/or analyze only objects that are moving within a particular scene. In embodiments, the object analyzer 326 may perform object analysis on images that have not been segmented. Results of object group analysis may be used by the segmenter 318 to facilitate a segmentation process, by an encoder 334 to facilitate an encoding process, and/or the like.
In embodiments, the object classifier 328 is configured to perform object classification on video data. For example, in embodiments, the object classifier 328 may include one or more classifiers configured to classify objects within video data. That is, for example, the one or more classifiers may be configured to receive any number of different inputs such as, for example, video data, segmentation information associated with the video data (e.g., resulting from the segmentation process 202 depicted in
In embodiments, the partitioner 330 is configured to partition video frames for encoding. For example, in embodiments, the partitioner 330 may be configured to use any number of different partitioning techniques to partition a video frame. In embodiments, the partitioner 330 may be configured to utilize metadata (e.g., object information, segmentation information, etc.) as part of the partitioning process.
In embodiments, the metatagger 332 is configured to partition video frames for encoding. For example, in embodiments, the metatagger 332 analyzes video data to identify characteristics of a video scene (e.g., identification of objects in the scene, characteristics of the objects, behavior of the objects, characteristics of foreground/background features, characteristics of segmentation of the images of the video data, characteristics of motion of segments and/or objects in the scene, etc.). According to embodiments, characteristics of a video scene may be captured using a metatagging (referred to herein, interchangeably, as “labeling”) procedure. In embodiments, information resulting from a metatagging procedure may be referred to, for example, as “metadata.” In embodiments such metadata may be stored as a file, in a database, and/or the like.
In embodiments, the encoder 334 is configured to encode video data. For example, in embodiments, the encoder 334 may be configured to perform adaptive quantization and encoding that may utilize metadata, as described herein, to facilitate efficient encoding. In embodiments, the communication component 336 is configured to facilitate communications between the video processing device 302 and other devices such as, for example, the decoding device 308.
The illustrative operating environment 300 shown in
As explained above, embodiments of video processing processes and methods described herein include segmentation. Segmentation is a key processing step in many applications, ranging, for instance, from medical imaging to machine vision and video compression technology. Although different approaches to segmentation have been proposed, those based on graphs have been particularly attractive to researchers because of their computational efficiency.
Many segmentation algorithms are known to the practitioners in the field. Some examples include the watershed algorithm and simple linear iterative clustering (SLIC), a superpixel algorithm based on nearest neighbor aggregation. Typically, these algorithms have a common disadvantage in that they require a scale parameter to be set by a human supervisor. Thus, the practical applications have, in general, involved supervised segmentation. This may limit the range of applications, since in many instances segmentation is to be generated dynamically and there may be no time or opportunity for human supervision.
In embodiments, a graph-based segmentation algorithm based on the work of P. F. Felzenszwalb and D. P. Huttenlocher may be used to segment images (e.g., video frames). Felzenszwalb and Huttenlocher discussed basic principles of segmentation in general, and applied these principles to develop an efficient segmentation algorithm based on graph cutting in their paper, “Efficient Graph-Based Image Segmentation,” Int. Jour. Comp. Vis., 59(2), September 2004, the entirety of which is hereby incorporated herein by reference for all purposes. Felzenszwalb and Huttenlocher stated that any segmentation algorithm should “capture perceptually important groupings or regions, which often reflect global aspects of the image.”
Based on the principle of a graph-based approach to segmentation, Felzenszwalk and Huttenlocher first build an undirected graph, G=(V, E), where v1εV the set of pixels of the image to be segmented and (vi, vj)εE is the set of edges that connects pairs of neighboring pixels. A non-negative weight, w(vi, vj), is associated with each edge, and has a magnitude proportional to the difference between vi and vj. Image segmentation is identified by finding a partition of V such that each component is connected, and where the internal difference between the elements of each component is minimal whereas the difference between elements of different components is maximal. This is achieved by the definition of a predicate in Equation (1) that determines if a boundary exists between two adjacent components C1 and C2:
where Dif(C1, C2) is the difference between the two components, defined as the minimum weight of the set of edges that connects C1 and C2; and Mlnt(C1, C2) is the minimum internal difference, defined in Equation (2) as:
Mlnt(C1, C2) =min[Int(C1)+TC1,IntC2+rC2] (2)
where Int(C) is the largest weight in the minimum spanning tree of the component C and describes therefore the internal difference between the elements of C; and where τ(C)=k/|C| is a threshold function used to establish whether there is evidence for a boundary between two components. The threshold function forces two small segments not to fuse at least there if is a strong evidence of difference between them.
In practice, the segment parameter k sets the scale of observation. Although Felzenszwalb and Huttenlocher demonstrate that the algorithm generates a segmentation map that is neither too fine nor too coarse, the definition of fineness and coarseness depends on k, which is set by the user to obtain a perceptually reasonable segmentation.
The definition of the proper value of k for the graph-based algorithm, as well as the choice of the threshold value used for edge extraction in other edge-based segmentation algorithms such as, for example, the algorithms described by lannizzotto and Vita in “Fast and Accurate Edge-Based Segmentation with No Contour Smoothing in 2-D Real Images,” Giancarlo lannizzotto and Lorenzo Vita, IEEE Transactions on Image Processing, Vol. 9, No. 7, pp. 1232-1237 (July 2000), the entirety of which is hereby incorporated by reference herein for all purposes, had been, until development of the segmentation algorithm described herein, an open issue when “perceptually important groupings or regions” are to be extracted from an image. In the algorithm described by lannizzotto and Vita, edges are detected by looking at gray-scale gradient maxima with gradient magnitudes above a threshold value. For this algorithm, k is this threshold value and is to be set appropriately for proper segmentation. In embodiments, segmentation based on edge-extraction may be used. In those embodiments, edge thresholds are established based on a strength parameter k. In the field of segmentation algorithms, in general, a parameter is used to set the scale of observation. In cases in which segmentation is performed in a supervised mode, a human user selects the k value for a particular image. It is, however, clear that the segmentation quality provided by a certain algorithm is generally related to the quality perceived by a human observer, especially for applications (like video compression) where a human being constitutes the final beneficiary of the output of the algorithm.
For example, a 640×480 color image is shown in
An illustrative segmentation device 500 is schematically illustrated in
In embodiments, one or more of the program components utilize training information 518 to assist in determining the appropriate segmentation of an image. For example, the training information 518 may include a plurality of images segmented at different values of k. The training images may be used to determine a value for k that corresponds to a well-segmented segmentation for the given image type (medical image, video image, landscape image, etc.). As explained in more detail herein, this value of k from the training images may be used to assist in determining appropriate segmentation of further images automatically by segmentation device 500. In embodiments, the training information 518 includes information associated with a variety of different image types. In embodiments, the training information 518 includes a segmentation quality model that was derived from a set of training images, their segmentations and the classification of these segmentations by a human observer. In embodiments, the training images and their segmentations are not retained.
In embodiments, the segmenter 508 is configured to segment an image into a plurality of segments, as described in more detail below. The segments may be stored in memory 504 as segmented image data 520. The segmented image data 520 may include a plurality of pixels of the image. The segmented image data 520 may also include one or more parameters associated with the image data 520. The segments may include, for example, objects, groups, slices, tiles, and/or the like. The segmenter 508 may employ any number of various automatic image segmentation methods known in the field. In embodiments, the segmenter 508 may use image color of the pixels and corresponding gradients of the pixels to subdivide an image into segments that have similar color and texture. Two examples of image segmentation techniques include the watershed algorithm and optimum cut partitioning of a pixel connectivity graph. For example, the segmenter 508 may use Canny edge detection to detect edges on a video frame for optimum cut partitioning, and create segments using the optimum cut partitioning of the resulting pixel connectivity graph.
In embodiments, the comparison module 510 may compare a calculated value or pair of values as described in more detail below. For example, the comparison module 510 may compare the parameter kin Equation (2) above, and/or Uw in Equation (4) below, with a reference value or pair of values, such as is shown in
In embodiments, the filter module 512 may apply a filter to the image data 506 or the segmented image data 520, as described in more detail below. For example, the filter module 512 may apply a low pass filter to a scale map of an image to avoid sharp transitions between adjacent sub-images.
In the illustrative embodiment of
A similarity function such as, for example, a quantitative index, can be defined to represent the amount of information contained in the original image, img, that is captured by the segmentation process. In embodiments, for example, a color image may be defined by substituting the RGB value in each pixel with the average RGB value of the pixels in the corresponding segment, seg. For each color channel, the symmetric uncertainty U between img and seg can be computed by Equation (3), as given by Witten & Frank in Witten, Ian H. & Frank, Eibe, “Data Mining: Practical Machine Learning Tools and Techniques,” Morgan Kaufmann, Amsterdam, ISBN 978-0-12-374856-0, the entirety of which is hereby incorporated herein by reference for all purposes:
where Sij indicates the Shannon's entropy, in bits, of the i-th channel for the image j, and where I(i,j) is the mutual information, in bits, of the images i and j.
The symmetric uncertainty U expresses the percentage of bits that are shared between img and seg for each color channel. The value of U tends to zero when the segmentation map is uncorrelated with the original color image channel, whereas it is close to one when the segmentation map represent any fine detail in the corresponding channel of img.
Different images have different quantity of information in each color channel. For example, the color image of
where U is determined for each channel as in Equation (3), and S is the Shannon's entropy for each channel.
The index Uw is a value between 0 and 1 and is correlated with the segmentation quality. Referring to
For a typical image, Uw will decrease as k increases, passing from over-segmentation to under-segmentation. For a particular segmentation quality model, a representative set of training images at representative resolutions may be selected. For example, the curve depicted in
The results are presented in
For each block considered, an S-shaped curve Uw=Uw[log(k)] in the (log(k), Uw) space was observed. As shown in
Output of the segmentation algorithm was classified as under-segmented, well-segmented, or over-segmented by a human supervisor for each training image and each input value of k. In embodiments, the straight-line quality model was stored. In other embodiments, all of the training results may be stored and the quality model may be derived as needed. In embodiments, some form of classifier may be stored that allows the classification of the (k, numsegments(k)) ordered pair as over-segmented, under-segmented, or well-segmented. The (log(k), Uw) plane is subdivided into three different regions corresponding to 3 qualities of the segmentation result. Equation (5) was utilized to estimate the (m, b) parameters of the line Uw=m*log(k)+b that separates under-segmented and well-segmented regions:
where NUS and NWE are respectively the number of under-segmented and well-segmented points; and where δus,i and δWE,i are 0 if the point is correctly classified (e.g., any under-segmentation point should lie under the Uw=m*log(k)+b line) and 1 otherwise.
Equation (6) was utilized to estimate the (m,b) parameters of the line Uw=m*log(k)+b that divides over-segmented and well-segmented regions:
where NOS and NWE are respectively the number of over-segmented and well segmented points; and where δO,S,i and δWE,i are 0 if the point is correctly classified (e.g., any well-segmentation point should lie under the Uw=m*log(k)+b line) and 1 otherwise.
The values of Equations (5) and (6) were minimized using a numerical algorithm. In embodiments, a simplex method is used. In practice, the cost function in each of Equation (5) and Equation (6) is the sum of the distances from the line Uw=m*log(k)+b of all the points that are misclassified. The estimate of the two lines that divide the (log(k), Uw) plane may be performed independently.
The average line between the line dividing the under-segmented and well-segmented points, and the line dividing the well-segmented and over-segmented points, was assumed to be the optimal line 80 for segmentation in the (log(k), Uw) plane. Given the S-like shape of the typical Uw=Uw[log(k)] curve in the (log(k), Uw) plane, a point of intersection between the optimal line for segmentation and the Uw=Uw[log(k)] curve can generally be identified. In embodiments, Identification of this point gives an optimal k value for a given 160×120 image.
As used herein, the term “optimal” refers to a value, conclusion, result, setting, circumstance, and/or the like, that may facilitate achieving a particular objective, and is not meant to necessarily refer to a single, most appropriate, value, conclusion, result, setting, circumstance, and/or the like. That is, for example, an optimal value of a parameter may include any value of that parameter that facilitates achieving a result (e.g., a segmentation that is more appropriate for an image than the segmentation achieved based on some other value of that parameter). Similarly, the term “optimize” refers to a process of determining or otherwise identifying an optimal value, conclusion, result, setting, circumstance, and/or the like.
Referring next to
An optimal k value can be computed iteratively through a bisection method 800, as illustrated in
In block 806, the value of i is increased for the first iteration. In block 808, the mean log value (k=exp{[log(kLeft)+log(kRight)]/2}) is used to determine a new k value.
In block 810, the current iteration i is compared to the maximum number of iterations. In embodiments, the maximum number of iterations is a predetermined integer, such as 5 or any other integer, which may be selected, for example, to optimize a trade-off between computational burden and image segmentation quality. In other embodiments, the maximum number of iterations is based on a difference between the k and/or Uw value determined in successive iterations. In embodiments, if the maximum number of iterations has been reached, the k value determined in block 808 is chosen as the final value of k for segmentation, as shown in block 812.
If the maximum number of iterations has not yet been reached in block 810, in block 814, the image is segmented and the corresponding Uw is computed for the k value determined in block 808. As shown in
In block 816, the determined kj and Uw values are compared to the optimal line in the (log ki, Uw) plane. For example, as shown in
Exemplary results of embodiments of method 8 are presented in
Although sub-images of 160×120 pixels were considered in
Method 800 (
Referring next to
As shown in block 1006, each image is then divided into a plurality of sub-images. Illustratively, each sub-image may be 160×120 pixels. In block 1008, for each sub-image, and independently from the other sub-images, a value of k for each sub-image was determined. In embodiments, the value of k is determined for the sub-image using method 800 as described above with respect to
In block 1012, a scale map of k(x,y) for the image is smoothed through a low pass filter to avoid sharp transition of k(x,y) along the image.
The results of an exemplary method 1000 are illustrated in
In some embodiments, the segmentation of a second image can be estimated based on the segmentation of a first image. Exemplary embodiments include video processing or video encoding, in which adjacent frames of images may be highly similar or highly correlated. A method 1300 for segmenting a second image is provided in
In embodiments, as shown in
In embodiments, such as in applications like video-encoding, it can also be noticed that the computational cost of segmenting the video images can be significantly reduced. When applied to a unique frame, embodiments of the proposed method include performing a research of the optimal k value for each sub-image considering the entire range for k. For a video-encoding application, since adjacent frames are highly correlated in videos, the range for k can be significantly reduced by considering the estimates obtained at previous frames for the same sub-image and/or corresponding sub-image. In embodiments, k values may be updated only at certain frame intervals and/or scene changes.
Embodiments of the methods, described above, of automatically optimizing a segmentation algorithm may be performed based on edge thresholding and working in the YUV color space, achieving similar results. In embodiments in which multiple input parameters are used by the segmentation algorithm, a similar segmentation quality model may be used, but the optimal segmentation line as shows in
Embodiments of the systems and methods described herein may include a template-based pattern-recognition process (e.g., the template-based pattern recognition process 204). Aspects of the template-based recognition process may be configured to facilitate an emblem identification process (e.g., the emblem identification process 206 depicted in
There is a growing interest in identifying emblems in video scenes. An emblem is a visible representation of something. For example, emblems may include symbols, letters, numbers, pictures, drawings, logos, colors, patterns, and/or the like, and may represent any number of different things such as, for example, concepts, companies, brands, people, things, places, emotions, and/or the like. In embodiments, for example, marketers, corporations, and content providers have an interest in quantifying ad visibility, both from ads inserted by the content deliverer (e.g., commercials) and ads inserted by the content creators (e.g., product placement in-program). For example, knowing the visibility at point of delivery of a banner at a football match can inform decisions by marketers and stadium owners regarding value. Further, when purchasing ad space in content delivery networks (e.g., internet-based video providers), logo owners may desire to purchase ad space that is not proximal to ad space occupied by logos of their competitors. Conventional emblem identification techniques are characterized by computationally-intensive brute force pattern matching.
Embodiments of the disclosure include systems, methods, and computer-readable media for identifying emblems in a video stream. In embodiments, a video stream is analyzed frame-by-frame to identify emblems, utilizing the results of efficient and robust segmentation processes to facilitate the use of high-precision classification engines that might otherwise be too computationally expensive for large-scale deployment. In embodiments, the emblem identification may be performed by, or in conjunction with, and encoding device. Embodiments of the technology for identifying emblems in video streams disclosed herein may be used with emblems of any kind, and generally may be most effective with emblems that have relatively static color schemes, shapes, sizes, and/or the like. Emblems are visual representations of objects, persons, concepts, brands, and/or the like, and may include, for example, logos, aspects of trade dress, colors, symbols, crests, and/or the like.
Embodiments of the disclosed subject matter include systems and methods configured for identifying one or more emblems in a video stream. The video stream may include multimedia content targeted at end-user consumption. Embodiments of the system may perform segmentation, classification, tracking, and reporting.
Embodiments of the classification procedures described herein may be similar to, or include, cascade classification (as described, for example, in Rainer Lienhart and Jochen Maydt, “An Extended Set of Haar-like Features for Rapid Object Detection,” IEEE ICIP, Vol. 1, pp. 900-903 (September 2002), attached herein as Appendix A, the entirety of which is hereby incorporated herein by reference for all purposes), an industry standard in object detection. Cascade classification largely takes advantage of two types of features: haar-like and LBP. These feature detectors have shown substantial promise for a variety of applications. For example, live face detection for autofocusing cameras is typically accomplished using this technique.
In embodiments, emblem information is received from emblem owners (e.g., customers of ad services associated with delivery of the video streams), and the emblem information is processed to enable more robust classification. That is, for example, emblem information may include images of emblems. The images may be segmented and feature extraction may be performed on the segmented images. The results of the feature extraction may be used to train classifiers to more readily identify the emblems.
Disadvantages of that technology include multiple detection, sensitivity to lighting conditions, and classification performance. The multiple detection issue may be due to overlapping sub-windows (caused by the s1 iding-window part of the approach). Embodiments mitigate this disadvantage by implementing a segmentation algorithm that provides a complete but disjoint labeling of the image. The sensitivity and the performance issues may be mitigated by embodiments of the disclosure by using a more robust feature set that would normally be too computationally expensive for live detection.
According to embodiments, the video processing device 1402 may be, include, be similar to, or be included in the video processing device 302 depicted in
The video processing device 1402 may be, or include, an encoding device and may be configured to encode video data received from the image source 1404 and may, in embodiments, be configured to facilitate insertion of ads into the encoded video data. In embodiments, the video processing device 1402 may encode video data into which ads have already been inserted by other components (e.g., ad network components). The ads may be provided by an ad provider 1410 that communicates with the video processing device 1402 via the network 1406. In embodiments, an emblem owner (e.g., a company that purchases ads containing its emblem from the ad provider 1410) may interact, via the network 1406, with the ad provider 1410, the video processing device 1402, the image source 1404, and/or the like, using an emblem owner device 1412. In embodiments, the emblem owner may wish to receive reports containing information about the placement of its emblem(s) in content encoded by the encoding device. The emblem owner may also, or alternatively, wish to purchase ad space that is not also proximate to emblem placement by its competitors. For example, an emblem owner may require that a video stream into which its emblem is to be inserted not also contain an emblem of a competitor, or the emblem owner may require that its emblem be separated, within the video stream, from a competitor's emblem by a certain number of frames, a certain amount of playback time, and/or the like.
In embodiments, the video processing device 1402 may instantiate an emblem identifier 1414 configured to identify one or more emblems within a video stream. According to embodiments, emblem identification may be performed by an emblem identifier that is instantiated independently of the video processing device 1402. For example, the emblem identifier 1414 may be instantiated by another component such as a stand-alone emblem identification device, a virtual machine running in an encoding environment, by a device that is maintained by an entity that is not the entity that maintains/provides the video processing device 1402 (e.g., emblem identification may be provided by a third-party vendor), by a component of an ad provider, and/or the like. According to embodiments, an emblem identifier 1414 may be implemented on a device and/or in an environment that includes a segmenter (e.g., the segmenter 318 depicted in
In embodiments, the emblem owner may provide emblem information to the emblem identifier 1414. The emblem information may include images of the emblem owner's emblem, images of emblems of the emblem owner's competitors, identifications of the emblem owner's competitors (e.g., which the emblem identifier 1414 may use to look up, from another source of information, emblem information associated with the competitors), and/or the like. The emblem identifier 1414 may store the emblem information in a database 1416. In embodiments, the emblem identifier 1414 may process the emblem information to facilitate more efficient and accurate identification of the associated emblem(s). For example, the emblem identifier 1414 may utilize a segmenter implemented by the video processing device 1402 to segment images of emblems, and may perform feature extraction on the segmented images of the emblems, to generate feature information that may be used to at least partially train one or more classifiers to identify the emblems. By associating the emblem identifier 1414 with the video processing device 1402 (e.g., by implementing the emblem identifier 1414 on the processing device 1402, by facilitating communication between the emblem identifier 1414 and the video processing device 1402, etc.), the emblem identifier 1414 may be configured to utilize the robust and efficient segmentation performed by the video processing device 1402 to facilitate emblem identification.
The illustrative system 1400 shown in
As shown in
In embodiments, the memory 1514 stores computer-executable instructions for causing the processor 1512 to implement aspects of embodiments of system components discussed herein and/or to perform aspects of embodiments of methods and procedures discussed herein. Computer-executable instructions may include, for example, computer code, machine-useable instructions, and the like such as, for example, program components capable of being executed by one or more processors associated with a computing device. Examples of such program components include a segmenter 1518, an emblem identifier 1520, an encoder 1522, and a communication component 1524.
In embodiments, as described above with reference to
In embodiments, the segmenter 1518 may be configured to segment a video frame into a number of segments to generate a segment map. The segments may include, for example, objects, groups, slices, tiles, and/or the like. The segmenter 1518 may employ any number of various automatic image segmentation methods known in the field. In embodiments, the segmenter 1518 may use image color and corresponding gradients to subdivide an image into segments that have similar color and texture. Two examples of image segmentation techniques include the watershed algorithm and optimum cut partitioning of a pixel connectivity graph. For example, the segmenter 1518 may use Canny edge detection to detect edges on a video frame for optimum cut partitioning, and create segments using the optimum cut partitioning of the resulting pixel connectivity graph. In embodiments, the segmenter 1518 implements aspects of the segmentation techniques described in luri Frosio and Ed Ratner, “Adaptive Segmentation Based on a Learned Quality Metric,” Proceedings of the 10th International Conference on Computer Vision Theory and Applications, March 2015, the entirety of which is hereby incorporated herein by reference for all purposes.
The resulting segment map of image segments includes an assignment of indices to every pixel in the image, which allows for the frame to be dealt with in a piecemeal fashion. Each index, which may be indexed in a database 1526 stored in the memory 1514, may be considered a mask for this purpose. The database 1526, which may refer to one or more databases, may be, or include, one or more tables, one or more relational databases, one or more multi-dimensional data cubes, and the like. Further, though illustrated as a single component, the database 1526 may, in fact, be a plurality of databases 1526 such as, for instance, a database cluster, which may be implemented on a single computing device or distributed between a number of computing devices, memory components, or the like.
In embodiments, the emblem identifier 1520 may be configured to identify, using the segment map, the presence of emblems within digital images such as, for example, frames of video. In embodiments, the emblem identifier 1520 may perform emblem identification on images that have not been segmented. In embodiments, results of emblem identification may be used by the segmenter 1518 to inform a segmentation process. According to embodiments, as shown in
According to embodiments, the pre-filter 1528 is configured to compute basic color metrics for each of the segments of a segmented image and to identify, based on emblem data 1530 (which may, for example, include processed emblem information), segments that are unlikely to contain a particular emblem. For example, the pre-filter 1528 may identify segments unlikely to contain a certain emblem based on emblem data, and/or may identify those segments based on other information such as, for example, texture information, known information regarding certain video frames, and/or the like.
In embodiments, the pre-filter 1528 is configured to determine, using one or more color metrics, that each segment of a first set of segments of a segment map is unlikely to include the target emblem; and to remove the first set of segments from the segment map to generate a pre-filtered segment map. As used herein, the term “target emblem” refers to an emblem that an emblem identifier (e.g., the emblem identifier 1520) is tasked with identifying within a video stream. The color metrics may include, in embodiments, color histogram matching to the emblem data 1530. For example, in embodiments, by comparing means and standard deviations associated with color distributions in the frame and the emblem data 1530, embodiments may facilitate removing segments that are unlikely to contain a target emblem. In embodiments, the pre-filter 1528 pre-filters the image on a per-emblem basis, thereby facilitating a more efficient and accurate feature extraction process.
The emblem identifier 1520 also may include a feature extractor 1532 configured to extract one or more features from an image to generate a feature map. In embodiments, the feature extractor 1532 may represent more than one feature extractors. The feature extractor 1532 may include any number of different types of feature extractors, implementations of feature extraction algorithms, and/or the like. For example, the feature extractor 1532 may perform histogram of oriented gradients feature extraction (“HOG,” as described, for example, in Navneet Dalal and Bill Triggs, “Histograms of Oriented Gradients for Human Detection,” available at http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf, 2005, the entirety of which is hereby incorporated herein by reference for all purposes), Gabor feature extraction (as explained, for example, in John Daugman, “Complete Discrete 2-D Gabor Transforms by Neural Networks for Image Analysis and Compression,” IEEE Transaction on Acoustics, Speech, and Signal Processing, Vol. 36, No. 7, 1988, the entirety of which is hereby incorporated herein by reference for all purposes), Kaze feature extraction, speeded-up robust features (SURF, as explained, for example, in D. G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” International Journal of Computer Vision, 60, 2, pp. 91-110, 2004, the entirety of which is hereby incorporated herein by reference for all purposes) feature extraction, features from accelerated segment (FAST) feature extraction, scale-invariant feature transform (SIFT) feature extraction, and/or the like. In embodiments, the feature extractor 1532 may detect features in an image based on emblem data 1530. By generating the features on the full frame, embodiments allow for feature detection in cases where nearby data could still be useful (e.g. an edge at the end of a segment).
The emblem identifier 1520 (e.g., via the feature extractor 1532 and/or classifier 1534) may be further configured to use the list of culled segments from the pre-filter step to remove features that fall outside the expected areas of interest. The practice also may facilitate, for example, classification against different corporate emblems by masking different areas depending on the target emblem. For example, the emblem identifier 1520 may be configured to remove, from a feature map, a first set of features, wherein at least a portion of each of the first set of features is located in at least one of the segments of the first set of segments.
After masking, the remaining features may be classified using a classifier 1534 configured to classify at least one of the plurality of features to identify the target emblem in the video frame. In embodiments, the emblem identifier 1520 further comprising an additional classifier, the additional classifier configured to mask at least one feature corresponding to a non-target emblem. The classifier 1534 may be configured to receive input information and produce output that may include one or more classifications. In embodiments, the classifier 1534 may be a binary classifier and/or a non-binary classifier. The classifier 1534 may include any number of different types of classifiers such as, for example, a support vector machine (SVM), an extreme learning machine (ELM), a neural network, a kernel-based perceptron, a k-NN classifier, a bag-of-visual-words classifier, and/or the like. In embodiments, high quality matches are selected as matches, providing both identification and location for the target emblem. Embodiments of classification techniques that may be utilized by the classifier 1534 include, for example, techniques described in Andrey Gritsenko, Emil Eirola, Daniel Schupp, Ed Ratner, and Amaury Lendasse, “Probabilistic Methods for Multiclass Classification Problems,” Proceedings of ELM-2015, Vol. 2, January 2016,; and Gabriella Csurka, Christopher R. Dance, Lixin Fan, Jutta Willamowski, and Cedric Bray, “Visual Categorization with Bags of Keypoints,” Xerox Research Centre Europe, 2004, the entirety of each of which is hereby incorporated herein by reference for all purposes.
The emblem identifier 1520 may include a tracker 1536 that is configured to track the identified target emblem by identifying an additional video frame in which the identified target emblem appears. For example, to provide valuable insights, it may be desirable to identify an emblem on more than a frame-by-frame basis, thereby eliminating the need for a human operator to interpret a marked-up stream to provide timing a report. That is, it may be desirable to track the identified emblem from frame to frame. According to embodiments, given a match in a frame (determined by the appearance of a high-quality feature match), the tracker 1536 looks at neighboring frames for high quality matches that are well localized. This not only allows robust reporting, but also may improve match quality. In embodiments, single-frame appearances may be discarded as false positives, and temporal hints may allow improved robustness for correct classification for each frame of the video. As an example, in embodiments, the classifier 1534 may classify at least one of a plurality of features to identify a candidate target emblem in the video frame. The tracker 1536 may be configured to determine that the candidate target emblem does not appear in an additional video frame; and identify, based on determining that the candidate target emblem does not appear in an additional video frame, the candidate target emblem as a false-positive.
The tracked identified emblems may be used to generate a report of emblem appearance, location, apparent size, and duration of appearance. As shown in
As shown in
According to embodiments, the emblem identifier 1520 may be configured to process emblem information to generate the emblem data 1530. That is, for example, prior to live identification, the database 1526 of target emblems may be processed for feature identification offline. Processing the emblem information may be performed by the segmenter 1518 and the feature extractor 1532. By processing the emblem information before performing an emblem identification procedure, embodiments of the present disclosure facilitate training classifiers (e.g., the classifier 1534) that can more efficiently identify the emblems. Additionally, in this manner, emblems that are split by the segmentation algorithm at runtime may be still well identified by the classifier 1534. As an example, an emblem with several leaves incorporated has a high chance of the leaves being segmented apart. By identifying the local features for each segment (e.g. a shape/texture descriptor for a leaf), embodiments of the present disclosure facilitate identifying those features on a segment-by-segment basis in the video stream.
In embodiments, for example, the encoding device 1502 may be configured to receive target emblem information, the target emblem information including an image of the target emblem. The encoding device 1502 may be further configured to receive non-target emblem information, the non-target emblem information including an image of one or more non-target emblems. The segmenter 1518 may be configured to segment the images of the target emblems and/or non-target emblems, and the feature extractor 1532 may be configured to extract a set of target emblem features from the target emblem; and extract a set of non-target emblem features from the non-target emblem. In this manner, the emblem identifier 1520 may train the classifier 1534 based on the set of target emblem features and the set of non-target emblem features.
The illustrative operating environment 1500 shown in
Embodiments of the method 1600 further include segmenting the image to generate a segment map (block 1604). The image may be pre-filtered (block 1606), based on the segment map. For example, in embodiments, the method 1600 includes pre-filtering the image by determining, using one or more color metrics, that each segment of a first set of segments is unlikely to include the target emblem; and removing the first set of segments from the image to generate a pre-filtered image. In embodiments, the one or more color metrics includes metrics generated by performing color histogram matching.
According to embodiments, the method 1600 further includes extracting features from the pre-filtered image to generate a feature map (block 1608). Embodiments of the method 1600 further include identifying the target emblem in the image (block 1610). Identifying the target emblem may include classifying at least one of a plurality of features, using a classifier, to identify the target emblem in the video frame. In embodiments, the classifier may include a bag-of-visual-words model. In embodiments, before classification, the method 1600 further includes removing, from the feature map, a first set of features, wherein at least a portion of each of the first set of features is located in at least one of the segments of the first set of segments. Additionally, or alternatively, embodiments of the method 1600 further include masking, using an additional classifier, at least one feature corresponding to a non-target emblem.
The method 1600 may further include tracking the identified target emblem by identifying an additional video frame in which the identified target emblem appears (block 1612). Although not illustrated, embodiments of the method 1600 may further include generating a report based on the tracked identified emblem. According to embodiments, the report may include any number of different types of information including, for example, target emblem appearance frequency, target emblem size, target emblem placement, and/or the like.
Embodiments of the illustrative method 1700 may include segmenting the image to generate a segment map (block 1704). Embodiments of the method 1700 further include pre-filtering the image by determining a first set of segments unlikely to include the target emblem (block 1706) and removing the first set of segments from the image to generate a pre-filtered image (block 1708). For example, the image may be the illustrative image 1800 depicted in
Embodiments of the method 1700 further include generating a feature map (block 1710). In embodiments, the method 1700 may also include removing a first set of features corresponding to the first set of segments (block 1712) and masking features corresponding to the non-target emblems (block 1714). A classifier is used to identify a candidate target emblem (block 1716). In embodiments, the method 1700 includes tracking the candidate target emblem (block 1718) and identifying, based on the tracking, the candidate target emblem as the target emblem or as a false-positive (block 1720). For example, in embodiments, the method 1700 includes determining that the candidate target emblem does not appear in an additional video frame; and identifying, based on determining that the candidate target emblem does not appear in an additional video frame, the candidate target emblem as a false-positive. In embodiments, the method 1700 may alternatively include determining that the candidate target emblem appears in an additional video frame; and identifying, based on determining that the candidate target emblem appears in an additional video frame, the candidate target emblem as the target emblem.
As indicated in
Embodiments of the subject matter disclosed herein include systems and methods for foreground detection that facilitate identifying pixels that indicate a substantive change in visual content between frames, and applying a filtration technique that is based on fractal-dimension methods. For example, a filter may be applied that is configured to eliminate structures of dimensionality less than unity, while preserving those of dimensionality of unity or greater. Embodiments of techniques described herein may enable foreground detection to be performed in real-time (or near real-time) with modest computational burdens. Embodiments of the present disclosure also may utilize variable thresholds for foreground detection and image segmentation techniques. As the term is used herein, “foreground detection” (also referred to, interchangeably, as “foreground determination”) refers to the detection (e.g., identification, classification, etc.) of pixels that are part of a foreground of a digital image (e.g., a picture, a video frame, etc.).
As shown in
In embodiments, the memory 1914 stores computer-executable instructions for causing the processor 1912 to implement aspects of embodiments of system components discussed herein and/or to perform aspects of embodiments of methods and procedures discussed herein. Computer-executable instructions may include, for example, computer code, machine-useable instructions, and the like such as, for example, program components capable of being executed by one or more processors associated with a computing device. Examples of such program components include a segmenter 1918, a foreground detector 1920, an encoder 1922, and a communication component 1924.
In embodiments, the segmenter 1918 may be configured to segment a video frame into a number of segments. The segments may include, for example, objects, groups, slices, tiles, and/or the like. The segmenter 1918 may employ any number of various automatic image segmentation methods known in the field. In embodiments, the segmenter 1918 may use image color and corresponding gradients to subdivide an image into segments that have similar color and texture. Two examples of image segmentation techniques include the watershed algorithm and optimum cut partitioning of a pixel connectivity graph. For example, the segmenter 1918 may use Canny edge detection to detect edges on a video frame for optimum cut partitioning, and create segments using the optimum cut partitioning of the resulting pixel connectivity graph. According to embodiments, the segmenter 1918 may be, include, be similar to, or be included in the segmenter 318 depicted in
In embodiments, the foreground detector 1920 is configured to perform foreground detection on a video frame. For example, in embodiments, the foreground detector 1920 may perform segment-based foreground detection, where the foreground segments, or portions of the segments, determined by the segmenter 1918 are detected using one or more aspects of embodiments of the methods described herein. In embodiments, the foreground detector 1920 may perform foreground detection on images that have not been segmented. In embodiments, results of foreground detection may be used by the segmenter 1918 to inform a segmentation process. According to embodiments, the foreground detector 1920 may be, include, be similar to, or be included in the foreground detector 322 depicted in
As shown in
The illustrative operating environment 1900 shown in
For instance, if a recording camera moves or changes zoom during recording of a video sequence, embodiments include providing a way to compensate for that motion, so that the background of the sequence may be kept at least substantially properly aligned between frames to a degree of acceptable accuracy. Similarly, if there is some sort of lighting change in the video sequence—e.g., due to a change in the physical lighting of the scene and/or due to a fade effect applied to the video-images may be adjusted to compensate for such effects.
For example,
In a second exemplary implementation of embodiments of an algorithm for detecting foreground in a video frame (“second example”),
As shown in
As shown, embodiments of the illustrative method 2000 may include constructing an ambient background image (block 2008). The ambient background image may be used as an approximation of the unchanging background and may be constructed in any number of different ways. For example, in embodiments, the ambient background image may be a median background image that includes a set of pixels, each of the set of pixels of the median background image having a plurality of color components, where each color component of each pixel of the median background image is the median of a corresponding component of corresponding pixels in the current image, the first previous image, and the second previous image. In embodiments, the ambient background image may be constructed using other types of averages (e.g., mean and/or mode), interpolations, and/or the like.
Examples of a median background image are shown in the
According to embodiments of the method 2000, a difference image is constructed (block 2010). In embodiments, the difference image includes a set of pixels, where each pixel of the difference image indicates a difference between a corresponding pixel in the ambient background image and a corresponding pixel in the current image. Embodiments further include constructing a foreground threshold image (block 2012). In embodiments, the threshold image includes a set of pixels, where each pixel of the threshold image indicates an amount by which a pixel can change between images and still be considered part of the background.
Examples of a difference image and foreground threshold image are shown in the
In the second example,
As shown, embodiments of the illustrative method 2000 may include constructing a foreground indicator map (FIM) (block 2014). The foreground indicator map includes a set of pixels, where each of the set of pixels of the foreground indicator map corresponds to one of the set of pixels in the current image, and where each of the set of pixels of the foreground indicator map includes an initial classification corresponding to a foreground or a background. The foreground indicator map may be a binary map or a non-binary map. In a binary map, each of the pixels may be classified as foreground or background, while in a non-binary map, each pixel may provide a measure of the probability associated therewith, where the probability is a probability that the pixel is foreground. In embodiments of the method 2000, a binary foreground indicator map (BFIM) may be used, a non-binary foreground indicator map (NBFIM) may be used, or both may be used.
Embodiments of the illustrative method 2000 further include constructing a filtered FIM by filtering noise from the FIM (block 2016). In embodiments, the FIM is filtered to remove sparse noise while preserving meaningful structures--for example, it may be desirable to retain sufficiently large one-dimensional structures during the filter process because the FIM more readily shows the edges of a moving object than the body of a moving object. Motivated by the concept of the box-counting fractal dimension, embodiments may include techniques that involve looking at varying size box-regions of the FIM, and using various criteria to declare pixels in the FIM as noise. In embodiments, these criteria may be chosen such that sufficiently large one-dimensional structures with some gaps are not eliminated while sufficiently sparse noise is eliminated.
Embodiments of the illustrative method 2000 may further include determining foreground segments (block 2018). According to embodiments, identifying at least one segment as a foreground segment or a background segment may include determining, based on the filtered BFIM, at least one foreground metric corresponding to the at least one segment; determining, based on the at least one foreground metric, at least one variable threshold; and applying the at least one variable threshold to the filtered BFIM to identify the at least one segment as a foreground segment or a background segment. Embodiments of the foreground detection algorithm make use of an image segmentation of the current frame, and may include determining which segments are likely to be moving, erring on the side of allowing false-positives.
In embodiments, foreground may be detected by applying a static threshold for each of the three fractions for each segment, declaring any segment over any of the thresholds to be in the foreground. According to embodiments, the algorithm may use variable thresholds which simultaneously consider the foreground fractions of a plurality of (e.g., every) segments in the current frame, providing an empirically justified trade-off between the threshold and the area of the frame that is declared to be foreground. This may be justified under the assumption that the system will rarely consider input where it is both content-wise correct and computationally beneficial for the entire frame to be considered foreground, and, simultaneously, there is little overhead to allowing a few false-positives when the entire frame should be considered background.
As indicated above, embodiments of the illustrative method may include constructing a binary foreground indicator map (BFIM). In embodiments, constructing the BFIM includes determining, for each of the set of pixels in the BFIM, whether a corresponding pixel in the difference image corresponds to a difference that exceeds a threshold, where the threshold is indicated by a corresponding pixel in the threshold image, the threshold image comprising a set of pixels, where each pixel of the threshold image indicates an amount by which a pixel can change between images and still be considered part of the background; and assigning an initial classification to each of the set of pixels in the BFIM, wherein the initial classification of a pixel is foreground if the corresponding difference exceeds the threshold. That is, for example, in embodiments, each pixel of the BFIM is given a value of 1 if the corresponding pixel in the difference image shows a difference larger than the threshold indicated by the corresponding pixel in the threshold image; otherwise, the pixel is given a value of 0. Embodiments may allow for pixels to be marked as foreground in the BFIM based on any available analysis from previous frames, such as projected motion.
Examples of the BFIM are shown in
In embodiments, the BFIM is filtered to remove sparse noise while preserving meaningful structures--for example, it may be desirable to retain sufficiently large one-dimensional structures during the filter process because the edges of a moving object more readily show up in the BFIM than the body of the object. As discussed above, a modified box counting technique may be used to filter the BFIM. According to embodiments, using the modified box counting technique may include constructing a neighbor sum map, the neighbor sum map including, for a first pixel of the set of pixels in the BFIM, a first neighbor sum map value and a second neighbor sum map value, where the first pixel includes an initial classification as foreground. The technique may also include applying, for the first pixel, a set of filter criteria to the first and second neighbor sum map values; and retaining the initial classification corresponding to foreground for the first pixel if the filter criteria are satisfied.
For example, in embodiments of the method 2100 depicted in
(1) neighborSumMap(s1)≧floor(C1*s1);
(2) neighborSumMap(s2)≧floor(C1*s2); and,
(3) neighborSumMap(s2)≧neighborSumMap(s1)+floor(C2*(s2−s1)).
In embodiments, all of the conditions may be tested. In other embodiments, as shown in
As shown in
According to embodiments, C1 and C2 may be selected based on empirical evidence, formulas, and/or the like, to produce desired results. In embodiments, for example, C1 may be 0.9 and C2 may be 0.4. Note that the coefficients less than unity may be used to allow for some gaps in the structures that are desired to be preserved; and, in embodiments, if it is desirable to preserve only structures without gaps, those could be increased to unity. Further, if it is desirable to preserve only internal points of the structures while allowing the ends to be eliminated, those coefficients could be increased to a value of 2. Also, in embodiments, the exponents of the half sizes for the requirements (s1, s2, and (s2−s1)) are all unity; and if a different dimensionality was desired for structures to be preserved, those exponents could be modified accordingly.
As for the exact values of s1 and s2 used, it may be desirable to take s2 to be sufficiently larger than s1 to sample enough of the space for the “sufficient increase requirement” (that is, requirement “(3)”) to be meaningful. Also, in embodiments, the maximum values of s1 and s2 that should be used depend on the expected sizes of the objects in the frames. The values presented here work well for video sequences that have been scaled to be a few hundred pixels by a few hundred pixels. As for the specific values of s1 chosen, the present iterative schedule has empirically been found to provide reasonably good achievement of the desired filtering; and subject to that level of quality, this schedule seems to be the minimum amount of work required.
For example,
Examples of the weight map are shown in
According to embodiments, in order to make use of a variable threshold to detect foreground using a BFIM, three foreground curves may be constructed, one for each of the foreground fractions. Each foreground curve may represent the cumulative distribution function of the area-weighted foreground fraction for that metric. As shown in
In embodiments, the method 2200 further includes determining variable thresholds for each metric (VTH(UFAF), VTH(FPF), and VTH(WFAF)) (blocks 2226, 2228, and 2230). The variable thresholds may be determined by finding the intersection of each foreground curve with a specified monotonically decreasing threshold curve. In the case of no intersection between the curves, all of the segments may be declared to be background. The inventors have achieved positive results by taking the threshold curves to be straight lines passing through the listed points: the (unweighted) area threshold curve (VTH(UFAF)) through (0, 0.8), (1, 0.1); the perimeter threshold curve (VTH(FPF)) through (0, 1.0), (1, 0.5); and the weighted area threshold curve (VTH(WFAF)) through (0, 0.6), (1, 0.2). The method 2200 may include classifying all segments which are above any of the variable thresholds as foreground (block 2232), and all other segments as background.
In many cases, where good moving foreground detection seems possible under human inspection, the above criteria generally function well. However, there are some low-noise cases where small foreground motion may not be detected by the above criteria but is possible under human inspection. In order to handle these cases, a conditional second pass may be utilized. For example, a determination may be made whether the total area declared foreground is less than a specified threshold (e.g., approximately 25%) of the total segment area (block 2234). If not, the classification depicted in block 2232 may be retained (block 2236), but if so, then, a second-pass variable threshold may be applied (block 2238). In embodiments, any other criteria may be used to determine whether a second-pass variable threshold may be applied.
This second-pass threshold may be applied, for example, only to the (unweighted) area fraction of doubly-eroded segments with the fraction normalized by the non-eroded area of the segment. Examples of a doubly eroded segment map are shown in
Examples of the foreground curves and determination of the thresholds for the first example are shown in
Examples of the parts of an image declared to be foreground are shown in
In the second example,
In embodiments, double-erosion of the segments for the second-pass may be, e.g., due to providing a good balance between completely ignoring the foreground pixels in adjacent segments and their impact on noise reduction, versus fully considering them for that purpose.
As indicated above, the foreground indicator map (FIM) may be binary (BFIM) or non-binary (NBFIM) and various embodiments of methods described herein may use a BFIM, an NBFIM, or some combination of the two. As described above, the binary foreground indicator map (BFIM) may be calculated from the difference image and the foreground threshold image. Additional insight into foreground analysis may be provided by using a non-binary foreground indicator map (NBFIM). The use of a non-binary map may include a modified fractal-based analysis technique.
To construct the NBFIM, embodiments include defining the normalized absolute difference image (NADI) to be an image where each component of each pixel is equal to the corresponding value in the difference image divided by the corresponding value in the foreground threshold image. The foreground threshold image may be constructed so that it has a value of at least one for each component of each pixel. Each pixel of the unfiltered NBFIM may be defined to be equal to the arc-hyperbolic sine (asinh) of the sum of the squares of the components of the corresponding pixel in the normalized absolute difference image (NADI) with a coefficient of 0.5 for each of the chroma components; that is, for each pixel: unfiltered NBFIM=a sin h(NADIŶ2+0.5*NADICb̂2+0.5*NADICr̂2). The asinh( ) function is used to provide a un-bounded quasi-normalization, as asinh(x)˜x for small x and asinh(x)˜log(x) for large x.
For example,
Sample unfiltered NBFIM are shown in
A non-binary fractal-based analysis may be used to generate the filtered NBFIM from the unfiltered NBFIM. The concept for the analysis may be the same as for the binary case, and may be based, for example, on a selected minimal linear growth rate in the neighborSumMap with respect to the box size; however, unlike the binary case, the coefficients for the growth may be based on the average value of pixels in the frame during that iteration.
For example, in embodiments, of the method 2300 depicted in
In embodiments, both of the conditions may be tested. In other embodiments, as shown in
As shown in
Sample filtered NBFIM are shown in
The filtered NBFIM can be used for foreground detection in a manner analogous to the binary case.
Samples of the foreground curves and determined foreground for a number of cases are illustrated in
According to embodiments, a number of other possible foreground metrics may exist. For example, even in some difficult cases, embodiments of the methods described above may facilitate eliminating noise while retaining many desired one-dimensional structures. Thus, for example, a simple edge-enhancing technique can be applied to the filtered NBFIM to help identify the edges of moving objects. One such technique may include setting the preliminary mask to be the set of all points in the filtered NBFIM that have a value greater than unity, and setting the base mask=dilate3×3 (preliminary mask), where dilate3×3 ( )means that each pixel of the output is set to the maximum value among the corresponding pixel and its 8 neighbors (this step may be performed to reduce any gaps in the structures). The technique may further include setting the outline mask=base mask & (erode3×3 (base mask)), where erode3×3 ( )is analogous to dilate3×3 ( )but uses the minimum, “&” indicates a bit-wise “and operation”, and “” indicates a negation. The outline mask may give the outline of the base mask, not the edges of moving objects. We then set the edge mask=imerode3×3(imdilate3×3 (outline mask)). Samples of the base, outline, and edge masks are shown in
According to various embodiments of the disclosure, the choice of the variable threshold may allow for some room for customization and/or optimization. For example, embodiments may focus on finding the “knee” of the foreground curves, which may be considered as the minima of the derivative of (the foreground curve minus x−), which serves as a discrete analogue of search for points where df/dx=1. Embodiments may incorporate some method of detecting locally varying illumination changes, such as that used by Farmer, as described in M. E. Farmer, “A Chaos Theoretic Analysis of Motion and Illumination in Video Sequences”, Journal of Multimedia, Vol. 2, No. 2, 2007, pp. 53-64; and M. E. Farmer, “Robust Pre-Attentive Attention Direction Using Chaos Theory for Video Surveillance”, Applied Mathematics, 4, 2013, pp. 43-55, the entirety of each of which is hereby incorporated herein by reference for all purposes. Embodiments may include applying the fractal-analysis filter or a variable threshold to the difference image, such as is shown in
For the non-binary case, embodiments may include various modifications. For example, embodiments may make use of a multi-pass method, as we described above for the binary case. For example, the first pass may be sufficient to identify any clearly moving objects; then, if too little area were identified as foreground, a second pass may be performed to identify slight motion. In embodiments, the calculation of the threshold image may be performed in any number of different ways such as, for example, by applying one or more filters to it, and/or considering the temporal variation of pixels as part of the threshold. Another approach, according to embodiments, may include dividing the foreground sum by some linear measure of a segment size, e.g., the square root of the area. Embodiments of the foreground detection techniques described herein may be used to detect foreground in images that have not been segmented. That is, for example, an FIM (e.g., a BFIM and/or an NBFIM) may be constructed based on an unsegmented image. In embodiments, such a FIM may be filtered and the remaining foreground pixels may be classified as foreground.
To produce multi-view video content, (e.g., three-dimensional (3D) video, augmented-reality (AR) video, virtual-reality (VR) video, etc.) multiple views may be used to present a scene (or scene augmentation) to a user. A view refers to a perspective of a scene and may include one or more images corresponding to a scene, where all of the images in the view represent a certain spatial (and/or temporal) perspective of a video scene (e.g., as opposed to a different perspective of the video scene, represented by a second view). According to embodiments, a perspective may include multiple spatial and/or temporal viewpoints such as, for example, in the case in which the “viewer” —that is, the conceptual entity that is experiencing the scene from the perspective—is moving relative to the scene, or an aspect thereof (or, put another way, the scene, or an aspect thereof, is moving relative to the viewer).
In embodiments, a view may include video information, which may be generated in any number of different ways such as, for example, using computing devices (e.g., computer-generated imagery (CGI)), video cameras (e.g., multiple cameras may be used to respectively record a scene from different perspectives), and/or the like. Accordingly, in embodiments, a view of a scene (e.g., computer-generated and/or recorded by a video camera) may be referred to herein as a video feed and multiple views of a scene may be referred to herein as video feeds. In embodiments, each video feed may include a plurality of video frames.
To encode a scene for multi-view video, some conventional systems and methods may independently encode the video feeds. That is, in embodiments, a first video feed may be encoded without regard to other video feeds. For example, during encoding, conventional systems and methods may estimate the motion vectors for each of the video feeds independently. Independently estimating motion vectors for each video feed independently can be computationally burdensome. Embodiments of this disclosure may provide a solution that is less computationally burdensome.
In embodiments, the video data 3404 may include views of a scene embodied in a number of video feeds. In embodiments, the video feeds of the scene, or aspects thereof, may have been respectively recorded by cameras positioned at different locations so that the scene is recorded from multiple different viewpoints. In embodiments, the video feeds of the scene, or aspects thereof, may have been computer-generated. In some instances, view information—information about the perspective corresponding to the view (e.g., camera angle, virtual camera angle, camera position, virtual camera position, camera motion (e.g., pan, zoom, translate, rotate, etc.), virtual camera motion, etc.)—may be received with (or in association with) the video data 3404 (e.g., multiplexed with the video data 3404, as metadata, in a separate transmission from the video data 3404, etc.). In other embodiments, view information may be determined, e.g., by the encoding device 3402. Each of the video feeds of the video data 3404 may be comprised of multiple video frames. In embodiments, the video feeds may be combined to produce multi-view video.
As described herein, while producing the encoded video data 3406 from the video data 3404, the encoding device 3402 may determine motion vectors of the video data 3404. In embodiments, the encoding device 3402 may determine motion vectors of the video data 3404 in a computationally less demanding way than conventional encoding systems, and, in embodiments, methods that include extrapolating motion vectors from a first video feed to other video feeds.
As shown in
As shown in
In embodiments, the memory 3414 stores computer-executable instructions for causing the processor 3412 to implement aspects of embodiments of system components discussed herein and/or to perform aspects of embodiments of methods and procedures discussed herein. Computer-executable instructions may include, for example, computer code, machine-useable instructions, and the like such as, for example, program components capable of being executed by one or more processors associated with a computing device. Examples of such program components may include a segmenter 3418, a foreground detector 3420, a multi-view motion estimator 3422, an object analyzer 3424, an encoder 3426 and a communication component 3428.
As indicated above, in embodiments, the video data 3404 includes multiple video feeds and each video feed includes multiple video frames. In embodiments, the segmenter 3418 may be configured to segment one or more video frames into a plurality of segments. The segments may include, for example, objects, groups, slices, tiles, and/or the like. The segmenter 3418 may employ any number of various automatic image segmentation techniques such as, for example, those discussed herein. For example, the segmenter 3418 may use image color and corresponding gradients to subdivide an image into segments that have similar color and texture. Two embodiments of image segmentation techniques include the watershed algorithm and optimum cut partitioning of a pixel connectivity graph. In embodiments, the segmenter 3418 may use Canny edge detection to detect edges on a video frame for optimum cut partitioning, and create segments using the optimum cut partitioning of the resulting pixel connectivity graph. In embodiments, the segmenter 3418 implements aspects of the segmentation techniques described in luri Frosio and Ed Ratner, “Adaptive Segmentation Based on a Learned Quality Metric,” Proceedings of the 10th International Conference on Computer Vision Theory and Applications, March 34015, the entirety of which is hereby incorporated herein by reference for all purposes.
The resulting segment map of image segments includes an assignment of indices to every pixel in the image, which allows for the frame to be dealt with in a piecemeal fashion. Each index, which may be indexed and stored in memory 3414, may be considered a mask for this purpose.
In embodiments, the foreground detector 3420 may be configured to perform foreground detection on one or more video frames of the video data 3404. For example, in embodiments, the foreground detector 3420 may perform segment-based foreground detection, where the foreground segments, or portions of the segments, are detected using any number of different techniques such as, for example, those discussed above with respect U.S. application Ser. No. 14/737,418, filed Jun. 11, 2015, entitled “FOREGROUND DETECTION USING FRACTAL DIMENSIONAL MEASURES,” the entirety of which is incorporated herein. For example, in embodiments, the foreground detector 3420 may identify a segment as a foreground segment or a background segment by: determining at least one foreground metric for the segment based on a filtered binary foreground indicator map (BFIM), determining at least one variable threshold based on the foreground metric, and applying the at least one variable threshold to the filtered BFIM to identify the segment as either a foreground segment or background segment. Alternatively, the foreground detector 3420 may perform foreground detection on images that have not been segmented. In embodiments, results of foreground detection may be used by the segmenter 3418 to inform a segmentation process.
In embodiments, the multi-view motion estimator 3422 is configured to perform motion estimation for multiple video feeds of video data 3404. To facilitate motion estimation, the multi-view motion estimator 3422 may include one or more program components. Examples of such program components include a single-view motion estimator 3428, a camera position and viewing angle calculator 3430, a depth analyzer 3434 and an extrapolator 3436.
The single-view motion estimator 3430 may be configured to estimate the motion of one or more segments between video frames of a single video feed. For example, the single-view motion estimator 3430 may receive a single video feed of the video data 3404. The single video feed may be received after video frames of the video feed are segmented by the segmenter 3418. The single-view motion estimator 3430 may then perform motion estimation on the segmented video frames of the video feed. That is, the single-view motion estimator 3430 may estimate the motion of a segment between video frames of the single video feed, with the current frame in the temporal sequence serving as the source frame and the subsequent frame in the temporal sequence serving as the target frame.
In embodiments, the single-view motion estimator 3430 may utilize any number of various motion estimation techniques known in the field. Two example motion estimation techniques are optical pixel flow and feature tracking. As an example, the single-view motion estimator 3430 may use feature tracking in which Speeded Up Robust Features (SURF) are extracted from both a source image (e.g., a source frame) and a subsequent image (e.g., a subsequent frame). The individual features of the two images may then be compared using a Euclidean metric to establish a correspondence between features, thereby generating a motion vector for each feature. In embodiments, a motion vector for a segment may be the median of all of the motion vectors for each of the segment's features. After a motion vector for a segment is determined, each segment may be categorized based on its motion properties. For example, a segment may be categorized as either moving or stationary.
After the single-view motion estimator 3430 determines motion vectors for a single video feed, the multi-view motion estimator 3422 may extrapolate motion vectors for the other video feeds of the video data 3404. To do so, in embodiments, the camera position and viewing angle calculator 3430 may calculate the relative positions and viewing angles of the cameras that respectively recorded the video feeds of the video data 3404 based on the field of views of each of the cameras. The relative positions and angles of the cameras may be used to determine 3D coordinates of a video scene. Additionally or alternatively, the relative positions and viewing angles of the cameras may be included in metadata of the video data 3404 and received by the multi-view motion estimator 3422.
As shown in
After the 3D map is created, the extrapolator 3436 may be configured to extrapolate the motion vectors determined by the single-view motion estimator 3422 of a video feed to other video feeds. To do so, in embodiments, the extrapolator 3436 assigns 3D coordinates to each of the motion vectors computed by the single-view motion estimator 3430 based on the 3D map. That is, the extrapolator 3436 may be configured to receive two-dimensional motion vector data from the single-view motion estimator 3430 and determine the three dimensional representations of the motion vectors using the 3D map determined from the calculated pixel depth. The extrapolator 3436 then can use the 3D representation of motion vectors to compute two-dimensional projections onto one or more of the other two-dimensional coordinate systems associated with the other video feeds. In embodiments, a local search can be performed by the extrapolator 3436 to determine whether the motion vectors projected onto a video feed accurately represent motion vectors for the video feed. In embodiments, the projected motion vectors may be compared to computed motion vectors for one or more of the other video feeds using a Euclidean metric to establish a correspondence between motion vectors and/or determine an projection error of the projected motion vectors.
In embodiments, the object analyzer 3424 may be configured to identify, using the segment map and/or the motion vectors computed by the multi-view motion estimator 3422, the presence of objects (e.g., clusters of segments that move together, or at least approximately together) within digital images such as, for example, frames of video. In embodiments, the object analyzer 3424 may be configured to identify and/or analyze only objects that are moving within a particular scene. In embodiments, the object analyzer 3424 may perform object analysis on images that have not been segmented. Results of object group analysis may be used by the segmenter 3418 to facilitate a segmentation process, by an encoder 3926 to facilitate an encoding process, and/or the like.
As shown in
The illustrative operating environment 3400 shown in
As shown in
As shown in
Embodiments of the method 35 further include extrapolating motion vectors from the first feed to other feeds (block 3506). Embodiments describing a motion vector extrapolation method 400 are discussed below with respect to
After the motion vectors from a first feed are extrapolated onto other feeds, the motion vectors are encoded (block 3508). By extrapolating motion vectors from a first feed onto another feed, the computational demands of encoding multiple video feeds may be reduced. That is, for example, extrapolating motion vectors from one video feed onto another video feed may be computationally less demanding than computing motion vectors for each video feed independently. In embodiments, encoding the motion vectors may be performed by an encoder 3426, as shown in
After encoding the motion vectors, the encoded motion vector data may be transmitted (block 3510). The encoded motion vector data may be transmitted to a decoding device (e.g., the decoding device 3408 depicted in
The illustrative method 3500 shown in
As shown in
Embodiments of the method 3600 further include receiving motion vectors computed for a first video feed (block 3604). Determining motion vectors for video frame segments may include any number of different techniques known in the field, such as the ones described above in relation to
In embodiments, the method 3600 may further comprise determining relative positions and angle of cameras used to record the video feeds (block 3606). In embodiments, the relative positions and angles of the cameras may be computed based on a comparison of the field of views of each of the cameras. Additionally or alternatively, the relative positions and viewing angles of the cameras may be included in metadata of the received segmented video feeds. The relative positions and angles of the cameras may be used to determine 3D coordinates of a video scene, as described below. In embodiments, a camera position and viewing angle calculator 3432, as depicted in
Based on the relative positions of the cameras used to record the two video feeds, pixel depths may be calculated and a 3D map may be created (block 3608). As an example, if an object encompasses an area of a1×b1 pixels in one video feed and the same object encompasses an area of a2×b2 pixels in another video feed, then, based on the relative positions and angles of the two cameras used to recorded to the two video feeds, a transformation function can be determined that will transform the object from a1×b1 pixels to a2×b2 pixels. Due the transformation function being based on the relative positions and angles of the cameras, the transformation function will include distance information that can be used to calculate the distance of the object from the first camera used to record the first video feed. That is, a depth can be assigned to the pixels including the object. Similarly, distances to other objects included in the video feed may be determined. Once the distances to each of the objects are calculated, a pixel depth (i.e., a depth coordinate) can be assigned to each object and pixel included in a video feed. And, based on the calculated pixel depth, a 3D map of a video feed may be determined. That is, for each pixel of a video frame, a depth coordinate (e.g., a z coordinate) can be included in the horizontal and vertical coordinates (e.g., x, y coordinates). In embodiments, a depth analyzer 3434, as depicted in
As shown in
The illustrative method 3600 shown in
In addition, the same scene is recorded by a second video camera from a second viewpoint 3702′. The second video camera has a respective position and angle relative to the scene. In embodiments, the viewpoint 3702′ of the scene includes a two-dimensional coordinate system 3704′. According to embodiments, the same or substantially the same segment 3706 may be determined for the two-dimensional coordinate system 3704′. The segment 3706 is depicted as segment 3706′ in the two-dimensional coordinate system 3702′. In embodiments, the segment 3706′ may be determined using a variety of embodiments, including, for example, the embodiments described above in relation to
Based on the representations of the segments 3706, 3706′ a pixel depth may be calculated based on the respective positions and angles of the first and second cameras. In addition, using the calculated pixel depth, a three-dimensional representation of the segment 3706, 3706′ and/or the motion vector 3708 may be determined. That is, depth coordinates may be assigned to the segment 3706, 3706′ and/or the motion vector 3708. In embodiments, the pixel depth and three-dimensional representation may be determined using a variety of embodiments, including, for example, the embodiments described above in relation to
After a three-dimensional representation of the motion vector 3708″ is determined, the motion vector 3708″ may be projected onto the two-dimensional coordinate system 3702′ to yield a projected motion vector 3706′. In embodiments, a local search may be performed to determine whether the motion vector 3706′ projected onto the two-dimensional accurately represent the motion vector for the segment 3706′. For example, the projected motion vector 3706′ may be compared to computed motion vector for the two-dimensional coordinate system 3702′ using a Euclidean metric to establish a correspondence between the motion vectors and/or to determine a projection error of the projected motion vectors. Additionally or alternatively, the motion vector 3708″ may be projected onto other viewpoints of the scene recorded by other video cameras.
By not computing the motion vector for the two-dimensional coordinate system 3702′, processing video feeds from multiple video feeds may reduce the computational requirements of a system encoding the video feeds.
The illustrative block diagram 3700 shown in
Various modifications and additions can be made to the exemplary embodiments discussed without departing from the scope of the present disclosure. For example, while the embodiments described above refer to particular features, the scope of this disclosure also includes embodiments having different combinations of features and embodiments that do not include all of the described features. Accordingly, the scope of the present disclosure is intended to embrace all such alternatives, modifications, and variations as fall within the scope of the claims, together with all equivalents thereof.
As explained above, embodiments of systems and methods described herein include an object group analysis process (e.g., the object group analysis process 212 depicted in
In embodiments, graph partitioning may be used in an object group analysis process. Graph partitioning often arises as a useful component of solving many numerical problems. Many graph partitioning methods have been devised, and selection of an appropriate method for a given problem may depend on the meaning of the underlying data and available computational resources. In the context of various aspects of video processing, graphs can be useful for performing processes such as, for example, image segmentation, object analysis and tracking, partitioning, and/or the like.
Embodiments include a graph partitioning algorithm that uses a max-flow technique. Embodiments of the algorithm include selecting source and drain vertices from within the existing graph to generate the min-cut.
Embodiments include a clustering algorithm for partitioning graphs using a max-flow/min-cut technique. See, for example, P. Sanders and C. Schulz (2011), “Engineering Multilevel Graph Partitioning Algorithms,” Proceedings of the 19th European Symposium on Algorithms (ESA), pp. 469-480 (hereinafter “Sanders & Schulz”), for an illustrative discussion of max-flow/min-cut techniques, the entirety of which is hereby incorporated herein by reference for all purposes. In contrast to conventional max-flow/min-cut techniques, in which a source vertex (sometimes referred to, interchangeably, as an “origin”) and a drain vertex (sometimes referred to, interchangeably, as a “sink”) are added to the graph to facilitate refining a candidate cut, embodiments of the subject matter disclosed herein include a max-flow/min-cut technique in which the source and drain vertex are selected from the set of graph vertices. Embodiments of the clustering algorithm described herein may be used for any number of purposes such as, for example, for object group analysis, segmentation, network modeling (e.g., routing of Internet Protocol (IP) packets), scheduling (e.g., scheduling of encoding jobs distributed between encoders, scheduling of tasks associated with distributed processing, etc.), other problems that may be modeled using graph partitioning, and/or the like. For example, in object group analysis, embodiments of the clustering algorithm described herein may be used to identify object boundaries, thereby facilitating more efficient encoding. For example, embodiments may be utilized with other techniques described herein to increase video processing efficiency by between approximately 10 and approximately 30 percent.
Embodiments of the disclosed algorithm may facilitate use of max-flow/min-cut techniques in situations in which these techniques may not have been traditionally used, more accurate partitioning process, and/or the like. For example, max-flow/min-cut partitioning techniques are typically used only when there is a good indication of which graph vertices the added source and drain vertices should be attached to. For example, as explained in U.S. Pat. Nos. 6,973,212 and 7,444,019, filed on Aug. 30, 2001, and Feb. 15, 2005, respectively, the entirety of each of which is hereby incorporated herein by reference for all purposes, a source vertex may be connected to graph vertices identified as object seeds, and the drain vertex may be connected to graph vertices identified as background seeds. Embodiments of the clustering algorithm described herein may be used to generalize the problem to situations in which object and/or background seeds are not known. In this manner, embodiments facilitate use of the max-flow/min-cut partitioning technique in situations in which other types of algorithms (e.g., spectral partitioning, Markov Cluster (MCL) algorithms, etc.) are typically used.
In contrast with the conventional max-flow/min-cut techniques, embodiments disclosed herein include a clustering algorithm in which the source and drain vertices used for the partitioning are selected from the existing graph vertices. In comparison to conventional methods, this technique and the other embodiments described herein may provide one or more advantages because max flow algorithms optimize the max flow metric and spectral algorithms optimize cut-ratio metric. That is, the embodiments described herein can perform well on both metrics simultaneously. Conventional embodiments cannot. In addition, the embodiments described herein produce partitions that are more meaningful from an object creation point of view.
An example is depicted in
Embodiments disclosed herein include systems and methods for object group analysis.
As shown in
In embodiments, the memory 3914 stores computer-executable instructions for causing the processor 3912 to implement aspects of embodiments of system components discussed herein and/or to perform aspects of embodiments of methods and procedures discussed herein. Computer-executable instructions may include, for example, computer code, machine-useable instructions, and the like such as, for example, program components capable of being executed by one or more processors associated with a computing device. Examples of such program components include a segmenter 3918, an object analyzer 3920, an encoder 3922, and a communication component 3924.
In embodiments, the segmenter 3918 may be configured to segment a video frame into a number of segments to generate a segment map. The segments may include, for example, objects, groups, slices, tiles, and/or the like. The segmenter 3918 may employ any number of various automatic image segmentation methods known in the field. In embodiments, the segmenter 3918 may use image color and corresponding gradients to subdivide an image into segments that have similar color and texture. Two examples of image segmentation techniques include the watershed algorithm and optimum cut partitioning of a pixel connectivity graph. For example, the segmenter 3918 may use Canny edge detection to detect edges on a video frame for optimum cut partitioning, and create segments using the optimum cut partitioning of the resulting pixel connectivity graph. In embodiments, the segmenter 3918 implements aspects of the segmentation techniques described in luri Frosio and Ed Ratner, “Adaptive Segmentation Based on a Learned Quality Metric,” Proceedings of the 10th International Conference on Computer Vision Theory and Applications, March 2015, the entirety of which is hereby incorporated herein by reference for all purposes.
The resulting segment map of image segments includes an assignment of indices to every pixel in the image, which allows for the frame to be dealt with in a piecemeal fashion. Each index, which may be indexed in a database 3926 stored in the memory 3914, may be considered a mask for this purpose. The database 3926, which may refer to one or more databases, may be, or include, one or more tables, one or more relational databases, one or more multi-dimensional data cubes, and the like. Further, though illustrated as a single component, the database 3926 may, in fact, be a plurality of databases 3926 such as, for instance, a database cluster, which may be implemented on a single computing device or distributed between a number of computing devices, memory components, or the like.
In embodiments, the object analyzer 3920 may be configured to identify, using the segment map, the presence of objects (e.g., clusters of segments that move together, or at least approximately together) within digital images such as, for example, frames of video. In embodiments, the object analyzer 3920 may be configured to identify and/or analyze only objects that are moving within a particular scene. In embodiments, the object analyzer 3920 may perform object analysis on images that have not been segmented. Results of object group analysis may be used by the segmenter 3918 to facilitate a segmentation process, by an encoder 3922 to facilitate an encoding process, and/or the like.
According to embodiments, as shown in
As shown in
The object analyzer 3920 also may include an object identifier 3932 configured to identify the presence of one of more objects in an image. In embodiments, the object identifier 3932 may represent more than one object identifier 3932. The object identifier 3932 may utilize any number of different types of object identification techniques such as, for example, clustering. According to embodiments, an object may be a group of one or more segments that move at least approximately together from frame to frame. After identifying the objects, the objects may be classified using a classifier 3934 configured to classify at least one of the objects to identify characteristics of the objects. The classifier 3934 may be configured to receive input information and produce output that may include one or more classifications. In embodiments, the classifier 3934 may be a binary classifier and/or a non-binary classifier. The classifier 3934 may include any number of different types of classifiers such as, for example, a support vector machine (SVM), an extreme learning machine (ELM), a neural network, a kernel-based perceptron, a k-NN classifier, a bag-of-visual-words classifier, and/or the like. The object analyzer 3920 may include a tracker 3936 that is configured to track one or more of the identified objects, groups of objects, and/or the like, as they move throughout the video.
As shown in
The illustrative operating environment 3900 shown in
As shown in
As shown in
Embodiments of the method 40 further include translating each of the segments based on the determined estimated motion (block 4006) and overlaying the translated segments on the segmentation of the next frame (e.g., the immediately following frame) (block 4008). In this manner, embodiments of the method 40 facilitate determining overlaps between the translated segments and the segments of the next frame (block 4010).
As shown in
As is further shown in
In embodiments, each group of segments in each frame may be a vertex, and the edge weights to be the amount of overlap between (camera and group) motion compensated groups, modified according to a difference in group motion. According to embodiments, segments may be grouped in any number of different manners. In embodiments, adjacent segments may be grouped based on characteristics of their respective motion from a current frame to the subsequent frame. For example, a number of segments may be associated with a particular group if they are adjacent (e.g., within a specified distance from one another), and if their motion vectors are similar. According to embodiments, two or more motion vectors may be similar if they satisfy a similarity metric. For example, motion vectors may be deemed to be similar if their respective directions are within a specified number of degrees of one another, if a difference between their magnitudes are within a specified range or exceed a specified threshold (or are less than a specified threshold), if a metric calculated based on one or more features of the motion vectors satisfy a specified criteria, and/or the like. Using the grouped segments, graphs may be generated and partitioned to create objects, as described herein. In embodiments, for the purpose of creating graphs, each group of edges may be a group of one—that is, for example, graph vertices may correspond to individual segments. For example, an edge weight may be defined as follows:
edgeWeight=floor(10*motionFactor*(forwardOverlap+backwardOverlap)/2),
where the forwardOverlap is based on the forward motion of each warped (camera motion compensated) group onto groups in the next frame, the backwardOverlap is based on the backward motion of each unwarped group onto groups in the previous frame, the floor(10* . . . ) is used to convert from order-unity-or-greater floating point values to integer values, and the motionFactor is given by:
dx=max{0,|vx1+vx2|−vsoak},
dy=max{0,|vy1+vy2|−vsoak},
v
Δ=sqrt(dx2+dy2),
v
Σ=sqrt(
v
x1
2
+v
x2
2
+v
y1
2
+v
y2
2) ;
v
0
2
=v
A
2
+v
B
*v
Σ
+C*v
Σ
2, and
motionFactor=v02/(v02+vΔ2),
where vx1, vy1 are the forward motion vector components of one group, vx2, vy2 are the backward motion vector components of the other group (hence the use of addition to obtain dx, dy); max{0, x} is x if x is positive and 0 otherwise; sqrt( )is the square root function; and the other symbols are constants. In embodiments, for example, Vsoak=0.5, VA=0.002, vB=0.1, and c=0.0. According to various embodiments, these consonants may be assigned any number of other combinations of values. The values of the consonants may be adjusted to achieve edge capacities that produce results meeting any number of different outcome criteria.
In embodiments, when constructing edges weights, the algorithm does not directly consider how well the overlapping pixels match—this information is somewhat captured by the (typically L1) residuals during segment motion, and that information impacts declaration of which segments can be said to be moving and how groups (“proto-groups”) are formed. But, in embodiments, the residuals are not directly considered when constructing the edge weights. In other embodiments, the residuals are considered when constructing edge weights. For example, embodiments of the method 4100 may include examining the actual overlaps between groups. In embodiments, considering, instead of the actual overlaps between groups, the aggregate residuals of the segments in each group, may facilitate a computationally less intensive operation while maintaining the robustness of the algorithm.
As shown in
As shown in
Turning briefly to
strength(v)≡Σucapacity(u, v).
Embodiments of the method 4200 further include determining the single-path max-flow, spmFlow(u,vsource) from the first candidate source vertex, vsource, to each of the other vertices (block 4204). In embodiments, spmFlow(u,vsource) may be determined by solving the widest path problem. A merit function is evaluated for each vertex (block 4206) and the next candidate source vertex (represented by “NCSV” in
merit(u)≡strength(u)/spmFlow(u, vsource),
where merit(vsource)=−1. The vertex with the greatest merit is taken to be the next candidate source vertex.
According to embodiments, the partitioning algorithm may be improved by modifying the merit function used to select the source and drain vertices. Similarly, modifying the merit function may allow the algorithm to be tuned to a desired application. For example, one may consider using not just the single-path max-flow and the vertex strength, but also the greatest edge capacity among the edges associated with each vertex and any other information available about each vertex or edge based on the underlying meaning of the graph. In embodiments, one or more of these types of information may be combined with one or more other types of information (whether or not listed here) to obtain a useful merit function that produces results appropriate for a type of video content, application, and/or the like.
As shown in
The source and drain vertex-selecting process may be repeated any number of times (e.g., up to five times), or until the next candidate source vertex was a previous candidate. In practice, this process has been generally found to converge after 3 iterations. Accordingly, in embodiments, the process may be programmed to be terminated after two iterations.
With reference to
residual(u, v)≡capacity(u, v)−flow(u, v).
Embodiments of the method 4100 further include deciding whether or not to accept the candidate partition, which may include evaluating the partition factor, PF (block 4114):
PF≡maxFlow/min{strength(vsource), strength(vdrain)},
where vsource and vdrain in are the source and drain vertices, and min{. . . } indicates the smallest value in the set. The partition factor is compared to a partition factor threshold, TH(P), and it is determined whether the partition factor exceeds the partition factor threshold (block 4116). If the partition factor does not exceed the partition factor threshold, the partition factor may be rejected and a new set of source and drain vertices may be selected (block 4110) and evaluated, as shown in
As shown in
Embodiments of the process of determining the source- and drain-side subgraphs are depicted in
According to embodiments, if the partition was accepted, the process described above may be applied to both of the determined sub-graphs iteratively until all subgraphs have been dropped or have been rejected for partitioning. According to embodiments, any number of various refinements, additional criteria, filtering processes, and/or the like may be employed as part of the clustering algorithm described above.
As described above, embodiments include a novel graph partitioning algorithm that may be used to partition graphs to identify clusters, which may be referred to as objects. Identification of the objects may facilitate object group analysis, in which the position and motion of objects and/or groups of objects are tracked. Information about objects, groups of objects, the location (in video frames) of objects, the motion of objects, and/or the like, may be used to facilitate more efficient partitioning, encoding, and/or the like.
The inventors have found, through experimentation, that embodiments of the algorithm described herein facilitate more effective and useful partitioning of graphs related to video frames. When analyzing a partitioning method, it is beneficial to have a simple numerical method to determine how “good” a cut is. The appropriate measure depends on what is meant by a “good” cut for a given purpose. One of the most widely used measures of merit is the cut ratio, which is the sum of the edge capacities cut divided by the lesser of the sum of the edge capacities in either partition:
cutRatio (ΣedgesCutWeightedge)/(min{Σleft partitionweightedge, Σright partitionweightedge}),
where the identification of which partition is “left” and which is “right” is arbitrary.
For many of applications, a more appropriate measure of merit may be a measure of merit defined herein, called the “flow ratio,” defined as follows. First, for each vertex in each partition, consider the strength of each vertex to be the sum of the capacities of the edges connected to that vertex. For any candidate partition, consider the maximum vertex strength in each candidate subgraph; and the lesser of those two is the min-max vertex strength. Finally, for any candidate partition, the flow ratio is the sum of the weights of the edges cut divided by the min-max vertex strength. The flow ratio may also be called the min-max vertex strength cut ratio. Note that the flow ratio is similar to, but different from, the partitionFactor used in embodiments of the partitioning algorithm described above. The flow ratio may be expressed in the following form:
One of the simplest traditional partitioning methods is based on minimum spanning trees. In this method, one first constructs an approximate minimum spanning tree from the original graph, using the reciprocal of the capacities of each edge as the cost of the edge. Prim's algorithm provides a simple greedy method for constructing an approximate minimum spanning tree, and works essentially as follows: starting with an arbitrary vertex, grow the tree one edge at a time, always adding the cheapest edge that connects the tree to a vertex not yet connected to the tree, and continue until all vertices are connected. Given an approximate minimum spanning tree, then decide which edge to cut in the tree, and partition vertices in the original graph according to the partitioning of the vertices in the tree. A simple approach would be to cut the tree at the weakest edge; and a more sophisticated approach would be to consider every edge in the tree and evaluate the resultant cut ratio in the original graph, taking the cut according to the best cut ratio. One of the downsides of this method is that it can perform very poorly in some cases, due to the greedy algorithm used to construct the tree.
Another widely used method of partitioning is spectral partitioning, and its properties been the subject of extended analysis. See, for example, Stephen Guattery and Gary L. Miller, “On the Performance of Spectral Graph Partitioning Methods,” Proceedings of The Second Annual ACM-SIAM Symposium on Discrete Algorithms, ACM-SIAM, 1995, pp. 233-242; and Daniel A. Spielman and Shan-Hua Teng, “Spectral partitioning works: Plana graphs and finite element meshes,” Linear Algebra and its Applications 421 (277) 284-305, the entirety of each of which is hereby incorporated herein by reference for all purposes. The method works as follows: First, consider the adjacency matrix A, with [A]ij being the capacity between vertex i and j in the graph and [A]ii=0 for any i. Construct the diagonal strength matrix D, with [D]ii=the sum of the edge capacities of all edges attached to vertex i and [D]ij=0 for i≠j. Now construct the Laplacian matrix L=D−A. Evaluate the Fiedler vector, which is the eigenvector corresponding to the second smallest eigenvalue of L. In spectral partitioning methods, the vertices in the graph are partitioned based on how the corresponding element in the Fiedler vector compares to one or more thresholds. One of the simplest thresholds would be to partition all vertices with a positive Fiedler value from those with a negative Fieldler value. In many cases, better results are achieved by taking the threshold to be such that some measure of partition merit is optimized.
The inventors have applied embodiments of the new max-flow graph partitioning algorithm described herein (referred to herein, interchangeably, as “the new algorithm” and “the new method”) to several graphs that arise during a video analysis application. The algorithm was applied to four different cases (“persp”, “fishClip”, “quickAtk” and “dog”), distinguishing the base graphs used as input to the algorithm from the induced graphs which resulted from the application of the algorithm. Several of the statistics of these graphs are summarized in Table 1.
The first column of Table 1 lists the name for each set. The second column lists the total number of graphs in that set. The third column lists the distribution of the number of vertices in the graphs in that set, in the format: minimum˜maximum (mean ±standard deviation). The graph with the most vertices contained 688 vertices and was in the quickAtk-base set. The fourth column lists the distribution of the number of edges in the graphs in each set similarly. The graph with the most edges had 1,687 edges and was in the fishClip-base set. The fifth column lists the distribution of the density (number of edges divided by (number of vertices squared)) of the graphs in each set, showing only the mean and standard deviation.
The merit of the generated partitions was evaluated using both the cut ratio and flow ratio merits, and the results have been compared to those obtained using spectral partitioning. In Table 2, results for the cut ratio are displayed, taking the spectral partition to minimize the cut ratio. For each graph, the inventors examined and classified the partitioning into one of five possible outcomes, in order: (1) the two methods yielded identical cuts (“Idnt”); (2) neither method found a good cut because the cut ratio was >⅓ for both (“No”); (3) the new method yielded a better cut (lower cut ratio) (“Bet”); (4) the new method yielded a worse cut (higher cut ratio) (“Wrs”); and (5) the two methods yielded a cut with an equivalent cut ratio to within numerical precision. Outcome 5 never occurred in these data sets, and is not indicated in the table. Note that the new method can partition an individual vertex from the rest of the graph, leading to an infinite cut ratio; and in all graphs where that happened, the spectral partitioning produced a partition with a cut ratio of over ⅓, leading to such graphs being classified as outcome (2) (“No”). The last two columns list the mean ±standard deviation of the cut ratios for graphs that resulted in outcomes (3)˜(5), for the spectral partitioning and the new max-flow partitioning algorithm respectively.
Table 2, above, indicates the results of a comparison of the new algorithm to spectral partitioning using the min-cut ratio. The first two columns are the same as in Table 1, above. Column 3 (“Idnt”) lists the number of graphs where the two partitioning methods produce an identical partition. Column 4 (“No”) lists the number of graphs where both the new algorithm and spectral partitioning produce a cut with a cut ratio of over ⅓; and it is deemed that no cut exists in this case. Columns 5 (“Bet”) and (“Wrs”) list the number of graphs where the new method produces a cut with a better cut ratio than spectral partitioning; and similarly, Column 6 (“Wrs”) is the number where the new method is worse. Column 7 (“SP Cut Ratio”) lists the mean ±standard deviation of the cut ratios generated using spectral partitioning among cases in the “Bet” and “Wrs” categories. Column 8 (“New Cut Ratio”) lists the same for the new partitioning method among cases in the “Bet” and “Wrs” category.
As indicated in Table 2, the new algorithm produces an identical cut a significant fraction of the time and generally similar results overall—and, this happens without any analysis of the cut ratio factoring into the new algorithm. Further, in four of the sets (both the base and induced graphs for the “persp” and “fishClip” cases), switching to the new algorithm yields a better cut (lower cut ratio) more often than it yields a worse cut (higher cut ratio). However, for most of the sets, the new algorithm yields a slightly higher mean cut ratio; the exceptions are the fishClip-base set, where a somewhat lower mean cut ratio was obtained, and the quickAtk-base set, where a much higher mean cut ratio was obtained. Note that an increase in the mean cut ratio is reconciled with having more “better” than “worse” outcomes by the fact that there are a few worse cases that are significantly worse, while most of the other better and worse cases change by less.
In Table 3, results for the flow ratio are shown, taking the spectral partition to minimize the flow ratio. Table 3 depicts the results of a comparison of the new algorithm to spectral partitioning using the min-max vertex strength ratio. The meanings of the columns are identical to those of Table 2, above, except that the last two columns list flow ratios instead of cut ratios. Observe that the two methods produce identical partitions somewhat more often than when using the cut ratio. However, for every set, the new algorithm produces results that are equal or better than spectral partitioning, and is sometimes significantly better.
Various modifications and additions can be made to the exemplary embodiments discussed without departing from the scope of the present disclosure. For example, while the embodiments described above refer to particular features, the scope of this disclosure also includes embodiments having different combinations of features and embodiments that do not include all of the described features. Accordingly, the scope of the present disclosure is intended to embrace all such alternatives, modifications, and variations as fall within the scope of the claims, together with all equivalents thereof.
Embodiments of the systems and methods described herein include a feature-based pattern recognition process (e.g., the feature-based pattern recognition process 216 depicted in
In embodiments, the feature-based pattern recognition process may be facilitated using segmentation information, foreground/background information, object group analysis information, and/or the like. According to embodiments, for example, a pattern recognition component (e.g., the pattern recognition component 4420 described below with regard to
Embodiments of the systems and methods described herein include an object classification process (e.g., the object classification process 218 depicted in
According to embodiments, multiple classifiers may be used for a more robust labeling scheme. Embodiments of such a technique involve extracting meaningful features from video data. Extracting the features may be achieved using any number of different feature-extraction techniques such as, for example, kernel-based approaches (e.g. Laplacian of Gaussian, Sobel, etc.), nonlinear approaches (e.g. Canny, SURF, etc.), and/or the like. After feature vectors are extracted, a learning algorithm (e.g., SVM, neural network, etc.) is used to train a classifier. Approaches such as deep learning seek to use cascaded classifiers (e.g., neural networks) to combine together the decisions from disparate feature sets into one decision.
Embodiments involve characterizing the output of a classifier using a histogram, and applying classical Bayesian decision theory on the result to build a statistically-backed prediction. Embodiments of this approach may facilitate improved accuracy and/or computational efficiency. For example, embodiments of the technique may be implemented in a modular manner, in that models may be trained independently, and added to the boosting stage ad-hoc, thereby potentially improving accuracy on the fly. As another example, by implementing a model that automatically provides a statistical model of a number of classifier outputs, computational efficiencies may be realized due, at least in part, to avoiding complex schemes for using cascaded classifiers to combine together the decisions from disparate features sets. Embodiments of the techniques and systems described herein may be applicable to any number of different situations in which classifiers are used, and although pattern recognition is one example, and any other situation in which one or more classifiers are utilized is contemplated herein.
As shown in
In embodiments, the memory 4414 stores computer-executable instructions for causing the processor 4412 to implement aspects of embodiments of system components discussed herein and/or to perform aspects of embodiments of methods and procedures discussed herein. Computer-executable instructions may include, for example, computer code, machine-useable instructions, and the like such as, for example, program components capable of being executed by one or more processors associated with a computing device. Examples of such program components include a segmenter 4418, a pattern recognition component 4420, an encoder 4422, and a communication component 4424.
In embodiments, the segmenter 4418 may be configured to segment a video frame into a number of segments. The segments may include, for example, objects, groups, slices, tiles, and/or the like. The segmenter 4418 may employ any number of various automatic image segmentation methods known in the field. In embodiments, the segmenter 4418 may use image color and corresponding gradients to subdivide an image into segments that have similar color and texture. Two examples of image segmentation techniques include the watershed algorithm and optimum cut partitioning of a pixel connectivity graph. For example, the segmenter 4418 may use Canny edge detection to detect edges on a video frame for optimum cut partitioning, and create segments using the optimum cut partitioning of the resulting pixel connectivity graph.
In embodiments, the pattern recognition component 4420 may perform pattern recognition on digital images such as, for example, frames of video. In embodiments, the pattern recognition component 4420 may perform pattern recognition on images that have not been segmented. In embodiments, results of pattern recognition may be used by the segmenter 4418 to inform a segmentation process. Pattern recognition may be used for any number of other purposes such as, for example, detecting regions of interest, foreground detection, facilitating compression, and/or the like.
According to embodiments, as shown in
As is also shown in
For example, in the case of a binary SVM, embodiments of the learning algorithm include, in simple terms, trying to maximize the average distance to the hyperplane for each label. In embodiments, kernel-based SVMs (e.g., RBF) allow for nonlinear separating planes that can nevertheless be used as a basis for distance measures to each sample point. That is, for example, after an SVM is trained on a test set, distance features may be computed for each sample point between the sample point and the separating hyperplane. The result may be binned into a histogram, as shown, for example, in
A similar approach may be taken for the case of an Extreme Learning Machine (ELM). An ELM is a sort of evolution of a neural network that has a series of output nodes, each generally corresponding to a sort of confidence that the sample belongs to class n (where n is the node number). While the ELM isn't necessarily binary in nature, the separate output nodes may allow a similar analysis to take place. In general, for example, the node with the highest output value may be predicted as the classification, but embodiments of the techniques described herein, when applied to the node outputs in a similar way as the SVM decisions, may facilitate significant improvements in performance. According to embodiments, any learning machine with a continuous output may be utilized. Embodiments of the techniques described herein may facilitate boosts in accuracy of classification, as well as more robust characterization of the prediction (e.g., confidence).
The pattern recognition component 4420 may include a distribution builder 4430 that is configured to receive, from the classifier, a number of classifications corresponding to the input information and to determine a distribution of the classifications. In embodiments, the distribution builder 4430 may determine the distributions based on distances between the classifications and the hyperplane.
For example, the distribution builder 4430 may be configured to determine the distribution by characterizing the plurality of classifications using a histogram. In embodiments, the distribution builder may be configured to compute a number of distances features, such as, for example, a distance, in the virtual feature space, between each of the classifications and the hyperplane. The distribution builder 4430 may assign each of the distance features to one of a number of bins of a histogram.
In the case of sparse or incomplete samples in the histogram, it may be advantageous to model the distribution to generate a projected score for a bin. In the case of sufficient data density (e.g., a significant number of samples fall in the bin of interest), it may be advantageous to use computed probabilities directly. As a result, modeling may be done on a per-bin basis, by checking each bin for statistical significance and backfilling probabilities from the modeled distribution in the case of data that has, for example, a statistically insignificant density, as depicted, for example, in
In embodiments, for example, the distribution builder 4430 is configured to determine a data density associated with a bin of the histogram, and determine whether the data density is statistically significant. That is, for example, the distribution builder 4430 may determine whether the data density of a bin is below a threshold, where the threshold corresponds to a level of statistical significance. If the data density of the bin is not statistically significant, the distribution builder 4430 may be configured to model the distribution of data in the bin using a modeled distribution. In embodiments, the Cauchy (also known as the Lorentz) distribution may be used, as it exhibits strong data locality with long tails, although any number of other distributions may be utilized.
Having determined statistical distributions associated with outputs from one or more classifiers, the pattern recognition component 4420 may utilize a predictor 4432 configured to generate a prediction by estimating, using a decision engine, a probability associated with the distribution. That is, for example, the class with the highest probability predicted by the distribution may be the one selected by the decision engine. A confidence interval may be calculated for each prediction based on the distribution, using any number of different techniques.
In embodiments, for example, the probability for a single classifier may be estimated using an improper Bayes estimation (e.g., a Bayes estimation without previous probability determinations, at least initially). That is, for example, the decision function may be:
Using histogram distributions, the P(distance/in/out class) may be calculated by determining percentage of samples in the distance bin, or by substituting an appropriately modeled projection (any of which may be handled by the model internally). Any number of different decision functions may be utilized, and different decision functions may be employed depending on desired system performance, characteristics of the classifier outputs, and/or the like. In embodiments, for example, the decision function may utilize Bayes estimation, positive predictive value (PPV) maximization, negative predictive value (NPV) maximization, a combination of one or more of these, and/or the like. Embodiments of the statistical model described herein may be well suited to a number of decision models as the sensitivity, specificity, and prevalence of the model are all known. Precision and recall may also be determined from the model directly, thereby facilitating potential efficiencies in calculations.
As shown in
The illustrative operating environment 4400 shown in
In embodiments, the trained classifiers 4510 and 4512 are used to build distributions that support more robust decision engines. The distribution is generated using a classifier evaluation process 4514 that produces a distance/response scalar 4516. In embodiments, for example, distances between classification output points and a hyperplane are computed and included in the distance/response scalar 4516. The process flow 4518 further includes histogram generation 4518, through which the distributions 4520 are created. A Bayes estimator 4522 may be used to generate, based on the distributions 4520, predictions 4524. According to embodiments, any other prediction technique or techniques.
The illustrative process flow 4500 shown in
Embodiments of the method 4600 further include generating at least one classifier (block 4604). The at least one classifier may be configured to define at least one decision hyperplane that separates a first classification region of a virtual feature space from a second classification region of the virtual feature space, and may include, for example, at least one of a support vector machine (SVM), an extreme learning machine (ELM), a neural network, a kernel-based perceptron, a k-NN classifier, and/or the like. Input is provided to the classifier (block 4606), and a number of classifications is received from the at least one classifier (block 4608).
Embodiments of the method 4600 include determining a distribution of the plurality of classifications (block 4610). In embodiments, determining a distribution of the plurality of classifications includes characterizing the plurality of classifications using a histogram. Embodiments of the method 4600 further include generating a prediction function based on the distribution (block 4612). According to embodiments, generating the prediction function may include generating a decision function that may be used for estimating, using the decision function, a probability associated with the distribution, where the decision function may utilize at least one of Bayes estimation, positive predictive value (PPV) maximization, negative predictive value (NPV) maximization and/or the like.
Embodiments of the method 4700 further include assigning each of the distance features to one of a plurality of bins of a histogram (block 4706). The method 4700 may also include determining a data density associated with a bin of the histogram (block 4708); determining that the data density is below a threshold, wherein the threshold corresponds to a level of statistical significance (block 4712); and modeling the distribution of data in the bin using a modeled distribution (block 4714). For example, in embodiments, the modeled distribution includes a Cauchy distribution. In a final illustrative step of embodiments of the method 4700, the bin is backfilled with probabilities from the modeled distribution (block 4716).
Embodiments of the method 4800 further include providing input information (e.g., the extracted features and/or information derived from the extracted features) to at least one classifier (block 4804). The at least one classifier may be configured to define at least one decision hyperplane that separates a first classification region of a virtual feature space from a second classification region of the virtual feature space, and may include, for example, at least one of a support vector machine (SVM), an extreme learning machine (ELM), a neural network, a kernel-based perceptron, a k-NN classifier, and/or the like. Embodiments of the method 4800 further include generating a prediction based on the classification distribution provided by the at least one classifier (block 4806). According to embodiments, generating the prediction may include using the decision function associated with the distribution, where the decision function may utilize at least one of Bayes estimation, positive predictive value (PPV) maximization, negative predictive value (NPV) maximization and/or the like.
To produce multi-view video content (e.g., 3D video, AR video and/or VR video), multiple views may be used to present a scene (or scene augmentation) to a user. A view refers to a perspective of a scene and may include one or more images corresponding to a scene, where all of the images in the view represent a certain spatial (and/or temporal) perspective of a video scene (e.g., as opposed to a different perspective of the video scene, represented by a second view). According to embodiments, a perspective may include multiple spatial and/or temporal viewpoints such as, for example, in the case in which the “viewer” —that is, the conceptual entity that is experiencing the scene from the perspective—is moving relative to the scene, or an aspect thereof (or, put another way, the scene, or an aspect thereof, is moving relative to the viewer).
In embodiments, a view may include video information, which may be generated in any number of different ways such as, for example, using computing devices (e.g., computer-generated imagery (CGI)), video cameras (e.g., multiple cameras may be used to respectively record a scene from different perspectives), and/or the like. Accordingly, in embodiments, a view of a scene (e.g., computer-generated and/or recorded by a video camera) may be referred to herein as a video feed and multiple views of a scene may be referred to herein as video feeds. In embodiments, each video feed may include a plurality of video frames.
Embodiments of this disclosure may provide efficiencies over conventional embodiments when processing video for multi-view video content. Examples of some computational efficiencies include, but are not limited to, reducing encoding redundancies of encoding the same object in another video feed, using an object from one video feed to identify the same object in another frame and/or using the object registration to tag and/or label the same object in more than one video feed.
In embodiments, the video data 5004 may include views of a scene embodied in a number of video feeds. In embodiments, the video feeds of the scene, or aspects thereof, may have been respectively recorded by cameras positioned at different locations so that the scene is recorded from multiple different viewpoints. In embodiments, the video feeds of the scene, or aspects thereof, may have been computer-generated. In some instances, view information—information about the perspective corresponding to the view (e.g., camera angle, virtual camera angle, camera position, virtual camera position, camera motion (e.g., pan, zoom, translate, rotate, etc.), virtual camera motion, etc.)—may be received with (or in association with) the video data 5004 (e.g., multiplexed with the video data 5004, as metadata, in a separate transmission from the video data 5004, etc.). In other embodiments, view information may be determined, e.g., by the encoding device 5002. Each of the video feeds of the video data 5004 may be comprised of multiple video frames. In embodiments, the video feeds may be combined to produce multi-view video.
As described herein, while producing the encoded video data 5006 from the video data 5004, the encoding device 5002 may determine motion vectors of the video data 5004. In embodiments, the encoding device 5002 may determine motion vectors of the video data 5004 in a computationally less demanding way than conventional encoding systems, and, in embodiments, methods that include extrapolating motion vectors from a first video feed to other video feeds.
As shown in
As shown in
In embodiments, the memory 5014 stores computer-executable instructions for causing the processor 5012 to implement aspects of embodiments of system components discussed herein and/or to perform aspects of embodiments of methods and procedures discussed herein. Computer-executable instructions may include, for example, computer code, machine-useable instructions, and the like such as, for example, program components capable of being executed by one or more processors associated with a computing device. Examples of such program components may include a segmenter 5018, a foreground detector 5020, a motion estimator 5022, an object analyzer 5024, an encoder 5026 and a communication component 5028.
As indicated above, in embodiments, the video data 5004 includes multiple video feeds and each video feed includes multiple video frames. In embodiments, the segmenter 5018 may be configured to segment one or more video frames into a plurality of segments. The segments may include, for example, objects, groups, slices, tiles, and/or the like. The segmenter 5018 may employ any number of various automatic image segmentation techniques such as, for example, those discussed herein. For example, the segmenter 5018 may use image color and corresponding gradients to subdivide an image into segments that have similar color and texture. Two embodiments of image segmentation techniques include the watershed algorithm and optimum cut partitioning of a pixel connectivity graph. In embodiments, the segmenter 5018 may use Canny edge detection to detect edges on a video frame for optimum cut partitioning, and create segments using the optimum cut partitioning of the resulting pixel connectivity graph. In embodiments, the segmenter 5018 implements aspects of the segmentation techniques described in luri Frosio and Ed Ratner, “Adaptive Segmentation Based on a Learned Quality Metric,” Proceedings of the 10th International Conference on Computer Vision Theory and Applications, March 2015, the entirety of which is hereby incorporated herein by reference for all purposes.
The resulting segment map of image segments includes an assignment of indices to every pixel in the image, which allows for the frame to be dealt with in a piecemeal fashion. Each index, which may be indexed and stored in memory 5014, may be considered a mask for this purpose.
In embodiments, the foreground detector 5020 may be configured to perform foreground detection on one or more video frames of the video data 5004. For example, in embodiments, the foreground detector 5020 may perform segment-based foreground detection, where the foreground segments, or portions of the segments, are detected using any number of different techniques such as, for example, those discussed above with respect U.S. application Ser. No. 14/737,418, filed Jun. 11, 2015, entitled “FOREGROUND DETECTION USING FRACTAL DIMENSIONAL MEASURES,” the entirety of which is incorporated herein. For example, in embodiments, the foreground detector 5020 may identify a segment as a foreground segment or a background segment by: determining at least one foreground metric for the segment based on a filtered binary foreground indicator map (BFIM), determining at least one variable threshold based on the foreground metric, and applying the at least one variable threshold to the filtered BFIM to identify the segment as either a foreground segment or background segment. Alternatively, the foreground detector 5020 may perform foreground detection on images that have not been segmented. In embodiments, results of foreground detection may be used by the segmenter 5018 to inform a segmentation process.
The motion estimator 5022 is configured to estimate the motion of one or more segments between video frames of a single video feed. For example, the motion estimator 5022 may receive a single video feed of the video data 5004. The single video feed may be received after video frames of the video feed are segmented by the segmenter 5018. The motion estimator 5022 may then perform motion estimation on the segmented video frames of the video feed. That is, the motion estimator 5022 may estimate the motion of a segment between video frames of the single video feed, with the current frame in the temporal sequence serving as the source frame and the subsequent frame in the temporal sequence serving as the target frame.
In embodiments, the motion estimator 5022 may utilize any number of various motion estimation techniques known in the field. Two example motion estimation techniques are optical pixel flow and feature tracking. As an example, the motion estimator 5022 may use feature tracking in which Speeded Up Robust Features (SURF) are extracted from both a source image (e.g., a source frame) and a subsequent image (e.g., a subsequent frame). The individual features of the two images may then be compared using a Euclidean metric to establish a correspondence between features, thereby generating a motion vector for each feature. In embodiments, a motion vector for a segment may be the median of all of the motion vectors for each of the segment's features. After a motion vector for a segment is determined, each segment may be categorized based on its motion properties. For example, a segment may be categorized as either moving or stationary.
In embodiments, the motion estimator 5022 may include a multi-view motion estimator 5038. In embodiments, the multi-view motion estimator 5038 may extrapolate motion vectors for the other video feeds of the video data 5004.
To do so, the multi-view motion estimator 5038 may determine a pixel depth. The multi-view motion estimator 5038 is configured to receive two or more video feeds of the video data 5004. Based on the relative positions of the cameras used to record the video feeds, as determined by the camera position and viewing angle calculator 5028, the multi-view motion estimator 5038 is configured to calculate and assign a pixel depth for each pixel located in the video frames of the video feeds. As an example, if an object encompasses an area of a1×b1 pixels in one video feed and the same object encompasses an area of a2×b2 pixels in another video feed, then, based on the relative positions and angles of the two cameras used to recorded to the two video feeds, a transformation function can be determined that will transform the object from a1×b1 pixels to a2×b2 pixels. Due to the transformation function being based on the relative positions and angles of the cameras, the transformation function will include distance information that can be used to calculate the distance of the object from the first camera used to record the first video feed. That is, a depth can be assigned to the pixels including the object. Similarly, distances to other objects included in the video feed may be determined. Once the distances to each of the objects are calculated, a pixel depth (i.e., a depth coordinate) can be assigned to each object and pixel included in a video feed. And, based on the calculated pixel depth, a 3D map of each video feed may be determined. That is, for each pixel of a video frame, a depth coordinate (e.g., a z coordinate) can be included in the horizontal and vertical coordinates (e.g., x, y coordinates).
After the 3D map is created, the multi-view motion estimator 5038 may be configured to extrapolate the motion vectors determined by the single-view motion estimator 5022 of a video feed to other video feeds. To do so, in embodiments, the multi-view motion estimator 5038 assigns 3D coordinates to each of the motion vectors computed by the single-view motion estimator 5030 based on the 3D map. That is, the multi-view motion estimator 5038 may be configured to receive 2D motion vector data from the single-view motion estimator 5030 and determine the three dimensional representations of the motion vectors using the 3D map determined from the calculated pixel depth. The multi-view motion estimator 5038 then can use the 3D representation of motion vectors to compute 2D projections onto one or more of the other 2D coordinate systems associated with the other video feeds. In embodiments, a local search can be performed by the multi-view motion estimator 5038 to determine whether the motion vectors projected onto a video feed accurately represent motion vectors for the video feed. In embodiments, the projected motion vectors may be compared to computed motion vectors for one or more of the other video feeds using a Euclidean metric to establish a correspondence between motion vectors and/or determine an projection error of the projected motion vectors.
In embodiments, the object analyzer 5024 may be configured to identify, using the segment map and/or the motion vectors computed by the motion estimator 5022, the presence of objects (e.g., clusters of segments that move together, or at least approximately together) within digital images such as, for example, frames of video. In embodiments, the object analyzer 5024 may be configured to identify and/or analyze only objects that are moving within a particular scene. In embodiments, the object analyzer 5024 may determine the presence of objects in all of the video feeds of the video data 5004. In embodiments, the object analyzer 5024 may perform object analysis on images that have not been segmented. Results of object group analysis may be used by the segmenter 5018 to facilitate a segmentation process, by an encoder device 5002 to facilitate an encoding process, and/or the like.
According to embodiments, the pattern recognition component 5026 may perform pattern recognition on digital images such as, for example, frames of video. For example, the pattern recognition component 5026 may perform pattern recognition of the objects that are determined by the object analyzer 5024 in one or more frames of video. In embodiments, the pattern recognition component 5026 may recognize patterns of one or more of the objects of a video frame and determine whether the recognized patterns correspond to a specific class of objects. To perform pattern recognition, the pattern recognition component 5026 may use any number of different techniques such as, for example, those discussed in U.S. application Ser. No. 15/237,048, filed Aug. 15, 2016, entitled “OBJECT CATEGORIZATION USING STATISTICALLY-MODELED CLASSIFIER OUTPUTS” and/or U.S. Application Ser. No. 62/368,853, filed Jul. 29, 2016, entitled “LOGO IDENTIFICATION.”
According to embodiments, if the recognized patterns correspond to a class of objects, the pattern recognition component 5026 may classify and label the object as the corresponding class. In embodiments, the pattern recognition component 5026 may classify and label objects using any number of different techniques such as, for example, those discussed in U.S. Application Ser. No. 15/237,048, filed Aug. 15, 2016, entitled “OBJECT CATEGORIZATION USING STATISTICALLY-MODELED CLASSIFIER OUTPUTS.” Additionally or alternatively, the pattern recognition component 5026 may be used for any number of other purposes such as, for example, detecting regions of interest, foreground detection, facilitating compression, and/or the like.
According to embodiments, the camera position and viewing angle calculator 5028 may calculate the relative positions and viewing angles of the cameras that respectively recorded the video feeds of the video data 5004 based on the field of views of each of the cameras. The relative positions and angles of the cameras may be used to determine 3D coordinates of a video scene, by the 3D motion estimator 5038 and/or by the object register 5030. Additionally or alternatively, the relative positions and viewing angles of the cameras may be included in metadata of the video data 5004 and received by the 3D motion estimator 5038 and/or the object register 5030.
Based on the relative positions and viewing angles of the cameras, the object register 5030 may transform an identified object from the perspective of a first video feed to the perspective of second video feed. To do so, the object register 5030 may calculate a pixel depth for each of the pixels included in a video frame based on two video feeds and the relative positions and angles of the cameras used to record the two video feeds. As an example, if an object encompasses an area of a1×b1 pixels in one video feed and the same object encompasses an area of a2×b 2 pixels in another video feed, then, based on the relative positions and angles of the two cameras used to recorded to the two video feeds, a transformation function can be determined that will transform the object from a1×b1 pixels to a2×b2 pixels. Due the transformation function being based on the relative positions and angles of the cameras, the transformation function will include distance information that can be used to calculate the distance of the object from the first camera used to record the first video feed. That is, a depth can be assigned to the pixels including the object. Similarly, distances to other objects included in the video feed may be determined. Once the distances to each of the objects are calculated, a pixel depth (i.e., a depth coordinate) can be assigned to each object and pixel included in a video feed. And, based on the calculated pixel depth, a 3D map of each video feed may be determined. That is, for each pixel of a video frame, a depth coordinate (e.g., a z coordinate) can be included in the horizontal and vertical coordinates (e.g., x, y coordinates).
After a 3D map is created, the object register 5030 may transform an object from the perspective of a first video camera to the perspective of a second video camera. That is, the 3D representation of an object may be projected onto the perspective of the second video camera. In embodiments, the object register 5030 may compare the transformed object against one or more objects identified in the second feed by the object analyzer 5024. The object register 5030 may then determine the closest match between the transformed object and an object identified in the second feed. After which, the object register 5030 may register the first object to the closest match in the second feed as the same object.
As shown in
The illustrative operating environment 5000 shown in
As shown in
As shown in
Embodiments of the method 51 further include registering objects of a video feed (block 5106). Embodiments describing an object registration method 5200 are discussed below with respect to
After the objects are registered, the registered objects are encoded (block 5108). By registering the same object as it appears in different video feeds, computational efficiencies may be obtained. Examples of some computational efficiencies include, but are not limited to, reducing encoding redundancies of encoding the same object in another video feed, using an object from one video feed to identify the same object in another frame and/or using the object registration to tag and/or label the same object in more than one video feed. In embodiments, the encoder 5032 depicted in
After encoding the motion vectors, the encoded motion vector data may be transmitted (block 5110). The encoded motion vector data may be transmitted to a decoding device (e.g., the decoding device 5008 depicted in
The illustrative method 5100 shown in
As shown in
According to embodiments, the method 5200 may include determining relative positions and angles of cameras (block 5204). The relative positions and angles of the cameras may be used to determine 3D coordinates of a video scene, as described below. In embodiments, any number of different techniques may be used to determine relative positions and angles of cameras such as, for example, the embodiments discussed above with respect to
According to embodiments, the method 5200 may include calculating a pixel depth based on two video feeds (block 5206) and determine 3D coordinates of objects for video based on pixel depth (block 5208). As an example, if an object encompasses an area of a1×b1 pixels in one video feed and the same object encompasses an area of a2×b2 pixels in another video feed, then, based on the relative positions and angles of the two cameras used to recorded to the two video feeds, a transformation function can be determined that will transform the object from a1×b1 pixels to a2×b2 pixels. Due the transformation function being based on the relative positions and angles of the cameras, the transformation function will include distance information that can be used to calculate the distance of the object from the first camera used to record the first video feed. That is, a depth can be assigned to the pixels including the object. Similarly, distances to other objects included in the video feed may be determined. Once the distances to each of the objects are calculated, a pixel depth (i.e., a depth coordinate) can be assigned to each object and pixel included in a video feed. And, based on the calculated pixel depth, a 3D map of a video feed may be determined. That is, for each pixel of a video frame, a depth coordinate (e.g., a z coordinate) can be included in the horizontal and vertical coordinates (e.g., x, y coordinates). In embodiments, an object register 5030, as depicted in
Then, the 3D coordinate representation of each of the object may be projected onto a respective 2D coordinate system of a second video feed (block 5210). Once a 3D representation of an object is projected onto a 2D coordinate perspective of a second video feed, the projected objects from the first feed are compared to objects identified in a second video feed (block 5212). In embodiments, the closest object of the second video feed to the projected object may be identified and registered as the same object (block 5214).
The illustrative method 5200 shown in
Various modifications and additions can be made to the exemplary embodiments discussed without departing from the scope of the present disclosure. For example, while the embodiments described above refer to particular features, the scope of this disclosure also includes embodiments having different combinations of features and embodiments that do not include all of the described features. Accordingly, the scope of the present disclosure is intended to embrace all such alternatives, modifications, and variations as fall within the scope of the claims, together with all equivalents thereof.
Embodiments of the systems and methods described herein include a deep scene level analysis process (e.g., the deep scene level analysis process 220 depicted in
According to embodiments, metadata may be maintained in a database, provided to users and/or other devices, provided with the video data, and/or the like. In embodiments, metadata may be used to facilitate a partitioning process (e.g., the partitioning process 222 depicted in
Labelling objects within a video feed has conventionally been performed manually by humans. This approach, however, is labor intensive. Further, humans cannot identify and label objects quick enough to facilitate labelling objects in a real-time video feed or in a near real-time video feed (e.g., where a slight delay is introduced into the video feed). Embodiments described herein may provide solutions to these, and other, shortcomings of conventional labelling techniques. In particular, embodiments disclosed herein provide an automated technique for labelling objects within a video feed. As such, labeling of objects using embodiments of the techniques described herein may be less time-consuming than in conventional systems, and, therefore, may be labelled in real-time video feed or a near real time video feed. Further, because embodiments include automatic labelling (e.g., labeling without human intervention), robust, comprehensive labelling of large sets of video data may be facilitated.
As described herein, while producing the encoded video data 5306 from the video data 5304, the video processing device 5302 may analyze and tag objects in the video data 5304 automatically to facilitate labeling of objects in real time and/or near real time. In embodiments, the video processing device 5302 may be configured to label video data to produce metadata. In embodiments, the video processing device 5302 may be a stand-alone device, virtual machine, program component, and/or the like, and may function independently of encoding functions.
As shown in
As shown in
In embodiments, the memory 5314 stores computer-executable instructions for causing the processor 5312 to implement aspects of embodiments of system components discussed herein and/or to perform aspects of embodiments of methods and procedures discussed herein. Computer-executable instructions may include, for example, computer code, machine-useable instructions, and the like such as, for example, program components capable of being executed by one or more processors associated with a computing device. Examples of such program components may include a segmenter 5318, a foreground detector 5320, a multi-view motion estimator 5322, an object analyzer 5324, an object recognition component 5326, an encoder 5328, and a communication component 5330.
In embodiments, the segmenter 5318 may be configured to segment one or more video frames into a plurality of segments. The segments may include, for example, objects, groups, slices, tiles, and/or the like. The segmenter 5318 may employ any number of various automatic image segmentation techniques such as, for example, those discussed in U.S. application Ser. No. 14/696,255, filed Apr. 24, 2015, entitled “METHOD AND SYSTEM FOR UNSUPERVISED IMAGE SEGMENTATION USING A TRAINED QUALITY METRIC.” For example, the segmenter 5318 may use image color and corresponding gradients to subdivide an image into segments that have similar color and texture. Two embodiments of image segmentation techniques include the watershed algorithm and optimum cut partitioning of a pixel connectivity graph. For example, the segmenter 5318 may use Canny edge detection to detect edges on a video frame for optimum cut partitioning, and create segments using the optimum cut partitioning of the resulting pixel connectivity graph. In embodiments, the segmenter 5318 implements aspects of the segmentation techniques described in luri Frosio and Ed Ratner, “Adaptive Segmentation Based on a Learned Quality Metric,” Proceedings of the 10th International Conference on Computer Vision Theory and Applications, March 2015, the entirety of which is hereby incorporated herein by reference for all purposes.
The resulting segment map of image segments includes an assignment of indices to every pixel in the image, which allows for the frame to be dealt with in a piecemeal fashion. Each index, which may be indexed and stored in memory 5314, may be considered a mask for this purpose.
In embodiments, the foreground detector 5320 may be configured to perform foreground detection on one or more video frames of the video data 5304. For example, in embodiments, the foreground detector 5320 may perform segment-based foreground detection, where the foreground segments, or portions of the segments, are detected using any number of different techniques such as, for example, those discussed in U.S. application Ser. No. 14/737,418, filed Jun. 11, 2015, entitled “FOREGROUND DETECTION USING FRACTAL DIMENSIONAL MEASURES.” For example, in embodiments, the foreground detector 5320 may identify a segment as a foreground segment or a background segment by: determining at least one foreground metric for the segment based on a filtered binary foreground indicator map (BFIM), determining at least one variable threshold based on the foreground metric, and applying the at least one variable threshold to the filtered BFIM to identify the segment as either a foreground segment or background segment. Additionally, or alternatively, the foreground detector 5320 may perform foreground detection on images that have not been segmented. In embodiments, results of foreground detection may be used by the segmenter 5318 to inform a segmentation process.
In embodiments, the motion estimator 5322 is configured to estimate the motion of one or more segments between video frames of a video feed. For example, the motion estimator 5322 may receive a video feed of the video data 5304. The video feed may be received after video frames of the video feed are segmented by the segmenter 5318. The motion estimator 5322 may then perform motion estimation on the segmented video frames of the video feed. That is, the motion estimator 5322 may estimate the motion of a segment between video frames of the single video feed, with the current frame in the temporal sequence serving as the source frame and the subsequent frame in the temporal sequence serving as the target frame.
In embodiments, the motion estimator 5322 may utilize any number of various motion estimation techniques known in the field. Two example motion estimation techniques are optical pixel flow and feature tracking. As an example, the motion estimator 5322 may use feature tracking in which Speeded Up Robust Features (SURF) are extracted from both a source image (e.g., a source frame) and a subsequent image (e.g., a subsequent frame). The individual features of the two images may then be compared using a Euclidean metric to establish a correspondence between features, thereby generating a motion vector for each feature. In embodiments, a motion vector for a segment may be the median of all of the motion vectors for each of the segment's features. After a motion vector for a segment is determined, each segment may be categorized based on its motion properties. For example, a segment may be categorized as either moving or stationary. In embodiments, the motion estimator 5322 may use one or more of the techniques described in U.S. application Ser. No. 15/237,048, filed Aug. 15, 2016, entitled “OBJECT CATEGORIZATION USING STATISTICALLY-MODELED CLASSIFIER OUTPUTS.”
In embodiments, the object analyzer 5324 may be configured to perform object group analysis on one or more video frames of the video data 5304. For example, the object analyzer 5324 may categorize each segment based on its motion properties (e.g., as either moving or stationary) and adjacent segments may be combined into objects. In embodiments, if the segments are moving, the object analyzer 5324 may combine the segments based on similarity of motion. If the segments are stationary, the object analyzer 5324 may combine the segments based on similarity of color and/or the percentage of shared boundaries. Additionally or alternatively, the object analyzer 5324 may use any of the object analyzation embodiments described in U.S. application Ser. No. 15/237,048, filed Aug. 15, 2016, entitled “OBJECT CATEGORIZATION USING STATISTICALLY-MODELED CLASSIFIER OUTPUTS.”
According to embodiments, the pattern recognition component 5326 may perform pattern recognition on digital images such as, for example, frames of video. For example, the pattern recognition component 5326 may perform pattern recognition of the objects that are determined by the object analyzer 5324 in one or more frames of video. In embodiments, the pattern recognition component 5326 may recognize patterns of one or more of the objects of a video frame and determine whether the recognized patterns correspond to a specific class of objects. If the recognized patterns correspond to a class of objects, the pattern recognition component 5326 may classify and label the object as the corresponding class. Additionally or alternatively, the pattern recognition component 5326 may be used for any number of other purposes such as, for example, detecting regions of interest, foreground detection, facilitating compression, and/or the like.
To perform pattern recognition on frames of a video, the pattern recognition component 5326 may include a feature extractor 5332. The feature extractor 5332 may be configured to extract one or more features from a video frame. In embodiments, the feature extractor 5332 may extract features from one or more of the objects determined by the object analyzer 5324. In embodiments, the feature extractor 5332 may be configured to correlate the extracted features with the objects determined by the object analyzer 5324. That is, in addition to labelling the video frame in which the features were extracted from, the feature extractor 5332 may correlate the specific object in a video frame with the extracted features. By correlating extracted features to respective objects, the device 5302 may be less likely to misclassify an object since features of a first object in a video frame will not be mistakenly used to classify a second object in the same video frame.
In embodiments, the feature extractor 5332 may represent more than one feature extractors. The feature extractor 5332 may include any number of different types of feature extractors, implementations of feature extraction algorithms, and/or the like. For example, the feature extractor 5332 may perform histogram of oriented gradients feature extraction (i.e., “HOG” as described, for example, in Navneet Dalal and Bill Triggs, “Histograms of Oriented Gradients for Human Detection,” available at http.//lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf, 2005, the entirety of which is hereby incorporated herein by reference for all purposes), Gabor feature extraction (as explained, for example, in John Daugman, “Complete Discrete 2-D Gabor Transforms by Neural Networks for Image Analysis and Compression,” IEEE Transaction on Acoustics, Speech, and Signal Processing, Vol. 4, No. 7, 1988, the entirety of which is hereby incorporated herein by reference for all purposes), Kaze feature extraction, speeded-up robust features (SURF, as explained, for example, in D. G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” International Journal of Computer Vision, 60, 2, pp. 91-110, 2004, the entirety of which is hereby incorporated herein by reference for all purposes) feature extraction, features from accelerated segment (FAST) feature extraction, scale-invariant feature transform (SIFT) feature extraction, and/or the like.
As illustrated in
To classify the object based on extracted features, the device 5302 may include an object database 5336, and the object database 5336 may include object features 5338 that are correlated to classes of objects. In embodiments, the object classifier 5334 may determine whether one or more of the extracted features correspond to one or more of the object features 5338 stored in the object database 5336. If the extracted features correspond to object features 5338, then the object classifier 5334 may classify the object as the class of object having those object features 5338.
The object classifier 5338 may include any number of different types of classifiers to classify the extracted features. For example, in embodiments, the object classifier 5338 may be a binary classifier, a non-binary classifier and/or may include one or more of the following types of classifiers: a support vector machine (SVM), an extreme learning machine (ELM), a neural network, a kernel-based perceptron, a k-NN classifier, a bag-of-visual-words classifier, and/or the like. In embodiments, high quality matches between extracted features and object features 5338 are selected as matches. According to embodiments, a high quality match is a match for which a corresponding match-quality metric satisfies one or more specified criteria. For example, in embodiments, any number of different measures of the match quality (e.g., similarity metrics, relevance metrics, etc.) may be determined and compared to one or more criteria (e.g., thresholds, ranges, and/or the like) to facilitate identifying matches. Embodiments of classification techniques that may be utilized by the object classifier 5338 include, for example, techniques described in Andrey Gritsenko, Emil Eirola, Daniel Schupp, Ed Ratner, and Amaury Lendasse, “Probabilistic Methods for Multiclass Classification Problems,” Proceedings of ELM-2015, Vol. 2, January 2016; and Gabriella Csurka, Christopher R. Dance, Lixin Fan, Jutta Willamowski, and Cedric Bray, “Visual Categorization with Bags of Keypoints,” Xerox Research Centre Europe, 2004; the entirety of each of which is hereby incorporated herein by reference for all purposes.
According to embodiments, the pattern recognition component 5326 includes an object labeler 5340. The object labeler 5340 is configured to label an object and one or more features of the object (e.g., movement of the object) based on the determined classification of object by the object classifier 5338.
As shown in
In embodiments, the communication component 5330 is configured to communicate encoded video data 5306. For example, in embodiments, the communication component 5330 may facilitate communicating encoded video data 5306 to the decoding device 5308. In embodiments, the classification metadata of the objects may be transmitted with or separately from the encoded object data.
The illustrative operating environment 5300 shown in
As shown in
As shown in
Embodiments of the method 5400 further include classifying objects (block 5406). Embodiments describing an object classification method 5400 are discussed below with respect to
After the objects are classified, the objects are encoded (block 5408). In embodiments, the classification data of an object may be encoded as metadata of an object. In embodiments, encoding the motion vectors may be performed by an encoder 5328 as depicted in
After encoding the objects, the object data may be transmitted (block 5410). In embodiments, the classification metadata of the objects may be transmitted with or separately from the encoded object data. The encoded object data may be transmitted to a decoding device (e.g., the decoding device 5308 depicted in
The illustrative method 5400 shown in
As shown in
As shown in
In embodiments, the method 5500 further comprises extracting features from the objects (block 5506). Features may be extracted using any of a number of different types of feature extraction algorithms including, for example, embodiments of algorithms indicated above. In embodiments, the feature extractor 5332 depicted in
Method 5500 may further include classifying objects based on extracted features (block 5508). To classify the object based on extracted features, extracted features may be correlated to features associated with specific classes of objects. If the extracted features are correlated to features associated with a specific class of objects, then the object may be classified as the class of object having the correlated features. In embodiments, the object classifier 5338 depicted in
According to embodiments, the method 5500 may include labeling objects based on the object classification (block 5510) and encoding said labeled objects (block 5512). In embodiments, an object may be labeled using object labeler 5340 and encoded using the encoder 5328 depicted in
The illustrative method 5500 shown in
Various modifications and additions can be made to the exemplary embodiments discussed without departing from the scope of the present disclosure. For example, while the embodiments described above refer to particular features, the scope of this disclosure also includes embodiments having different combinations of features and embodiments that do not include all of the described features. Accordingly, the scope of the present disclosure is intended to embrace all such alternatives, modifications, and variations as fall within the scope of the claims, together with all equivalents thereof.
Embodiments of the systems and methods described herein may include a partitioning process (e.g., the partitioning process 222 depicted in
The process of breaking a video frame into smaller blocks for encoding has been common to the h.26x family of video coding standards since the release of h.261. A more recent version, h.265, uses blocks of sizes up to 64 samples, and utilizes more reference frames and greater motion vector ranges than its predecessors. In addition, these blocks can be partitioned into smaller sub-blocks. The frame sub blocks in h.265 are referred to as Coding Tree Units (CTUs). In H.264 and VP8, these are known as macroblocks and are 16×16. These CTUs can be subdivided into smaller blocks called Coding Units (CUs). While CUs provide greater flexibility in referencing different frame locations, they may also be computationally expensive to locate due to multiple cost calculations performed with respect to CU candidates. Often many CU candidates are not used in a final encoding.
A common strategy for selecting a final CTU follows a quad tree, recursive structure. A CU's motion vectors and cost are calculated. The CU may be split into multiple (e.g., four) parts and a similar cost examination may be performed for each. This subdividing and examining may continue until the size of each CU is 4×4 samples. Once the cost of each sub-block for all the viable motion vectors is calculated, they are combined to form a new CU candidate. This new candidate is then compared to the original CU candidate and the CU candidate with the higher rate-distortion cost is discarded. This process may be repeated until a final CTU is produced for encoding. With the above approach, unnecessary calculations may be made at each CTU for both divided and undivided CU candidates. Additionally, conventional encoders may examine only local information.
Embodiments of the present disclosure use a classifier to facilitate efficient coding unit (CU) examinations. The classifier may include, for example, a neural network classifier, a support vector machine, a random forest, a linear combination of weak classifiers, and/or the like. The classifier may be trained using various inputs such as, for example, object group analysis, segmentation, localized frame information, and global frame information. Segmentation on a still frame may be generated using any number of techniques. For example, in embodiments, an edge detection based method may be used. Additionally, a video sequence may be analyzed to ascertain areas of consistent inter frame movements which may be labeled as objects for later referencing. In embodiments, the relationships between the CU being examined and the objects and segments may be inputs for the classifier.
According to embodiments, frame information may be examined both on a global and local scale. For example, the average cost of encoding an entire frame may be compared to a local CU encoding cost and, in embodiments, this ratio may be provided, as an input, to the classifier. As used herein, the term “cost” may refer to a cost associated with error from motion compensation for a particular partitioning decision and/or costs associated with encoding motion vectors for a particular partitioning decision. These and various other, similar, types of costs are known in the art and may be included within the term “costs” herein. Examples of these costs are defined in U.S. application Ser. No. 13/868,749, filed Apr. 23, 2013, entitled “MACROBLOCK PARTITIONING AND MOTION ESTIMATION USING OBJECT ANALYSIS FOR VIDEO COMPRESSION,” the entirety of which is hereby incorporated by reference herein for all purposes.
Another input to the classifier may include a cost decision history of local CTUs that have already been processed. This may be, e.g., a count of the number of times a split CU was used in a final CTU within a particular region of the frame. In embodiments, the Early Coding Unit decision, as developed in the Joint Video Team's Video Coding HEVC Test Model 12, may be provided, as input, to the classifier. Additionally, the level of the particular CU in the quad tree structure may be provided, as input, to the classifier.
According to embodiments, information from a number of test videos may be used to train a classifier to be used in future encodings. In embodiments, the classifier may also be trained during actual encodings. That is, for example, the classifier may be adapted to characteristics of a new video sequence for which it may subsequently influence the encoder's decisions of whether to bypass unnecessary calculations.
According to various embodiments of the present disclosure, a pragmatic partitioning analysis may be employed, using a classifier to help guide the CU selection process. Using a combination of segmentation, object group analysis, and a classifier, the cost decision may be influenced in such a way that human visual quality may be increased while lowering bit expenditures. For example, this may be done by allocating more bits to areas of high activity than are allocated to areas of low activity. Additionally, embodiments of the present disclosure may leverage correlation information between CTUs to make more informed global decisions. In this manner, embodiments of the present disclosure may facilitate placing greater emphasis on areas that are more sensitive to human visual quality, thereby potentially producing a result of higher quality to end-users.
As shown in
In embodiments, the memory 5714 stores computer-executable instructions for causing the processor 5712 to implement aspects of embodiments of system components discussed herein and/or to perform aspects of embodiments of methods and procedures discussed herein. Computer-executable instructions may include, for example, computer code, machine-useable instructions, and the like such as, for example, program components capable of being executed by one or more processors associated with a computing device. Examples of such program components include a segmenter 5718, a motion estimator 5720, a partitioner 5722, a classifier 5724, an encoder 5726, and a communication component 5728.
In embodiments, the segmenter 5718 may be configured to segment a video frame into a number of segments. The segments may include, for example, objects, groups, slices, tiles, and/or the like. The segmenter 5718 may employ any number of various automatic image segmentation methods known in the field. In embodiments, the segmenter 5718 may use image color and corresponding gradients to subdivide an image into segments that have similar color and texture. Two examples of image segmentation techniques include the watershed algorithm and optimum cut partitioning of a pixel connectivity graph. For example, the segmenter 5718 may use Canny edge detection to detect edges on a video frame for optimum cut partitioning, and create segments using the optimum cut partitioning of the resulting pixel connectivity graph.
In embodiments, the motion estimator 5720 is configured to perform motion estimation on a video frame. For example, in embodiments, the motion estimator may perform segment-based motion estimation, where the inter-frame motion of the segments determined by the segmenter 5718 is determined. The motion estimator 5720 may utilize any number of various motion estimation techniques known in the field. Two examples are optical pixel flow and feature tracking. For example, in embodiments, the motion estimator 5720 may use feature tracking in which Speeded Up Robust Features (SURF) are extracted from both a source image (e.g., a first frame) and a target image (e.g., a second, subsequent, frame). The individual features of the two images may then be compared using a Euclidean metric to establish a correspondence, thereby generating a motion vector for each feature. In such cases, a motion vector for a segment may be, for example, the median of all of the motion vectors for each of the segment's features.
In embodiments, the encoding device 5702 may perform an object group analysis on a video frame. For example, each segment may be categorized based on its motion properties (e.g., as either moving or stationary) and adjacent segments may be combined into objects. In embodiments, if the segments are moving, they may be combined based on similarity of motion. If the segments are stationary, they may be combined based on similarity of color and/or the percentage of shared boundaries.
In embodiments, the partitioner 5722 may be configured to partition the video frame into a number of partitions. For example, the partitioner 5722 may be configured to partition a video frame into a number of coding tree units (CTUs). The CTUs can be further partitioned into coding units (CUs). Each CU may include a luma coding block (CB), two chroma CBs, and an associated syntax. In embodiments, each CU may be further partitioned into prediction units (Pus) and transform units (TUs). In embodiments, the partitioner 5722 may identify a number of partitioning options corresponding to a video frame. For example, the partitioner 5722 may identify a first partitioning option and a second partitioning option.
To facilitate selecting a partitioning option, the partitioner 5722 may determine a cost of each option and may, for example, determine that a cost associated with the first partitioning option is lower than a cost associated with the second partitioning option. In embodiments, a partitioning option may include a candidate CU, a CTU, and/or the like. In embodiments, costs associated with partitioning options may include costs associated with error from motion compensation, costs associated with encoding motion vectors, and/or the like.
To minimize the number of cost calculations made by the partitioner 5722, the classifier 5724 may be used to facilitate classification of partitioning options. In this manner, the classifier 5724 may be configured to facilitate a decision as to whether to partition the frame according to an identified partitioning option. According to various embodiments, the classifier may be, or include, a neural network, a support vector machine, and/or the like. The classifier may be trained using test videos before and/or during its actual use in encoding.
In embodiments, the classifier 5724 may be configured to receive, as input, at least one characteristic corresponding to the candidate coding unit. For example, the partitioner 5722 may be further configured to provide, as input to the classifier 5724, a characteristic vector corresponding to the partitioning option. The characteristic vector may include a number of feature parameters that can be used by the classifier to provide an output to facilitate determining that the cost associated with a first partitioning option is lower than the cost associated with a second partitioning option. For example, the characteristic vector may include one or more of localized frame information, global frame information, output from object group analysis and output from segmentation. The characteristic vector may include a ratio of an average cost for the video frame to a cost of a local CU in the video frame, an early coding unit decision, a level in a CTU tree structure corresponding to a CU, and a cost decision history of a local CTU in the video frame. For example, the cost decision history of the local CTU may include a count of a number of times a split CU is used in a corresponding final CTU.
As shown in
The illustrative operating environment 5700 shown in
Embodiments of the method 5800 further include a process 5807 that is performed for each of a number of coding units or other partition structures. For example, a first iteration of the process 5807 may be performed for a first CU that may be a 64×∝block of pixels, then for each of four 32×32 blocks of the CU, using information generated in each step to inform the next step. The iterations may continue, for example, by performing the process for each 16×16 block that makes up each 32×32 block. This iterative process 5807 may continue until a threshold or other criteria are satisfied, at which point the method 5800 does is not applied at any further branches of the structural hierarchy.
As shown in
As shown in
As shown in
As shown in
Modern video compression techniques take advantage of the fact that information content in video exhibits significant redundancy. Video exhibits temporal redundancy inasmuch as, in a new frame of a video, most content was present previously. Video also exhibits significant spatial redundancy, inasmuch as, in a given frame, pixels have color values similar to their neighbors. The first commercially widespread video coding methods, MPEG1 and MPEG2, took advantage of these forms of redundancy and were able to reduce bandwidth requirements substantially.
For high quality encoding, MPEG1 generally cut from 240 Mbps to 6 Mbps the bandwidth requirement for standard definition resolution. MPEG2 brought the requirement down further to 4 Mbps. MPEG2 is resultantly used for digital television broadcasting all over the world. MPEG1 and MPEG2 each took advantage of temporal redundancy by leveraging block-based motion compensation. To compress using block-based motion compensation, a new frame that is to be encoded by an encoder is broken up into fixed-size, 16×16 pixel blocks, labeled macroblocks. These macroblocks are non-overlapping and form a homogenous tiling of the frame. When encoding, the encoder searches for the best matching macroblock of a previously encoded frame, for each macroblock in a new frame. In fact, in MPEG1 and MPEG2 up to two previously encoded frames can be searched. Once a best match is found, the encoder establishes and transmits a displacement vector, known in this case as a motion vector, referencing and, thereby, approximating, each macroblock.
MPEG1 and MPEG2, as international standards, specified the format of the motion vector coding but left the means of determination of the motion vectors to the designers of the encoder algorithms. Originally, the absolute error between the actual macroblock and its approximation was targeted for minimization in the motion vector search. However, later implementations took into account the cost of encoding the motion vector, too. Although MPEG1 and MPEG2 represented significant advances in video compression, their effectiveness was limited, due, largely, to the fact that real video scenes are not comprised of moving square blocks. Realistically, certain macroblocks in a new frame are not represented well by any macroblocks from a previous frame and have to be encoded without the benefit of temporal redundancy. With MPEG1 and MPEG2, these macroblocks could not be compressed well and contributed disproportionately to overall bitrate.
The newer generation of video compression standards, such as H.264 and Google's VP8, has addressed this temporal redundancy problem by allowing the 16×16 macroblocks to be partitioned into smaller blocks, each of which can be motion compensated separately. The option is to go, potentially, as far down as 4×4 pixel block partitions. The finer partitioning potentially allows for a better match of each partition to a block in a previous frame. However, this approach incurs the cost of coding extra motion vectors. The encoders, operating within standards, have the flexibility to decide how the macroblocks are partitioned and how the motion vectors for each partition are selected. Regardless of path, ultimately, the results are encoded in a standards compliant bitstream that any standards compliant decoder can decode.
Determining how to partition and motion compensate each macroblock is complex, and the original H.264 test model used an approach based on rate-distortion optimization. In rate-distortion optimization, a combined cost function, including both the error for a certain displacement and the coding cost of the corresponding motion vector, is targeted for minimization. To partition a particular macroblock, the total cost-function is analyzed. The total cost function contains the errors from motion compensating each partition and the costs of encoding all the motion vectors associated with the specific partitioning. The cost is given by the following equation:
F(v1, . . . , vN)=ΣpartitionsErrorpartition+αΣ.partitionsR(vpartition), (1)
where α is the Lagrange multiplier relating rate and distortion, ΣpartitionsErrorpartition is the cost associated with the mismatch of the source and the target, and Σpartitions R(vpartitions) is the cost associated with encoding the corresponding motion vectors.
For each possible partitioning, the cost function F is minimized as a function of motion vectors v. For the final decision, the optimal cost functions of each potential partitioning are considered, and the partitioning with lowest overall cost function is selected. The macroblocks are encoded in raster scan order, and this choice is made for each macroblock as it is encoded. The previous macroblocks impact the current macroblock by predicting differentially the motion vectors for the current macroblock and, thus, impacting the coding cost of a potential candidate motion vector. This approach is now used de facto in video compression encoders for H.264 and VP8 today.
Embodiments of the methods and systems herein may include a method of partitioning including: determining objects within a frame, such determining being at least partially based on movement characteristics of pixels of the frame; creating a mask corresponding to the frame; determining a pre-scaling cost function value associated with each of a plurality of partitioning options for partitioning a macroblock of the frame, the pre-scaling cost function value comprising a sum of motion compensation errors associated with the partitioning option and the costs of encoding the motion vectors associated with the partitioning option; determining a post-scaling cost function value associated with each of the plurality of partitioning options, comprising: determining, based on the mask, that a macroblock overlaps at least two of the determined objects; determining that a first partitioning option of the plurality of partitioning options results in partitioning the macroblock into a plurality of blocks, wherein the first portioning option separates at least some of the determined objects into different blocks of the plurality of blocks; and reducing, by a scaling factor, a pre-scaling cost function value of the first partitioning option to create a post-scaling cost function value of the first partitioning option in response to determining that the first partitioning option results in partitioning the macroblock into a plurality of blocks, wherein the first portioning option separates at least some of the determined objects into different blocks of the plurality of blocks; selecting a partitioning option of the plurality of partitioning options having the lowest associated post-scaling cost function value; and partitioning the frame into blocks according to the selected partitioning option.
In an exemplary and non-limited embodiment, aspects of the disclosure are embodied in a method of encoding video including determining objects within a frame at least partially based on movement characteristics of underlying pixels and partitioning the frame into blocks by considering a plurality of partitioning options, such partitioning favoring options that result in different objects being placed in different blocks. That is, for example,
In another example, aspects of the present disclosure are embodied in a partitioner operable to partition a frame into blocks by considering a plurality of partitioning options, such partitioning favoring options that result in different objects being placed in different blocks.
In yet another example, aspect of the present disclosure are embodied in a computer readable media having instructions thereon that when interpreted by a processor cause the processor to determine objects within a frame at least partially based on movement characteristics of underlying pixels; and partition a frame into blocks by considering a plurality of partitioning options, such partitioning favoring options that result in different objects being placed in different blocks.
The methods and systems described herein improve on the currently prevailing compression approach by taking a more global view of the encoding of a frame of video. Using the traditional rate-distortion optimization approach, no weight is given to the fact that the choice of partitions and their corresponding motion vectors will impact subsequent macroblocks. The result of this comes in the form of higher cost for encoding motion vectors and potential activation of the de-blocking filter, negatively impacting overall quality.
Referring to
Memory 6204 includes communication component 6218 which when executed by processor 6202 permit video encoding computing system 6200 to communicate with other computing devices over a network. Although illustrated as software, communication component 6218 may be implemented as software, hardware (such as state logic), or a combination thereof. Video encoding computing system 6200 further includes data, such as at least one video file 6206, to be encoded which is received from a client computing system and is stored on memory 6204. The video file is to be encoded and subsequently stored as a processed video file 6208. Exemplary video encoding computing systems 6200 include desktop computers, laptop computers, tablet computers, cell phones, smart phones, and other suitable computing devices. In the illustrative embodiment, video encoding computing system 6200 includes memory 6204 which may be multiple memories accessible by processor 6202.
Memory 6204 associated with the one or more processors of processor 6202 may include, but is not limited to, memory associated with the execution of software and memory associated with the storage of data. Memory 6204 includes computer readable media. Computer-readable media may be any available media that may be accessed by one or more processors of processor 6202 and includes both volatile and non-volatile media. Further, computer readable-media may be one or both of removable and non-removable media. By way of example, computer-readable media may include, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disk (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which may be used to store the desired information and which may be accessed by processor 6202.
Memory 6204 further includes video encoding component 6216. Video encoding component 6216 relates to the processing of video file 6206. Exemplary processing sequences of the video encoding component are provided below. Although illustrated as software, video encoding component 6216 may be implemented as software, hardware, or a combination thereof.
Video encoding computing system 6200 further includes a user interface 6210. User interface 6210 includes one or more input devices 6212 and one or more output devices, illustratively a display 6214. Exemplary input devices include a keyboard, a mouse, a pointer device, a trackball, a button, a switch, a touch screen, and other suitable devices which allow an operator to provide input to video encoding computing system 6200. Exemplary output devices include a display, a touch screen, a printer, and other suitable devices which provide information to an operator of video encoding computing system 6200.
In one embodiment, the computer systems disclosed in U.S. application Ser. No. 13/428,707, filed Mar. 23, 2012, titled VIDEO ENCODING SYSTEM AND METHOD, the entirety of which is hereby incorporated by reference herein for all purposes, utilize the video encoding processing sequences described herein to encode video files.
In embodiments, a two-pass approach through the video is implemented. In the first pass, video is analyzed both for coherently moving and for stationary objects. With respect to each frame of video, mask generator 6226 generates a mask. Mask generator 6226 assigns each pixel of a frame to either a moving or a stationary object. Objects are determined (block 6300) and enumerated with each objects numeral corresponding to the pixel value in the mask. Moreover, via motion estimator 6230, associated metadata may specify which objects are in motion.
According to embodiments, the first pass includes two steps. In the first step, segmenter receives a frame 6700 and breaks up the frame into image segments 6800 (
A number of different automatic image segmentation methods are known to practitioners in the field. Generally, the techniques use image color and corresponding gradients to subdivide an image into segment regions that have similar color and texture. Two examples of image segmentation techniques include the watershed algorithm and optimum cut partitioning of a pixel connectivity graph. In the specific embodiment, Canny edge detection is used to detect edges on an image for optimum cut partitioning. Segments are then created using the optimum cut partitioning of the pixel connectivity graph.
The second step is segment-based motion estimation, where the motion of the segments is determined. Once the segments are created, motion estimator 6230 estimates motion of the segment between frames, with the current frame in the temporal sequence serving as the source frame and the subsequent frame in the temporal sequence serving as the target frame. A number of motion estimation techniques are known to practitioners in the field. Two examples are optical pixel flow and feature tracking. In a feature tracking technique, for example, Speeded Up Robust Features (SURF) are extracted from both the source image and the target image. The individual features of the two images are then compared using a Euclidean metric to establish a correspondence. This generates a motion vector for each feature. A motion vector for a segment is the median of all of the segment's features. Accordingly, in embodiments, each segment is categorized based on its motion properties (Block 6410). Such categorization includes categorizing each segment as either moving or stationary (Block 6610).
As shown, adjacent segments, as understood from the foregoing two steps, are combined into objects (Block 6420). If the segments are moving, they are combined based on similarity of motion (Block 6620). If the segments are stationary, they are combined based on similarity of color and the percentage of shared boundaries (Block 6630). Objects are enumerated, and a mask is generated for a given frame.
In the second pass, the actual encoding is performed by encoder 6220. The object mask generated by the first pass is available to encoder 6220. Partitioner 6222 operates to determine which macroblocks are kept whole and which macroblocks are further divided into smaller partitions. Partitioner 6222 makes the partitioning decision by taking object mask information into account. Partitioner 6222 illustratively “decides” between multiple partitioning options.
Partitioner 6222 determines if a macroblock overlaps multiple objects of the mask (Block 6500, 6640). The costs associated with each partitioning option are determined (Block 6510). In one example, costs associated with error from motion compensation for a particular partitioning decision is determined (Block 6650). Costs associated with encoding motion vectors for a particular partitioning decision are also determined (Block 6660).
In the case where a macroblock overlaps two objects, cost adjuster 6221 favors the partitioning option that separates the two objects by adjusting (reducing) its cost function via multiplying it by a coefficient, β, which is less than 1 (Block 6520, 6670). Stated differently, the processing of macroblocks is biased to encourage partitioning that separates objects (block 6310). β is a learned constant and, in the specific embodiment, depends on whether one of two objects is moving, both objects are moving, or both are stationary. In the case of a macroblock containing more than two objects, the cost function of a partition that separates three of the objects is further scaled by β2. This approach is applied potentially indefinitely for an indefinite number of additional objects within a macroblock. In the specific embodiment, β's past β2 are equal to 1. The partition corresponding to the best cost function value post-scaling is determined (block 6680), selected, and processed (Block 6690).
The specific cost functions are given by:
F(v1, . . . , vn)objects separated=β(ΣpartitionsErrorpartition+αΣpartitionsR(Vpartitions))
F(v1, . . . , vn)objects together=ΣpartitionsErrorpartition+αΣpartitionsR(Vpartitions))
Partitioning that favors separation of objects is hereby more likely because β less than one gives such partitioning a lower cost. In other words, additional present real cost is taken on in anticipation that such present cost results in later savings. Moreover, this leads potentially to less expensive encoding of macroblocks reached subsequently when they contain portions of one of the objects in the considered macroblock. In the specific embodiment, the error metric chosen (i.e., the first addend) is the sum of absolute differences. The coding cost of the motion vectors (i.e., the second addend) is derived by temporarily quantifying the vectors' associated bitrates using Binary Adaptive Arithmetic Coding. Nothing is written to the bitstream until the final choice for the macroblock is made. Once such macroblock choice is made, along with the decisions for all other macroblocks, the frame is divided into macroblocks (Block 6320).
An exemplary processing will now be described with reference to
Based on analysis of the motion of the segments from frame to frame, segments are grouped into objects.
In the current example, macroblocks are illustratively 16 pixels by 16 pixels in size.
In the present example, the cost calculation has determined that the changes between frames warrants subdivision within the 16×16 macroblock to give four 8×8 macroblocks. Similar cost calculations are performed for each resulting 8×8 macroblock. It should be appreciated that two of the 8×8 macroblocks (upper left and lower right) are deemed to be homogenous enough and/or stationary enough to not warrant further division. However, the other two 8×8 macroblocks (those that contain the majority of the edges of the objects) have satisfied the criteria (cost calculation) for further division. As previously noted, the cost calculation is biased to favor division of objects.
Embodiments of the systems and methods described herein may include an encoding process (e.g., the encoding process 226 depicted in
Referring to
Exemplary information that the video encoding system 7000 receives from a client computing system 7006A-N is a video file and information regarding a processing of the video file. In embodiments, the client computing system 7006A-N sends the video file to the video encoding system 7000 or instructs another computing system to send the video file to the video encoding system 7000. In embodiments, the client computing system 7006A-N provides video encoding system 7000 instructions on how to retrieve the video file.
Exemplary information that the video encoding system 7000 sends to the client computing system 7006A-N includes at least one processed video file, which is generated based on the video file and the information regarding processing of the video file. In embodiments, the video encoding system 7000 sends the processed video file to the client computing system 7006A-N or instructs another computing system to send the processed video file to the client computing system 7006A-N. In embodiments, the video encoding system 7000 sends the processed video file to a destination specified by the client computing system 7006A-N or instructs another computing system to send the processed video file to a destination specified by the client computing system 7006A-N. In embodiments, the video encoding system 7000 provides client computing system 7006A-N instructions on how to retrieve the processed video file.
In embodiments, the information regarding processing of the video file may include information related to the desired video encoding format. Exemplary information related to the format of the video file includes the video encoding format, a bit rate, a resolution, and/or other suitable information. Additional exemplary information regarding the processing of the video file may include key point positions and other metadata. In embodiments, the information relates to multiple encoding formats for the video file so that multiple processed video files are produced by video encoding system 7000.
Referring to
In embodiments, the memory 7104 stores computer-executable instructions for causing the processor 7102 to implement aspects of embodiments of system components discussed herein and/or to perform aspects of embodiments of methods and procedures discussed herein. Computer-executable instructions may include, for example, computer code, machine-useable instructions, and the like such as, for example, program components capable of being executed by one or more processors associated with a computing device. An example of such a program component may include a communication component 7112. In embodiments, the information file 7114 includes the information regarding processing of the video file 7110. In embodiments, the information regarding processing of the video file 7110 is stored as part of the video file 7110.
In embodiments, the client computing system 7100 further includes a user interface 7116. The user interface 7116 includes one or more input devices 7118 and one or more output devices, illustratively a display 7120. Exemplary input devices include a keyboard, a mouse, a pointer device, a trackball, a button, a switch, a touch screen, and other suitable devices which allow an operator to provide input to the client computing system 7100. Exemplary output devices include a display, a touch screen, a printer, speakers, and other suitable devices which provide information to an operator of the client computing system 7100. In embodiments, as indicated above, the client computing system 7100 may include a video camera 7106 and associated microphone 7108. The video camera 7106 may be used to capture the video file 7110. In embodiments, the client computing system 7100 receives the video file 7110 from another computing device.
Returning to
In embodiments, the number, N, of the worker instances 7010A-N is fixed, while, in other embodiments, the number, N, of worker instances 7010A-N is dynamic and provides a scalable, on-demand system. In embodiments, for example, the video encoding system 7000 is implemented in a cloud-computing platform. In embodiments, cloud computing refers to a networked server (or servers) configured to provide computing services. Computing resources in a cloud computing platform may include worker instances 7010A-N. An “instance” is an instantiation of a computing resource. In embodiments, a worker instance may include a particular CPU, a certain amount of memory and a certain amount of memory (e.g., hard-disk space). In embodiments, a worker instance may include an instantiated program component. That is, for example, a computing device may be configured to instantiate a number of different worker instances, each of which may be instantiated, for example, in a separate virtual machine. In embodiments, the instances may be launched and shut down programmatically and/or dynamically (e.g., in response to a request for computing resources).
According to embodiments, any number of different types of algorithms may be utilized for scheduling encoding tasks among a number of different encoders (e.g., worker instances 7010 depicted in
Embodiments of the systems and methods described herein include determining an encoding schedule by using an algorithm configured to minimize (or at least approximately minimize) a cost associated with encoding (e.g., an encoding cost). The encoding cost may be determined based on encoder load capacities, an estimated encoding load associated with a video file, requested encoding parameters (e.g., a requested target encoding format, bitrate, resolution, etc.), and/or other information. In embodiments, a video encoding manager (or any other component configured to determine encoding schedules) may include, as input to the scheduling decision algorithm, a state associated with each encoder. The state may represent the availability of an encoder. That is, for example, in embodiments, an encoder may, at any given time, have an associated state that is one of three potential states: fully available; partially available (e.g., storage available, but computing resources not available, or vice-versa); or fully unavailable.
According to embodiments, the video encoding manager may be configured to use any number of different types of mathematical models to determine a cost associated with a potential encoding task, at least partially in view of the respective states of the various (at least potentially) available encoders. An encoder in the fully unavailable state, for example, may be associated with a higher cost than the costs associated with each of the other state options. Similarly, an encoder in the partially available state may be associated with a higher cost than the cost associated with an encoder in the fully available state, but a lower cost than the cost associated with an encoder in the fully unavailable state. In embodiments, the algorithm (e.g., the illustrative method 7300 depicted in
In embodiments, the video encoding manager may be configured to determine the state of an encoder by referencing a database, by querying the encoder (or a component that manages the encoder such as, for example, a master worker instance), and/or the like. In embodiments, the video encoding manager may be configured to determine an estimated time before an encoder in the fully unavailable state becomes partially available and/or fully available, before an encoder in the partially available state becomes fully available (or fully unavailable due to another scheduled encoding task), and/or before an encoder in the fully available state can be instantiated (or becomes partially and/or fully unavailable due to another scheduled encoding task). In this manner, a video encoding manager may be configured, for example, to optimize cost and achieve specified performance requirements (e.g., encoding speed, encoding task duration, encoded video quality, and/or the like). In embodiments, a video encoding manager may also be configured to take into account any number of other scheduled encoding tasks, reserved computing resources, and/or the like, in determining encoding schedules.
According to embodiments, referring to
In embodiments, the memory 7204 stores computer-executable instructions for causing the processor 7202 to implement aspects of embodiments of system components discussed herein and/or to perform aspects of embodiments of methods and procedures discussed herein. Computer-executable instructions may include, for example, computer code, machine-useable instructions, and the like such as, for example, program components capable of being executed by one or more processors associated with a computing device. Examples of such program components may include a video encoding management component 7208 and a communication component 7210. In embodiments, the information file 7212 includes the information regarding processing of the video file 7206. In embodiments, the information regarding processing of the video file 7206 is stored as part of the video file 7206.
In embodiments, the video encoding manager 7200 further includes a user interface 7214. The user interface 7214 includes one or more input devices 7216 and one or more output devices, illustratively a display 7218. Exemplary input devices include a keyboard, a mouse, a pointer device, a trackball, a button, a switch, a touch screen, and other suitable devices which allow an operator to provide input to the video encoding manager 7200. Exemplary output devices include a display, a touch screen, a printer, speakers, and other suitable devices which provide information to an operator of the video encoding manager 7200.
According to embodiments, the video encoding management component 7208 may be configured to process the video file 7206, e.g., to facilitate an encoding process. In a cloud or other distributed computing environment, the processor 7202 may be configured to instantiate the video encoding management component 7208 to process the video file 7206. For example, the video encoding management component 7208 may be configured to determine a number of worker instances to use to encode the video file 7206. In embodiments, the video encoding management component 7208 may be configured to determine the number of worker instances to use to encode the video file 7206 based on the size of the video file 7206, requested processing parameters, and a load capacity of each of the worker instances (or an average, or otherwise estimated, load capacity associated with a number of worker instances). In embodiments, the video encoding management component 7208 is configured to assign a worker instance to function as a master worker instance.
Referring to
As shown, the video encoding manager 7200 determines the load capacity of each of the worker instances (block 7304). In embodiments, each of the worker instances that can be instantiated may be similarly (or identically) configured, in which case the video encoding manager 7200 only needs to determine one load capacity. In embodiments, two of more of the worker instances may be differently configured. In embodiments, the video encoding manager 7200 may determine the load capacities of the worker instances by accessing the memory 7204, as the load capacities may be stored in the memory of the video encoding manager 7200. In embodiments, the video encoding manager 7200 may be configured to request load capacity information from another device (e.g., from a worker instance, from a website, and/or the like). In embodiments, the load capacity determination may additionally, or alternatively, be based on historical information available regarding the performance of one or more worker instances. According to embodiments, a load capacity of a worker instance refers to its ability to handle an encoding task, and may be calculated based on any number of different parameters such as, for example, the speed at which the worker instance encodes video data under a specified set of circumstances (e.g., to achieve a certain resolution); scheduling information associated with the worker instance (e.g., information regarding other tasks that the worker instance is scheduled to complete, an estimated (or programmed) length of time for completing scheduled tasks, etc.); and/or the like.
Embodiments of the method 7300 include estimating a total encoding load (block 7306). In embodiments, the video encoding manager 7200 may be configured to analyze the video file, the information file, the worker instance load capacities, and/or the like, to estimate the total encoding load. In embodiments, for example, the information file may include a requested target encoding format, the resolution of the target encoding format for the video file, a bit rate of the target encoding format, and/or the like.
According to embodiments, the total load, LOAD, may be determined using the following formula:
LOAD=(A)(T)[(H)(W)]n, (1)
where (A) is the requested frame rate of the target encoding; (T) is the duration of video file; (H) is the number of rows in the target encoding; (W) is the number of columns in the target encoding; and n is a variable that have a value that is determined based on the speed of each of the worker instances. In embodiments in which all of the worker instances are at least approximately identical in functionality, n may be determined based on the encoding speed of any one of the worker instances. In embodiments in which the worker instances include different capabilities and/or functionalities, n may represent an aggregated characteristic across the different worker instances. In embodiments, a separate calculation may be made for each type of worker instance, in which case a total load for the system may be determined by mathematically combining the multiple calculations. The variable, n, may include any number of different values (e.g., between approximately 0 and approximately 5). For example, in embodiments in which a worker instance includes the Lyrical Labs H.264 encoder available from Lyrical Labs, of New York, N.Y., the value of n may be between approximately 1.2 and approximately 2.5 (e.g., 1.5).
According to embodiments of the method 7300, the video encoding manager may be configured to determine the number of worker instances to use to encode the video file, based on the estimated total load and the worker instance load capacity (block 7308). In embodiments, for example, the number of worker instances to use may be determined based on equation (2):
where NUM is the number of worker instances to launch; LOAD is the estimated total encoding load associated with encoding the video file; SINGLE_WORKER_LOAD is the load capacity of an illustrative worker instance; and the CEILING function increases NUM to the smallest following integer. In embodiments, the SINGLE_WORKER_LOAD may be a single value representing an aggregate of multiple load capacities of multiple different worker instances, and/or the like. In embodiments, the SINGLE_WORKER_LOAD may be weighted based on the state, as explained above, associated with the corresponding worker instance. According to embodiments, any number of different techniques may be used to incorporate encoder state, and/or other information discussed herein, into embodiments of the method 7300 to enhance the decision-making process.
In contrast to embodiments of the algorithm described above, conventional encoding systems may be configured to analyze the complexity of the source video file, partition the video file based on the complexity analysis, and determine an encoding rate for each chunk based on the complexity analysis. In the conventional systems, encoding formats are then selected based on the determined encoding rates, rather than based on a desired encoding format that has been received by the system, as in embodiments of the systems and methods described herein. Additionally, embodiments include determining a number of computer resources (e.g., encoding instances) needed to produce a processed video file based on the video file duration, an estimated total encoding load, and a load capacity of each computer resource. Embodiments further include instructing the master computer resource (e.g., the master worker instance) to partition the video file into a number of partitions being equal to the number of computer resources, where each partition corresponds to a time interval of the video file. That is, for example, embodiments include receiving a requested frame rate and/or resolution, determining the estimated total load based on the requested frame rate and/or resolution, and then determining the number of partitions according to the load that was previously determined. In this manner, for example, embodiments faciliate distributing video encoding processes among a number of computer resources that are available, but that may not always be available (e.g., one or more of the computer resources may be available to a number of users for encoding video), by partitioning the video file into a number of partitions equal to the number of available computer resources, based on a calculated load of the video file and the load capacity of each resource.
Embodiments of the method 7300, the video encoding manager may be configured to launch (e.g., instantiate) the determined number of worker instances (block 7310). The video encoding manager may be further configured to designate one of the worker instances as a master worker instance to manage the processing of video file (block 7312). According to embodiments of the illustrative method 7300, the video encoding manager may be configured to provide partitioning instructions and/or encoding instructions to the master worker instance (block 7314). In embodiments, the video encoding manager may be configured to pass the video file to the master worker instance and/or to instruct (e.g., by providing computer-executable instructions to) the master worker instance to partition the video file into a number of partitions. In embodiments, the number of partitions is equal to the number of worker instances launched. In embodiments, for example, the master worker instance may partition the video file into time partitions of equal length. In embodiments, the master worker instance may also be configured to encode a partition of the video file.
Referring to
In embodiments, the memory 7404 stores computer-executable instructions for causing the processor 7402 to implement aspects of embodiments of system components discussed herein and/or to perform aspects of embodiments of methods and procedures discussed herein. Computer-executable instructions may include, for example, computer code, machine-useable instructions, and the like such as, for example, program components capable of being executed by one or more processors associated with a computing device. Examples of such program components may include a video partitioning component 7408, a video concatenation component 7410, a video encoding component 7412, and a communication component 7414. In embodiments, the information file 7416 includes the information regarding processing of the video file 7406. In embodiments, the information regarding processing of the video file 7406 is stored as part of the video file 7406.
In embodiments, the master worker instance 7400 further includes a user interface 7418. The user interface 7418 includes one or more input devices 7420 and one or more output devices, illustratively a display 7422. Exemplary input devices include a keyboard, a mouse, a pointer device, a trackball, a button, a switch, a touch screen, and other suitable devices which allow an operator to provide input to the master worker instance 7400. Exemplary output devices include a display, a touch screen, a printer, speakers, and other suitable devices which provide information to an operator of the master worker instance 7400.
According to embodiments, the memory 7404 further includes the determined number of partitions 7424 and a worker instance assignment index 7426. In embodiments, the video partitioning component 7408 partitions the video file 7406 into a plurality of partitions, each partition having a duration of a length that is approximately equal to the length of the duration of each of the other partitions. In embodiments, the video partitioning component 7408 is configured to generate separate video clips, each corresponding to one of the plurality of partitions. In embodiments, the video encoding manager determines the time windows for each partition (e.g., the start and stop times corresponding to each partition), and provides these time intervals to the video partitioning component 7408, which generates the separate video clips. In embodiments, the video partitioning component 7408 may be any type of video partitioning component such as, for example, Ffmpeg, available from FFmpeg.org, located online at http://www.ffmpeg.org. The video concatenation component 7410 concatenates video pieces into a video file. The video concatenation component 7410 can be any type of video concatenation component such as, for example, Flvbind, available from FLVSoft, located online at http://www.flvsoft.com.
The video encoding component 7412 may be configured to encode at least a portion of the video file 7406. In embodiments, the video encoding component 7412 may be configured to encode at least one partition of the video file 7406 to the targeted encoding format specified in the information file 7416. The video encoding component 7412 may be any type of video encoding component configured to encode video according to any number of different encoding standards (e.g., H.264, HEVC, AVC, VP9, etc.), such as, for example, the ×264 encoder available from VideoLAN, located online at http://www.videolan.org.
As indicated above, the number of partitions 7424 corresponds to the number of worker instances launched by a video encoding manager (e.g., the video encoding manager 7200 depicted in
Referring to
In embodiments, the memory 7504 stores computer-executable instructions for causing the processor 7502 to implement aspects of embodiments of system components discussed herein and/or to perform aspects of embodiments of methods and procedures discussed herein. Computer-executable instructions may include, for example, computer code, machine-useable instructions, and the like such as, for example, program components capable of being executed by one or more processors associated with a computing device. Examples of such program components may include a video encoding component 7506 and a communication component 7508. In embodiments, an information file includes the information regarding processing of the video file. In embodiments, the information regarding processing of the video file is stored as part of the video file.
In embodiments, the worker instance 7500 further includes a user interface 7510. The user interface 7510 includes one or more input devices 7512 and one or more output devices, illustratively a display 7514. Exemplary input devices include a keyboard, a mouse, a pointer device, a trackball, a button, a switch, a touch screen, and other suitable devices which allow an operator to provide input to the worker instance 7500. Exemplary output devices include a display, a touch screen, a printer, speakers, and other suitable devices which provide information to an operator of the worker instance 7500.
As shown, the worker instance 7500 further includes one or more partitions 7516 of a video file, which may be received from a master worker instance (e.g., the master worker instance 7400 depicted in
When the processed partitions 7518 are received by a master worker instance (e.g., the master worker instance 7400 depicted in
Referring to
A messaging system 7602 is also used for communication between client computing systems 7604 and video encoding system 7606. In the illustrated embodiment, Amazon's Simple Queuing Service (“Amazon SQS”) is used for messaging. A queuing system allows messages to be communicated between client computing system 7604 and the in-the-cloud computing system of video encoding system 7606. A file-hosting system 7608 to share files between instances, the worker instances 7610 and master worker instance 7612, and a video encoding manager 7614, are provided. In the illustrated embodiment, Amazon's S3 file system (“Amazon S3”) is used for file transfer.
The encoding process begins with the content host, hereinafter called the “publisher”, copying a video file 7616 into a specified directory 7618 on its own server, client computing system 7604. In addition to the video file 7616, the publisher provides an information file 7620 specifying the resolution and bitrate for the target encoding. Multiple target resolutions and bitrates may be specified within a single information file 7620.
Next, the files (both the video 7616 and the information files 7620) are transferred to the cloud-based file system by a local service, file-hosting system 7608, running on the publisher's server, client computing system 7604. In the illustrated embodiment, Amazon S3 is used.
The publisher's local service, file-hosting system 7608, places a message on the message queue of messaging system 7602 that the specific file has been uploaded. In the illustrated embodiment, Amazon SQS is used. The Amazon SQS is a bi-directional, highly available service, accessible by many client computing systems 7604 simultaneously. The Clip Manager or video encoding manager 7214, which is a service running on a cloud-based instance, video encoding system 7606, accesses the queue and reads the specific messages regarding which file to encode. As shown in
When the Clip Manager accesses a message on the message queue indicating that a new file is to be encoded, it accesses the video file 7616 and the information file 7620 from file-hosting system 7608 and determines the resources needed to complete the encoding. The Clip Manager may decide to use a single instance to process a video clip or it may split the job up among multiple instances. In the illustrated embodiment, a single instance is loaded with a fixed compute load (i.e., a video clip at a specific resolution for a particular length of time). Depending on resolution, a video file will be processed in pieces of fixed lengths. The length of the pieces is a function of the target resolution of the encoding. For example, a two hour video clip may be split up into two minute pieces and processed in parallel on 60 instances. However, a 90 second clip would likely be processed on a single instance. The instances are launched programmatically on demand. The cloud-based system only provisions the resources required, and instances that are not used are shut down programmatically.
The instances that are launched to encode a given video clip are termed “worker” instances, such as worker instances 7610 and master worker instance 7612. A worker instance is given a pointer to the file in file-hosting system 7608, Amazon S3, along with the information about the target resolution, bitrate and portion of the file it must encode (e.g., the interval between the two minute and four minute marks of the clip). The worker accesses the video file 7616 from file-hosting system 7608, Amazon S3. Given the high availability of file-hosting system 7608, Amazon S3, many workers can access the same file simultaneously with relatively little degradation of performance due to congestion. The worker decodes its designated time interval to a canonical format. In the illustrated embodiment, the format is uncompressed .yuv files. Many programs in the public domain can decode a wide range of standard formats. The .yuv files are subsequently resized to the target resolution for encoding. An encoder then encodes the file to the target format. In embodiments, the Lyrical Labs H.264 encoder available from Lyrical Labs located at 405 Park Ave., New York, N.Y. 10022 encodes a .yuv files (Color Space Pixel Format) as input and outputs either .fly (Flash Video File) or .mp4 (MPEG Audio Stream) files or both. The encoder functions at the full range of commercially interesting resolutions and bitrates. The resultant encoded file is placed back into the Amazon S3 queue as illustrated in
If the encoding process was split into multiple parts or partitions 7616A-N (as shown, e.g., in
Once the encoded file is placed in Amazon S3, the worker, master worker instance 7612, notifies the Clip Manager, video encoding manager 7614, that the job is complete. The Clip Manager, video encoding manager 7614, assigns a new job to the free worker or terminates the instance if no further jobs exist.
The Clip Manager, video encoding manager 7614, then places both a message on the message queue of messaging system 7602 that a particular encoding job has been complete and a pointer to the encoded file on file-hosting system 7608, Amazon S3. The local service running on the publisher's server will access the message queue and download the encoded file, processed video file 7622. Multiple encoded files can result from a single input file. This process is illustrated in
In the illustrated embodiment, the publisher does not need to provision a large encoder farm targeted to peak use. The cloud-based system scales to the customer's demands, and the customer's cost is only related to that of the actual compute resources used. There are substantially no up-front costs, so the publisher's costs scale with their business, providing a strong economic advantage.
Whether video data is encoded using a cloud-computing environment, a stand-alone encoding device, a software encoder, or the like, embodiments of the systems and methods described herein may facilitate more efficient and effective encoding by leveraging aspects of the rich metadata generated during aspects of embodiments of video processing processes described herein (e.g., the video processing process 200 depicted in
According to embodiments, metadata such as that described above may be used to facilitate more intelligent macroblock partitioning decisions that may be implemented using machine-learning techniques. In this manner, embodiments of video processing platforms, including, for example, encoders, may be configured to get more efficient over time.
Embodiments of encoding processes described herein may include adaptive quantization techniques, which may be implemented, for example, as part of any number of embodiments of the processes described herein (e.g., the adaptive quantization process 224 depicted in
According to embodiments, metadata (e.g., object information, segmentation information, etc.) may be leveraged to dynamically configure group-of-picture (GOP) structure during encoding. In embodiments, methods of dynamically configuring GOP structure may include specifying a maximum I-frame interval, and selecting I-frame positions in locations (sequential positions within the GOP structure) calculated to result in increased quality. In embodiments, for example, I-frame positions may be selected to maximize (or optimize) video quality (e.g., as measured by a quality metric). In embodiments, the I-frame positions may be selected based on object information. Similarly, P- and/or B-frame positions may be selected based on object information to maximize or optimize quality. In embodiments, a conventional dynamic GOP structure configuring algorithm may be modified to be biased toward favoring certain structures based on various metadata.
For example, in embodiments, I-frames may be favored for placement in locations associated with the appearance or disappearance of an object or objects. In embodiments, P- and B-frames may be favored for placement in locations associated with other material changes that are less significant than the appearance and/or disappearance of an object. For example, P frames might be placed just before a sudden movement of an object within a scene occurs. Embodiments include utilizing machine-learning techniques that may be configured to enhance the GOP structure configuration process. For example, in embodiments, objects and their movements may be treated as feature inputs to a classifier that learns to make GOP structure configuration decisions that improve the video quality and encoding efficiency over time.
Additionally, or alternatively, embodiments may include optimizing encoding of multiple versions of a video file. In conventional video content delivery systems, a content source may maintain multiple encoded renditions of a video file to facilitate rapid access to a rendition encoded at a resolution and/or bitrate that is appropriate for a particular transmission. In embodiments, for example, a video file may be encoded in as many as 8 or 10 (or more) different ways (e.g., 1920×780 at 4 mbps, 1280×720 at 3 mbps, 620×480 at 1.5 mbps, etc.), in which each encoding run is performed independently, and includes a full motion estimation process independent of the others. This conventional practice often is computationally burdensome.
In contrast, embodiments of the systems and methods described herein may be configured to encode a first rendition of the video file, and then leverage motion information generated during a motion estimation process associated with the first encoding run to facilitate more efficient motion estimation in each subsequent encoding run (e.g., by seeding motion vector searches, etc.). In embodiments, the highest mode encoding run is performed fully, as it generates the most information. Each subsequent motion estimation process may be performed, in embodiments, by refining the motion information from the first run. That is, for example, during a second encoding run, the encoder may be configured to adjust the motion information generated during a first encoding run for differences in the bitrate and/or resolution, thereby reducing the computational burden of generating multiple renditions of a video file.
Various modifications and additions can be made to the exemplary embodiments discussed without departing from the scope of the present disclosure. For example, while the embodiments described above refer to particular features, the scope of this disclosure also includes embodiments having different combinations of features and embodiments that do not include all of the described features.
Accordingly, the scope of the present disclosure is intended to embrace all such alternatives, modifications, and variations as fall within the scope of the claims, together with all equivalents thereof.
Number | Date | Country | |
---|---|---|---|
62042188 | Aug 2014 | US | |
62132167 | Mar 2015 | US | |
62058647 | Oct 2014 | US | |
62134534 | Mar 2015 | US | |
62204925 | Aug 2015 | US | |
61468872 | Mar 2011 | US | |
61646479 | May 2012 | US | |
61637447 | Apr 2012 | US | |
62368853 | Jul 2016 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14737418 | Jun 2015 | US |
Child | 14696255 | US | |
Parent | 15237048 | Aug 2016 | US |
Child | 14737418 | US | |
Parent | 13428707 | Mar 2012 | US |
Child | 15237048 | US | |
Parent | 15269960 | Sep 2016 | US |
Child | 13428707 | US | |
Parent | 13868749 | Apr 2013 | US |
Child | 15269960 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14737401 | Jun 2015 | US |
Child | 15480361 | US | |
Parent | 15357906 | Nov 2016 | US |
Child | 14737401 | US | |
Parent | 14696255 | Apr 2015 | US |
Child | 15357906 | US |