In the present day Internet, websites and web applications are highly dynamic and increasingly rich in content, and a multitude of network devices (e.g., smart phones, laptops, and tablets) may be utilized for accessing websites and web applications. Additionally, the average size of webpages has been steadily increasing, as has the average load time of content-rich websites. Since images represent the most significant portion of data content on a typical webpage, improved techniques for delivering images to the multitude of user devices in a manner that maximizes the Quality of Experience (QoE) for the users would be desirable.
Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.
The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. As used herein, the term ‘processor’ refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.
A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.
One way to reduce the size of an image file, and thus its network footprint, is by transcoding the image from its original encoding format to a different encoding format with a reduced file size and quality before sending the modified image through the network. These techniques are herein referred to as front-end optimization compression quality adjustment techniques (FEO-CQ). For example, the original image may be transcoded to a lossy image format with a lower image quality before it is sent through the network.
Lossy compression algorithms trade quality for the degree of compression. A lower quality image results in a smaller file, whereas a higher quality image results in a larger file. For JPEG images, the quality level (Q level) ranges from 1% (lowest quality) to 100% (almost no loss of quality). The lower the Q level of an image, the greater the amount of distortion from the original image. Compression algorithms that support lossy compression include JPEG, JPEG 2000, WebP, JPEG XR, and the like.
FEO-CQ techniques have a number of drawbacks. Different types of images may differ as to which Q levels are appropriate or optimal. Some images can be compressed significantly but still look visually acceptable, while other images can only be compressed by a small amount before they become visually unacceptable to most users. The Q level of the modified file should be preferably selected such that the image quality does not degrade appreciably from that of the original image file. However, FEO-CQ techniques do not take into account the diversity of the images. If a fixed Q level is used to compress all images, then most images will have Q levels that are either lower or higher than their respective appropriate or optimal Q levels. On the other hand, manually selecting the appropriate Q level for every single image is highly inefficient and not a practical solution either.
An alternate way to reduce the wait time experienced by an end-user of an image is to stream the image to the user device in partial download mode, as opposed to sending it in full download mode. In full-download mode, the web application (e.g., an application that runs on a web browser) downloads the entire image content onto the user device and then displays the image. Using full-download mode to download dynamic multimedia-rich content in a mobile environment may lead to unacceptable long transfer times and low QoE for the end-users. In contrast, in partial-download mode, the web application starts to render an image on the user device as soon as a portion of the image file is buffered. For example, progressive image encoding (e.g., progressive JPEG encoding) allows rendering after partial download. Through a shorter latency before the initial display of image content and a more efficient prioritization of bandwidth, partial-download mode reduces the transfer time and provides better QoE for the end-users.
One key parameter in partial-download mode of image-delivery is the threshold t selected for the initial rendering of the image. For example, threshold t may be a percentage of the image size, where 0≤t≤100%. For example, if t=40%, then after the initial 40% of an image file is received, the web browser may start to render the received portion, displaying a lower quality and distorted initial version of the original image. The initial rendered portion should display a reasonable image without degrading the image quality to the extent that it becomes clearly noticeable to the end-user. However, different images have different appropriate or optimal t thresholds. Without taking into account the diversity of the images and using a fixed t threshold to deliver all images, most images will have t thresholds that are either lower or higher than their respective appropriate or optimal t thresholds. On the other hand, manually selecting an appropriate t threshold for every single image is highly inefficient and not a practical solution either.
In summary, image delivery using partial-download mode has a number of existing problems. First, there is a lack of a universally applicable threshold for determining the deliverable image buffer size. Second, there is a lack of a quantitative definition for QoE for a given image. Third, it is difficult to determine a suitable threshold for every single image in large-scale web delivery service (WDS).
In some embodiments, client 104 is running within web browser 106. For example, client 104 may be an Instart Nanovisor that is injected into web browser 106 by edge server 110 (e.g., an Instart edge server) to interface between web browser 106 and edge server 110 for efficient delivery of various types of data (e.g., images, videos, audio clips, scripts, text, and webpage files) to device 102. In some embodiments, client 104 may be injected into web browser 106 based on standards-based (e.g., HTML or JavaScript) procedures. For example, after edge server 110 receives a request from web browser 106 requesting an HTML webpage file, edge server 110 may parse the HTML webpage file, inject client 104 into the HTML webpage file, and then send the response back to web browser 106. In some embodiments, client 104 may be injected by adding JavaScript client code in the head section of the HTML webpage file. In some embodiments, the user may add the script to their own HTML webpage file.
Device 102 is connected to edge server 110 through network 108. Network 108 may be any combination of public or private networks, including intranets, local area networks (LANs), wide area networks (WANs), radio access networks (RANs), Wi-Fi networks, the Internet, and the like. Edge server 110 includes image quality threshold analyzer 112 and database 114, as will be described in greater detail below. In some embodiments, image quality threshold analyzer 112 and database 114 may be located on one or more hosts that are external to edge server 110. In some embodiments, database 114 comprises an image database. In some embodiments, database 114 comprises a database that stores image signatures.
In some embodiments of process 200, edge server 110 reduces the size of the requested image file, and thus its network footprint, by transcoding the image to a modified image file with a reduced file size and quality before sending the modified image file through the network to web browser 106 or client 104. In these embodiments, the image quality threshold determined by image quality threshold analyzer 112 of edge server 110 at step 204 is a Q level that ranges from 1% (lower quality) to 100% (almost no loss of quality). The lower the Q level of the modified image file, the greater the amount of distortion from the original image and the smaller the size of the modified image file. The higher the Q level of the modified image file, the smaller the amount of distortion from the original image and the greater the size of the modified image file. The determined Q level is an appropriate or optimal Q level specific to the requested image file as determined by image quality threshold analyzer 112. At 206, edge server 110 transcodes the original image file to the modified image file at the determined Q level before sending the modified image file through the network to web browser 106 or client 104.
In some embodiments of process 200, edge server 110 sends the requested image file to web browser 106 or client 104 in partial download mode. In these embodiments, the image quality threshold determined by image quality threshold analyzer 112 of edge server 110 at step 204 is a t percentage threshold that ranges from 1% (lower quality) to 100% (highest quality) in size of the original file. The determined t percentage threshold is an appropriate or optimal t percentage specific to the requested image file as determined by image quality threshold analyzer 112. At 206, edge server 110 sends the requested image file to web browser 106 or client 104 in partial download mode, and web browser 106 starts to render the image on the user device as soon as t percent of the original image file is buffered. The buffer size is t percent of the original image file.
Image quality threshold analyzer 112 uses a simple quantitative metric for characterizing the Quality of Experience (QoE) for any given image that is sent through a web delivery service (WDS). The quantitative metric is a quantitative signature known as the variation of quality signature (VoQS). The signature identifies the quality degradation of an image as it passes through a web delivery pipeline. The advantage of VoQS is that it allows any two arbitrary images (e.g., images of different sizes or formats) to be compared in the context of web delivery performance; images that have similar VoQSs have similar web delivery performance. Therefore, images in a large database may be separated into multiple clusters, each cluster including images that have similar VoQSs and thus similar web delivery performances. Once the images are separated into different clusters, an exemplar, or best representative, image may be selected for each cluster, and an appropriate image quality threshold may be determined for the exemplar image. Since the images within the same cluster have similar web delivery performances, the assigned image quality threshold for the exemplar image is also an appropriate image quality threshold for the other images within the same cluster. When image quality threshold analyzer 112 is requested to determine an image quality threshold for a new image (e.g., at step 304 of process 300), the analyzer computes the new image's VoQS, compares it to the VoQS of the stored exemplars to find a nearest neighbor to the new image, and selects an appropriate image quality threshold for the new image based on the nearest exemplar.
Image quality threshold analyzer 112 includes two components. In some embodiments, the first component may be implemented as an offline process, and the second component may be implemented as an online or real-time process.
With reference to
The variation of quality signature (VoQS) of a particular image includes a vector of signatures, and the elements are the signatures of different versions of the image when different levels of distortion/compression are introduced to the image in exchange for their corresponding reduced image file sizes.
In the partial-download example, if VoQS of an original reference image I is Q(I), then Q(I) may be a vector defined as:
Q(I)[q(It1), q(It2), . . . q(ItN)] Equation (1)
where
In the JPEG transcoding example, if VoQS for an original reference image I is Q(I), then Q(I) may be a vector defined as:
Q(I)[q(IQ1), q(IQ2), . . . q(IQM)] Equation (2)
where
The signature for each level of distortion (i.e., q(It) or q(IQ)) may be measured in different ways. The signature should be efficient to compute, and the signature should measure a degree of similarity between the reference image I and a distorted version of the image. For example, the signature for each level of distortion may be a measure of the perceptual quality of the image experienced by an end-user when that level of distortion is introduced to the image. In some embodiments, the signature for each level of distortion may be determined by soliciting subjective feedback from human users. However, this approach is time-consuming and therefore not scalable for large image databases. In some embodiments, the signature for each level of distortion may be an objective measure of the perceptual quality of the image.
An objective measure of the perceptual quality of the image may be a combination of one or more metrics. For example, one metric is the peak signal-to-noise ratio (PSNR), which is the ratio between the maximum power of a signal and the power of corrupting noise that affects the fidelity of its representation. Another metric is the structural similarity metric (SSIM). SSIM is a metric that measures the similarity between two images. SSIM tries to explicitly model the function of the human visual system. PSNR and SSIM are examples of metrics that may be used to measure the perceptual quality of the distorted images. Other objective measures may be used as well.
In the partial-download example, using the combination of PSNR and SSIM (denoted as p(It) and s(It), respectively) as the objective measure of the perceptual quality of the distorted streamed image It, then the signature may be determined as
q(It)=[p(It), s(It)] Equation (3)
The number of elements of Q(I) is equal to M*2. For example, if the number of level of distortions (M) introduced to image I is 10, such that
Q(I)[q(I10%), q(I20%), q(I30%), q(I40%), q(I50%), q(I60%), q(I70%), q(I80%), q(I90%), q(I100)]
Then Q(I) has a total of 20 elements. The length of the VoQS vector (i.e., Q(I)) linearly impacts the computational complexity of the offline and online computations. Therefore, the length of the VoQS vector should be selected based on the availability of computational resources, acceptable response time, and the like.
In some embodiments, Q(I) is normalized. For example, the components of Q(I), (e.g., the PSNR and SSIM components) may be normalized by computing their z-scores across the entire image database for normalization.
With continued reference to
sim (I′,I″)=−sqrt (Σ(Q(I′)−Q(I″))2) Equation (4)
Equation (4) allows a quantitative comparison between any two arbitrary images in the context of web content delivery. Equation (4) is one exemplary measure of the similarity between two images based on Euclidean distance. Other measures may be used as well. For example, Equation (4) may be modified to give unequal weights to different elements of the VoQSs vectors. In some embodiments, the weights may be learned through machine learning.
Different clustering techniques may be used to divide the collection of images into a plurality of clusters of images. Examples of clustering techniques include K-means clustering and EM algorithms. Another example of the clustering techniques is affinity propagation clustering. The affinity propagation algorithm may be used in conjunction with the image similarity measure defined in Equation (4) to find QoE-driven image clusters from the image database. The affinity propagation algorithm does not require a pre-specification of the number of expected clusters.
With continued reference to
With continued reference to
With continued reference to Figure, at 310, the clusters, the exemplars of the clusters, and the exemplars' assigned image quality thresholds are stored, e.g., in image database 114 of edge server 110 as shown in
At 404, the nearest exemplar image to the new image is determined by comparing the VoQS of the new image to the VoQSs of the exemplar images. For example, the nearest exemplar image to the new image may be determined using an efficient nearest-neighbor lookup.
At 406, the image quality threshold (or its equivalent buffer size) for delivering the new image is determined based on the image quality threshold assigned to the nearest exemplar image. In one example, both the new image and the nearest exemplar image have the same image format (e.g., JPEG). In this case, the image quality threshold (e.g., a t threshold or Q level) assigned to the nearest exemplar image may be assigned to the new image as well. In another example, the new image is a WebP image file, while the nearest exemplar image is a JPEG image file. In this case, a simple conversion (e.g., by using lookup-tables) is performed to convert the image quality threshold to a new value that takes into consideration the differences between the JPEG and WebP formats. The new converted value is then assigned to the new image as the image quality threshold for delivery.
Image formats may be optimized for displaying on different device display formats. For example, they may be optimized for displaying on desktop monitors, laptop computers, tablets, smartphones, and the like.
In some embodiments, the VoQS unsupervised image categorization technique discussed above may be used to categorize and select images with formats that are specifically optimized and targeted for different device display formats. In one experiment, images that are optimized for the four different device types above (smartphone, tablet, laptop, and monitor) are collected. The VoQS for each image is computed (e.g., see step 302 of process 300), and a plurality of clusters of images are determined (e.g., see step 304 of process 300). It has been shown that the VoQS image categorization technique separates the images into four different clusters, each cluster corresponding to one of the device types.
At 702, the VoQS of each image in a collection of images with different formats optimized for different device types is determined, as similarly described in step 302 of process 300. To cover a wide variety of images, the images may be collected from various image databases and content providers. The collected images may include images in different formats optimized for different user devices. For example, the image formats may be optimized for display on desktop monitors, laptop computers, tablets, smartphones, and the like.
At 704, the collection of images is categorized into a plurality of clusters of images (as similarly described in step 304 of process 300), each cluster including images that have similar VoQSs and thus similar web delivery performances. As described above, images that are optimized for different device types are separated into different clusters. Each cluster may be identified and labeled as a cluster of images targeted for a particular type of user device.
At 706, an image is acquired. The acquired image may be obtained from different sources, including content providers, image databases, end-users, and the like. The VoQS of the acquired image is determined.
At 708, the nearest exemplar image to the acquired image is determined by comparing the VoQS of the acquired image to the VoQSs of the exemplar images, as similarly described in step 404 of process 400.
At 710, the type of device the acquired image is optimized for is determined based on the nearest exemplar. For example, if the nearest exemplar is an image optimized for smartphones, then the acquired image is determined as an image optimized for smartphones as well. Information of the device type corresponding to the acquired image is also stored.
At 712, versions of the acquired image that are optimized for other device types may be generated and stored. For example, if the acquired image is determined as an image optimized for smartphones, then versions of the image corresponding to the device types of tablets, laptop computers, and desktop monitors may be generated and stored.
At 714, a request for the image is received. At 716, the device type corresponding to request is determined. For example, client 104 may notify edge server 110 the user device type on which web browser 106 runs. At 718, the version of the image that is optimized for the user device type is delivered to web browser 106.
As described above, the VoQS unsupervised image categorization technique allows large image database to be efficiently parsed into coherent groups in a content-dependent and device-targeted manner for optimized image content delivery. The technique significantly reduces the average delivered bits per image across a large image database, while preserving the perceptual quality across the entire image database. In some embodiments, the VoQS unsupervised categorization technique may be further extended to deliver other media files, including audio and video files.
Edge server 110 includes media file quality threshold analyzer 812 and database 814. In some embodiments, media file quality threshold analyzer 812 and database 814 may be located on one or more hosts that are external to edge server 110. In some embodiments, database 814 comprises a media file database. In some embodiments, database 814 comprises a database that stores media file signatures.
In some embodiments of process 900, edge server 110 reduces the size of the requested media file, and thus its network footprint, by transcoding the media file to a modified media file with a reduced file size and quality before sending the modified media file through the network to web browser 106 or client 104. In these embodiments, the media file quality threshold determined by media file quality threshold analyzer 812 of edge server 110 at step 904 is a quality level that ranges from 1% (lower quality) to 100% (almost no loss of quality). The lower the quality level of the modified media file, the greater the amount of distortion from the original media file and the smaller the size of the modified media file. The higher the quality level of the modified media file, the smaller the amount of distortion from the original media file and the greater the size of the modified media file. The determined quality level is an appropriate or optimal quality level specific to the requested media file as determined by media file quality threshold analyzer 812. At 906, edge server 110 transcodes the original media file to the modified media file at the determined quality level before sending the modified media file through the network to web browser 106 or client 104.
Similar to image quality threshold analyzer 112, media file quality threshold analyzer 812 also uses VoQS to characterize the Quality of Experience (QoE) for any given media file that is sent through a web delivery service (WDS).
The variation of quality signature (VoQS) of a media file includes a vector of signatures, and the elements are the signatures of different versions of the media file when different levels of distortion/compression are introduced to the media file in exchange for corresponding reductions in file sizes.
The signature for each level of distortion may be measured in different ways. The signature should be efficient to compute, and the signature should measure a degree of similarity between the reference media file and a distorted version of the media file. For example, the signature for each level of distortion may be a measure of the perceptual quality of the media file experienced by an end-user when that level of distortion is introduced.
An objective measure of the perceptual quality of the media file may be a combination of one or more metrics. Video clips are similar to images in that the former includes multiple frames of images. In some embodiments, PSNR and SSIM are used as the metrics for measuring the perceptual quality of both images and video clips. PSNR may also be used as a metric for measuring the perceptual quality of audio files. Different metrics may be used to measure the perceptual quality of video clips and audio clips. The examples provided herein are not intended to be exhaustive.
Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative and not restrictive.
This application claims priority to U.S. Provisional Patent Application No. 61/980,319 entitled QOE-DRIVEN UNSUPERVISED IMAGE CATEGORIZATION FOR OPTIMIZED WEB DELIVERY filed Apr. 16, 2014 which is incorporated herein by reference for all purposes. This application claims priority to U.S. Provisional Patent Application No. 62/042,739 entitled QOE-DRIVEN UNSUPERVISED IMAGE CATEGORIZATION FOR OPTIMIZED WEB DELIVERY filed Aug. 27, 2014 which is incorporated herein by reference for all purposes.
Number | Name | Date | Kind |
---|---|---|---|
7231082 | Lenoir | Jun 2007 | B2 |
7647593 | Matsumoto | Jan 2010 | B2 |
8452110 | Carmel | May 2013 | B2 |
8599432 | Mestha et al. | Dec 2013 | B2 |
8612517 | Yadid et al. | Dec 2013 | B1 |
8645485 | Yadid et al. | Feb 2014 | B1 |
8805109 | Carmel | Aug 2014 | B2 |
8908984 | Carmel | Dec 2014 | B2 |
20020056131 | Hayashi et al. | May 2002 | A1 |
20030068100 | Covell | Apr 2003 | A1 |
20040014460 | Moroo et al. | Jan 2004 | A1 |
20040101086 | Sabol et al. | May 2004 | A1 |
20040114685 | Kouloheris | Jun 2004 | A1 |
20040249565 | Park et al. | Dec 2004 | A1 |
20050057648 | Ambiru et al. | Mar 2005 | A1 |
20050226252 | Tomita | Oct 2005 | A1 |
20080071877 | Beach | Mar 2008 | A1 |
20080291479 | Mestha et al. | Nov 2008 | A1 |
20080291480 | Mestha et al. | Nov 2008 | A1 |
20090141992 | Coulombe | Jun 2009 | A1 |
20090180555 | Sun | Jul 2009 | A1 |
20090202167 | Mujis et al. | Aug 2009 | A1 |
20090044116 | Kitabayashi | Dec 2009 | A1 |
20100067574 | Knicker | Mar 2010 | A1 |
20100185615 | Monga | Jul 2010 | A1 |
20100202700 | Rezazadeh | Aug 2010 | A1 |
20100316131 | Shanableh | Dec 2010 | A1 |
20100329333 | Haskell | Dec 2010 | A1 |
20110106881 | Douville et al. | May 2011 | A1 |
20110228848 | Dvir et al. | Sep 2011 | A1 |
20110246996 | Tunning | Oct 2011 | A1 |
20110258344 | Mukherjee et al. | Oct 2011 | A1 |
20110274361 | Bovik | Nov 2011 | A1 |
20120201475 | Carmel | Aug 2012 | A1 |
20120229655 | Solomon | Sep 2012 | A1 |
20130257883 | Krig et al. | Oct 2013 | A1 |
20140241629 | Lerios et al. | Aug 2014 | A1 |
20150030237 | Jancsary et al. | Jan 2015 | A1 |
20150131898 | Schelten et al. | May 2015 | A1 |
Entry |
---|
Richard L. Gregory, Princeton Science Library, Eye and Brain, The Psychology of Seeing, Fifth Edition 1997 ISBN-10: 0-691-04837-1, Entire Book is being submitted. |
Jain et al., Score Normalization in Multimodal Biometric Systems, Pattern Recognition 38, 2005, pp. 2270-2285. |
Chen et al;, Image Categorization by Learning and Reasoning with Regions, Journal of Machine Learning Research, 5, 2004, pp. 913-939. |
Frey et al., Clustering by Passing Messages Between Data Points, Science vol. 315, Feb. 16, 2007, pp. 972-976. |
Ahammad et al., QoE-Driven Unsupervised Image Categorization for Optimized Web Delivery, Aug. 27, 2014. |
Hartigan et al., A K-Means Clustering Algorithm, Applied Statistics, 1979, pp. 100-108. |
Shotton et al., Semantic Texton Forests for Image Categorization and Segmentation, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008. |
Hans Marmolin, Subjective MSE Measures, IEEE Transactions on Systems, Man, Cybernetics, vol. SMC-16, No. 3, May/Jun. 1986. |
Bradley et al., Scaling EM (Expection-Maximization) Clustering to Large Databases, Microsoft Research Technical Report, MSR-TR-98-35, Nov. 1998. |
Rauschenbach et al., Adaptive Image Transmission, Proceedings of International Conference in Central Europe, on Computer Graphics and Visualization, (WSCG), Feb. 10-14, 1997, pp. 434-443. |
Chandra et al., Differentiated Multimedia Web Services Using Quality Aware Transcoding, Proceedings of the Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM), 2000, pp. 961-969. |
Tammy Everts, The Average Web Page has Almost Doubled in Size Since 2010, WebPerformanceToday, Jun. 5, 2013. |
Wang et al., Image Quality Assessment: From Error Visibility to Structural Similarity, IEEE Transactions on Image Processing, vol. 13, No. 4, Apr. 2004, pp. 600-612. |
Ahammad et al., QoE-Driven Unsupervised Image Categorization for Optimized Web Delivery, Apr. 16, 2014. |
ITU, ITU-R, BT.500-11, Recommendation ITU-R BT.500-11, Methodology for Subjective Assessment of the Quality of Television Pictures, 2002. |
Li et al: “Reduced-Reference Image Quality Assessment Using Divisive Normalization-Based Image Representation”, IEEE, 2009. |
Malekmohamadi et al: “Automatic QoE Prediction in Stereoscopic Videos”, IEEE, 2012. |
Number | Date | Country | |
---|---|---|---|
61980319 | Apr 2014 | US | |
62042739 | Aug 2014 | US |